Academic Journal

VADA: An Architecture for End User Informed Data Preparation

التفاصيل البيبلوغرافية
العنوان: VADA: An Architecture for End User Informed Data Preparation
المؤلفون: Konstantinou, Nikolaos, Abel, Edward, Bellomarini, Luigi, Bogatu, Alex, Civili, Cristina, Irfanie, Endri, Koehler, Martin, Mazilu, Lacramioara, Sallinger, Emanuel, Fernandes, Alvaro, Gottlob, Georg, Keane, John, Paton, Norman
المصدر: Konstantinou , N , Abel , E , Bellomarini , L , Bogatu , A , Civili , C , Irfanie , E , Koehler , M , Mazilu , L , Sallinger , E , Fernandes , A , Gottlob , G , Keane , J & Paton , N 2019 , ' VADA: An Architecture for End User Informed Data Preparation ' , Journal of Big Data . https://doi.org/10.1186/s40537-019-0237-9
سنة النشر: 2019
المجموعة: The University of Manchester: Research Explorer - Publications
مصطلحات موضوعية: data preparation, data quality, data integration
الوصف: Background: Data scientists spend considerable amounts of time preparing data for analysis. Data preparation is labour intensive because the data scientist typically takes fine grained control over each aspect of each step in the process, motivating the development of techniques that seek to reduce this burden. Results: This paper presents an architecture in which the data scientist need only describe the intended outcome of the data preparation process, leaving the software to determine how best to bring about the outcome. Key wrangling decisions on matching, mapping generation, mapping selection, format transformation and data repair are taken by the system, and the user need only provide: (i) the schema of the data target; (ii) partial representative instance data aligned with the target; (iii) criteria to be prioritised when populating the target; and (iv) feedback on candidate results. To support this, the proposed architecture dynamically orchestrates a collection of loosely coupled wrangling components, in which the orchestration is declaratively specied and includes self-tuning of component parameters. Conclusion: This paper describes a data preparation architecture that has been designed to reduce the cost of data preparation through the provision of a central role for automation. An empirical evaluation with deep web and open government data investigates the quality and suitability of the wrangling result, the cost-effectiveness of the approach, the impact of self-tuning, and scalability with respect to the numbers of sources.
نوع الوثيقة: article in journal/newspaper
اللغة: English
DOI: 10.1186/s40537-019-0237-9
الاتاحة: https://research.manchester.ac.uk/en/publications/351ea41e-9e38-4db8-a006-c42ab5f6e5de
https://doi.org/10.1186/s40537-019-0237-9
Rights: info:eu-repo/semantics/openAccess
رقم الانضمام: edsbas.E02F1916
قاعدة البيانات: BASE
الوصف
DOI:10.1186/s40537-019-0237-9