In automating data wrangling, a key feature of VADA is that the automation takes account of the data context and the user context. The data context is supplementary data about the result of the data wrangling process. The user context is information about what is important to the user, as likely there are trade-offs between different features of the result, such as the correctness and the completeness.
A paper on data context  was presented in December 2017 at IEEE Big Data in Boston, and a paper on the user context for source selection  has been published in Information Sciences.
- Koehler, M., Bogatu, A., Civili, C., Konstantinou, N., Abel, E., Fernandes, A. A. A., Keane, J., Libkin, L., Paton, N. W. (2017). Data Context Informed Data Wrangling. IEEE 2017 International Conference on Big Data (IEEE Big Data 2017).
- Abel, E., Keane, J., Paton, N. W., Fernandes, A. A. A., Koehler, M., Konstantinou, N., Ríos, J.C.C., Azuan, N., Embury, S. M. (2018). User Driven Multi-Criteria Source Selection. Information Sciences, 430–431, 179–199. https://doi.org/10.1016/j.ins.2017.11.019
VADALOG is used in VADA to represent and reason about information that is shared between data wrangling components and to orchestrate such components. VADALOG can also be used to support knowledge graphs, and has been the subject of an invited talk at IJCAI in Melbourne.
For further information, there is a video of the presentation and an associated paper.
The VADA project is organising a Workshop on Data Wrangling for Big Data at the Alan Turing Institute in London on 14th September 2017. Registration is free, and the workshop will include talks, demonstrations and posters.
The initial implementation of the VADA architecture was demonstrated at ACM SIGMOD in Chicago on 17th May 2017. The paper that gives an overview of the demonstration is in the proceedings of the conference and a screencast has been produced that gives a flavour of the system.
The VADA project is playing a significant role in the 2017 EDBT Summer School on Adding Value to Data. In particular, VADA is one of the sponsors, VADA members are helping to organise the event (Georg Gottlob, Leonid Libkin), and there are VADA speakers (Alvaro Fernandes, Andreas Pieris and Emanuel Salinger).
Wrapidity, the company founded to commercialise web data extraction software from Oxford, has been acquired by media intelligence company, Meltwater.
Georg Gottlob, Professor at the Oxford University Department of Computer Science and Co-Founder of Wrapidity, said: “Instant access to products, places, people and news has changed our lives in the last decade. The same access, but at a much larger scale, is now changing business in ways we can’t even imagine yet. At Wrapidity, we have responded to this by developing a completely new AI-based technology for extracting massive amounts of relevant data from millions of websites.”
Tim Furche, Lecturer at the Oxford University Department of Computer Science and Co-Founder and Chief Technology Officer (CTO) of Wrapidity, added: “Meltwater already monitors and analyses millions of articles per day across several languages. Combining Meltwater’s industry leadership and global footprint with Wrapidity’s advances in AI technology, we will be able to surface more accurate, timely and insightful content for Meltwater’s customers. Jorn and his team were visionaries in developing the software, services and business models to make such external web data usable for internal decision-making. We truly believe that companies of the future will hinge on Outside Insight, and we’re extremely excited to pursue this together.”
The ACM Journal of Data and Information Quality (ACM JDIQ) is to publish a Special Issue on Improving the Veracity and Value of Big Data. This is a key focus within VADA, and Norman Paton from the project is one of the Guest Editors for the Special Issue. Further details are available at: http://jdiq.acm.org/CFP-JDIQ-SI-VVBD.pdf
Prof. Leonid Libkin has been awarded an EPSRC Established Career Fellowship. The grant’s title is “MAGIC: MAnaGing InComplete Data – New Foundations“, and its total amount is £1.14M over 5 years, starting 1 August 2016.
The main goal of this research programme is to deliver new understanding of uncertain and incomplete information in data processing tasks, and by doing so to provide new ways of getting knowledge out of such data. It will reconcile correctness guarantees with an efficient algorithmic toolkit that scales to large data sets, and put an end to perceived impossibility of achieving correctness and efficiency simultaneously for large classes of queries over incomplete data.
Data wrangling is the process by which data is identified, extracted, integrated and cleaned for analysis. The New York Times reports that “Data scientists, according to interviews and expert estimates, spend from 50 percent to 80 percent of their time mired in this more mundane labor of collecting and preparing unruly digital data”.
The VADA project exists to put data wrangling on a firmer footing, in which automation, more systematic use of the available evidence, and carefully targeted user input lead to more efficient data wrangling. One of the goals of the project is to encourage a larger community of researchers and developers to work on techniques and tools for data wrangling. With this in mind, a paper on “Data Wrangling for Big Data: Challenges and Opportunities” has been written by members of the VADA team, and published in the Vision Track of the 19th International Conference on Extending Database Technology, March 15-18, 2016, in Bordeaux, France. This paper makes the case that a concerted effort to address specific challenges in data wrangling can be expected to yield substantial rewards.
Data is everywhere, generated by increasing numbers of applications, devices and users, with few or no guarantees on the format, semantics, and quality. The economic potential of data-driven innovation is enormous, estimated to reach as much as £40B in 2017, by the Centre for Economics and Business Research.
To realise this potential, and to provide meaningful data analyses, data scientists must first spend a significant portion of their time (estimated as 50% to 80%) on “data wrangling” – the process of collection, reorganising, and cleaning data. This heavy toll is due to what is referred as the four Vs of big data:Volume – the scale of the data, Velocity – speed of change, Variety – different forms of data, and Veracity – uncertainty of data.
There is an urgent need to provide data scientists with a new generation of tools that will unlock the potential of data assets and significantly reduce the data wrangling component. As many traditional tools are no longer applicable in the 4 V’s environment, a radical paradigm shift is required. The VADA Programme Grant aims to add value to data by:
- carrying out data management tasks in an environment that takes full account of data and user contexts, and
- integrating and automating key data management tasks in a way not yet attempted, but desperately needed by many innovative companies in today’s data-driven economy.
The VADA research programme will define principles and solutions for Value Added Data Systems, which support users in discovering, extracting, integrating, accessing and interpreting the data of relevance to their questions. In so doing, it uses the context of the user, e.g., requirements in terms of the trade-off between completeness and correctness, and the data context, e.g., its cost, provenance and quality. The user context characterises not only what data is relevant, but also the properties it must exhibit to be fit for purpose.