Data Wrangling

Research Topics

Data-driven research and decisions are based on the assumption that new knowledge and insights can be obtained from gathering and analyzing large amounts of data from possibly numerous sources. However, raw data produced or provided by these sources cannot simply be taken and analyzed as is. Instead, the raw data useful for analysis have to be sorted out (from vast amounts of irrelevant data), correctly combined together, annotated etc. Getting the data “in shape” for analysis has been termed as data wrangling.

Our research in data wrangling focuses on methods, algorithms, and tools both for individual data wrangling steps (e.g., data selection, data cleansing, data integration, data provisioning) as well as full data wrangling pipelines that provide end-to-end processing of raw data to data used in data-driven analysis.

Data integration and data cleaning

Our data integration research focuses on effective methods and scalable algorithms that facilitate the (semi-) automatic combination of heterogeneous data from various sources. This serves the overarching goal of giving a unified access to data where entities are represented in a complete, unique, and correct way. In this context, we have great expertise in the problems known as entity resolution and data fusion: while entity resolution recognizes the different representations of a real-world entity (e.g., coming from different sources), data fusion combines these different representations into a single entity.

The problems of entity resolution and data fusion equally apply in the context of data cleansing, as dedicated solutions help identify and eliminate redundant, incomplete, or incorrect data within a single data set.

To the top of the page