Entity resolution (aka record linkage or duplicate detection) refers to the process of disambiguating references to real world objects. It is an essential task within data integration, spanning a wide variety of applications (e.g., biological and public health data, web search, e-commerce). Research on ER spans further areas of computer science, such as information retrieval, machine learning, and natural language processing.
While established ER algorithms typically run as batch processes running off-line to match several static data sets, this project focuses on ER approaches for large volumes of heterogeneous and dynamic data. Our approaches thus tackle volume, variety, and velocity of big data. In particular, we study incremental ER approaches taking as input a stream of query data items to be matched with existing data sets in an on-line fashion. We investigate algorithms that leverage modern computing technologies such as parallelization and distribution as well as the use of space-efficient data structures for ER.
Supplemental material for the paper "Progressive Entity Resolution over Incremental Data":
Please visit the following Github repository for the artifacts: https://github.com/UniStuttgart-DataEngineering/progressive-incremental-er
- Gazzarri, L., & Herschel, M. (2021). End-to-end Task Based Parallelization for Entity Resolution on Dynamic Data. Proceedings of the IEEE International Conference on Data Engineering (ICDE).
- Gazzarri, L., & Herschel, M. (2020). Towards task-based parallelization for entity resolution. SICS Software-Intensive Cyber-Physical Systems, 35(1), 31–38. https://doi.org/10.1007/s00450-019-00409-6
- Gazzarri, L., & Herschel, M. (2020). Boosting Blocking Performance in Entity Resolution Pipelines: Comparison Cleaning using Bloom Filters. Proceedings of the International Conference on Extending Database Technology (EDBT), 419--422. https://doi.org/10.5441/002/edbt.2020.47