Pebble: Provenance for nested data processing in big data analytics systems

Analyzing and debugging processing pipelines in Big Data Analytics systems such as Apache Spark or Flink is a tedious task, which typically involves a lot of engineering effort. The task becomes even more complex when the pipelines process nested data. In this setting, detailed informarmation on how results of such pipelines were obtained could give developers valuable insights when analyzing and debugging their pipelines.

Motivated by this and other use cases, we research foundations, algorithms, and system architectures to offer provenance solutions tailored to nested data and the distributed and data-parallel processing leveraged by data analytics systems.

Publications

Diestelkamper, R., Lee, S., Glavic, B., & Herschel, M. (2021). Debugging Missing Answers for Spark Queries over Nested Data with Breadcrumb. Proceedings of the VLDB Endowment (PVLDB). http://www.vldb.org/pvldb/vol14/p2731-diestelkamper.pdf
Diestelkämper, R., Lee, S., Herschel, M., & Glavic, B. (2021). To not miss the forest for the trees - A holistic approach for explaining missing answers over nested data. In Proceedins of the ACM SIG Conference on the Management of Data (SIGMOD). https://dl.acm.org/doi/pdf/10.1145/3448016.3457249
Diestelkämper, R., & Herschel, M. (2020). Tracing nested data with structural provenance for big data analytics. Proceedings of the International Conference on Extending Database Technology (EDBT), 253–264. https://doi.org/10.5441/002/edbt.2020.23
Diestelkämper, R., & Herschel, M. (2020). Distributed Tree-Pattern Matching in Big Data Analytics Systems. In Proceedings of the Conference on Advances in Databases and Information Systems (ADBIS), 171–186. https://doi.org/10.1007/978-3-030-54832-2_14
Diestelkämper, R., Glavic, B., Herschel, M., & Lee, S. (2019). Query-based Why-not Explanations for Nested Data. Proceedings of the International Workshop on Theory and Practice of Provenance (TaPP). https://www.usenix.org/conference/tapp2019/presentation/diestelkamper
Diestelkämper, R., & Herschel, M. (2019). Capturing and Querying Structural Provenance in Spark with Pebble. In ACM International Conference on Management of Data (SIGMOD), 1893–1896. https://doi.org/10.1145/3299869.3320225
Diestelkämper, R., Herschel, M., & Jadhav, P. (2017). Provenance in DISC Systems: Reducing Space Overhead at Runtime. Proceedings of the USENIX Conference on Theory and Practice of Provenance (TAPP), 1–13. https://dl.acm.org/doi/abs/10.5555/3183865.3183883
Herschel, M., Diestelkämper, R., & Ben Lahmar, H. (2017). A survey on provenance: What for? What form? What from? The VLDB Journal, 26(6), Article 6. https://doi.org/10.1007/s00778-017-0486-1

Resources

Title	File
Workload details for the EDBT 2019 evaluation of structural provenance management	pebble_edbt_workload.pdf
Workload for tree pattern matching evaluation	pebble_tpm_workload.pdf

Publications

Resources

Audience

Formalities

Services

Organization

Pebble: Provenance for nested data processing in big data analytics systems

Publications

Resources

Here you can reach us

Audience

Formalities

Services

Organization