Pebble: Provenance for nested data processing in big data analytics systems

Analyzing and debugging processing pipelines in Big Data Analytics systems such as Apache Spark or Flink is a tedious task, which typically involves a lot of engineering effort. The task becomes even more complex when the pipelines process nested data. In this setting, detailed informarmation on how results of such pipelines were obtained could give developers valuable insights when analyzing and debugging their pipelines. 

Motivated by this and other use cases, we research foundations, algorithms, and system architectures to offer provenance solutions tailored to nested data and the distributed and data-parallel processing leveraged by data analytics systems. 


  1. Diestelkamper, R., Lee, S., Glavic, B., & Herschel, M. (2021). Debugging Missing Answers for Spark Queries over Nested Data with Breadcrumb. Proceedings of the VLDB Endowment (PVLDB).
  2. Diestelkämper, R., Lee, S., Herschel, M., & Glavic, B. (2021). To not miss the forest for the trees - A holistic approach for explaining missing answers over nested data. In Proceedins of the ACM SIG Conference on the Management of Data (SIGMOD).
  3. Diestelkämper, R., & Herschel, M. (2020). Tracing nested data with structural provenance for big data analytics. Proceedings of the International Conference on Extending Database Technology (EDBT), 253–264.
  4. Diestelkämper, R., & Herschel, M. (2020). Distributed Tree-Pattern Matching in Big Data Analytics Systems. In Proceedings of the Conference on Advances in Databases and Information Systems (ADBIS), 171–186.
  5. Diestelkämper, R., Glavic, B., Herschel, M., & Lee, S. (2019). Query-based Why-not Explanations for Nested Data. Proceedings of the International Workshop on Theory and Practice of Provenance (TaPP).
  6. Diestelkämper, R., & Herschel, M. (2019). Capturing and Querying Structural Provenance in Spark with Pebble. In ACM International Conference on Management of Data (SIGMOD), 1893–1896.
  7. Diestelkämper, R., Herschel, M., & Jadhav, P. (2017). Provenance in DISC Systems: Reducing Space Overhead at Runtime. Proceedings of the USENIX Conference on Theory and Practice of Provenance (TAPP), 1–13.
  8. Herschel, M., Diestelkämper, R., & Ben Lahmar, H. (2017). A survey on provenance: What for? What form? What from? The VLDB Journal, 26(6), Article 6.


Title File
Workload details for the EDBT 2019 evaluation of structural provenance management pebble_edbt_workload.pdf
Workload for tree pattern matching evaluation pebble_tpm_workload.pdf
To the top of the page