Pebble: Provenance for nested data processing in big data analytics systems

Analyzing and debugging processing pipelines in Big Data Analytics systems such as Apache Spark or Flink is a tedious task, which typically involves a lot of engineering effort. The task becomes even more complex when the pipelines process nested data. In this setting, detailed informarmation on how results of such pipelines were obtained could give developers valuable insights when analyzing and debugging their pipelines. 

Motivated by this and other use cases, we research foundations, algorithms, and system architectures to offer provenance solutions tailored to nested data and the distributed and data-parallel processing leveraged by data analytics systems. 


  1. Diestelkämper, R., & Herschel, M. (2020). Tracing nested data with structural provenance for big data analytics. Proceedings of the 23nd International Conference on Extending Database Technology, EDBT 2020, Copenhagen, Denmark, March 30 - April 02, 2020, 253--264.
  2. Diestelkämper, R., Glavic, B., Herschel, M., & Lee, S. (2019). Query-based Why-not Explanations for Nested Data. 11th International Workshop on Theory and Practice of Provenance (TaPP 2019).
  3. Diestelkämper, R., & Herschel, M. (2019). Capturing and Querying Structural Provenance in Spark with Pebble. In P. A. Boncz, S. Manegold, A. Ailamaki, A. Deshpande, & T. Kraska (Eds.), SIGMOD Conference (pp. 1893–1896). ACM.
  4. Diestelkämper, R., Herschel, M., & Jadhav, P. (2017). Provenance in DISC Systems: Reducing Space Overhead at Runtime. Proceedings of the USENIX Conference on Theory and Practice of Provenance (TAPP), 1–13.
  5. Herschel, M., Diestelkämper, R., & Ben Lahmar, H. (2017). A Survey on Provenance - What for? What form? What from? The VLDB Journal, 26, 881–906.


Title File
Workload details for the EDBT 2019 evaluation of structural provenance management pebble_edbt_workload.pdf
Workload for tree pattern matching evaluation pebble_tpm_workload.pdf
To the top of the page