Direkt zu

Informationen für Studierende

zur Startseite


Framework for Automatic Selection of Analytic Platforms for Data Mining Tasks
Projekt Interactive Rapid Analytic Concepts
Betreuer M.Sc. Manuel Fritz
Prüfer PD Dr. rer. nat. habil. Holger Schwarz


The term “Data Analytics” describes the process which turns information from raw data to knowledge. Nowadays, multiple reference process models exist, such as KDD or CRISP-DM. Those reference models generally range from (1) data selection, (2) data transformation, (3) data mining to (4) evaluation and (5) application of the mining results. Although the logical order of the individual steps is reasonable and well established, there are yet no exact guidelines on how to perform the individual steps. Especially novice analysts are often overstrained with the plethora of possibilities within each of the above mentioned steps. Furthermore, they often lack of technical knowledge, bedeviling the analysis of datasets even more.

In the “Data Mining” step, algorithms and statistical approaches are executed on data sets in order to generate models, which subsequently deliver potentially new knowledge. Typically, those algorithms originate from the area of machine learning and potentially calculate models with a high runtime complexity. To accelerate these calculations, various distributed analytic platforms are emerging, such as Apache Mahout or Apache Spark as descendants of the MapReduce paradigm [1]. The possibility of streaming platforms (e.g. Apache Flink or Esper) to analyze stored data, lead to a overwhelming amount of platforms to run data mining algorithms on. Consequently, novice analysts with only little technical knowledge often struggle in choosing a platform suitable for their needs.


Currently, the plethora of analytic platforms and respective implementations of mining algorithms hinder the selection of a proper analytic platform. A few benchmarks are currently available [2], but they either focus a specific aspect of the mining task (e.g. solely on classification) or on a specific dataset within a predefined domain. Therefore, it is often unclear which platform can be suggested for an analyst in order to achieve the goals in consideration of certain criteria. Such criteria can be (a) the runtime, (b) the reliability in terms of recovering if an error occurs (e.g. hardware outage, network limitations) or (c) the accuracy of the result, since implementations of the very same algorithm may differ across multiple platforms [3].

The goal of this thesis is to develop a concept which enables a technically inexperienced analyst to automatically provide with the best analytic platform for her needs. Therefore, research has to be conducted on universal criteria for the selection of an analytic platform. These criteria should be used for an evaluation for multiple distributed analytic platforms. For the evaluation, an extensible benchmarking suite needs to be implemented which evaluates each criterion for each analytic platform and each mining task. Subsequently, the results of the evaluation are the foundation of an automatic selection of a platform for a technically inept analyst.

An implementation of the benchmarking suite and the automatic selection of analytic platforms based on the benchmarking results are the goals of this thesis. A detailed evaluation about current analytic platforms and their performance across certain criteria demonstrate the benefits of the whole framework.

This thesis includes the following tasks:

  • Researching strategies for automatic selection of analytic platforms
  • Researching promising dimensions for automatic selection
  • Implementation of an extensible framework for benchmarking and selecting platforms
  • Benchmarking of data mining algorithms on multiple platforms
  • Evaluation of the results
  • Presenting intermediate results in a talk
  • Presenting final results in a talk


[1] J. Dean, S. Ghemawat, MapReduce: Simplied Data Processing on Large Clusters, Proceedings of 6th Symposium on Operating Systems Design and Implementation 51(1), 137 (2004)
[2] S. Pafka. Simple/limited/incomplete benchmark for scalability, speed and accuracy of machine learning libraries for classification. URL https://github.com/szilard/benchm-ml
[3] H.P. Kriegel, E. Schubert, A. Zimek, The (black) art of runtime evaluation: Are we comparing algorithms or implementations?, Knowledge and Information Systems 52(2), 341 (2017)