Contact
+49 711 685 88242
+49 711 685 78242
Email
Business card (VCF)
Universitätsstraße 38
D-70569 Stuttgart
Deutschland
Room: 2.467
Office Hours
on appointment
Subject
Working Area: Data, Metadata and Analytics
In the PhD project "Interactive assistance systems in the context of explorative and user-centric data analysis", the focus is on interactive procedures that allow domain experts to perform additional analyses not covered by standard queries, thereby enabling new hypotheses and insights.
Conventional applications from the areas of Visual Analytics or Self-Service Business Intelligence either focus on the analysis of a specific challenge or follow predefined analysis paths. Since a domain expert rarely has in-depth technical knowledge, a more fundamental analysis would have to be implemented by the IT department at high cost. However, this is only feasible if the economic relevance is predictable. Especially with regard to explorative analysis scenarios, such a competitive advantage is not quantifiable.
Hence, it is desirable to enable the domain expert to perform initial exploratory analyses on his own to verify assumptions. For this purpose, the necessary level of detail must be abstracted in each step of the analysis. A balanced interplay between visual and automated procedures is required and the domain expert should be involved in every step of the analysis. To provide more comprehensive analysis paths here, a more generic approach - for example using data mashup tools - is required. These allow a largely free combination of data sources and operators by means of an intuitive graphical interface and are therefore suitable for the specification of analysis processes with regard to the rapid exploration of data without requiring programming knowledge.
The objective of this project is to develop techniques to support a domain expert in explorative analysis. This includes for instance a preselection of data sources, the reduction of repetitive tasks and interaction concepts in the area of data preparation. Due to this shift in focus on the integration of one or more domain experts into the analysis process and the resulting increase in the degree of freedom, the time-consuming and cost-intensive implementation of a (new) analysis by the IT department can be dispensed with in many cases and, as a consequence, new business opportunities can be identified.
2023
- Michael Behringer (2023). "Interactive, Explorative and User-Centric Data Analysis: Concepts, Systems, Evaluations". PhD Thesis
[Abstract] [Cite] [Link] [PDF]Abstract:
The present era, oftentimes referred to as the data age, is characterized by an enormous volume of data across various sectors. Similar to how oil has shaped the industrial age in the 19th century, data are now the crucial resource for gaining competitive advantages. However, harnessing this potential requires thorough analysis and domain knowledge to extract valuable information from these data. To optimally leverage this knowledge, domain experts have to be involved in the entire analysis process. This doctoral thesis introduces the user-centric data analysis approach, empowering domain experts to navigate the full-featured analytical journey, from selecting data sources to data preprocessing, data mining, and reporting - without the need for extensive technical knowledge. This holistic approach encompasses not only a reference model for user-centric data analysis but furthermore includes concepts, prototypical implementations as well as comprehensive evaluations for several phases of the analysis. The user-centric data analysis approach is systematically compared to various state-of-the-art approaches, such as process models or visual analytics, based on six different dimensions. This comparison reveals that, through the introduced approach, domain experts are significantly better integrated into the analysis process, resulting in faster insights and competitive advantages.
BibTeX:
- Michael Behringer, Pascal Hirmer, Alejandro Gabriel Zacharias Villanueva, Jannis Rapp, and Bernhard Mitschang (2023). "Unobtrusive Integration of Data Quality in Interactive Explorative Data Analysis". Accepted: 25th International Conference on Enterprise Information Systems, ICEIS 2023, Prague, Czech Republic, April 24-26, 2023
[Abstract] [Cite] [Link] [PDF]Abstract:
The volume of data to be analyzed has increased tremendously in recent years. To extract knowledge from this data, domain experts gain new insights using graphical analysis tools for explorative analyses. Hereby, the reliability and trustworthiness of an explorative analysis are determined by the quality of the underlying data. Existing approaches require a manual inspection to ensure data quality. This inspection is frequently neglected, partly because domain experts often lack the necessary technical knowledge. Moreover, they might need many different tools for this purpose. In this paper, we present a novel interactive approach to integrate data quality into explorative data analysis in an unobtrusive manner. Our approach efficiently combines the strength of different experts, which is currently not supported by state-of-the-art tools, thereby allowing domain-specific adaptation. We implemented a fully working prototype to demonstrate the ability of our approach to support domain experts in explorative data analysis.
BibTeX:
- Michael Behringer, Dennis Treder-Tschechlov, Julius Voggesberger, Pascal Hirmer, and Bernhard Mitschang (2023). "SDRank - A Deep Learning Approach for Similarity Ranking of Data Sources to Support User-Centric Data Analysis". Accepted: 25th International Conference on Enterprise Information Systems, ICEIS 2023, Prague, Czech Republic, April 24-26, 2023
[Abstract] [Cite] [Link] [PDF]Abstract:
Today, data analytics is widely used throughout many domains to identify new trends, opportunities, or risks and improve decision-making. By doing so, various heterogeneous data sources must be selected to form the foundation for knowledge discovery driven by data analytics. However, discovering and selecting the suitable and valuable data sources to improve the analytics results is a great challenge. Domain experts can easily become overwhelmed in the data selection process due to a large amount of available data sources that might contain similar kinds of information. Supporting domain experts in discovering and selecting the best suitable data sources can save time, costs and significantly increase the quality of the analytics results. In this paper, we introduce a novel approach -- SDRank -- which provides a Deep Learning approach to rank data sources based on their similarity to already selected data sources. We implemented SDRank, trained various models on 4 860 datasets, and measured the achieved precision for evaluation purposes. By doing so, we showed that SDRank is able to highly improve the workflow of domain experts to select beneficial data sources.
BibTeX:
2022
- Michael Behringer, Manuel Fritz, Holger Schwarz, Bernhard Mitschang (2022). "DATA-IMP: An Interactive Approach to Specify Data Imputation Transformations on Large Datasets". Proceedings of the 28th International Conference on Cooperative Information Systems, CoopIS 2022, Bozen, Italy, October 04-07, 2022. Best Conference Paper Award.
[Abstract] [Cite] [Link] [PDF]Abstract:
In recent years, the volume of data to be analyzed has increased tremendously. However, purposeful data analyses on large-scale data require in-depth domain knowledge. A common approach to reduce data volume and preserve interactivity are sampling algorithms. However, when using a sample, the semantic context across the entire dataset is lost, which impedes data preprocessing. In particular data imputation transformations, which aim to fill empty values for more accurate data analyses, suffer from this problem. To cope with this issue, we introduce DATA-IMP, a novel human-in-the-loop approach that enables data imputation transformations in an interactive manner while preserving scalability. We implemented a fully working prototype and conducted a comprehensive user study as well as a comparison to several non-interactive data imputation techniques. We show that our approach significantly outperforms state-of-the-art approaches regarding accuracy as well as preserves user satisfaction and enables domain experts to preprocess large-scale data in an interactive manner.
BibTeX:
@inproceedings{Behringer2022coopis, abstract = {In recent years, the volume of data to be analyzed has increased tremendously. However, purposeful data analyses on large-scale data require in-depth domain knowledge. A common approach to reduce data volume and preserve interactivity are sampling algorithms. However, when using a sample, the semantic context across the entire dataset is lost, which impedes data preprocessing. In particular data imputation transformations, which aim to fill empty values for more accurate data analyses, suffer from this problem. To cope with this issue, we introduce DATA-IMP, a novel human-in-the-loop approach that enables data imputation transformations in an interactive manner while preserving scalability. We implemented a fully working prototype and conducted a comprehensive user study as well as a comparison to several non-interactive data imputation techniques. We show that our approach significantly outperforms state-of-the-art approaches regarding accuracy as well as preserves user satisfaction and enables domain experts to preprocess large-scale data in an interactive manner.}, address = {Cham}, author = {Behringer, Michael and Fritz, Manuel and Schwarz, Holger and Mitschang, Bernhard}, booktitle = {Cooperative Information Systems}, editor = {Sellami, Mohamed and Ceravolo, Paolo and Reijers, Hajo A. and Gaaloul, Walid and Panetto, Herv{\'e}}, isbn = {978-3-031-17834-4}, pages = {55--74}, publisher = {Springer International Publishing}, title = {DATA-IMP: An Interactive Approach to Specify Data Imputation Transformations on Large Datasets}, year = {2022}}
- Christoph Stach, Clémentine Gritti, Julia Bräcker, Michael Behringer, and Bernhard Mitschang (2022). "Protecting Sensitive Data in the Information Age: State of the Art and Future Prospects". Future Internet. Volume 14, Issue 11, pp. 302:1-302:43
[Abstract] [Cite] [Link] [PDF]Abstract:
The present information age is characterized by an ever-increasing digitalization. Smart devices quantify our entire lives. These collected data provide the foundation for data-driven services called smart services. They are able to adapt to a given context and thus tailor their functionalities to the user's needs. It is therefore not surprising that their main resource, namely data, is nowadays a valuable commodity that can also be traded. However, this trend does not only have positive sides, as the gathered data reveal a lot of information about various data subjects. To prevent uncontrolled insights into private or confidential matters, data protection laws restrict the processing of sensitive data. One key factor in this regard is user-friendly privacy mechanisms. In this paper, we therefore assess current state-of-the-art privacy mechanisms. To this end, we initially identify forms of data processing applied by smart services. We then discuss privacy mechanisms suited for these use cases. Our findings reveal that current state-of-the-art privacy mechanisms provide good protection in principle, but there is no compelling one-size-fits-all privacy approach. This leads to further questions regarding the practicality of these mechanisms, which we present in the form of seven thought-provoking propositions.
BibTeX:
@article{stach22fi, author = {Stach, Christoph and Gritti, Cl\'{e}mentine and Br\"{a}cker, Julia and Behringer, Michael and Mitschang, Bernhard}, journal = {Future Internet}, title = {{P}rotecting {S}ensitive {D}ata in the {I}nformation {A}ge: {S}tate of the {A}rt and {F}uture {P}rospects}, editor = {Giuli, Dino and Papavassiliou, Symeon and Bellavista, Paolo and Hudson-Smith, Andrew}, year = 2022, month = oct, volume = 14, number = 11, pages = {302:1--302:43}, publisher = {MDPI}, issn = {1999-5903}, doi = {10.3390/fi14110302}, }
- Manuel Fritz, Michael Behringer, Dennis Tschechlov, and Holger Schwarz (2022). "Efficient exploratory clustering analyses in large-scale exploration processes". The VLDB Journal. Volume 31, Issue 4, pp. 711-732
[Abstract] [Cite] [Link] [PDF]Abstract:
Clustering is a fundamental primitive in manifold applications. In order to achieve valuable results in exploratory clustering analyses, parameters of the clustering algorithm have to be set appropriately, which is a tremendous pitfall. We observe multiple challenges for large-scale exploration processes. On the one hand, they require specific methods to efficiently explore large parameter search spaces. On the other hand, they often exhibit large runtimes, in particular when large datasets are analyzed using clustering algorithms with super-polynomial runtimes, which repeatedly need to be executed within exploratory clustering analyses. We address these challenges as follows: First, we present LOG-Means and show that it provides estimates for the number of clusters in sublinear time regarding the defined search space, i.e., provably requiring less executions of a clustering algorithm than existing methods. Second, we demonstrate how to exploit fundamental characteristics of exploratory clustering analyses in order to significantly accelerate the (repetitive) execution of clustering algorithms on large datasets. Third, we show how these challenges can be tackled at the same time. To the best of our knowledge, this is the first work which simultaneously addresses the above-mentioned challenges. In our comprehensive evaluation, we unveil that our proposed methods significantly outperform state-of-the-art methods, thus especially supporting novice analysts for exploratory clustering analyses in large-scale exploration processes.
BibTeX:
@article{Fritz2022vldbj, author = {Fritz, Manuel and Behringer, Michael and Tschechlov, Dennis and Schwarz, Holger}, title = {{Efficient exploratory clustering analyses in large-scale exploration processes}}, journal = {The VLDB Journal}, year = {2022}, volume = {31}, number = {4}, pages = {711--732}, url = {https://doi.org/10.1007/s00778-021-00716-y} }
- Christoph Stach, Michael Behringer, Julia Bräcker, Clémentine Gritti, and Bernhard Mitschang (2022). "SMARTEN—A Sample-Based Approach towards Privacy-Friendly Data Refinement". Journal of Cybersecurity and Privacy. Volume 2, Issue 3, pp. 606-628
[Abstract] [Cite] [Link] [PDF]Abstract:
Two factors are crucial for the effective operation of modern-day smart services: Initially, IoT-enabled technologies have to capture and combine huge amounts of data on data subjects. Then, all these data have to be processed exhaustively by means of techniques from the area of big data analytics. With regard to the latter, thorough data refinement in terms of data cleansing and data transformation is the decisive cornerstone. Studies show that data refinement reaches its full potential only by involving domain experts in the process. However, this means that these experts need full insight into the data in order to be able to identify and resolve any issues therein, e.g., by correcting or removing inaccurate, incorrect, or irrelevant data records. In particular for sensitive data (e.g., private data or confidential data), this poses a problem, since these data are thereby disclosed to third parties such as domain experts. To this end, we introduce SMARTEN, a sample-based approach towards privacy-friendly data refinement to smarten up big data analytics and smart services. SMARTEN applies a revised data refinement process that fully involves domain experts in data pre-processing but does not expose any sensitive data to them or any other third-party. To achieve this, domain experts obtain a representative sample of the entire data set that meets all privacy policies and confidentiality guidelines. Based on this sample, domain experts define data cleaning and transformation steps. Subsequently, these steps are converted into executable data refinement rules and applied to the entire data set. Domain experts can request further samples and define further rules until the data quality required for the intended use case is reached. Evaluation results confirm that our approach is effective in terms of both data quality and data privacy.
BibTeX:
@Article{stach2022jcp, author = {Stach, Christoph and Behringer, Michael and Br\"{a}cker, Julia and Gritti, Cl\'{e}mentine and Mitschang, Bernhard}, journal = {Journal of Cybersecurity and Privacy}, title = {{SMARTEN}---{A} {S}ample-{B}ased {A}pproach towards {P}rivacy-{F}riendly {D}ata {R}efinement}, editor = {Rawat, Danda B. and Giacinto, Giorgio}, year = 2022, month = aug, volume = 2, number = 3, pages = {606--628}, publisher = {MDPI}, issn = {2624-800X}, doi = {10.3390/jcp2030031}, }
- Michael Behringer, Manuel Fritz, Holger Schwarz, Bernhard Mitschang (2022). "Increasing Explainability of Clustering Results for Domain Experts by Identifying Meaningful Features". Proceedings of the 24th International Conference on Enterprise Information Systems, ICEIS 2022, Online Streaming, April 25-27, 2022
[Abstract] [Cite] [Link] [PDF]Abstract:
Today, the amount of data is growing rapidly, which makes it nearly impossible for human analysts to comprehend the data or to extract any knowledge from it. To cope with this, as part of the knowledge discovery process, many different data mining and machine learning techniques were developed in the past. A famous representative of such techniques is clustering, which allows the identification of different groups of data (the clusters) based on data characteristics. These algorithms need no prior knowledge or configuration, which makes them easy to use, but interpreting and explaining the results can become very difficult for domain experts. Even though different kinds of visualizations for clustering results exist, they do not offer enough details for explaining how the algorithms reached their results. In this paper, we propose a new approach to increase explainability for clustering algorithms. Our approach identifies and selects features that are most meaningful for the clusteri ng result. We conducted a comprehensive evaluation in which, based on 216 synthetic datasets, we first examined various dispersion metrics regarding their suitability to identify meaningful features and we evaluated the achieved precision with respect to different data characteristics. This evaluation shows, that our approach outperforms existing algorithms in 93 percent of the examined datasets.
BibTeX:
@inproceedings{Behringer2022iceis, author = {Michael Behringer and Pascal Hirmer and Dennis Tschechlov and Bernhard Mitschang}, editor = {Joaquim Filipe and Michal Smialek and Alexander Brodsky and Slimane Hammoudi}, title = {Increasing Explainability of Clustering Results for Domain Experts by Identifying Meaningful Features}, booktitle = {Proceedings of the 24th International Conference on Enterprise Information Systems, {ICEIS} 2022, Online Streaming, April 25-27, 2022, Volume 2}, pages = {364--373}, publisher = {{SCITEPRESS}}, year = {2022}, url = {https://doi.org/10.5220/0011092000003179}, doi = {10.5220/0011092000003179} }
2020
- Manuel Fritz, Michael Behringer, and Holger Schwarz (2020). "LOG-Means: Efficiently Estimating the Number of Clusters in Large Datasets". Proceedings of the VLDB Endowment. Volume 13, Issue 12, pp. 2118-2131
[Abstract] [Cite] [Link] [PDF]Abstract:
Clustering is a fundamental primitive in manifold applications. In order to achieve valuable results, parameters of the clustering algorithm, e.g., the number of clusters, have to be set appropriately, which is a tremendous pitfall. To this end, analysts rely on their domain knowledge in order to define parameter search spaces. While experienced analysts may be able to define a small search space, especially novice analysts often define rather large search spaces due to the lack of in-depth domain knowledge. These search spaces can be explored in different ways by estimation methods for the number of clusters. In the worst case, estimation methods perform an exhaustive search in the given search space, which leads to infeasible runtimes for large datasets and large search spaces. We propose LOG-Means, which is able to overcome these issues of existing methods. We show that LOG-Means provides estimates in sublinear time regarding the defined search space, thus being a strong fit for large datasets and large search spaces. In our comprehensive evaluation on an Apache Spark cluster, we compare LOG-Means to 13 existing estimation methods. The evaluation shows that LOG-Means significantly outperforms these methods in terms of runtime and accuracy. To the best of our knowledge, this is the most systematic comparison on large datasets and search spaces as of today.
BibTeX:
@article{Fritz2020vldb, author = {Fritz, Manuel and Behringer, Michael and Schwarz, Holger}, title = {LOG-Means: Efficiently Estimating the Number of Clusters in Large Datasets}, publisher = {VLDB Endowment}, volume = {13}, number = {12}, issn = {2150-8097}, url = {https://doi.org/10.14778/3407790.3407813}, doi = {10.14778/3407790.3407813}, year = {2020}, issue_date = {August 2020}, journal = {Proc. VLDB Endow.}, month = {jul}, pages = {2118–2131}, numpages = {14} }
- Michael Behringer, Pascal Hirmer, Manuel Fritz, and Bernhard Mitschang (2020). "Empowering Domain Experts to Preprocess Massive Distributed Datasets". In Proceedings of the 23rd International Conference on Business Information Systems, BIS 2020, Colorado Springs, CO, USA, June 08-10, 2020.
[Abstract] [Cite] [Link] [PDF]Abstract:
In recent years, the amount of data is growing extensively. In companies, spreadsheets are one common approach to conduct data processing and statistical analysis. However, especially when working with massive amounts of data, spreadsheet applications have their limitations. To cope with this issue, we introduce a human-in-the-loop approach for scalable data preprocessing using sampling. In contrast to state-of-the- art approaches, we also consider conflict resolution and recommendations based on data not contained in the sample itself. We implemented a fully functional prototype and conducted a user study with 12 participants. We show that our approach delivers a significantly higher error correction than comparable approaches which only consider the sample dataset.
BibTeX:
@incollection{Behringer2020ba, author = {Behringer, Michael and Hirmer, Pascal and Fritz, Manuel and Mitschang, Bernhard}, title = {{Empowering Domain Experts to Preprocess Massive Distributed Datasets}}, booktitle = {Business Information Systems}, year = {2020}, editor = {Abramowicz, Witold and Klein, Gary}, pages = {61--75}, publisher = {Springer International Publishing}, address = {Cham}, doi = {10.1007/978-3-030-53337-3_5}, language = {English}, rating = {0}, date-added = {2021-05-11T17:09:51GMT}, date-modified = {2021-05-11T17:23:02GMT}, abstract = {In recent years, the amount of data is growing extensively. In companies, spreadsheets are one common approach to conduct data processing and statistical analysis. However, especially when working...}, url = {https://link.springer.com/chapter/10.1007/978-3-030-53337-3_5}, }
2019
- Manuel Fritz, Osama Muazzen, Michael Behringer, and Holger Schwarz (2019). "ASAP-DM: a framework for automatic selection of analytic platforms for data mining". SICS Software-Intensive Cyber-Physical Systems.
[Abstract] [Cite] [Link] [PDF]Abstract:
The plethora of analytic platforms escalates the difficulty of selecting the most appropriate analytic platform that fits the needed data mining task, the dataset as well as additional user-defined criteria. Especially analysts, who are rather focused on the analytics domain, experience difficulties to keep up with the latest developments. In this work, we introduce the ASAP-DM framework, which enables analysts to seamlessly use several platforms, whereas programmers can easily add several platforms to the framework. Furthermore, we investigate how to predict a platform based on specific criteria, such as lowest runtime or resource consumption during the execution of a data mining task. We formulate this task as an optimization problem, which can be solved by today’s classification algorithms. We evaluate the proposed framework on several analytic platforms such as Spark, Mahout, and WEKA along with several data mining algorithms for classification, clustering, and association rule discovery. Our experiments unveil that the automatic selection process can save up to 99.71% of the execution time due to automatically choosing a faster platform.
BibTeX:
@article{Fritz2019b, author = {Fritz, Manuel and Muazzen, Osama and Behringer, Michael and Schwarz, Holger}, day = 17, doi = {10.1007/s00450-019-00408-7}, issn = {2524-8529}, journal = {SICS Software-Intensive Cyber-Physical Systems}, month = aug, title = {ASAP-DM: a framework for automatic selection of analytic platforms for data mining}, url = {https://doi.org/10.1007/s00450-019-00408-7}, year = 2019 }
- Manuel Fritz, Michael Behringer, and Holger Schwarz (2019). "Quality-driven early stopping for explorative cluster analysis for big data". SICS Software-Intensive Cyber-Physical Systems - Advancements of Service Computing: Proceedings of SummerSoC 2018. Volume 34, Issue 2-3, pp. 129–140
[Abstract] [Cite] [Link] [PDF]Abstract:
Data analysis has become a critical success factor for companies in all areas. Hence, it is necessary to quickly gain knowledge from available datasets, which is becoming especially challenging in times of big data. Typical data mining tasks like cluster analysis are very time consuming even if they run in highly parallel environments like Spark clusters. To support data scientists in explorative data analysis processes, we need techniques to make data mining tasks even more efficient. To this end, we introduce a novel approach to stop clustering algorithms as early as possible while still achieving an adequate quality of the detected clusters. Our approach exploits the iterative nature of many cluster algorithms and uses a metric to decide after which iteration the mining task should stop. We present experimental results based on a Spark cluster using multiple huge datasets. The experiments unveil that our approach is able to accelerate the clustering up to a factor of more than 800 by obliterating many iterations which provide only little gain in quality. This way, we are able to find a good balance between the time required for data analysis and quality of the analysis results.
BibTeX:
@article{Fritz2019a, author = {Manuel Fritz and Michael Behringer and Holger Schwarz}, title = {Quality-driven early stopping for explorative cluster analysis for big data}, journal = {{SICS} Softw.-Intensive Cyber Phys. Syst.}, volume = {34}, number = {2-3}, pages = {129--140}, year = {2019}, url = {https://doi.org/10.1007/s00450-019-00401-0}, doi = {10.1007/s00450-019-00401-0}, biburl = {https://dblp.org/rec/journals/ife/FritzBS19.bib} }
- Michael Behringer, Pascal Hirmer, and Bernhard Mitschang (2018). "A Human-Centered Approach for Interactive Data Processing and Analytics". In Enterprise Information Systems : 19th International Conference on Enterprise Information Systems, ICEIS 2017, Porto, Portugal, April 26-29, 2017, Revised Selected Papers, Slimane Hammoudi, Michał Śmiałek, Olivier Camp and Joaquim Filipe (eds.). Springer International Publishing, pp. 498–514.
[Abstract] [Cite] [Link] [PDF]Abstract:
In recent years, the amount of data increases continuously. With newly emerging paradigms, such as the Internet of Things, this trend will even intensify in the future. Extracting information and, consequently, knowledge from this large amount of data is challenging. To realize this, approved data analytics approaches and techniques have been applied for many years. However, those approaches are oftentimes very static, i.e., cannot be dynamically controlled. Furthermore, their implementation and modification requires deep technical knowledge only technical experts can provide, such as an IT department of a company. The special needs of the business users are oftentimes not fully considered. To cope with these issues, we introduce in this article a human-centered approach for interactive data processing and analytics. By doing so, we put the user in control of data analytics through dynamic interaction. This approach is based on requirements derived from typical case scenarios.
BibTeX:
@inproceedings{Behringer2018, author = {Behringer, Michael and Hirmer, Pascal and Mitschang, Bernhard}, title = {A Human-Centered Approach for Interactive Data Processing and Analytics}, booktitle = {Enterprise Information Systems -- 19th International Conference on Enterprise Information Systems, ICEIS 2017, Porto, Portugal, April 26-29, 2017, Revised Selected Papers}, editor = {Hammoudi, Slimane and {\'{S}}mia{\l}ek, Micha{\l} and Camp, Olivier and Filipe, Joaquim}, address = {Cham}, isbn = {978-3-319-93375-7}, pages = {498--514}, publisher = {Springer International Publishing}, year = {2018} }
- Pascal Hirmer, Michael Behringer, and Bernhard Mitschang (2018). "Partial execution of Mashup Plans during modeling time". SICS Software-Intensive Cyber-Physical Systems - Advancements of Service Computing: Proceedings of SummerSoC 2017. Volume 33, Issue 3-4, pp. 341–352
[Abstract] [Cite] [Link] [PDF]Abstract:
Workflows and workflow technologies are an approved means to orchestrate services while supporting parallelism, error handling, and asynchronous messaging. A special case workflow technology is applied to are Data Mashups. In Data Mashups, workflows orchestrate services that specialize on data processing. The workflow model itself specifies the order data is processed in. Due to the fact that Data Mashups aim for usability of domain-experts with limited IT and programming knowledge, they oftentimes offer a layer on top that abstracts from the concrete workflow model and technology. This model is then transformed into an executable workflow model. However, transforming and executing the model as a whole leads to efficiency issues. In this paper, we introduce an approach to execute part of this model during modeling time. More precisely, once a specific part is modeled, it is transformed into an executable workflow fragment and executed in the backend. Consequently, once the user created the whole model, the execution time seems to be much shorter for the user because most of the model has already been processed. Furthermore, through our approach, access to intermediate results is enabled at modeling time already.
BibTeX:
@article{Hirmer:2018do, author = {Hirmer, Pascal and Behringer, Michael and Mitschang, Bernhard}, title = {{Partial execution of Mashup Plans during modeling time}}, journal = {Computer Science - Research and Development}, year = {2018}, volume = {33}, number = {3-4}, pages = {341--352}, publisher = {Springer Berlin Heidelberg}, doi = {10.1007/s00450-017-0388-x}, language = {English} }
- Pascal Hirmer and Michael Behringer (2017). "FlexMash 2.0 – Flexible Modeling and Execution of Data Mashups". Rapid Mashup Development Tools : Second International Rapid Mashup Challenge, RMC 2016, Lugano, Switzerland, June 6, 2016, Revised Selected Papers, Florian Daniel and Martin Gaedke (eds.). Springer International Publishing, pp. 10–29.
[Abstract] [Cite] [Link] [PDF]Abstract:
In recent years, the amount of data highly increases through cheap hardware, fast network technology, and the increasing digitization within most domains. The data produced is oftentimes heterogeneous, dynamic and originates from many highly distributed data sources. Deriving information and, as a consequence, knowledge from this data can lead to a higher effectiveness for problem solving and thus higher profits for companies. However, this is a great challenge – oftentimes referred to as Big Data problem. The data mashup tool FlexMash, developed at the University of Stuttgart, tackles this challenge by offering a means for integration and processing of heterogeneous, dynamic data sources. By doing so, FlexMash focuses on (i) an easy means to model data integration and processing scenarios by domain-experts based on the Pipes and Filters pattern, (ii) a flexible execution based on the user’s non-functional requirements, and (iii) high extensibility to enable a generic approach. A first version of this tool was presented during the ICWE Rapid Mashup Challenge 2015. In this article, we present the new version FlexMash 2.0, which introduces new features such as cloud-based execution and human interaction during runtime. These concepts have been presented during the ICWE Rapid Mashup Challenge 2016.
BibTeX:
@incollection{Hirmer2017, author = {Hirmer, Pascal and Behringer, Michael}, title = {{FlexMash 2.0 {\textendash} Flexible Modeling and Execution of Data Mashups}}, booktitle = {Rapid Mashup Development Tools}, year = {2017}, editor = {Daniel, Florian and Gaedke, Martin}, pages = {10--29}, publisher = {Springer International Publishing}, address = {Cham}, doi = {10.1007/978-3-319-53174-8_2} }
- Michael Behringer, Pascal Hirmer, and Bernhard Mitschang (2017). "Towards Interactive Data Processing and Analytics - Putting the Human in the Center of the Loop". Proceedings of the 19th International Conference on Enterprise Information Systems, ICEIS 2017, Porto, Portugal, April 26-29, 2017. pp. 87–96
[Abstract] [Cite] [Link] [PDF]Abstract:
Today, it is increasingly important for companies to evaluate data and use the information contained. In practice, this is however a great challenge, especially for domain users that lack the necessary technical knowledge. However, analyses prefabricated by technical experts do not provide the necessary flexibility and are oftentimes only implemented by the IT department if there is sufficient demand. Concepts like Visual Analytics or Self-Service Business Intelligence involve the user in the analysis process and try to reduce the technical requirements. However, these approaches either only cover specific application areas or they do not consider the entire analysis process. In this paper, we present an extended Visual Analytics process, which puts the user at the center of the analysis. Based on a use case scenario, requirements for this process are determined and, later on, a possible application for this scenario is discussed that emphasizes the benefits of our approach.
BibTeX:
@inproceedings{Behringer:2017, author = {Behringer, Michael and Hirmer, Pascal and Mitschang, Bernhard}, title = {{Towards Interactive Data Processing and Analytics - Putting the Human in the Center of the Loop}}, booktitle = {Proceedings of the 19th International Conference on Enterprise Information Systems, ICEIS 2017, Porto, Portugal, April 26-29, 2017}, year = {2017}, editor = {Hammoudi, Slimane and {\'{S}}mia{\l}ek, Micha{\l} and Camp, Olivier and Filipe, Joaquim}, pages = {87--96}, publisher = {SCITEPRESS - Science and Technology Publications}, doi = {10.5220/0006326300870096}, isbn = {978-989-758-247-9} }
- Michael Behringer (2016). "Visual Analytics im Kontext der Daten- und Analysequalität am Beispiel von Data Mashups". Diploma Thesis. Universität Stuttgart
[Abstract] [Cite] [Link] [PDF]Abstract:
Viele Prozesse und Geschäftsmodelle der Gegenwart basieren auf der Auswertung von Daten. Durch Fortschritte in der Speichertechnologie und Vernetzung ist die Akquisition von Daten heute sehr einfach und wird umfassend genutzt. Das weltweit vorhandene Datenvolumen steigt exponentiell und sorgt für eine zunehmende Komplexität der Analyse. In den letzten Jahren fällt in diesem Zusammenhang öfter der Begriff Visual Analytics. Dieses Forschungsgebiet kombiniert visuelle und automatische Verfahren zur Datenanalyse. Im Rahmen dieser Arbeit werden die Verwendung und die Ziele von Visual Analytics evaluiert und eine neue umfassendere Definition entwickelt. Aus dieser wird eine Erweiterung des Knowledge Discovery-Prozesses abgeleitet und verschiedene Ansätze bewertet. Um die Unterschiede zwischen Data Mining, der Visualisierung und Visual Analytics zu verdeutlichen, werden diese Themengebiete gegenübergestellt und in einem Ordnungsrahmen hinsichtlich verschiedener Dimensionen klassifiziert. Zusätzlich wird untersucht, inwiefern dieser neue Ansatz im Hinblick auf Daten- und Analysequalität eingesetzt werden kann. Abschließend wird auf Basis der gewonnenen Erkenntnisse eine prototypische Implementierung auf Basis von FlexMash, einem an der Universität Stuttgart entwickelten Data Mashup-Werkzeug, beschrieben. Data Mashups vereinfachen die Einbindung von Anwendern ohne technischen Hintergrund und harmonieren daher ausgezeichnet mit Visual Analytics.
BibTeX:
@mastersthesis{Behringer:2016, author = {Behringer, Michael}, title = {{Visual Analytics im Kontext der Daten- und Analysequalit{\"a}t am Beispiel von Data Mashups}}, school = {Universit{\"a}t Stuttgart}, year = {2016}, publisher = {Universit{\"a}t Stuttgart}, doi = {10.18419/opus-9325}, language = {German} }
- Markus Funk, Stefan Schneegass, Michael Behringer, Niels Henze, and Albrecht Schmidt (2015). "An Interactive Curtain for Media Usage in the Shower". In Proceedings of the 4th International Symposium on Pervasive Displays, PerDis 2015, Saarbrücken, Germany, June 10-12, 2015. pp. 225–231
[Abstract] [Cite] [Link] [PDF]Abstract:
Smartphones besitzen eine immer größere Funktionsvielfalt und sorgen dadurch für zunehmende Abhängigkeit. Entsprechend fühlen sich viele Menschen unwohl, wenn kein Zugriff auf dieses möglich ist. Insbesondere in einer Umgebung wie dem Badezimmer kann dies einerseits zu technischen Defekten an der Hardware, andererseits auch zu hygienischen Problemen führen. Im Rahmen dieser Studienarbeit wurde zunächst eine Online-Umfrage durchgeführt, um nähere Informationen über die vorhandene Ausstattung und die Anwendungsfülle zu erhalten. Auf Basis dieser Ergebnisse wurde ein Prototyp zur Mediennutzung in der Duschkabine entwickelt. Dieser bietet verschiedene Anwendungen wie Musik- und Videoplayer oder auch einen Überblick über zukünftige Termine und den Wetterbericht. Weiter wurden drei verschiedene Algorithmen entwickelt, welche sich in Komplexität, Geschwindigkeit und Fehlertoleranz unterscheiden. Sowohl das System, als auch die Algorithmen, wurden in einer Nutzerstudie vorgestellt und evaluiert. In dieser zeigte sich, dass die Probanden einem solchen System sehr positiv gegenüber stehen und die Erkennung gut funktioniert.
BibTeX:
@inproceedings{DBLP:conf/perdis/FunkSBH015, author = {Funk, Markus and Schneegass, Stefan and Behringer, Michael and Henze, Niels and Schmidt, Albrecht}, title = {{An Interactive Curtain for Media Usage in the Shower}}, booktitle = {Proceedings of the 4th International Symposium on Pervasive Displays, PerDis 2015, Saarbr{\"u}cken, Germany, June 10-12, 2015}, year = {2015}, pages = {225--231}, organization = {ACM}, publisher = {ACM Press}, address = {New York, New York, USA}, affiliation = {ACM}, doi = {10.1145/2757710.2757713}, isbn = {9781450336086}, language = {English} }
- Michael Behringer (2014). "Erforschung der Interaktionsmöglichkeiten mit flexiblen und unebenen Oberflächen". Study Thesis. Universität Stuttgart
[Abstract] [Cite] [Link] [PDF]Abstract:
Smartphones besitzen eine immer größere Funktionsvielfalt und sorgen dadurch für zunehmende Abhängigkeit. Entsprechend fühlen sich viele Menschen unwohl, wenn kein Zugriff auf dieses möglich ist. Insbesondere in einer Umgebung wie dem Badezimmer kann dies einerseits zu technischen Defekten an der Hardware, andererseits auch zu hygienischen Problemen führen. Im Rahmen dieser Studienarbeit wurde zunächst eine Online-Umfrage durchgeführt, um nähere Informationen über die vorhandene Ausstattung und die Anwendungsfülle zu erhalten. Auf Basis dieser Ergebnisse wurde ein Prototyp zur Mediennutzung in der Duschkabine entwickelt. Dieser bietet verschiedene Anwendungen wie Musik- und Videoplayer oder auch einen Überblick über zukünftige Termine und den Wetterbericht. Weiter wurden drei verschiedene Algorithmen entwickelt, welche sich in Komplexität, Geschwindigkeit und Fehlertoleranz unterscheiden. Sowohl das System, als auch die Algorithmen, wurden in einer Nutzerstudie vorgestellt und evaluiert. In dieser zeigte sich, dass die Probanden einem solchen System sehr positiv gegenüber stehen und die Erkennung gut funktioniert.
BibTeX:
@phdthesis{Behringer:2014, author = {Behringer, Michael}, title = {{Erforschung der Interaktionsm{\"o}glichkeiten mit flexiblen und unebenen Oberfl{\"a}chen}}, school = {Universit{\"a}t Stuttgart}, year = {2014}, publisher = {Universit{\"a}t Stuttgart}, doi = {10.18419/opus-3336}, language = {German} }
is currently being revised
Bachelor thesis:
➣ Data quality metrics to support domain experts in interactive analysis
Bachelor thesis
Data quality metrics to support domain experts in interactive analysis
Motivation
Today, large amounts of data are collected and stored. This data must first be processed and integrated before an analysis can be made. The data processing should be as flexible as possible and domain-specific knowledge is usually required. An application that meets these requirements must therefore also be understandable for users without extensive technical background, so-called domain experts. Data mashup platforms aim at such flexible, ad hoc integration and analysis of heterogeneous data [1]. At the University of Stuttgart, FlexMash [2] was developed as such a data mashup tool, which allows interactive, graphical modeling of data processing and analysis scenarios. The modelling is based on the pipes-and-filters pattern, in which modular services with uniform interfaces and a uniform data exchange format can be connected with each other as desired. These services represent either the extraction of data, the processing of extracted data or the visualization of the results.
Goals
The goal of this work is to extend FlexMash to provide feedback on data quality to the domain expert. This includes the implementation of a repository that provides possible data quality metrics and implementation, an extension in the context of a new data source specification (offline phase), and a context-dependent specification by the domain expert during runtime (online phase).
The thesis includes the following tasks:
- Literature research, summary and delimitation of current research results on data quality and its integration into data mashup tools
- Development of a concept for integration into FlexMash
- Prototypical implementation of the developed concept
- Evaluation of results
References
- [1] Daniel, F., Matera, M. (2014). Mashups. Berlin, Heidelberg: Springer
- [2] Hirmer, P., Behringer, M. (2017). FlexMash 2.0 – Flexible Modeling and Execution of Data Mashups. In F. Daniel,M. Gaedke (Eds.), Rapid Mashup Development Tools (Vol. 696, pp. 10–29). Cham: Springer International Publishing
Summary
Type: | Bachelor thesis |
Title (english): | Data quality metrics to support domain experts in interactive analysis |
Title (german): | Datenqualitätsmetriken zur Unterstützung von Domänenexperten bei interaktiven Analysen |
Supervisor(s): | Dipl.-Inf. Michael Behringer |
Examinor: | PD Dr. rer. nat. habil. Holger Schwarz |
Start: | In Progress |
Master thesis:
➣ Using provenance data to explore personal data with GDPR compliance
Master thesis
Using provenance data to explore personal data with GDPR compliance
Motivation
Today, companies collect personal data in almost every interaction with the Internet. This includes, for example, name, address and payment method when trading online, but goes far beyond this, even when surfing on third-party sites. Since May 2018, the EU has granted consumers far-reaching rights to restrict the use of this data by companies. If a company violates these rights, the penalty is 4\% of annual sales. As a result, businesses have a strong interest in complying with the rules.
Goals
The goal of this thesis is to develop a provenance-based solution that addresses the above challenges. Provenance describes the origin and processing of data. In preliminary work, tools for collecting provenance data (\textit{Pebble}[1]) and for modelling analysis processes have already been developed (\textit{FlexMash}[2]). These tools are to be extended in the context of this thesis with regard to the use case of the General Data Protection Regulation (GDPR). For this purpose, a procedure must first be developed which determines the influence of a data source on the analysis result and enables an efficient recalculation. In a second step, this procedure will be integrated into FlexMash.
The thesis includes the following tasks:
- Literature research, summary and delimitation of current research results on Data Provenance[3], GDPRProv[4], interactive data processing, etc.
- Design and implementation of an index structure that records the influence of deleted input elements on the results over several queries. The index structure is based on information gathered by collecting provenance data. This reduces the recalculation to those parts of the result which are actually affected by changes in the input
- Development and implementation of an algorithm to decide whether it is more beneficial to recalculate the analysis results partially or completely
- Evaluation of the index structure and recalculation metrics over different workloads on real data.
- Implementation of the developed approach in FlexMash
References
- [1] Diestelkämper, R., Herschel, M. (2019). Capturing and Querying Structural Provenance in Spark with Pebble. SIGMOD Conference, 1893–1896
- [2] Hirmer, P., Behringer, M. (2017). FlexMash 2.0 – Flexible Modeling and Execution of Data Mashups. In F. Daniel, M. Gaedke (Eds.), Rapid Mashup Development Tools (Vol. 696, pp. 10–29). Cham: Springer International Publishing
- [3] Herschel, M., Diestelkaemper, R., Ben Lahmar, H. (2017). A survey on provenance: What for? What form? What from? VLDB Endowment, 26(6), 881–906.
- [4] GDPRov - The GDPR Provenance Ontology, https://openscience.adaptcentre.ie/ontologies/GDPRov/docs/ontology
Summary
Type: | Master thesis |
Title (english): | Using provenance data to explore personal data with GDPR compliance |
Title (german): | Nutzung von Provenance-Daten zur Analyse personenbezogener Daten gemäß der DSGVO-Richtlinien |
Supervisor(s): | Dipl.-Inf. Michael Behringer Ralf Diestelkämper, M. Sc. |
Examinor: | PD Dr. rer. nat. habil. Holger Schwarz |
Start: | In Progress |
Bachelor thesis:
➣ Automatic context-sensitive visualization of data sources using data mashups
Bachelor thesis
Automatic context-sensitive visualization of data sources using data mashups
Motivation
Today, large amounts of unstructured, semi-structured and heterogeneous data are produced. This data must first be processed and integrated before an analysis can be made. Data processing should be as flexible as possible and allow ad hoc integration based on real-time data. An application that meets these requirements must also be understandable for users without extensive technical background. Data Mashup platforms aim at a flexible, ad hoc integration of heterogeneous data [1].
Goals
In this thesis, different concepts for the automated characterization of data, as well as suitable visualizations for this purpose are to be researched and evaluated with regard to their application in the area of Data Mashups. Furthermore, a suitable concept especially for the requirements of the tool FlexMash in this context shall be developed and prototypically implemented. Finally, the gained knowledge will be evaluated.
The thesis includes the following tasks:
- Literature research on concepts of automatic characterization of data
- Literature research on suitable visualizations for various data
- Prototypical implementation of a suitable concept
- Evaluation of results
References
- [1] Daniel, F., Matera, M. (2014). Mashups. Berlin, Heidelberg: Springer. http://doi.org/10.1007/978-3-642-55049-2
Summary
Type: | Bachelor thesis |
Title (english): | Automatic context-sensitive visualization of data sources using data mashups |
Title (german): | Automatisierte kontext-sensitive Visualisierung von Datenquellen unter Verwendung von Data Mashups |
Supervisor(s): | Dipl.-Inf. Michael Behringer |
Examinor: | Prof. Dr.-Ing. habil. Bernhard Mitschang |
Start: | Finished |
➣ Feature-Driven Representation of Clustering Results
Bachelor thesis
Feature-Driven Representation of Clustering Results
Motivation
Nowadays data is the basis of many processes in industry and research. However, since they are worthless without evaluation and linking, there are various algorithms and analysis methods. These methods are mostly an opaque black box, because there are no control possibilities between input of parameters and output of the result. It is often unclear why and under which conditions a certain result is obtained [1]. An analyst must assess this result taking into account his domain knowledge and draw conclusions. In particular with clustering methods (such as k-Means), this result is also strongly dependent on the initially selected parameters. The goal of a clustering procedure is to combine similar elements to a cluster, as well as to separate different elements as far as possible. The communication of the results is often difficult.
Goals
For two-dimensional (and with some limitations three-dimensional) data sets a visualization of the results is possible and easy to understand. For higher dimensional data sets, dimensional reduction methods such as PCA [2] or t-SNE [3] are typically used. However, here the clarity/comprehensibility of the clustering result is rarely given. For this reason, different, more easily understandable representations are required for these data sets. Possible approaches for this are textual representations of the cluster properties.
The thesis includes the following tasks:
- Literature research on appropriate metrics to identify the most relevant features
- Literature research on concepts for the presentation of multidimensional clustering results
- Development and prototypical implementation of suitable concepts
- Evaluation of concepts
References
- [1] Jain, A. K., Dubes, R. C. (1988). Algorithms for clustering data. Upper Saddle River, NJ, USA: Prentice-Hall, Inc.
- [2] Wold, S., Esbensen, K., Geladi, P. (1987). Principal Component Analysis. Chemometrics and Intelligent Laboratory Systems, 2(1-3), 37–52
- [3] Maaten, L. V. D., Hinton, G. (2008). Visualizing Data using t-SNE. Journal of Machine Learning Research, 9(Nov), 2579–2605.
Summary
Type: | Bachelor thesis |
Title (english): | Feature-Driven Representation of Clustering Results |
Title (german): | Feature-getriebene Darstellung von Clustering-Resultaten |
Supervisor(s): | Dipl.-Inf. Michael Behringer Manuel Fritz, M. Sc. |
Examinor: | PD Dr. rer. nat. habil. Holger Schwarz |
Start: | Finished |
➣ Interactive and incremental visualization in the context of Big Data
Bachelor thesis
Interactive and incremental visualization in the context of Big Data
Motivation
Today, large amounts of unstructured, semi-structured and heterogeneous data are produced. This data must first be processed and integrated before an analysis can be made. Data processing should be as flexible as possible and allow ad hoc integration based on real-time data. An application that meets these requirements must also be understandable for users without extensive technical background. Data Mashup platforms aim at a flexible, ad hoc integration of heterogeneous data [1].
Goals
In this thesis an application is to be developed, which enables the user to specify arbitrary attributes of the data set and to generate a visualization helpful for the understanding. Since this can lead to higher latency times for the creation of the visualization, especially in the context of Big Data, it will be further evaluated to what extent an incremental calculation [2] can support this.
The thesis includes the following tasks:
- Literature research on suitable visualizations for varying data
- Literature research on concepts of incremental visualization
- Prototypical implementation of a suitable concept
- Evaluation of results
References
- [1] Daniel, F., Matera, M. (2014). Mashups. Berlin, Heidelberg: Springer
- [2] Schulz, H.-J., Angelini, M., Santucci, G., Schumann, H. (2016). An Enhanced Visualization Process Model for Incremental Visualization. IEEE Transactions on Visualization and Computer Graphics, 22(7), 1830–1842
Summary
Type: | Bachelor thesis |
Title (english): | Interactive and incremental visualization in the context of Big Data |
Title (german): | Interaktive und inkrementelle Visualisierung im Kontext von Big Data |
Supervisor(s): | Dipl.-Inf. Michael Behringer Manuel Fritz, M. Sc. |
Examinor: | PD Dr. rer. nat. habil. Holger Schwarz |
Start: | Finished |
Bachelor thesis
Interactive context-sensitive integration and cleaning of heterogenous data sources using data mashups
Motivation
Today, large amounts of unstructured, semi-structured and heterogeneous data are produced. This data must first be processed and integrated before an analysis can be made. Data processing should be as flexible as possible and allow ad hoc integration based on real-time data. An application that meets these requirements must also be understandable for users without extensive technical background. Data Mashup platforms aim at a flexible, ad hoc integration of heterogeneous data [1].
Goals
In this thesis the existing tool FlexMash shall be extended by a concept for the integration and preprocessing of data sets for subsequent analysis.
The thesis includes the following tasks:
- Literature research on concepts and algorithms for automated schema integration
- Development of a concept for integration into FlexMash
- Prototypical implementation of the developed concept
- Evaluation of results
References
- [1] Daniel, F., Matera, M. (2014). Mashups. Berlin, Heidelberg: Springer
Summary
Type: | Bachelor thesis |
Title (english): | Interactive context-sensitive integration and cleaning of heterogenous data sources using data mashups |
Title (german): | Interaktive kontextsensitive Integration und Aufbereitung heterogener Datenquellen unter Verwendung von Data Mashups |
Supervisor(s): | Dipl.-Inf. Michael Behringer Dipl.-Inf. Pascal Hirmer |
Examinor: | Prof. Dr.-Ing. habil. Bernhard Mitschang |
Start: | Finished |
➣ Interactive sampling techniques in the context of data mashup tools
Bachelor thesis
Interactive sampling techniques in the context of data mashup tools
Motivation
Nowadays data is the basis of many processes in industry and research. However, as they are worthless without evaluation and linkage, there are various analysis methods which are either manual, semi-automatic or automatic. Manual methods allow a deep interaction possibility for the data analyst, but are not practicable with the amount of data occurring today due to the explorative character of data analysis and the required computing power. Automatic methods, on the other hand, can process large amounts of data, but are mostly an opaque black box, since there is no control possibility between the input of parameters and the output of the result. Automatic methods can therefore not integrate the specific domain knowledge of the data analyst into the process, or can only be made possible by repeatedly executing the complete black box without gaining an understanding of the running processes.
Goals
In preliminary work, various procedures for data analysis, such as clustering or sampling algorithms [1], have already been implemented on Spark. So far, however, there is no user interface for these procedures to address them from FlexMash [2], a data mashup tool [3] developed at the University of Stuttgart. Therefore, this thesis will first integrate the existing implementations into FlexMash. This includes an adaptation to the used architecture as well as the development of a suitable user interface for the specification of the parameters.
The thesis includes the following tasks:
- Integration of existing practices into FlexMash
- Literature research on concepts for manual and (semi-)automated control of sampling procedures
- Literature research on suitable metrics for the evaluation of generated samples
- Development and prototypical implementation of a suitable semi-automated approach
- Evaluation of results
References
- [1] Wang, H., Parthasarathy, S., Ghoting, A., Tatikonda, S., Buehrer, G., Kurc, T., & Saltz, J. (2005). Design of a next generation sampling service for large scale data analysis applications (pp. 91–100). Proceedings of the 19th International Conference on Supercomputing, New York, New York, USA
- [2] Hirmer, P., Mitschang, B. (2016). FlexMash – Flexible Data Mashups Based on Pattern-Based Model Transformation. In F. Daniel, C. Pautasso (Eds.), Rapid Mashup Development Tools (Vol. 591, pp. 12–30). Cham: Springer, Cham
- [3] Daniel, F., Matera, M. (2014). Mashups. Berlin, Heidelberg: Springer
Summary
Type: | Bachelor thesis |
Title (english): | Interactive sampling techniques in the context of data mashup tools |
Title (german): | Interaktive Sampling-Verfahren im Kontext von Data-Mashup-Werkzeugen |
Supervisor(s): | Dipl.-Inf. Michael Behringer Manuel Fritz, M. Sc. |
Examinor: | PD Dr. rer. nat. habil. Holger Schwarz |
Start: | Finished |
➣ Metrics for the evaluation of partial steps in data mining analyses
Bachelorarbeit
Metrics for the evaluation of partial steps in data mining analyses
Motivation
Nowadays data is the basis of many processes in industry and research. However, since they are worthless without evaluation and linking, there are various algorithms and analysis methods. From the point of view of beginners, but also for experienced users, these methods are an opaque black box, as there are no control options or intermediate steps between the input of parameters and the output of the result. Therefore it is often unclear why and under which conditions a certain result is obtained [1]. These procedures are characterized by iterative algorithms, but for a user these intermediate steps are not visible.
Goals
In this thesis, suitable points in time for a specific analysis procedure (clustering, e.g. k-means [2] and DBSCAN [3]) shall be determined for the calculation of intermediate results. For this purpose, a manual or (semi-)automated selection of metrics shall be performed, which clarifies at which point in time a suitable intermediate result is to be visualized. These metrics shall be used in the further course to approximate the clustering results, if a sufficient quality has already been achieved. An implementation shall visualize these results.
The thesis includes the following tasks:
- Literature research on clustering algorithms
- Literature research on metrics and convergence criteria
- Development and prototypical implementation of a suitable concept
- Evaluation of results
References
- [1] Jain, A. K., Dubes, R. C. (1988). Algorithms for clustering data. Upper Saddle River, NJ, USA: Prentice-Hall, Inc.
- [2] MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations (Vol. 1, pp. 281–297). Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, University of California Press
- [3] Ester, M., Kriegel, H. P., Sander, J., Xu, X. (1996). A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. Proceedings of the 2nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
Summary
Type: | Bachelor thesis |
Title (english): | Metrics for the evaluation of partial steps in data mining analyses |
Title (german): | Metriken zur Evaluation von Teilschritten in Data Mining-Analysen |
Supervisor(s): | Manuel Fritz, M. Sc. Dipl.-Inf. Michael Behringer |
Examinor: | PD Dr. rer. nat. habil. Holger Schwarz |
Start: | Finished |
Master thesis:
➣ Dynamic Execution of Workflows Parts During Modeling Time
Master thesis
Dynamic Execution of Workflows Parts During Modeling Time
Motivation
Today, large quantities of unstructured, semi-structured and heterogeneous data produced. This data must first be processed and integrated before an analysis can be made. In this context, the data processing should, as far as possible flexible and allow ad hoc integration based on real-time data. A Application that meets these requirements must also be suitable for users without extensive technical background must be understandable. Data Mashup platforms aim to provide a flexible, ad hoc integration of heterogeneous data [1]. The University of Stuttgart developed FlexMash, a data mashup tool that besides a domain-specific, graphical modeling of data processing and integration scenarios, it also enables their execution through so-called mashup plans. The The type of design depends on the non-functional requirements of the user, i.e. the components used for the execution are determined dynamically. The modelling is based on the Pipes and Filters Pattern, in which modular services with uniform interfaces, as well as a uniform data exchange format arbitrarily with each other can be connected. These services represent either the extraction of data that Processing of extracted data or the visualization of the results. A previously unsolved problem with FlexMash is that even with minimal changes to the modeling the entire Mashup plan is executed again, which can lead to a loss of data when large amounts of data are involved. a greatly increased runtime and correspondingly limited usability. To solve this problem to be countered is a partial execution of the modelled processes, i.e. the Mashup Plan, desirable. In this context, the application of various concepts - such as for example 'smart' re-runs [2] or model-as-you-go [3] - are conceivable, so that in this way the response time of the system can be reduced.
Goals
In this thesis the different concepts for the partial execution of workflows and with regard to the possible applications in the area of Data Mashups be evaluated. Furthermore, a suitable concept especially for the requirements of the tool FlexMash can be created and prototypically implemented in this context. The resulting are to be finally evaluated against the formulated requirements will be.
The thesis includes the following tasks:
- Literature research on concepts of partial execution of workflows
- Development of a suitable concept for FlexMash
- Prototypical implementation of the developed concept
- Evaluation of results
References
- [1] Daniel, F., Matera, M. (2014). Mashups. Berlin, Heidelberg: Springer
- [2] Ludäscher, B., Altintas, I., Berkley, C., Higgins, D., Jaeger, E., Jones, M., et al. (2006). Scientific workflow management and the Kepler system. Concurrency and Computation: Practice and Experience, 18(10), 1039–1065
- [3] Sonntag, M., Karastoyanova, D., Karastoyanova, D. (2013). Model-as-you-go: An Approach for an Advanced Infrastructure for Scientific Workflows. Journal of Grid Computing, 11(3), 553–583
Summary
Type: | Master thesis |
Title (english): | Dynamic Execution of Workflows Parts During Modeling Time |
Title (german): | Dynamische Teilausführung von Workflows zur Modellierungszeit |
Supervisor(s): | Dipl.-Inf. Pascal Hirmer Dipl.-Inf. Michael Behringer |
Examinor: | PD Dr. rer. nat. habil. Holger Schwarz |
Start: | Finished |
➣ Evaluation of Prediction Mechanisms of Parameters for Data Mining Algorithms
Master thesis
Evaluation of Prediction Mechanisms of Parameters for Data Mining Algorithms
Motivation
The term "Data Analytics" describes a process which turns information from raw data to knowledge. Nowadays, multiple reference process models exist, such as KDD or CRISP-DM. Those reference models generally range from (1) data selection, (2) data transformation, (3) data mining to (4) evaluation and (5) application of the mining results. Although the logical order of the individual steps is reasonable and well established, there are yet no exact approaches on how exactly to perform the individual steps. In general, analysts need to explore the solution space to find valid options along the analytics process. Domain knowledge of the specific context may be useful, yet it is cumbersome to perform such a process. The main reason is an ever increasing amount of data to analyze, which yields to huge gaps between the individual steps of the process and therefore hinders the exploration. In the “Data Mining“ step, algorithms and statistical approaches are executed on the data set in order to generate new knowledge. Typically, those algorithms originate from the area of machine learning and require a set of parameters before the actual execution of the analytic algorithms. Those parameters are crucial for the quality of the result, since wrong parameters may lead to wrong or no results at all. However, the mining algorithms need to be executed completely until it’s possible to estimate the quality of the algorithm and its parameters. Hence, an analyst has to iterate over multiple instantiations of algorithms and parameters completely, which results in a highly time-consuming iteration cycle. Even a small change in parameters results in long runtimes, thus unveiling that the exploration of the solution space of parameters is highly cumbersome for an analyst.
Goals
Currently, there are few heuristics [1] and best practices [2] for determining parameters for some mining algorithms. Those are very specific for each single algorithm and not necessarily well suited for a more general set of algorithms. Especially when regarding big data, some best practices are not feasible because they need to be executed multiple times on the whole data set to approximate solid parameters. Space partitioning algorithms and visualizations seem to be a promising approach. Binary space partitioning algorithms and partitioning visualizations are suitable for seperating the data space into smaller chunks which can be easier processed. The goal of this thesis is to estimate parameters with such an space partitioning approach, e.g. Voronoi tessellation or Delaunay triangulation. Both visualizations can be obtained for example from algorithms [3], but need some further fine tuning to reflect characteristics of mining algorithms, such as specific density or distance metrics to estimate promising parameters. This work can pursue different directions: From an exhaustive research and evaluation of heuristics and best practices for estimating parameters in a time-saving manner for a broad range of mining algorithms to a development of a novel approach using space partitioning concepts and a basic comparison with a heuristics for a single mining algorithm are possible. A prototypical implementation of the results should emphasize the advancements of this thesis for the research community.
The thesis includes the following tasks:
- Researching heuristics for frequently used Data Mining algoritms
- Research and Evaluation of space partitioning approaches
- Prototypical implementation
- Evaluation of the results
References
- [1] V. Birodkar and D. R. Edla, “Enhanced K -Means Clustering Algorithm using A Heuristic Approach”, Journal of Information and Computing Science, vol. 9, no. 4, pp. 277–284, 2014
- [2] D. Golovin, B. Solnik, S. Moitra, G. Kochanski, J. Karro, and D. Sculley, “Google Vizier: A Service for Black-Box Optimization,” in Proceedings of the SIGKDD Conference on Knowledge Discovery and Data Mining, 2017
- [3] S. Fortune, “A Sweepline Algorithm for Voronoi Diagrams,” in Proceedings of the secondannual symposium on Computational geometry, 1986, pp. 313–322.
Summary
Type: | Master thesis |
Title (english): | Evaluation of Prediction Mechanisms of Parameters for Data Mining Algorithms |
Title (german): | Bewertung von Vorhersagemechanismen von Parametern für Data-Mining-Algorithmen |
Supervisor(s): | Manuel Fritz, M.Sc. Dipl.-Inf. Michael Behringer |
Examinor: | PD Dr. rer. nat. habil. Holger Schwarz |
Start: | Finished |