zur Startseite

Masterarbeit

Distributed Data Analytics using Graph Processing Frameworks
Betreuer Dr. rer. nat. Muhammad Adnan Tariq
Dipl.-Inf. Christian Mayer
Prüfer Prof. Dr. rer. nat. Dr. h. c. Kurt Rothermel
Beschreibung

The trend for Big Data analytics continues to attract academia and industry to find feasible solutions for the challenges introduced by processing huge data sets. Large companies such as Google, Facebook and Yahoo are seeking for data analysts and statisticians that are able to extract value out of web, user or financial data.

Because of the size and complexity of the data, Big Data analytics is usually performed in the cloud (and hence in large data centers) and distributed across many computers in order to parallelize execution and improve the latency that is perceived by the data analysts. For the distributed analysis, expert knowledge in both areas, distributed systems and algorithmic design, had been required. In an attempt to release the data scientist from the burden of error-prone and time-consuming distributed system programming, Google initiated the MapReduce project in 2004 [1]. MapReduce was developed as middleware between data scientist and a computing cluster. The data scientist specifies two sequential functions, map and reduce, and the MapReduce framework executes these functions in a parallel, distributed and failure-tolerant way.

MapReduce-based systems are optimized for embarassingly parallel problems, where processors compute the same task on different splits of the data independently from each other. However, the MapReduce abstraction is inefficient for iterative, complex analysis on graph data and data with large computational dependencies [2], mainly because of the overhead of restarting a series of fine-granular MapReduce jobs again and again and writing the data to stable storage after each MapReduce job.

In order to overcome the limitations of MapReduce-based frameworks, researchers proposed graph-parallel systems such as Pregel [2] and GraphLab [3]. Programmers specify their algorithms in a think-like-a-vertex mode (e.g. in the Bulk Synchronous Parallel (BSP) model [4]), where each vertex aggregates values from its neighbors, updates local variables and sends values back to its neighbors. The graph-parallel systems execute these vertex-centric programs in parallel on a computing cluster in a failure-tolerant manner. In this way, these systems provide a middleware comparable to MapReduce-based frameworks that decouples algorithmic from distributed computing expertise.

The class of algorithms that are expressible with the vertex-centric programming model is not limited to classical graph algorithms such as Shortest Paths or the Traveling Salesman Problem, but also to large-scale machine learning algorithms such as clustering or collaborative filtering. In the thesis, we will focus our attention to these non-traditional graph algorithms.

Goals and Tasks

The goals and tasks of this thesis are the following:

  • Survey existing algorithms that can be expressed in the mentioned graph-parallel frameworks and categorize them according to their characteristics and feasibility of parallel execution.
  • Select three specific algorithms and modify them to fit into a particular graph-parallel framework such as GraphLab.
  • Implement and evaluate these algorithms on a computing cluster.

 

The results have to be documented in a written report and presented in the colloquium of the department of Distributed Systems.

Requirements

The student should have a good algorithmic comprehension, especially in the area of distributed machine learning. Moreover, good programming skills are required for implementing and deploying these algorithms on a computing cluster using current graph-parallel technologies. The student can write and present the thesis in German or in English.

References
[1] Dean, J. & Ghemawat, S. MapReduce: simplified data processing on large clusters Communications of the ACM, ACM, 2008, 51, 107-113
[2] Malewicz, G.; Austern, M. H.; Bik, A. J.; Dehnert, J. C.; Horn, I.; Leiser, N. & Czajkowski, G. Pregel: a system for large-scale graph processing Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, 2010, 135-146
[3] Low, Y.; Bickson, D.; Gonzalez, J.; Guestrin, C.; Kyrola, A. & Hellerstein, J. M. Distributed GraphLab: a framework for machine learning and data mining in the cloud Proceedings of the VLDB Endowment, VLDB Endowment, 2012, 5, 716-727
[4] Valiant, L. G. A bridging model for parallel computation Communications of the ACM, ACM, 1990, 33, 103-111