Scaling Complex Analytical Processing on Graph Structured Data Using Map Reduce

Show full item record

Title: Scaling Complex Analytical Processing on Graph Structured Data Using Map Reduce
Author: Sridhar, Radhika
Advisors: Dr. Kemafor Anyanwu, Committee Chair
Dr. Xiaosong Ma, Committee Member
Dr. Tao Xie, Committee Member
Abstract: Efficient analytical processing at the Web scale has become an important requirement as more decision support applications rely on the data on the Web. One approach for achieving the significant scalability is by the use of parallel processing techniques on a computational cluster of the commodity grade machines. Software platforms such as Map-Reduce, Hadoop and Pig are now available that allow the users to encode their tasks in terms of simple low-level primitives that are easily parallelizable. Further, a high-level dataflow language called Pig Latin has been proposed for specifying analytical processing tasks using a mixture of the procedural and the declarative paradigms. This approach strikes a good balance between customizability and the potential for an automatic query optimization. However, the analytical processing capability currently offered by these frameworks is fairly basic and as such has narrow applicability to many real world scenarios. Furthermore, an increasing amount of data being made available on the Web is semi-structured. For example, some search engines report that the recent W3C standard for representing the metadata on the Web called the Resource Description Framework (RDF) already accounts for about 8,502,794 Web data URL’s and 2,759,040 documents. However, such data is typically organized as a set of binary relations (a graph) whereas these frameworks are primarily targeted at processing the data structured as n-ary relational tables. This thesis addresses the problem of enabling scalable analytical data processing on RDF datasets. Its approach is based on extending Yahoo’s Pig system (an open source parallel processing) with constructs that allow complex data processing problems on the graph structured data to be expressed in a manner that is more amenable to automatic parallelization. Specifically, it makes the following contributions: 1. Extends Pig Latin, the dataflow language for Pig, with primitives that support the expression of queries in terms of a readily parallelizable multidimensional join operator, as well as support the expression of graph navigational filter expressions. 2. Implements the introduced primitives in a Hadoop implementation running on VCL 3. Develops a cost model for estimating the cost of queries expressed in terms of the multidimensional join operator.
Date: 2009-01-06
Degree: MS
Discipline: Computer Science

Files in this item

Files Size Format View
etd.pdf 2.314Mb PDF View/Open

This item appears in the following Collection(s)

Show full item record