An open solution to discover the graph structure of World Wide Web

Show full item record

Title: An open solution to discover the graph structure of World Wide Web
Author: Chen, Kunsheng
Advisors: Vincent W. Freeh, Committee Chair
Frank Mueller, Committee Member
Xuxian Jiang, Committee Member
Abstract: The World Wide Web is a large complex network of inter-linked web pages. Understanding this structure is of immense benefit both economically and socially. Currently, there is incomplete or sparse information about the graph structure of the Web in the public domain. The full data is closely-guarded by a handful of corporations. Nevertheless, studies on the topological structure of World Wide Web benefit not only scientists and e-commerce merchants but also common users. A better understanding of such a structure helps scientists to develop new technologies to improve the Internet. It also assists companies to build optimal e-commerce solutions to fulfill their business needs. The goal of this thesis is to evaluate an open source solution to mapping the structure of the Web. In support of this thesis, we have implemented a prototype using existing open source software including voluntary computing library BOINC (Berkeley Open Infrastructure Network Computing) and Hadoop MapReduce framework. We utilize the computing power and disk space from BOINC to perform data collection and Hadoop MapReduce framework to perform data analysis on a large set of data.. Contribution of our research includes a low-cost open solution of a distributed web crawling system using BOINC and a URL ranking system utilizing Hadoop MapReduce framework. We also provide a feasibility study on crawling the web using the above solution and present experimental results.
Date: 2009-12-23
Degree: MS
Discipline: Computer Science

Files in this item

Files Size Format View
etd.pdf 247.5Kb PDF View/Open

This item appears in the following Collection(s)

Show full item record