An open solution to discover the graph structure of World Wide Web
No Thumbnail Available
Files
Date
2009-12-23
Authors
Journal Title
Series/Report No.
Journal ISSN
Volume Title
Publisher
Abstract
The World Wide Web is a large complex network of inter-linked web pages. Understanding
this structure is of immense benefit both economically and socially. Currently,
there is incomplete or sparse information about the graph structure of the Web in the public
domain. The full data is closely-guarded by a handful of corporations.
Nevertheless, studies on the topological structure of World Wide Web benefit not
only scientists and e-commerce merchants but also common users. A better understanding
of such a structure helps scientists to develop new technologies to improve the Internet. It
also assists companies to build optimal e-commerce solutions to fulfill their business needs.
The goal of this thesis is to evaluate an open source solution to mapping the
structure of the Web. In support of this thesis, we have implemented a prototype using
existing open source software including voluntary computing library BOINC (Berkeley Open
Infrastructure Network Computing) and Hadoop MapReduce framework. We utilize the
computing power and disk space from BOINC to perform data collection and Hadoop
MapReduce framework to perform data analysis on a large set of data..
Contribution of our research includes a low-cost open solution of a distributed
web crawling system using BOINC and a URL ranking system utilizing Hadoop MapReduce
framework. We also provide a feasibility study on crawling the web using the above solution
and present experimental results.
Description
Keywords
graph structure of World Wide Web, distributed web crawler
Citation
Degree
MS
Discipline
Computer Science