An open solution to discover the graph structure of World Wide Web

dc.contributor.advisorVincent W. Freeh, Committee Chairen_US
dc.contributor.advisorFrank Mueller, Committee Memberen_US
dc.contributor.advisorXuxian Jiang, Committee Memberen_US
dc.contributor.authorChen, Kunshengen_US
dc.date.accessioned2010-04-02T18:01:41Z
dc.date.available2010-04-02T18:01:41Z
dc.date.issued2009-12-23en_US
dc.degree.disciplineComputer Scienceen_US
dc.degree.levelthesisen_US
dc.degree.nameMSen_US
dc.description.abstractThe World Wide Web is a large complex network of inter-linked web pages. Understanding this structure is of immense benefit both economically and socially. Currently, there is incomplete or sparse information about the graph structure of the Web in the public domain. The full data is closely-guarded by a handful of corporations. Nevertheless, studies on the topological structure of World Wide Web benefit not only scientists and e-commerce merchants but also common users. A better understanding of such a structure helps scientists to develop new technologies to improve the Internet. It also assists companies to build optimal e-commerce solutions to fulfill their business needs. The goal of this thesis is to evaluate an open source solution to mapping the structure of the Web. In support of this thesis, we have implemented a prototype using existing open source software including voluntary computing library BOINC (Berkeley Open Infrastructure Network Computing) and Hadoop MapReduce framework. We utilize the computing power and disk space from BOINC to perform data collection and Hadoop MapReduce framework to perform data analysis on a large set of data.. Contribution of our research includes a low-cost open solution of a distributed web crawling system using BOINC and a URL ranking system utilizing Hadoop MapReduce framework. We also provide a feasibility study on crawling the web using the above solution and present experimental results.en_US
dc.identifier.otheretd-12222009-192226en_US
dc.identifier.urihttp://www.lib.ncsu.edu/resolver/1840.16/1176
dc.rightsI hereby certify that, if appropriate, I have obtained and attached hereto a written permission statement from the owner(s) of each third party copyrighted matter to be included in my thesis, dis sertation, or project report, allowing distribution as specified below. I certify that the version I submitted is the same as that approved by my advisory committee. I hereby grant to NC State University or its agents the non-exclusive license to archive and make accessible, under the conditions specified below, my thesis, dissertation, or project report in whole or in part in all forms of media, now or hereafter known. I retain all other ownership rights to the copyright of the thesis, dissertation or project report. I also retain the right to use in future works (such as articles or books) all or part of this thesis, dissertation, or project report.en_US
dc.subjectgraph structure of World Wide Weben_US
dc.subjectdistributed web crawleren_US
dc.titleAn open solution to discover the graph structure of World Wide Weben_US

Files

Original bundle

Now showing 1 - 1 of 1
No Thumbnail Available
Name:
etd.pdf
Size:
247.56 KB
Format:
Adobe Portable Document Format

Collections