Fast, Incremental, and Scalable All Pairs Similarity Search

Show simple item record

dc.contributor.advisor Anatoli V. Melechko, Committee Co-Chair en_US
dc.contributor.advisor Christopher G. Healey, Committee Member en_US
dc.contributor.advisor Kemafor Anyanwu, Committee Member en_US
dc.contributor.advisor Nagiza F. Samatova, Committee Chair en_US Awekar, Amit Chintamani en_US 2010-04-02T18:48:57Z 2010-04-02T18:48:57Z 2010-01-19 en_US
dc.identifier.other etd-12022009-094010 en_US
dc.description.abstract Searching pairs of similar data records is an operation required for many data mining techniques like clustering and collaborative filtering. With emergence of the Web, scale of the data has increased to several millions or billions of records. Business and scientific applications like search engines, digital libraries, and systems biology often deal with massive data sets in a high dimensional space. The overarching goal of this thesis is to enable fast and incremental similarity search over large high dimensional data sets through improved indexing, systematic heuristic optimizations, and scalable parallelization. In Task 1, we design a sequential algorithm for all pairs similarity search (APSS) that involves finding all pairs of records having similarity above a specified threshold. Our proposed fast matching technique speeds-up APSS computation by using novel tighter bounds for similarity computation and indexing data structure. It offers the fastest solution known to-date with up to 6X speed-up over the state-of-the-art existing APSS algorithm. In Task 2, we address the incremental formulation of APSS problem, where APSS is performed multiple times over a given data set while varying the similarity threshold. Our goal is to avoid redundant computations across multiple invocations of APSS by storing history of computation during each APSS. Depending on the similarity threshold variation, our proposed history binning and index splitting techniques achieve speed-ups from 2X to over 100000X over the state-of-the-art APSS algorithm. To the best of our knowledge, this is the first work that addresses this problem. In Task 3, we design scalable parallel algorithms for APSS that take advantage of modern multi-processor, multi-core architectures to further scale-up the APSS computation. Our proposed index sharing technique divides the APSS computation into independent tasks and achieves ideal strong scaling behavior on shared memory architectures. We also propose a complementary incremental index sharing technique, which provides a memory-efficient parallel APSS solution while maintaining almost linear speed-up. Performance of our parallel APSS algorithms remains consistent for datasets of various sizes. To the best of our knowledge, this is the first work that explores parallelization for APSS. We demonstrate the effectiveness of our techniques using four real-world million record data sets. en_US
dc.rights I hereby certify that, if appropriate, I have obtained and attached hereto a written permission statement from the owner(s) of each third party copyrighted matter to be included in my thesis, dis sertation, or project report, allowing distribution as specified below. I certify that the version I submitted is the same as that approved by my advisory committee. I hereby grant to NC State University or its agents the non-exclusive license to archive and make accessible, under the conditions specified below, my thesis, dissertation, or project report in whole or in part in all forms of media, now or hereafter known. I retain all other ownership rights to the copyright of the thesis, dissertation or project report. I also retain the right to use in future works (such as articles or books) all or part of this thesis, dissertation, or project report. en_US
dc.subject similarity search en_US
dc.subject parallel algorithms en_US
dc.subject data mining en_US
dc.subject inverted index en_US
dc.title Fast, Incremental, and Scalable All Pairs Similarity Search en_US PhD en_US dissertation en_US Computer Science en_US

Files in this item

Files Size Format View
etd.pdf 1.971Mb PDF View/Open

This item appears in the following Collection(s)

Show simple item record