Fast, Incremental, and Scalable All Pairs Similarity Search

Abstract

Searching pairs of similar data records is an operation required for many data mining techniques like clustering and collaborative filtering. With emergence of the Web, scale of the data has increased to several millions or billions of records. Business and scientific applications like search engines, digital libraries, and systems biology often deal with massive data sets in a high dimensional space. The overarching goal of this thesis is to enable fast and incremental similarity search over large high dimensional data sets through improved indexing, systematic heuristic optimizations, and scalable parallelization. In Task 1, we design a sequential algorithm for all pairs similarity search (APSS) that involves finding all pairs of records having similarity above a specified threshold. Our proposed fast matching technique speeds-up APSS computation by using novel tighter bounds for similarity computation and indexing data structure. It offers the fastest solution known to-date with up to 6X speed-up over the state-of-the-art existing APSS algorithm. In Task 2, we address the incremental formulation of APSS problem, where APSS is performed multiple times over a given data set while varying the similarity threshold. Our goal is to avoid redundant computations across multiple invocations of APSS by storing history of computation during each APSS. Depending on the similarity threshold variation, our proposed history binning and index splitting techniques achieve speed-ups from 2X to over 100000X over the state-of-the-art APSS algorithm. To the best of our knowledge, this is the first work that addresses this problem. In Task 3, we design scalable parallel algorithms for APSS that take advantage of modern multi-processor, multi-core architectures to further scale-up the APSS computation. Our proposed index sharing technique divides the APSS computation into independent tasks and achieves ideal strong scaling behavior on shared memory architectures. We also propose a complementary incremental index sharing technique, which provides a memory-efficient parallel APSS solution while maintaining almost linear speed-up. Performance of our parallel APSS algorithms remains consistent for datasets of various sizes. To the best of our knowledge, this is the first work that explores parallelization for APSS. We demonstrate the effectiveness of our techniques using four real-world million record data sets.

Description

Keywords

similarity search, parallel algorithms, data mining, inverted index

Citation

Degree

PhD

Discipline

Computer Science

Collections