A Transparent Collaborative Framework for Efficient Data Analysis and Knowledge Annotation on the Web

Show full item record

Title: A Transparent Collaborative Framework for Efficient Data Analysis and Knowledge Annotation on the Web
Author: Breimyer, Paul William
Advisors: Professor Nagiza F. Samatova, Committee Chair
Professor Steffen Heber, Committee Member
Professor Tao Xie, Committee Member
Professor Mladen Vouk, Committee Member
Abstract: High-throughput experiments and ultrascale computing generate scientific data of growing size and complexity. These trends challenge traditional data analysis environments, most of which are based on scripting languages such as R, MATLAB or IDL, in a number of ways. To address some of these challenges, this research proposes a framework with the overarching goal to enable large-scale high-performance data analytics and collaborative knowledge annotation over the Web. The proposed framework has three major components, which parallel the three core steps of the knowledge discovery cycle. For the first step, defining the data analysis pipeline, the research designs and implements a Web-enabled interactive and collaborative statistical R-based environment. The component implements a memory management system that minimizes memory requirements thereby enabling multi-user scalability. To the best of our knowledge, this is the first Web-enabled R system that supports interactive remote access to R servers and enables users to share data, results and analysis sessions. For the second step, executing the data analysis pipeline, the research investigates and proposes a transparent and low-overhead means for executing external compiled language parallel codes from within R, thus seamlessly bridging two code development paradigms: efficient, compiled parallel codes and high abstraction and easy-to-use scripting codes. This component contains three elements: a transparent bidirectional translation of data objects between R and compiled languages, such as C/C++/Fortran; seamless integration of external parallel codes; and automatic parallelization of data-parallel computations in hybrid multi-core and multi-node execution environments. For the third step, annotating the predictive knowledge derived from community analysis pipelines, the research explores an environment for semantically rich, structured and queriable annotation of facts, relationships between those facts, and complex events reported in scientific literature. The social networking nature of this component allows the community to improve the predictions as well as generate new, higher-level inferences, thus filling in the gaps in the communities' understanding of physical phenomena. The environment offers mechanisms for streamlining the annotated and curated knowledge into distributed public databases, thus enabling a feedback loop into the database-publication cycle to allow scientists to make connections between data-driven predictions and published evidence.
Date: 2009-07-23
Degree: PhD
Discipline: Computer Science
URI: http://www.lib.ncsu.edu/resolver/1840.16/4020

Files in this item

Files Size Format View
etd.pdf 8.960Mb PDF View/Open

This item appears in the following Collection(s)

Show full item record