Publications
A survey of cryptographic approaches to securing big-data analytics in the cloud
Summary
Summary
The growing demand for cloud computing motivates the need to study the security of data received, stored, processed, and transmitted by a cloud. In this paper, we present a framework for such a study. We introduce a cloud computing model that captures a rich class of big-data use-cases and allows...
A test-suite generator for database systems
Summary
Summary
In this paper, we describe the SPAR Test Suite Generator (STSG), a new test-suite generator for SQL style database systems. This tool produced an entire test suite (data, queries, and ground-truth answers) as a unit and in response to a user's specification. Thus, database evaluators could use this tool to...
Sparse matrix partitioning for parallel eigenanalysis of large static and dynamic graphs
Summary
Summary
Numerous applications focus on the analysis of entities and the connections between them, and such data are naturally represented as graphs. In particular, the detection of a small subset of vertices with anomalous coordinated connectivity is of broad interest, for problems such as detecting strange traffic in a computer network...
Big Data dimensional analysis
Summary
Summary
The ability to collect and analyze large amounts of data is a growing problem within the scientific community. The growing gap between data and users calls for innovative tools that address the challenges faced by big data volume, velocity and variety. One of the main challenges associated with big data...
Achieving 100,000,000 database inserts per second using Accumulo and D4M
Summary
Summary
The Apache Accumulo database is an open source relaxed consistency database that is widely used for government applications. Accumulo is designed to deliver high performance on unstructured data such as graphs of network data. This paper tests the performance of Accumulo using data from the Graph500 benchmark. The Dynamic Distributed...
Genetic sequence matching using D4M big data approaches
Summary
Summary
Recent technological advances in Next Generation Sequencing tools have led to increasing speeds of DNA sample collection, preparation, and sequencing. One instrument can produce over 600 Gb of genetic sequence data in a single run. This creates new opportunities to efficiently handle the increasing workload. We propose a new method...
Content+context=classification: examining the roles of social interactions and linguist content in Twitter user classification
Summary
Summary
Twitter users demonstrate many characteristics via their online presence. Connections, community memberships, and communication patterns reveal both idiosyncratic and general properties of users. In addition, the content of tweets can be critical for distinguishing the role and importance of a user. In this work, we explore Twitter user classification using...
VizLinc: integrating information extraction, search, graph analysis, and geo-location for the visual exploration of large data sets
Summary
Summary
In this demo paper we introduce VizLinc; an open-source software suite that integrates automatic information extraction, search, graph analysis, and geo-location for interactive visualization and exploration of large data sets. VizLinc helps users in: 1) understanding the type of information the data set under study might contain, 2) finding patterns...
Effective Entropy: security-centric metric for memory randomization techniques
Summary
Summary
User space memory randomization techniques are an emerging field of cyber defensive technology which attempts to protect computing systems by randomizing the layout of memory. Quantitative metrics are needed to evaluate their effectiveness at securing systems against modern adversaries and to compare between randomization technologies. We introduce Effective Entropy, a...
Using 3D printing to visualize social media big data
Summary
Summary
Big data volume continues to grow at unprecedented rates. One of the key features that makes big data valuable is the promise to find unknown patterns or correlations that may be able to improve the quality of processes or systems. Unfortunately, with the exponential growth in data, users often have...