Publications

Refine Results

(Filters Applied) Clear All

R&D Areas

R&D Groups

Year

Items per page

Tagged As

big data Clear filter

Genetic sequence matching using D4M big data approaches

September 9, 2014

Conference Paper

Author:

Darrell O. Ricke

…

Published in:

HPEC 2014: IEEE Conf. on High Performance Extreme Computing, 9-11 September 2014.

Topic:

high performance computing

R&D area:

R&D group:

Summary

Recent technological advances in Next Generation Sequencing tools have led to increasing speeds of DNA sample collection, preparation, and sequencing. One instrument can produce over 600 Gb of genetic sequence data in a single run. This creates new opportunities to efficiently handle the increasing workload. We propose a new method of fast genetic sequence analysis using the Dynamic Distributed Dimensional Data Model (D4M) - an associative array environment for MATLAB developed at MIT Lincoln Laboratory. Based on mathematical and statistical properties, the method leverages big data techniques and the implementation of an Apache Acculumo database to accelerate computations one-hundred fold over other methods. Comparisons of the D4M method with the current gold-standard for sequence analysis, BLAST, show the two are comparable in the alignments they find. This paper will present an overview of the D4M genetic sequence algorithm and statistical comparisons with BLAST.

READ LESS

Summary

Genetic sequence matching using D4M big data approaches

Using 3D printing to visualize social media big data

July 23, 2014

Journal Article

Author:

Zachary J. Weber

…

Vijay N. Gadepally

Published in:

HPEC 2014: IEEE Conf. on High Performance Extreme Computing, 9-11 September 2014.

Topic:

3D printing

R&D area:

Cyber Security and Information Sciences

R&D group:

Summary

Big data volume continues to grow at unprecedented rates. One of the key features that makes big data valuable is the promise to find unknown patterns or correlations that may be able to improve the quality of processes or systems. Unfortunately, with the exponential growth in data, users often have difficulty in visualizing the often-unstructured, non-homogeneous data coming from a variety of sources. The recent growth in popularity of 3D printing has ushered in a revolutionary way to interact with big data. Using a 3D printed mockup up a physical or notional environment, one can display data on the mockup to show real-time data patterns. In this poster and demonstration, we describe the process of 3D printing and demonstrate an application of displaying Twitter data on a 3D mockup of the Massachusetts Institute of Technology (MIT) campus, known as LuminoCity.

READ LESS

Summary

Using 3D printing to visualize social media big data

Effective parallel computation of eigenpairs to detect anomalies in very large graphs

February 14, 2014

Presentation

Author:

Michael M. Wolf

…

Benjamin A. Miller

Published in:

SIAM Conference on Parallel Processing for Scientific Computing

Topic:

big data

R&D area:

Cyber Security and Information Sciences

R&D group:

Cyber Operations and Analysis Technology

Summary

The computational driver for an important class of graph analysis algorithms is the computation of leading eigenvectors of matrix representations of the graph. In this presentation, we discuss the challenges of calculating eigenvectors of modularity matrices derived from very large graphs (upwards of a billion vertices) and demonstrate the scaling properties of parallel eigensolvers when applied to these matrices.

READ LESS

Summary

Effective parallel computation of eigenpairs to detect anomalies in very large graphs

D4M 2.0 Schema: a general purpose high performance schema for the Accumulo database

September 10, 2013

Conference Paper

Author:

Jeremy Kepner

…

Published in:

HPEC 2013: IEEE Conf. on High Performance Extreme Computing, 10-12 September 2013.

Topic:

big data

R&D area:

Cyber Security and Information Sciences

R&D group:

Summary

Non-traditional, relaxed consistency, triple store databases are the backbone of many web companies (e.g., Google Big Table, Amazon Dynamo, and Facebook Cassandra). The Apache Accumulo database is a high performance open source relaxed consistency database that is widely used for government applications. Obtaining the full benefits of Accumulo requires using novel schemas. The Dynamic Distributed Dimensional Data Model (D4M) [http://www.mit.edu.ezproxyberklee.flo.org/~kepner/D4M] provides a uniform mathematical framework based on associative arrays that encompasses both traditional (i.e., SQL) and non-traditional databases. For non-traditional databases D4M naturally leads to a general purpose schema that can be used to fully index and rapidly query every unique string in a dataset. The D4M 2.0 Schema has been applied with little or no customization to cyber, bioinformatics, scientific citation, free text, and social media data. The D4M 2.0 Schema is simple, requires minimal parsing, and achieves the highest published Accumulo ingest rates. The benefits of the D4M 2.0 Schema are independent of the D4M interface. Any interface to Accumulo can achieve these benefits by using the D4M 2.0 Schema.

READ LESS

Summary

D4M 2.0 Schema: a general purpose high performance schema for the Accumulo database

Benchmarking parallel eigen decomposition for residuals analysis of very large graphs

September 10, 2012

Conference Paper

Author:

Edward M. Rutledge

…

Published in:

HPEC 2012: IEEE Conf. on High Performance Extreme Computing, 10-12 September 2012.

Topic:

big data

R&D area:

Cyber Security and Information Sciences

R&D group:

Artificial Intelligence Technology and Systems

Summary

Graph analysis is used in many domains, from the social sciences to physics and engineering. The computational driver for one important class of graph analysis algorithms is the computation of leading eigenvectors of matrix representations of a graph. This paper explores the computational implications of performing an eigen decomposition of a directed graph's symmetrized modularity matrix using commodity cluster hardware and freely available eigensolver software, for graphs with 1 million to 1 billion vertices, and 8 million to 8 billion edges. Working with graphs of these sizes, parallel eigensolvers are of particular interest. Our results suggest that graph analysis approaches based on eigen space analysis of graph residuals are feasible even for graphs of these sizes.

READ LESS

Summary

Benchmarking parallel eigen decomposition for residuals analysis of very large graphs

Driving big data with big compute

September 10, 2012

Conference Paper

Author:

Chansup Byun

…

Published in:

HPEC 2012: IEEE Conf. on High Performance Extreme Computing, 10-12 September 2012.

Topic:

high performance computing

R&D area:

Cyber Security and Information Sciences

R&D group:

Summary

Big Data (as embodied by Hadoop clusters) and Big Compute (as embodied by MPI clusters) provide unique capabilities for storing and processing large volumes of data. Hadoop clusters make distributed computing readily accessible to the Java community and MPI clusters provide high parallel efficiency for compute intensive workloads. Bringing the big data and big compute communities together is an active area of research. The LLGrid team has developed and deployed a number of technologies that aim to provide the best of both worlds. LLGrid MapReduce allows the map/reduce parallel programming model to be used quickly and efficiently in any language on any compute cluster. D4M (Dynamic Distributed Dimensional Data Model) provided a high level distributed arrays interface to the Apache Accumulo database. The accessibility of these technologies is assessed by measuring the effort to use these tools and is typically a few lines of code. The performance is assessed by measuring the insert rate into the Accumulo database. Using these tools a database insert rate of 4M inserts/second has been achieved on an 8 node cluster.

READ LESS

Summary

Driving big data with big compute

Dynamic Distributed Dimensional Data Model (D4M) database and computation system

March 25, 2012

Conference Paper

Author:

Jeremy Kepner

…

Published in:

ICASSP 2012, Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, 25-30 March 2012, pp. 5349-52.

Topic:

big data

R&D area:

R&D group:

Summary

A crucial element of large web companies is their ability to collect and analyze massive amounts of data. Tuple store databases are a key enabling technology employed by many of these companies (e.g., Google Big Table and Amazon Dynamo). Tuple stores are highly scalable and run on commodity clusters, but lack interfaces to support efficient development of mathematically based analytics. D4M (Dynamic Distributed Dimensional Data Model) has been developed to provide a mathematically rich interface to tuple stores (and structured query language "SQL" databases). D4M allows linear algebra to be readily applied to databases. Using D4M, it is possible to create composable analytics with significantly less effort than using traditional approaches. This work describes the D4M technology and its application and performance.

READ LESS

Summary

Dynamic Distributed Dimensional Data Model (D4M) database and computation system

Publications

Refine Results

Tagged As

Genetic sequence matching using D4M big data approaches

Summary

Summary

Using 3D printing to visualize social media big data

Summary

Summary

Effective parallel computation of eigenpairs to detect anomalies in very large graphs

Summary

Summary

D4M 2.0 Schema: a general purpose high performance schema for the Accumulo database

Summary

Summary

Benchmarking parallel eigen decomposition for residuals analysis of very large graphs

Summary

Summary

Driving big data with big compute

Summary

Summary

Dynamic Distributed Dimensional Data Model (D4M) database and computation system

Summary

Summary

Showing Results