Publications

Refine Results

(Filters Applied) Clear All

LLSuperCloud: sharing HPC systems for diverse rapid prototyping

Summary

The supercomputing and enterprise computing arenas come from very different lineages. However, the advent of commodity computing servers has brought the two arenas closer than they have ever been. Within enterprise computing, commodity computing servers have resulted in the development of a wide range of new cloud capabilities: elastic computing, virtualization, and data hosting. Similarly, the supercomputing community has developed new capabilities in heterogeneous, massively parallel hardware and software. Merging the benefits of enterprise clouds and supercomputing has been a challenging goal. Significant effort has been expended in trying to deploy supercomputing capabilities on cloud computing systems. These efforts have resulted in unreliable, low performance solutions, which requires enormous expertise to maintain. LLSuperCloud provides a novel solution to the problem of merging enterprise cloud and supercomputing technology. More specifically LLSuperCloud reverses the traditional paradigm of attempting to deploy supercomputing capabilities on a cloud and instead deploys cloud capabilities on a supercomputer. The result is a system that can handle heterogeneous, massively parallel workloads while also providing high performance elastic computing, virtualization, and databases. The benefits of LLSuperCloud are highlighted using a mixed workload of C MPI, parallel MATLAB, Java, databases, and virtualized web services.
READ LESS

Summary

The supercomputing and enterprise computing arenas come from very different lineages. However, the advent of commodity computing servers has brought the two arenas closer than they have ever been. Within enterprise computing, commodity computing servers have resulted in the development of a wide range of new cloud capabilities: elastic computing...

READ MORE

D4M 2.0 Schema: a general purpose high performance schema for the Accumulo database

Summary

Non-traditional, relaxed consistency, triple store databases are the backbone of many web companies (e.g., Google Big Table, Amazon Dynamo, and Facebook Cassandra). The Apache Accumulo database is a high performance open source relaxed consistency database that is widely used for government applications. Obtaining the full benefits of Accumulo requires using novel schemas. The Dynamic Distributed Dimensional Data Model (D4M) [http://www.mit.edu.ezproxyberklee.flo.org/~kepner/D4M] provides a uniform mathematical framework based on associative arrays that encompasses both traditional (i.e., SQL) and non-traditional databases. For non-traditional databases D4M naturally leads to a general purpose schema that can be used to fully index and rapidly query every unique string in a dataset. The D4M 2.0 Schema has been applied with little or no customization to cyber, bioinformatics, scientific citation, free text, and social media data. The D4M 2.0 Schema is simple, requires minimal parsing, and achieves the highest published Accumulo ingest rates. The benefits of the D4M 2.0 Schema are independent of the D4M interface. Any interface to Accumulo can achieve these benefits by using the D4M 2.0 Schema.
READ LESS

Summary

Non-traditional, relaxed consistency, triple store databases are the backbone of many web companies (e.g., Google Big Table, Amazon Dynamo, and Facebook Cassandra). The Apache Accumulo database is a high performance open source relaxed consistency database that is widely used for government applications. Obtaining the full benefits of Accumulo requires using...

READ MORE

Very large graphs for information extraction (VLG) - summary of first-year proof-of-concept study

Summary

In numerous application domains relevant to the Department of Defense and the Intelligence Community, data of interest take the form of entities and the relationships between them, and these data are commonly represented as graphs. Under the Very Large Graphs for Information Extraction effort--a one-year proof-of-concept study--MIT LL developed novel techniques for anomalous subgraph detection, building on tools in the signal processing research literature. This report documents the technical results of this effort. Two datasets--a snapshot of Thompson Reuters? Web of Science database and a stream of web proxy logs--were parsed, and graphs were constructed from the raw data. From the phenomena in these datasets, several algorithms were developed to model the dynamic graph behavior, including a preferential attachment mechanism with memory, a streaming filter to model a graph as a weighted average of its past connections, and a generalized linear model for graphs where connection probabilities are determined by additional side information or metadata. A set of metrics was also constructed to facilitate comparison of techniques. The study culminated in a demonstration of the algorithms on the datasets of interest, in addition to simulated data. Performance in terms of detection, estimation, and computational burden was measured according to the metrics. Among the highlights of this demonstration were the detection of emerging coauthor clusters in the Web of Science data, detection of botnet activity in the web proxy data after 15 minutes (which took 10 days to detect using state-of-the-practice techniques), and demonstration of the core algorithm on a simulated 1-billion-vertex graph using a commodity computing cluster.
READ LESS

Summary

In numerous application domains relevant to the Department of Defense and the Intelligence Community, data of interest take the form of entities and the relationships between them, and these data are commonly represented as graphs. Under the Very Large Graphs for Information Extraction effort--a one-year proof-of-concept study--MIT LL developed novel...

READ MORE

A language-independent approach to automatic text difficulty assessment for second-language learners

Published in:
Proc. 2nd Workshop on Predicting and Improving Text Readability for Target Reader Populations, 4-9 August 2013.

Summary

In this paper we introduce a new baseline for language-independent text difficulty assessment applied to the Interagency Language Roundtable (ILR) proficiency scale. We demonstrate that reading level assessment is a discriminative problem that is best-suited for regression. Our baseline uses z-normalized shallow length features and TF-LOG weighted vectors on bag-of-words for Arabic, Dari, English, and Pashto. We compare Support Vector Machines and the Margin-Infused Relaxed Algorithm measured by mean squared error. We provide an analysis of which features are most predictive of a given level.
READ LESS

Summary

In this paper we introduce a new baseline for language-independent text difficulty assessment applied to the Interagency Language Roundtable (ILR) proficiency scale. We demonstrate that reading level assessment is a discriminative problem that is best-suited for regression. Our baseline uses z-normalized shallow length features and TF-LOG weighted vectors on bag-of-words...

READ MORE

Efficient anomaly detection in dynamic, attributed graphs: emerging phenomena and big data

Published in:
ISI 2013: IEEE Int. Conf. on Intelligence and Security Informatics, 4-7 June 2013.

Summary

When working with large-scale network data, the interconnected entities often have additional descriptive information. This additional metadata may provide insight that can be exploited for detection of anomalous events. In this paper, we use a generalized linear model for random attributed graphs to model connection probabilities using vertex metadata. For a class of such models, we show that an approximation to the exact model yields an exploitable structure in the edge probabilities, allowing for efficient scaling of a spectral framework for anomaly detection through analysis of graph residuals, and a fast and simple procedure for estimating the model parameters. In simulation, we demonstrate that taking into account both attributes and dynamics in this analysis has a much more significant impact on the detection of an emerging anomaly than accounting for either dynamics or attributes alone. We also present an analysis of a large, dynamic citation graph, demonstrating that taking additional document metadata into account emphasizes parts of the graph that would not be considered significant otherwise.
READ LESS

Summary

When working with large-scale network data, the interconnected entities often have additional descriptive information. This additional metadata may provide insight that can be exploited for detection of anomalous events. In this paper, we use a generalized linear model for random attributed graphs to model connection probabilities using vertex metadata. For...

READ MORE

Probabilistic threat propagation for malicious activity detection

Published in:
Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, ICASSP, 25-31 May 2013.

Summary

In this paper, we present a method for detecting malicious activity within networks of interest. We leverage prior community detection work by propagating threat probabilities across graph nodes, given an initial set of known malicious nodes. We enhance prior work by employing constraints which remove the adverse effect of cyclic propagation that is a byproduct of current methods. We demonstrate the effectiveness of Probabilistic Threat Propagation on the task of detecting malicious web destinations.
READ LESS

Summary

In this paper, we present a method for detecting malicious activity within networks of interest. We leverage prior community detection work by propagating threat probabilities across graph nodes, given an initial set of known malicious nodes. We enhance prior work by employing constraints which remove the adverse effect of cyclic...

READ MORE

Link prediction methods for generating speaker content graphs

Published in:
ICASSP 2013, Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, 25-31 May 2013.

Summary

In a speaker content graph, vertices represent speech signals and edges represent speaker similarity. Link prediction methods calculate which potential edges are most likely to connect vertices from the same speaker; those edges are included in the generated speaker content graph. Since a variety of speaker recognition tasks can be performed on a content graph, we provide a set of metrics for evaluating the graph's quality independently of any recognition task. We then describe novel global and incremental algorithms for constructing accurate speaker content graphs that outperform the existing k nearest neighbors link prediction method. We evaluate those algorithms on a NIST speaker recognition corpus.
READ LESS

Summary

In a speaker content graph, vertices represent speech signals and edges represent speaker similarity. Link prediction methods calculate which potential edges are most likely to connect vertices from the same speaker; those edges are included in the generated speaker content graph. Since a variety of speaker recognition tasks can be...

READ MORE

Large-scale community detection on speaker content graphs

Published in:
Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, ICASSP, 25-31 May 2013.

Summary

We consider the use of community detection algorithms to perform speaker clustering on content graphs built from large audio corpora. We survey the application of agglomerative hierarchical clustering, modularity optimization methods, and spectral clustering as well as two random walk algorithms: Markov clustering and Infomap. Our results on graphs built from the NIST 2005+2006 and 2008+2010 Speaker Recognition Evaluations (SREs) provide insight into both the structure of the speakers present in the data and the intricacies of the clustering methods. In particular, we introduce an additional parameter to Infomap that improves its clustering performance on all graphs. Lastly, we also develop an automatic technique to purify the neighbors of each node by pruning away unnecessary edges.
READ LESS

Summary

We consider the use of community detection algorithms to perform speaker clustering on content graphs built from large audio corpora. We survey the application of agglomerative hierarchical clustering, modularity optimization methods, and spectral clustering as well as two random walk algorithms: Markov clustering and Infomap. Our results on graphs built...

READ MORE

Sparse volterra systems: theory and practice

Published in:
Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, ICASSP, 25-31 May 2013.

Summary

Nonlinear effects limit analog circuit performance, causing both in-band and out-of-band distortion. The classical Volterra series provides an accurate model of many nonlinear systems, but the number of parameters grows extremely quickly as the memory depth and polynomial order are increased. Recently, concepts from compressed sensing have been applied to nonlinear system modeling in order to address this issue. This work investigates the theory and practice of applying compressed sensing techniques to nonlinear system identification under the constraints of typical radio frequency (RF) laboratories. The main theoretical result shows that these techniques are capable of identifying sparse Memory Polynomials using only single-tone training signals rather than pseudorandom noise. Empirical results using laboratory measurements of an RF receiver show that sparse Generalized Memory Polynomials can also be recovered from two-tone signals.
READ LESS

Summary

Nonlinear effects limit analog circuit performance, causing both in-band and out-of-band distortion. The classical Volterra series provides an accurate model of many nonlinear systems, but the number of parameters grows extremely quickly as the memory depth and polynomial order are increased. Recently, concepts from compressed sensing have been applied to...

READ MORE

An Expectation Maximization Approach to Detecting Compromised Remote Access Accounts(267.16 KB)

Published in:
Proceedings of FLAIRS 2013, St. Pete Beach, Fla.

Summary

Just as credit-card companies are able to detect aberrant transactions on a customer’s credit card, it would be useful to have methods that could automatically detect when a user’s login credentials for Virtual Private Network (VPN) access have been compromised. We present here a novel method for detecting that a VPN account has been compromised, in a manner that bootstraps a model of the second unauthorized user.
READ LESS

Summary

Just as credit-card companies are able to detect aberrant transactions on a customer’s credit card, it would be useful to have methods that could automatically detect when a user’s login credentials for Virtual Private Network (VPN) access have been compromised. We present here a novel method for detecting that a...

READ MORE