Publications

Refine Results

(Filters Applied) Clear All

Approaches for language identification in mismatched environments

Summary

In this paper, we consider the task of language identification in the context of mismatch conditions. Specifically, we address the issue of using unlabeled data in the domain of interest to improve the performance of a state-of-the-art system. The evaluation is performed on a 9-language set that includes data in both conversational telephone speech and narrowband broadcast speech. Multiple experiments are conducted to assess the performance of the system in this condition and a number of alternatives to ameliorate the drop in performance. The best system evaluated is based on deep neural network (DNN) bottleneck features using i-vectors utilizing a combination of all the approaches proposed in this work. The resulting system improved baseline DNN system performance by 30%.
READ LESS

Summary

In this paper, we consider the task of language identification in the context of mismatch conditions. Specifically, we address the issue of using unlabeled data in the domain of interest to improve the performance of a state-of-the-art system. The evaluation is performed on a 9-language set that includes data in...

READ MORE

Multi-lingual deep neural networks for language recognition

Published in:
SLT 2016, IEEE Spoken Language Technology Workshop, 13-16 December 2016.

Summary

Multi-lingual feature extraction using bottleneck layers in deep neural networks (BN-DNNs) has been proven to be an effective technique for low resource speech recognition and more recently for language recognition. In this work we investigate the impact on language recognition performance of the multi-lingual BN-DNN architecture and training configurations for the NIST 2011 and 2015 language recognition evaluations (LRE11 and LRE15). The best performing multi-lingual BN-DNN configuration yields relative performance gains of 50% on LRE11 and 40% on LRE15 compared to a standard MFCC/SDC baseline system and 17% on LRE11 and 7% on LRE15 relative to a single language BN-DNN system. Detailed performance analysis using data from all 24 Babel languages, Fisher Spanish and Switchboard English shows the impact of language selection and the amount of training data on overall BN-DNN performance.
READ LESS

Summary

Multi-lingual feature extraction using bottleneck layers in deep neural networks (BN-DNNs) has been proven to be an effective technique for low resource speech recognition and more recently for language recognition. In this work we investigate the impact on language recognition performance of the multi-lingual BN-DNN architecture and training configurations for...

READ MORE

Resilience of cyber systems with over- and underregulation

Published in:
Risk Analysis, Vol. 37, No. 9, 2017, pp. 1644-51, DOI:10.1111/risa.12729.

Summary

Recent cyber attacks provide evidence of increased threats to our critical systems and infrastructure. A common reaction to a new threat is to harden the system by adding new rules and regulations. As federal and state governments request new procedures to follow, each of their organizations implements their own cyber defense strategies. This unintentionally increases time and effort that employees spend on training and policy implementation and decreases the time and latitude to perform critical job functions, thus raising overall levels of stress. People's performance under stress, coupled with an overabundance of information, results in even more vulnerabilities for adversaries to exploit. In this article, we embed a simple regulatory model that accounts for cybersecurity human factors and an organization's regulatory environment in a model of a corporate cyber network under attack. The resulting model demonstrates the effect of under- and overregulation on an organization's resilience with respect to insider threats. Currently, there is a tendency to use ad-hoc approaches to account for human factors rather than to incorporate them into cyber resilience modeling. It is clear that using a systematic approach utilizing behavioral science, which already exists in cyber resilience assessment, would provide a more holistic view for decisionmakers.
READ LESS

Summary

Recent cyber attacks provide evidence of increased threats to our critical systems and infrastructure. A common reaction to a new threat is to harden the system by adding new rules and regulations. As federal and state governments request new procedures to follow, each of their organizations implements their own cyber...

READ MORE

Intersection and convex combination in multi-source spectral planted cluster detection

Published in:
IEEE Global Conf. on Signal and Information Processing, GlobalSIP, 7-9 December 2016.

Summary

Planted cluster detection is an important form of signal detection when the data are in the form of a graph. When there are multiple graphs representing multiple connection types, the method of aggregation can have significant impact on the results of a detection algorithm. This paper addresses the tradeoff between two possible aggregation methods: convex combination and intersection. For a spectral detection method, convex combination dominates when the cluster is relatively sparse in at least one graph, while the intersection method dominates in cases where it is dense across graphs. Experimental results confirm the theory. We consider the context of adversarial cluster placement, and determine how an adversary would distribute connections among the graphs to best avoid detection.
READ LESS

Summary

Planted cluster detection is an important form of signal detection when the data are in the form of a graph. When there are multiple graphs representing multiple connection types, the method of aggregation can have significant impact on the results of a detection algorithm. This paper addresses the tradeoff between...

READ MORE

Bootstrapping and maintaining trust in the cloud

Published in:
32nd Annual Computer Security Applications Conf., ACSAC 2016, 5-9 December 2016.

Summary

Today's infrastructure as a service (IaaS) cloud environments rely upon full trust in the provider to secure applications and data. Cloud providers do not offer the ability to create hardware-rooted cryptographic identities for IaaS cloud resources or sufficient information to verify the integrity of systems. Trusted computing protocols and hardware like the TPM have long promised a solution to this problem. However, these technologies have not seen broad adoption because of their complexity of implementation, low performance, and lack of compatibility with virtualized environments. In this paper we introduce keylime, a scalable trusted cloud key management system. keylime provides an end-to-end solution for both bootstrapping hardware rooted cryptographic identities for IaaS nodes and for system integrity monitoring of those nodes via periodic attestation. We support these functions in both bare-metal and virtualized IaaS environments using a virtual TPM. keylime provides a clean interface that allows higher level security services like disk encryption or configuration management to leverage trusted computing without being trusted computing aware. We show that our bootstrapping protocol can derive a key in less than two seconds, we can detect system integrity violations in as little as 110ms, and that keylime can scale to thousands of IaaS cloud nodes.
READ LESS

Summary

Today's infrastructure as a service (IaaS) cloud environments rely upon full trust in the provider to secure applications and data. Cloud providers do not offer the ability to create hardware-rooted cryptographic identities for IaaS cloud resources or sufficient information to verify the integrity of systems. Trusted computing protocols and hardware...

READ MORE

LLTools: machine learning for human language processing

Summary

Machine learning methods in Human Language Technology have reached a stage of maturity where widespread use is both possible and desirable. The MIT Lincoln Laboratory LLTools software suite provides a step towards this goal by providing a set of easily accessible frameworks for incorporating speech, text, and entity resolution components into larger applications. For the speech processing component, the pySLGR (Speaker, Language, Gender Recognition) tool provides signal processing, standard feature analysis, speech utterance embedding, and machine learning modeling methods in Python. The text processing component in LLTools extracts semantically meaningful insights from unstructured data via entity extraction, topic modeling, and document classification. The entity resolution component in LLTools provides approximate string matching, author recognition and graph-based methods for identifying and linking different instances of the same real-world entity. We show through two applications that LLTools can be used to rapidly create and train research prototypes for human language processing.
READ LESS

Summary

Machine learning methods in Human Language Technology have reached a stage of maturity where widespread use is both possible and desirable. The MIT Lincoln Laboratory LLTools software suite provides a step towards this goal by providing a set of easily accessible frameworks for incorporating speech, text, and entity resolution components...

READ MORE

Predicting and analyzing factors in patent litigation

Published in:
30th Conf. on Neural Information Processing System, NIPS 2016, 5-10 December 2016.

Summary

Patent litigation is an expensive and time-consuming process. To minimize its impact on the participants in the patent lifecycle, automatic determination of litigation potential is a compelling machine learning application. In this paper, we consider preliminary methods for the prediction of a patent being involved in litigation using metadata, content, and graph features. Metadata features are top-level easily-extractable features, i.e., assignee, number of claims, etc. The content feature performs lexical analysis of the claims associated to a patent. Graph features use relational learning to summarize patent references. We apply our methods on US patents using a labeled data set. Prior work has focused on metadata-only features, but we show that both graph and content features have significant predictive capability. Additionally, fusing all features results in improved performance. We also perform a preliminary examination of some of the qualitative factors that may have significant importance in patent litigation.
READ LESS

Summary

Patent litigation is an expensive and time-consuming process. To minimize its impact on the participants in the patent lifecycle, automatic determination of litigation potential is a compelling machine learning application. In this paper, we consider preliminary methods for the prediction of a patent being involved in litigation using metadata, content...

READ MORE

Making #sense of #unstructured text data

Published in:
30th Conf. on Neural Info. Processing Syst., NIPS 2016, 5-10 December 2016.

Summary

Automatic extraction of intelligent and useful information from data is one of the main goals in data science. Traditional approaches have focused on learning from structured features, i.e., information in a relational database. However, most of the data encountered in practice are unstructured (i.e., social media posts, forums, emails and web logs); they do not have a predefined schema or format. In this work, we examine unsupervised methods for processing unstructured text data, extracting relevant information, and transforming it into structured information that can then be leveraged in various applications such as graph analysis and matching entities across different platforms. Various efforts have been proposed to develop algorithms for processing unstructured text data. At a top level, text can be either summarized by document level features (i.e., language, topic, genre, etc.) or analyzed at a word or sub-word level. Text analytics can be unsupervised, semi-supervised, or supervised. In this work, we focus on word analysis and unsupervised methods. Unsupervised (or semi-supervised) methods require less human annotation and can easily fulfill the role of automatic analysis. For text analysis, we focus on methods for finding relevant words in the text. Specifically, we look at social media data and attempt to predict hashtags for users' posts. The resulting hashtags can be used for downstream processing such as graph analysis. Automatic hashtag annotation is closely related to automatic tag extraction and keyword extraction. Techniques for hashtags extraction include topic analysis, supervised classifiers, machine translation methods, and collaborative filtering. Methods for keyword extraction include graph-based and topical analysis of text.
READ LESS

Summary

Automatic extraction of intelligent and useful information from data is one of the main goals in data science. Traditional approaches have focused on learning from structured features, i.e., information in a relational database. However, most of the data encountered in practice are unstructured (i.e., social media posts, forums, emails and...

READ MORE

An overview of the DARPA Data Driven Discovery of Models (D3M) Program

Published in:
29th Conf. on Neural Information Processing Systems, NIPS, 5-10 December 2016.

Summary

A new DARPA program called Data Driven Discovery of Models (D3M) aims to develop automated model discovery systems that can be used by researchers with specific subject matter expertise to create empirical models of real, complex processes. Two major goals of this program are to allow experts to create empirical models without the need for data scientists and to increase the productivity of data scientists via automation. Automated model discovery systems developed will be tested on real-world problems that progressively get harder during the course of the program. Toward the end of the program, problems will be both unsolved and underspecified in terms of data and desired outcomes. The program will emphasize creating and leveraging open source technology and architecture. Our presentation reviews the goals and structure of this program which will begin early in 2017. Although the deadline for submitting proposals has past, we welcome suggestions concerning challenge tasks, evaluations, or new open-source data sets to be included for system development and evaluation that would supplement data currently being curated from many sources.
READ LESS

Summary

A new DARPA program called Data Driven Discovery of Models (D3M) aims to develop automated model discovery systems that can be used by researchers with specific subject matter expertise to create empirical models of real, complex processes. Two major goals of this program are to allow experts to create empirical...

READ MORE

Leveraging data provenance to enhance cyber resilience

Summary

Building secure systems used to mean ensuring a secure perimeter, but that is no longer the case. Today's systems are ill-equipped to deal with attackers that are able to pierce perimeter defenses. Data provenance is a critical technology in building resilient systems that will allow systems to recover from attackers that manage to overcome the "hard-shell" defenses. In this paper, we provide background information on data provenance, details on provenance collection, analysis, and storage techniques and challenges. Data provenance is situated to address the challenging problem of allowing a system to "fight-through" an attack, and we help to identify necessary work to ensure that future systems are resilient.
READ LESS

Summary

Building secure systems used to mean ensuring a secure perimeter, but that is no longer the case. Today's systems are ill-equipped to deal with attackers that are able to pierce perimeter defenses. Data provenance is a critical technology in building resilient systems that will allow systems to recover from attackers...

READ MORE