Publications

Refine Results

(Filters Applied) Clear All

Fine structure features for speaker identification

Published in:
Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, ICASSP, Vol. 2, Speech (Part II), 7-10 May 1996, pp. 689-692.

Summary

The performance of speaker identification (SID) systems can be improved by the addition of the rapidly varying "fine structure" features of formant amplitude and/or frequency modulation and multiple excitation pulses. This paper shows how the estimation of such fine structure features can be improved further by obtaining better estimates of formant frequency locations and uncovering various sources of error in the feature extraction systems. Most female telephone speech showed "spurious" formants, due to distortion in the telephone network. Nevertheless, SID performance was greatest with these spurious formants as formant estimates. A new feature has also been identified which can increase SID performance: cepstral coefficients from noise in the estimated excitation waveform. Finally, statistical tools have been developed to explore the relative importance of features used for SID, with the ultimate goal of uncovering the source of the features that provide SID performance improvement.
READ LESS

Summary

The performance of speaker identification (SID) systems can be improved by the addition of the rapidly varying "fine structure" features of formant amplitude and/or frequency modulation and multiple excitation pulses. This paper shows how the estimation of such fine structure features can be improved further by obtaining better estimates of...

READ MORE

Low rate coding of the spectral envelope using channel gains

Published in:
Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, ICASSP, Vol. 2, 7-10 May 1996, pp. 769-772.

Summary

A dual rate embedded sinusoidal transform coder is described in which a core 14th order allpole coder operating at 2400 b/s is augmented with a set of channel gain residuals in order to operate at the higher 4800 b/s rate. The channel gains are a set of non-uniformly spaced samples of the spline envelope and constitute a lowpass estimate of the short-time vocal tract magnitude spectrum. The channel gain residuals represent the difference between the spline envelope and the quantized 14th order allpole spectrum at the channel gain frequencies. The channel gain residuals are coded using pitch dependent scalar quantization. Informal listening indicates that the quality of the embedded coder at 4800 b/s is comparable to that of an existing high quality 4800 b/s allpole coder.
READ LESS

Summary

A dual rate embedded sinusoidal transform coder is described in which a core 14th order allpole coder operating at 2400 b/s is augmented with a set of channel gain residuals in order to operate at the higher 4800 b/s rate. The channel gains are a set of non-uniformly spaced samples...

READ MORE

The effects of handset variability on speaker recognition performance: experiments on the switchboard corpus

Published in:
Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, ICASSP, Vol. 1, 7-10 May 1996, pp. 113-116.

Summary

This paper presents an empirical study of the effects of handset variability on text-independent speaker recognition performance using the Switchboard corpus. Handset variability occurs when training speech is collected using one type of handset, but a different handset is used for collecting test speech. For the Switchboard corpus, the calling telephone number associated with a file is used to imply the handset used. Analysis of experiments designed to focus on handset variability on the SPIDRE database and the May95 NIST speaker recognition evaluation database, show that a performance gap between matched and mismatched handset tests persists even after applying several standard channel compensation techniques. Error rates for the mismatched tests are over 4 times those for the matched tests. Lastly, a new energy dependent cepstral mean subtraction technique is proposed to compensate for nonlinear distortions, but is not found to improve performance on the databases used.
READ LESS

Summary

This paper presents an empirical study of the effects of handset variability on text-independent speaker recognition performance using the Switchboard corpus. Handset variability occurs when training speech is collected using one type of handset, but a different handset is used for collecting test speech. For the Switchboard corpus, the calling...

READ MORE

Unsupervised topic clustering of switchboard speech messages

Published in:
Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, ICASSP, Vol. 1, 7-10 May 1996, pp. 315-318.

Summary

This paper presents a statistical technique which can be used to automatically group speech data records based on the similarity of their content. A tree-based clustering algorithm is used to generate a hierarchical structure for the corpus. This structure can then be used to guide the search for similar material in data from other corpora. The SWITCHBOARD Speech Corpus was used to demonstrate these techniques, since it provides sets of speech files which are nominally on the same topic. Excellent automatic clustering was achieved on the truth text transcripts provided with the SWITCHBOARD corpus, with an average cluster purity of 97.3%. Degraded clustering was achieved using the output transcriptions of a speech recognizer, with a clustering purity of 61.4%.
READ LESS

Summary

This paper presents a statistical technique which can be used to automatically group speech data records based on the similarity of their content. A tree-based clustering algorithm is used to generate a hierarchical structure for the corpus. This structure can then be used to guide the search for similar material...

READ MORE

Recognition by humans and machines: miles to go before we sleep

Published in:
Speech Commun., Vol. 18, No. 3, 1996, pp. 247-8.

Summary

Bourlard and his colleagues note that much effort over the past few years has focused on creating large-vocabulary speech recognition systems and reducing error rates measured using clean speech materials. This has led to experimental talker-independent systems with vocabularies of 65,000 words capable of transcribing sentences on a limited set of topics. Instead of comparing these systems to each other, or to last year's performance, this short comment compares them to human listeners, who are the ultimate users and judges of speech recognizers.
READ LESS

Summary

Bourlard and his colleagues note that much effort over the past few years has focused on creating large-vocabulary speech recognition systems and reducing error rates measured using clean speech materials. This has led to experimental talker-independent systems with vocabularies of 65,000 words capable of transcribing sentences on a limited set...

READ MORE

Comparison of four approaches to automatic language identification of telephone speech

Author:
Published in:
IEEE Trans. Speech Audio Process., Vol. 4, No. 1, January 1996, pp. 31-44.

Summary

We have compared the performance of four approaches for automatic language identification of speech utterances: Gaussian mixture model (GMM) classification; single-language phone recognition followed by language-dependent, interpolated n-gram language modeling (PRLM); parallel PRLM, which uses multiple single-language phone recognizers, each trained in a different language; and language dependent parallel phone recognition (PPR). These approaches which space a wide range of training requirements and levels of recognition complexity, were evaluated with the Oregon Graduate Institute Multi-Language Telephone Speech Corpus. Systems containing phone recognizers performed better than the simpler GMM classifier. The top-performing system was parallel PRLM, which exhibited an error rate of 2% for 45-s utterances and 5% for 10-s utterances in two-language, closed-set, forced-choice classification. The error rate for 11-language, closed-set, forced-choice classification was 11% for 45-s utterances and 21% for 10-s utterances.
READ LESS

Summary

We have compared the performance of four approaches for automatic language identification of speech utterances: Gaussian mixture model (GMM) classification; single-language phone recognition followed by language-dependent, interpolated n-gram language modeling (PRLM); parallel PRLM, which uses multiple single-language phone recognizers, each trained in a different language; and language dependent parallel phone...

READ MORE

A subband approach to time-scale expansion of complex acoustic signals

Published in:
IEEE Trans. Speech Audio Process., Vol. 3, No. 6, November 1995, pp. 515-519.

Summary

A new approach to time-scale expansion of short-duration complex acoustic signals is introduced. Using a subband signal representation, channel phases are selected to preserve a desired time-scaled temporal envelope. The phase representation is derived from locations of events that occur within filter bank outputs. A frame-based generalization of the method imposes phase consistency across consecutive synthesis frames. The method is applied to synthetic and actual complex acoustic signals consisting of closely spaced rapidly damped sine wave. Time-frequency resolution limitations are discussed.
READ LESS

Summary

A new approach to time-scale expansion of short-duration complex acoustic signals is introduced. Using a subband signal representation, channel phases are selected to preserve a desired time-scaled temporal envelope. The phase representation is derived from locations of events that occur within filter bank outputs. A frame-based generalization of the method...

READ MORE

Time-scale modification with inconsistent constraints

Published in:
Proc. 1995 Workshop on Applications of Signal Processing to Audio Acoustics, 15-18 October 1995.

Summary

A set theoretic estimation approach is introduced for timescale modification of complex acoustic signals. The method determines a signal that meets, in a least-squared error sense, desired temporal and spectral envelope constraints that are inconsistent. These constraints are generalized within the set theoretic framework to include other signal characteristics such as instantaneous frequency and group delay. The approach can enhance acoustic signals consisting of closely-spaced sequential time components, and is applicable to biological, underwater, and music sound processing.
READ LESS

Summary

A set theoretic estimation approach is introduced for timescale modification of complex acoustic signals. The method determines a signal that meets, in a least-squared error sense, desired temporal and spectral envelope constraints that are inconsistent. These constraints are generalized within the set theoretic framework to include other signal characteristics such...

READ MORE

Military and government applications of human-machine communication by voice

Published in:
Proc. Natl. Acad. Sci., Vol. 92, October 1995, pp. 10011-10016.

Summary

This paper describes a range of opportunities for military and government applications of human-machine communication by voice, based on visits and contacts with numerous user organizations in the United States. The applications include some that appear to be feasible by careful integration of current state-of-the-art technology and others that will require a varying mix of advances in speech technology and in integration of the technology into applications environments. Applications that are described include (1) speech recognition and synthesis for mobile command and control; (2) speech processing for a portable multifunction soldier's computer; (3) speech- and language-based technology for naval combat team tactical training; (4) speech technology for command and control on a carrier flight deck; (5) control of auxiliary systems, and alert and warning generation, in fighter aircraft and helicopters; and (6) voice check-in, report entry, and communication for law enforcement agents or special forces. A phased approach for transfer of the technology into applications is advocated, where integration of applications systems is pursued in parallel with advanced research to meet future needs.
READ LESS

Summary

This paper describes a range of opportunities for military and government applications of human-machine communication by voice, based on visits and contacts with numerous user organizations in the United States. The applications include some that appear to be feasible by careful integration of current state-of-the-art technology and others that will...

READ MORE

Sine-wave amplitude coding using a mixed LSF/PARCOR representation

Published in:
Proc. 1995 IEEE Workshop on Speech Coding for Telecommunications, 20-22 Spetember 1995, pp. 77-8.

Summary

An all-pole model of the speech spectral envelope is used to code the sine-wave amplitudes in the Sinusoidal Transform Coder. While line spectral frequencies (LSFs) are currently used to represent this all-pole model, it is shown that a mixture of line spectral frequencies and partial correlation (PARCOR) coefficients can be used to reduce complexity without a loss in quantization efficiency. Objective and subjective measures demonstrate that speech quality is maintained. In addition, the use of split vector quantization is shown to substantially reduce the number of bits needed to code the all-pole model.
READ LESS

Summary

An all-pole model of the speech spectral envelope is used to code the sine-wave amplitudes in the Sinusoidal Transform Coder. While line spectral frequencies (LSFs) are currently used to represent this all-pole model, it is shown that a mixture of line spectral frequencies and partial correlation (PARCOR) coefficients can be...

READ MORE