This paper considers the problem of obtaining an accurate spectral representation of speech formant structure when the voicing source exhibits a high fundamental frequency. Our work is inspired by auditory perception and physiological studies implicating the use of pitch dynamics in speech by humans. We develop and assess signal processing schemes aimed at exploiting temporal change of pitch to address the high-pitch formant frequency estimation problem. Specifically, we propose a 2-D analysis framework using 2-D transformations of the time-frequency space. In one approach, we project changing spectral harmonics over time to a 1-D function of frequency. In a second approach, we draw upon previous work of Quatieri and Ezzat et al. [1], [2], with similarities to the auditory modeling efforts of Chi et al. [3], where localized 2-D Fourier transforms of the time-frequency space provide improved source-filter separation when pitch is changing. Our methods show quantitative improvements for synthesized vowels with stationary formant structure in comparison to traditional and homomorphic linear prediction. We also demonstrate the feasibility of applying our methods on stationary vowel regions of natural speech spoken by high-pitch females of the TIMIT corpus. Finally, we show improvements afforded by the proposed analysis framework in formant tracking on examples of stationary and time-varying formant structure.

READ LESS

Summary

High-pitch formant estimation by exploiting temporal change of pitch

Sinewave parameter estimation using the fast Fan-Chirp Transform

October 18, 2009

Conference Paper

Author:

Robert B. Dunn

…

Published in:

Proc. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, WASPAA, 18-21 October 2009, pp. 349-352.

Topic:

signal processing

R&D area:

Cyber Security and Information Sciences

R&D group:

Artificial Intelligence Technology and Systems

Summary

Sinewave analysis/synthesis has long been an important tool for audio analysis, modification and synthesis [1]. The recently introduced Fan-Chirp Transform (FChT) [2,3] has been shown to improve the fidelity of sinewave parameter estimates for a harmonic audio signal with rapid frequency modulation [4]. A fast version of the FChT [3] reduces computation but this algorithm presents two factors that affect sinewave parameter estimation. The phase of the fast FChT does not match the phase of the original continuous-time transform and this interferes with the estimation of sinewave phases. Also, the fast FChT requires an interpolation of the input signal and the choice of interpolator affects the speed of the transform and accuracy of the estimated sinewave parameters. In this paper we demonstrate how to modify the phase of the fast FChT such that it can be used to estimate sinewave phases, and we explore the use of various interpolators demonstrating the tradeoff between transform speed and sinewave parameter accuracy.

READ LESS

Summary

Sinewave parameter estimation using the fast Fan-Chirp Transform

Towards co-channel speaker separation by 2-D demodulation of spectrograms

October 18, 2009

Conference Paper

Author:

Tianyu Tom Wang

…

Thomas F. Quatieri

Published in:

WASPAA 2009, IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 18-21 October 2009, pp. 65-68.

Topic:

speech enhancement

R&D area:

Cyber Security and Information Sciences

R&D group:

Artificial Intelligence Technology and Systems

Summary

This paper explores a two-dimensional (2-D) processing approach for co-channel speaker separation of voiced speech. We analyze localized time-frequency regions of a narrowband spectrogram using 2-D Fourier transforms and propose a 2-D amplitude modulation model based on pitch information for single and multi-speaker content in each region. Our model maps harmonically-related speech content to concentrated entities in a transformed 2-D space, thereby motivating 2-D demodulation of the spectrogram for analysis/synthesis and speaker separation. Using a priori pitch estimates of individual speakers, we show through a quantitative evaluation: 1) Utility of the model for representing speech content of a single speaker and 2) Its feasibility for speaker separation. For the separation task, we also illustrate benefits of the model's representation of pitch dynamics relative to a sinusoidal-based separation system.

READ LESS

Summary

Towards co-channel speaker separation by 2-D demodulation of spectrograms

2-D processing of speech for multi-pitch analysis.

September 6, 2009

Conference Paper

Author:

Tianyu Tom Wang

…

Thomas F. Quatieri

Published in:

INTERSPEECH 2009, 6-10 September 2009.

Topic:

signal processing

R&D area:

Cyber Security and Information Sciences

R&D group:

Artificial Intelligence Technology and Systems

Summary

This paper introduces a two-dimensional (2-D) processing approach for the analysis of multi-pitch speech sounds. Our framework invokes the short-space 2-D Fourier transform magnitude of a narrowband spectrogram, mapping harmonically related signal components to multiple concentrated entities in a new 2-D space. First, localized time-frequency regions of the spectrogram are analyzed to extract pitch candidates. These candidates are then combined across multiple regions for obtaining separate pitch estimates of each speech-signal component at a single point in time. We refer to this as multi-region analysis (MRA). By explicitly accounting for pitch dynamics within localized time segments, this separability is distinct from that which can be obtained using short-time autocorrelation methods typically employed in state-of-the-art multi-pitch tracking algorithms. We illustrate the feasibility of MRA for multi-pitch estimation on mixtures of synthetic and real speech.

READ LESS

Summary

2-D processing of speech for multi-pitch analysis.

Time-varying autoregressive tests for multiscale speech analysis

September 6, 2009

Conference Paper

Author:

Daniel Rudoy

…

Published in:

INTERSPEECH 2009, 10th Annual Conf. of the International Speech Communication Association, pp. 2839-2842.

Topic:

speech enhancement

R&D area:

Cyber Security and Information Sciences

R&D group:

Artificial Intelligence Technology and Systems

Summary

In this paper we develop hypothesis tests for speech waveform nonstationarity based on time-varying autoregressive models, and demonstrate their efficacy in speech analysis tasks at both segmental and sub-segmental scales. Key to the successful synthesis of these ideas is our employment of a generalized likelihood ratio testing framework tailored to autoregressive coefficient evolutions suitable for speech. After evaluating our framework on speech-like synthetic signals, we present preliminary results for two distinct analysis tasks using speech waveform data. At the segmental level, we develop an adaptive short-time segmentation scheme and evaluate it on whispered speech recordings, while at the sub-segmental level, we address the problem of detecting the glottal flow closed phase. Results show that our hypothesis testing framework can reliably detect changes in the vocal tract parameters across multiple scales, thereby underscoring its broad applicability to speech analysis.

READ LESS

Summary

Time-varying autoregressive tests for multiscale speech analysis

Adaptive short-time analysis-synthesis for speech enhancement

March 31, 2008

Conference Paper

Author:

Prabahan Basu

…

Published in:

2008 IEEE Int. Conf. on Acoustics, Speech and Signal Processing, ICASSP, 31 March - 4 April 2008.

Topic:

speech enhancement

R&D area:

Cyber Security and Information Sciences

R&D group:

Artificial Intelligence Technology and Systems

Summary

In this paper we propose a multiresolution short-time analysis method for speech enhancement. It is well known that fixed resolution methods such as the traditional short-time Fourier transform do not generally match the time-frequency structure of the signal being analyzed resulting in poor estimates of the speech and noise spectra required for enhancement. This can lead to the reduction of quality in the enhanced signal through the introduction of artifacts such as musical noise. To counter these limitations, we propose an adaptive short-time analysis-synthesis scheme for speech enhancement in which the adaptation is based on a measure of local time-frequency concentration. Synthesis is made possible through a modified overlap-add procedure. Empirical results using voiced speech indicate a clear improvement over a fixed time-frequency resolution enhancement scheme both in terms of mean-square error and as indicated by informal listening tests.

READ LESS

Summary

Adaptive short-time analysis-synthesis for speech enhancement

Exploiting temporal change in pitch in formant estimation

March 31, 2008

Conference Paper

Author:

Tianyu Tom Wang

…

Thomas F. Quatieri

Published in:

Proc. IEEE Int. Conf. on Acoustic, Speech, and Signal Processes, ICASSP, 31 March - 4 April 2008, pp. 3929-3932.

Topic:

signal processing

R&D area:

Cyber Security and Information Sciences

R&D group:

Artificial Intelligence Technology and Systems

Summary

READ LESS

Summary

Exploiting temporal change in pitch in formant estimation

Multisensor very low bit rate speech coding using segment quantization

March 31, 2008

Conference Paper

Author:

Alan V. McCree

…

Published in:

Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, ICASSP, 31 March - 4 April 2008, pp. 3997-4000.

Topic:

speech processing

R&D area:

Cyber Security and Information Sciences

R&D group:

Artificial Intelligence Technology and Systems

Summary

We present two approaches to noise robust very low bit rate speech coding using wideband MELP analysis/synthesis. Both methods exploit multiple acoustic and non-acoustic input sensors, using our previously-presented dynamic waveform fusion algorithm to simultaneously perform waveform fusion, noise suppression, and crosschannel noise cancellation. One coder uses a 600 bps scalable phonetic vocoder, with a phonetic speech recognizer followed by joint predictive vector quantization of the error in wideband MELP parameters. The second coder operates at 300 bps with fixed 80 ms segments, using novel variable-rate multistage matrix quantization techniques. Formal test results show that both coders achieve equivalent intelligibility to the 2.4 kbps NATO standard MELPe coder in harsh acoustic noise environments, at much lower bit rates, with only modest quality loss.

READ LESS

Summary

Multisensor very low bit rate speech coding using segment quantization

Spectral representations of nonmodal phonation

January 1, 2008

Journal Article

Author:

Nicolas Malyska

…

Thomas F. Quatieri

Published in:

IEEE Trans. Audio, Speech, Language Proc., Vol. 16, No. 1, January 2008, pp. 34-46.

Topic:

signal processing

R&D area:

Cyber Security and Information Sciences

R&D group:

Artificial Intelligence Technology and Systems

Summary

Regions of nonmodal phonation, which exhibit deviations from uniform glottal-pulse periods and amplitudes, occur often in speech and convey information about linguistic content, speaker identity, and vocal health. Some aspects of these deviations are random, including small perturbations, known as jitter and shimmer, as well as more significant aperiodicities. Other aspects are deterministic, including repeating patterns of fluctuations such as diplophonia and triplophonia. These deviations are often the source of misinterpretation of the spectrum. In this paper, we introduce a general signal-processing framework for interpreting the effects of both stochastic and deterministic aspects of nonmodality on the short-time spectrum. As an example, we show that the spectrum is sensitive to even small perturbations in the timing and amplitudes of glottal pulses. In addition, we illustrate important characteristics that can arise in the spectrum, including apparent shifting of the harmonics and the appearance of multiple pitches. For stochastic perturbations, we arrive at a formulation of the power-spectral density as the sum of a low-pass line spectrum and a high-pass noise floor. Our findings are relevant to a number of speech-processing areas including linear-prediction analysis, sinusoidal analysis-synthesis, spectrally derived features, and the analysis of disordered voices.

READ LESS

Summary

Spectral representations of nonmodal phonation

Sinewave analysis/synthesis based on the fan-chirp transform

October 21, 2007

Conference Paper

Author:

Robert B. Dunn

…

Thomas F. Quatieri

Published in:

Proc. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, WASPA, 21-24 October 2007, pp. 247-250.

Topic:

speech modification

R&D area:

Cyber Security and Information Sciences

R&D group:

Artificial Intelligence Technology and Systems

Summary

There have been numerous recent strides at making sinewave analysis consistent with time-varying sinewave models. This is particularly important in high-frequency speech regions where harmonic frequency modulation (FM) can be significant. One notable approach is through the Fan Chirp transform that provides a set of FM-sinewave basis functions consistent with harmonic FM. In this paper, we develop a complete sinewave analysis/synthesis system using the Fan Chirp transform. With this system we are able to obtain more accurate sinewave frequencies and phases, thus creating more accurate frequency tracks, in contrast to a system derived from the short-time Fourier transform, particularly for high-frequency regions of large-bandwidth analysis. With synthesis, we show an improvement in segmental signal-to-noise ratio with respect to waveform matching with the largest gains during rapid pitch dynamics.

READ LESS

Summary

Sinewave analysis/synthesis based on the fan-chirp transform

Publications

Refine Results

By

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Showing Results