Fundamentals 7 min read

Understanding Speech Coding: Phonation Models and Linear Predictive Coding Explained

This article explains how speech codecs based on phonation models work, covering voiced and unvoiced sound generation, glottal pulse modeling, vocal tract and radiation models, and the linear predictive coding (LPC) technique used to reconstruct speech waveforms from acoustic parameters.

Seewo Tech Circle

Sep 12, 2019

Understanding Speech Coding: Phonation Models and Linear Predictive Coding Explained

1. Phonation Model

1.1 Voiced and Unvoiced Sounds

Voiced sounds are produced when the vocal cords close, building pressure from the lungs until the cords are forced apart; the airflow then drops pressure, causing the cords to close again, creating quasi‑periodic pulses that are shaped by the oral, nasal, and pharyngeal cavities.

Unvoiced sounds arise from a partial constriction in the vocal tract that forces air to pass at high speed, generating broadband noise like the sound in the word "see".

Voiced waveforms show clear periodic patterns (marked V), while unvoiced segments are harder to detect and often blend with background noise (marked U).

In voiced waveforms, the repeated local waveform corresponds to the pitch period, illustrated by the repeating rectangular segments.

1.2 Glottal Pulse Model

The glottal pulse shapes the quasi‑periodic signal for voiced sounds; adjusting pulse length adapts to different pitch periods and models various open‑close ratios of the glottis.

The frequency response of this pulse shows low‑pass characteristics, attenuating high frequencies.

1.3 Vocal Tract Model

The vocal tract model considers tract area, wave reflections, and losses at the glottis and lips.

1.4 Radiation Model

Lip radiation can be modeled as a source radiating into an infinite planar obstacle.

1.5 Complete Model

2. Linear Predictive Coding (LPC)

Based on the speech production model, accurate speech parameters can reconstruct the speech waveform.

A simplified production model includes a filter H(z) that incorporates vocal tract resonance, lip radiation, and for voiced sounds, the spectral effect of the glottal pulse. Voiced sounds are excited by a quasi‑periodic pulse train; unvoiced sounds by a random noise sequence.

The required parameters are:

Voiced/unvoiced classification

Pitch period for voiced sounds

Gain parameter G

Filter coefficients of H(z)

LPC uses linear prediction analysis to estimate H(z) coefficients and gain G from short speech frames (10‑30 ms) because speech is quasi‑stationary over such intervals. The analysis minimizes the mean‑square error of the prediction.

The output speech sample s[n] follows the difference equation:

where the weighted term is

e[n] denotes the prediction error, whose mean‑square error (MSE) is

Minimizing this error leads to the normal equations, which can be expressed in matrix form using a Toeplitz matrix:

Solving the matrix yields the filter coefficients.

References

Steven W. Smith, Digital Signal Processing

Lawrence R. Rabiner, Ronald W. Schafer, Theory and Applications of Digital Speech Processing

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

audio signal processing linear predictive coding phonation model speech coding

Written by

Seewo Tech Circle

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.