Voice Conversion

 

 

 

Overview

When we speak we emit a speech pattern containing at least two kinds of information: our meaning (message) and our identity. Voice conversion technology enables a user to transform one person's speech pattern into another pattern with distinct characteristics, giving it a new identity, while preserving the original content (meaning). It transforms how something is said without changing what is said. This technology allows one to convert someone's speech so that it exhibits desired characteristics. In most applications, the desired characteristic is someone else's voice. In other words, voice conversion technology allows users to convert one person's speech into that of another.


Voice Quality

When describing the human voice, terms such as 'voice quality', 'voice individuality', or 'voice personality' are typically used. These terms, generally, refer to the overall quality of a sound, its timbre. People often describe a voice's timbre in term of what the voice sounds like. The timbre enables listeners to differentiate between sounds. Therefore, the timbre of someone's voice is what distinguishes it from other people's voices. For example, timbre accounts for the difference in sound quality between two people speaking the same sentence.

Since it is a perceptual attribute, timbre is difficult to quantify. It is inherently multidimensional and cannot reduced to one physical dimension. Physical attributes play an important role in determining timbre. One problem is to decide which physical attributes are most important when we, as listeners, attempt to differentiate people's voices. Some of the relevant attributes include: fundamental frequency (related to pitch), formants, the intensity, and the rate at which people speak. To a large extent, these features are determined by our vocal system and how it is controlled.

The Vocal System

A speaker's vocal system can be described in terms of three functional components: the source of airflow, the sound source, and source of sound modification. These components correspond to the lungs, the larynx and the vocal tract, respectively. The lungs push the air through the trachea into the larynx. The larynx contains the vocal cords, also called vocal folds, which are flaps that control the passage of air into the vocal tract. Muscles in the larynx control the length and thickness of the vocal cords. Altering the length and thickness of the vocal cords produces a change in the fundamental frequency of the sound source. The opening between the vocal cords is called the glottis. The vocal tract, which modifies the sound produced by the vocal cords, is the area between the glottis and the lips. From the glottis, the air is pushed through the pharynx and then through the oral and/or nasal cavity. Within the oral cavity, many articulators including the lips, jaw, tongue, and teeth modify the sound.

To make a speech sound the vocal cords are set into vibration by airflow from the lungs. Pressure generated in the lungs creates an air stream that moves through the glottis. The vocal cords vibrate when the air stream moves through the glottis and causes it to open and close. Vocal cord vibration for vowels is a periodic sequence with a well-defined fundamental frequency. These vibrations are the source of the sound produced by the voice. The spectrum of this periodic sound source produced by the glottis is harmonic and the amplitudes of the harmonics decrease with increasing frequency. Resonances produced by the vocal tract filter this spectrum and emphasize harmonics that fall within their frequency bandwidth. As a consequence of resonance, formants (resonant frequencies) are produced by the combination of many aspects of the vocal tract such as the overall shape of the vocal tract, the amount that the jaw and lip are open and the placement and shape of the tongue.

The speech system, as described above, determines the sounds that can be produced by the human voice. The range and function of these sounds within a particular language can be described in terms of phonology.

Phonology

There are two types of phonology relevant in voice conversion: segmental phonology and suprasegmental phonology. Segmental phonology breaks down speech into distinctive units. Distinctive units are discrete units that can be identified using either physical or auditory criteria. The most common discrete unit used in segmental phonology is the phoneme. A phoneme is the smallest meaningful contrastive speech sound within a given language. A phoneme is simply an abstract linguistic unit used to describe sounds. Just as sentences can be broken down into discrete words, words, themselves, are made up of one or more phonemes. At one level, phonemes can be divided into two groups, vowels and consonants. The acoustic correlates of phonemes derive primarily from vocal tract movement during its articulation.

Suprasegmental (non-segmental) phonology, on the other hand, deals with features that apply to a sequence of phonemes. Sequences of phones can take the form of syllables, words, or even sentences. The term, 'suprasegmental', generally applies to the following features: pitch, loudness, tempo, and rhythm. Sometimes, the term 'prosody' is used as a synonym for 'suprasegmental'. The narrower sense of the term 'prosody' refers to paralinguistic features that include features such as intonation, stress patterns, and secondary articulations such as lip rounding. In the context of voice conversion, both terms, 'suprasegmental' and 'prosody' are used interchangeably to refer to any features that are dealt with as a sequence extending beyond the individual phoneme.

Typically, understanding what makes one voice different from another voice involves determining which features are specific to an individual voice as well as determining which features are needed to differentiate between several voices. In general, features are properties of the acoustic waveform. In speech, the term 'distinctive feature' is used to specify dimensions in which phonemes can be differentiated.

Since there are a number of dimensions in which phonemes can be differentiated, phonemes can be thought of as bundles of distinctive features. The distinctive features associated with each dimension are discerned through binary opposition; either the feature (e.g., nasality) is present or it is not. What features have in common is that they represent some property that is attributed to a particular sound. The number of features a sound can have varies, based on the nature of that feature. In addition to binary features, which are either applicable to a sound or not, there are dimensional properties which represent quantitative variations along some scale. For example, several voices can be ordered on the basis of having higher or lower pitch (related to rate of vocal cord vibration), being louder/softer (amount of pressure), or the overall deepness of their tone (size of vocal tract).

Voice Conversion

As stated above, the key purpose of a voice conversion system is to transform the voice of one speaker into that of another speaker. Therefore, given two speakers, the goal for a voice conversion system is to determine how to design a transformation that makes the speech of the first speaker sounds as though it were uttered by the second speaker.

A general voice conversion system works as follows. The system analyzes the speech samples of two speakers. This involves collecting the voice characteristic of the original and desired (target) speakers. After learning the characteristics of each speaker, the system automatically creates a conversion rule from the original speaker's voice characteristics into those of the desired speaker. This conversion rule is applied to the original speech to create a converted speech that exhibits the target speaker's voice characteristics.

Voice Matching

Voice conversion systems must address two problems. The first problem is to identify the speakers' salient characteristics. In speech recognition terms, this analysis involves acquiring speaker-dependent knowledge. The second problem is to incorporate this speaker-dependent knowledge into a transformation process.

Our approach is to treat the voice conversion problem as one of pattern classification followed by modification. Pattern classification is used to partition a set of examples into appropriate classes (categories) and can be divided into two stages: feature extraction followed by classification.

Pattern Classification

Feature extraction for voice conversion begins by dividing the speech signal (a pattern) into frames. An acoustic representation is then extracted for each frame. Acoustic representations are features that often reduce the overall information in the speech signal while at the same time highlighting some aspect of the signal. Choosing which features to highlight is critical to the success of pattern classification. Given that a set of classes is known prior to classification, the best features are those that decrease the variability of features from examples belonging to the same class and increase the variability of features from examples belonging to different classes. So, in the context of voice conversion, the goal of feature extraction is a sequence of feature vectors that represent salient aspects of the speech signal that contribute towards classifying each frame of speech.

Voice conversion systems typically usefeatures that correlate with the aspects of the vocal system that are critical to voice quality. From an acoustic perspective, the vocal production mechanism produces a time-varying pressure wave referred to simply as the 'waveform'. Using signal-processing methods, the waveform can be represented in terms of frequency; this representation is referred to as the 'spectrum'. Spectral characteristics can be estimated from the spectrum using a source-filter model, in which the filter represents the vocal tract and the source represents the excitation of the vocal tract. Roughly, the source (also referred to as excitation) corresponds to the function of the larynx and the filter corresponds to the function of the vocal tract. Prosodic characteristics such as a speaker's pitch, energy, and timing can be estimated from the waveform. The result is a set of parameters representing the spectral characteristics (excitation and vocal tract) and the prosodic characteristics (pitch, energy, and timing).

Classification for voice conversion is based on the idea that speech signals are simply a concatenation of discrete units (words or phones). Therefore, in order to classify these units, based on their corresponding features, a model is needed that represents the fact that each sound unit is dependent on the nearby sound units. Statistical distributions are one way to represent this type of sequence. Such statistical sequence recognition techniques rely on the notion that speech is essentially generated according to probability distributions. In other words, a model used for this type of classification must be able to represent the underlying linguistic units. These models must also be able to capture both the temporal variability as well as the underlying structure of the sequence. The goal of the classifier is to create a uniform representation of some discrete speech unit (phonemes or statistical states) so that two speakers can be compared and contrasted in a uniform manner.

Our system classifies the speech signal into phonemes. Classifying speech in terms of its underlying phonemic structure allows the features from each speaker to be compared in a uniform manner. Therefore, the conversion factors between two speakers can be calculated in terms of each phoneme. In other words, a mapping can be calculated for each feature with respect to the entire range of phonemes (forty two for the English language).

Modification

In order to modify the voice, some method of analysis and synthesis is required. Methods of analysis and synthesis have evolved in conjunction with the development of signal processing techniques. Despite these changes, the overall approach of analysis and synthesis has remained the same. First, the analysis portion provides an alternative presentation of the acoustic waveform - in this case the speech signal. And second, once this analysis has been performed, the parameters can be used to synthesize a new waveform. The resultant sound can then be compared to the original sound, and the adequacy of the parametric representation can be determined. Given a parametric representation of the waveform, the system can modify the parameters, which in turn causes the synthesized waveform to differ from the original. Additionally, if the parametric representation is based on acoustic correlates of the sound, such as formant locations, it becomes possible to isolate the parameters that are most important to its timbre. Altering the parameters and then listening to the synthesized results allows one to correlate specific physical characteristics with changes in timbre.

Our system analysis-synthesis system that produces a parametric representation of the speech signal. The basic spectral representations begin with sinusoidal analysis, which produces a set of parameters that provides a representation of the vocal tract and excitation. These parameters are modified using various mapping techniques, which modify the spectral excitation and filter response as well as the prosody characteristics of the speech signal. The parameters that are modified during resynthesis include spectral envelope, the distribution of the fundamental frequency, the energy, and duration of each phoneme (time scaling).

For more information, email me or contact Nellymoser.

Voice Conversion Bibliography (several years out of date)

  • Voice Conversion: Converting one voice to another.
  • Speech Modification: modifying speech. Includes time-scale and pitch-scale modifcation.
  • Morphing: blending one acoustic signal (speech or audio) into another.
  • Voice quality: Characteristics of the voice, especially as a function of different types of speakers.
  • Gender Identification: automatically determining the gender of a speaker from the speech alone.
  • Formant Estimation: estimating formant frequencies automatically.
  • Pitch Estimation: estimating the fundumental frequency automatically.
  • Prosody Analysis: automatically analyzing and characterizing prosody (e.g., pitch, duration) from speech.

Voice Conversion

Abe, Masanobu, Nakamura, Satoshi, Shikano, Kiyohiro, and Kuwabara, Hisao (1988) Voice Conversion through Vector Quantization. ICASSP 1988, pp. 655-658.

Abe, Masanobu, Shikano, Kiyohiro, and Kuwabara, Hisao (1990) Cross-Language Voice Conversion. ICASSP 1990, pp. 345-364.

Abe, Masanobu (1991) A Segment-Based Approach to Voice Conversion. ICASSP 1991 , pp. 765-768

Levent M. Arslan and David Talkin (1997) Voice conversion by codebook mapping of line spectral frequencies and excitation spectrum. Proceeding of EuroSpeech97.

Childers, D.G., Yegnanarayana, B. and Wu, Ke (1985) Voice Conversion: Factors Responsible for Quality. ICASSP 1985 , pp. 748-751.

Childers, D.G., Wu, Ke, Hicks, D.M., and Yegnanarayana, B (1989) Voice Conversion. Speech Communication 8, pp. 147 - 158.

Childers, D.G. (1995) Glottal source modeling for voice conversion. Speech Communication 16, pp. 127 - 138.

Hoory, R. and Chazan, D. (1994) Speech Synthesis for a Specific Speaker Based on a Labeled Speech Database. IEEE International Conference on Pattern Recognition , Vol. 1, pp. 146-148.

Iwahashi, Naoto and Sagisaka, Yoshinori (1995) Speech spectrum conversion based on speaker interpolation and multi-functional representation with weighting by radial basis function networks. Speech Communication 16, pp. 139-151.

Iwahashi, Naoto and Sagisaka, Yoshinori (1996?) Speech spectrum transformation by speaker interpolation. ICASSP-96?, Vol. 1, pp. 461-464.

Kuwabara, Hisao and Sagisaka (1995) Acoustic charcteristics of speaker individuality: Control and conversion. Speech Communication 16, pp. 165-173.

Mizuno, Hideyuki and Abe, Masanobu () Voice Conversion Based on Piecewise Linear Conversions Rules of Formant Frequency and Spectrum Tilt. Speech Communication 16, pp. 153-164.

Mizuno, Hedeyuki, and Abe, Masanobu () Voice conversion based on piecewise linear conversion rules of formant frequency and spetrum tilt. ICASSP-?, Vol. 1, pp. 469-472.

Moulines, E. and Sagisaka, Y. (1995) Voice Conversion: State of the Art and Perspectives. Speech Communication 16, pp. 125-126.

Narendranath, M., Murthy, Hema A., Rajendran, S. and Yegnanarayana, B. (1995) Transformation of formants for voice conversion using artificial neural networks. Speech Communication , 16 , pp. 207-216.

Rinsheid, Ansgar (1996) Voice Conversion on Topological Feature Maps and Time-Variant Filtering. ICSLP 1996.

Savic, Michael and Nam, Il-Hyun (1990) A System For Voice Personality Transformation. Speech Technology .

Savic, Michael, and Nam, Il-Hyun (1991) Voice personality transformation. Digital Signal Processing 1, pp. 107-110.

Shikano, Kiyohiro, Nakamura, Satoshi, and Abe, Masanobu (1991) Speaker Adaptation and Voice Conversion by Codebook Mapping. IEEE Symposim on Cirsuits and Systems , Vol. 1, pp. 594-597.

Slifka, Janet and Anderson, Timothy R. (1995) Speaker Modification with LPC Pole Analysis.ICASSP 1995, pp. 644-647.

Valbret, H., Moulines, E. and Tubach, J.P. (1992) Voice transformation using PSOLA technique. Speech Communication 11, pp. 175-187.

Verhelst, Werner and Merterns, Johan (1996) Voice Conversion using Partitions of Spectral Feature Space.

Speech Modification

Bi, Ning, and Qi, Yingyong (1997) Application of Speech Conversion to Alaryngeal Speech Enhancement. IEEE Trans.Speech Audio Proc., 5(2), pp. 97-105.

E. Bryan George and Mark J. T. Smith (1997) Speech analysis/synthesis and modification using an analysis-by-synthesis/overlap-add sinusoidal model. IEEE Trans. Speech and Audio Proc. 5(5), pp. 389-406.

Kuwabara, Hisao (1984) A Pitch-Synchronous Analysis/Synthesis Sytem to Independently Modify Formant Frequencies and Bandwidths For Voiced Speech. Speech Communication 3, pp. 211-220.

Kawahara, Hedeki (1997) Speech representation and transformation using adaptive interpolation of weighted spectrum: vocoder revisited. ICASSP-97, pp. 1303-1306.

Leis, John, Phythian, Mark, and Sridharan, Sridha (1997?) Speech compression with preservation of speaker identity. Proc. Eurospeech 97?

Macon, Michael W. and Clements, Mark A. (1997) Sinusoidal modelling and modification of unvoiced speech. IEEE Trans. Speech and Audio Proc. 5(6), pp. 557-560.

Moulines, Eric and Laroche, Jean (1995) Non-parametric techniques for pitch-scale and time-scale modification of speech. Speech Communication 16, pp. 175-205.

Quatieri, Thomas F. and McAulay, Robert J. (1986) Speech transformations based on a sinusoidal representation. IEEE Trans. Acoust., Speech, and Sig. Proc. ASSP-34(6), pp. 1449-1464.

Quatieri, Thomas F. and McAulay, Robert J. (1989) Phase coherence in speech reconstruction for enhancement and coding applications. Proc. ICASSP-89, pp. 207-209.

Quatieri, Thomas F. and McAulay, Robert J. (1992) Shape invariant time-scale and pitch modification of speech. IEEE Trans. Signal Processing 40(3), pp. 497-510.

Seneff, Stephanie (1982) System to Independently Modify Excitation and/or Spectrum of Speech Waveform Without Explicit Pitch Extraction. IEEE Trans Acous. Speech Sig. Proc., ASSP-30(4), pp. 566-578.

Vergin, Rivarol, O'Shaughnessy, Douglas, and Farhat, Azarshid (1997) Time domain technique for pitch modification and robust voice transformation. Proc. ICASSP-97, pp. ?

Morphing

Slaney, Malcolm, Covell, Michele, and Lassiter, Bud (1996) Automatic Audio Morphing. ICASSP 1996, pp. 1001-1004.

Signal/Speech Analysis

Laroche, Jean (1989) A new analysis/synthesis sytem of musical signals using Prony's method. Application to heavily damped percussive sounds. Proc. ICASSP-89, pp. 2053-2056.

Laroche, Jean, Stylianou, Yannis, and Moulines, Eric (1993) HNS: Speech Modification Based on a Harmonic + Noise Model. ICASSP 1993 , Vol. 2, pp. 550-553.

McAulay, Robert J. and Quatieri, Thomas F. (1986) Speech analysis/synthesis based on a sinusoidal representation. IEEE Trans. Acoust., Speech, and Sig. Proc. ASSP-34(4), pp. 744-754.

McAulay, Robert J. and Quatieri, Thomas F. (1986) Phase modelling and its application to sinusoidal transform coding. Proc. ICASSP-86, pp. 1713-1715.

Quatieri Jr., Thomas F. (1979) Minimum and mixed phase speech analysis/synthesis by adaptive homomorphic deconvolution. IEEE Trans. Acoust., Speech, and Sig. Proc. ASSP-27(4), pp. 328-335.

General Voice Quality and Speaker Characteristics

Childers, D. G. and Lee, C. K. (1991) Vocal quality factors: analysis, synthesis, and perception. J. Acoust. Soc. Am. 90(5), pp. 2394-2410.

Darwin, C.J. and Gardner, Roy B. (1985) Which harmonics contribute to the estimation of first formant frequency? Speech Communication 4, pp. 231-235

Fant, Gunnar (1993) Some problems in voice source analysis.

Klatt, Dennis and Klatt, Laura C. (1990) Analysis, synthesis, and perception of voice quality variations among female and male talkers. J. Acoust. Soc. Am. 87(2), pp. 820-857.

Kreiman, Jody and Gerratt, Bruce, R. (1996) The perceptual structure of pathologic voice quality. J. Acoust. Soc. Am. 100(3), pp. 1787-1795

Ladd, Robert D. (1985) Evidence for the independent function of intonation contour type, voice quality, and F0 range in signaling speaker affect. J. Acout. Soc. Am. 78(2), pp. 435-444.

Lubov, William (?) The organization of dialect diversity in North America. ?

Price, P. J. (1989) Male and female voice source characteristics: inverse filtering results. Speech Communcation 8 pp. 261-277.

Schoentgen, Jean (1982) Quantitative evaluation of the discrimination performance of acoustic features in detecting laryngeal pathology. Speech Communication 1, pp. 269-282

Traunmuller, Hartmut (1984) Articulatory and Perceptual Factors Controlling the Age- and Sex- Conditioned Variability in Formant Frequencies of Vowels. Speech Communication .

Yang, Chang-Sheng and Kasuya, Hideki (1996) Speaker individualities of vocal track shapes of Japanese vowels measured by magnetic resonance images. Proc. ICSLP-96

Gender Identification

Childers, D.G. and Wu, Ke (1991) Gender recognition from speech. Part II: Fine analysis. J. Acoust. Soc. Am. 90(4), pp. 1841-1856.

Parris, Eluned S. and Carey, Michael J. (1996)Language Independent Gender Identification. ICASSP 1996, Vol. 2, pp. 685-688.

Wu, Ke and Childers, D.G. (1991) Gender recognition from speech. Part I: Coarse analysis. J. Acoust. Soc. Am. 90(4), pp. 1828-1839.

Formant estimation

Christensen, Randall L., Strong, William J. and Palmer, E. Paul (1976) A comparison of three methods of extracting resonance information from predictor-coefficient coded speech. IEEE Trans. Acoust., Speech, and Sig. Proc. ASSP-24(1), pp. 8-14.

Kopec, Gary E. (1986) A family of formant trackers based on hidden Markov models. Proc. ICASSP-86, pp. 1225-1228.

McCandless, Stephanie S. (1974) An algorithm for automatic formant extraction using linear prediction spectra. IEEE Trans. Acoust., Speech, and Sig. Proc. ASSP-22(2), pp. 135-141.

Markel, John D. (1972) Digital inverse filtering - a new tool for formant trajectory estimation. IEEE Trans. Audio and Electroacoustics AU-20(2), pp. 129-137.

Markel, John D. (1973) Application of a digital inverse filter for automatic formant and F0 analysis. IEEE Trans. Audio and Electroacoustics AU-21(3), pp. 154-160.

Nathan, Krishna S. and Silverman, Harvey (1990) High-resolution characterization of formant in vowel-consonant transitions. Proc. ICASSP-90, pp. 353-355.

Reddy, N. Sridhar and Swamy, M. N. S. (1984) High-resolution formant extraction from linear-prediction phase spectra. IEEE Trans. Acoust., Speech, and Sig. Proc. ASSP-32(6), pp. 1136-1144.

Wilcox, Lynn D. and Chen, Francine R. (1990) Application of Markov random fields to formant extraction. Proc. ICASSP-90, pp. 349-351.

Pitch estimation

Hedelin, Per and Huber, Dieter (1990) Pitch period determination of aperiodic speech signals. Proc. ICASSP-90, pp. 361-364.

McAulay, Robert J. and Quatieri, Thomas F. (1990) Pitch estimation and voicing detection based on a sinusoidal speech model. Proc. ICASSP-90, pp. 249-252.

Schoentgen, Jean and de Guchteneere, Raoul (1991) An algorithm for the measurement of jitter. Speech Communication 10, pp. 533-538.

Slaney, Malcom and Lyon, Richard F. (1990) A perceptual pitch detector. Proc. ICASSP-90, pp. 357-359.

Prosody analysis

Jan P. H. van Santen (1997) Prosodic modeling in text-to-speech synthesis. Proceedings of EuroSpeech97. Keynote Speech.

Marcel Riedi (1997) Modeling segmental duration with multivariate adaptive regression splines. Proceedings of EuroSpeech97..

 


John Puterbaugh Home