Voice Conversion
|
|
|
Overview When we speak we emit a speech pattern containing at least two kinds
of information: our meaning (message) and our identity. Voice conversion
technology enables a user to transform one person's speech pattern into
another pattern with distinct characteristics, giving it a new identity,
while preserving the original content (meaning). It transforms how something
is said without changing what is said. This technology allows one to convert
someone's speech so that it exhibits desired characteristics. In most
applications, the desired characteristic is someone else's voice. In other
words, voice conversion technology allows users to convert one person's
speech into that of another.
When describing the human voice, terms such as 'voice quality', 'voice individuality', or 'voice personality' are typically used. These terms, generally, refer to the overall quality of a sound, its timbre. People often describe a voice's timbre in term of what the voice sounds like. The timbre enables listeners to differentiate between sounds. Therefore, the timbre of someone's voice is what distinguishes it from other people's voices. For example, timbre accounts for the difference in sound quality between two people speaking the same sentence. Since it is a perceptual attribute, timbre is difficult to quantify.
It is inherently multidimensional and cannot reduced to one physical dimension.
Physical attributes play an important role in determining timbre. One
problem is to decide which physical attributes are most important when
we, as listeners, attempt to differentiate people's voices. Some of the
relevant attributes include: fundamental frequency (related to pitch),
formants, the intensity, and the rate at which people speak. To a large
extent, these features are determined by our vocal system and how it is
controlled. The Vocal System A speaker's vocal system can be described in terms of three functional components: the source of airflow, the sound source, and source of sound modification. These components correspond to the lungs, the larynx and the vocal tract, respectively. The lungs push the air through the trachea into the larynx. The larynx contains the vocal cords, also called vocal folds, which are flaps that control the passage of air into the vocal tract. Muscles in the larynx control the length and thickness of the vocal cords. Altering the length and thickness of the vocal cords produces a change in the fundamental frequency of the sound source. The opening between the vocal cords is called the glottis. The vocal tract, which modifies the sound produced by the vocal cords, is the area between the glottis and the lips. From the glottis, the air is pushed through the pharynx and then through the oral and/or nasal cavity. Within the oral cavity, many articulators including the lips, jaw, tongue, and teeth modify the sound. To make a speech sound the vocal cords are set into vibration by airflow from the lungs. Pressure generated in the lungs creates an air stream that moves through the glottis. The vocal cords vibrate when the air stream moves through the glottis and causes it to open and close. Vocal cord vibration for vowels is a periodic sequence with a well-defined fundamental frequency. These vibrations are the source of the sound produced by the voice. The spectrum of this periodic sound source produced by the glottis is harmonic and the amplitudes of the harmonics decrease with increasing frequency. Resonances produced by the vocal tract filter this spectrum and emphasize harmonics that fall within their frequency bandwidth. As a consequence of resonance, formants (resonant frequencies) are produced by the combination of many aspects of the vocal tract such as the overall shape of the vocal tract, the amount that the jaw and lip are open and the placement and shape of the tongue. The speech system, as described above, determines the sounds that can
be produced by the human voice. The range and function of these sounds
within a particular language can be described in terms of phonology. Phonology There are two types of phonology relevant in voice conversion: segmental phonology and suprasegmental phonology. Segmental phonology breaks down speech into distinctive units. Distinctive units are discrete units that can be identified using either physical or auditory criteria. The most common discrete unit used in segmental phonology is the phoneme. A phoneme is the smallest meaningful contrastive speech sound within a given language. A phoneme is simply an abstract linguistic unit used to describe sounds. Just as sentences can be broken down into discrete words, words, themselves, are made up of one or more phonemes. At one level, phonemes can be divided into two groups, vowels and consonants. The acoustic correlates of phonemes derive primarily from vocal tract movement during its articulation. Suprasegmental (non-segmental) phonology, on the other hand, deals with features that apply to a sequence of phonemes. Sequences of phones can take the form of syllables, words, or even sentences. The term, 'suprasegmental', generally applies to the following features: pitch, loudness, tempo, and rhythm. Sometimes, the term 'prosody' is used as a synonym for 'suprasegmental'. The narrower sense of the term 'prosody' refers to paralinguistic features that include features such as intonation, stress patterns, and secondary articulations such as lip rounding. In the context of voice conversion, both terms, 'suprasegmental' and 'prosody' are used interchangeably to refer to any features that are dealt with as a sequence extending beyond the individual phoneme. Typically, understanding what makes one voice different from another voice involves determining which features are specific to an individual voice as well as determining which features are needed to differentiate between several voices. In general, features are properties of the acoustic waveform. In speech, the term 'distinctive feature' is used to specify dimensions in which phonemes can be differentiated. Since there are a number of dimensions in which phonemes can be differentiated,
phonemes can be thought of as bundles of distinctive features. The distinctive
features associated with each dimension are discerned through binary opposition;
either the feature (e.g., nasality) is present or it is not. What features
have in common is that they represent some property that is attributed
to a particular sound. The number of features a sound can have varies,
based on the nature of that feature. In addition to binary features, which
are either applicable to a sound or not, there are dimensional properties
which represent quantitative variations along some scale. For example,
several voices can be ordered on the basis of having higher or lower pitch
(related to rate of vocal cord vibration), being louder/softer (amount
of pressure), or the overall deepness of their tone (size of vocal tract). Voice Conversion As stated above, the key purpose of a voice conversion system is to transform the voice of one speaker into that of another speaker. Therefore, given two speakers, the goal for a voice conversion system is to determine how to design a transformation that makes the speech of the first speaker sounds as though it were uttered by the second speaker. A general voice conversion system works as follows. The system analyzes the speech samples of two speakers. This involves collecting the voice characteristic of the original and desired (target) speakers. After learning the characteristics of each speaker, the system automatically creates a conversion rule from the original speaker's voice characteristics into those of the desired speaker. This conversion rule is applied to the original speech to create a converted speech that exhibits the target speaker's voice characteristics. Voice Matching Voice conversion systems must address two problems. The first problem is to identify the speakers' salient characteristics. In speech recognition terms, this analysis involves acquiring speaker-dependent knowledge. The second problem is to incorporate this speaker-dependent knowledge into a transformation process. Our approach is to treat the voice conversion problem as one of pattern
classification followed by modification. Pattern classification is used
to partition a set of examples into appropriate classes (categories) and
can be divided into two stages: feature extraction followed by classification. Pattern Classification Feature extraction for voice conversion begins by dividing the speech signal (a pattern) into frames. An acoustic representation is then extracted for each frame. Acoustic representations are features that often reduce the overall information in the speech signal while at the same time highlighting some aspect of the signal. Choosing which features to highlight is critical to the success of pattern classification. Given that a set of classes is known prior to classification, the best features are those that decrease the variability of features from examples belonging to the same class and increase the variability of features from examples belonging to different classes. So, in the context of voice conversion, the goal of feature extraction is a sequence of feature vectors that represent salient aspects of the speech signal that contribute towards classifying each frame of speech. Voice conversion systems typically usefeatures that correlate with the aspects of the vocal system that are critical to voice quality. From an acoustic perspective, the vocal production mechanism produces a time-varying pressure wave referred to simply as the 'waveform'. Using signal-processing methods, the waveform can be represented in terms of frequency; this representation is referred to as the 'spectrum'. Spectral characteristics can be estimated from the spectrum using a source-filter model, in which the filter represents the vocal tract and the source represents the excitation of the vocal tract. Roughly, the source (also referred to as excitation) corresponds to the function of the larynx and the filter corresponds to the function of the vocal tract. Prosodic characteristics such as a speaker's pitch, energy, and timing can be estimated from the waveform. The result is a set of parameters representing the spectral characteristics (excitation and vocal tract) and the prosodic characteristics (pitch, energy, and timing). Classification for voice conversion is based on the idea that speech signals are simply a concatenation of discrete units (words or phones). Therefore, in order to classify these units, based on their corresponding features, a model is needed that represents the fact that each sound unit is dependent on the nearby sound units. Statistical distributions are one way to represent this type of sequence. Such statistical sequence recognition techniques rely on the notion that speech is essentially generated according to probability distributions. In other words, a model used for this type of classification must be able to represent the underlying linguistic units. These models must also be able to capture both the temporal variability as well as the underlying structure of the sequence. The goal of the classifier is to create a uniform representation of some discrete speech unit (phonemes or statistical states) so that two speakers can be compared and contrasted in a uniform manner. Our system classifies the speech signal into phonemes. Classifying speech in terms of its underlying phonemic structure allows the features from each speaker to be compared in a uniform manner. Therefore, the conversion factors between two speakers can be calculated in terms of each phoneme. In other words, a mapping can be calculated for each feature with respect to the entire range of phonemes (forty two for the English language). Modification In order to modify the voice, some method of analysis and synthesis is required. Methods of analysis and synthesis have evolved in conjunction with the development of signal processing techniques. Despite these changes, the overall approach of analysis and synthesis has remained the same. First, the analysis portion provides an alternative presentation of the acoustic waveform - in this case the speech signal. And second, once this analysis has been performed, the parameters can be used to synthesize a new waveform. The resultant sound can then be compared to the original sound, and the adequacy of the parametric representation can be determined. Given a parametric representation of the waveform, the system can modify the parameters, which in turn causes the synthesized waveform to differ from the original. Additionally, if the parametric representation is based on acoustic correlates of the sound, such as formant locations, it becomes possible to isolate the parameters that are most important to its timbre. Altering the parameters and then listening to the synthesized results allows one to correlate specific physical characteristics with changes in timbre. Our system analysis-synthesis system that produces a parametric representation
of the speech signal. The basic spectral representations begin with sinusoidal
analysis, which produces a set of parameters that provides a representation
of the vocal tract and excitation. These parameters are modified using
various mapping techniques, which modify the spectral excitation and filter
response as well as the prosody characteristics of the speech signal.
The parameters that are modified during resynthesis include spectral envelope,
the distribution of the fundamental frequency, the energy, and duration
of each phoneme (time scaling). For more information, email me or contact Nellymoser. Voice Conversion Bibliography (several years out of date)
Voice ConversionAbe, Masanobu, Nakamura, Satoshi, Shikano, Kiyohiro, and Kuwabara, Hisao (1988) Voice Conversion through Vector Quantization. ICASSP 1988, pp. 655-658. Abe, Masanobu, Shikano, Kiyohiro, and Kuwabara, Hisao (1990) Cross-Language Voice Conversion. ICASSP 1990, pp. 345-364. Abe, Masanobu (1991) A Segment-Based Approach to Voice Conversion. ICASSP 1991 , pp. 765-768 Levent M. Arslan and David Talkin (1997) Voice conversion by codebook mapping of line spectral frequencies and excitation spectrum. Proceeding of EuroSpeech97. Childers, D.G., Yegnanarayana, B. and Wu, Ke (1985) Voice Conversion: Factors Responsible for Quality. ICASSP 1985 , pp. 748-751. Childers, D.G., Wu, Ke, Hicks, D.M., and Yegnanarayana, B (1989) Voice
Conversion. Childers, D.G. (1995) Glottal source modeling for voice conversion. Speech
Communication 16, pp. 127 - 138.
Hoory, R. and Chazan, D. (1994) Speech Synthesis for a Specific Speaker
Based on a Labeled Speech Database. IEEE International Conference on
Pattern Recognition , Vol. 1, pp. 146-148.
Iwahashi, Naoto and Sagisaka, Yoshinori (1995) Speech spectrum conversion
based on speaker interpolation and multi-functional representation with
weighting by radial basis function networks. Speech Communication
16, pp. 139-151.
Iwahashi, Naoto and Sagisaka, Yoshinori (1996?) Speech spectrum transformation
by speaker interpolation. ICASSP-96?, Vol. 1, pp. 461-464.
Kuwabara, Hisao and Sagisaka (1995) Acoustic charcteristics of speaker
individuality: Control and conversion. Speech Communication 16,
pp. 165-173.
Mizuno, Hideyuki and Abe, Masanobu () Voice Conversion Based on Piecewise
Linear Conversions Rules of Formant Frequency and Spectrum Tilt. Speech
Communication 16, pp. 153-164.
Mizuno, Hedeyuki, and Abe, Masanobu () Voice conversion based on piecewise
linear conversion rules of formant frequency and spetrum tilt. ICASSP-?,
Vol. 1, pp. 469-472.
Moulines, E. and Sagisaka, Y. (1995) Voice Conversion: State of the Art
and Perspectives. Speech Communication 16, pp. 125-126.
Narendranath, M., Murthy, Hema A., Rajendran, S. and Yegnanarayana, B.
(1995) Transformation of formants for voice conversion using artificial
neural networks. Speech Communication , 16 , pp. 207-216.
Rinsheid, Ansgar (1996) Voice Conversion on Topological Feature Maps
and Time-Variant Filtering. ICSLP 1996.
Savic, Michael and Nam, Il-Hyun (1990) A System For Voice Personality
Transformation. Speech Technology .
Savic, Michael, and Nam, Il-Hyun (1991) Voice personality transformation.
Digital Signal Processing 1, pp. 107-110.
Shikano, Kiyohiro, Nakamura, Satoshi, and Abe, Masanobu (1991) Speaker
Adaptation and Voice Conversion by Codebook Mapping. IEEE Symposim
on Cirsuits and Systems , Vol. 1, pp. 594-597.
Slifka, Janet and Anderson, Timothy R. (1995) Speaker Modification with
LPC Pole Analysis.ICASSP 1995, pp. 644-647.
Valbret, H., Moulines, E. and Tubach, J.P. (1992) Voice transformation
using PSOLA technique. Speech Communication 11, pp. 175-187.
Verhelst, Werner and Merterns, Johan (1996) Voice Conversion using Partitions
of Spectral Feature Space.
Bi, Ning, and Qi, Yingyong (1997) Application of Speech Conversion to
Alaryngeal Speech Enhancement. IEEE Trans.Speech Audio Proc.,
5(2), pp. 97-105.
E. Bryan George and Mark J. T. Smith (1997) Speech analysis/synthesis
and modification using an analysis-by-synthesis/overlap-add sinusoidal
model. IEEE Trans. Speech and Audio Proc. 5(5), pp. 389-406.
Kuwabara, Hisao (1984) A Pitch-Synchronous Analysis/Synthesis Sytem to
Independently Modify Formant Frequencies and Bandwidths For Voiced Speech.
Speech Communication 3, pp. 211-220.
Kawahara, Hedeki (1997) Speech representation and transformation using
adaptive interpolation of weighted spectrum: vocoder revisited. ICASSP-97,
pp. 1303-1306.
Leis, John, Phythian, Mark, and Sridharan, Sridha (1997?) Speech compression
with preservation of speaker identity. Proc. Eurospeech 97?
Macon, Michael W. and Clements, Mark A. (1997) Sinusoidal modelling and
modification of unvoiced speech. IEEE Trans. Speech and Audio Proc.
5(6), pp. 557-560.
Moulines, Eric and Laroche, Jean (1995) Non-parametric techniques for
pitch-scale and time-scale modification of speech. Speech Communication
16, pp. 175-205.
Quatieri, Thomas F. and McAulay, Robert J. (1986) Speech transformations
based on a sinusoidal representation. IEEE Trans. Acoust., Speech,
and Sig. Proc. ASSP-34(6), pp. 1449-1464.
Quatieri, Thomas F. and McAulay, Robert J. (1989) Phase coherence in
speech reconstruction for enhancement and coding applications. Proc.
ICASSP-89, pp. 207-209.
Quatieri, Thomas F. and McAulay, Robert J. (1992) Shape invariant time-scale
and pitch modification of speech. IEEE Trans. Signal Processing
40(3), pp. 497-510.
Seneff, Stephanie (1982) System to Independently Modify Excitation and/or
Spectrum of Speech Waveform Without Explicit Pitch Extraction. IEEE
Trans Acous. Speech Sig. Proc., ASSP-30(4), pp. 566-578.
Vergin, Rivarol, O'Shaughnessy, Douglas, and Farhat, Azarshid (1997)
Time domain technique for pitch modification and robust voice transformation.
Proc. ICASSP-97, pp. ?
Slaney, Malcolm, Covell, Michele, and Lassiter, Bud (1996) Automatic
Audio Morphing. ICASSP 1996, pp. 1001-1004.
Laroche, Jean (1989) A new analysis/synthesis sytem of musical signals
using Prony's method. Application to heavily damped percussive sounds.
Proc. ICASSP-89, pp. 2053-2056.
Laroche, Jean, Stylianou, Yannis, and Moulines, Eric (1993) HNS: Speech
Modification Based on a Harmonic + Noise Model. ICASSP 1993 ,
Vol. 2, pp. 550-553.
McAulay, Robert J. and Quatieri, Thomas F. (1986) Speech analysis/synthesis
based on a sinusoidal representation. IEEE Trans. Acoust., Speech,
and Sig. Proc. ASSP-34(4), pp. 744-754.
McAulay, Robert J. and Quatieri, Thomas F. (1986) Phase modelling and
its application to sinusoidal transform coding. Proc. ICASSP-86,
pp. 1713-1715.
Quatieri Jr., Thomas F. (1979) Minimum and mixed phase speech analysis/synthesis
by adaptive homomorphic deconvolution. IEEE Trans. Acoust., Speech,
and Sig. Proc. ASSP-27(4), pp. 328-335.
Childers, D. G. and Lee, C. K. (1991) Vocal quality factors: analysis,
synthesis, and perception. J. Acoust. Soc. Am. 90(5), pp.
2394-2410.
Darwin, C.J. and Gardner, Roy B. (1985) Which harmonics contribute to
the estimation of first formant frequency? Speech Communication
4, pp. 231-235
Fant, Gunnar (1993) Some problems in voice source analysis.
Klatt, Dennis and Klatt, Laura C. (1990) Analysis, synthesis, and perception
of voice quality variations among female and male talkers. J. Acoust.
Soc. Am. 87(2), pp. 820-857.
Kreiman, Jody and Gerratt, Bruce, R. (1996) The perceptual structure
of pathologic voice quality. J. Acoust. Soc. Am. 100(3),
pp. 1787-1795
Ladd, Robert D. (1985) Evidence for the independent function of intonation
contour type, voice quality, and F0 range in signaling speaker affect.
J. Acout. Soc. Am. 78(2), pp. 435-444.
Lubov, William (?) The organization of dialect diversity in North America.
?
Price, P. J. (1989) Male and female voice source characteristics: inverse
filtering results. Speech Communcation 8 pp.
261-277.
Schoentgen, Jean (1982) Quantitative evaluation of the discrimination
performance of acoustic features in detecting laryngeal pathology. Speech
Communication 1, pp. 269-282
Traunmuller, Hartmut (1984) Articulatory and Perceptual Factors Controlling
the Age- and Sex- Conditioned Variability in Formant Frequencies of Vowels.
Speech Communication .
Yang, Chang-Sheng and Kasuya, Hideki (1996) Speaker individualities of
vocal track shapes of Japanese vowels measured by magnetic resonance images.
Proc. ICSLP-96
Childers, D.G. and Wu, Ke (1991) Gender recognition from speech. Part
II: Fine analysis. J. Acoust. Soc. Am. 90(4), pp. 1841-1856.
Parris, Eluned S. and Carey, Michael J. (1996)Language Independent Gender
Identification. ICASSP 1996, Vol. 2, pp. 685-688.
Wu, Ke and Childers, D.G. (1991) Gender recognition from speech. Part
I: Coarse analysis. J. Acoust. Soc. Am. 90(4), pp. 1828-1839.
Christensen, Randall L., Strong, William J. and Palmer, E. Paul (1976)
A comparison of three methods of extracting resonance information from
predictor-coefficient coded speech. IEEE Trans. Acoust., Speech, and
Sig. Proc. ASSP-24(1), pp. 8-14.
Kopec, Gary E. (1986) A family of formant trackers based on hidden Markov
models. Proc. ICASSP-86, pp. 1225-1228.
McCandless, Stephanie S. (1974) An algorithm for automatic formant extraction
using linear prediction spectra. IEEE Trans. Acoust., Speech, and Sig.
Proc. ASSP-22(2), pp. 135-141.
Markel, John D. (1972) Digital inverse filtering - a new tool for formant
trajectory estimation. IEEE Trans. Audio and Electroacoustics AU-20(2),
pp. 129-137.
Markel, John D. (1973) Application of a digital inverse filter for automatic
formant and F0 analysis. IEEE Trans. Audio and Electroacoustics
AU-21(3), pp. 154-160.
Nathan, Krishna S. and Silverman, Harvey (1990) High-resolution characterization
of formant in vowel-consonant transitions. Proc. ICASSP-90, pp.
353-355.
Reddy, N. Sridhar and Swamy, M. N. S. (1984) High-resolution formant
extraction from linear-prediction phase spectra. IEEE Trans. Acoust.,
Speech, and Sig. Proc. ASSP-32(6), pp. 1136-1144.
Wilcox, Lynn D. and Chen, Francine R. (1990) Application of Markov random
fields to formant extraction. Proc. ICASSP-90, pp. 349-351.
Hedelin, Per and Huber, Dieter (1990) Pitch period determination of aperiodic
speech signals. Proc. ICASSP-90, pp. 361-364.
McAulay, Robert J. and Quatieri, Thomas F. (1990) Pitch estimation and
voicing detection based on a sinusoidal speech model. Proc. ICASSP-90,
pp. 249-252.
Schoentgen, Jean and de Guchteneere, Raoul (1991) An algorithm for the
measurement of jitter. Speech Communication 10, pp. 533-538.
Slaney, Malcom and Lyon, Richard F. (1990) A perceptual pitch detector.
Proc. ICASSP-90, pp. 357-359.
Jan P. H. van Santen (1997) Prosodic modeling in text-to-speech synthesis.
Proceedings of EuroSpeech97. Keynote Speech.
Marcel Riedi (1997) Modeling segmental duration with multivariate adaptive
regression splines. Proceedings of EuroSpeech97..
|
|
| |
|