对数量化是什么意思(量化数值是什么意思)

1.1语音信号


言语的基本目的是人与人之间的交流。即在说话者和听者之间传送消息。根据Shannon的信息理论[364],表示为离散符号序列的消息可以通过其信息内容(以比特为单位)进行量化,其中信息的传输速率以比特/秒(bps)为单位。在语音产生中,以及在许多人为设计的电子通信系统中,要传输的信息以连续变化(模拟)波形的形式编码,该波形可以被传输,记录(存储),操纵并最终由人类听众解码。消息的基本模拟形式是我们称为语音信号的声音波形。语音信号(例如,图1.2中所示的语音信号)可以通过麦克风转换为电波形,可以通过模拟和数字信号处理方法进一步进行操作,然后通过扬声器(电话听筒)大声地转换回声学形式或耳机(视需要而定)。当然,这种语音处理形式是贝尔电话发明以及当今用于记录,传输和处理语音和音频信号的众多设备的基础。用贝尔自己的话[47],


沃森(Watson),如果我能找到一种机制,当声音通过时,随着空气密度的变化,电流会改变电流强度,那么我可以广播任何声音,甚至是语音。

尽管贝尔在不了解信息理论的情况下做出了伟大的发明,但信息理论的原理在复杂的现代数字通信系统的设计中已发挥了重要作用。因此,即使我们的主要精力将集中在语音波形及其以参数模型形式的表示上,但是从讨论语音波形中编码的信息开始还是很有用的。


图1.3显示了产生和感知语音的完整过程的图形表示-从在说话者的大脑中形成消息,到语音信号的创建,最后到听众对消息的理解。在对语音科学的经典介绍中,Denes和Pinson将该过程适当地称为“语音链” [88]。语音链的更精细的框图表示如图1 .4所示。该过程从左上方开始例如,语音信息在说话者的大脑中以某种方式表示,在语音生成过程中,该语音信息可以被认为具有多种不同的表示方式(图1.4中的上部路径)。


该消息最初可以表示为英文文本。为了说出“消息”,说话者将文本隐式地转换为与文本的口语版本相对应的声音序列的符号表示。此步骤在图1.4中称为语言代码生成器,将文本符号转换为语音符号(以及压力和持续时间信息)描述了消息的口头版本的基本声音以及产生声音的方式(即速度和重点)。例如,波形的各段使用称为ARPAbet的计算机键盘友好代码在图1.2中使用语音符号标记。2因此,我们应追随的文本”用ARPAbet符号用语音表示为[SH UH D- W IY-CH EY S]。 (有关语音转录的更多讨论,请参见第3章。)语音产生过程的第三步是转换为神经肌肉控制。即一组控制信号,该信号指导神经肌肉系统以与所需口语的声音相一致的方式移动语音发音器,即舌头,嘴唇,牙齿,下巴和等

时间(秒)

图1.2

具有文字“我们应该追逐”表示的消息的带有语音标签的语音波形。

2国际语音协会(IPA)使用等效的一组专用符号为语音转录提供了一组规则。 ARPAbet代码不需要特殊的字体,因此消息更多,并且具有所需的强调程度。神经肌肉控制步骤的最终结果是一组关节运动(连续控制),这些运动使声带发音器以规定的方式运动以产生所需的声音。最后,语音产生过程的最后一步是“声道系统”,它创建物理声源和适当的随时间变化的声道形状,以产生如图1.2所示的声波波形。这样,所需消息中的信息被编码为语音信号。


要确定语音生成过程中的信息流速率,请假设在语音合成中大约有32个符号(字母)书面语言。 (英语中有26个字母,但是如果我们包括简单的标点符号和空格,我们得到的计数将接近32个= 25个符号。)正常的平均说话速度约为每秒15个符号。因此,假设独立字母为简单的一阶近似值,则编码为语音的文本消息的基本信息速率约为75 bps(每个符号5位乘以每秒15个符号)。但是,实际速率会随语音速率而变化。对于图1.2的示例,文本表示形式有15个字母(包括空格),并且相应的语音发声的持续时间为0.6秒,从而给出了更高的估计值15 x 5 / 0.6 = 125 bps。在该过程的第二阶段,将文本表示形式与韵律(例如,音调和重音)标记一起转换为称为音素的基本声音单位,信息速率可以轻松提高到200 bps以上。图1.2中用于标记语音的ARBAbet语音符号集包含大约64个= 26个符号,或大约6位/音素(再次近似地假设了音素的独立性)。在图1.2中,大约在0.6秒内有八个音素。这导致估计为8 x 6 / 0.6 = 80 bps。描述信号的韵律特征(例如,持续时间,音调,响度)所需的其他信息可以很容易地为编码为语音信号的文本消息的总信息速率增加100 bps。

语音链中前两个阶段的信息表示为

离散,因此我们可以通过一些简单的假设轻松估算信息流的速率。对于语音链的语音生成部分的下一个阶段,表示将变得连续(以用于关节运动的神经肌肉控制信号的形式)。如果可以测量它们,我们可以估计这些控制信号的频谱带宽,并对这些信号进行适当的采样和量化,以获得等效数字信号,可以针对这些数字信号估计数据速率。与所产生的声波波形的时间变化相比,咬合架运动相对较慢。带宽和所需信号表示精度的估计表明,采样的关节运动控制信号的总数据速率约为2000 bps [105],因此,原始文本消息由一组连续变化的信号表示,其数字表示需要更高的信号表示率。 3最终,正如我们稍后将看到的,语音链的语音生成部分末尾的数字化语音波形的数据速率可以达到


3请注意,我们为数字表示引入了“数据速率”一词,以区别于语音信号表示的消息的固有信息内容。

可以从64,000 bps到700,000 bps以上我们通过检查代表具有期望的感知保真度的语音信号所需的采样率和量化来得出这样的数字。例如,“电话质量”语音处理要求保留0到4 kHz的带宽,这意味着8000的采样率样本/秒。每个样本幅度可以用对数刻度上分布的8位进行量化,从而产生64,000 bps的比特率。这种表示非常清晰(例如,人类可以很容易地从中提取消息),但对于大多数听众来说,另一方面,语音波形可以使用“ CD质量”来表示,使用44,100个样本/秒的采样率和16位样本,或者“数据质量”来表示。 705,600 bps。在这种情况下,再现的声音信号实际上将与原始语音信号没有区别。


当我们通过语音链从文本表示转换为语音波形表示时,结果是对消息的编码,该消息可以通过声波传播进行传输,并可以通过听众的听力机制进行可靠地解码。上面对数据速率的分析表明,当我们从文本移动到采样的语音波形时,数据速率最多可以增加10,000倍。这些额外信息的一部分代表了讲话者的特征,例如情绪状态,言语举止,口音等,但是大部分是由于简单采样和精细量化模拟信号的效率低下。因此,出于对语音的低固有信息速率的认识的动机,许多数字语音处理的中心主题是获得具有比采样波形的数据速率低的数据速率的数字表示。

完整的语音链包括上述类型的语音产生/生成模型,以及语音感知/识别模型,如图1.4左半部分所示。语音感知模型显示了从上限到下限的一系列处理步骤在耳朵上浏览语音以了解语音信号中编码的消息。第一步是将声波有效转换为频谱表示。这是通过基底膜在内耳内完成的,基底膜通过在空间上分离传入语音信号的频谱分量,从而对它们进行分析(相当于不均匀的滤波器组),从而充当了不均匀的频谱分析仪。语音感知过程的第二步是将频谱特征神经转换为可以由大脑解码和处理的一组声音特征(或在语言学领域中称为独特特征)。该过程的第三步是通过人脑中的语言翻译过程将声音特征转换为与传入消息关联的一组音素,单词和句子。最后,语音感知模型的最后一步是将消息的音素,单词和句子转换为对基本消息含义的理解,以便能够响应或采取某些适当的措施。我们对图1.4中大多数语音感知模块中的过程的基本了解充其量不过是基本知识,但通常认为,语音感知模型中每个步骤的某些物理关联都发生在人脑内部,因此整个大脑模型对于考虑发生的过程很有用。听力和知觉的基础知识在第4章中进行了讨论。

图1.4中完整语音链的示意图中还显示了一个我们尚未讨论的附加过程,即模型的语音生成和语音感知部分之间的传输通道。在其最简单的实施例中,如图1.3所示,此传输通道仅由位于公共空间中的扬声器和收听者之间的声波连接组成。在我们的语音链模型中必须包含此传输通道,因为它包含现实世界中的噪声和通道失真,这会使真实通信环境中的语音和消息理解变得更加困难。出于我们的目的,更有趣的是-在这里,语音的声音波形被转换为数字形式,并由通信系统进行操纵,存储或传输。也就是说,正是在这一领域中,我们找到了数字语音处理的应用。

1.1THE SPEECH SIGNAL

1.1的语音信号

The fundamental purpose of speech is human communication;

语言的根本目的是人类的交流;

i.e., the transmission of messages between a speaker and a listener.

也就是说,说话者和听者之间的信息传递。

According to Shannon's information theory [364], a message represented as a sequence of discrete symbols can be quantified by its information content in bits, where the rate of transmission of information is measured in bits per second (bps).

根据Shannon的信息理论[364],用离散符号序列表示的消息可以通过比特的信息内容来量化,其中信息的传输速率以比特每秒(bps)来衡量。

In speech production, as well as in many human-engineered electronic communication systems, the information to be transmitted is encoded in the form of a continuously varying (analog) waveform that can be trans-mitted, recorded (stored), manipulated, and ultimately decoded by a human listener.

在语音产生中,以及在许多人类工程的电子通信系统中,要传输的信息是以连续变化(模拟)波形的形式进行编码的,这些波形可以被人类听者传输、记录(存储)、操纵和最终解码。

The fundamental analog form of the message is an acoustic waveform that we call the speech signal.

信息的基本模拟形式是我们称之为语音信号的声音波形。

Speech signals, such as the one illustrated in Figure 1.2, can be con-verted to an electrical waveform by a microphone, further manipulated by both analog and digital signal processing methods, and then converted back to acoustic form by aloud speaker, a telephone handset, or headphone, as desired.

如图1.2所示的语音信号,可以通过麦克风转换成电波形,再通过模拟和数字信号处理方法进行处理,然后根据需要通过扬声器、电话听筒或耳机转换回声学形式。

This form of speech processing is, of course, the basis for Bell's telephone invention as well as today's multitude of devices for recording, transmitting, and manipulating speech and audio signals.

当然,这种语音处理方式是贝尔电话发明的基础,也是今天众多录音、传输和处理语音和音频信号的设备的基础。

In Bell's own words [47],

用贝尔自己的话说,[47],

Watson, if I can get a mechanism which will make a current of electricity vary its intensity as the air varies in density when sound is passing through it, I can telegraph any sound, even the sound of speech.

华生,如果我能弄到一种机制,能使电流在有声音通过时,随着空气密度的变化而改变其强度,那么我就能发出任何声音,甚至是说话的声音。

Although Bell made his great invention without knowing about information theory, the principles of information theory have assumed great importance in the design of sophisticated modern digital communications systems.

虽然贝尔在不了解信息理论的情况下做出了他的伟大发明,但信息理论的原理在设计复杂的现代数字通信系统中具有非常重要的意义。

Therefore, even though our main focus will be mostly on the speech waveform and its representation in the form of parametric models, it is nevertheless useful to begin with a discussion of the information that is encoded in the speech waveform.

因此,尽管我们的主要关注点将主要放在语音波形及其参数模型形式的表示上,但从讨论语音波形中编码的信息开始还是很有用的。

Figure 1.3 shows a pictorial representation of the complete process of producing and perceiving speech - from the formulation of a message in the brain of a speaker, to the creation of the speech signal, and finally to the understanding of the message by a listener.

图1.3展示了从说话者大脑中形成信息,到创造语音信号,最后到听者理解信息的完整过程。

In their classic introduction to speech science, Denes and Pinson appropriately referred to this process as the speech chain" [88].

在经典的语言科学导论中,Denes和Pinson恰当地将这一过程称为语音链”[88]。

A more refined block diagram representation of the speech chain is shown in Figure 1 .4.

图1.4显示了一个更细化的语音链框图表示。

The process starts in the upper left as a message represented somehow in the brain of the speaker.

这一过程开始于左上角,在说话人的大脑中以某种方式呈现出一条信息。

The message information can be thought of as having a number of different representations during the process of speech production (the upper path in Figure 1.4).

在语音产生的过程中,消息信息可以被认为有许多不同的表示(图1.4中的上路径)。

For example

例如

the message could be represented initially as English text.

消息最初可以表示为英语文本。

In order to speak" the message, the speaker implicitly converts the text into a symbolic representation of the sequence of sounds corresponding to the spoken version of the text.

为了传达“信息”,说话者含蓄地将文本转换成与文本的口语版本相对应的声音序列的符号表示。

This step, called the language code generator in Figure 1.4, converts text symbols to phonetic symbols (along with stress and durational information) that describe the basic sounds of a spoken version of the message and the manner (i.e., the speed and emphasis) in which the sounds are intended to be produced.

这一步中,被称为语言代码生成器在图1.4中,将文字符号转换成语音符号(连同压力和durational信息)来描述消息的基本口语版的声音和方式(即速度和强调)的声音旨在生产。

As an example, the segments of the waveform of Figure 1.2 are labeled with phonetic symbols using a computer-keyboard-friendly code called ARPAbet.

例如,图1.2的波形片段使用一种称为ARPAbet的计算机键盘友好代码标记语音符号。

2 Thus, the text should we chase" is represented phonetically (in ARPAbet symbols) as [SH UH D- W IY - CH EY S].

2因此,“我们应该追逐的文本”(在ARPAbet符号中)被表示为[SH UH D- W IY - CH EY S]。

(See Chapter 3 for more discussions of phonetic transcription.)

(见第三章更多关于音标的讨论。)

The third step in the speech production process is the conversion to neuro-muscular controls;

语音产生过程的第三个步骤是神经肌肉控制的转换;

i.e., the set of control signals that direct the neuro-muscular system to move the speech articulators, namely the tongue, lips, teeth, jaw, and velum, in a manner that is consistent with the sounds of the desired spoken

也就是一组控制信号,引导神经肌肉系统以一种与预期发音一致的方式移动发音器官,即舌头、嘴唇、牙齿、下巴和舌膜

Time in Seconds

时间间隔,以秒为单位

FIGURE 1.2

图1.2

A speech waveform with phonetic labels for the message represented by the text "should we chase.'

这是一种带有语音标签的语音波形,它所代表的信息是“我们应该追逐吗?”

2The International Phonetic Association (IPA) provides a set of rules for phonetic transcription using an equivalent set of specialized symbols.

国际音标协会(IPA)提供了一套使用等价的专用符号集的音标规则。

The ARPAbet code does not require special fonts and is thus more message and with the desired degree of emphasis.

ARPAbet代码不需要特殊的字体,因此具有更多的信息和所需的强调程度。

The end result of the neuro-muscular controls step is a set of articulatory motions (continuous control) that cause the vocal tract articulators to move in a prescribed manner in order to create the desired sounds.

神经肌肉控制步骤的最终结果是一系列的发音动作(连续控制),使声道的发音器官按照规定的方式移动,从而产生所需的声音。

Finally, the last step in the speech production process is the "vocal tract system," which creates the physical sound sources and appropriate time-varying vocal tract shapes so as to produce an acoustic waveform such as the one shown in Figure 1.2.

最后,语音产生过程中的最后一步是“声道系统”,它创建物理声源和适当的时变声道形状,从而产生如图1.2所示的声波形。

In this way, the information in the desired message is encoded into the speech signal.

这样,所需信息中的信息就被编码到语音信号中。

To determine the rate of information flow during speech production, assume that there are about 32 symbols (letters) in the written language.

为了确定语音产生过程中信息流动的速度,假设在书面语言中大约有32个符号(字母)。

(In English there are 26letters, but if we include simple punctuation and spaces, we get a count closer to 32 =25 symbols.)

(英语中有26个字母,但如果我们加上简单的标点和空格,我们得到的数字更接近32 =25个符号。)

The normal average rate of speaking is about 15 symbols per second.

正常的平均说话速度大约是每秒15个符号。

Hence, assuming independent letters as a simple first-order approximation, the base information rate of the text message encoded as speech is about 75 bps (5 bits per symbol times 15 symbols per second).

因此,假设独立的字母作为简单的一阶近似,编码为语音的文本消息的基本信息速率约为75bps(每个符号5比特乘以每秒15个符号)。

However, the actual rate will vary with speaking rate.

然而,实际的语速会随着语速的变化而变化。

For the example of Figure 1.2, the text representation has 15 letters (including spaces) and the corresponding speech utterance has a duration of 0.6 seconds, giving a higher estimate of 15 x 5/0.6= 125 bps.

在图1.2的例子中,文本表示有15个字母(包括空格),而相应的语音表达持续时间为0.6秒,给出了更高的估计值15 x 5/0.6= 125 bps。

At the second stage of the process, where the text representation is converted into basic sound units called phonemes along with prosody (e.g., pitch and stress) markers, the information rate can easily increase to over 200 bps.

在这一过程的第二阶段,文本表示法被转换成称为音素和韵律标记(如音调和重音)的基本声音单位,信息率可以很容易地增加到200bps以上。

The ARBAbet phonetic symbol set used to label the speech sounds in Figure 1.2 contains approximately 64 = 26 symbols, or about 6 bits/phoneme (again a rough approximation assuming independence of phonemes).

图1.2中用于标注语音的ARBAbet语音符号集包含约64 = 26个符号,或约6位/音素(假设音素独立,这也是一个粗略的近似)。

In Figure 1.2, there are eight phonemes in approximately 0.6 seconds.

在图1.2中,大约0.6秒内有8个音素。

This leads to an estimate of8 x 6/0.6 =80 bps.

这导致估计为8 x 6/0.6 =80个bps。

Additional information required to describe prosodic features of the signal (e.g., duration, pitch, loudness) could easily add 100 bps to the total information rate for the text message encoded as a speech signal.

描述信号的韵律特征(例如,持续时间、音调、响度)所需的附加信息很容易使编码为语音信号的文本消息的总信息速率增加100个bps。

The information representations for the first two stages in the speech chain are discrete, so we can readily estimate the rate of information flow with some simple assumptions.

语音链的前两个阶段的信息表示是离散的,因此我们可以很容易地通过一些简单的假设来估计信息流的速度。

For the next stage in the speech production part of the speech chain, the representation becomes continuous (in the form of neuro-muscular control signals for articulatory motion).

在下一阶段,在语音链的语音产生部分,表征成为连续的(以关节运动的神经肌肉控制信号的形式)。

If they could be measured, we could estimate the spectral bandwidth of these control signals and appropriately sample and quantize these signals to obtain equivalent digital signals for which the data rate could be estimated.

如果可以测量,我们可以估计这些控制信号的频谱带宽,并对这些信号进行适当的采样和量化,得到可以估计数据速率的等效数字信号。

The articulators move relatively slowly compared to the time variation of the resulting acoustic waveform.

与产生的声波波形的时间变化相比,关节移动相对缓慢。

Estimates of bandwidth and required signal representation accuracy suggest that the total data rate of the sampled articulatory control signals is about 2000 bps [105], Thus, the original text message is represented by a set of continuously varying signals whose digital representation requires a much higher data rate than the information rate that we estimated for transmission of the message as a discrete textual signal.

估计所需的带宽和信号表示精度表明,总的发音控制信号采样的数据速率大约是2000个基点[105],因此,原始文本消息是由一组连续变化信号的数字表示需要更高的数据率比我们估计的信息率作为离散文本信号传输的消息。

3 Finally, as we will see later, the data rate of the digitized speech waveform at the end of the speech production part of the speech chain can

3 .最后,正如我们稍后将看到的,在语音链的语音产生部分的末端,数字化语音波形的数据速率可以

3Note that we introduce the term data rate" for digital representations to distinguish from the inherent information content of the message represented by the speech signal.

注意,我们为数字表示引入术语“数据速率”,以区别于语音信号所表示的消息的固有信息内容。

be anywhere from 64,000 to more than 700,000 bps.

从64000到70000bps不等。

We arrive at such numbers by examining the sampling rate and quantization required to represent the speech signal with a desired perceptual fidelity, For example, telephone quality" speech processing requires that a bandwidth of 0 to 4 kHz be preserved, implying a sampling rate of 8000 samples/sec.

我们通过检查以理想的感知保真度表示语音信号所需的采样率和量化来得出这样的数字,例如,“电话质量”语音处理需要保持0至4 kHz的带宽,这意味着采样率为8000个/秒。

Each sample amplitude can be quantized with 8 bits distributed on a log scale, resulting in a bit rate of 64,000 bps.

每个样本振幅可以用8位分布在对数尺度上进行量化,结果比特率为64,000 bps。

This representation is highly intelligible (i.e., humans can readily extract the message from it) but to most listeners, it will sound different from the original speech signal uttered by the talker.

这种表达是很容易理解的(也就是说,人类可以很容易地从中提取信息),但对大多数听众来说,它听起来会不同于说话者发出的原始语音信号。

On the other hand, the speech waveform can be represented with "CD quality" using a sampling rate of 44,100 samples/sec with 16-bit samples, or a data rate of 705,600 bps.

另一方面,语音波形可以用“CD质量”表示,使用16位采样率44100个样本/秒,或705600个bps的数据率。

in this case, the reproduced acoustic signal will be virtually indistinguishable from the original speech signal.

在这种情况下,再现的声音信号将与原始语音信号几乎无法区分。

As we move from a textual representation to the speech waveform representation through the speech chain, the result is an encoding of the message that can be transmitted by acoustic wave propagation and robustly decoded by the hearing mechanism of a listener.

当我们通过语音链从文本表示过渡到语音波形表示时,结果是可以通过声波传播传输的信息编码,并被听者的听觉机制稳健地解码。

The above analysis of data rates shows that as we move from text to a sampled speech waveform, the data rate can increase by a factor of up to 10,000.

以上对数据速率的分析表明,当我们从文本移动到采样语音波形时,数据速率可以增加多达10,000倍。

Part of this extra information represents characteristics of the talker such as emotional state, speech mannerisms, accent, etc., but much of it is due to the inefficiency of sim-ply sampling and finely quantizing analog signals.

这些额外信息的一部分代表说话者的特征,如情绪状态、说话习惯、口音等,但大部分是由于简单采样和精细量化模拟信号的低效率。

Thus, motivated by an awareness of the low intrinsic information rate of speech, a central theme of much of digital speech processing is to obtain a digital representation with a lower data rate than that of the sampled waveform.

因此,由于意识到语音的低内在信息率,大部分数字语音处理的中心主题是获得比采样波形的数据率更低的数字表示。

The complete speech chain consists of a speech production/generation model, of the type discussed above, as well as a speech perception/recognition model, as shown progressing to the left in the bottom half of Figure 1.4.

完整的语音链包括上述类型的语音产生/生成模型,以及语音感知/识别模型,如图1.4下半部左方所示。

The speech perception model shows the series of processing steps from capturing speech at the ear to understanding the message encoded in the speech signal.

语音感知模型展示了从耳朵捕捉语音到理解语音信号中编码的信息的一系列处理步骤。

The first step is the effective conversion of the acoustic waveform to a spectral representation.

第一步是将声波波形有效地转换为频谱表示。

This is done within the inner ear by the basilar membrane, which acts as a non-uniform spectrum analyzer by spatially separating the spectral components of the incoming speech signal and thereby analyzing them by what amounts to a non-uniform filter bank.

这是由内耳的基底膜完成的,基底膜作为一个非均匀频谱分析仪,在空间上分离输入语音信号的频谱成分,从而通过非均匀滤波器组来分析它们。

The second step in the speech perception process is a neural transduction of the spectral features into a set of sound features (or distinctive features as they are referred to in the area of linguistics) that can be decoded and processed by the brain.

语音感知过程的第二步是将光谱特征通过神经传导成一组声音特征(或语言学领域所指的独特特征),这些特征可以被大脑解码和处理。

The third step in the process is a conversion of the sound features into the set of phonemes, words, and sentences associated with the incoming message by a language translation process in the human brain.

这个过程的第三个步骤是通过人类大脑的语言翻译过程,将声音特征转换成与输入信息相关的一组音素、单词和句子。

Finally the last step in the speech perception model is the conversion of the phonemes, words, and sentences of the message into an understanding of the meaning of the basic message in order to be able to respond to or take some appropriate action.

最后,语音感知模型中的最后一步是将信息的音素、单词和句子转换为对基本信息的含义的理解,以便能够做出回应或采取一些适当的行动。

Our fundamental understanding of the processes in most of the speech perception modules in Figure 1.4is rudimentary at best, but it is generally agreed that some physical correlate of each of the steps in the speech perception model occurs within the human brain, and thus the entire model is useful for thinking about the processes that occur.

我们的基本的理解过程在大多数的言语知觉模块如图1.4基本在最好的情况下,但普遍认为,一些物理相关言语知觉模型中的每个步骤的发生在人类的大脑,因此整个模型是有用的思考所发生的过程。

The fundamentals of hearing and perception are discussed in Chapter 4.

听力和知觉的基本原理将在第四章讨论。

There is one additional process shown in the diagram of the complete speech chain in Figure 1.4 that we have not discussed - namely the transmission channel between the speech generation and speech perception parts of the model.

在图1.4的完整语音链图中还有一个我们没有讨论的附加过程,即模型的语音生成部分和语音感知部分之间的传输通道。

In its simplest embodiment, as depicted in Figure 1.3, this transmission channel consists of just the acoustic wave connection between a speaker and a listener who are in a common space.

在其最简单的实施例中,如图1.3所示,该传输通道仅由位于公共空间的扬声器和听者之间的声波连接组成。

It is essential to include this transmission channel in our model for the speech chain since it includes real-world noise and channel distortions that make speech and message understanding more difficult in real-communication environments.

在我们的语音链模型中包含这个传输信道是非常必要的,因为它包含了真实世界的噪声和信道失真,这使得在真实通信环境中理解语音和信息更加困难。

More interestingly for our purpose here- this is where the acoustic waveform of speech is converted to digital form and manipulated, stored, or transmitted by a communication system.

更有趣的是,在这里,语音的声波波形被转换成数字形式,并通过通信系统进行操作、存储或传输。

That is, it is in this domain that we find the applications of digital speech processing.

也就是说,数字语音处理正是在这一领域得到了应用。

1.1THE SPEECH SIGNAL

The fundamental purpose of speech is human communication; i.e., the transmission of messages between a speaker and a listener. According to Shannon's information theory [364], a message represented as a sequence of discrete symbols can be quantified by its information content in bits, where the rate of transmission of information is measured in bits per second (bps). In speech production, as well as in many human-engineered electronic communication systems, the information to be transmitted is encoded in the form of a continuously varying (analog) waveform that can be trans-mitted, recorded (stored), manipulated, and ultimately decoded by a human listener. The fundamental analog form of the message is an acoustic waveform that we call the speech signal. Speech signals, such as the one illustrated in Figure 1.2, can be con-verted to an electrical waveform by a microphone, further manipulated by both analog and digital signal processing methods, and then converted back to acoustic form by aloud speaker, a telephone handset, or headphone, as desired. This form of speech processing is, of course, the basis for Bell's telephone invention as well as today's multitude of devices for recording, transmitting, and manipulating speech and audio signals. In Bell's own words [47],


Watson, if I can get a mechanism which will make a current of electricity vary its intensity as the air varies in density when sound is passing through it, I can telegraph any sound, even the sound of speech.

Although Bell made his great invention without knowing about information theory, the principles of information theory have assumed great importance in the design of sophisticated modern digital communications systems. Therefore, even though our main focus will be mostly on the speech waveform and its representation in the form of parametric models, it is nevertheless useful to begin with a discussion of the information that is encoded in the speech waveform.


Figure 1.3 shows a pictorial representation of the complete process of producing and perceiving speech - from the formulation of a message in the brain of a speaker, to the creation of the speech signal, and finally to the understanding of the message by a listener. In their classic introduction to speech science, Denes and Pinson appropriately referred to this process as the speech chain" [88]. A more refined block diagram representation of the speech chain is shown in Figure 1 .4. The process starts in the upper left as a message represented somehow in the brain of the speaker. The message information can be thought of as having a number of different representations during the process of speech production (the upper path in Figure 1.4). For example


the message could be represented initially as English text. In order to speak" the message, the speaker implicitly converts the text into a symbolic representation of the sequence of sounds corresponding to the spoken version of the text. This step, called the language code generator in Figure 1.4, converts text symbols to phonetic symbols (along with stress and durational information) that describe the basic sounds of a spoken version of the message and the manner (i.e., the speed and emphasis) in which the sounds are intended to be produced. As an example, the segments of the waveform of Figure 1.2 are labeled with phonetic symbols using a computer-keyboard-friendly code called ARPAbet.2 Thus, the text should we chase" is represented phonetically (in ARPAbet symbols) as [SH UH D- W IY - CH EY S]. (See Chapter 3 for more discussions of phonetic transcription.) The third step in the speech production process is the conversion to neuro-muscular controls; i.e., the set of control signals that direct the neuro-muscular system to move the speech articulators, namely the tongue, lips, teeth, jaw, and velum, in a manner that is consistent with the sounds of the desired spoken


FIGURE 1.2


Time in Seconds

FIGURE 1.2

A speech waveform with phonetic labels for the message represented by the text "should we chase.'



2The International Phonetic Association (IPA) provides a set of rules for phonetic transcription using an equivalent set of specialized symbols. The ARPAbet code does not require special fonts and is thus more message and with the desired degree of emphasis. The end result of the neuro-muscular controls step is a set of articulatory motions (continuous control) that cause the vocal tract articulators to move in a prescribed manner in order to create the desired sounds. Finally, the last step in the speech production process is the "vocal tract system," which creates the physical sound sources and appropriate time-varying vocal tract shapes so as to produce an acoustic waveform such as the one shown in Figure 1.2. In this way, the information in the desired message is encoded into the speech signal.


To determine the rate of information flow during speech production, assume that there are about 32 symbols (letters) in the written language. (In English there are 26letters, but if we include simple punctuation and spaces, we get a count closer to 32 =25 symbols.) The normal average rate of speaking is about 15 symbols per second. Hence, assuming independent letters as a simple first-order approximation, the base information rate of the text message encoded as speech is about 75 bps (5 bits per symbol times 15 symbols per second). However, the actual rate will vary with speaking rate. For the example of Figure 1.2, the text representation has 15 letters (including spaces) and the corresponding speech utterance has a duration of 0.6 seconds, giving a higher estimate of 15 x 5/0.6= 125 bps. At the second stage of the process, where the text representation is converted into basic sound units called phonemes along with prosody (e.g., pitch and stress) markers, the information rate can easily increase to over 200 bps. The ARBAbet phonetic symbol set used to label the speech sounds in Figure 1.2 contains approximately 64 = 26 symbols, or about 6 bits/phoneme (again a rough approximation assuming independence of phonemes). In Figure 1.2, there are eight phonemes in approximately 0.6 seconds. This leads to an estimate of8 x 6/0.6 =80 bps. Additional information required to describe prosodic features of the signal (e.g., duration, pitch, loudness) could easily add 100 bps to the total information rate for the text message encoded as a speech signal.

The information representations for the first two stages in the speech chain are discrete, so we can readily estimate the rate of information flow with some simple assumptions. For the next stage in the speech production part of the speech chain, the representation becomes continuous (in the form of neuro-muscular control signals for articulatory motion). If they could be measured, we could estimate the spectral bandwidth of these control signals and appropriately sample and quantize these signals to obtain equivalent digital signals for which the data rate could be estimated. The articulators move relatively slowly compared to the time variation of the resulting acoustic waveform. Estimates of bandwidth and required signal representation accuracy suggest that the total data rate of the sampled articulatory control signals is about 2000 bps [105], Thus, the original text message is represented by a set of continuously varying signals whose digital representation requires a much higher data rate than the information rate that we estimated for transmission of the message as a discrete textual signal.3 Finally, as we will see later, the data rate of the digitized speech waveform at the end of the speech production part of the speech chain can


3Note that we introduce the term data rate" for digital representations to distinguish from the inherent information content of the message represented by the speech signal.

be anywhere from 64,000 to more than 700,000 bps. We arrive at such numbers by examining the sampling rate and quantization required to represent the speech signal with a desired perceptual fidelity, For example, telephone quality" speech processing requires that a bandwidth of 0 to 4 kHz be preserved, implying a sampling rate of 8000 samples/sec. Each sample amplitude can be quantized with 8 bits distributed on a log scale, resulting in a bit rate of 64,000 bps. This representation is highly intelligible (i.e., humans can readily extract the message from it) but to most listeners, it will sound different from the original speech signal uttered by the talker. On the other hand, the speech waveform can be represented with "CD quality" using a sampling rate of 44,100 samples/sec with 16-bit samples, or a data rate of 705,600 bps. in this case, the reproduced acoustic signal will be virtually indistinguishable from the original speech signal.


As we move from a textual representation to the speech waveform representation through the speech chain, the result is an encoding of the message that can be transmitted by acoustic wave propagation and robustly decoded by the hearing mechanism of a listener. The above analysis of data rates shows that as we move from text to a sampled speech waveform, the data rate can increase by a factor of up to 10,000. Part of this extra information represents characteristics of the talker such as emotional state, speech mannerisms, accent, etc., but much of it is due to the inefficiency of sim-ply sampling and finely quantizing analog signals. Thus, motivated by an awareness of the low intrinsic information rate of speech, a central theme of much of digital speech processing is to obtain a digital representation with a lower data rate than that of the sampled waveform.

The complete speech chain consists of a speech production/generation model, of the type discussed above, as well as a speech perception/recognition model, as shown progressing to the left in the bottom half of Figure 1.4. The speech perception model shows the series of processing steps from capturing speech at the ear to understanding the message encoded in the speech signal. The first step is the effective conversion of the acoustic waveform to a spectral representation. This is done within the inner ear by the basilar membrane, which acts as a non-uniform spectrum analyzer by spatially separating the spectral components of the incoming speech signal and thereby analyzing them by what amounts to a non-uniform filter bank. The second step in the speech perception process is a neural transduction of the spectral features into a set of sound features (or distinctive features as they are referred to in the area of linguistics) that can be decoded and processed by the brain. The third step in the process is a conversion of the sound features into the set of phonemes, words, and sentences associated with the incoming message by a language translation process in the human brain. Finally the last step in the speech perception model is the conversion of the phonemes, words, and sentences of the message into an understanding of the meaning of the basic message in order to be able to respond to or take some appropriate action. Our fundamental understanding of the processes in most of the speech perception modules in Figure 1.4is rudimentary at best, but it is generally agreed that some physical correlate of each of the steps in the speech perception model occurs within the human brain, and thus the entire model is useful for thinking about the processes that occur. The fundamentals of hearing and perception are discussed in Chapter 4.

There is one additional process shown in the diagram of the complete speech chain in Figure 1.4 that we have not discussed - namely the transmission channel between the speech generation and speech perception parts of the model. In its simplest embodiment, as depicted in Figure 1.3, this transmission channel consists of just the acoustic wave connection between a speaker and a listener who are in a common space. It is essential to include this transmission channel in our model for the speech chain since it includes real-world noise and channel distortions that make speech and message understanding more difficult in real-communication environments. More interestingly for our purpose here- this is where the acoustic waveform of speech is converted to digital form and manipulated, stored, or transmitted by a communication system. That is, it is in this domain that we find the applications of digital speech processing.

语音的基本目的是为了人类沟通,即说话者和倾听者之间消息的传输。据香农信息论1361,以离散符号序列表示的消息可对其信息量以比特进行量化,信息传输速率可用比特/秒(bps)进行度量。在语音产生及许多人类设计的电子通信系统中,待传输信息以连续变化的波形(模拟波形)进行编码,这种波形可以传输、记录(存储)、操纵,最后被倾听者解码。消息的基本模拟形式是一种称为语音信号的声学波。如图1.2 所示,语音信号可通过麦克风转换成电信号,进一步通过模拟和数字信号处理方法进行操纵,然后可根据需要通过扬声器、电话听简或头戴式耳机转换回声学波。这种语音处理方式为贝尔发明电话奠定了基础,同时也是今天大多数记录、传输、操纵语音和音频信号的设备的基础。用贝尔自己的话说:“华生,如果我能得到一种像声音传播时空气改变密度那样改变电流密度的机制,就能通过电来传递任何声音,甚至是语音。”





为了确定语音产生过程中信息流的速率,我们假设在书面语中约有32个符号(字母,英语中有26个字母,若包括标点符号和空格,则接近32=25个符号)。正常的平均说话速率约为15个符号每秒,因此,假设字母相互独立后做简单的--阶近似,文本消息编码成语音后的基本信息速率约为75bps (5比特每符号乘以15个符号每秒)。但是,实际的速率会随着说话的速率变化而变化。




@国际语音协会(IPA)为音素标注提供了一套规则,它用等价的一组特殊符号来表示音标ARPAbet编码不需要特殊字体,因此更加便于计算机应用。

对于图1.2 中的例子,文本包含15 个字母(包括空格),对应的语音词条持续了0.6秒,因此有更高的速率15x5/0.6= 125bps。在语音产生过程的第二个阶段,文本表示转变成基本声音的单元,它们称为带有韵律(即音高和重音)标记的音素,此时信息速率很容易达到200bps以上。图1.2 中用来标注语声片段的ARBAbet音素集包含近64=26个符号,即6比特每音素(假设音素相互独立得到的粗略近似)。在图1.2 中,大约0.6秒的时间里有8个音素,计算得到信息速率为8*6/0.6 = 80bps,考虑描述信号韵律特征的额外信息(如段长、音高、响度),文本信息编码成语音信号后,总信息速率需要再加上100bps。

语音链前两个阶段的信息表示是离散的,所以用一些简单假设就可估计信息流的速率。在语音链中语音产生部分的下-阶段,信息表示变成连续的(以关节运动时的神经肌肉控制信号的形式)。若它们能被度量,就可估计这些控制信号的频谱带宽,进行恰当的采样和量化获得等效的数字信号,进而估计数据的速率。与产生的声学波形的时间变化相比,关节的运动相当缓慢。带宽估计和信号表示需要达到的精度要求意味着被采样的关节控制信号的总数据率约为200bpl0.因此,用一-组连续变化信号表示的原始文本消息传输,比用离散文本信号表示的消息传输需要更高的数据率”。在语音链中语音产生部分的最后阶段,数字语音波形的数据率可从64000bps变化到超过70000bps.我们是通过测量表示语音信号时为达到想要的感知保真度所需要的采样率和量化率计算得到上面的结果的。例如,“电话质量”的语音处理需要保证带宽为0~4kHz,这意味着采样率为8000个样本秒。每个样本可以用对数尺度量化成8比特,从而得到数据率64000bps。这种表示方式很容易.听懂(即人们可很容易地从其中提取出消息),但对于大多数倾听者来说,语音听起来与说话者发出的原始语音会有不同。另一方面,语音波形可以表示成“CD质量”,即采用44100个样本/秒的采样率,每个样本16比特,总数据率为705600bps, 此时复原的声学波听起来和原始语音信号几乎没有区别。

当我们通过语音链将文本表示变成语音波形表示时,消息编码后能够以声学波形的形式进行传播,并且可被倾听者的听觉机制稳健地解码。前面对数据率的分析表明,当我们将消息从文本表示转换成采样的语音波形时,数据率会增大10000倍。这些额外信息的一部分能够代表说话者的一些特征,如情绪状态、说话的习惯、口音等,但主要是由简单采样和对模拟信号进行精细量化的低效性导致的。因此,出于语音信号固有的低信息速率的考虑,很多数字语音处理的重点是用比采样波形更低的数据率对语音进行数字表示。

完整的语音链包括上面讨论的语音产生/生成模型,也包括图1.4底部从右向左显示的语音感知/识别模型。语音感知模型显示了从耳朵捕捉语音信号到理解语音信号编码中携带的消息的一系列处理步骤。第一步是将声学波有效地转换成频谱表示,这是由耳朵内部的基底膜实现的,基底膜的作用类似于非均匀频谱分析仪,它能将输入语音信号的频谱成分进行空间分离,以非均匀滤波器组的方式进行频谱分析。语音感知过程中的第二步是神经传导过程,将频谱特征变成可被大脑解码:和处理的声音特征(或语音学领域中所指的差异性特征)。第三步通过人脑的语言翻译过程将声音特征变成与输入消息对应的一组音素、词和句子。语音感知模型中的最后一步是将消息对应的音素、词和句子变成对基本信息意义的理解,进而做出响应或采取适当的处理。我们对图1.4中大部分语音感知模块过程的基本理解还是非常初步的,但人们普遍认为语音感知模型中各个步骤物理间的相互关联发生在人脑中,因此整个模型对于思考语音感知模型中各个过程的发生非常有帮助。第4章中将讨论听觉和感知机理。

图1.4所示的整个语音链框图中还有一个过程我们没有讨论,即模型中语音产生部分和语音感

@为数字表示引入术语“数据率”,是为了区别于语音信号表示的消息中所含的内在信息内容。

知部分之间的传输通道。在图1.3 中描绘的最简单的具体实现中,传输通道仅包含同一空间中说话者和倾听者间的声学波连接。将传输通道包含在语音链模型中非常有必要,因为在真实的通信环境中,噪声和信道失真会使得理解语音和消息变得更加困难。有趣的是,正是在传输通道中我们利用通信系统将声学波形转变成数字形式,并对其进行操纵、存储或传播;也正是在这一领域里, 我们找到了数字语音处理的应用。