Review of end-to-end speech synthesis technology based on deep

Noname manuscript No.

(will be inserted by the editor)

Review of end-to-end speech synthesis technology based on

deep learning

Zhaoxi Mu

· Xinyu Yang

· Yizhuo Dong

Received: date / Accepted: date

Abstract As an indispensable part of modern human-

computer interaction system, speech synthesis technol-

ogy helps users get the output of intelligent machine

more easily and intuitively, thus has attracted more

and more attention. Due to the limitations of high com-

plexity and low eﬃciency of traditional speech synthe-

sis technology, the current research focus is the deep

learning-based end-to-end speech synthesis technology,

which has more powerful modeling ability and a simpler

pipeline. It mainly consists of three modules: text front-

end, acoustic model, and vocoder. This paper reviews

the research status of these three parts, and classiﬁes

and compares various methods according to their em-

phasis. Moreover, this paper also summarizes the open-

source speech corpus of English, Chinese and other lan-

guages that can be used for speech synthesis tasks, and

introduces some commonly used subjective and objec-

tive speech quality evaluation method. Finally, some

attractive future research directions are pointed out.

Keywords Speech synthesis · Text-to-speech · End-

to-end · Deep learning · Review

Zhaoxi Mu

E-mail: [email protected]

Xinyu Yang

E-mail: [email protected]

Yizhuo Dong

E-mail: [email protected]

Xi’an Jiaotong University, Xi’an, Shaanxi,

People’s Republic of China

1 Introduction

With the rapid development of computer science, ar-

tiﬁcial intelligence, automation and robot control tech-

nology, the demand of human-computer interaction has

been fully met and the way has become more and more

direct and convenient. Human-computer interaction re-

lies heavily on speech communication. The speech sys-

tem of the machine is divided into three functional

modules: voiceprint recognition, speech recognition and

speech synthesis. The most diﬃcult and complex task

is speech synthesis. This is because that compared to

speech and voiceprint recognition, speech synthesis sys-

tems usually require more data for training and more

complex models for modeling in order to accurately

synthesize high-ﬁdelity speech with various styles by

inputting simple text.

Speech synthesis is also called text-to-speech (TTS)

when the input is text. TTS is a frontier technology in

the ﬁeld of information processing, which involves many

disciplines such as acoustics, linguistics, and computer

science. The main task is to convert input text into out-

put speech. TTS system is the mouth of the intelligent

machine, which has been widely used in various ﬁelds of

people’s daily life, such as voice navigation, information

broadcast, intelligent assistant, intelligent customer ser-

vice, and has achieved great economic beneﬁts. More-

over, it is also being applied to some new ﬁelds, such as

article reading, language education, video dubbing, and

rehabilitation therapy. TTS applications has become an

important part of people’s lives.

Deep learning-based TTS technology With the develop-

ment of computer science and technology, the intelligi-

bility and naturalness of synthesized speech have been

greatly improved due to the continuous improvement

arXiv:2104.09995v1 [cs.SD] 20 Apr 2021

2 Zhaoxi Mu

et al.

of TTS techniques from the formant-based methods

[92, 97, 105, 106, 179, 224] to the unit selection-based

waveform cascade methods [6, 24, 36, 59, 80, 143, 144],

and to the hidden Markov model (HMM)-based sta-

tistical parametric speech synthesis (SPSS) methods

[26, 94, 149, 178, 210, 241, 249, 250]. Deep learning

is a new research direction in the ﬁeld of artiﬁcial in-

telligence in recent years. This method can eﬀectively

capture the latent information and association in data,

and has more powerful modeling ability than tradi-

tional statistical learning methods [238]. TTS methods

based on deep learning have been widely researched

[52, 131, 168, 247]. For example, in the SPSS model

based on deep neural network (DNN), DNN can learn

the mapping function from linguistic features (input)

to acoustic features (output).

DNN-based acoustic models provide an eﬀective dis-

tributed representation of the complex dependencies

between linguistic features and acoustic features. How-

ever, one limitation of the acoustic feature modeling

method based on feedforward DNN is that it ignores

the continuity of speech. The DNN-based method as-

sumes that each frame is sampled independently, al-

though there is correlation between consecutive frames

in the speech data. Recurrent Neural Network (RNN)

provides an eﬀective method to model the correlation

between adjacent frames of speech, because it can use

all the available input features to predict the output fea-

tures of each frame. Based on this, some researchers use

RNN instead of DNN to capture the long-term depen-

dence of speech frames in order to improve the quality

of synthesized speech [51, 53, 93, 213, 248, 251].

End-to-end TTS technology The traditional SPSS net-

work is a complex pipeline containing many modules,

composed of text-to-phoneme network, audio segmen-

tation network, phoneme duration prediction network,

fundamental frequency prediction network and vocoder

[3, 57]. Building these modules requires a lot of profes-

sional knowledge and complex engineering implemen-

tation, which will take a lot of time and eﬀort. Also,

the combination of errors in each component may make

the model diﬃcult to train. End-to-end TTS methods

are driven by the desire to simplify TTS systems and

reduce the need for manual intervention and linguis-

tic background knowledge. The end-to-end TTS model

only needs to be trained from scratch on the paired

data set of htext, speechi, and can directly synthesize

speech from the text. The state-of-the-art end-to-end

TTS models based on deep learning have been able to

synthesize speech close to human voice [151, 189, 227].

It is mainly composed of three parts: text analy-

sis front-end, acoustic model and vocoder, as shown

Text front-end

Text Phoneme

Acous!c model

Spectrogram

Vocoder

Wave

yu3 yin1 he2 cheng2

Fig. 1 Pipeline architecture for TTS

in Fig. 1. Firstly, the text front-end converts the text

into standard input. Then, the acoustic model converts

the standard input into intermediate acoustic features,

which are used to model the long-term structure of

speech. The most common intermediate acoustic fea-

tures are spectrogram [189, 227], vocoder feature [196]

or linguistic feature [151]. Finally, the vocoder is used

to ﬁll in low-level signal details and convert acoustic

features into time-domain waveform samples. To reduce

the diﬃculty of training and improve the quality of syn-

thesized speech, the text front-end, acoustic model and

vocoder are usually trained separately [189], and they

can also be ﬁne-tuned jointly [196]. This article will in-

troduce some of the latest developments in each of the

three components according to the structure of Fig. 2.

There have been some reviews on TTS. For exam-

ple, Deng et al. [43] analyzed the number of documents

and citations of TTS papers from 1992 to 2017, aiming

to help researchers understand the development trend

of TTS. Aroon and Dhonde [5] reviewed SPSS meth-

ods based on HMM. Adiga and Prasanna [1] reviewed

SPSS methods and partially deep learning based meth-

ods. Ning et al. [148] and Sruthi and Meharban [197]

reviewed TTS methods based on deep learning. Kalita

and Deb [90] reviewed emotional TTS methods for Hindi.

Tits et al. [207] reviewed the emotional speech corpus

that could be used for TTS.

Although there have been some reviews on TTS

methods based on deep learning, only some of baseline

models have been introduced, such as WaveNet [151],

Tacotron [227] and SampleRNN [136]. These models

have many problems, such as slow training and infer-

ence speed, instability, lack of emotion and rhythm in

synthesized speech, and a large amount of high-quality

speech data required for training. The state-of-the-art

TTS methods can completely or partially solve these

problems, still so far there has been no comprehensive

review of the latest deep learning-based TTS models.

Moreover, the quantity and quality of training speech

Review of end-to-end speech synthesis technology based on deep learning 3

Text front-end

Acous!c model

Vocoder

Fast acous!c model

Robust acous!c

model

Expressive acous!c

model

Low-resource

acous!c model

Fast vocoder

High-quality

vocoder

Non-RNN acous!c model

Non-autoregressive acous!c model

Streaming acous!c model

Stable autoregressive genera!on process

Accurate alignment

Acous!c model with reference encoder

Acous!c model of explicit modeling style features

Mul!-speaker acous!c model

Small size vocoder

Non-autoregressive vocoder

Fig. 2 Section organization of the TTS model

corpus play a decisive role in the training results of

TTS model, and how to eﬀectively evaluate the quality

of synthesized speech has always been a problem in the

ﬁeld of TTS. Therefore, this paper will make a detailed

summary of the latest end-to-end TTS models based

on deep learning, speech corpus and evaluation meth-

ods of synthesized speech, and ﬁnally give some future

research directions.

The rest of this paper is organized as follows: Sect. 2,

3 and 4 respectively introduce the latest text front-end,

acoustic model and vocoder based on deep learning.

Sect. 5 organizes the corpus that could be used for TTS.

Sect. 6 introduces commonly used synthesized speech

evaluation methods from both subjective and objective

aspects. Sect. 7 puts forward some challenges and future

research directions for reference. The last section draws

a general conclusion of this paper.

2 Text front-end

It is diﬃcult to synthesize high-ﬁdelity speech only us-

ing original phonemes or original text as the input of

the TTS model, especially for languages that contain

polyphonic characters and have complex prosodic struc-

tures, such as Mandarin. Therefore, it is necessary to

use the text front-end to introduce additional pronun-

ciation and syntactic information. The text front-end

predicts the pronunciation mode from the original text,

aiming to provide enough information for the back-

end to accurately synthesize speech. The quality of the

text front-end has a great impact on the clarity and

naturalness of the synthesized speech. Pronunciation

patterns are important information for languages with

many polyphonic characters and ambiguous pronunci-

ations, such as Mandarin. Syntactic information also

contributes a lot to the pronunciation of a sentence,

which determines the pause and tone of a sentence.

People usually read a phrase that has a full meaning

in its entirety, and pause between phrases that need to

be separated. For languages with many ambiguities, the

eﬀect of syntactic information on sentence segmentation

may also cause listeners to have a completely diﬀerent

understanding of a sentence. Therefore, this informa-

tion needs to be predicted by the text front-end as a

conditional input of the acoustic model to synthesize

speech with correct pronunciation and prosody.

The traditional Mandarin text front-end is a cas-

cade system, which consists of a series of text processing

components, such as text normalization (TN), Chinese

word segmentation (CWS), part-of-speech (POS) tag-

ging, grapheme-to-phoneme (G2P) and prosodic struc-

ture prediction (PSP). The text front-end structure of

other languages is similar to that of Mandarin. These

components are usually modeled by traditional statisti-

cal methods, such as syntactic trees [264] and CRF [167]

based methods for PSP tasks and dictionary match-

ing based methods [77] for pronunciation prediction

tasks. However, these traditional text front-ends often

4 Zhaoxi Mu

et al.

fail to predict correctly in some unusual or complex con-

texts. To boost prediction accuracy, some researchers

have adopted state-of-the-art NLP frameworks based

on deep learning methods such as BLSTM-CRF [78,

266], Word2Vec [139], Transformer [222] and BERT [44]

to improve the text front-end model based on dictio-

nary and traditional statistical learning methods. These

models can extract contextual information from the

text eﬀectively, and thus help the text front-end to

accurately determine the pronunciation of polyphonic

characters, the meaning of ambiguous sentences, and

the prosodic boundaries between each word, each phrase

and each sentence. The following will introduce the lat-

est text front-end model based on deep learning from

the aspects of text normalization, prosodic structure

prediction, pronunciation prediction, contextual infor-

mation extraction and so on.

Text normalization Text normalization is an important

preprocessing step for TTS tasks. Zhang et al. [258]

standardized Mandarin text by combining the tradi-

tional rule-based system with a neural text network

consisting of multi-head self-attention modules in Trans-

former to convert Non-Standard Words (NSW) into

Spoken-Form Words (SFW). This method has a higher

prediction accuracy than the rule-based system.

Prosodic structure prediction Prosodic structure pre-

diction is also an important function of the text front-

end. Taking Mandarin as an example, the prosodic struc-

ture of Mandarin is a three-level hierarchical structure

composed of three basic units: prosodic words (PW),

prosodic phrases (PPH) and intonation phrases (IPH)

[35]. Because these three levels of prediction tasks are

interrelated, Pan et al. [154] modeled prosody informa-

tion at all levels of the text in the way of multi-task

learning, and proposed a Mandarin prosodic bound-

ary prediction model based on BLSTM-CRF, which

improved the prediction accuracy and simpliﬁed the

model. Lu et al. [130] also proposed a method of multi-

tasking learning to eﬃciently complete PSP tasks based

on the self-attention model.

Pronunciation prediction Other text front-ends have

the pronunciation prediction function on the basis of

text normalization and prosody prediction. The G2P

tasks of Mandarin can be divided into two categories:

G2P of monophonic characters and G2P of polyphonic

characters. The pronunciation of monophonic charac-

ters can be easily determined by a pronunciation dictio-

nary, while G2P of polyphonic characters is highly con-

text sensitive [262]. Therefore, disambiguation of poly-

phonic characters is the main task of Mandarin G2P.

To accurately predict the pronunciation of polyphonic

characters, Cai et al. [21], Shan et al. [185] and Park

and Lee [157] proposed to use Bi-LSTM network for

G2P. On the basis of Pan et al. [154], Yang et al. [236]

proposed to preprocess the original text by replacing

the Word2Vec model with the encoder of Transformer-

based NLP model and BERT pre-training model, and

then carry out G2P and PSP in the Mandarin text

front-end. The accuracy of prediction can be improved

by taking advantage of Transformer and BERT net-

work. However, pre-training models, such as BERT,

are too large to be used in realtime applications and

edge devices. To reduce the size of the model, Zhang

et al. [262] proposed to use the simpliﬁed TinyBERT

model [86] for the G2P and PSP tasks simultaneously

using multi-task learning. It can ensure the accuracy

of the prediction results while reducing the size of the

model. Conkie and Finch [40] proposed a text front-end

that can be used to process multiple languages, includ-

ing text normalization and G2P functions. They regard

these two front-end tasks as two neural machine trans-

lation (NMT) tasks and use Transformer for modeling.

Byte pair encoding (BPE) technology [181] is also used

to process uncommon words, and the splicing technique

is used for long texts, which improves the accuracy of

prediction and the quality of synthesized speech.

Introduction of style information The text front-end

can also directly add additional style information to the

TTS system to provide the synthesized speech style fea-

tures. For example, Tahon et al. [203] added a pronun-

ciation adaptive framework based on CRF between text

front-end and TTS model to generate diﬀerent styles of

speech. In order to make the synthesized speech closer

to human voice, Sz´ekely et al. [201] took the front and

back utterances of an utterance and the breath pronun-

ciation events between them as a data set to learn the

breath location information of the context, thus adding

human breath information into the training data. The

forward and backward breath predictors were also used

to predict the location of breath more accurately.

Contextual information extraction The text front-end

model can also extract the contextual information of

the text. The extracted additional contextual informa-

tion can be input into the acoustic model as prior knowl-

edge. For example, Hayashi et al. [70] directly used

BERT as a context feature extraction network to en-

code input text, and added encoded word or sentence-

level contextual information to the input of the encoder

of the acoustic model to improve the quality of synthe-

sized speech. In order to obtain the phrase structure

of the sentence and word relationship information, Guo

Review of end-to-end speech synthesis technology based on deep learning 5

et al. [65] used the factor parser [107] in the Stanford

parser to extract the syntactic tree. Then, the embed-

ding vectors of extracted syntactic features and input

tokens are then combined as the input of the acoustic

model encoder, enabling TTS models to correctly syn-

thesize speech when facing some ambiguous sentences.

In order to improve the quality of synthesized speech,

GraphSpeech [126] inputs syntactic knowledge as ad-

ditional contextual information into the self-attention

module of Transformer-TTS [118]. The syntax tree of

the input text is converted into a syntax graph to model

the language relation between any two characters in the

input text, describe the global relation between the in-

put characters and extract grammatical features of the

text.

Uniﬁed text front-end To reduce the cumulative train-

ing error of each part and simplify the model, the com-

ponents of the text front-end with various functions can

be combined together. Pan et al. [155] proposed a Man-

darin text front-end model that uniﬁes a series of text

processing components, which can directly convert the

original text into linguistic features. Firstly, the original

text is normalized by the method proposed by Zhang

et al. [258] Then, the Word2Vec model is used to con-

vert sentences into character embedding, and an auxil-

iary model composed of dilated convolution or Trans-

former encoder is used to predict CWS and POS respec-

tively. Finally, the results are embedded and combined

with the original characters as the input of the main

module to jointly predict the labels of phoneme, tone

and prosody.

3 Acoustic model

Tacotron [227] is the ﬁrst end-to-end acoustic model

based on deep learning, and it is also the most widely

used acoustic model. It can synthesize acoustic features

directly from text, and then synthesize speech wave-

forms according to Griﬃn-Lim algorithm [62]. Tacotron

is based on the Seq2Seq architecture of encoder-decoder

with attention mechanism. The encoder is composed of

the CBHG network and is used to encode the input

text. The CBHG network includes convolution bank,

highway networks and Bi-GRU [38]. Decoder consists of

RNN with attention mechanism that aligns the output

of the encoder with the mel-spectrogram to be gener-

ated. Finally, the decoder maps the output sequence of

the encoder to the mel-spectrogram in an autoregressive

manner [220]. The autoregressive generative method is

to decompose the joint probability p(x) of the acoustic

feature sequence x = {x

, x

, . . . , x

} into:

p(x) =

T −1

i=0

p(x

i+1

| x

, x

, . . . , x

) (1)

This means that the acoustic features of the n-th frame

are generated under the condition of the previous n − 1

frames. In order to increase the speed of synthesizing

mel-spectrogram, Tacotron generates multiple frames

of mel-spectrogram at each decoding step.

Although Tacotron is better than most SPSS mod-

els, it still has the following four disadvantages:

• The decoder in Tacotron is composed of RNN and

synthesizes acoustic features in an autoregressive

manner, which introduces a time-series dependence.

Therefore, it cannot be calculated in parallel, result-

ing in slow training and inference speed.

• Tacotron uses content-based attention mechanism,

thus the synthesized speech will have many errors,

such as mispronunciation, missed words and repeti-

tions.

• Tacotron cannot synthesize speech with a speciﬁc

emotion and rhythm.

• Tacotron needs to use a lot of high-ﬁdelity speech

data during training to get good results.

In order to overcome these disadvantages in Tacotron,

researchers have proposed many new acoustic models

based on Tacotron. The following will introduce various

improvement methods for the above four disadvantages.

3.1 Fast acoustic model

Although Tacotron can synthesize high-ﬁdelity speech

that is close to human voice, it cannot be used in prac-

tical applications due to its slow training and infer-

ence speed. The training and inference speed of acous-

tic model can be improved by improving RNN network,

improving autoregressive generative method and using

streaming method.

3.1.1 Non-RNN acoustic model

Multi-layer CNN can replace RNN to capture the long-

term dependence of the context, and can speed up train-

ing and inference in the way of parallel computing. For

example, Tacotron 2 [189] replaces the complex CBHG

and GRU structures with simple LSTM [74] and CNN

structures on the basis of Tacotron. Deep Voice 3 [160]

uses residual gated convolution [42, 56] instead of RNN

to capture contextual information, where the encoder

and decoder are composed of non-causal and causal

CNNs. DCTTS [202] replaces RNN with CNN on the

6 Zhaoxi Mu

et al.

basis of Tacotron, which consists of Text2Mel and Spec-

trogram Super Resolution Network (SSRN).

In addition to CNN, other networks can be used

instead of RNN to achieve parallel computing. For ex-

ample, Li et al. [118] proposed to use Transformer to

replace the RNN and attention networks in Tacotron 2,

thereby increasing the computational eﬃciency by us-

ing the multi-head self-attention in Transformer to gen-

erate the hidden states of encoder and decoder in paral-

lel. Bi et al. [14] proposed that the deep feed-forward se-

quential memory network (DFSMN) [260] with a struc-

ture similar to dilated-CNN [151] can be used to replace

RNN in the acoustic model. The quality of speech gen-

eration by the DFSMN-based model is similar to that

of the RNN-based model, and the model complexity is

reduced and the training time is reduced.

3.1.2 Non-autoregressive acoustic model

Although the above models improve the computational

eﬃciency by means of parallel computation, they still

need to generate acoustic features frame by frame in

an autoregressive manner [220] during inference, re-

sulting in a very slow generation speed. Therefore, if

acoustic features can be generated in parallel, the gen-

eration speed will be greatly improved. However, it is

diﬃcult for the acoustic model based on the attention

mechanism to learn the correct alignment between in-

put and output if the mel-spectrogram is directly gen-

erated in parallel in a non-autoregressive manner. In

order to solve this problem, FastSpeech [172], SpeedyS-

peech [215], ParaNet [159], FastPitch [117] and other

models introduced a teacher network to replace the

implicit autoregressive alignment method of the tra-

ditional seq2seq model through knowledge distillation.

The autoregressive teacher network can guide the non-

autoregressive network to learn correct attention align-

ment.

FastSpeech consists of the feed-forward Transformer

networks, which can generate acoustic feature frames in

parallel under the guidance of the length regulator. The

length regulator aligns each language unit with a cor-

responding number of acoustic frames in a manner pro-

vided by the autoregressive teacher network. However,

the Transformer module is complex and has a large

number of parameters. To reduce model parameters and

further improve the speed of training and inference, De-

viceTTS [79], SpeedySpeech [215], TalkNet [12], and

Parallel Tacotron [49] replace the Transformer mod-

ule in FastSpeech with simple DFSMN [260], residual

dilated-CNN, CNN and lightweight convolution (LConv)

[231], respectively.

The training process for models such as Fastspeech,

Speedyspeech, and Paranet is complicated by the use

of knowledge distillation. To simplify the training pro-

cess, other generative models such as normalizing ﬂow

and generative adversarial network (GAN) generative

models can be used to avoid autoregressive generation

and knowledge distillation process. Glow-TTS [98] uses

the Glow [101] normalizing ﬂow instead of Transformer

as the decoder to generate mel-spectrogram in parallel

(the Glow normalizing ﬂow will be described in detail in

Sect. 4.1.2). Flow-TTS [138] also uses a Glow-based de-

coder to generate mel-spectrogram non-autoregressively.

Donahue et al. [47] proposed an end-to-end TTS model

EATS based on GAN-TTS [17], which directly syn-

thesized speech non-autoregressively using GAN. Table

1 lists the methods to improve training and inference

speed of each model.

3.1.3 Streaming acoustic model

Although the training and inference speed of TTS mod-

els has been greatly improved, most of the current mod-

els can only output speech after inputting an entire sen-

tence. The longer the sentence, the longer the waiting

time, that is, the system will delay the input, which

seriously aﬀects the experience of human-computer in-

teraction experience. To solve this problem, some re-

searchers have proposed streaming incremental TTS

systems [50, 133, 198, 235], which can output speech in

real time while inputting text, because they only need

to see a few characters or words to synthesize speech.

The streaming system can generate new audio while the

user plays the audio, which greatly improves the appli-

cability of the TTS system and the user experience. It

can be applied in the ﬁelds of simultaneous translation,

dialog generation, and assistive technologies [133].

Traditional acoustic models with complete sentences

as input can rely on the full linguistic context (ie, past

and future words) to construct their internal repre-

sentations for acoustic features, thus generating high-

quality speech. However, due to the limited contextual

information that streaming acoustic models can obtain,

it is a challenge to eﬀectively model the overall prosodic

structure of speech. Yanagita et al. [235] proposed the

streaming neural TTS model for the ﬁrst time. In order

to learn the intra-sentence boundary features, they used

the start, middle and end symbols to split the train-

ing sentence into multiple subunits, which were used to

train the Tacotron. And they allow the model to learn

the acoustic time-series within one full sentence by tak-

ing the last vector of the mel-spectrogram from the pre-

vious units as the initial input for each unit. Finally,

the entire sentence is synthesized by incrementally syn-

Review of end-to-end speech synthesis technology based on deep learning 7

thesizing blocks consisting of one or more words with

symbols.

This method needs to preprocess the training data,

and only considers the previous information, which will

cause the prosodic error of synthesized speech. In or-

der to solve this problem, Ma et al. [133] borrowed the

idea of preﬁx-to-preﬁx framework of simultaneous ma-

chine translation [132]. When generating acoustic fea-

tures and speech waveforms incrementally, not only the

previous results but also the information of the follow-

ing words should be be used as the condition. Stephen-

son et al. [198] also proposed that the following words

should be considered when incrementally encoding each

word. They use Bi-LSTM to encode the ﬁrst word to

the following few words of the word to be synthesized,

and then input the resulting embedding vector into the

decoder. Finally, the speech segments will be cropped

[104] and spliced. Ellinas et al. [50] proposed a stream-

ing inference method, which can input the generated

acoustic frames into the vocoder before the inference

process of the acoustic model is completed. They accu-

mulate the output frames from each decoding step in a

buﬀer, and when the buﬀer includes enough frames to

accommodate the total receptive ﬁeld of the convolu-

tional layers in post-net, the acoustic frames are passed

to post-net in a larger batch. The post-net is trained to

reﬁne the entire acoustic frames sequence. The acoustic

frames in the buﬀer are partially redundant to consider

the contextual information of the acoustic frame to be

synthesized. Stephenson et al. [199] used the language

model GPT-2 [169] to predict the next word in the in-

put text, thereby improving the naturalness of speech

synthesized by the incremental TTS model by utilizing

the predicted contextual information.

3.2 Robust acoustic model

The neural TTS models based on autoregressive genera-

tive method and attention mechanism have been able to

generate speech that is as natural as human voice. How-

ever, these models are not as robust as traditional meth-

ods. During training, the autoregression-based models

need to ﬁrst decide whether it should stop when pre-

dicting each frame. Therefore, incorrect prediction of a

single frame can result in serious errors, such as end-

ing the the generation process early. Moreover, there

are almost no constraints in the attention mechanism

of the acoustic model to prevent problems such as repe-

tition, skipping, long pauses, or nonsense. These errors

are rare and therefore usually do not show up in small

test sets such as those used in subjective listening tests.

However, in customer-oriented products, even if there is

only a small probability of such problems, it will greatly

reduce the user experience. Therefore, many improved

methods for autoregressive generative model and atten-

tion mechanism widely used in neural TTS models have

been proposed.

3.2.1 Stable autoregressive generation process

In order to improve the training convergence speed, the

autoregressive TTS models such as Tacotron use nat-

ural acoustic feature frames as the input of decoder

for teacher forcing training in training stage, while in

inference stage, use the previously predicted acoustic

feature frames as the input of the decoder to generate

speech in free running mode. The distribution of the

data predicted by the model is diﬀerent from the dis-

tribution of the real data used in the training process,

and the discrepancy between these two distributions

can quickly accumulate errors in decoding, resulting in

exposure bias and wrong results, such as skipping, re-

peating words, incomplete synthesis and inappropriate

prosody phrase breaks. And this makes the model can

only be used to synthesize short sentences, because the

sound quality will deteriorate as the length of the syn-

thesized sentence increases.

A simple method to reduce exposure bias is sched-

uled sampling [13], in which acoustic feature frames of

the current time step are predicted by using natural

acoustic feature frames or those predicted by the pre-

vious time step with a certain probability [141, 155].

However, due to the inconsistency between the natural

speech frames and the predicted speech frames during

the scheduled sampling, the temporal correlation of the

acoustic feature sequence is destroyed, leading to the

decline of the quality of the synthesized speech.

To avoid this problem, Guo et al. [66] proposed

to use the Professor Forcing [116] method for train-

ing, which is a GAN-based adversarial training method.

The model is composed of a generator and a discrimi-

nator. The generator generates the output sequence in

the manner of teacher forcing and free running, respec-

tively. The discriminator based on self-attention GAN

(SAGAN) Zhang et al. [257] is used to determine which

way the output sequence is generated. They reduce the

exposure bias by introducing an additional term to min-

imize the gap between the output sequences generated

by the two methods in the training goal of the gen-

erator, although this solution is not stable and easy

enough. Liu et al. [125] proposed the random descent

method, which ﬁrst uses the natural acoustic features

as the input of the decoder for the ﬁrst round of teacher

forcing training, and then replaces the natural acous-

tic features with the acoustic features generated in the

ﬁrst round for the second round of teacher forcing train-

8 Zhaoxi Mu

et al.

Table 1 Methods to improve the training and inference speed of each acoustic model

Acoustic model Neural net-

work types

Generative model

types

Characteristics

Tacotron (Wang et al., 2017) CBHG, GRU Autoregression Synthesizing speech end-to-end, the struc-

ture is complex, the training and inference

speed is slow

Deep Voice 3 (Ping et al.,

2017)

CNN Autoregression Based on CNN, training and inference speed

is faster than Tacotron

DCTTS (Tachibana et al.,

2018)

CNN Autoregression Based on CNN, training and inference speed

is faster than Tacotron

Tacotron 2 (Shen et al., 2018) LSTM, CNN Autoregression The structure is simpler than Tacotron

Transformer-TTS (Li et al.,

2019)

Transformer Autoregression Based on Transformer, training and infer-

ence speed is faster than Tacotron

FastSpeech (Ren et al., 2019) Transformer Non-autoregression Training through knowledge distillation,

training speed is slow, inference speed is fast

ParaNet (Peng et al., 2020) CNN Non-autoregression Training through knowledge distillation,

based on CNN, the structure is simpler than

FastSpeech

EATS (Donahue et al., 2020) CNN GAN Based on CNN and GAN, the training and

inference speed is fast, the structure is fully

end-to-end

Glow-TTS (Kim et al., 2020) Transformer,

Glow

Normalizing ﬂow Based on normalizing ﬂow, training and in-

ference speed is fast

SpeedySpeech (Vainer and

Duˇsek, 2020)

CNN Non-autoregression Training through knowledge distillation,

based on CNN, the structure is simpler than

FastSpeech

TalkNet (Beliaev et al., 2020) CNN Non-autoregression Based on CNN, training and inference speed

is faster, the structure is simpler than Fast-

Speech

Flow-TTS (Miao et al., 2020) Glow Normalizing ﬂow Based on normalizing ﬂow, training and in-

ference speed is fast

DeviceTTS (Huang et al.,

2020)

DFSMN,

RNN

Combination of au-

toregression and non-

autoregression

Based on DFSMN, the structure is simpler

than FastSpeech

Parallel Tacotron (Elias

et al., 2020)

LConv Non-autoregression Based on LConv, the structure is simpler

than FastSpeech

FastPitch ( La´ncucki, 2020) Transformer Non-autoregression Training through knowledge distillation,

training speed is slow and inference speed

is fast

Review of end-to-end speech synthesis technology based on deep learning 9

ing. The model is trained multiple iterations to mini-

mize the gap between the generated acoustic features

and the natural acoustic features, thereby reducing the

exposure bias. Liu et al. [127] also proposed a method

based on knowledge distillation to reduce exposure bias,

which is to train a teacher model ﬁrst, and then use it

to guide the training of the student model. The teacher

model uses ground-truth data for training, and the stu-

dent model uses the predicted value of the previous

time step to guide the prediction of the next time step.

Knowledge distillation is performed by minimizing the

distance between the hidden states of the decoder at

each time step of the two models.

When the target sequence is generated by autore-

gressive method, the previous wrong token will aﬀect

the next one. The acoustic feature sequence is usually

longer than the target sequence of other sequence learn-

ing tasks (such as NMT). Therefore, the results of the

TTS task will be more susceptible to error propaga-

tion, resulting in that the right part of the generated

acoustic feature sequence is usually worse than the left

part. Ren et al. [173] used the bidirectional sequence

modeling (BSM) technique to alleviate error propaga-

tion. They generated acoustic feature sequences from

left to right and from right to left respectively to pre-

vent the model from generating sequences with poor

quality on one side. Zheng et al. [267] proposed two

BSM methods for acoustic models, which take full ad-

vantage of the autoregressive model at the initial it-

eration stage and reduce errors in synthesized speech

by adding bidirectional decoding regularization term

to the loss function during training. The ﬁrst method

is to construct two acoustic models that generate the

mel-spectrogram from front to back and from back to

front respectively, and then minimize the diﬀerence be-

tween the output mel-spectrogram of the two models.

The second method is to use two decoders to generate

mel-spectrogram forward and backward while sharing

an encoder, and then minimize the diﬀerence between

the state or attention weight values of the two decoders

at each time step. Moreover, Vainer and Duˇsek [215]

employed three data augmentations on the input mel-

spectrogram to improve the robustness of the model to

error propagation during autoregressive generation:

• A small amount of Gaussian noise is added to each

spectrogram pixel.

• The model outputs are simulated by feeding the in-

put spectrogram through the network without gra-

dient update in parallel mode.

• The input spectrograms are degraded by randomly

replacing several frames with random frames, thereby

encouraging the model to use temporally more dis-

tant frames.

When acoustic features are generated by autoregres-

sive acoustic models, there is a problem of local infor-

mation preference [29, 124], that is, the acoustic feature

frames to be generated by the current time step are

completely dependent on the acoustic feature frames

generated by the previous time step, and are indepen-

dent of the text conditions. In order to avoid ignoring

text information during synthesis and thus generating

wrong speech, Liu et al. [124] learned from the idea of

InfoGAN [28] and proposed to use an additional auxil-

iary CTC recognizer to recognize the predicted acous-

tic features. The predicted acoustic features are used

to restore the corresponding input text. This method

essentially maximizes the mutual information between

the predicted acoustic features and the input text to

enhance the dependence between them.

3.2.2 Accurate alignment

Similar to other Seq2Seq models, many TTS models

use the attention mechanism to align input text with

output spectrograms. The attention mechanism allows

the output of the decoder at each step to focus on a

subset of hidden states of the encoder, and the result

directly controls the duration and rhythm of the syn-

thesized speech. The main structure of the attention

mechanism is shown in Fig. 3, which can be expressed

as [25]:

, h

, . . . , h

) = Encoder(x

, x

, . . . , x

) (2)

= Attention(s

i−1

, c

i−1

, y

i−1

) (3)

i,j

= f

, h

) (4)

i,j

= f

i,j

) (5)

i,j

(6)

= Decoder(y

i−1

, c

, s

) (7)

where {x

}

j=1

is input sequence, L is the length of in-

put sequence, {h

}

j=1

are hidden states of encoder, c

is context vector, α

i,j

are attention weights over input,

is hidden state of decoder, e

i,j

are energy values, y

is output token, f

is alignment function, f

is distri-

bution function, and the form of f

and f

depends on

the speciﬁc attention mechanism.

10 Zhaoxi Mu

et al.

Encoder

xxx 

hhh 

Decoder

1i

Attention

1i

Lijii

eee

,,1,



Alignmentfunction

Distribution

function

Element‐wise

product

Lijii ,,1,





Fig. 3 Attention mechanism structure

First, the input sequence (x

, x

, . . . , x

) is encoded

by encoder and transformed to (h

, h

, . . . , h

). Then,

the hidden states {s

}

i=1

of decoder is generated by

the attention network, and the corresponding weights

{α

i,j

}

j=1

of encoder states in the i-th time step are

calculated by s

. The context vector c

consists of a

linear combination of attention weights {α

i,j

}

j=1

and

encoder states {h

}

j=1

. Finally, the decoder generates

the output token y

using the current time step context

vector c

and hidden state s

Since the order and position of input text and out-

put speech in TTS task are corresponding, attention

alignment in TTS is a surjective mapping from the out-

put frames to the input tokens and should follow such

strict criteria [71]:

• Locality Each output frame should be aligned around

a single input token to avoid attention collapse.

• Monotonicity The position of the aligned input to-

ken must never rewind backward to prevent repeat-

ing.

• Completeness Each input token should be covered

once or aligned with some output frame to avoid

skipping.

The original Tacotron model uses the content-based

attention mechanism proposed by Bahdanau et al. [7].

In this case, Eq. (4) is:

i,j

= v

tanh(W s

+ V h

+ b) (8)

where W s

and V h

represent query and key, respec-

tively.

The content-based attention mechanism does not

consider the position information of each item in the

sequence at all, and can not eﬀectively utilize the mono-

tonicity and locality of alignment, thus alignment errors

are common. In order to enable the attention mecha-

nism to consider the positon information of input and

output, and thus enhance the generalization ability of

synthesizing long sentences, Char2wav [196], Voiceloop

[204] and Melnet [221] adopted the Gaussian mixture

model (GMM) attention mechanism proposed by Graves

[61] to replace the content-based attention mechanism

in Tacotron. This method is a purely location-based at-

tention mechanism, which uses an unnormalized mix-

ture of K Gaussians to produce the attention weights,

i,j

, for each encoder state:

i,j

k=1

i,k

exp



−

(j − µ

i,k

)

2(σ

i,k

)



(9)

i,k

= µ

i−1,k

+ ∆

i,k

(10)

where w

i,k

, Z

i,k

, ∆

i,k

and σ

i,k

are computed from the

attention RNN state. The mean of each Gaussian com-

ponent µ

i,k

is computed using the recurrence relation in

Eq. (10), which makes the mechanism location-relative

and potentially monotonic if ∆

i,k

is constrained to be

positive. Although this location-based attention mech-

anism can enhance the generalization ability of acoustic

models for long sentences, it sacriﬁces some of the nat-

uralness of synthesized speech.

In order to combine content and location informa-

tion in alignment, Tacotron 2 uses the hybrid location-

sensitive attention mechanism [31]. In this case, Eq. (4)

is:

i,j

= v

tanh(W s

+ V h

+ U f

i,j

+ b) (11)

where U f

i,j

represents the location-sensitive term, and

uses convolutional features computed from the previous

attention weights {α

i−1,j

}

j=1

. This method combines

the content and location features to make alignment

more accurate by additionally introducing previous at-

tention weight information.

Based on the monotonicity of alignment between in-

put and output sequences in TTS, various monotonic

attention mechanisms have been proposed to reduce

errors in attention alignment. In order to introduce

Review of end-to-end speech synthesis technology based on deep learning 11

monotonicity into the hybrid location-sensitive atten-

tion, Battenberg et al. [10] proposed Dynamic Convo-

lution Attention (DCA), which removed content-based

terms W s

and V h

, leaving only location-sensitive term

i,j

as static ﬁlters, while adding a set of learned dy-

namic ﬁlters T g

i,j

and a single ﬁxed prior ﬁlter p

i,j

. In

this case, Eq. (4) is redeﬁned as:

i,j

= v

tanh(Uf

i,j

+ T g

i,j

+ b) + p

i,j

(12)

Similar to static ﬁlters Uf

i,j

, dynamic ﬁlters T g

i,j

are

computed from the attention RNN state and serve to

dynamically adjust the alignment relative to the align-

ment at the previous step. Prior ﬁlter p

i,j

is used to bias

the alignment toward short forward steps. This mono-

tonic DCA has stronger generalization ability and is

more stable.

Raﬀel et al. [170] proposed a monotonic alignment

method that can be applied to TTS: monotonic atten-

tion (MA). At each step i, MA inspects the memory

entries from the memory index t

i−1

it focused on at the

previous step and evaluates the ”selection probability”

i,j

= σ(e

i,j

) (13)

where σ is logistic sigmoid function and energy values

i,j

are produced as in Eq. (4). Starting from j = t

i−1

at each time MA would sample z

i,j

∼ Bernoulli(p

i,j

)

to decide to keepj unmoved (z

i,j

= 1) or move to the

next position (z

i,j

= 0). j would keep moving forward

until reaching the end of inputs, or until receiving a

positive sampling result z

i,j

= 1, and when j stops,

memory h

would be directly picked as c

. With such

restriction, it is guaranteed that solely one input unit

would be focused on at each step, and its position would

never rewind backward. Moreover, the mechanism only

requires linear time complexity and supports online in-

puts, which could be eﬃcient in practice.

In contrast to the traditional “soft” attention using

continuous weights, MA, which simply selects one in-

put unit as the context vector c

, is a “hard” attention.

It can ensure the locality of attention alignment, but

it could not be trained by standard back-propagation

(BP) algorithm. Multiple approaches have been pro-

posed for this issue, including reinforcement learning

[122, 233, 246], approximation by beam search [186],

and approximation by soft attention for training [170].

To further guarantee the completeness of alignment,

He et al. [71] proposed stepwise monotonic attention

(SMA), which adds additional restrictions on MA: in

each decoding step, the attention alignment position

moves forward at most one step, and it is not allowed

to skip any input unit. The alignment of soft atten-

tion (SA), MA and SMA is shown in Fig. 4 [71]. The

color depth of each node in the ﬁgure represents the size

of the attention weight between each output acoustic

feature frame and the input phoneme. The darker the

color, the greater the value of attention weight. The ﬁg-

ure shows that each acoustic feature frame is calculated

by multiple input phonemes in SA. Each acoustic fea-

ture frame is determined by an input phoneme in MA.

In SMA, not only each acoustic feature frame is deter-

mined by an input phoneme, but all input phonemes

must be corresponding at least once, which ensures

the locality, monotonicity and completeness of atten-

tion alignment.

Zhang et al. [259] and Yasuda et al. [240] also pro-

posed similar monotonic attention mechanisms. Zhang

et al. [259] suggested that only the alignment paths sat-

isfying the monotonic condition are taken into consid-

eration at each decoder time step. The attention prob-

abilities of each time step can be computed recursively

using a forward algorithm, and a transition agent is

proposed to help the attention mechanism make deci-

sions whether to move forward or stay at each decoder

time step. This attention mechanism has the advan-

tages of fast convergence speed and high stability. Ya-

suda et al. [240] also proposed a hard monotonic atten-

tion mechanism. The framework and likelihood function

are similar to those of a hidden Markov model (HMM).

The constrained alignment is conceptually borrowed

from segment-to-segment neural transduction (SSNT)

[244, 245]. They factorized the generation probability

for acoustic features into an alignment transition prob-

ability and emission probability, thereby constraining

the alignment process to moving from left to right, and

only one step at a time. Although this hard monotonic

alignment method can avoid some alignment errors that

are commonly observed in soft-attention-based meth-

ods, including muﬄing, skipping, and repeating, this

attention mechanism has poor stability and long train-

ing time.

In order to make more direct use of the correspon-

dence between text and speech in TTS, Tachibana et al.

[202] and Wang [225] added a guided attention loss to

content-based dot product attention [222]. More specif-

ically, they added an additional monotonic attention

loss to the original audio reconstruction loss, forcing the

non-zero values of the attention weight matrix were con-

centrated on the diagonal as much as possible. Further-

more, the forced increment attention was proposed to

force the text and speech to be aligned monotonously by

making the corresponding text position of acoustic fea-

ture frame at each time step move forward by at most

one. To produce monotonic alignment, Deep Voice 3

and ParaNet added positional encoding in Transformer

to the content-based dot product attention. Besides,

12 Zhaoxi Mu

et al.

Input

text

Outputacousticfeatures

Input

text

SMA

Outputacousticfeatures

Input

text

Fig. 4 The alignment of SA, MA and SMA

they added an attention window [125, 137] he attention

during inference, calculated the attention weights only

for the input characters in the window, and took the po-

sition of the character with the largest attention weight

as the starting position of the next window. Moreover,

ParaNet adopted a multi-layer attention mechanism to

iteratively reﬁne attention alignment in a layer-by-layer

manner.

However, the use of positional encoding can cause

errors when synthesizing long sentences [119]. To syn-

thesize long sentences stably, Glow-TTS removes the

positional encoding and adds relative position represen-

tations [187] into self-attention modules instead. Robu-

Trans [119] counts on the 1-D CNN used in Encoder

Pre-net to model relative position information in a ﬁxed

window. Moreover, in order to make the self-attention

in Transformer more suitable for TTS models, Robu-

Trans also uses Pseudo Non-causal Attention (PNCA)

to replace the traditional causal self-attention. The de-

coding process is more robust by providing the decoder

with the holistic view of the input sequence and the

frame-level context information.

As described in Sect. 3.1.2, a large number of non-

autoregressive acoustic models have been proposed re-

cently. TTS is a one-to-many mapping. For the same

text input, there are many possible speech expressions

with diﬀerent prosody. To eliminate ambiguity in multi-

mode output, the acoustic models with autoregressive

decoders can predict the acoustic feature frames of the

next time step by combining the contextual information

provided by the acoustic feature frames generated by

the previous time step. However, acoustic models with

non-autoregressive decoders need to obtain contextual

information in other ways to select an appropriate gen-

eration mode. Non-autoregressive acoustic models need

to determine the output length in advance, rather than

predict whether to stop at each frame. In this case, in

order to align the inputs and outputs, a duration pre-

dictor similar to the one used in the traditional SPSS

method [247, 250] can be used instead of the attention

network. Aligning with a duration predictor can avoid

the errors of skipping, repeating, and irregular stops

caused by the attention mechanism. This method ﬁrst

appeared in NMT [64], and then was introduced into

TTS through non-autoregressive acoustic models such

as FastSpeech [172]. Acoustic models with duration pre-

dictors can align input phonemes and output acoustic

features by introducing additional alignment modules

or using external aligners. Next, these two alignment

methods are introduced separately.

The most direct way to obtain the alignment infor-

mation is provided by an external aligner. For exam-

ple, FastSpeech extracts phoneme duration from a pre-

trained autoregressive model by knowledge distillation

[100]. However, FastSpeech lacks generalization abil-

ity for long utterances, especially those whose length

exceeds the maximum length of the utterance in the

training set. This may be because the self-attention is

a global modeling method. To use the local modeling

method to make network more stable, DeviceTTS [79]

replaces the Transformer with DFSMN, which makes

use of a latency control window size to learn the con-

text. To simplify the training process, JDI-T [121] jointly

trains the autoregressive Transformer teacher network

and the feed-forward Transformer student network. To

avoid the complicated knowledge distillation process,

some models use a separate external alignment model

to predict the target phoneme duration, thus estab-

lishing alignment between input phonemes and output

acoustic features. For example, TalkNet [12] uses the

CTC-based automatic speech recognition (ASR) model

Quartznet [111], FastSpeech 2 [174] uses the forced-

alignment tool MFA toolkit [135], DurIAN [243] uses

a external alignment model [51, 252], RobuTrans [119]

uses speech recognition tools, Parallel Tacotron [49] and

Non-Attentive Tacotron [190] use a speaker-dependent

HMM-based aligner with a lexicon [230]. To address

the diﬃculty of training an aligner due to data spar-

sity, Shen et al. [190] used ﬁne-grained VAE (FVAE)

to achieve semi-supervised and unsupervised duration

Review of end-to-end speech synthesis technology based on deep learning 13

prediction, that is, simply train the model using the

predicted durations instead of the target durations for

upsampling.

It is also possible to directly learn alignment by

training an alignment module within the model. For ex-

ample, AlignTTS [254] uses the dynamic programming

to consider all possible alignments in training, that is,

uses the alignment loss inspired by the Baum-Welch

algorithm [11, 206] to train the mix density network

for alignment. Glow-TTS uses the Monotonic Align-

ment Search (MAS) algorithm to predict the duration

of each input tokens by searching for the most probable

monotonic alignment between text and the latent rep-

resentation of speech. The internal alligator of EATS

[47] implicitly enhances the monotonicity of alignment

by predicting token lengths and obtaining positions us-

ing a cumulative sum operation. Moreover, the dynamic

time warping (DTW) loss and the aligner length loss

are introduced to learn alignment and ensure that the

model can accurately predict phoneme lengths. Flow-

TTS [138] trains a length predictor inside the model to

predict the output length in advance, and takes the po-

sitional encoding of the predicted spectrogram length

as query vector to align the input and output using the

positional attention module based on the multi-head

dot-product attention mechanism [222].

Since one-to-many regression problems like TTS can

beneﬁt from autoregressive decoding, it is also possible

to combine the autoregressive method with duration

predictor to further improve the stability of TTS mod-

els, such as the alignment methods used in DurIAN,

Non-Attentive Tacotron [190], DeviceTTS and Robu-

Trans [119]. The alignment method of each model is

shown in Table 2.

3.3 Expressive acoustic model

The speech synthesized by deep learning method has a

smooth tone, without rhythm and expressiveness, thus

it often has a certain gap with the real human voice.

In order to synthesize expressive speech, three parts

need to be considered: ”what to say”, ”who to say”

and ”how to say”. ”What to say” is controlled by the

input text and the text front-end. ”Who to say” can

be controlled by collecting a large amount of voice data

of a person and then training the model to learn to

imitate the speaker’s voice. ”How to say” is controlled

by prosodic information such as tone, speech rate, and

emotion of the synthesized speech. In this paper, ”who

to say” and ”how to say” are collectively referred to as

the style features of synthesized speech.

3.3.1 Acoustic model with reference encoder

Style information can be introduced by adding a ref-

erence encoder to synthesize expressive speech. There

are mainly two methods based on reference encoders

that can be used to synthesize speech with a speciﬁc

style. The ﬁrst method is to directly control various

speech style parameters, such as pitch, loudness, and

emotion, by using a trained reference encoder. The sec-

ond method is to input the reference audio into the ref-

erence encoder and use the style parameters encoded

by the reference encoder to transfer the speech style

features between the reference speech and the target

speech. Diﬀerent methods and models have been pro-

posed to disentangle the diﬀerent style feature informa-

tion so that each style feature can be easily controlled

individually to synthesize speech with the target style.

These methods and models are described in the follow-

ing paragraphs.

Skerry-Ryan et al. [192] divided the features of speech

into three components: text, speaker, and prosody. A

reference encoder is added to tacotron to extract the

prosody embedding from the reference speech with a

speciﬁc style, and the speaker embedding is obtained

by using a speaker embedding lookup table. Then the

prosody embedding, speaker embedding and text em-

bedding are combined and input into the decoder to

synthesize speech with the style of the reference speech.

Gururani et al. [69] reﬁned the model on the basis of

Skerry-Ryan et al. [192], divided the style features of

speech into pitch and loudness, and selected two 1-

D time series to model the fundamental frequency f

and loudness of the reference speech respectively. In or-

der to transfer the emotion features in the reference

speech more accurately, Li et al. [120] added two emo-

tion classiﬁers after the reference encoder and decoder

respectively to enhance emotion classiﬁcation ability in

the emotion space. Moreover, they adopted a style loss

[54, 88] to measure the style diﬀerences between the

generated and reference mel-spectrogram [55, 134].

Voice conversion (VC) model can disentangle the

speaker-dependent timbre feature from speech [33, 34,

91, 165, 183], but cannot extract other style features

such as the content, pitch and rhythm of speech. In-

spired by the voice conversion model AutoVC [165],

Qian et al. [166] proposed SPEECHFLOW, which is

a speech style conversion model that can disentangle

the rhythm, pitch, content, and timbre information.

Rhythm, pitch and content features are extracted by

three encoders respectively, and timbre feature is rep-

resented by one-hot vector of speaker ID. SPEECH-

FLOW can be trained for speech style conversion by

14 Zhaoxi Mu

et al.

Table 2 Alignment method of each acoustic model

Acoustic model Neural net-

work types

Generative model

types

Alignment methods Characteristics

Tacotron (Wang

et al., 2017)

CBHG,

GRU

Autoregression Content-based attention Unstable, alignment errors

often occur

Char2Wav

(Sotelo et al.,

2017)

RNN Autoregression GMM attention Low naturalness of synthe-

sized speech

Deep Voice 3

(Ping et al., 2017)

CNN Autoregression Dot-product attention, po-

sitional encoding, attention

window

Attention is monotonic

VoiceLoop (Taig-

man et al., 2017)

Shifting

buﬀer

Autoregression GMM attention Low naturalness of synthe-

sized speech

DCTTS

(Tachibana et al.,

2018)

CNN Autoregression Dot-product attention and

guided attention

Stable, alignment errors are

rare

Tacotron 2 (Shen

et al., 2018)

LSTM, CNN Autoregression Mixed location-sensitive at-

tention

Able to synthesize long sen-

tences accurately

DurIAN (Yu

et al., 2019)

CBHG,

RNN

Autoregression Duration prediction model,

external alignment model

Stable, alignment errors are

rare

FastSpeech (Ren

et al., 2019)

Transformer Non-

autoregression

Duration prediction model,

knowledge distillation

Errors will occur when syn-

thesizing long sentences

FastSpeech 2

(Ren et al., 2020)

Transformer Non-

autoregression

Duration prediction model,

MFA toolkit

Stable, alignment errors are

rare

ParaNet (Peng

et al., 2020)

CNN Non-

autoregression

Dot-product attention, po-

sitional encoding, attention

window, multi-layer atten-

tion, knowledge distillation

Attention alignment is mono-

tonic and stable

EATS (Donahue

et al., 2020)

CNN GAN Duration prediction model,

internal alignment module

Stable, alignment errors are

rare

Non-Attentive

Tacotron (Shen

et al., 2020)

RNN Autoregression Duration prediction model,

external alignment module

Stable, alignment errors are

rare

FastPitch

( La´ncucki, 2020)

Transformer Non-

autoregression

Duration prediction model,

knowledge distillation

Can control the pitch contour

of synthesized speech

Glow-TTS (Kim

et al., 2020)

Transformer,

Glow

Normalizing ﬂow Duration prediction model,

MAS algorithm

The alignment is monotonic

and stable

AlignTTS (Zeng

et al., 2020)

Transformer Non-

autoregression

Duration prediction model,

internal alignment module

Stable, alignment errors are

rare

SpeedySpeech

(Vainer and

Duˇsek, 2020)

CNN Non-

autoregression

Duration prediction model,

knowledge distillation

Stable, alignment errors are

rare

JDI-T (Lim et al.,

2020)

Transformer Non-

autoregression

Duration prediction model,

knowledge distillation

Joint training of teacher and

student network, stable and

alignment errors are rare

TalkNet (Beliaev

et al., 2020)

CNN Non-

autoregression

Duration prediction model,

ASR model

Stable, alignment errors are

rare

Flow-TTS (Miao

et al., 2020)

Glow Normalizing ﬂow Multi-head dot-product at-

tention, internal length pre-

dictor

High quality of synthesized

speech, fast training and in-

ference speed

DeviceTTS

(Huang et al.,

2020)

DFSMN,

RNN

Combination

of autoregres-

sion and non-

autoregression

Duration prediction model Stable, alignment errors are

rare

Parallel Tacotron

(Elias et al., 2020)

LConv Non-

autoregression

Duration prediction model,

HMM-based aligner

Stable, alignment errors are

rare

RobuTrans (Li

et al., 2020)

Transformer Autoregression Duration prediction model,

speech recognition tools

Stable, alignment errors are

rare

Review of end-to-end speech synthesis technology based on deep learning 15

replacing the input of the three encoders with the spec-

trogram or pitch contour of the reference speech.

Similarly, in order to disentangle diﬀerent style fea-

tures in speech and achieve the purpose of individually

controlling each feature, Wang et al. [228] introduced

a global style token (GST) network in Tacotron, which

plays a role of clustering. When the GST network is

trained with speech data with various styles, multiple

meaningful and interpretable tokens can be obtained.

The weighted sum of these tokens is used as a style

embedding to control and transfer the style features

of speech. In inference, a speciﬁc weight can be cho-

sen directly for each style token, or a reference signal

can be fed to guide the choice of token combination

weights. For the choice of token weight, Kwon et al.

[114] proposed a controlled weight (CW)-based method

to deﬁne the weight values by investigating the distri-

bution of each emotion in the emotional vector space.

Um et al. [214] proposed to improve the method of sim-

ply averaging the style embedding vectors belonging

to each emotion category [115] to determine the rep-

resentative weight vectors by maximizing the ratio of

inter-category distance to intra-category distance (I2I),

and proposed to apply the spread-aware I2I (SA-I2I)

method to change the emotion intensity instead of the

simple linear interpolation-based approach. Mellotron

[218] additionally introduces fundamental frequency f

information, and takes text, speaker, fundamental fre-

quency f

, attention mapping, and GST as conditions

when synthesizing speech, in which the speaker repre-

sents timbre, the fundamental frequency f

represents

pitch, the attention mapping represents rhythm, and

GST represents prosody.

Since GST-Tacotron uses only paired input text and

reference speech for training, inputting unpaired text

and speech during synthesis will cause the generated

sound to become blurry. Moreover, in this case, the ref-

erence encoder may store some text information in the

reference embedding rather than prosody and speaker

information to reconstruct the input speech. Using the

idea of dual learning, Liu et al. [123] proposed to train

GST-Tacotron with unpaired text and speech, and in-

put the output mel-spectrogram into the ASR model

to predict the input text, thus preventing the reference

encoder from encoding any text information. Further-

more, they also use the regularization method of atten-

tion consistency loss to accelerate the training conver-

gence speed of both ASR and TTS models.

In order to control the style of synthesized speech

more ﬂexibly, multiple reference encoders can be used

to extract diﬀerent style features of multiple reference

speech respectively. For example, Bian et al. [15] used

multiple reference encoders based on GST network to

disentangle diﬀerent style features, and proposed in-

tercross training technique to separate the style latent

space by introducing orthogonality constraints between

the extracted styles of each encoder. However, this in-

tercross training scheme does not guarantee each com-

bination of style classes is seen during training, caus-

ing a missed opportunity to learn disentangled repre-

sentations of styles and sub-optimal results on disjoint

datasets. Whitehill et al. [229] used an adversarial cy-

cle consistency training scheme to ensure the use of

information from all style dimensions to address the

challenges of multi-reference style transfer on disjoint

datasets. They achieved a higher rate of style transfer

for disjoint datasets than previous models.

Variational auto-encoder (VAE) [102] generates sam-

ples with speciﬁc features by sampling from the distri-

bution of latent variables. Latent variables are continu-

ous and can be interpolated, similar to the implicit style

features in speech. The speech style features learned by

VAE in an unsupervised manner can be easily sepa-

rated, scaled and combined. Therefore, there are many

tasks that use VAE to control the synthesized speech

style. The speech style features learned by VAE in an

unsupervised manner can be easily separated, scaled

and combined. Therefore, there are many works using

VAE to control the style of synthesized speech. For

example, Zhang et al. [263] added a VAE network to

Tacotron 2 to learn latent variables representing speech

style. Each dimension of latent variables represents a

diﬀerent style feature. In order to further disentangle

the various style features of speech, Hsu et al. [76] pro-

posed GMVAE-Tacotron based on the Gaussian mix-

ture VAE network, with two levels of hierarchical latent

variables. The ﬁrst level is a discrete latent variable,

representing a certain category of style (e.g. speaker

ID, clean/noisy). The second level is a continuous la-

tent variable approximated by the multivariate Gaus-

sian distribution. Each component represents the de-

gree of the feature (e.g. noise level, speaking rate, pitch)

under the category of the ﬁrst level. In general, it is

equivalent to using the GMM to ﬁt the distribution

of latent variables. This model can eﬀectively factorize

and independently control latent attributes underlying

the speech signal.

However, these methods only model the global style

features of speech, without considering prosodic con-

trol at the phoneme and word levels. In order to model

acoustic features at various resolutions, Sun et al. [200],

in addition to modeling global speech features such as

noise and channel number, also modeled word-level and

phoneme-level prosodic features such as fundamental

frequency f

, energy and duration. They used a con-

ditional VAE with an autoregressive structure to make

16 Zhaoxi Mu

et al.

prosodic features of each layer more interpretable and

to impose hierarchical conditioning across all latent di-

mensions. Parallel Tacotron [49] used two diﬀerent VAE

models, one similar to Hsu et al. [76] for modeling global

features of speech such as diﬀerent prosodic patterns of

diﬀerent speakers, and the other similar to Sun et al.

[200] for modeling phoneme-level ﬁne-grained features.

Normalizing ﬂow can control the latent variables to

synthesize speech with diﬀerent styles by learning an

invertible mapping of data to a latent space. For ex-

ample, Flowtron [219] applied the normalizing ﬂow to

Tacotron to control speech variation and style transfer

by learning a latent space that stores non-textual infor-

mation. Glow-TTS [98] takes Glow [101] as the decoder

to control the style of synthesized speech by control-

ling the prior distribution of latent variables. It is also

possible to model speech style features with both nor-

malizing ﬂow and VAE. Aggarwal et al. [2] used VAE

and Householder Flow [211] to improve the reference

encoder proposed by Skerry-Ryan et al. [192], thereby

enhancing the disentanglement capability of the TTS

system.

GAN can also be used in style speech synthesis. For

example, Ma et al. [134] enhanced the content-style dis-

entanglement ability and controllability of the model

by combining a pairwise training procedure, an adver-

sarial game, and a collaborative game into one train-

ing scheme. The adversarial game concentrates the true

data distribution, and the collaborative game minimizes

the distance between real samples and generated sam-

ples in both the original space and the latent space.

3.3.2 Acoustic model of explicit modeling style features

The prosody of the speech can also be controlled in-

tuitively by constraining the prosodic features of the

waveform. For example, Morrison et al. [141] proposed

a user-controllable, context-aware neural prosody gen-

erator that allows the input of the f

contour for certain

time frames and generates the remaining time frames

from input text and contextual prosody. CHiVE [96] is

a conditional VAE model with a hierarchical structure.

It can generate prosodic features such as fundamental

frequency f

, energy c

and duration suitable for use

with a vocoder, and yield a prosodic space from which

meaningful prosodic features can be sampled. To eﬃ-

ciently capture the hierarchical nature of the linguistic

input (words, syllables and phones), both the encoder

and decoder parts of the auto-encoder are hierarchical,

in line with the linguistic structure, with layers being

clocked dynamically at the respective rates.

In practical applications, since it is diﬃcult to in-

terpret and give practical meaning to each of the la-

tent variables learned by unsupervised style separation

methods such as GST and VAE, FastSpeech uses a

length adjuster to replicate and expand the hidden state

of the phoneme sequence according to the duration of

each phoneme, thus intuitively controlling the speech

speed and some prosodic features.

FastPitch [117] adds a pitch prediction network to

FastSpeech to control pitch. Compared with FastSpeech

and FastPitch, FastSpeech 2 introduces more style fea-

tures such as pitch, energy, and more accurate duration

as conditional inputs to construct a variance adaptor,

and uses trained predictors of energy, pitch, and dura-

tion predictors to synthesize speech with a speciﬁc style.

Durian simply divides speech styles into several dis-

crete categories, learns embedding vectors from speech

data with various styles through supervised learning ,

and controls the intensity of the style by multiplying a

scalar.

3.3.3 Multi-speaker acoustic model

Multi-speaker speech synthesis is also an important task

of TTS model. A simple way to synthesize the voices of

multiple speakers is to add a speaker embedding vector

to the input [57, 160]. The speaker embedding vector

can be obtained by additionally training a reference en-

coder. For example, Jia et al. [85], Arik et al. [4] and

Nachmani et al. [145] introduced a speaker encoder in

Tacotron 2, Deep Voice 3 and VoiceLoop [204] respec-

tively to encode the speaker information in the refer-

ence speech into a ﬁxed-dimensional speaker embed-

ding vector. The embedding vector can be extracted

only from a small number of speech fragments of the

target speaker. The speech data corpus used to train

the speaker encoder only needs to contain the record-

ings of a large number of speakers, but does not need to

be of high quality. Even if the training data contains a

small amount of noise, the extraction of timbre features

will not be aﬀected.

The speaker adaptation can also be used for multi-

speaker speech synthesis. Arik et al. [4], Taigman et al.

[204], and Zhang et al. [265] ﬁne-tune the trained multi-

speaker model using a small number of htext, speechi

data pairs of the target speaker. Fine-tuning can be

applied to the speaker embedded vector [4, 204], part

of the model [265], or the whole model [4]. Moss et al.

[142] proposed a ﬁne-tuning method to select diﬀerent

model hyperparameters for diﬀerent speakers, achieving

the goal of synthesizing the voice of a speciﬁc speaker

with only a small number of speech samples, in which

the selection of hyperparameters adopts the Bayesian

optimization method [184].

Review of end-to-end speech synthesis technology based on deep learning 17

However, these methods are not very eﬀective when

synthesizing the speech of unseen speakers. To solve this

problem, Cooper et al. [41] extracted speaker informa-

tion by using learnable dictionary encoding (LDE) on

the basis of Jia et al. [85], and inserted the speaker em-

bedding into both prenet layer and attention network

of Tacotron 2 as additional information. When training

the speaker encoder, Nachmani et al. [145] introduced,

in addition to the use of MSE losses, the contrast loss

term and the cyclic loss term, which allowed the model

to synthesize the voice of the new speaker with only

a small amount of audio. When training the speaker

encoder, in addition to the MSE loss, Nachmani et al.

[145] also a contrastive loss term and a cyclic loss term,

which allow the model to synthesize the voice of a new

speaker with only a small amount of audio. Cai et al.

[22] and Shi et al. [191] introduced an identity feedback

constraint by adding an additional loss term between

the reference embedding and the extraction embedding

of the synthesized signal, thus increasing the robustness

and speaker similarity of the produced speeches.

3.4 Low-resource acoustic model

Deep learning based acoustic models need to be trained

with a large number of high-quality htext, speechi data

pairs to synthesize high-ﬁdelity speech, and the data

set requirements are higher when synthesizing speech

with speciﬁc prosody and emotion. But for 95% of lan-

guages and audio with a speciﬁc style, the corpus is

very scarce. Moreover, the English speech corpus used

for TTS usually contains about 10 − 40 hours of speech

data and contains no more than 20, 000 words. The

largest public English speech corpus, LibriTTS [253],

contains only 80, 000 words, which is far lower than

the number of words in the regular English vocabu-

lary (usually 130, 000 − 160, 000). When synthesizing,

the acoustic model may mispronounce words outside

the training set. It is diﬃcult to cover all vocabulary

just by increasing the number of training utterances,

because the natural frequency of words tends to fol-

low the Zipﬁan distribution [205], which means that

the number of new words contained in the speech data

per hour gradually decreases. Therefore, to achieve a

linear increase in word coverage would require an ex-

ponential increase in audio data, which would be costly

and impractical. Besides, most speech data is recorded

by non-professionals and contains a lot of noise. There-

fore, the lack of high-quality speech training data in

TTS is mainly manifested in the lack of training data

that cannot cover all vocabulary and contains noise.

To solve the problem that the speech data cannot

cover all the words, text and phonemes can be input

into the acoustic network together. During training,

some words can be represented by text randomly, so

that the acoustic model can predict the phoneme pro-

nunciation of unseen words according to the learned

correspondence between characters and phonemes [159,

160]. The text front-end can also be used to convert the

text into phonemes in advance, in order to make the

model only need to learn the pronunciation of a small

number of phonemes.

To solve the problem of the lack of speech data for

minority languages and dialects, the method of cross-

language transfer learning can be used. For example,

Guo et al. [68] and Zhang et al. [261] trained an aver-

age language model with a large Mandarin corpus and

a small Tibetan corpus when training the Tibetan TTS

model, which made up for the lack of Tibetan speech

data. Tu et al. [212] introduced cross-language trans-

fer learning into Tacotron. They used speech data from

high-resource languages to pre-train Tacotron, and then

ﬁne-tuned the pre-trained model with speech data from

low-resource languages. Nekvinda and Duˇsek [147] used

the idea of meta-learning to train the acoustic model

with only a small number of samples from multiple lan-

guages in order to synthesize speech containing mul-

tiple languages. They used a fully convolutional en-

coder from DCTTS, whose parameters are generated

using a separate contextual parameter generator net-

work [163] conditioned on language embedding, thus

realizing cross-lingual knowledge-sharing.

Semi-supervised pre-training can also be used to re-

duce the demand of the TTS model for paired training

data. Chung et al. [39] proposed training the encoder

and decoder with unpaired text and speech respectively,

and then ﬁne-tuning the pre-trained Tacotron with a

small amount of htext, speechi data pairs. Although

this approach helps the model synthesizes more intel-

ligible speech, the experimental results show that pre-

training the encoder and decoder separately at the same

time does not bring further improvement than only

pre-training the decoder. And there is a mismatch be-

tween pre-training only the decoder and ﬁne-tuning the

whole model, because during pre-training the decoder

is only conditioned on the previous frame, while during

ﬁne-tuning the decoder is also conditioned on the text

representation output by the encoder. To avoid poten-

tial error caused by this mismatch and further improve

the data eﬃciency by using only speech, Zhang and

Lin [255] proposed to use vector Vector-quantization

Variational-Autoencoder (VQ-VAE) [32, 152] to extract

unsupervised linguistic units from untranscribed speech

and then use hlinguistic units, speechi pairs to pre-train

the entire model. The language units act as phonemes

that are paired with the audio, while VQ-VAE plays a

18 Zhaoxi Mu

et al.

role similar to speech recognition model. However, VQ-

VAE is trained in an unsupervised way to obtain dis-

cretized linguistic representations, which is suitable for

low-resource languages. Finally, the model is ﬁne-tuned

with a small amount of htext, speechi data pairs.

Using dual learning to train TTS and ASR models

simultaneously can also achieve the purpose of using

text or speech data alone to train both models. Tjandra

et al. [208] proposed an auto-encoder model in which

one is regarded as an encoder and the other as a de-

coder. For example, when there is only speech but no

corresponding text, the ASR model can be used as the

encoder to output text, and the TTS model can be used

as the decoder to output speech, and then the speech

output of the TTS model is expected to be as close as

possible to the input speech. The other situation is sim-

ilar when there is only text but no speech. Ren et al.

[173] also used the idea of dual learning to combine

TTS and ASR to build the capability of the language

understanding and modeling in both speech and text

domains using unpaired data during training, that is,

using denoising auto-encoder (DAE) to reconstruct cor-

rupt speech and text in an encoder-decoder framework.

They also used a dual transformation (DT) approach

similar to Tjandra et al. to train the model to convert

text to speech and speech to text respectively. The dif-

ference is that Tjandra et al. relied on two well-trained

TTS and ASR models, whereas Ren et al. trained the

two models from scratch, which is suitable for the lack

of training data.

Multi-speaker TTS model has much lower require-

ments on the quantity and quality of training data than

models that synthesize speech with a speciﬁc style, be-

cause they only need to separate and capture timbre in-

formation in the audio. However, if the speech training

data of the target speaker is too small, the timbre fea-

tures cannot be eﬀectively learned. In order to increase

the amount of speech data of the target speaker, Huy-

brechts et al. [81] used a voice conversion (VC) model to

convert the voice data of other speakers into the voice of

the target speaker for data augmentation, then trained

the TTS model with the expanded speech data, and ﬁ-

nally used the real voice data of the target speaker for

ﬁne-tuning.

The noise in the training data can be reduced by

pre-processing steps. Valentini-Botinhao and Yamag-

ishi [216] took the acoustic features of clean speech

and noisy speech respectively as the input and tar-

get of RNN network, enabling the network to convert

noisy speech into clean speech. Generally, the data in

the corpus containing diﬀerent styles of speech is of

low quality and contains noise, which will hinder the

training of style speech synthesis model. In this case,

the method of speech style control introduced in Sect.

3.3.1 can be used to train the style extraction network

with clean data and noisy data. By making the net-

work learn about the latent variables of noise features,

it can synthesize clean speech. For example, Wang et al.

[228] trained GST-Tacotron by using data sets mixed

with various noises to learn tokens about the noise fea-

tures. During synthesis, the token representing noiseless

is used as the style embedding to convert noisy refer-

ence speech to clean speech. Hsu et al. [76] used one

dimension of the mixed Gaussian distribution to repre-

sent the noise feature. Clean speech can be synthesized

by using the average value of the clean speech class or

the value of the noise variable extracted from the clean

reference speech as the value of the noise feature.

4 Vocoder

Inspired by the successful application of autoregressive

generative model [220] in the ﬁeld of image and natu-

ral language generation, Oord et al. [151] ﬁrst applied

this method to TTS and proposed the most widely

used vocoder WaveNet. In order to capture the long-

range temporal dependencies in audio signals, WaveNet

adopts a multi-layer dilated causal gated convolutional

network, which makes the receptive ﬁeld to grow expo-

nentially with the depth. WaveNet uses speaker iden-

tity and linguistic features as global and local condi-

tions respectively to synthesize the speech of the tar-

get speaker and text. However, WaveNet has a com-

plex network structure and is autoregressive, therefore

the training and inference speed is slow. Moreover, the

speech synthesized with WaveNet is sometimes not nat-

ural. Therefore, after it was proposed, there are a lot

of work to improve it. The direction of improvement is

mainly to accelerate the speed of training and inference

and improve the quality of synthesized speech, which

are respectively called fast vocoder and high-quality

vocoder. These methods are introduced in the following

sections.

4.1 Fast vocoder

The training can be accelerated by reducing the size

and parameters of the vocoder, and the inference can be

accelerated by replacing the autoregressive method in

WaveNet with non-autoregressive methods. The follow-

ing sections will introduce various small-size vocoders

and non-autoregressive vocoders.

Review of end-to-end speech synthesis technology based on deep learning 19

4.1.1 Small size vocoder

To improve the speed of training and inference, Fft-

net [87] uses the simple ReLU activation function and

1 × 1 convolutions to replace the gated activation units

and dilated convolutions in WaveNet, which reduces the

computational cost. SampleRNN [136] adopted a multi-

scale RNN structure. Diﬀerent layers operate on audio

data of diﬀerent time scales. Compared with WaveNet,

it only processes individual samples in the last layer

to improve the synthesis speed, and back-propagates

the gradient of the loss function only on a small frac-

tion of audio to improve the training speed. WaveRNN

[89] only uses a single-layer GRU network with a dual

softmax layer that predict respectively the 8 coarse (or

more signiﬁcant) bits and the 8 ﬁne (or least signif-

icant) bits of the 16-bit audio sample, and applies a

weight pruning technique to further reduce the model

parameters. Furthermore, for the purpose of generat-

ing multiple channels of speech in parallel, WaveRNN

divides a long audio sequence into multiple short se-

quences evenly during inference, and the generation

within and between each short sequence is autoregres-

sive. Although WaveRNN is autoregressive and based

on RNN, its training and inference time is still short,

thus it can be used in systems with few resources such as

mobile phones and embedded systems. The Multi-Band

WaveRNN proposed by Yu et al. [243] further improves

the inference speed of WaveRNN by generating multi-

ple bands in parallel, and performs 8-bit quantization

on the weight value to reduce the model size. LPCNet

[217] reduces the complexity of the model by combin-

ing WaveRNN with linear prediction (LP) technology in

traditional digital signal processing, thereby improving

the synthesis eﬃciency. The characteristics of various

small-size vocoders are shown in Table 3.

4.1.2 Non-autoregressive vocoder

Similar to acoustic models, these vocoders increase the

speed of training and inference to a certain extent, but

all generate audio signals frame by frame in an autore-

gressive manner. If the non-autoregressive generation

method can be used to generate speech waveforms in

parallel, the inference speed will be greatly improved.

Based on this idea, various non-autoregressive vocoders

are proposed, and their characteristics are shown in Ta-

ble 4.

The traditional Gaussian autoregressive model is

equivalent to an autoregressive ﬂow (AF) [103], which

is a kind of normalizing ﬂow [175]. The main idea of the

normalizing ﬂow is that a complex distribution can be

obtained by a simple distribution transformed through

multiple invertible functions. It was originally proposed

to make the distribution function of latent variables

in VAE [102] more complex. The ﬂow-based generative

model learns the bidirectional mapping from the input

sample x to the latent representation z, i.e. x = f(z)

and z = f

−1

(x). This mapping f is called a normal-

izing ﬂow and is an invertible function ﬁtted by neu-

ral networks, consisting of k invertible transformations

f = f

◦ · · · ◦ f

. The normalizing ﬂow transforms a

simple density distribution p(z) (such as an isotropic

Gaussian distribution) to a complex distribution p(x)

by applying an invertible transformation x = f (z). The

probability density of x can be calculated through the

change of variables formula:

p(x) = p(z)



det



∂f

−1

(x)

∂x





(14)

where det is the Jacobian determinant. The computa-

tion of the determinant has the complexity of O(n

where n is the dimension of x and z. In order to re-

duce the amount of computation, two ﬂow models that

can easily calculate the Jacobian determinant have been

proposed, respectively, based on autoregressive trans-

formation [220] and bipartite transformation [45, 46,

101].

During training, the autoregressive ﬂow calculates

the latent variable z

, i = 1, . . . , D by transforming the

speech x = x

, x

, . . . , x

= σ

1:i−1

) · x

+ µ

1:i−1

) (15)

where z

1:D

is D latent variables subject to the isotropic

Gaussian distribution, µ is the shift variables represent-

ing the mean, and σ is the scaling variables represent-

ing the standard deviation. The training process is non-

autoregressive, and z

only depends on x

1:i

. In this case,

the Jacobian matrix is a triangular matrix whose deter-

minant is the product of the diagonal terms:

det



∂f

−1

(x)

∂x



1:i−1

) (16)

During inference, the trained z

, i = 1, . . . , D and the

previously generated audio x

1:i−1

are used to predict

the new x

− µ

1:i−1

)

1:i−1

)

(17)

This inference process is autoregressive, resulting in

slow inference. In order to speed up the inference speed,

Parallel WaveNet [150] and its improved model Clar-

iNet [161] use inverse autoregressive ﬂow (IAF) [103]

to generate speech in Parallel. IAF is another normal-

izing ﬂow. In contrast to AF, IAF uses the previously

20 Zhaoxi Mu

et al.

Table 3 Small size vocoder

Vocoder Neural network

types

Characteristics

WaveNet (Oord et al.,

2016)

Dilated causal

gated CNN

Based on dilated CNN, the training and in-

ference speed is slow

SampleRNN (Mehri

et al., 2016)

RNN Multi-scale RNN structure, training and in-

ference speed is faster than Wavenet

FftNet (Jin et al.,

2018)

1 × 1 CNN Based on 1× 1 convolution, the model struc-

ture is simple, and the training and infer-

ence speed is fast

WaveRNN (Kalch-

brenner et al., 2018)

GRU Based on single layer of GRU, the model

structure is simple, and the training and in-

ference speed is fast

Multi-Band Wav-

eRNN (Yu et al.,

2019)

GRU Parallel generation of multiple bands, the

training and inference speed is fast

LPCNet (Valin and

Skoglund, 2019)

GRU The linear prediction (LP) technology is

used, the model structure is simple, and the

training and inference speed is fast

obtained latent variable z

1:i−1

to calculate z

during

training:

− µ

1:i−1

)

1:i−1

)

(18)

This training process is autoregressive. In inference, z

1:i

is used to predict x

= σ

1:i−1

) · z

+ µ

1:i−1

) (19)

This inference process is non-autoregressive. Therefore,

AF is fast in training and slow in inference, whereas

IAF is just the opposite. In order to train and synthesize

quickly at the same time, Parallel WaveNet and Clar-

iNet take the autoregression WaveNet as the teacher

network, which is responsible for providing the guid-

ance information on the distribution of z

, i = 1, . . . , D

during training. And IAF is used as the student net-

work to take charge of the ﬁnal audio sampling, and

solve the problem that IAF cannot be trained in paral-

lel by means of probability density distillation.

However, due to the knowledge distillation used in

Parallel WaveNet and Clarinet, the training process is

complex. In order to simplify the training process, Peng

et al. [159] proposed WaveVAE. The encoder and de-

coder of WaveVAE are respectively parameterized by

a Gaussian autoregressive WaveNet and the one-step-

ahead predictions from an IAF. It can jointly optimize

the encoder q

(z|x) and decoder p

(x|z) to be trained

from scratch by maximizing the evidence lower bound

(ELBO) for observed x in VAE, but at the expense of

sound quality.

In order to train and synthesize more quickly, Ping

et al. [162] proposed WaveFlow which combines au-

toregressive ﬂow and non-autoregressive convolution.

The training process does not need complex knowl-

edge distillation, only based on the likelihood function,

and combines the advantages of autoregressive and non-

autoregressive ﬂow. It can train and synthesizing high-

ﬁdelity speech quickly, while only occupying small mem-

ory. WaveFlow represents a 1-D audio sequence x =

, x

, . . . , x

with a 2-D matrix X ∈ R

h×w

, in which

adjacent samples are in the same column. The latent

variable matrix Z ∈ R

h×w

is deﬁned as:

i,j

= σ

i,j

1:i−1,:

) · X

i,j

+ µ

i,j

1:i−1,:

) (20)

where X

1:i−1,:

represents all the elements above the i-th

row. Therefore, the value of Z

i,j

depends only on the

sample in i-th row and j-th column and the samples

above i-th row, which can be calculated at the same

time. In inference, the sample is generated by:

i,j

− µ

i,j

1:i−1,:

)

i,j

1:i−1,:

)

(21)

Although it is autoregressive, it only takes h steps to

generate all samples, and h is usually small, like 8 or

16. WaveFlow uses a 2-D dilated CNN to model a 2-D

Review of end-to-end speech synthesis technology based on deep learning 21

Table 4 Non-autoregressive vocoder

Vocoder Neural network types Generative

model types

Characteristics

WaveNet (Oord

et al., 2016)

Dilated causal gated

convolution

Autoregression Autoregressive generation, slow training

and inference speed

Parallel WaveNet

(Oord et al., 2018)

Dilated causal gated

convolution

IAF Based on knowledge distillation, training

and inference speed is fast, Monte Carlo

sampling is required to estimate KL diver-

gence, the training process is unstable

FloWaveNet (Kim

et al., 2018)

Dilated convolution Normalizing

ﬂow

The inference speed is fast, the training con-

vergence speed is slow, the model contains

many parameters

ClariNet (Ping

et al., 2018)

Dilated causal gated

convolution

IAF Based on knowledge distillation, the train-

ing and inference speed is fast, the training

process is stable

WaveGlow

(Prenger et al.,

2019)

Non-causal dilated

convolution, 1 × 1

convolution

Normalizing

ﬂow

The inference speed is fast, the training con-

vergence speed is slow, the model contains

many parameters

MelGAN (Kumar

et al., 2019)

Dilated convolution,

transposed con-

volution, grouped

convolution

GAN The inference speed is fast, the training con-

vergence speed is slow

GAN-TTS

(Bi´nkowski et al.,

2019)

Dilated convolution GAN The training and inference speed is fast, no

need for mel-spectrogram as input

Parallel Wave-

GAN (Yamamoto

et al., 2020)

Non-causal dilated

convolution

GAN The inference speed is fast, the training con-

vergence speed is slow, the model contains

many parameters

WaveVAE (Peng

et al., 2020)

Dilated causal gated

convolution

IAF, VAE The training and inference speed is fast

WaveFlow (Ping

et al., 2020)

2D-dilated convolu-

tion

Autoregression Combining the advantages of autoregressive

ﬂow and non-autoregressive ﬂow, the train-

ing and inference speed is fast

WaveGrad (Chen

et al., 2020)

Dilated convolution Diﬀusion

probability

model

The inference speed is fast, the training con-

vergence speed is slow

DiﬀWave (Kong

et al., 2020)

Bidirectional dilated

convolution

Diﬀusion

probability

model

The inference speed is fast, the training con-

vergence speed is slow

Multi-Band Mel-

GAN (Yang et al.,

2021)

Dilated convolution,

transposed con-

volution, grouped

convolution

GAN The training and inference speed is fast

22 Zhaoxi Mu

et al.

matrix. Non-causal CNN is used on width dimension,

causal CNN with autoregressive constraints is used on

height dimension, and convolution queue [153] is used to

cache the intermediate hidden states to speed up the au-

toregressive synthesis on height dimension. Therefore,

it not only retains both the advantage of autoregressive

inference method that can accurately simulate the lo-

cal variations of waveform and non-autoregressive con-

volutional structure that can do speedy synthesis and

capture the long-range structure in the data.

WaveGlow [164] and FloWaveNet [99] are also based

on normalizing ﬂow and have similar structures, using

Glow [101] and Real-NVP [46] respectively. Real-NVP

is an improved model of the normalizing ﬂow NICE

[45]. It is trained and inferred by bipartite transfor-

mation, but each layer can only transform a part of

the input. As an improved model of Real-NVP, Glow

introduces 1 × 1 invertible CNN to mix the informa-

tion between two channels and realizes complete trans-

formation. The aﬃne coupling layer in WaveGlow and

Flowavenet transforms one half dimension x

of input

vector x each time, leaving the other half dimension x

unchanged. The transformation process is:

= x

(22)

= x

· σ

) + µ

) (23)

where x

and x

are the result of bisecting x, z

and z

are the corresponding latent variables respectively. The

inference process is:

= z

(24)

− µ

)

(25)

Therefore, WaveGlow and FloWaveNet can both com-

pute the latent variable z and synthesize the speech

frame x in parallel. In fact, the bipartite transformation

is a special autoregressive transformation [162], which

can be reduced to a bipartite transformation by substi-

tution:



1:i−1

)

1:i−1

)





(0, 1)

, i ∈ a

(µ

), σ

))

, i ∈ b

(26)

Compared with autoregressive transformation, bipar-

tite transformation is not as expressive as autoregres-

sive transformation, because it reduces the dependence

between data X and latent variable Z. As a result, the

speech synthesized by WaveGlow and FloWaveNet is of

low quality. A deeper network is needed to obtain the

results comparable to the autoregressive model.

In addition to normalizing ﬂow, GAN [60] can also

be used to synthesize speech in parallel, such as Parallel

WaveGAN [234], MelGAN [113], multi-band MelGAN

[237] and GAN-TTS [17]. Parallel WaveGAN’s genera-

tor is similar in structure to WaveNet, which uses ran-

dom noise and mel-spectrogram conditions to generate

speech waveforms. Its discriminator is used to deter-

mine whether the generated audio is real. MelGAN’s

generator simply uses dilated CNN to increase the re-

ceptive ﬁeld, and its inference speed is faster than Par-

allel WaveGAN. Its discriminator outputs real/fake la-

bels and feature maps [226], and speeds up training by

using grouped convolutions to reduce the model param-

eter.

The feature matching loss adopted by MelGAN gen-

erates feature maps with neural networks, while the

multi-resolution STFT loss adopted by Parallel Wave-

GAN uses STFT algorithm to generate feature maps.

Inspired by this, multi-band MelGAN introduces the

multi-resolution STFT loss in Parallel Wavegan into

MelGAN instead of the original feature matching loss,

and carries out a multi-band extension to MelGAN to

measure the diﬀerence between the real and predicted

audio in multiple subband scales of audio, which further

improves the training and inference speed of MelGAN.

In order to obtain better results and faster training

speed, GAN-TTS uses an ensemble of small scale un-

conditional and conditional Random Window Discrim-

inators (RWDs) operating at diﬀerent window sizes,

which respectively assess the realism of the generated

speech and its correspondence with the input text.

The diﬀusion probability model [73, 193] can also be

used to generate speech waveforms. It is a probabilis-

tic model based on Markov chain, which divides the

mapping relationship between the noise and the tar-

get waveform into several steps, and gradually trans-

forms the simple distribution (e.g., isotropic Gaussian)

into the complex data distribution by means of Markov

chain. It ﬁrst trains the diﬀusion process of Markov

chains (from structured waveform to noise), and then

decodes the noise through the reverse process (from

noise to structured waveform). The decoding process re-

quires only a constant few generation steps, so the infer-

ence speed is fast. Chen et al. [27] proposed a fully con-

volutional vocoder WaveGrad to synthesize speech non-

autoregressively based on diﬀusion probability model

and score matching framework [194, 195]. A similar

model is DiﬀWave [110], which uses bidirectional di-

lated convolution architecture with a long bidirectional

receptive ﬁeld and a much smaller number of model pa-

rameters than WaveGrad. However, the inference speed

of the vocoder based on diﬀusion probability model is

slightly lower than that of the ﬂow-based vocoder.

Review of end-to-end speech synthesis technology based on deep learning 23

4.2 High-quality vocoder

To improve the naturalness of speech, WaveNet pro-

poses to expand the receptive ﬁeld by dilated CNN and

introduce additional conditional information, such as

speaker information (global conditioning) and acous-

tic features (local conditioning), by modeling the con-

ditional probability of audio. WaveNet takes softmax

layer as the output layer of the network, and adopts

nonlinear quantization method of µ-law companding

transformation to obtain discrete-value speech signals.

Although the reconstructed speech signal is close to the

original, the quantization process still introduces white

noise into the original signal. Yoshimura et al. [242]

proposed a quantization noise shaping method based

on mel-cepstrum, which solved this problem by pre-

processing WaveNet with a mel-log spectral approxi-

mation (MLSA) ﬁlter [82]. Because the mel-cepstrum

matches the human auditory perception characteris-

tics, this method eﬀectively ﬁlters the white noise in-

troduced by the commonly used quantization method

in the speech waveform synthesis system, and has no

extra computational cost compared with WaveNet in

the synthesis stage.

In order to improve the quality of the speech syn-

thesized by the autoregressive vocoder, Jin et al. [87]

proposed to add zero padding to the input to make

the network have a stronger generalization ability. And

when outputting the result, instead of directly taking

the value of the maximum probability, sampling is con-

ducted according to the probability distribution to sim-

ulate the real speech signal containing noise. Due to the

training error of vocoder, there is always noise in the

generated speech sample. And in the process of autore-

gressive generation, the noise in the synthesized speech

sample will become more and more loud over time. Gen-

erating new samples with noisy speech samples as in-

put to the network adds more and more uncertainty

to the network. Therefore, during the training, they

added some noise to the input to make the network

robust to the input samples containing noise, and re-

duced the noise injected into the pronunciation sam-

ples by post-processing with spectral subtraction noise

reduction [129].

When using implicit generative models such as GAN

to generate audio, speech waveforms of diﬀerent resolu-

tions can be predicted at the same time to perfect the

details of synthesized speech and stabilize the train-

ing process, as shown in Table 5. Parallel Wave GAN

and Multi-Band MelGAN use a multi-resolution STFT

loss for training. The discriminator in MelGAN adopts

a multi-scale structure to simultaneously discriminate

feature maps of audio waveforms with diﬀerent sam-

pling frequencies to learn the features of diﬀerent au-

dio frequency ranges. Besides, MelGAN uses feature

matching loss to optimize both discriminator and gen-

erator, thereby reducing the distance between the fea-

ture maps of the real and synthesized audio. VocGAN

[239] uses both multi-resolution STFT loss and feature

matching loss, and extends the generator on the ba-

sis of MelGAN to output multiple waveforms of diﬀer-

ent scales. It helps the generator learn the mapping of

both low- and high-frequency components of acoustic

features by training the generator with the adversarial

loss calculated by a set of discriminators with diﬀerent

resolutions. Moreover, VocGAN also applied the joint

conditional and unconditional (JCU) loss [256]. The

conditional loss leads the generator to map the acous-

tic feature of the input mel-spectrogram to the wave-

form more accurately, thus reducing the discrepancy

between the acoustic characteristics of the input mel-

spectrogram and the output waveform. In addition to

using the multi-scale discriminator in MelGAN, HiFi-

GAN [109] introduced the multi-period discriminator

(MPD) to model the periodic patterns of speech. Each

sub-discriminator only accepts equally spaced samples

of an input audio, aiming to capture diﬀerent implicit

structures from each other by looking at diﬀerent parts

of the input audio. Besides, the generator in HiFi-GAN

is connected with a multi-receptive ﬁeld fusion (MRF)

module after each transposed convolution, which can

observe patterns of various lengths in parallel. Grit-

senko et al. [63] proposed a method for training par-

allel vocoder based on the spectral generalized energy

distance (GED) [58, 180, 188] between the generated

and the real audio distribution. The main diﬀerence

from other spectrogram-based losses is that, in addi-

tion to the attractive term between the generated data

and the actual data, GED also adds a repulsive term

between generated data to the training loss to avoid

generated samples collapsing to a single point, thus cap-

turing the full data distribution. GED can be combined

with the adversarial loss to further improve the synthe-

sized speech quality.

Similar to the acoustic models, the multi-speaker

TTS task can also be performed only by the vocoder.

Chen et al. [30] borrowed the idea of meta-learning and

proposed three methods to synthesize the voice of a

new speaker using only a small amount of the target

speaker’s speech. The ﬁrst method is to ﬁx other pa-

rameters of the model and update only the speaker

embedding vector. The second method is to ﬁne-tune

all the parameters of the model. The third method

is to use a trained neural network encoder to predict

the speaker embedding. The experimental results show

that the speech synthesized by the second method has

24 Zhaoxi Mu

et al.

Table 5 Methods of GAN-based vocoder to improve the naturalness of generated speech

Vocoder Characteristics

MelGAN (Kumar et al.,

2019)

Using multi-scale discriminant structure and feature

matching loss

Parallel WaveGAN (Ya-

mamoto et al., 2020)

Using multi-resolution STFT loss

VocGAN (Yang et al.,

2020)

Using multi-resolution STFT loss, feature matching loss,

multi-scale waveform generator, and JCU loss

HiFi-GAN (Kong et al.,

2020)

Using multi-scale discrimination, multi-period discrimina-

tion, and MRF

Multi-Band MelGAN

(Yang et al., 2021)

Using multi-resolution STFT loss

the highest naturalness. However, the method they pro-

posed only works when the quality of the training speech

data is high.

5 Speech corpus

The proposal of the end-to-end TTS method based on

deep learning reduces the diﬃculty of developing a high-

quality TTS system. Compared with the ASR model,

TTS model requires more high-quality speech data with

labels to achieve better training results, and the number

of open source corpora that meets these conditions is

very small. For the convenience of researchers to carry

out experiments, several commonly used open source

TTS corpora are introduced below. The details of each

corpus are shown in Table 6.

5.1 English speech corpus

Due to the versatility of English, the academic research

on English TTS is the most. Therefore, there are many

English TTS corpora available for free, such as VCTK

[223], LJ Speech [83] and LibriTTS [253].

The VCTK corpus

includes speech data uttered

by 109 native speakers of English with various accents.

Each speaker reads out about 400 sentences, most of

which were selected from a newspaper plus the Rain-

bow Passage and an elicitation paragraph intended to

identify the speaker’s accent. The speaker uses an omni-

directional head-mounted microphone to record speech

in a hemi-anechoic chamber of the University of Edin-

burgh at a sampling frequency of 24 bit and 96 kHz. All

recordings were converted into 16 bit, downsampled to

The VCTK corpus can be freely available for download

from https://datashare.is.ed.ac.uk/handle/10283/2119.

48 kHz, and manually end-pointed. The VCTK corpus

was originally recorded for building HMM-based multi-

speaker TTS systems.

LJ Speech

is a public domain corpus consisting of

13,100 short audio clips of a single speaker, made up of

non-professional audiobooks from the LibriVox project

[95]. Each audio ﬁle is a single-channel 16 bit PCM

WAV with a sampling rate of 22,050 Hz. The audio

clips range in length from approximately 1 second to

10 seconds and are segmented automatically based on

silences in the recording, with a total duration of about

24 hours. Clip boundaries generally align with sentence

or clause boundaries. The text was matched to the au-

dio manually, and a QA pass was done to ensure that

the text accurately matched the words spoken in the

audio.

The LibriTTS corpus

is composed of audio and

text from the LibriSpeech [156] corpus. Librispeech,

made up of audiobooks from the LibriVox project, was

originally designed for ASR research and contains 982

hours of speech data from 2,484 speakers. The Lib-

riTTS corpus inherits some of the properties of the Lib-

riSpeech corpus, while addressing problems that make

LibriSpeech less suitable for TTS tasks. For example,

LibriTTS increases the sampling rate of audio ﬁles from

16 kHz to 24 kHz, splits speech at sentence breaks in-

stead of at silences longer than 0.3 seconds, contains

the original text and the standardized text, can ex-

tract contextual information (such as neighbouring sen-

tences), and excludes utterances with signiﬁcant back-

ground noise. The processed LibriTTS corpus consists

of 585 hours of speech data at 24 kHz sampling rate

The LJ Speech corpus is freely available for download

from https://keithito.com/LJ-Speech-Dataset/.

The LibriTTS corpus is freely available for download from

http://www.openslr.org/60/.

Review of end-to-end speech synthesis technology based on deep learning 25

Table 6 Details of each corpus

Corpus Language Number of

speakers

Hours Labeling method Sampling

Rate

(kHz)

VCTK English 109 44 Characters 48

LJ Speech English 1 24 Original and stan-

dardized characters

and phonemes

22.05

LibriTTS English 2,456 585 Original and stan-

dardized characters,

contextual informa-

tion

CMU ARC-

TIC

English 7 7 Characters 16

Blizzard2011 English 1 16.6 Characters 16

Blizzard2013 English 1 300 Characters 44.1

Blizzard2017 English 1 6 Characters 44.1

CSMSC Mandarin 1 12 Pinyin, rhythm and

phoneme boundary

AISHELL-3 Mandarin 218 85 Characters, pinyin 44.1

DiDiSpeech Mandarin 6,000 800 Standardized Pinyin 48

CSS10 German,

Greek, Span-

ish, French,

Chinese,

Japanese,

Russian,

Finnish, Hun-

garian, Dutch

Single

speaker

per lan-

guage

Original and stan-

dardized characters

Common

Voice

60 languages 7,335 Characters 48

from 2,456 speakers and its corresponding text tran-

scripts.

There are other open source English corpora, such

as the CMU ARCTIC corpus

[108] constructed by

the Language Technologies Institute of Carnegie Mellon

University for unit selection speech synthesis research.

However, the amount of data in this corpus is too small

to train the neural end-to-end TTS model well. Every

year, The Blizzard Challenge, an international speech

synthesis competition, provides participants with open

source English speech data. For example, the corpus of

The data is freely available for download from http://

www.festvox.org/cmu_arctic/.

The Blizzard Challenge 2011, 2013 and 2017

consists

of tens of hours, hundreds of hours and 6 hours of au-

dio and corresponding text transcripts of audiobooks

read by a single speaker, with sampling frequencies of

16 kHz, 44.1 kHz and 44.1 kHz, respectively.

5.2 Mandarin speech corpus

Mandarin is the language with the largest number of

speakers in the world, thus Mandarin TTS has also been

widely researched and applied [57, 160]. However, Man-

darin has a complex tone and prosodic structure [140].

These data sets are freely available for download

fromhttp://www.cstr.ed.ac.uk/projects/blizzard/ and

can only be used for non-commercial use.

26 Zhaoxi Mu

et al.

Meanwhile, Chinese characters are ideograms, which

are not directly related to pronunciation. It is neces-

sary to convert the original Chinese text into phonemes

or pinyin as audio transcription. Therefore, compared

with English, the cost of recording and transcribing

high-quality Mandarin corpus is higher, resulting in few

open source high-quality Mandarin corpus. In order to

facilitate researchers to conduct research on Mandarin

TTS, several open source Mandarin corpora that can

be used for TTS will be introduced.

CSMSC (Chinese Standard Mandarin Speech Co-

pus)

[8] is a single-speaker Mandarin female voice cor-

pus released by data-baker company. The corpus uses

a professional recording studio and recording software

for recording. The recording environment and equip-

ment remain unchanged throughout the recording, and

the signal-to-noise ratio (SNR) of the recording envi-

ronment is not less than 35 dB. The audio format is

a mono PCM WAV with a sampling frequency of 48

kHz 16 bit and an eﬀective duration of approximately

12 hours. The recordings cover a variety of topics, such

as news, ﬁction, technology, entertainment, dialogue,

etc. The speech corpus is proofread, and rhythms and

phoneme boundaries are manually edited.

AISHELL-3

[191] is a high-quality Mandarin cor-

pus for multi-speaker TTS published by Shell Shell. It

contains roughly 85 hours of emotion-neutral record-

ings spoken by 218 native Chinese mandarin speakers,

as well as transcripts in Chinese character-level and

pinyin-level. All utterances are recorded using a high-

ﬁdelity microphone (44.1 kHz, 16 bit) in a quiet indoor

environment. The topics of the textual content spread

a wide range of domains including smart home voice

commands, news reports and geographic information.

DiDiSpeech

[67] is a large open source Mandarin

speech corpus released by DiDi Chuxing company. The

corpus includes approximately 800 hours of speech data

at a sampling rate of 48 kHz from 6,000 speakers and

corresponding text transcripts. All speech data in the

DiDiSpeech corpus are recorded in a quiet environment,

and the audio with signiﬁcant background noise is ﬁl-

tered. It is suitable for various speech processing tasks,

such as voice conversion, multi-speaker TTS and ASR.

The CSMSC corpus is available at https://www.

data-baker.com/open_source.html for non-commercial use

only.

The AISHELL-3 corpus is available at http://www.

aishelltech.com/aishell_3, supporting academic research

only and is prohibited from commercial use without permis-

sion.

The DiDiSpeech data set is available for application on

https://outreach.didichuxing.com/research/opendata/.

5.3 Multilingual speech corpus

There has been little research in the TTS ﬁeld into lan-

guages other than English, partly because of the lack

of available open source corpora. To enable TTS to be

applied to more languages, some researchers have con-

structed speech corpora containing multiple languages,

such as CSS10 [158] and Common Voice [128].

CSS10

is a single-speaker corpus of ten languages,

including Chinese, Dutch, French, Finnish, Japanese,

Hungarian, Greek, German, Russian and Spanish. It

is composed of short audio clips from LibriVox audio-

books and corresponding standardized transcriptions.

All audio ﬁles are sampled at 22 kHz.

Common Voice

is the largest public multilingual

speech corpus, currently containing nearly 9,283 hours

(7,335 hours veriﬁed) of speech data in 60 languages and

fully open to the public. The project employs crowd-

sourcing for data collection and data validation. The

audio clips are released as mono-channel, 16 bit MPEG-

3 ﬁles with a 48 kHz sampling rate. This corpus is de-

signed for ASR and rather noisy, thus denoising of the

original audio data is required before it is used for the

TTS task [147].

5.4 Emotional speech corpus

Emotional TTS has been extensively researched, but

one of the problems currently in this ﬁeld is the lack

of publicly available emotional speech corpus and the

diﬃculty of recording such data. None of the above-

mentioned corpora contains explicit emotional informa-

tion, and most of the existing emotional corpora cannot

be eﬀectively used to train the emotional TTS model

based on deep learning, because these data sets contain

a small number of sentences, such as RAVDESS [128],

CREMA-D [23], GEEMP [9], EMO-DB [18], or contain

noise, such as IMPROV [20] and IEMOCAP [19].

To ﬁll this gap, Tits et al. [207] released the Emov-

DB corpus

, which covers ﬁve emotions (amusement,

anger, sleepiness, disgust, and neutral) and two lan-

guages (English and French). The English speech data

is recorded by two male and two female speakers, and

the French speech data is recorded by one male speaker.

English sentences are taken from the CMU ARCTIC

Corpus and French sentences from the SIWIS Corpus

[75]. Each audio ﬁle is recorded in 16 bits .wav format.

The CSS10 corpus is available for free at https://github.

com/Kyubyong/CSS10.

The Common Voice corpus is available for free at https:

//commonvoice.mozilla.org/.

The EmoV-DB database is available for free at https:

//github.com/numediart/EmoV-DB.

Review of end-to-end speech synthesis technology based on deep learning 27

6 Evaluation method

The speech quality is measured in three aspects: clar-

ity, intelligibility and naturalness. However, at present,

there is no uniform evaluation criterion for the quality

of synthesized speech. In fact, diﬀerent from the quan-

titative evaluation methods for tasks such as classiﬁca-

tion and prediction, since the ﬁnal user is the audience,

the level of generated speech quality often requires sub-

jective qualitative evaluation. However, subjective eval-

uation is diﬃcult to measure with strict standards, be-

cause there will be some deviations. In addition, some

objective speech quality evaluation metrics also have

reference value. Therefore, this section will summarize

the evaluation methods of synthesized speech respec-

tively from both subjective and objective aspects.

6.1 Subjective evaluation method

Subjective evaluation methods are usually more suit-

able for evaluating generative models, but they require

signiﬁcant resources and face challenges in the relia-

bility, validity and reproducibility of results [84]. The

most commonly used subjective evaluation method is

the Mean Opinion Score (MOS), which measures nat-

uralness by asking listeners to score the synthesized

speech. MOS adopts a ﬁve-point scoring system, with

higher scores indicating higher speech quality, which

can be collected using the CrowdMOS toolkit [176].

MUSHRA (Multiple Stimuli with Hidden Reference and

Anchor) [171, 182] is also a subjective listening test

method. Speciﬁcally, the audio to be tested is mixed

with natural speech as reference (upper limit) and total

loss audio as anchor factor (lower limit). The listeners

are asked to subjectively score the test audio, hidden

reference audio and anchor factor through the double-

blind listening test, with a score from 0 to 100. The

0-100 scale used by MUSHRA allows very small diﬀer-

ences to be rated. The main advantage over the MOS

methodology is that MUSHRA requires fewer partici-

pants to obtain statistically signiﬁcant results.

All the above are absolute rating methods, and some-

times it is necessary to compare the speech quality gen-

erated by two models, which requires the use of rela-

tive rating methods, such as comparison mean opinion

score (CMOS) and AB preference test. CMOS is used

to compare the diﬀerence between the MOS value of

the model under test and the baseline. AB preference

test selects a better model or ﬁnds no signiﬁcant diﬀer-

ence between the two models by asking the listeners to

compare the speech of the same sentence synthesized by

the two models. The ABX preference test can be used

when comparing multi-speaker TTS models or speech

conversion models. Speciﬁcally, listeners are asked to

listen to three speech fragments A, B and X respec-

tively, where X represents the target speech, while A

and B represent the speech generated by the two mod-

els respectively. The listeners are then asked to judge

whether speech A or B is closer to X in terms of the

personality characteristics of the speech, or can not give

a clear judgment. Finally, the judgments of all listeners

are counted to calculate the proportion of the speech

synthesized by each model that sounded more like the

target speech.

6.2 Objective evaluation method

The objective evaluation method is mainly the quanti-

tative evaluation of the TTS model and the generated

speech. The diﬀerences between the generated samples

and the real samples are usually used to evaluate the

model. However, these evaluation metrics can only re-

ﬂect the data processing ability of the model to a cer-

tain extent, and cannot truly reﬂect the quality of the

generated speech.

The most intuitive way to objectively evaluate the

prosody and accuracy of synthesized speech is to di-

rectly calculate the root mean square error (RMSE),

absolute error and negative log likelihood (NLL) of f

pitch, c

(the 0-th cepstrum coeﬃcient) and duration

of reference audio and predicted audio, as well as the

character error rate (CER), word error rate (WER) and

utterance error rate (UER) of synthesized speech.

Another commonly used objective evaluation met-

ric for judging the diﬀerence between the generated

samples and the real samples is Mel-Cepstral Distor-

tion (MCD) [112]. MCD quantiﬁes the reconstruction

performance of Mel-Frequency Cepstrum Coeﬃcients

(MFCC) by calculating the spectral distance between

synthesized and reference mel-spectral features. Its cal-

culation formula is:

MCD

T −1

t=0

k=1

t,k

− c

t,k

)

(27)

where c

t,k

and c

t,k

are the k-th MFCC from the t-

th frame of the reference and predicted audio, respec-

tively. The MCD is usually calculated using the mean

square error (MSE) calculated by the MFCC features

of K = 13 dimensions. The lower the value of MCD,

the higher the quality of synthesized speech. It can be

used to evaluate timbral distortion, and its unit is db.

A similar evaluation metric is mel-spectral distortion

(MSD). MSD is calculated in the same way as MCD,

but it is calculated with the logarithmic mel-spectral

28 Zhaoxi Mu

et al.

amplitude rather than cepstrum coeﬃcient, which cap-

tures the harmonic content not found in MCD.

Gross Pitch Error (GPE) and Voicing Decision Er-

ror (VDE) are two commonly used metrics to measure

the error rate of synthesized speech [146]. GPE is the

estimation error of the audio f

value, deﬁned as [192]:

GP E =

1[| p

− p

|> 0.2p

]1[v

]

1[v

]1[v

]

(28)

where p

, p

are the pitch signals from the reference and

predicted audio, v

, v

are the voicing decisions from

the reference audio and predicted audio, and 1 is the

indicator function. The GPE measures the percentage

of voiced frames in the predicted audio that deviate

in pitch by more than 20% compared to the reference.

VDE is deﬁned as [192]:

V DE =

T −1

t=0

1[v

6= v

]

(29)

where v

, v

are the voicing decisions of the reference

and predicted audio, T is the total number of frames,

and 1 is the indicator function. VDE is used to mea-

sure the frame-level voicing decision error rate of the

predicted audio. The lower these two metrics, the bet-

ter. However, some algorithms have low GPE but high

VDE. In order to reduce the values of VDE and GPE

at the same time, Chu and Alwan [37] combined GPE

and VDE and proposed f

Frame Error (FFE) metric.

FFE is used to measure the percentage of frames that

either contain a 20% pitch error (according to GPE) or

a voicing decision error (according to VDE), deﬁned as

[192]:

F F E =

T −1

t=0

(1[| p

− p

|> 0.2p

]1[v

] + 1[v

6= v

])

(30)

FFE is used to calculate the ratio of the diﬀerence be-

tween the predicted pitch and the true pitch, which can

quantify the reconstruction error of the f

trajectory.

The lower the value, the better.

Bi´nkowski et al. [17] also proposed four metrics for

evaluating TTS models: unconditional and conditional

Fr´echet DeepSpeech Distance (FDSD, cFDSD) and Ker-

nel DeepSpeech Distance (KDSD, cKDSD). These met-

rics are inspired by the commonly used metrics for eval-

uating GAN-based image generation models [16, 72],

which judge the quality of the synthesized speech by

calculating the distance between the synthesized audio

and the reference audio. Moreover, the quality of syn-

thesized speech waveform can also be evaluated by cal-

culating Perceptual evaluation of speech quality (PESQ)

[177] of reference speech and synthesized speech, with

the higher the value, the better.

7 Future development direction

With the development of deep learning ﬁelds such as

NMT, ASR, image generation and music generation,

although existing TTS methods can synthesize high-

ﬁdelity speech by drawing on various Seq2Seq models

and generation models, they still have many shortcom-

ings. For example, the existing TTS technology based

on deep learning is still unable to stably synthesize

speech in real time, and the quality of the generated

speech is diﬃcult to be guaranteed. For example, the

end-to-end TTS technology based on deep learning has

not been able to synthesize speech stably in real time,

and the quality of the generated speech cannot be guar-

anteed. Therefore, a large part of TTS models currently

used in the industry are based on waveform cascade

technology [24]. Moreover, the state-of-the-art TTS tech-

nology is limited to a few common languages such as

English and Mandarin. Since it is diﬃcult to obtain

the data pairs of htext, speechi, there has been little

research on minor languages and dialects.

Based on the above introduction and summary of

TTS method, it can be concluded that there will be at

least the following development directions in the ﬁeld

of TTS in the future:

• Control the style of speech in a precise and ﬁne-

grained manner Speaking styles such as emotion,

intonation and rhythm often change during conver-

sation. However, current neural TTS systems cannot

precisely control these style features of speech indi-

vidually. How to achieve ﬁne-grained style control

of speech at word level and phrase level will also be

the focus of TTS research in the future. In addition,

due to the diﬃculty in recording and labeling emo-

tional speech data, how to eﬀectively use emotional

speech data limited in quantity and quality to train

the TTS model and enable it to learn the represen-

tation methods of various style features in speech is

also an urgent problem in the ﬁeld of TTS.

• In-depth research on the representation method of

speech signal in deep neural network Children learn

to speak long before they learn to read and write.

They can conduct a dialogue and produce novel sen-

tences, without being trained on an annotated cor-

pus of speech and text or aligned phonetic symbols.

Presumably, they achieve this by recoding the input

speech in their own internal phonetic representa-

tions (proto-phonemes or proto-text) [48]. This idea

can also be applied to TTS systems, as stated in the

goal of the ZeroSpeech Challenge: extract acoustic

units from speech signals by unsupervised learning

and create good data representation. Therefore, rep-

resentation learning and meta-learning can be used

Review of end-to-end speech synthesis technology based on deep learning 29

to improve the modeling ability and learning eﬃ-

ciency of TTS model for speech data, thus greatly

reducing the labeled speech data required for train-

ing.

• Build a fully end-to-end TTS pipeline Although the

existing TTS models are all called end-to-end, most

of them are divided into three parts: text front-end,

acoustic model and vocoder. These three modules

need to be trained separately, and the errors gener-

ated by each module will gradually accumulate. The

latest TTS frameworks such as ClariNet [161], Fast-

Speech 2s [174], EATS [47] and Wave-Tacotron com-

bine these modules and claim to be fully end-to-end

for training and inference. However, they still gener-

ate intermediate acoustic features as the condition

of the audio generation module, essentially similar

to other methods. A fully end-to-end model that

maps original text or phonemes directly to speech

waveforms would greatly simplify the TTS pipeline.

• Apply the deep learning methods used in other tasks

to TTS First, as a generation task, speech synthesis

and image generation have great similaritie. Many

methods used in TTS are inspired by image gener-

ation methods. For example, MelNet [221] regards

the speech spectrogram as an image, and synthe-

sizes the mel-spectrogram using a 2-D multi-scale

autoregressive generation method. The methods of

generating images and speech with speciﬁc styles

are also very similar. Second, the alignment method

in the acoustic model can learn from the methods

in NMT and ASR, which are also Seq2Seq mod-

els. Third, as recognition and generation are dual

tasks, multi-task learning can be adopted to com-

bine recognition and generation models to improve

each other and reduce the demand for labeled data

during training. In addition to combining TTS and

ASR [123, 173, 208, 209, 232], it is also possible

to combine speaker recognition with multi-speaker

TTS [30, 209], and combine speech emotion recogni-

tion with emotional speech synthesis [120] for dual

training.

8 Conclusion

The research of end-to-end TTS technology based on

deep learning has become a hot topic in the ﬁeld of

artiﬁcial intelligence. In order to make researchers to

have a clear understanding of the latest TTS paradigm,

this paper summarizes the latest technologies used in

each module of the TTS system in detail, and classiﬁes

the methods according to their characteristics and com-

pares their advantages and disadvantages. Furthermore,

the public speech corpus for various TTS tasks and the

commonly used subjective and objective speech qual-

ity evaluation methods are also summarized. Finally,

some suggestions for the future development direction

of TTS are put forward.

References

1. Adiga N, Prasanna S (2019) Acoustic features

modelling for statistical parametric speech synthe-

sis: a review. IETE Technical Review 36(2):130–

149

2. Aggarwal V, Cotescu M, Prateek N, Lorenzo-

Trueba J, Barra-Chicote R (2020) Using vaes and

normalizing ﬂows for one-shot text-to-speech syn-

thesis of expressive speech. In: ICASSP 2020-

2020 IEEE International Conference on Acoustics,

Speech and Signal Processing (ICASSP), IEEE,

pp 6179–6183

3. Arık S

O, Chrzanowski M, Coates A, Diamos G,

Gibiansky A, Kang Y, Li X, Miller J, Ng A,

Raiman J, et al. (2017) Deep voice: Real-time neu-

ral text-to-speech. In: International Conference on

Machine Learning, PMLR, pp 195–204

4. Arik SO, Chen J, Peng K, Ping W, Zhou Y (2018)

Neural voice cloning with a few samples. arXiv

preprint arXiv:180206006

5. Aroon A, Dhonde S (2015) Statistical parametric

speech synthesis: A review. In: 2015 IEEE 9th In-

ternational Conference on Intelligent Systems and

Control (ISCO), IEEE, pp 1–5

6. Atal BS, Hanauer SL (1971) Speech analysis and

synthesis by linear prediction of the speech wave.

The journal of the acoustical society of America

50(2B):637–655

7. Bahdanau D, Cho K, Bengio Y (2014) Neural ma-

chine translation by jointly learning to align and

translate. arXiv preprint arXiv:14090473

8. Baker D (2017) Chinese standard mandarin

speech copus

9. B¨anziger T, Mortillaro M, Scherer KR (2012) In-

troducing the geneva multimodal expression cor-

pus for experimental research on emotion percep-

tion. Emotion 12(5):1161

10. Battenberg E, Skerry-Ryan R, Mariooryad S,

Stanton D, Kao D, Shannon M, Bagby T (2020)

Location-relative attention mechanisms for robust

long-form speech synthesis. In: ICASSP 2020-

2020 IEEE International Conference on Acoustics,

Speech and Signal Processing (ICASSP), IEEE,

pp 6194–6198

11. Baum LE, Petrie T, Soules G, Weiss N (1970) A

maximization technique occurring in the statisti-

cal analysis of probabilistic functions of markov

30 Zhaoxi Mu

et al.

chains. The annals of mathematical statistics

41(1):164–171

12. Beliaev S, Rebryk Y, Ginsburg B (2020) Talknet:

Fully-convolutional non-autoregressive speech

synthesis model. arXiv preprint arXiv:200505514

13. Bengio S, Vinyals O, Jaitly N, Shazeer N

(2015) Scheduled sampling for sequence prediction

with recurrent neural networks. arXiv preprint

arXiv:150603099

14. Bi M, Lu H, Zhang S, Lei M, Yan Z (2018)

Deep feed-forward sequential memory networks

for speech synthesis. In: 2018 IEEE International

Conference on Acoustics, Speech and Signal Pro-

cessing (ICASSP), IEEE, pp 4794–4798

15. Bian Y, Chen C, Kang Y, Pan Z (2019) Multi-

reference tacotron by intercross training for style

disentangling, transfer and control in speech syn-

thesis. arXiv preprint arXiv:190402373

16. Bi´nkowski M, Sutherland DJ, Arbel M, Gretton

A (2018) Demystifying mmd gans. arXiv preprint

arXiv:180101401

17. Bi´nkowski M, Donahue J, Dieleman S, Clark A,

Elsen E, Casagrande N, Cobo LC, Simonyan K

(2019) High ﬁdelity speech synthesis with adver-

sarial networks. arXiv preprint arXiv:190911646

18. Burkhardt F, Paeschke A, Rolfes M, Sendlmeier

WF, Weiss B (2005) A database of german emo-

tional speech. In: Ninth European Conference on

Speech Communication and Technology

19. Busso C, Bulut M, Lee CC, Kazemzadeh A,

Mower E, Kim S, Chang JN, Lee S, Narayanan

SS (2008) Iemocap: Interactive emotional dyadic

motion capture database. Language resources and

evaluation 42(4):335–359

20. Busso C, Parthasarathy S, Burmania A, Abdel-

Wahab M, Sadoughi N, Provost EM (2016) Msp-

improv: An acted corpus of dyadic interactions to

study emotion perception. IEEE Transactions on

Aﬀective Computing 8(1):67–80

21. Cai Z, Yang Y, Zhang C, Qin X, Li M (2019) Poly-

phone disambiguation for mandarin chinese using

conditional neural network with multi-level em-

bedding features. arXiv preprint arXiv:190701749

22. Cai Z, Zhang C, Li M (2020) From speaker ver-

iﬁcation to multispeaker speech synthesis, deep

transfer with feedback constraint. arXiv preprint

arXiv:200504587

23. Cao H, Cooper DG, Keutmann MK, Gur

RC, Nenkova A, Verma R (2014) Crema-

d: Crowd-sourced emotional multimodal actors

dataset. IEEE transactions on aﬀective comput-

ing 5(4):377–390

24. Capes T, Coles P, Conkie A, Golipour L, Had-

jitarkhani A, Hu Q, Huddleston N, Hunt M, Li

J, Neeracher M, et al. (2017) Siri on-device deep

learning-guided unit selection text-to-speech sys-

tem. In: INTERSPEECH, pp 4011–4015

25. Chaudhari S, Polatkan G, Ramanath R, Mithal

V (2019) An attentive survey of attention models.

arXiv preprint arXiv:190402874

26. Chen LH, Raitio T, Valentini-Botinhao C, Ling

ZH, Yamagishi J (2015) A deep generative ar-

chitecture for postﬁltering in statistical para-

metric speech synthesis. IEEE/ACM Transac-

tions on Audio, Speech, and Language Processing

23(11):2003–2014

27. Chen N, Zhang Y, Zen H, Weiss RJ, Norouzi

M, Chan W (2020) Wavegrad: Estimating gra-

dients for waveform generation. arXiv preprint

arXiv:200900713

28. Chen X, Duan Y, Houthooft R, Schulman J,

Sutskever I, Abbeel P (2016) Infogan: Inter-

pretable representation learning by information

maximizing generative adversarial nets. arXiv

preprint arXiv:160603657

29. Chen X, Kingma DP, Salimans T, Duan Y,

Dhariwal P, Schulman J, Sutskever I, Abbeel

P (2016) Variational lossy autoencoder. arXiv

preprint arXiv:161102731

30. Chen Y, Assael Y, Shillingford B, Budden D, Reed

S, Zen H, Wang Q, Cobo LC, Trask A, Laurie

B, et al. (2018) Sample eﬃcient adaptive text-to-

speech. arXiv preprint arXiv:180910460

31. Chorowski J, Bahdanau D, Serdyuk D, Cho

K, Bengio Y (2015) Attention-based mod-

els for speech recognition. arXiv preprint

arXiv:150607503

32. Chorowski J, Weiss RJ, Bengio S, van den Oord

A (2019) Unsupervised speech representation

learning using wavenet autoencoders. IEEE/ACM

transactions on audio, speech, and language pro-

cessing 27(12):2041–2053

33. Chou Jc, Yeh Cc, Lee Hy, Lee Ls (2018) Multi-

target voice conversion without parallel data by

adversarially learning disentangled audio repre-

sentations. arXiv preprint arXiv:180402812

34. Chou Jc, Yeh Cc, Lee Hy (2019) One-shot voice

conversion by separating speaker and content rep-

resentations with instance normalization. arXiv

preprint arXiv:190405742

35. Chu M, Qian Y (2001) Locating boundaries for

prosodic constituents in unrestricted mandarin

texts. In: International Journal of Computational

Linguistics & Chinese Language Processing, Vol-

ume 6, Number 1, February 2001: Special Issue

Review of end-to-end speech synthesis technology based on deep learning 31

on Natural Language Processing Researches in

MSRA, pp 61–82

36. Chu M, Peng H, Zhao Y, Niu Z, Chang E

(2003) Microsoft mulan-a bilingual tts system. In:

2003 IEEE International Conference on Acous-

tics, Speech, and Signal Processing, 2003. Pro-

ceedings.(ICASSP’03)., IEEE, vol 1, pp I–I

37. Chu W, Alwan A (2009) Reducing f0 frame er-

ror of f0 tracking algorithms under noisy condi-

tions with an unvoiced/voiced classiﬁcation fron-

tend. In: 2009 IEEE International Conference on

Acoustics, Speech and Signal Processing, IEEE,

pp 3969–3972

38. Chung J, Gulcehre C, Cho K, Bengio Y (2014)

Empirical evaluation of gated recurrent neural

networks on sequence modeling. arXiv preprint

arXiv:14123555

39. Chung YA, Wang Y, Hsu WN, Zhang Y, Skerry-

Ryan R (2019) Semi-supervised training for im-

proving data eﬃciency in end-to-end speech syn-

thesis. In: ICASSP 2019-2019 IEEE International

Conference on Acoustics, Speech and Signal Pro-

cessing (ICASSP), IEEE, pp 6940–6944

40. Conkie A, Finch A (2020) Scalable multilingual

frontend for tts. In: ICASSP 2020-2020 IEEE In-

ternational Conference on Acoustics, Speech and

Signal Processing (ICASSP), IEEE, pp 6684–6688

41. Cooper E, Lai CI, Yasuda Y, Fang F, Wang

X, Chen N, Yamagishi J (2020) Zero-shot multi-

speaker text-to-speech with state-of-the-art neural

speaker embeddings. In: ICASSP 2020-2020 IEEE

International Conference on Acoustics, Speech

and Signal Processing (ICASSP), IEEE, pp 6184–

6188

42. Dauphin YN, Fan A, Auli M, Grangier D (2017)

Language modeling with gated convolutional net-

works. In: International conference on machine

learning, PMLR, pp 933–941

43. Deng GF, Tsai CH, Ku T (2018) The historical re-

view and current trends in speech synthesis by bib-

liometric approach. In: International Conference

on Frontier Computing, Springer, pp 1966–1978

44. Devlin J, Chang MW, Lee K, Toutanova K (2018)

Bert: Pre-training of deep bidirectional transform-

ers for language understanding. arXiv preprint

arXiv:181004805

45. Dinh L, Krueger D, Bengio Y (2014) Nice: Non-

linear independent components estimation. arXiv

preprint arXiv:14108516

46. Dinh L, Sohl-Dickstein J, Bengio S (2016) Den-

sity estimation using real nvp. arXiv preprint

arXiv:160508803

47. Donahue J, Dieleman S, Bi´nkowski M, Elsen E,

Simonyan K (2020) End-to-end adversarial text-

to-speech. arXiv preprint arXiv:200603575

48. Dunbar E, Algayres R, Karadayi J, Bernard M,

Benjumea J, Cao XN, Miskic L, Dugrain C, On-

del L, Black AW, et al. (2019) The zero re-

source speech challenge 2019: Tts without t. arXiv

preprint arXiv:190411469

49. Elias I, Zen H, Shen J, Zhang Y, Jia Y,

Weiss R, Wu Y (2020) Parallel tacotron: Non-

autoregressive and controllable tts. arXiv preprint

arXiv:201011439

50. Ellinas N, Vamvoukakis G, Markopoulos K, Cha-

lamandaris A, Maniati G, Kakoulidis P, Raptis S,

Sung JS, Park H, Tsiakoulis P (2020) High qual-

ity streaming speech synthesis with low, sentence-

length-independent latency. Proc Interspeech 2020

pp 2022–2026

51. Fan Y, Qian Y, Xie FL, Soong FK (2014) Tts

synthesis with bidirectional lstm based recurrent

neural networks. In: Fifteenth annual conference

of the international speech communication associ-

ation

52. Fernandez R, Rendel A, Ramabhadran B, Hoory

R (2013) F0 contour prediction with a deep be-

lief network-gaussian process hybrid model. In:

2013 IEEE International Conference on Acoustics,

Speech and Signal Processing, IEEE, pp 6885–

6889

53. Fernandez R, Rendel A, Ramabhadran B, Hoory

R (2014) Prosody contour prediction with long

short-term memory, bi-directional, deep recurrent

neural networks. In: Fifteenth Annual Conference

of the International Speech Communication Asso-

ciation

54. Gatys LA, Ecker AS, Bethge M (2015) A neu-

ral algorithm of artistic style. arXiv preprint

arXiv:150806576

55. Gatys LA, Ecker AS, Bethge M (2016) Image style

transfer using convolutional neural networks. In:

Proceedings of the IEEE conference on computer

vision and pattern recognition, pp 2414–2423

56. Gehring J, Auli M, Grangier D, Yarats D,

Dauphin YN (2017) Convolutional sequence to se-

quence learning. In: International Conference on

Machine Learning, PMLR, pp 1243–1252

57. Gibiansky A, Arik S

O, Diamos GF, Miller J, Peng

K, Ping W, Raiman J, Zhou Y (2017) Deep voice

2: Multi-speaker neural text-to-speech. In: NIPS

58. Gneiting T, Raftery AE (2007) Strictly proper

scoring rules, prediction, and estimation. Jour-

nal of the American statistical Association

102(477):359–378

32 Zhaoxi Mu

et al.

59. Gonzalvo X, Tazari S, Chan Ca, Becker M, Gutkin

A, Silen H (2016) Recent advances in google real-

time hmm-driven unit selection synthesizer

60. Goodfellow IJ, Pouget-Abadie J, Mirza M, Xu

B, Warde-Farley D, Ozair S, Courville A, Bengio

Y (2014) Generative adversarial networks. arXiv

preprint arXiv:14062661

61. Graves A (2013) Generating sequences with

recurrent neural networks. arXiv preprint

arXiv:13080850

62. Griﬃn D, Lim J (1984) Signal estimation from

modiﬁed short-time fourier transform. IEEE

Transactions on acoustics, speech, and signal pro-

cessing 32(2):236–243

63. Gritsenko AA, Salimans T, Berg Rvd, Snoek J,

Kalchbrenner N (2020) A spectral energy dis-

tance for parallel speech synthesis. arXiv preprint

arXiv:200801160

64. Gu J, Wang Y, Chen Y, Cho K, Li VO (2018)

Meta-learning for low-resource neural machine

translation. arXiv preprint arXiv:180808437

65. Guo H, Soong FK, He L, Xie L (2019) Exploiting

syntactic features in a parsed tree to improve end-

to-end tts. arXiv preprint arXiv:190404764

66. Guo H, Soong FK, He L, Xie L (2019) A new gan-

based end-to-end tts training algorithm. arXiv

preprint arXiv:190404775

67. Guo T, Wen C, Jiang D, Luo N, Zhang R, Zhao

S, Li W, Gong C, Zou W, Han K, et al. (2020)

Didispeech: A large scale mandarin speech corpus.

arXiv preprint arXiv:201009275

68. Guo W, Yang H, Gan Z (2018) A dnn-based

mandarin-tibetan cross-lingual speech synthesis.

In: 2018 Asia-Paciﬁc Signal and Information Pro-

cessing Association Annual Summit and Confer-

ence (APSIPA ASC), IEEE, pp 1702–1707

69. Gururani S, Gupta K, Shah D, Shakeri Z, Pinto J

(2019) Prosody transfer in neural text to speech

using global pitch and loudness features. arXiv

preprint arXiv:191109645

70. Hayashi T, Watanabe S, Toda T, Takeda K, Tosh-

niwal S, Livescu K (2019) Pre-trained text embed-

dings for enhanced text-to-speech synthesis. In:

INTERSPEECH, pp 4430–4434

71. He M, Deng Y, He L (2019) Robust sequence-to-

sequence acoustic modeling with stepwise mono-

tonic attention for neural tts. arXiv preprint

arXiv:190600672

72. Heusel M, Ramsauer H, Unterthiner T, Nessler B,

Hochreiter S (2017) Gans trained by a two time-

scale update rule converge to a local nash equilib-

rium. arXiv preprint arXiv:170608500

73. Ho J, Jain A, Abbeel P (2020) Denoising

diﬀusion probabilistic models. arXiv preprint

arXiv:200611239

74. Hochreiter S, Schmidhuber J (1997) Long short-

term memory. Neural computation 9(8):1735–

1780

75. Honnet PE, Lazaridis A, Garner PN, Yamag-

ishi J (2017) The siwis french speech synthesis

database? design and recording of a high quality

french database for speech synthesis. Tech. rep.,

Idiap

76. Hsu WN, Zhang Y, Weiss RJ, Chung YA, Wang

Y, Wu Y, Glass J (2019) Disentangling corre-

lated speaker and noise for speech synthesis via

data augmentation and adversarial factorization.

In: ICASSP 2019-2019 IEEE International Confer-

ence on Acoustics, Speech and Signal Processing

(ICASSP), IEEE, pp 5901–5905

77. Huang FL, Lin JH, Lin XW (2010) Disambigua-

tion for polyphones of chinese based on two-pass

uniﬁed approach. In: 2010 International Computer

Symposium (ICS2010), IEEE, pp 603–607

78. Huang Z, Xu W, Yu K (2015) Bidirectional lstm-

crf models for sequence tagging. arXiv preprint

arXiv:150801991

79. Huang Z, Li H, Lei M (2020) Devicetts: A small-

footprint, fast, stable network for on-device text-

to-speech. arXiv preprint arXiv:201015311

80. Hunt AJ, Black AW (1996) Unit selection in a con-

catenative speech synthesis system using a large

speech database. In: 1996 IEEE International

Conference on Acoustics, Speech, and Signal Pro-

cessing Conference Proceedings, IEEE, vol 1, pp

373–376

81. Huybrechts G, Merritt T, Comini G, Perz B,

Shah R, Lorenzo-Trueba J (2020) Low-resource

expressive text-to-speech using data augmenta-

tion. arXiv preprint arXiv:201105707

82. Imai S, Sumita K, Furuichi C (1983) Mel log spec-

trum approximation (mlsa) ﬁlter for speech syn-

thesis. Electronics and Communications in Japan

(Part I: Communications) 66(2):10–18

83. Ito K, Johnson L (2017) The lj speech dataset.

https://keithito.com/LJ-Speech-Dataset/

84. Ji S, Luo J, Yang X (2020) A comprehensive sur-

vey on deep music generation: Multi-level repre-

sentations, algorithms, evaluations, and future di-

rections. arXiv preprint arXiv:201106801

85. Jia Y, Zhang Y, Weiss RJ, Wang Q, Shen J,

Ren F, Chen Z, Nguyen P, Pang R, Moreno IL,

et al. (2018) Transfer learning from speaker veri-

ﬁcation to multispeaker text-to-speech synthesis.

arXiv preprint arXiv:180604558

Review of end-to-end speech synthesis technology based on deep learning 33

86. Jiao X, Yin Y, Shang L, Jiang X, Chen X, Li L,

Wang F, Liu Q (2019) Tinybert: Distilling bert for

natural language understanding. arXiv preprint

arXiv:190910351

87. Jin Z, Finkelstein A, Mysore GJ, Lu J (2018)

Fftnet: A real-time speaker-dependent neural

vocoder. In: 2018 IEEE International Confer-

ence on Acoustics, Speech and Signal Processing

(ICASSP), IEEE, pp 2251–2255

88. Johnson J, Alahi A, Fei-Fei L (2016) Percep-

tual losses for real-time style transfer and super-

resolution. In: European conference on computer

vision, Springer, pp 694–711

89. Kalchbrenner N, Elsen E, Simonyan K, Noury S,

Casagrande N, Lockhart E, Stimberg F, Oord A,

Dieleman S, Kavukcuoglu K (2018) Eﬃcient neu-

ral audio synthesis. In: International Conference

on Machine Learning, PMLR, pp 2410–2419

90. Kalita J, Deb N (2017) Emotional text to speech

synthesis: A review. International Journal of Ad-

vanced Research in Computer and Communica-

tion Engineering 6(4):428–430

91. Kameoka H, Kaneko T, Tanaka K, Hojo N (2018)

Stargan-vc: Non-parallel many-to-many voice con-

version using star generative adversarial net-

works. In: 2018 IEEE Spoken Language Technol-

ogy Workshop (SLT), IEEE, pp 266–273

92. Kang M, Hong Y (2011) Formant synthesis of

haegeum: a sound analysis/synthesis system us-

ing cepstral envelope. In: 2011 International Con-

ference on Information Science and Applications,

IEEE, pp 1–8

93. Karaali O, Corrigan G, Gerson I, Massey N

(1998) Text-to-speech conversion with neural net-

works: A recurrent tdnn approach. arXiv preprint

cs/9811032

94. Kawahara H, Morise M, Takahashi T, Nisimura R,

Irino T, Banno H (2008) Tandem-straight: A tem-

porally stable power spectral representation for

periodic signals and applications to interference-

free spectrum, f0, and aperiodicity estimation. In:

2008 IEEE International Conference on Acoustics,

Speech and Signal Processing, IEEE, pp 3933–

3936

95. Kearns J (2014) Librivox: Free public domain au-

diobooks. Reference Reviews

96. Kenter T, Wan V, Chan CA, Clark R, Vit J (2019)

Chive: Varying prosody in speech synthesis with

a linguistically driven dynamic hierarchical condi-

tional variational network. In: International Con-

ference on Machine Learning, PMLR, pp 3331–

3340

97. Khorinphan C, Phansamdaeng S, Saiyod S (2014)

Thai speech synthesis with emotional tone: Based

on formant synthesis for home robot. In: 2014

Third ICT International Student Project Confer-

ence (ICT-ISPC), IEEE, pp 111–114

98. Kim J, Kim S, Kong J, Yoon S (2020) Glow-tts:

A generative ﬂow for text-to-speech via monotonic

alignment search. arXiv preprint arXiv:200511129

99. Kim S, Lee SG, Song J, Kim J, Yoon S (2018)

Flowavenet: A generative ﬂow for raw audio. arXiv

preprint arXiv:181102155

100. Kim Y, Rush AM (2016) Sequence-level knowl-

edge distillation. arXiv preprint arXiv:160607947

101. Kingma DP, Dhariwal P (2018) Glow: Genera-

tive ﬂow with invertible 1x1 convolutions. arXiv

preprint arXiv:180703039

102. Kingma DP, Welling M (2013) Auto-encoding

variational bayes. arXiv preprint arXiv:13126114

103. Kingma DP, Salimans T, Jozefowicz R, Chen X,

Sutskever I, Welling M (2016) Improving varia-

tional inference with inverse autoregressive ﬂow.

arXiv preprint arXiv:160604934

104. Kisler T, Reichel U, Schiel F (2017) Multilingual

processing of speech via web services. Computer

Speech & Language 45:326–347

105. Klatt DH (1980) Software for a cascade/parallel

formant synthesizer. the Journal of the Acoustical

Society of America 67(3):971–995

106. Klatt DH (1987) Review of text-to-speech conver-

sion for english. The Journal of the Acoustical So-

ciety of America 82(3):737–793

107. Klein D, Manning CD, et al. (2003) Fast exact in-

ference with a factored model for natural language

parsing. Advances in neural information process-

ing systems pp 3–10

108. Kominek J, Black AW, Ver V (2003) Cmu arctic

databases for speech synthesis

109. Kong J, Kim J, Bae J (2020) Hiﬁ-gan: Generative

adversarial networks for eﬃcient and high ﬁdelity

speech synthesis. arXiv preprint arXiv:201005646

110. Kong Z, Ping W, Huang J, Zhao K, Catanzaro

B (2020) Diﬀwave: A versatile diﬀusion model for

audio synthesis. arXiv preprint arXiv:200909761

111. Kriman S, Beliaev S, Ginsburg B, Huang J,

Kuchaiev O, Lavrukhin V, Leary R, Li J, Zhang Y

(2020) Quartznet: Deep automatic speech recogni-

tion with 1d time-channel separable convolutions.

In: ICASSP 2020-2020 IEEE International Confer-

ence on Acoustics, Speech and Signal Processing

(ICASSP), IEEE, pp 6124–6128

112. Kubichek R (1993) Mel-cepstral distance mea-

sure for objective speech quality assessment. In:

Proceedings of IEEE Paciﬁc Rim Conference on

34 Zhaoxi Mu

et al.

Communications Computers and Signal Process-

ing, IEEE, vol 1, pp 125–128

113. Kumar K, Kumar R, de Boissiere T, Gestin L,

Teoh WZ, Sotelo J, de Br´ebisson A, Bengio Y,

Courville A (2019) Melgan: Generative adversar-

ial networks for conditional waveform synthesis.

arXiv preprint arXiv:191006711

114. Kwon O, Jang I, Ahn C, Kang HG (2019) An eﬀec-

tive style token weight control technique for end-

to-end emotional speech synthesis. IEEE Signal

Processing Letters 26(9):1383–1387

115. Kwon O, Song E, Kim JM, Kang HG (2019) Eﬀec-

tive parameter estimation methods for an excitnet

model in generative text-to-speech systems. arXiv

preprint arXiv:190508486

116. Lamb A, Goyal A, Zhang Y, Zhang S, Courville

A, Bengio Y (2016) Professor forcing: A new al-

gorithm for training recurrent networks. arXiv

preprint arXiv:161009038

117. La´ncucki A (2020) Fastpitch: Parallel text-to-

speech with pitch prediction. arXiv preprint

arXiv:200606873

118. Li N, Liu S, Liu Y, Zhao S, Liu M (2019) Neu-

ral speech synthesis with transformer network. In:

Proceedings of the AAAI Conference on Artiﬁcial

Intelligence, vol 33, pp 6706–6713

119. Li N, Liu Y, Wu Y, Liu S, Zhao S, Liu M (2020)

Robutrans: A robust transformer-based text-to-

speech model. In: Proceedings of the AAAI Con-

ference on Artiﬁcial Intelligence, vol 34, pp 8228–

8235

120. Li T, Yang S, Xue L, Xie L (2021) Controllable

emotion transfer for end-to-end speech synthe-

sis. In: 2021 12th International Symposium on

Chinese Spoken Language Processing (ISCSLP),

IEEE, pp 1–5

121. Lim D, Jang W, Park H, Kim B, Yoon J, et al.

(2020) Jdi-t: Jointly trained duration informed

transformer for text-to-speech without explicit

alignment. arXiv preprint arXiv:200507799

122. Ling J (2017) Coarse-to-ﬁne attention models for

document summarization. PhD thesis

123. Liu DR, Yang CY, Wu SL, Lee HY (2018) Im-

proving unsupervised style transfer in end-to-end

speech synthesis with end-to-end speech recogni-

tion. In: 2018 IEEE Spoken Language Technology

Workshop (SLT), IEEE, pp 640–647

124. Liu P, Wu X, Kang S, Li G, Su D, Yu D (2019)

Maximizing mutual information for tacotron.

arXiv preprint arXiv:190901145

125. Liu R, Yang J, Liu M (2019) A new end-to-

end long-time speech synthesis system based on

tacotron2. In: Proceedings of the 2019 Interna-

tional Symposium on Signal Processing Systems,

pp 46–50

126. Liu R, Sisman B, Li H (2020) Graphspeech:

Syntax-aware graph attention network for neural

speech synthesis. arXiv preprint arXiv:201012423

127. Liu R, Sisman B, Li J, Bao F, Gao G, Li

H (2020) Teacher-student training for robust

tacotron-based tts. In: ICASSP 2020-2020 IEEE

International Conference on Acoustics, Speech

and Signal Processing (ICASSP), IEEE, pp 6274–

6278

128. Livingstone SR, Russo FA (2018) The ryerson

audio-visual database of emotional speech and

song (ravdess): A dynamic, multimodal set of fa-

cial and vocal expressions in north american en-

glish. PloS one 13(5):e0196391

129. Loizou PC (2013) Speech enhancement: theory

and practice. CRC press

130. Lu C, Zhang P, Yan Y (2019) Self-attention based

prosodic boundary prediction for chinese speech

synthesis. In: ICASSP 2019-2019 IEEE Interna-

tional Conference on Acoustics, Speech and Signal

Processing (ICASSP), IEEE, pp 7035–7039

131. Lu H, King S, Watts O (2013) Combining a vector

space representation of linguistic context with a

deep neural network for text-to-speech synthesis.

In: Eighth ISCA Workshop on Speech Synthesis

132. Ma M, Huang L, Xiong H, Zheng R, Liu K, Zheng

B, Zhang C, He Z, Liu H, Li X, et al. (2018) Stacl:

Simultaneous translation with implicit anticipa-

tion and controllable latency using preﬁx-to-preﬁx

framework. arXiv preprint arXiv:181008398

133. Ma M, Zheng B, Liu K, Zheng R, Liu H, Peng K,

Church K, Huang L (2019) Incremental text-to-

speech synthesis with preﬁx-to-preﬁx framework.

arXiv preprint arXiv:191102750

134. Ma S, Mcduﬀ D, Song Y (2018) Neural tts styl-

ization with adversarial and collaborative games.

In: International Conference on Learning Repre-

sentations

135. McAuliﬀe M, Socolof M, Mihuc S, Wagner M, Son-

deregger M (2017) Montreal forced aligner: Train-

able text-speech alignment using kaldi. In: Inter-

speech, vol 2017, pp 498–502

136. Mehri S, Kumar K, Gulrajani I, Kumar R, Jain

S, Sotelo J, Courville A, Bengio Y (2016) Sam-

plernn: An unconditional end-to-end neural audio

generation model. arXiv preprint arXiv:161207837

137. Merboldt A, Zeyer A, Schl¨uter R, Ney H (2019)

An analysis of local monotonic attention variants.

In: INTERSPEECH, pp 1398–1402

138. Miao C, Liang S, Chen M, Ma J, Wang S, Xiao J

(2020) Flow-tts: A non-autoregressive network for

Review of end-to-end speech synthesis technology based on deep learning 35

text to speech based on ﬂow. In: ICASSP 2020-

2020 IEEE International Conference on Acoustics,

Speech and Signal Processing (ICASSP), IEEE,

pp 7209–7213

139. Mikolov T, Chen K, Corrado G, Dean J (2013) Ef-

ﬁcient estimation of word representations in vector

space. arXiv preprint arXiv:13013781

140. Minematsu N, Kobayashi S, Shimizu S, Hirose K

(2012) Improved prediction of japanese word ac-

cent sandhi using crf. In: Thirteenth Annual Con-

ference of the International Speech Communica-

tion Association

141. Morrison M, Jin Z, Bryan NJ, Mysore GJ

(2020) Controllable neural prosody synthesis.

arXiv preprint arXiv:200803388

142. Moss HB, Aggarwal V, Prateek N, Gonz´alez

J, Barra-Chicote R (2020) Boﬃn tts: Few-shot

speaker adaptation by bayesian optimization. In:

ICASSP 2020-2020 IEEE International Confer-

ence on Acoustics, Speech and Signal Processing

(ICASSP), IEEE, pp 7639–7643

143. Moulines E, Charpentier F (1990) Pitch-

synchronous waveform processing techniques for

text-to-speech synthesis using diphones. Speech

communication 9(5-6):453–467

144. Murray IR, Arnott JL, Rohwer EA (1996) Emo-

tional stress in synthetic speech: Progress and

future directions. Speech Communication 20(1-

2):85–91

145. Nachmani E, Polyak A, Taigman Y, Wolf L (2018)

Fitting new speakers based on a short untran-

scribed sample. In: International Conference on

Machine Learning, PMLR, pp 3683–3691

146. Nakatani T, Amano S, Irino T, Ishizuka K, Kondo

T (2008) A method for fundamental frequency es-

timation and voicing decision: Application to in-

fant utterances recorded in real acoustical envi-

ronments. Speech Communication 50(3):203–214

147. Nekvinda T, Duˇsek O (2020) One model, many

languages: Meta-learning for multilingual text-to-

speech. arXiv preprint arXiv:200800768

148. Ning Y, He S, Wu Z, Xing C, Zhang LJ (2019)

A review of deep learning based speech synthesis.

Applied Sciences 9(19):4050

149. Nose T (2016) Eﬃcient implementation of global

variance compensation for parametric speech

synthesis. IEEE/ACM Transactions on Audio,

Speech, and Language Processing 24(10):1694–

1704

150. Oord A, Li Y, Babuschkin I, Simonyan K,

Vinyals O, Kavukcuoglu K, Driessche G, Lock-

hart E, Cobo L, Stimberg F, et al. (2018) Par-

allel wavenet: Fast high-ﬁdelity speech synthesis.

In: International conference on machine learning,

PMLR, pp 3918–3926

151. Oord Avd, Dieleman S, Zen H, Simonyan K,

Vinyals O, Graves A, Kalchbrenner N, Senior

A, Kavukcuoglu K (2016) Wavenet: A gen-

erative model for raw audio. arXiv preprint

arXiv:160903499

152. Oord Avd, Vinyals O, Kavukcuoglu K (2017) Neu-

ral discrete representation learning. arXiv preprint

arXiv:171100937

153. Paine TL, Khorrami P, Chang S, Zhang Y, Ra-

machandran P, Hasegawa-Johnson MA, Huang TS

(2016) Fast wavenet generation algorithm. arXiv

preprint arXiv:161109482

154. Pan H, Li X, Huang Z (2019) A mandarin prosodic

boundary prediction model based on multi-task

learning. In: INTERSPEECH, pp 4485–4488

155. Pan J, Yin X, Zhang Z, Liu S, Zhang Y, Ma

Z, Wang Y (2020) A uniﬁed sequence-to-sequence

front-end model for mandarin text-to-speech syn-

thesis. In: ICASSP 2020-2020 IEEE International

Conference on Acoustics, Speech and Signal Pro-

cessing (ICASSP), IEEE, pp 6689–6693

156. Panayotov V, Chen G, Povey D, Khudanpur S

(2015) Librispeech: an asr corpus based on public

domain audio books. In: 2015 IEEE international

conference on acoustics, speech and signal process-

ing (ICASSP), IEEE, pp 5206–5210

157. Park K, Lee S (2020) g2pm: A neural grapheme-

to-phoneme conversion package for mandarinchi-

nese based on a new open benchmark dataset.

arXiv preprint arXiv:200403136

158. Park K, Mulc T (2019) Css10: A collection of sin-

gle speaker speech datasets for 10 languages. arXiv

preprint arXiv:190311269

159. Peng K, Ping W, Song Z, Zhao K (2020) Non-

autoregressive neural text-to-speech. In: Interna-

tional Conference on Machine Learning, PMLR,

pp 7586–7598

160. Ping W, Peng K, Gibiansky A, Arik SO, Kannan

A, Narang S, Raiman J, Miller J (2017) Deep voice

3: Scaling text-to-speech with convolutional se-

quence learning. arXiv preprint arXiv:171007654

161. Ping W, Peng K, Chen J (2018) Clarinet: Paral-

lel wave generation in end-to-end text-to-speech.

arXiv preprint arXiv:180707281

162. Ping W, Peng K, Zhao K, Song Z (2020) Wave-

ﬂow: A compact ﬂow-based model for raw audio.

In: International Conference on Machine Learn-

ing, PMLR, pp 7706–7716

163. Platanios EA, Sachan M, Neubig G, Mitchell T

(2018) Contextual parameter generation for uni-

versal neural machine translation. arXiv preprint

36 Zhaoxi Mu

et al.

arXiv:180808493

164. Prenger R, Valle R, Catanzaro B (2019) Waveg-

low: A ﬂow-based generative network for speech

synthesis. In: ICASSP 2019-2019 IEEE Interna-

tional Conference on Acoustics, Speech and Signal

Processing (ICASSP), IEEE, pp 3617–3621

165. Qian K, Zhang Y, Chang S, Yang X, Hasegawa-

Johnson M (2019) Autovc: Zero-shot voice style

transfer with only autoencoder loss. In: Interna-

tional Conference on Machine Learning, PMLR,

pp 5210–5219

166. Qian K, Zhang Y, Chang S, Hasegawa-Johnson

M, Cox D (2020) Unsupervised speech decompo-

sition via triple information bottleneck. In: Inter-

national Conference on Machine Learning, PMLR,

pp 7836–7846

167. Qian Y, Wu Z, Ma X, Soong F (2010) Automatic

prosody prediction and detection with conditional

random ﬁeld (crf) models. In: 2010 7th Interna-

tional Symposium on Chinese Spoken Language

Processing, IEEE, pp 135–138

168. Qian Y, Fan Y, Hu W, Soong FK (2014) On the

training aspects of deep neural network (dnn) for

parametric tts synthesis. In: 2014 IEEE Interna-

tional Conference on Acoustics, Speech and Signal

Processing (ICASSP), IEEE, pp 3829–3833

169. Radford A, Wu J, Child R, Luan D, Amodei D,

Sutskever I (2019) Language models are unsuper-

vised multitask learners. OpenAI blog 1(8):9

170. Raﬀel C, Luong MT, Liu PJ, Weiss RJ, Eck D

(2017) Online and linear-time attention by enforc-

ing monotonic alignments. In: International Con-

ference on Machine Learning, PMLR, pp 2837–

2846

171. Recommendation I (2001) 1534-1,“method for the

subjective assessment of intermediate sound qual-

ity (mushra)”. International Telecommunications

Union, Geneva, Switzerland 2

172. Ren Y, Ruan Y, Tan X, Qin T, Zhao S,

Zhao Z, Liu TY (2019) Fastspeech: Fast, robust

and controllable text to speech. arXiv preprint

arXiv:190509263

173. Ren Y, Tan X, Qin T, Zhao S, Zhao Z, Liu

TY (2019) Almost unsupervised text to speech

and automatic speech recognition. In: Interna-

tional Conference on Machine Learning, PMLR,

pp 5410–5419

174. Ren Y, Hu C, Qin T, Zhao S, Zhao Z,

Liu TY (2020) Fastspeech 2: Fast and high-

quality end-to-end text-to-speech. arXiv preprint

arXiv:200604558

175. Rezende D, Mohamed S (2015) Variational in-

ference with normalizing ﬂows. In: International

Conference on Machine Learning, PMLR, pp

1530–1538

176. Ribeiro F, Florˆencio D, Zhang C, Seltzer M (2011)

Crowdmos: An approach for crowdsourcing mean

opinion score studies. In: 2011 IEEE international

conference on acoustics, speech and signal process-

ing (ICASSP), IEEE, pp 2416–2419

177. Rix AW, Beerends JG, Hollier MP, Hekstra

AP (2001) Perceptual evaluation of speech qual-

ity (pesq)-a new method for speech quality as-

sessment of telephone networks and codecs. In:

2001 IEEE International Conference on Acoustics,

Speech, and Signal Processing. Proceedings (Cat.

No. 01CH37221), IEEE, vol 2, pp 749–752

178. Saito Y, Takamichi S, Saruwatari H (2017) Sta-

tistical parametric speech synthesis incorporat-

ing generative adversarial networks. IEEE/ACM

Transactions on Audio, Speech, and Language

Processing 26(1):84–96

179. Schr¨oder M (2001) Emotional speech synthesis:

A review. In: Seventh European Conference on

Speech Communication and Technology

180. Sejdinovic D, Sriperumbudur B, Gretton A, Fuku-

mizu K (2013) Equivalence of distance-based and

rkhs-based statistics in hypothesis testing. The

Annals of Statistics pp 2263–2291

181. Sennrich R, Haddow B, Birch A (2015) Neural

machine translation of rare words with subword

units. arXiv preprint arXiv:150807909

182. Series B (2014) Method for the subjective assess-

ment of intermediate quality level of audio sys-

tems. International Telecommunication Union Ra-

diocommunication Assembly

183. Serr`a J, Pascual S, Segura C (2019) Blow:

a single-scale hyperconditioned ﬂow for non-

parallel raw-audio voice conversion. arXiv preprint

arXiv:190600794

184. Shahriari B, Swersky K, Wang Z, Adams RP,

De Freitas N (2015) Taking the human out of the

loop: A review of bayesian optimization. Proceed-

ings of the IEEE 104(1):148–175

185. Shan C, Xie L, Yao K (2016) A bi-directional lstm

approach for polyphone disambiguation in man-

darin chinese. In: 2016 10th International Sym-

posium on Chinese Spoken Language Processing

(ISCSLP), IEEE, pp 1–5

186. Shankar S, Garg S, Sarawagi S (2018) Surpris-

ingly easy hard-attention for sequence to sequence

learning. In: Proceedings of the 2018 Conference

on Empirical Methods in Natural Language Pro-

cessing, pp 640–645

187. Shaw P, Uszkoreit J, Vaswani A (2018) Self-

attention with relative position representations.

Review of end-to-end speech synthesis technology based on deep learning 37

arXiv preprint arXiv:180302155

188. Shen C, Vogelstein JT (2020) The exact equiva-

lence of distance and kernel methods in hypothesis

testing. AStA Advances in Statistical Analysis pp

1–19

189. Shen J, Pang R, Weiss RJ, Schuster M, Jaitly N,

Yang Z, Chen Z, Zhang Y, Wang Y, Skerrv-Ryan

R, et al. (2018) Natural tts synthesis by condition-

ing wavenet on mel spectrogram predictions. In:

2018 IEEE International Conference on Acoustics,

Speech and Signal Processing (ICASSP), IEEE,

pp 4779–4783

190. Shen J, Jia Y, Chrzanowski M, Zhang Y, Elias

I, Zen H, Wu Y (2020) Non-attentive tacotron:

Robust and controllable neural tts synthesis in-

cluding unsupervised duration modeling. arXiv

preprint arXiv:201004301

191. Shi Y, Bu H, Xu X, Zhang S, Li M (2020) Aishell-

3: A multi-speaker mandarin tts corpus and the

baselines. arXiv preprint arXiv:201011567

192. Skerry-Ryan R, Battenberg E, Xiao Y, Wang Y,

Stanton D, Shor J, Weiss R, Clark R, Saurous

RA (2018) Towards end-to-end prosody trans-

fer for expressive speech synthesis with tacotron.

In: international conference on machine learning,

PMLR, pp 4693–4702

193. Sohl-Dickstein J, Weiss E, Maheswaranathan N,

Ganguli S (2015) Deep unsupervised learning us-

ing nonequilibrium thermodynamics. In: Interna-

tional Conference on Machine Learning, PMLR,

pp 2256–2265

194. Song Y, Ermon S (2020) Improved techniques

for training score-based generative models. arXiv

preprint arXiv:200609011

195. Song Y, Garg S, Shi J, Ermon S (2020) Sliced

score matching: A scalable approach to density

and score estimation. In: Uncertainty in Artiﬁcial

Intelligence, PMLR, pp 574–584

196. Sotelo J, Mehri S, Kumar K, Santos JF, Kastner

K, Courville A, Bengio Y (2017) Char2wav: End-

to-end speech synthesis

197. Sruthi K, Meharban M (2020) Review on im-

age captioning and speech synthesis techniques.

In: 2020 6th International Conference on Ad-

vanced Computing and Communication Systems

(ICACCS), IEEE, pp 352–356

198. Stephenson B, Besacier L, Girin L, Hueber T

(2020) What the future brings: Investigating the

impact of lookahead for incremental neural tts.

arXiv preprint arXiv:200902035

199. Stephenson B, Hueber T, Girin L, Besacier L

(2021) Alternate endings: Improving prosody for

incremental neural tts with predicted future text

input. arXiv preprint arXiv:210209914

200. Sun G, Zhang Y, Weiss RJ, Cao Y, Zen H, Wu

Y (2020) Fully-hierarchical ﬁne-grained prosody

modeling for interpretable speech synthesis. In:

ICASSP 2020-2020 IEEE International Confer-

ence on Acoustics, Speech and Signal Processing

(ICASSP), IEEE, pp 6264–6268

201. Sz´ekely

E, Henter GE, Beskow J, Gustafson J

(2020) Breathing and speech planning in spon-

taneous speech synthesis. In: ICASSP 2020-2020

IEEE International Conference on Acoustics,

Speech and Signal Processing (ICASSP), IEEE,

pp 7649–7653

202. Tachibana H, Uenoyama K, Aihara S (2018) Ef-

ﬁciently trainable text-to-speech system based

on deep convolutional networks with guided at-

tention. In: 2018 IEEE International Confer-

ence on Acoustics, Speech and Signal Processing

(ICASSP), IEEE, pp 4784–4788

203. Tahon M, Lecorv´e G, Lolive D (2018) Can we

generate emotional pronunciations for expressive

speech synthesis? IEEE Transactions on Aﬀective

Computing 11(4):684–695

204. Taigman Y, Wolf L, Polyak A, Nachmani E (2017)

Voiceloop: Voice ﬁtting and synthesis via a phono-

logical loop. arXiv preprint arXiv:170706588

205. Taylor J, Richmond K (2019) Analysis of pronun-

ciation learning in end-to-end speech synthesis. In:

INTERSPEECH, pp 2070–2074

206. Taylor P (2009) Text-to-speech synthesis. Cam-

bridge university press

207. Tits N, El Haddad K, Dutoit T (2019) Emotional

speech datasets for english speech synthesis pur-

pose: A review. In: Proceedings of SAI Intelligent

Systems Conference, Springer, pp 61–66

208. Tjandra A, Sakti S, Nakamura S (2017) Listen-

ing while speaking: Speech chain by deep learning.

In: 2017 IEEE Automatic Speech Recognition and

Understanding Workshop (ASRU), IEEE, pp 301–

308

209. Tjandra A, Sakti S, Nakamura S (2018) Machine

speech chain with one-shot speaker adaptation.

arXiv preprint arXiv:180310525

210. Tokuda K, Nankaku Y, Toda T, Zen H, Yamag-

ishi J, Oura K (2013) Speech synthesis based on

hidden markov models. Proceedings of the IEEE

101(5):1234–1252

211. Tomczak JM, Welling M (2016) Improving varia-

tional auto-encoders using householder ﬂow. arXiv

preprint arXiv:161109630

212. Tu T, Chen YJ, Yeh Cc, Lee HY (2019) End-

to-end text-to-speech for low-resource languages

by cross-lingual transfer learning. arXiv preprint

38 Zhaoxi Mu

et al.

arXiv:190406508

213. Tuerk C, Robinson T (1993) Speech synthesis us-

ing artiﬁcial neural networks trained on cepstral

coeﬃcients. In: Third European Conference on

Speech Communication and Technology

214. Um SY, Oh S, Byun K, Jang I, Ahn C, Kang HG

(2020) Emotional speech synthesis with rich and

granularized control. In: ICASSP 2020-2020 IEEE

International Conference on Acoustics, Speech

and Signal Processing (ICASSP), IEEE, pp 7254–

7258

215. Vainer J, Duˇsek O (2020) Speedyspeech: Ef-

ﬁcient neural speech synthesis. arXiv preprint

arXiv:200803802

216. Valentini-Botinhao C, Yamagishi J (2018) Speech

enhancement of noisy and reverberant speech for

text-to-speech. IEEE/ACM Transactions on Au-

dio, Speech, and Language Processing 26(8):1420–

1433

217. Valin JM, Skoglund J (2019) Lpcnet: Improving

neural speech synthesis through linear prediction.

In: ICASSP 2019-2019 IEEE International Confer-

ence on Acoustics, Speech and Signal Processing

(ICASSP), IEEE, pp 5891–5895

218. Valle R, Li J, Prenger R, Catanzaro B (2020) Mel-

lotron: Multispeaker expressive voice synthesis by

conditioning on rhythm, pitch and global style to-

kens. In: ICASSP 2020-2020 IEEE International

Conference on Acoustics, Speech and Signal Pro-

cessing (ICASSP), IEEE, pp 6189–6193

219. Valle R, Shih K, Prenger R, Catanzaro B (2020)

Flowtron: an autoregressive ﬂow-based genera-

tive network for text-to-speech synthesis. arXiv

preprint arXiv:200505957

220. Van Oord A, Kalchbrenner N, Kavukcuoglu K

(2016) Pixel recurrent neural networks. In: Inter-

national Conference on Machine Learning, PMLR,

pp 1747–1756

221. Vasquez S, Lewis M (2019) Melnet: A generative

model for audio in the frequency domain. arXiv

preprint arXiv:190601083

222. Vaswani A, Shazeer N, Parmar N, Uszkoreit J,

Jones L, Gomez AN, Kaiser L, Polosukhin I

(2017) Attention is all you need. arXiv preprint

arXiv:170603762

223. Veaux C, Yamagishi J, MacDonald K, et al.

(2016) Superseded-cstr vctk corpus: English

multi-speaker corpus for cstr voice cloning toolkit

224. Vogten L, Berendsen E (1988) From text to

speech: the mitalk system. Journal of Phonetics

16(3):371–375

225. Wang G (2019) Deep text-to-speech system with

seq2seq model. arXiv preprint arXiv:190307398

226. Wang TC, Liu MY, Zhu JY, Tao A, Kautz J,

Catanzaro B (2018) High-resolution image syn-

thesis and semantic manipulation with conditional

gans. In: Proceedings of the IEEE conference on

computer vision and pattern recognition, pp 8798–

8807

227. Wang Y, Skerry-Ryan R, Stanton D, Wu Y, Weiss

RJ, Jaitly N, Yang Z, Xiao Y, Chen Z, Bengio S,

et al. (2017) Tacotron: Towards end-to-end speech

synthesis. arXiv preprint arXiv:170310135

228. Wang Y, Stanton D, Zhang Y, Ryan RS, Batten-

berg E, Shor J, Xiao Y, Jia Y, Ren F, Saurous RA

(2018) Style tokens: Unsupervised style model-

ing, control and transfer in end-to-end speech syn-

thesis. In: International Conference on Machine

Learning, PMLR, pp 5180–5189

229. Whitehill M, Ma S, McDuﬀ D, Song Y

(2019) Multi-reference neural tts stylization with

adversarial cycle consistency. arXiv preprint

arXiv:191011958

230. Wightman CW, Talkin DT (1997) The aligner:

Text-to-speech alignment using markov models.

In: Progress in speech synthesis, Springer, pp 313–

323

231. Wu F, Fan A, Baevski A, Dauphin YN, Auli

M (2019) Pay less attention with lightweight

and dynamic convolutions. arXiv preprint

arXiv:190110430

232. Xu J, Tan X, Ren Y, Qin T, Li J, Zhao S, Liu TY

(2020) Lrspeech: Extremely low-resource speech

synthesis and recognition. In: Proceedings of the

26th ACM SIGKDD International Conference on

Knowledge Discovery & Data Mining, pp 2802–

2812

233. Xu K, Ba J, Kiros R, Cho K, Courville A,

Salakhudinov R, Zemel R, Bengio Y (2015) Show,

attend and tell: Neural image caption generation

with visual attention. In: International conference

on machine learning, PMLR, pp 2048–2057

234. Yamamoto R, Song E, Kim JM (2020) Paral-

lel wavegan: A fast waveform generation model

based on generative adversarial networks with

multi-resolution spectrogram. In: ICASSP 2020-

2020 IEEE International Conference on Acoustics,

Speech and Signal Processing (ICASSP), IEEE,

pp 6199–6203

235. Yanagita T, Sakti S, Nakamura S (2019) Neural

itts: Toward synthesizing speech in real-time with

end-to-end neural text-to-speech framework. In:

Proceedings of the 10th ISCA Speech Synthesis

Workshop, pp 183–188

236. Yang B, Zhong J, Liu S (2019) Pre-trained text

representations for improving front-end text pro-

Review of end-to-end speech synthesis technology based on deep learning 39

cessing in mandarin text-to-speech synthesis. In:

INTERSPEECH, pp 4480–4484

237. Yang G, Yang S, Liu K, Fang P, Chen W, Xie L

(2021) Multi-band melgan: Faster waveform gen-

eration for high-quality text-to-speech. In: 2021

IEEE Spoken Language Technology Workshop

(SLT), IEEE, pp 492–498

238. Yang J, Wang Y, Liu H, Li J, Lu J (2014)

Deep learning theory and its application in speech

recognition. Commun Countermeas 33:1–5

239. Yang J, Lee J, Kim Y, Cho H, Kim I (2020)

Vocgan: A high-ﬁdelity real-time vocoder with a

hierarchically-nested adversarial network. arXiv

preprint arXiv:200715256

240. Yasuda Y, Wang X, Yamagishi J (2019) Ini-

tial investigation of an encoder-decoder end-

to-end tts framework using marginalization of

monotonic hard latent alignments. arXiv preprint

arXiv:190811535

241. Yoshimura T, Tokuda K, Masuko T, Kobayashi

T, Kitamura T (1999) Simultaneous modeling

of spectrum, pitch and duration in hmm-based

speech synthesis. In: Sixth European Conference

on Speech Communication and Technology

242. Yoshimura T, Hashimoto K, Oura K, Nankaku

Y, Tokuda K (2018) Mel-cepstrum-based quanti-

zation noise shaping applied to neural-network-

based speech waveform synthesis. IEEE/ACM

Transactions on Audio, Speech, and Language

Processing 26(7):1177–1184

243. Yu C, Lu H, Hu N, Yu M, Weng C, Xu K, Liu P,

Tuo D, Kang S, Lei G, et al. (2019) Durian: Du-

ration informed attention network for multimodal

synthesis. arXiv preprint arXiv:190901700

244. Yu L, Blunsom P, Dyer C, Grefenstette E, Kocisky

T (2016) The neural noisy channel. arXiv preprint

arXiv:161102554

245. Yu L, Buys J, Blunsom P (2016) Online segment

to segment neural transduction. arXiv preprint

arXiv:160908194

246. Zaremba W, Sutskever I (2015) Reinforcement

learning neural turing machines-revised. arXiv

preprint arXiv:150500521

247. Ze H, Senior A, Schuster M (2013) Statistical

parametric speech synthesis using deep neural net-

works. In: 2013 ieee international conference on

acoustics, speech and signal processing, IEEE, pp

7962–7966

248. Zen H, Sak H (2015) Unidirectional long short-

term memory recurrent neural network with recur-

rent output layer for low-latency speech synthesis.

In: 2015 IEEE International Conference on Acous-

tics, Speech and Signal Processing (ICASSP),

IEEE, pp 4470–4474

249. Zen H, Nose T, Yamagishi J, Sako S, Masuko

T, Black AW, Tokuda K (2007) The hmm-based

speech synthesis system (hts) version 2.0. In: SSW,

Citeseer, pp 294–299

250. Zen H, Tokuda K, Black AW (2009) Statistical

parametric speech synthesis. speech communica-

tion 51(11):1039–1064

251. Zen H, Sak H, Graves A, Senior A (2014) Statisti-

cal parametric speech synthesis based on recurrent

neural networks. In: Poster presentation given at

UKSpeech Conference

252. Zen H, Agiomyrgiannakis Y, Egberts N, Hender-

son F, Szczepaniak P (2016) Fast, compact, and

high quality lstm-rnn based statistical paramet-

ric speech synthesizers for mobile devices. arXiv

preprint arXiv:160606061

253. Zen H, Dang V, Clark R, Zhang Y, Weiss RJ, Jia

Y, Chen Z, Wu Y (2019) Libritts: A corpus derived

from librispeech for text-to-speech. arXiv preprint

arXiv:190402882

254. Zeng Z, Wang J, Cheng N, Xia T, Xiao J (2020)

Aligntts: Eﬃcient feed-forward text-to-speech sys-

tem without explicit alignment. In: ICASSP 2020-

2020 IEEE International Conference on Acoustics,

Speech and Signal Processing (ICASSP), IEEE,

pp 6714–6718

255. Zhang H, Lin Y (2020) Unsupervised learn-

ing for sequence-to-sequence text-to-speech

for low-resource languages. arXiv preprint

arXiv:200804549

256. Zhang H, Xu T, Li H, Zhang S, Wang X, Huang

X, Metaxas DN (2018) Stackgan++: Realistic im-

age synthesis with stacked generative adversarial

networks. IEEE transactions on pattern analysis

and machine intelligence 41(8):1947–1962

257. Zhang H, Goodfellow I, Metaxas D, Odena A

(2019) Self-attention generative adversarial net-

works. In: International conference on machine

learning, PMLR, pp 7354–7363

258. Zhang J, Pan J, Yin X, Li C, Liu S, Zhang Y,

Wang Y, Ma Z (2020) A hybrid text normalization

system using multi-head self-attention for man-

darin. In: ICASSP 2020-2020 IEEE International

Conference on Acoustics, Speech and Signal Pro-

cessing (ICASSP), IEEE, pp 6694–6698

259. Zhang JX, Ling ZH, Dai LR (2018) Forward at-

tention in sequence-to-sequence acoustic modeling

for speech synthesis. In: 2018 IEEE International

Conference on Acoustics, Speech and Signal Pro-

cessing (ICASSP), IEEE, pp 4789–4793

260. Zhang S, Lei M, Yan Z, Dai L (2018) Deep-

fsmn for large vocabulary continuous speech

40 Zhaoxi Mu

et al.

recognition. In: 2018 IEEE International Confer-

ence on Acoustics, Speech and Signal Processing

(ICASSP), IEEE, pp 5869–5873

261. Zhang W, Yang H, Bu X, Wang L (2019) Deep

learning for mandarin-tibetan cross-lingual speech

synthesis. IEEE Access 7:167884–167894

262. Zhang Y, Deng L, Wang Y (2020) Uniﬁed man-

darin tts front-end based on distilled bert model.

arXiv preprint arXiv:201215404

263. Zhang YJ, Pan S, He L, Ling ZH (2019)

Learning latent representations for style control

and transfer in end-to-end speech synthesis. In:

ICASSP 2019-2019 IEEE International Confer-

ence on Acoustics, Speech and Signal Processing

(ICASSP), IEEE, pp 6945–6949

264. Zhang Z, Wu F, Yang C, Dong M, Zhou F (2016)

Mandarin prosodic phrase prediction based on

syntactic trees. In: SSW, pp 160–165

265. Zhang Z, Tian Q, Lu H, Chen LH, Liu S

(2020) Adadurian: Few-shot adaptation for neu-

ral text-to-speech with durian. arXiv preprint

arXiv:200505642

266. Zheng Y, Tao J, Wen Z, Li Y (2018) Blstm-crf

based end-to-end prosodic boundary prediction

with context sensitive embeddings in a text-to-

speech front-end. In: Interspeech, pp 47–51

267. Zheng Y, Tao J, Wen Z, Yi J (2019) Forward–

backward decoding sequence for regularizing end-

to-end tts. IEEE/ACM Transactions on Audio,

Speech, and Language Processing 27(12):2067–

2079