Noname manuscript No.
(will be inserted by the editor)
Review of end-to-end speech synthesis technology based on
deep learning
Zhaoxi Mu
1
· Xinyu Yang
1,
· Yizhuo Dong
1
Received: date / Accepted: date
Abstract As an indispensable part of modern human-
computer interaction system, speech synthesis technol-
ogy helps users get the output of intelligent machine
more easily and intuitively, thus has attracted more
and more attention. Due to the limitations of high com-
plexity and low efficiency of traditional speech synthe-
sis technology, the current research focus is the deep
learning-based end-to-end speech synthesis technology,
which has more powerful modeling ability and a simpler
pipeline. It mainly consists of three modules: text front-
end, acoustic model, and vocoder. This paper reviews
the research status of these three parts, and classifies
and compares various methods according to their em-
phasis. Moreover, this paper also summarizes the open-
source speech corpus of English, Chinese and other lan-
guages that can be used for speech synthesis tasks, and
introduces some commonly used subjective and objec-
tive speech quality evaluation method. Finally, some
attractive future research directions are pointed out.
Keywords Speech synthesis · Text-to-speech · End-
to-end · Deep learning · Review
Zhaoxi Mu
Xinyu Yang
Yizhuo Dong
1
Xi’an Jiaotong University, Xi’an, Shaanxi,
People’s Republic of China
1 Introduction
With the rapid development of computer science, ar-
tificial intelligence, automation and robot control tech-
nology, the demand of human-computer interaction has
been fully met and the way has become more and more
direct and convenient. Human-computer interaction re-
lies heavily on speech communication. The speech sys-
tem of the machine is divided into three functional
modules: voiceprint recognition, speech recognition and
speech synthesis. The most difficult and complex task
is speech synthesis. This is because that compared to
speech and voiceprint recognition, speech synthesis sys-
tems usually require more data for training and more
complex models for modeling in order to accurately
synthesize high-fidelity speech with various styles by
inputting simple text.
Speech synthesis is also called text-to-speech (TTS)
when the input is text. TTS is a frontier technology in
the field of information processing, which involves many
disciplines such as acoustics, linguistics, and computer
science. The main task is to convert input text into out-
put speech. TTS system is the mouth of the intelligent
machine, which has been widely used in various fields of
people’s daily life, such as voice navigation, information
broadcast, intelligent assistant, intelligent customer ser-
vice, and has achieved great economic benefits. More-
over, it is also being applied to some new fields, such as
article reading, language education, video dubbing, and
rehabilitation therapy. TTS applications has become an
important part of people’s lives.
Deep learning-based TTS technology With the develop-
ment of computer science and technology, the intelligi-
bility and naturalness of synthesized speech have been
greatly improved due to the continuous improvement
arXiv:2104.09995v1 [cs.SD] 20 Apr 2021
2 Zhaoxi Mu
1
et al.
of TTS techniques from the formant-based methods
[92, 97, 105, 106, 179, 224] to the unit selection-based
waveform cascade methods [6, 24, 36, 59, 80, 143, 144],
and to the hidden Markov model (HMM)-based sta-
tistical parametric speech synthesis (SPSS) methods
[26, 94, 149, 178, 210, 241, 249, 250]. Deep learning
is a new research direction in the field of artificial in-
telligence in recent years. This method can effectively
capture the latent information and association in data,
and has more powerful modeling ability than tradi-
tional statistical learning methods [238]. TTS methods
based on deep learning have been widely researched
[52, 131, 168, 247]. For example, in the SPSS model
based on deep neural network (DNN), DNN can learn
the mapping function from linguistic features (input)
to acoustic features (output).
DNN-based acoustic models provide an effective dis-
tributed representation of the complex dependencies
between linguistic features and acoustic features. How-
ever, one limitation of the acoustic feature modeling
method based on feedforward DNN is that it ignores
the continuity of speech. The DNN-based method as-
sumes that each frame is sampled independently, al-
though there is correlation between consecutive frames
in the speech data. Recurrent Neural Network (RNN)
provides an effective method to model the correlation
between adjacent frames of speech, because it can use
all the available input features to predict the output fea-
tures of each frame. Based on this, some researchers use
RNN instead of DNN to capture the long-term depen-
dence of speech frames in order to improve the quality
of synthesized speech [51, 53, 93, 213, 248, 251].
End-to-end TTS technology The traditional SPSS net-
work is a complex pipeline containing many modules,
composed of text-to-phoneme network, audio segmen-
tation network, phoneme duration prediction network,
fundamental frequency prediction network and vocoder
[3, 57]. Building these modules requires a lot of profes-
sional knowledge and complex engineering implemen-
tation, which will take a lot of time and effort. Also,
the combination of errors in each component may make
the model difficult to train. End-to-end TTS methods
are driven by the desire to simplify TTS systems and
reduce the need for manual intervention and linguis-
tic background knowledge. The end-to-end TTS model
only needs to be trained from scratch on the paired
data set of htext, speechi, and can directly synthesize
speech from the text. The state-of-the-art end-to-end
TTS models based on deep learning have been able to
synthesize speech close to human voice [151, 189, 227].
It is mainly composed of three parts: text analy-
sis front-end, acoustic model and vocoder, as shown
Text front-end
Text Phoneme
Acous!c model
Spectrogram
Vocoder
Wave
yu3 yin1 he2 cheng2
Fig. 1 Pipeline architecture for TTS
in Fig. 1. Firstly, the text front-end converts the text
into standard input. Then, the acoustic model converts
the standard input into intermediate acoustic features,
which are used to model the long-term structure of
speech. The most common intermediate acoustic fea-
tures are spectrogram [189, 227], vocoder feature [196]
or linguistic feature [151]. Finally, the vocoder is used
to fill in low-level signal details and convert acoustic
features into time-domain waveform samples. To reduce
the difficulty of training and improve the quality of syn-
thesized speech, the text front-end, acoustic model and
vocoder are usually trained separately [189], and they
can also be fine-tuned jointly [196]. This article will in-
troduce some of the latest developments in each of the
three components according to the structure of Fig. 2.
There have been some reviews on TTS. For exam-
ple, Deng et al. [43] analyzed the number of documents
and citations of TTS papers from 1992 to 2017, aiming
to help researchers understand the development trend
of TTS. Aroon and Dhonde [5] reviewed SPSS meth-
ods based on HMM. Adiga and Prasanna [1] reviewed
SPSS methods and partially deep learning based meth-
ods. Ning et al. [148] and Sruthi and Meharban [197]
reviewed TTS methods based on deep learning. Kalita
and Deb [90] reviewed emotional TTS methods for Hindi.
Tits et al. [207] reviewed the emotional speech corpus
that could be used for TTS.
Although there have been some reviews on TTS
methods based on deep learning, only some of baseline
models have been introduced, such as WaveNet [151],
Tacotron [227] and SampleRNN [136]. These models
have many problems, such as slow training and infer-
ence speed, instability, lack of emotion and rhythm in
synthesized speech, and a large amount of high-quality
speech data required for training. The state-of-the-art
TTS methods can completely or partially solve these
problems, still so far there has been no comprehensive
review of the latest deep learning-based TTS models.
Moreover, the quantity and quality of training speech
Review of end-to-end speech synthesis technology based on deep learning 3
Text front-end
Acous!c model
Vocoder
Fast acous!c model
Robust acous!c
model
Expressive acous!c
model
Low-resource
acous!c model
Fast vocoder
High-quality
vocoder
Non-RNN acous!c model
Non-autoregressive acous!c model
Streaming acous!c model
Stable autoregressive genera!on process
Accurate alignment
Acous!c model with reference encoder
Acous!c model of explicit modeling style features
Mul!-speaker acous!c model
Small size vocoder
Non-autoregressive vocoder
Fig. 2 Section organization of the TTS model
corpus play a decisive role in the training results of
TTS model, and how to effectively evaluate the quality
of synthesized speech has always been a problem in the
field of TTS. Therefore, this paper will make a detailed
summary of the latest end-to-end TTS models based
on deep learning, speech corpus and evaluation meth-
ods of synthesized speech, and finally give some future
research directions.
The rest of this paper is organized as follows: Sect. 2,
3 and 4 respectively introduce the latest text front-end,
acoustic model and vocoder based on deep learning.
Sect. 5 organizes the corpus that could be used for TTS.
Sect. 6 introduces commonly used synthesized speech
evaluation methods from both subjective and objective
aspects. Sect. 7 puts forward some challenges and future
research directions for reference. The last section draws
a general conclusion of this paper.
2 Text front-end
It is difficult to synthesize high-fidelity speech only us-
ing original phonemes or original text as the input of
the TTS model, especially for languages that contain
polyphonic characters and have complex prosodic struc-
tures, such as Mandarin. Therefore, it is necessary to
use the text front-end to introduce additional pronun-
ciation and syntactic information. The text front-end
predicts the pronunciation mode from the original text,
aiming to provide enough information for the back-
end to accurately synthesize speech. The quality of the
text front-end has a great impact on the clarity and
naturalness of the synthesized speech. Pronunciation
patterns are important information for languages with
many polyphonic characters and ambiguous pronunci-
ations, such as Mandarin. Syntactic information also
contributes a lot to the pronunciation of a sentence,
which determines the pause and tone of a sentence.
People usually read a phrase that has a full meaning
in its entirety, and pause between phrases that need to
be separated. For languages with many ambiguities, the
effect of syntactic information on sentence segmentation
may also cause listeners to have a completely different
understanding of a sentence. Therefore, this informa-
tion needs to be predicted by the text front-end as a
conditional input of the acoustic model to synthesize
speech with correct pronunciation and prosody.
The traditional Mandarin text front-end is a cas-
cade system, which consists of a series of text processing
components, such as text normalization (TN), Chinese
word segmentation (CWS), part-of-speech (POS) tag-
ging, grapheme-to-phoneme (G2P) and prosodic struc-
ture prediction (PSP). The text front-end structure of
other languages is similar to that of Mandarin. These
components are usually modeled by traditional statisti-
cal methods, such as syntactic trees [264] and CRF [167]
based methods for PSP tasks and dictionary match-
ing based methods [77] for pronunciation prediction
tasks. However, these traditional text front-ends often
4 Zhaoxi Mu
1
et al.
fail to predict correctly in some unusual or complex con-
texts. To boost prediction accuracy, some researchers
have adopted state-of-the-art NLP frameworks based
on deep learning methods such as BLSTM-CRF [78,
266], Word2Vec [139], Transformer [222] and BERT [44]
to improve the text front-end model based on dictio-
nary and traditional statistical learning methods. These
models can extract contextual information from the
text effectively, and thus help the text front-end to
accurately determine the pronunciation of polyphonic
characters, the meaning of ambiguous sentences, and
the prosodic boundaries between each word, each phrase
and each sentence. The following will introduce the lat-
est text front-end model based on deep learning from
the aspects of text normalization, prosodic structure
prediction, pronunciation prediction, contextual infor-
mation extraction and so on.
Text normalization Text normalization is an important
preprocessing step for TTS tasks. Zhang et al. [258]
standardized Mandarin text by combining the tradi-
tional rule-based system with a neural text network
consisting of multi-head self-attention modules in Trans-
former to convert Non-Standard Words (NSW) into
Spoken-Form Words (SFW). This method has a higher
prediction accuracy than the rule-based system.
Prosodic structure prediction Prosodic structure pre-
diction is also an important function of the text front-
end. Taking Mandarin as an example, the prosodic struc-
ture of Mandarin is a three-level hierarchical structure
composed of three basic units: prosodic words (PW),
prosodic phrases (PPH) and intonation phrases (IPH)
[35]. Because these three levels of prediction tasks are
interrelated, Pan et al. [154] modeled prosody informa-
tion at all levels of the text in the way of multi-task
learning, and proposed a Mandarin prosodic bound-
ary prediction model based on BLSTM-CRF, which
improved the prediction accuracy and simplified the
model. Lu et al. [130] also proposed a method of multi-
tasking learning to efficiently complete PSP tasks based
on the self-attention model.
Pronunciation prediction Other text front-ends have
the pronunciation prediction function on the basis of
text normalization and prosody prediction. The G2P
tasks of Mandarin can be divided into two categories:
G2P of monophonic characters and G2P of polyphonic
characters. The pronunciation of monophonic charac-
ters can be easily determined by a pronunciation dictio-
nary, while G2P of polyphonic characters is highly con-
text sensitive [262]. Therefore, disambiguation of poly-
phonic characters is the main task of Mandarin G2P.
To accurately predict the pronunciation of polyphonic
characters, Cai et al. [21], Shan et al. [185] and Park
and Lee [157] proposed to use Bi-LSTM network for
G2P. On the basis of Pan et al. [154], Yang et al. [236]
proposed to preprocess the original text by replacing
the Word2Vec model with the encoder of Transformer-
based NLP model and BERT pre-training model, and
then carry out G2P and PSP in the Mandarin text
front-end. The accuracy of prediction can be improved
by taking advantage of Transformer and BERT net-
work. However, pre-training models, such as BERT,
are too large to be used in realtime applications and
edge devices. To reduce the size of the model, Zhang
et al. [262] proposed to use the simplified TinyBERT
model [86] for the G2P and PSP tasks simultaneously
using multi-task learning. It can ensure the accuracy
of the prediction results while reducing the size of the
model. Conkie and Finch [40] proposed a text front-end
that can be used to process multiple languages, includ-
ing text normalization and G2P functions. They regard
these two front-end tasks as two neural machine trans-
lation (NMT) tasks and use Transformer for modeling.
Byte pair encoding (BPE) technology [181] is also used
to process uncommon words, and the splicing technique
is used for long texts, which improves the accuracy of
prediction and the quality of synthesized speech.
Introduction of style information The text front-end
can also directly add additional style information to the
TTS system to provide the synthesized speech style fea-
tures. For example, Tahon et al. [203] added a pronun-
ciation adaptive framework based on CRF between text
front-end and TTS model to generate different styles of
speech. In order to make the synthesized speech closer
to human voice, Sz´ekely et al. [201] took the front and
back utterances of an utterance and the breath pronun-
ciation events between them as a data set to learn the
breath location information of the context, thus adding
human breath information into the training data. The
forward and backward breath predictors were also used
to predict the location of breath more accurately.
Contextual information extraction The text front-end
model can also extract the contextual information of
the text. The extracted additional contextual informa-
tion can be input into the acoustic model as prior knowl-
edge. For example, Hayashi et al. [70] directly used
BERT as a context feature extraction network to en-
code input text, and added encoded word or sentence-
level contextual information to the input of the encoder
of the acoustic model to improve the quality of synthe-
sized speech. In order to obtain the phrase structure
of the sentence and word relationship information, Guo
Review of end-to-end speech synthesis technology based on deep learning 5
et al. [65] used the factor parser [107] in the Stanford
parser to extract the syntactic tree. Then, the embed-
ding vectors of extracted syntactic features and input
tokens are then combined as the input of the acoustic
model encoder, enabling TTS models to correctly syn-
thesize speech when facing some ambiguous sentences.
In order to improve the quality of synthesized speech,
GraphSpeech [126] inputs syntactic knowledge as ad-
ditional contextual information into the self-attention
module of Transformer-TTS [118]. The syntax tree of
the input text is converted into a syntax graph to model
the language relation between any two characters in the
input text, describe the global relation between the in-
put characters and extract grammatical features of the
text.
Unified text front-end To reduce the cumulative train-
ing error of each part and simplify the model, the com-
ponents of the text front-end with various functions can
be combined together. Pan et al. [155] proposed a Man-
darin text front-end model that unifies a series of text
processing components, which can directly convert the
original text into linguistic features. Firstly, the original
text is normalized by the method proposed by Zhang
et al. [258] Then, the Word2Vec model is used to con-
vert sentences into character embedding, and an auxil-
iary model composed of dilated convolution or Trans-
former encoder is used to predict CWS and POS respec-
tively. Finally, the results are embedded and combined
with the original characters as the input of the main
module to jointly predict the labels of phoneme, tone
and prosody.
3 Acoustic model
Tacotron [227] is the first end-to-end acoustic model
based on deep learning, and it is also the most widely
used acoustic model. It can synthesize acoustic features
directly from text, and then synthesize speech wave-
forms according to Griffin-Lim algorithm [62]. Tacotron
is based on the Seq2Seq architecture of encoder-decoder
with attention mechanism. The encoder is composed of
the CBHG network and is used to encode the input
text. The CBHG network includes convolution bank,
highway networks and Bi-GRU [38]. Decoder consists of
RNN with attention mechanism that aligns the output
of the encoder with the mel-spectrogram to be gener-
ated. Finally, the decoder maps the output sequence of
the encoder to the mel-spectrogram in an autoregressive
manner [220]. The autoregressive generative method is
to decompose the joint probability p(x) of the acoustic
feature sequence x = {x
1
, x
2
, . . . , x
T
} into:
p(x) =
T 1
Y
i=0
p(x
i+1
| x
1
, x
2
, . . . , x
i
) (1)
This means that the acoustic features of the n-th frame
are generated under the condition of the previous n 1
frames. In order to increase the speed of synthesizing
mel-spectrogram, Tacotron generates multiple frames
of mel-spectrogram at each decoding step.
Although Tacotron is better than most SPSS mod-
els, it still has the following four disadvantages:
The decoder in Tacotron is composed of RNN and
synthesizes acoustic features in an autoregressive
manner, which introduces a time-series dependence.
Therefore, it cannot be calculated in parallel, result-
ing in slow training and inference speed.
Tacotron uses content-based attention mechanism,
thus the synthesized speech will have many errors,
such as mispronunciation, missed words and repeti-
tions.
Tacotron cannot synthesize speech with a specific
emotion and rhythm.
Tacotron needs to use a lot of high-fidelity speech
data during training to get good results.
In order to overcome these disadvantages in Tacotron,
researchers have proposed many new acoustic models
based on Tacotron. The following will introduce various
improvement methods for the above four disadvantages.
3.1 Fast acoustic model
Although Tacotron can synthesize high-fidelity speech
that is close to human voice, it cannot be used in prac-
tical applications due to its slow training and infer-
ence speed. The training and inference speed of acous-
tic model can be improved by improving RNN network,
improving autoregressive generative method and using
streaming method.
3.1.1 Non-RNN acoustic model
Multi-layer CNN can replace RNN to capture the long-
term dependence of the context, and can speed up train-
ing and inference in the way of parallel computing. For
example, Tacotron 2 [189] replaces the complex CBHG
and GRU structures with simple LSTM [74] and CNN
structures on the basis of Tacotron. Deep Voice 3 [160]
uses residual gated convolution [42, 56] instead of RNN
to capture contextual information, where the encoder
and decoder are composed of non-causal and causal
CNNs. DCTTS [202] replaces RNN with CNN on the
6 Zhaoxi Mu
1
et al.
basis of Tacotron, which consists of Text2Mel and Spec-
trogram Super Resolution Network (SSRN).
In addition to CNN, other networks can be used
instead of RNN to achieve parallel computing. For ex-
ample, Li et al. [118] proposed to use Transformer to
replace the RNN and attention networks in Tacotron 2,
thereby increasing the computational efficiency by us-
ing the multi-head self-attention in Transformer to gen-
erate the hidden states of encoder and decoder in paral-
lel. Bi et al. [14] proposed that the deep feed-forward se-
quential memory network (DFSMN) [260] with a struc-
ture similar to dilated-CNN [151] can be used to replace
RNN in the acoustic model. The quality of speech gen-
eration by the DFSMN-based model is similar to that
of the RNN-based model, and the model complexity is
reduced and the training time is reduced.
3.1.2 Non-autoregressive acoustic model
Although the above models improve the computational
efficiency by means of parallel computation, they still
need to generate acoustic features frame by frame in
an autoregressive manner [220] during inference, re-
sulting in a very slow generation speed. Therefore, if
acoustic features can be generated in parallel, the gen-
eration speed will be greatly improved. However, it is
difficult for the acoustic model based on the attention
mechanism to learn the correct alignment between in-
put and output if the mel-spectrogram is directly gen-
erated in parallel in a non-autoregressive manner. In
order to solve this problem, FastSpeech [172], SpeedyS-
peech [215], ParaNet [159], FastPitch [117] and other
models introduced a teacher network to replace the
implicit autoregressive alignment method of the tra-
ditional seq2seq model through knowledge distillation.
The autoregressive teacher network can guide the non-
autoregressive network to learn correct attention align-
ment.
FastSpeech consists of the feed-forward Transformer
networks, which can generate acoustic feature frames in
parallel under the guidance of the length regulator. The
length regulator aligns each language unit with a cor-
responding number of acoustic frames in a manner pro-
vided by the autoregressive teacher network. However,
the Transformer module is complex and has a large
number of parameters. To reduce model parameters and
further improve the speed of training and inference, De-
viceTTS [79], SpeedySpeech [215], TalkNet [12], and
Parallel Tacotron [49] replace the Transformer mod-
ule in FastSpeech with simple DFSMN [260], residual
dilated-CNN, CNN and lightweight convolution (LConv)
[231], respectively.
The training process for models such as Fastspeech,
Speedyspeech, and Paranet is complicated by the use
of knowledge distillation. To simplify the training pro-
cess, other generative models such as normalizing flow
and generative adversarial network (GAN) generative
models can be used to avoid autoregressive generation
and knowledge distillation process. Glow-TTS [98] uses
the Glow [101] normalizing flow instead of Transformer
as the decoder to generate mel-spectrogram in parallel
(the Glow normalizing flow will be described in detail in
Sect. 4.1.2). Flow-TTS [138] also uses a Glow-based de-
coder to generate mel-spectrogram non-autoregressively.
Donahue et al. [47] proposed an end-to-end TTS model
EATS based on GAN-TTS [17], which directly syn-
thesized speech non-autoregressively using GAN. Table
1 lists the methods to improve training and inference
speed of each model.
3.1.3 Streaming acoustic model
Although the training and inference speed of TTS mod-
els has been greatly improved, most of the current mod-
els can only output speech after inputting an entire sen-
tence. The longer the sentence, the longer the waiting
time, that is, the system will delay the input, which
seriously affects the experience of human-computer in-
teraction experience. To solve this problem, some re-
searchers have proposed streaming incremental TTS
systems [50, 133, 198, 235], which can output speech in
real time while inputting text, because they only need
to see a few characters or words to synthesize speech.
The streaming system can generate new audio while the
user plays the audio, which greatly improves the appli-
cability of the TTS system and the user experience. It
can be applied in the fields of simultaneous translation,
dialog generation, and assistive technologies [133].
Traditional acoustic models with complete sentences
as input can rely on the full linguistic context (ie, past
and future words) to construct their internal repre-
sentations for acoustic features, thus generating high-
quality speech. However, due to the limited contextual
information that streaming acoustic models can obtain,
it is a challenge to effectively model the overall prosodic
structure of speech. Yanagita et al. [235] proposed the
streaming neural TTS model for the first time. In order
to learn the intra-sentence boundary features, they used
the start, middle and end symbols to split the train-
ing sentence into multiple subunits, which were used to
train the Tacotron. And they allow the model to learn
the acoustic time-series within one full sentence by tak-
ing the last vector of the mel-spectrogram from the pre-
vious units as the initial input for each unit. Finally,
the entire sentence is synthesized by incrementally syn-
Review of end-to-end speech synthesis technology based on deep learning 7
thesizing blocks consisting of one or more words with
symbols.
This method needs to preprocess the training data,
and only considers the previous information, which will
cause the prosodic error of synthesized speech. In or-
der to solve this problem, Ma et al. [133] borrowed the
idea of prefix-to-prefix framework of simultaneous ma-
chine translation [132]. When generating acoustic fea-
tures and speech waveforms incrementally, not only the
previous results but also the information of the follow-
ing words should be be used as the condition. Stephen-
son et al. [198] also proposed that the following words
should be considered when incrementally encoding each
word. They use Bi-LSTM to encode the first word to
the following few words of the word to be synthesized,
and then input the resulting embedding vector into the
decoder. Finally, the speech segments will be cropped
[104] and spliced. Ellinas et al. [50] proposed a stream-
ing inference method, which can input the generated
acoustic frames into the vocoder before the inference
process of the acoustic model is completed. They accu-
mulate the output frames from each decoding step in a
buffer, and when the buffer includes enough frames to
accommodate the total receptive field of the convolu-
tional layers in post-net, the acoustic frames are passed
to post-net in a larger batch. The post-net is trained to
refine the entire acoustic frames sequence. The acoustic
frames in the buffer are partially redundant to consider
the contextual information of the acoustic frame to be
synthesized. Stephenson et al. [199] used the language
model GPT-2 [169] to predict the next word in the in-
put text, thereby improving the naturalness of speech
synthesized by the incremental TTS model by utilizing
the predicted contextual information.
3.2 Robust acoustic model
The neural TTS models based on autoregressive genera-
tive method and attention mechanism have been able to
generate speech that is as natural as human voice. How-
ever, these models are not as robust as traditional meth-
ods. During training, the autoregression-based models
need to first decide whether it should stop when pre-
dicting each frame. Therefore, incorrect prediction of a
single frame can result in serious errors, such as end-
ing the the generation process early. Moreover, there
are almost no constraints in the attention mechanism
of the acoustic model to prevent problems such as repe-
tition, skipping, long pauses, or nonsense. These errors
are rare and therefore usually do not show up in small
test sets such as those used in subjective listening tests.
However, in customer-oriented products, even if there is
only a small probability of such problems, it will greatly
reduce the user experience. Therefore, many improved
methods for autoregressive generative model and atten-
tion mechanism widely used in neural TTS models have
been proposed.
3.2.1 Stable autoregressive generation process
In order to improve the training convergence speed, the
autoregressive TTS models such as Tacotron use nat-
ural acoustic feature frames as the input of decoder
for teacher forcing training in training stage, while in
inference stage, use the previously predicted acoustic
feature frames as the input of the decoder to generate
speech in free running mode. The distribution of the
data predicted by the model is different from the dis-
tribution of the real data used in the training process,
and the discrepancy between these two distributions
can quickly accumulate errors in decoding, resulting in
exposure bias and wrong results, such as skipping, re-
peating words, incomplete synthesis and inappropriate
prosody phrase breaks. And this makes the model can
only be used to synthesize short sentences, because the
sound quality will deteriorate as the length of the syn-
thesized sentence increases.
A simple method to reduce exposure bias is sched-
uled sampling [13], in which acoustic feature frames of
the current time step are predicted by using natural
acoustic feature frames or those predicted by the pre-
vious time step with a certain probability [141, 155].
However, due to the inconsistency between the natural
speech frames and the predicted speech frames during
the scheduled sampling, the temporal correlation of the
acoustic feature sequence is destroyed, leading to the
decline of the quality of the synthesized speech.
To avoid this problem, Guo et al. [66] proposed
to use the Professor Forcing [116] method for train-
ing, which is a GAN-based adversarial training method.
The model is composed of a generator and a discrimi-
nator. The generator generates the output sequence in
the manner of teacher forcing and free running, respec-
tively. The discriminator based on self-attention GAN
(SAGAN) Zhang et al. [257] is used to determine which
way the output sequence is generated. They reduce the
exposure bias by introducing an additional term to min-
imize the gap between the output sequences generated
by the two methods in the training goal of the gen-
erator, although this solution is not stable and easy
enough. Liu et al. [125] proposed the random descent
method, which first uses the natural acoustic features
as the input of the decoder for the first round of teacher
forcing training, and then replaces the natural acous-
tic features with the acoustic features generated in the
first round for the second round of teacher forcing train-
8 Zhaoxi Mu
1
et al.
Table 1 Methods to improve the training and inference speed of each acoustic model
Acoustic model Neural net-
work types
Generative model
types
Characteristics
Tacotron (Wang et al., 2017) CBHG, GRU Autoregression Synthesizing speech end-to-end, the struc-
ture is complex, the training and inference
speed is slow
Deep Voice 3 (Ping et al.,
2017)
CNN Autoregression Based on CNN, training and inference speed
is faster than Tacotron
DCTTS (Tachibana et al.,
2018)
CNN Autoregression Based on CNN, training and inference speed
is faster than Tacotron
Tacotron 2 (Shen et al., 2018) LSTM, CNN Autoregression The structure is simpler than Tacotron
Transformer-TTS (Li et al.,
2019)
Transformer Autoregression Based on Transformer, training and infer-
ence speed is faster than Tacotron
FastSpeech (Ren et al., 2019) Transformer Non-autoregression Training through knowledge distillation,
training speed is slow, inference speed is fast
ParaNet (Peng et al., 2020) CNN Non-autoregression Training through knowledge distillation,
based on CNN, the structure is simpler than
FastSpeech
EATS (Donahue et al., 2020) CNN GAN Based on CNN and GAN, the training and
inference speed is fast, the structure is fully
end-to-end
Glow-TTS (Kim et al., 2020) Transformer,
Glow
Normalizing flow Based on normalizing flow, training and in-
ference speed is fast
SpeedySpeech (Vainer and
Duˇsek, 2020)
CNN Non-autoregression Training through knowledge distillation,
based on CNN, the structure is simpler than
FastSpeech
TalkNet (Beliaev et al., 2020) CNN Non-autoregression Based on CNN, training and inference speed
is faster, the structure is simpler than Fast-
Speech
Flow-TTS (Miao et al., 2020) Glow Normalizing flow Based on normalizing flow, training and in-
ference speed is fast
DeviceTTS (Huang et al.,
2020)
DFSMN,
RNN
Combination of au-
toregression and non-
autoregression
Based on DFSMN, the structure is simpler
than FastSpeech
Parallel Tacotron (Elias
et al., 2020)
LConv Non-autoregression Based on LConv, the structure is simpler
than FastSpeech
FastPitch ( La´ncucki, 2020) Transformer Non-autoregression Training through knowledge distillation,
training speed is slow and inference speed
is fast
Review of end-to-end speech synthesis technology based on deep learning 9
ing. The model is trained multiple iterations to mini-
mize the gap between the generated acoustic features
and the natural acoustic features, thereby reducing the
exposure bias. Liu et al. [127] also proposed a method
based on knowledge distillation to reduce exposure bias,
which is to train a teacher model first, and then use it
to guide the training of the student model. The teacher
model uses ground-truth data for training, and the stu-
dent model uses the predicted value of the previous
time step to guide the prediction of the next time step.
Knowledge distillation is performed by minimizing the
distance between the hidden states of the decoder at
each time step of the two models.
When the target sequence is generated by autore-
gressive method, the previous wrong token will affect
the next one. The acoustic feature sequence is usually
longer than the target sequence of other sequence learn-
ing tasks (such as NMT). Therefore, the results of the
TTS task will be more susceptible to error propaga-
tion, resulting in that the right part of the generated
acoustic feature sequence is usually worse than the left
part. Ren et al. [173] used the bidirectional sequence
modeling (BSM) technique to alleviate error propaga-
tion. They generated acoustic feature sequences from
left to right and from right to left respectively to pre-
vent the model from generating sequences with poor
quality on one side. Zheng et al. [267] proposed two
BSM methods for acoustic models, which take full ad-
vantage of the autoregressive model at the initial it-
eration stage and reduce errors in synthesized speech
by adding bidirectional decoding regularization term
to the loss function during training. The first method
is to construct two acoustic models that generate the
mel-spectrogram from front to back and from back to
front respectively, and then minimize the difference be-
tween the output mel-spectrogram of the two models.
The second method is to use two decoders to generate
mel-spectrogram forward and backward while sharing
an encoder, and then minimize the difference between
the state or attention weight values of the two decoders
at each time step. Moreover, Vainer and Duˇsek [215]
employed three data augmentations on the input mel-
spectrogram to improve the robustness of the model to
error propagation during autoregressive generation:
A small amount of Gaussian noise is added to each
spectrogram pixel.
The model outputs are simulated by feeding the in-
put spectrogram through the network without gra-
dient update in parallel mode.
The input spectrograms are degraded by randomly
replacing several frames with random frames, thereby
encouraging the model to use temporally more dis-
tant frames.
When acoustic features are generated by autoregres-
sive acoustic models, there is a problem of local infor-
mation preference [29, 124], that is, the acoustic feature
frames to be generated by the current time step are
completely dependent on the acoustic feature frames
generated by the previous time step, and are indepen-
dent of the text conditions. In order to avoid ignoring
text information during synthesis and thus generating
wrong speech, Liu et al. [124] learned from the idea of
InfoGAN [28] and proposed to use an additional auxil-
iary CTC recognizer to recognize the predicted acous-
tic features. The predicted acoustic features are used
to restore the corresponding input text. This method
essentially maximizes the mutual information between
the predicted acoustic features and the input text to
enhance the dependence between them.
3.2.2 Accurate alignment
Similar to other Seq2Seq models, many TTS models
use the attention mechanism to align input text with
output spectrograms. The attention mechanism allows
the output of the decoder at each step to focus on a
subset of hidden states of the encoder, and the result
directly controls the duration and rhythm of the syn-
thesized speech. The main structure of the attention
mechanism is shown in Fig. 3, which can be expressed
as [25]:
(h
1
, h
2
, . . . , h
L
) = Encoder(x
1
, x
2
, . . . , x
L
) (2)
s
i
= Attention(s
i1
, c
i1
, y
i1
) (3)
e
i,j
= f
a
(s
i
, h
j
) (4)
α
i,j
= f
d
(e
i,j
) (5)
c
i
=
X
j
α
i,j
h
j
(6)
y
i
= Decoder(y
i1
, c
i
, s
i
) (7)
where {x
j
}
L
j=1
is input sequence, L is the length of in-
put sequence, {h
j
}
L
j=1
are hidden states of encoder, c
i
is context vector, α
i,j
are attention weights over input,
s
i
is hidden state of decoder, e
i,j
are energy values, y
i
is output token, f
a
is alignment function, f
d
is distri-
bution function, and the form of f
a
and f
d
depends on
the specific attention mechanism.
10 Zhaoxi Mu
1
et al.
Encoder
Lj
xxx
1
Lj
hhh
1
Decoder
1i
y
i
y
1i
s
i
s
Attention
1i
c
i
c
Lijii
eee
,,1,
Alignmentfunction
Distribution
function
Elementwise
product
Lijii ,,1,
Fig. 3 Attention mechanism structure
First, the input sequence (x
1
, x
2
, . . . , x
L
) is encoded
by encoder and transformed to (h
1
, h
2
, . . . , h
L
). Then,
the hidden states {s
i
}
T
i=1
of decoder is generated by
the attention network, and the corresponding weights
{α
i,j
}
L
j=1
of encoder states in the i-th time step are
calculated by s
i
. The context vector c
i
consists of a
linear combination of attention weights {α
i,j
}
L
j=1
and
encoder states {h
j
}
L
j=1
. Finally, the decoder generates
the output token y
i
using the current time step context
vector c
i
and hidden state s
i
.
Since the order and position of input text and out-
put speech in TTS task are corresponding, attention
alignment in TTS is a surjective mapping from the out-
put frames to the input tokens and should follow such
strict criteria [71]:
Locality Each output frame should be aligned around
a single input token to avoid attention collapse.
Monotonicity The position of the aligned input to-
ken must never rewind backward to prevent repeat-
ing.
Completeness Each input token should be covered
once or aligned with some output frame to avoid
skipping.
The original Tacotron model uses the content-based
attention mechanism proposed by Bahdanau et al. [7].
In this case, Eq. (4) is:
e
i,j
= v
T
tanh(W s
i
+ V h
j
+ b) (8)
where W s
i
and V h
j
represent query and key, respec-
tively.
The content-based attention mechanism does not
consider the position information of each item in the
sequence at all, and can not effectively utilize the mono-
tonicity and locality of alignment, thus alignment errors
are common. In order to enable the attention mecha-
nism to consider the positon information of input and
output, and thus enhance the generalization ability of
synthesizing long sentences, Char2wav [196], Voiceloop
[204] and Melnet [221] adopted the Gaussian mixture
model (GMM) attention mechanism proposed by Graves
[61] to replace the content-based attention mechanism
in Tacotron. This method is a purely location-based at-
tention mechanism, which uses an unnormalized mix-
ture of K Gaussians to produce the attention weights,
α
i,j
, for each encoder state:
α
i,j
=
K
X
k=1
w
i,k
Z
i,k
exp
(j µ
i,k
)
2
2(σ
i,k
)
2
(9)
µ
i,k
= µ
i1,k
+
i,k
(10)
where w
i,k
, Z
i,k
,
i,k
and σ
i,k
are computed from the
attention RNN state. The mean of each Gaussian com-
ponent µ
i,k
is computed using the recurrence relation in
Eq. (10), which makes the mechanism location-relative
and potentially monotonic if
i,k
is constrained to be
positive. Although this location-based attention mech-
anism can enhance the generalization ability of acoustic
models for long sentences, it sacrifices some of the nat-
uralness of synthesized speech.
In order to combine content and location informa-
tion in alignment, Tacotron 2 uses the hybrid location-
sensitive attention mechanism [31]. In this case, Eq. (4)
is:
e
i,j
= v
T
tanh(W s
i
+ V h
j
+ U f
i,j
+ b) (11)
where U f
i,j
represents the location-sensitive term, and
uses convolutional features computed from the previous
attention weights {α
i1,j
}
L
j=1
. This method combines
the content and location features to make alignment
more accurate by additionally introducing previous at-
tention weight information.
Based on the monotonicity of alignment between in-
put and output sequences in TTS, various monotonic
attention mechanisms have been proposed to reduce
errors in attention alignment. In order to introduce
Review of end-to-end speech synthesis technology based on deep learning 11
monotonicity into the hybrid location-sensitive atten-
tion, Battenberg et al. [10] proposed Dynamic Convo-
lution Attention (DCA), which removed content-based
terms W s
i
and V h
j
, leaving only location-sensitive term
Uf
i,j
as static filters, while adding a set of learned dy-
namic filters T g
i,j
and a single fixed prior filter p
i,j
. In
this case, Eq. (4) is redefined as:
e
i,j
= v
T
tanh(Uf
i,j
+ T g
i,j
+ b) + p
i,j
(12)
Similar to static filters Uf
i,j
, dynamic filters T g
i,j
are
computed from the attention RNN state and serve to
dynamically adjust the alignment relative to the align-
ment at the previous step. Prior filter p
i,j
is used to bias
the alignment toward short forward steps. This mono-
tonic DCA has stronger generalization ability and is
more stable.
Raffel et al. [170] proposed a monotonic alignment
method that can be applied to TTS: monotonic atten-
tion (MA). At each step i, MA inspects the memory
entries from the memory index t
i1
it focused on at the
previous step and evaluates the ”selection probability”
p
i,j
:
p
i,j
= σ(e
i,j
) (13)
where σ is logistic sigmoid function and energy values
e
i,j
are produced as in Eq. (4). Starting from j = t
i1
,
at each time MA would sample z
i,j
Bernoulli(p
i,j
)
to decide to keepj unmoved (z
i,j
= 1) or move to the
next position (z
i,j
= 0). j would keep moving forward
until reaching the end of inputs, or until receiving a
positive sampling result z
i,j
= 1, and when j stops,
memory h
j
would be directly picked as c
i
. With such
restriction, it is guaranteed that solely one input unit
would be focused on at each step, and its position would
never rewind backward. Moreover, the mechanism only
requires linear time complexity and supports online in-
puts, which could be efficient in practice.
In contrast to the traditional “soft” attention using
continuous weights, MA, which simply selects one in-
put unit as the context vector c
i
, is a “hard” attention.
It can ensure the locality of attention alignment, but
it could not be trained by standard back-propagation
(BP) algorithm. Multiple approaches have been pro-
posed for this issue, including reinforcement learning
[122, 233, 246], approximation by beam search [186],
and approximation by soft attention for training [170].
To further guarantee the completeness of alignment,
He et al. [71] proposed stepwise monotonic attention
(SMA), which adds additional restrictions on MA: in
each decoding step, the attention alignment position
moves forward at most one step, and it is not allowed
to skip any input unit. The alignment of soft atten-
tion (SA), MA and SMA is shown in Fig. 4 [71]. The
color depth of each node in the figure represents the size
of the attention weight between each output acoustic
feature frame and the input phoneme. The darker the
color, the greater the value of attention weight. The fig-
ure shows that each acoustic feature frame is calculated
by multiple input phonemes in SA. Each acoustic fea-
ture frame is determined by an input phoneme in MA.
In SMA, not only each acoustic feature frame is deter-
mined by an input phoneme, but all input phonemes
must be corresponding at least once, which ensures
the locality, monotonicity and completeness of atten-
tion alignment.
Zhang et al. [259] and Yasuda et al. [240] also pro-
posed similar monotonic attention mechanisms. Zhang
et al. [259] suggested that only the alignment paths sat-
isfying the monotonic condition are taken into consid-
eration at each decoder time step. The attention prob-
abilities of each time step can be computed recursively
using a forward algorithm, and a transition agent is
proposed to help the attention mechanism make deci-
sions whether to move forward or stay at each decoder
time step. This attention mechanism has the advan-
tages of fast convergence speed and high stability. Ya-
suda et al. [240] also proposed a hard monotonic atten-
tion mechanism. The framework and likelihood function
are similar to those of a hidden Markov model (HMM).
The constrained alignment is conceptually borrowed
from segment-to-segment neural transduction (SSNT)
[244, 245]. They factorized the generation probability
for acoustic features into an alignment transition prob-
ability and emission probability, thereby constraining
the alignment process to moving from left to right, and
only one step at a time. Although this hard monotonic
alignment method can avoid some alignment errors that
are commonly observed in soft-attention-based meth-
ods, including muffling, skipping, and repeating, this
attention mechanism has poor stability and long train-
ing time.
In order to make more direct use of the correspon-
dence between text and speech in TTS, Tachibana et al.
[202] and Wang [225] added a guided attention loss to
content-based dot product attention [222]. More specif-
ically, they added an additional monotonic attention
loss to the original audio reconstruction loss, forcing the
non-zero values of the attention weight matrix were con-
centrated on the diagonal as much as possible. Further-
more, the forced increment attention was proposed to
force the text and speech to be aligned monotonously by
making the corresponding text position of acoustic fea-
ture frame at each time step move forward by at most
one. To produce monotonic alignment, Deep Voice 3
and ParaNet added positional encoding in Transformer
to the content-based dot product attention. Besides,
12 Zhaoxi Mu
1
et al.
Input
text
Outputacousticfeatures
MA
Outputacousticfeatures
Input
text
SA
SMA
Outputacousticfeatures
Input
text
Fig. 4 The alignment of SA, MA and SMA
they added an attention window [125, 137] he attention
during inference, calculated the attention weights only
for the input characters in the window, and took the po-
sition of the character with the largest attention weight
as the starting position of the next window. Moreover,
ParaNet adopted a multi-layer attention mechanism to
iteratively refine attention alignment in a layer-by-layer
manner.
However, the use of positional encoding can cause
errors when synthesizing long sentences [119]. To syn-
thesize long sentences stably, Glow-TTS removes the
positional encoding and adds relative position represen-
tations [187] into self-attention modules instead. Robu-
Trans [119] counts on the 1-D CNN used in Encoder
Pre-net to model relative position information in a fixed
window. Moreover, in order to make the self-attention
in Transformer more suitable for TTS models, Robu-
Trans also uses Pseudo Non-causal Attention (PNCA)
to replace the traditional causal self-attention. The de-
coding process is more robust by providing the decoder
with the holistic view of the input sequence and the
frame-level context information.
As described in Sect. 3.1.2, a large number of non-
autoregressive acoustic models have been proposed re-
cently. TTS is a one-to-many mapping. For the same
text input, there are many possible speech expressions
with different prosody. To eliminate ambiguity in multi-
mode output, the acoustic models with autoregressive
decoders can predict the acoustic feature frames of the
next time step by combining the contextual information
provided by the acoustic feature frames generated by
the previous time step. However, acoustic models with
non-autoregressive decoders need to obtain contextual
information in other ways to select an appropriate gen-
eration mode. Non-autoregressive acoustic models need
to determine the output length in advance, rather than
predict whether to stop at each frame. In this case, in
order to align the inputs and outputs, a duration pre-
dictor similar to the one used in the traditional SPSS
method [247, 250] can be used instead of the attention
network. Aligning with a duration predictor can avoid
the errors of skipping, repeating, and irregular stops
caused by the attention mechanism. This method first
appeared in NMT [64], and then was introduced into
TTS through non-autoregressive acoustic models such
as FastSpeech [172]. Acoustic models with duration pre-
dictors can align input phonemes and output acoustic
features by introducing additional alignment modules
or using external aligners. Next, these two alignment
methods are introduced separately.
The most direct way to obtain the alignment infor-
mation is provided by an external aligner. For exam-
ple, FastSpeech extracts phoneme duration from a pre-
trained autoregressive model by knowledge distillation
[100]. However, FastSpeech lacks generalization abil-
ity for long utterances, especially those whose length
exceeds the maximum length of the utterance in the
training set. This may be because the self-attention is
a global modeling method. To use the local modeling
method to make network more stable, DeviceTTS [79]
replaces the Transformer with DFSMN, which makes
use of a latency control window size to learn the con-
text. To simplify the training process, JDI-T [121] jointly
trains the autoregressive Transformer teacher network
and the feed-forward Transformer student network. To
avoid the complicated knowledge distillation process,
some models use a separate external alignment model
to predict the target phoneme duration, thus estab-
lishing alignment between input phonemes and output
acoustic features. For example, TalkNet [12] uses the
CTC-based automatic speech recognition (ASR) model
Quartznet [111], FastSpeech 2 [174] uses the forced-
alignment tool MFA toolkit [135], DurIAN [243] uses
a external alignment model [51, 252], RobuTrans [119]
uses speech recognition tools, Parallel Tacotron [49] and
Non-Attentive Tacotron [190] use a speaker-dependent
HMM-based aligner with a lexicon [230]. To address
the difficulty of training an aligner due to data spar-
sity, Shen et al. [190] used fine-grained VAE (FVAE)
to achieve semi-supervised and unsupervised duration
Review of end-to-end speech synthesis technology based on deep learning 13
prediction, that is, simply train the model using the
predicted durations instead of the target durations for
upsampling.
It is also possible to directly learn alignment by
training an alignment module within the model. For ex-
ample, AlignTTS [254] uses the dynamic programming
to consider all possible alignments in training, that is,
uses the alignment loss inspired by the Baum-Welch
algorithm [11, 206] to train the mix density network
for alignment. Glow-TTS uses the Monotonic Align-
ment Search (MAS) algorithm to predict the duration
of each input tokens by searching for the most probable
monotonic alignment between text and the latent rep-
resentation of speech. The internal alligator of EATS
[47] implicitly enhances the monotonicity of alignment
by predicting token lengths and obtaining positions us-
ing a cumulative sum operation. Moreover, the dynamic
time warping (DTW) loss and the aligner length loss
are introduced to learn alignment and ensure that the
model can accurately predict phoneme lengths. Flow-
TTS [138] trains a length predictor inside the model to
predict the output length in advance, and takes the po-
sitional encoding of the predicted spectrogram length
as query vector to align the input and output using the
positional attention module based on the multi-head
dot-product attention mechanism [222].
Since one-to-many regression problems like TTS can
benefit from autoregressive decoding, it is also possible
to combine the autoregressive method with duration
predictor to further improve the stability of TTS mod-
els, such as the alignment methods used in DurIAN,
Non-Attentive Tacotron [190], DeviceTTS and Robu-
Trans [119]. The alignment method of each model is
shown in Table 2.
3.3 Expressive acoustic model
The speech synthesized by deep learning method has a
smooth tone, without rhythm and expressiveness, thus
it often has a certain gap with the real human voice.
In order to synthesize expressive speech, three parts
need to be considered: ”what to say”, ”who to say”
and ”how to say”. ”What to say” is controlled by the
input text and the text front-end. ”Who to say” can
be controlled by collecting a large amount of voice data
of a person and then training the model to learn to
imitate the speaker’s voice. ”How to say” is controlled
by prosodic information such as tone, speech rate, and
emotion of the synthesized speech. In this paper, ”who
to say” and ”how to say” are collectively referred to as
the style features of synthesized speech.
3.3.1 Acoustic model with reference encoder
Style information can be introduced by adding a ref-
erence encoder to synthesize expressive speech. There
are mainly two methods based on reference encoders
that can be used to synthesize speech with a specific
style. The first method is to directly control various
speech style parameters, such as pitch, loudness, and
emotion, by using a trained reference encoder. The sec-
ond method is to input the reference audio into the ref-
erence encoder and use the style parameters encoded
by the reference encoder to transfer the speech style
features between the reference speech and the target
speech. Different methods and models have been pro-
posed to disentangle the different style feature informa-
tion so that each style feature can be easily controlled
individually to synthesize speech with the target style.
These methods and models are described in the follow-
ing paragraphs.
Skerry-Ryan et al. [192] divided the features of speech
into three components: text, speaker, and prosody. A
reference encoder is added to tacotron to extract the
prosody embedding from the reference speech with a
specific style, and the speaker embedding is obtained
by using a speaker embedding lookup table. Then the
prosody embedding, speaker embedding and text em-
bedding are combined and input into the decoder to
synthesize speech with the style of the reference speech.
Gururani et al. [69] refined the model on the basis of
Skerry-Ryan et al. [192], divided the style features of
speech into pitch and loudness, and selected two 1-
D time series to model the fundamental frequency f
0
and loudness of the reference speech respectively. In or-
der to transfer the emotion features in the reference
speech more accurately, Li et al. [120] added two emo-
tion classifiers after the reference encoder and decoder
respectively to enhance emotion classification ability in
the emotion space. Moreover, they adopted a style loss
[54, 88] to measure the style differences between the
generated and reference mel-spectrogram [55, 134].
Voice conversion (VC) model can disentangle the
speaker-dependent timbre feature from speech [33, 34,
91, 165, 183], but cannot extract other style features
such as the content, pitch and rhythm of speech. In-
spired by the voice conversion model AutoVC [165],
Qian et al. [166] proposed SPEECHFLOW, which is
a speech style conversion model that can disentangle
the rhythm, pitch, content, and timbre information.
Rhythm, pitch and content features are extracted by
three encoders respectively, and timbre feature is rep-
resented by one-hot vector of speaker ID. SPEECH-
FLOW can be trained for speech style conversion by
14 Zhaoxi Mu
1
et al.
Table 2 Alignment method of each acoustic model
Acoustic model Neural net-
work types
Generative model
types
Alignment methods Characteristics
Tacotron (Wang
et al., 2017)
CBHG,
GRU
Autoregression Content-based attention Unstable, alignment errors
often occur
Char2Wav
(Sotelo et al.,
2017)
RNN Autoregression GMM attention Low naturalness of synthe-
sized speech
Deep Voice 3
(Ping et al., 2017)
CNN Autoregression Dot-product attention, po-
sitional encoding, attention
window
Attention is monotonic
VoiceLoop (Taig-
man et al., 2017)
Shifting
buffer
Autoregression GMM attention Low naturalness of synthe-
sized speech
DCTTS
(Tachibana et al.,
2018)
CNN Autoregression Dot-product attention and
guided attention
Stable, alignment errors are
rare
Tacotron 2 (Shen
et al., 2018)
LSTM, CNN Autoregression Mixed location-sensitive at-
tention
Able to synthesize long sen-
tences accurately
DurIAN (Yu
et al., 2019)
CBHG,
RNN
Autoregression Duration prediction model,
external alignment model
Stable, alignment errors are
rare
FastSpeech (Ren
et al., 2019)
Transformer Non-
autoregression
Duration prediction model,
knowledge distillation
Errors will occur when syn-
thesizing long sentences
FastSpeech 2
(Ren et al., 2020)
Transformer Non-
autoregression
Duration prediction model,
MFA toolkit
Stable, alignment errors are
rare
ParaNet (Peng
et al., 2020)
CNN Non-
autoregression
Dot-product attention, po-
sitional encoding, attention
window, multi-layer atten-
tion, knowledge distillation
Attention alignment is mono-
tonic and stable
EATS (Donahue
et al., 2020)
CNN GAN Duration prediction model,
internal alignment module
Stable, alignment errors are
rare
Non-Attentive
Tacotron (Shen
et al., 2020)
RNN Autoregression Duration prediction model,
external alignment module
Stable, alignment errors are
rare
FastPitch
( La´ncucki, 2020)
Transformer Non-
autoregression
Duration prediction model,
knowledge distillation
Can control the pitch contour
of synthesized speech
Glow-TTS (Kim
et al., 2020)
Transformer,
Glow
Normalizing flow Duration prediction model,
MAS algorithm
The alignment is monotonic
and stable
AlignTTS (Zeng
et al., 2020)
Transformer Non-
autoregression
Duration prediction model,
internal alignment module
Stable, alignment errors are
rare
SpeedySpeech
(Vainer and
Duˇsek, 2020)
CNN Non-
autoregression
Duration prediction model,
knowledge distillation
Stable, alignment errors are
rare
JDI-T (Lim et al.,
2020)
Transformer Non-
autoregression
Duration prediction model,
knowledge distillation
Joint training of teacher and
student network, stable and
alignment errors are rare
TalkNet (Beliaev
et al., 2020)
CNN Non-
autoregression
Duration prediction model,
ASR model
Stable, alignment errors are
rare
Flow-TTS (Miao
et al., 2020)
Glow Normalizing flow Multi-head dot-product at-
tention, internal length pre-
dictor
High quality of synthesized
speech, fast training and in-
ference speed
DeviceTTS
(Huang et al.,
2020)
DFSMN,
RNN
Combination
of autoregres-
sion and non-
autoregression
Duration prediction model Stable, alignment errors are
rare
Parallel Tacotron
(Elias et al., 2020)
LConv Non-
autoregression
Duration prediction model,
HMM-based aligner
Stable, alignment errors are
rare
RobuTrans (Li
et al., 2020)
Transformer Autoregression Duration prediction model,
speech recognition tools
Stable, alignment errors are
rare
Review of end-to-end speech synthesis technology based on deep learning 15
replacing the input of the three encoders with the spec-
trogram or pitch contour of the reference speech.
Similarly, in order to disentangle different style fea-
tures in speech and achieve the purpose of individually
controlling each feature, Wang et al. [228] introduced
a global style token (GST) network in Tacotron, which
plays a role of clustering. When the GST network is
trained with speech data with various styles, multiple
meaningful and interpretable tokens can be obtained.
The weighted sum of these tokens is used as a style
embedding to control and transfer the style features
of speech. In inference, a specific weight can be cho-
sen directly for each style token, or a reference signal
can be fed to guide the choice of token combination
weights. For the choice of token weight, Kwon et al.
[114] proposed a controlled weight (CW)-based method
to define the weight values by investigating the distri-
bution of each emotion in the emotional vector space.
Um et al. [214] proposed to improve the method of sim-
ply averaging the style embedding vectors belonging
to each emotion category [115] to determine the rep-
resentative weight vectors by maximizing the ratio of
inter-category distance to intra-category distance (I2I),
and proposed to apply the spread-aware I2I (SA-I2I)
method to change the emotion intensity instead of the
simple linear interpolation-based approach. Mellotron
[218] additionally introduces fundamental frequency f
0
information, and takes text, speaker, fundamental fre-
quency f
0
, attention mapping, and GST as conditions
when synthesizing speech, in which the speaker repre-
sents timbre, the fundamental frequency f
0
represents
pitch, the attention mapping represents rhythm, and
GST represents prosody.
Since GST-Tacotron uses only paired input text and
reference speech for training, inputting unpaired text
and speech during synthesis will cause the generated
sound to become blurry. Moreover, in this case, the ref-
erence encoder may store some text information in the
reference embedding rather than prosody and speaker
information to reconstruct the input speech. Using the
idea of dual learning, Liu et al. [123] proposed to train
GST-Tacotron with unpaired text and speech, and in-
put the output mel-spectrogram into the ASR model
to predict the input text, thus preventing the reference
encoder from encoding any text information. Further-
more, they also use the regularization method of atten-
tion consistency loss to accelerate the training conver-
gence speed of both ASR and TTS models.
In order to control the style of synthesized speech
more flexibly, multiple reference encoders can be used
to extract different style features of multiple reference
speech respectively. For example, Bian et al. [15] used
multiple reference encoders based on GST network to
disentangle different style features, and proposed in-
tercross training technique to separate the style latent
space by introducing orthogonality constraints between
the extracted styles of each encoder. However, this in-
tercross training scheme does not guarantee each com-
bination of style classes is seen during training, caus-
ing a missed opportunity to learn disentangled repre-
sentations of styles and sub-optimal results on disjoint
datasets. Whitehill et al. [229] used an adversarial cy-
cle consistency training scheme to ensure the use of
information from all style dimensions to address the
challenges of multi-reference style transfer on disjoint
datasets. They achieved a higher rate of style transfer
for disjoint datasets than previous models.
Variational auto-encoder (VAE) [102] generates sam-
ples with specific features by sampling from the distri-
bution of latent variables. Latent variables are continu-
ous and can be interpolated, similar to the implicit style
features in speech. The speech style features learned by
VAE in an unsupervised manner can be easily sepa-
rated, scaled and combined. Therefore, there are many
tasks that use VAE to control the synthesized speech
style. The speech style features learned by VAE in an
unsupervised manner can be easily separated, scaled
and combined. Therefore, there are many works using
VAE to control the style of synthesized speech. For
example, Zhang et al. [263] added a VAE network to
Tacotron 2 to learn latent variables representing speech
style. Each dimension of latent variables represents a
different style feature. In order to further disentangle
the various style features of speech, Hsu et al. [76] pro-
posed GMVAE-Tacotron based on the Gaussian mix-
ture VAE network, with two levels of hierarchical latent
variables. The first level is a discrete latent variable,
representing a certain category of style (e.g. speaker
ID, clean/noisy). The second level is a continuous la-
tent variable approximated by the multivariate Gaus-
sian distribution. Each component represents the de-
gree of the feature (e.g. noise level, speaking rate, pitch)
under the category of the first level. In general, it is
equivalent to using the GMM to fit the distribution
of latent variables. This model can effectively factorize
and independently control latent attributes underlying
the speech signal.
However, these methods only model the global style
features of speech, without considering prosodic con-
trol at the phoneme and word levels. In order to model
acoustic features at various resolutions, Sun et al. [200],
in addition to modeling global speech features such as
noise and channel number, also modeled word-level and
phoneme-level prosodic features such as fundamental
frequency f
0
, energy and duration. They used a con-
ditional VAE with an autoregressive structure to make
16 Zhaoxi Mu
1
et al.
prosodic features of each layer more interpretable and
to impose hierarchical conditioning across all latent di-
mensions. Parallel Tacotron [49] used two different VAE
models, one similar to Hsu et al. [76] for modeling global
features of speech such as different prosodic patterns of
different speakers, and the other similar to Sun et al.
[200] for modeling phoneme-level fine-grained features.
Normalizing flow can control the latent variables to
synthesize speech with different styles by learning an
invertible mapping of data to a latent space. For ex-
ample, Flowtron [219] applied the normalizing flow to
Tacotron to control speech variation and style transfer
by learning a latent space that stores non-textual infor-
mation. Glow-TTS [98] takes Glow [101] as the decoder
to control the style of synthesized speech by control-
ling the prior distribution of latent variables. It is also
possible to model speech style features with both nor-
malizing flow and VAE. Aggarwal et al. [2] used VAE
and Householder Flow [211] to improve the reference
encoder proposed by Skerry-Ryan et al. [192], thereby
enhancing the disentanglement capability of the TTS
system.
GAN can also be used in style speech synthesis. For
example, Ma et al. [134] enhanced the content-style dis-
entanglement ability and controllability of the model
by combining a pairwise training procedure, an adver-
sarial game, and a collaborative game into one train-
ing scheme. The adversarial game concentrates the true
data distribution, and the collaborative game minimizes
the distance between real samples and generated sam-
ples in both the original space and the latent space.
3.3.2 Acoustic model of explicit modeling style features
The prosody of the speech can also be controlled in-
tuitively by constraining the prosodic features of the
waveform. For example, Morrison et al. [141] proposed
a user-controllable, context-aware neural prosody gen-
erator that allows the input of the f
0
contour for certain
time frames and generates the remaining time frames
from input text and contextual prosody. CHiVE [96] is
a conditional VAE model with a hierarchical structure.
It can generate prosodic features such as fundamental
frequency f
0
, energy c
0
and duration suitable for use
with a vocoder, and yield a prosodic space from which
meaningful prosodic features can be sampled. To effi-
ciently capture the hierarchical nature of the linguistic
input (words, syllables and phones), both the encoder
and decoder parts of the auto-encoder are hierarchical,
in line with the linguistic structure, with layers being
clocked dynamically at the respective rates.
In practical applications, since it is difficult to in-
terpret and give practical meaning to each of the la-
tent variables learned by unsupervised style separation
methods such as GST and VAE, FastSpeech uses a
length adjuster to replicate and expand the hidden state
of the phoneme sequence according to the duration of
each phoneme, thus intuitively controlling the speech
speed and some prosodic features.
FastPitch [117] adds a pitch prediction network to
FastSpeech to control pitch. Compared with FastSpeech
and FastPitch, FastSpeech 2 introduces more style fea-
tures such as pitch, energy, and more accurate duration
as conditional inputs to construct a variance adaptor,
and uses trained predictors of energy, pitch, and dura-
tion predictors to synthesize speech with a specific style.
Durian simply divides speech styles into several dis-
crete categories, learns embedding vectors from speech
data with various styles through supervised learning ,
and controls the intensity of the style by multiplying a
scalar.
3.3.3 Multi-speaker acoustic model
Multi-speaker speech synthesis is also an important task
of TTS model. A simple way to synthesize the voices of
multiple speakers is to add a speaker embedding vector
to the input [57, 160]. The speaker embedding vector
can be obtained by additionally training a reference en-
coder. For example, Jia et al. [85], Arik et al. [4] and
Nachmani et al. [145] introduced a speaker encoder in
Tacotron 2, Deep Voice 3 and VoiceLoop [204] respec-
tively to encode the speaker information in the refer-
ence speech into a fixed-dimensional speaker embed-
ding vector. The embedding vector can be extracted
only from a small number of speech fragments of the
target speaker. The speech data corpus used to train
the speaker encoder only needs to contain the record-
ings of a large number of speakers, but does not need to
be of high quality. Even if the training data contains a
small amount of noise, the extraction of timbre features
will not be affected.
The speaker adaptation can also be used for multi-
speaker speech synthesis. Arik et al. [4], Taigman et al.
[204], and Zhang et al. [265] fine-tune the trained multi-
speaker model using a small number of htext, speechi
data pairs of the target speaker. Fine-tuning can be
applied to the speaker embedded vector [4, 204], part
of the model [265], or the whole model [4]. Moss et al.
[142] proposed a fine-tuning method to select different
model hyperparameters for different speakers, achieving
the goal of synthesizing the voice of a specific speaker
with only a small number of speech samples, in which
the selection of hyperparameters adopts the Bayesian
optimization method [184].
Review of end-to-end speech synthesis technology based on deep learning 17
However, these methods are not very effective when
synthesizing the speech of unseen speakers. To solve this
problem, Cooper et al. [41] extracted speaker informa-
tion by using learnable dictionary encoding (LDE) on
the basis of Jia et al. [85], and inserted the speaker em-
bedding into both prenet layer and attention network
of Tacotron 2 as additional information. When training
the speaker encoder, Nachmani et al. [145] introduced,
in addition to the use of MSE losses, the contrast loss
term and the cyclic loss term, which allowed the model
to synthesize the voice of the new speaker with only
a small amount of audio. When training the speaker
encoder, in addition to the MSE loss, Nachmani et al.
[145] also a contrastive loss term and a cyclic loss term,
which allow the model to synthesize the voice of a new
speaker with only a small amount of audio. Cai et al.
[22] and Shi et al. [191] introduced an identity feedback
constraint by adding an additional loss term between
the reference embedding and the extraction embedding
of the synthesized signal, thus increasing the robustness
and speaker similarity of the produced speeches.
3.4 Low-resource acoustic model
Deep learning based acoustic models need to be trained
with a large number of high-quality htext, speechi data
pairs to synthesize high-fidelity speech, and the data
set requirements are higher when synthesizing speech
with specific prosody and emotion. But for 95% of lan-
guages and audio with a specific style, the corpus is
very scarce. Moreover, the English speech corpus used
for TTS usually contains about 10 40 hours of speech
data and contains no more than 20, 000 words. The
largest public English speech corpus, LibriTTS [253],
contains only 80, 000 words, which is far lower than
the number of words in the regular English vocabu-
lary (usually 130, 000 160, 000). When synthesizing,
the acoustic model may mispronounce words outside
the training set. It is difficult to cover all vocabulary
just by increasing the number of training utterances,
because the natural frequency of words tends to fol-
low the Zipfian distribution [205], which means that
the number of new words contained in the speech data
per hour gradually decreases. Therefore, to achieve a
linear increase in word coverage would require an ex-
ponential increase in audio data, which would be costly
and impractical. Besides, most speech data is recorded
by non-professionals and contains a lot of noise. There-
fore, the lack of high-quality speech training data in
TTS is mainly manifested in the lack of training data
that cannot cover all vocabulary and contains noise.
To solve the problem that the speech data cannot
cover all the words, text and phonemes can be input
into the acoustic network together. During training,
some words can be represented by text randomly, so
that the acoustic model can predict the phoneme pro-
nunciation of unseen words according to the learned
correspondence between characters and phonemes [159,
160]. The text front-end can also be used to convert the
text into phonemes in advance, in order to make the
model only need to learn the pronunciation of a small
number of phonemes.
To solve the problem of the lack of speech data for
minority languages and dialects, the method of cross-
language transfer learning can be used. For example,
Guo et al. [68] and Zhang et al. [261] trained an aver-
age language model with a large Mandarin corpus and
a small Tibetan corpus when training the Tibetan TTS
model, which made up for the lack of Tibetan speech
data. Tu et al. [212] introduced cross-language trans-
fer learning into Tacotron. They used speech data from
high-resource languages to pre-train Tacotron, and then
fine-tuned the pre-trained model with speech data from
low-resource languages. Nekvinda and Duˇsek [147] used
the idea of meta-learning to train the acoustic model
with only a small number of samples from multiple lan-
guages in order to synthesize speech containing mul-
tiple languages. They used a fully convolutional en-
coder from DCTTS, whose parameters are generated
using a separate contextual parameter generator net-
work [163] conditioned on language embedding, thus
realizing cross-lingual knowledge-sharing.
Semi-supervised pre-training can also be used to re-
duce the demand of the TTS model for paired training
data. Chung et al. [39] proposed training the encoder
and decoder with unpaired text and speech respectively,
and then fine-tuning the pre-trained Tacotron with a
small amount of htext, speechi data pairs. Although
this approach helps the model synthesizes more intel-
ligible speech, the experimental results show that pre-
training the encoder and decoder separately at the same
time does not bring further improvement than only
pre-training the decoder. And there is a mismatch be-
tween pre-training only the decoder and fine-tuning the
whole model, because during pre-training the decoder
is only conditioned on the previous frame, while during
fine-tuning the decoder is also conditioned on the text
representation output by the encoder. To avoid poten-
tial error caused by this mismatch and further improve
the data efficiency by using only speech, Zhang and
Lin [255] proposed to use vector Vector-quantization
Variational-Autoencoder (VQ-VAE) [32, 152] to extract
unsupervised linguistic units from untranscribed speech
and then use hlinguistic units, speechi pairs to pre-train
the entire model. The language units act as phonemes
that are paired with the audio, while VQ-VAE plays a
18 Zhaoxi Mu
1
et al.
role similar to speech recognition model. However, VQ-
VAE is trained in an unsupervised way to obtain dis-
cretized linguistic representations, which is suitable for
low-resource languages. Finally, the model is fine-tuned
with a small amount of htext, speechi data pairs.
Using dual learning to train TTS and ASR models
simultaneously can also achieve the purpose of using
text or speech data alone to train both models. Tjandra
et al. [208] proposed an auto-encoder model in which
one is regarded as an encoder and the other as a de-
coder. For example, when there is only speech but no
corresponding text, the ASR model can be used as the
encoder to output text, and the TTS model can be used
as the decoder to output speech, and then the speech
output of the TTS model is expected to be as close as
possible to the input speech. The other situation is sim-
ilar when there is only text but no speech. Ren et al.
[173] also used the idea of dual learning to combine
TTS and ASR to build the capability of the language
understanding and modeling in both speech and text
domains using unpaired data during training, that is,
using denoising auto-encoder (DAE) to reconstruct cor-
rupt speech and text in an encoder-decoder framework.
They also used a dual transformation (DT) approach
similar to Tjandra et al. to train the model to convert
text to speech and speech to text respectively. The dif-
ference is that Tjandra et al. relied on two well-trained
TTS and ASR models, whereas Ren et al. trained the
two models from scratch, which is suitable for the lack
of training data.
Multi-speaker TTS model has much lower require-
ments on the quantity and quality of training data than
models that synthesize speech with a specific style, be-
cause they only need to separate and capture timbre in-
formation in the audio. However, if the speech training
data of the target speaker is too small, the timbre fea-
tures cannot be effectively learned. In order to increase
the amount of speech data of the target speaker, Huy-
brechts et al. [81] used a voice conversion (VC) model to
convert the voice data of other speakers into the voice of
the target speaker for data augmentation, then trained
the TTS model with the expanded speech data, and fi-
nally used the real voice data of the target speaker for
fine-tuning.
The noise in the training data can be reduced by
pre-processing steps. Valentini-Botinhao and Yamag-
ishi [216] took the acoustic features of clean speech
and noisy speech respectively as the input and tar-
get of RNN network, enabling the network to convert
noisy speech into clean speech. Generally, the data in
the corpus containing different styles of speech is of
low quality and contains noise, which will hinder the
training of style speech synthesis model. In this case,
the method of speech style control introduced in Sect.
3.3.1 can be used to train the style extraction network
with clean data and noisy data. By making the net-
work learn about the latent variables of noise features,
it can synthesize clean speech. For example, Wang et al.
[228] trained GST-Tacotron by using data sets mixed
with various noises to learn tokens about the noise fea-
tures. During synthesis, the token representing noiseless
is used as the style embedding to convert noisy refer-
ence speech to clean speech. Hsu et al. [76] used one
dimension of the mixed Gaussian distribution to repre-
sent the noise feature. Clean speech can be synthesized
by using the average value of the clean speech class or
the value of the noise variable extracted from the clean
reference speech as the value of the noise feature.
4 Vocoder
Inspired by the successful application of autoregressive
generative model [220] in the field of image and natu-
ral language generation, Oord et al. [151] first applied
this method to TTS and proposed the most widely
used vocoder WaveNet. In order to capture the long-
range temporal dependencies in audio signals, WaveNet
adopts a multi-layer dilated causal gated convolutional
network, which makes the receptive field to grow expo-
nentially with the depth. WaveNet uses speaker iden-
tity and linguistic features as global and local condi-
tions respectively to synthesize the speech of the tar-
get speaker and text. However, WaveNet has a com-
plex network structure and is autoregressive, therefore
the training and inference speed is slow. Moreover, the
speech synthesized with WaveNet is sometimes not nat-
ural. Therefore, after it was proposed, there are a lot
of work to improve it. The direction of improvement is
mainly to accelerate the speed of training and inference
and improve the quality of synthesized speech, which
are respectively called fast vocoder and high-quality
vocoder. These methods are introduced in the following
sections.
4.1 Fast vocoder
The training can be accelerated by reducing the size
and parameters of the vocoder, and the inference can be
accelerated by replacing the autoregressive method in
WaveNet with non-autoregressive methods. The follow-
ing sections will introduce various small-size vocoders
and non-autoregressive vocoders.
Review of end-to-end speech synthesis technology based on deep learning 19
4.1.1 Small size vocoder
To improve the speed of training and inference, Fft-
net [87] uses the simple ReLU activation function and
1 × 1 convolutions to replace the gated activation units
and dilated convolutions in WaveNet, which reduces the
computational cost. SampleRNN [136] adopted a multi-
scale RNN structure. Different layers operate on audio
data of different time scales. Compared with WaveNet,
it only processes individual samples in the last layer
to improve the synthesis speed, and back-propagates
the gradient of the loss function only on a small frac-
tion of audio to improve the training speed. WaveRNN
[89] only uses a single-layer GRU network with a dual
softmax layer that predict respectively the 8 coarse (or
more significant) bits and the 8 fine (or least signif-
icant) bits of the 16-bit audio sample, and applies a
weight pruning technique to further reduce the model
parameters. Furthermore, for the purpose of generat-
ing multiple channels of speech in parallel, WaveRNN
divides a long audio sequence into multiple short se-
quences evenly during inference, and the generation
within and between each short sequence is autoregres-
sive. Although WaveRNN is autoregressive and based
on RNN, its training and inference time is still short,
thus it can be used in systems with few resources such as
mobile phones and embedded systems. The Multi-Band
WaveRNN proposed by Yu et al. [243] further improves
the inference speed of WaveRNN by generating multi-
ple bands in parallel, and performs 8-bit quantization
on the weight value to reduce the model size. LPCNet
[217] reduces the complexity of the model by combin-
ing WaveRNN with linear prediction (LP) technology in
traditional digital signal processing, thereby improving
the synthesis efficiency. The characteristics of various
small-size vocoders are shown in Table 3.
4.1.2 Non-autoregressive vocoder
Similar to acoustic models, these vocoders increase the
speed of training and inference to a certain extent, but
all generate audio signals frame by frame in an autore-
gressive manner. If the non-autoregressive generation
method can be used to generate speech waveforms in
parallel, the inference speed will be greatly improved.
Based on this idea, various non-autoregressive vocoders
are proposed, and their characteristics are shown in Ta-
ble 4.
The traditional Gaussian autoregressive model is
equivalent to an autoregressive flow (AF) [103], which
is a kind of normalizing flow [175]. The main idea of the
normalizing flow is that a complex distribution can be
obtained by a simple distribution transformed through
multiple invertible functions. It was originally proposed
to make the distribution function of latent variables
in VAE [102] more complex. The flow-based generative
model learns the bidirectional mapping from the input
sample x to the latent representation z, i.e. x = f(z)
and z = f
1
(x). This mapping f is called a normal-
izing flow and is an invertible function fitted by neu-
ral networks, consisting of k invertible transformations
f = f
1
· · · f
k
. The normalizing flow transforms a
simple density distribution p(z) (such as an isotropic
Gaussian distribution) to a complex distribution p(x)
by applying an invertible transformation x = f (z). The
probability density of x can be calculated through the
change of variables formula:
p(x) = p(z)
det
f
1
(x)
x
(14)
where det is the Jacobian determinant. The computa-
tion of the determinant has the complexity of O(n
3
),
where n is the dimension of x and z. In order to re-
duce the amount of computation, two flow models that
can easily calculate the Jacobian determinant have been
proposed, respectively, based on autoregressive trans-
formation [220] and bipartite transformation [45, 46,
101].
During training, the autoregressive flow calculates
the latent variable z
i
, i = 1, . . . , D by transforming the
speech x = x
1
, x
2
, . . . , x
D
:
z
i
= σ
i
(x
1:i1
) · x
i
+ µ
i
(x
1:i1
) (15)
where z
1:D
is D latent variables subject to the isotropic
Gaussian distribution, µ is the shift variables represent-
ing the mean, and σ is the scaling variables represent-
ing the standard deviation. The training process is non-
autoregressive, and z
i
only depends on x
1:i
. In this case,
the Jacobian matrix is a triangular matrix whose deter-
minant is the product of the diagonal terms:
det
f
1
(x)
x
=
Y
i
σ
i
(x
1:i1
) (16)
During inference, the trained z
i
, i = 1, . . . , D and the
previously generated audio x
1:i1
are used to predict
the new x
i
:
x
i
=
z
i
µ
i
(x
1:i1
)
σ
i
(x
1:i1
)
(17)
This inference process is autoregressive, resulting in
slow inference. In order to speed up the inference speed,
Parallel WaveNet [150] and its improved model Clar-
iNet [161] use inverse autoregressive flow (IAF) [103]
to generate speech in Parallel. IAF is another normal-
izing flow. In contrast to AF, IAF uses the previously
20 Zhaoxi Mu
1
et al.
Table 3 Small size vocoder
Vocoder Neural network
types
Characteristics
WaveNet (Oord et al.,
2016)
Dilated causal
gated CNN
Based on dilated CNN, the training and in-
ference speed is slow
SampleRNN (Mehri
et al., 2016)
RNN Multi-scale RNN structure, training and in-
ference speed is faster than Wavenet
FftNet (Jin et al.,
2018)
1 × 1 CNN Based on 1× 1 convolution, the model struc-
ture is simple, and the training and infer-
ence speed is fast
WaveRNN (Kalch-
brenner et al., 2018)
GRU Based on single layer of GRU, the model
structure is simple, and the training and in-
ference speed is fast
Multi-Band Wav-
eRNN (Yu et al.,
2019)
GRU Parallel generation of multiple bands, the
training and inference speed is fast
LPCNet (Valin and
Skoglund, 2019)
GRU The linear prediction (LP) technology is
used, the model structure is simple, and the
training and inference speed is fast
obtained latent variable z
1:i1
to calculate z
i
during
training:
z
i
=
x
i
µ
i
(z
1:i1
)
σ
i
(z
1:i1
)
(18)
This training process is autoregressive. In inference, z
1:i
is used to predict x
i
:
x
i
= σ
i
(z
1:i1
) · z
i
+ µ
i
(z
1:i1
) (19)
This inference process is non-autoregressive. Therefore,
AF is fast in training and slow in inference, whereas
IAF is just the opposite. In order to train and synthesize
quickly at the same time, Parallel WaveNet and Clar-
iNet take the autoregression WaveNet as the teacher
network, which is responsible for providing the guid-
ance information on the distribution of z
i
, i = 1, . . . , D
during training. And IAF is used as the student net-
work to take charge of the final audio sampling, and
solve the problem that IAF cannot be trained in paral-
lel by means of probability density distillation.
However, due to the knowledge distillation used in
Parallel WaveNet and Clarinet, the training process is
complex. In order to simplify the training process, Peng
et al. [159] proposed WaveVAE. The encoder and de-
coder of WaveVAE are respectively parameterized by
a Gaussian autoregressive WaveNet and the one-step-
ahead predictions from an IAF. It can jointly optimize
the encoder q
ϕ
(z|x) and decoder p
θ
(x|z) to be trained
from scratch by maximizing the evidence lower bound
(ELBO) for observed x in VAE, but at the expense of
sound quality.
In order to train and synthesize more quickly, Ping
et al. [162] proposed WaveFlow which combines au-
toregressive flow and non-autoregressive convolution.
The training process does not need complex knowl-
edge distillation, only based on the likelihood function,
and combines the advantages of autoregressive and non-
autoregressive flow. It can train and synthesizing high-
fidelity speech quickly, while only occupying small mem-
ory. WaveFlow represents a 1-D audio sequence x =
x
1
, x
2
, . . . , x
D
with a 2-D matrix X R
h×w
, in which
adjacent samples are in the same column. The latent
variable matrix Z R
h×w
is defined as:
Z
i,j
= σ
i,j
(X
1:i1,:
) · X
i,j
+ µ
i,j
(X
1:i1,:
) (20)
where X
1:i1,:
represents all the elements above the i-th
row. Therefore, the value of Z
i,j
depends only on the
sample in i-th row and j-th column and the samples
above i-th row, which can be calculated at the same
time. In inference, the sample is generated by:
X
i,j
=
Z
i,j
µ
i,j
(X
1:i1,:
)
σ
i,j
(X
1:i1,:
)
(21)
Although it is autoregressive, it only takes h steps to
generate all samples, and h is usually small, like 8 or
16. WaveFlow uses a 2-D dilated CNN to model a 2-D
Review of end-to-end speech synthesis technology based on deep learning 21
Table 4 Non-autoregressive vocoder
Vocoder Neural network types Generative
model types
Characteristics
WaveNet (Oord
et al., 2016)
Dilated causal gated
convolution
Autoregression Autoregressive generation, slow training
and inference speed
Parallel WaveNet
(Oord et al., 2018)
Dilated causal gated
convolution
IAF Based on knowledge distillation, training
and inference speed is fast, Monte Carlo
sampling is required to estimate KL diver-
gence, the training process is unstable
FloWaveNet (Kim
et al., 2018)
Dilated convolution Normalizing
flow
The inference speed is fast, the training con-
vergence speed is slow, the model contains
many parameters
ClariNet (Ping
et al., 2018)
Dilated causal gated
convolution
IAF Based on knowledge distillation, the train-
ing and inference speed is fast, the training
process is stable
WaveGlow
(Prenger et al.,
2019)
Non-causal dilated
convolution, 1 × 1
convolution
Normalizing
flow
The inference speed is fast, the training con-
vergence speed is slow, the model contains
many parameters
MelGAN (Kumar
et al., 2019)
Dilated convolution,
transposed con-
volution, grouped
convolution
GAN The inference speed is fast, the training con-
vergence speed is slow
GAN-TTS
(Bi´nkowski et al.,
2019)
Dilated convolution GAN The training and inference speed is fast, no
need for mel-spectrogram as input
Parallel Wave-
GAN (Yamamoto
et al., 2020)
Non-causal dilated
convolution
GAN The inference speed is fast, the training con-
vergence speed is slow, the model contains
many parameters
WaveVAE (Peng
et al., 2020)
Dilated causal gated
convolution
IAF, VAE The training and inference speed is fast
WaveFlow (Ping
et al., 2020)
2D-dilated convolu-
tion
Autoregression Combining the advantages of autoregressive
flow and non-autoregressive flow, the train-
ing and inference speed is fast
WaveGrad (Chen
et al., 2020)
Dilated convolution Diffusion
probability
model
The inference speed is fast, the training con-
vergence speed is slow
DiffWave (Kong
et al., 2020)
Bidirectional dilated
convolution
Diffusion
probability
model
The inference speed is fast, the training con-
vergence speed is slow
Multi-Band Mel-
GAN (Yang et al.,
2021)
Dilated convolution,
transposed con-
volution, grouped
convolution
GAN The training and inference speed is fast
22 Zhaoxi Mu
1
et al.
matrix. Non-causal CNN is used on width dimension,
causal CNN with autoregressive constraints is used on
height dimension, and convolution queue [153] is used to
cache the intermediate hidden states to speed up the au-
toregressive synthesis on height dimension. Therefore,
it not only retains both the advantage of autoregressive
inference method that can accurately simulate the lo-
cal variations of waveform and non-autoregressive con-
volutional structure that can do speedy synthesis and
capture the long-range structure in the data.
WaveGlow [164] and FloWaveNet [99] are also based
on normalizing flow and have similar structures, using
Glow [101] and Real-NVP [46] respectively. Real-NVP
is an improved model of the normalizing flow NICE
[45]. It is trained and inferred by bipartite transfor-
mation, but each layer can only transform a part of
the input. As an improved model of Real-NVP, Glow
introduces 1 × 1 invertible CNN to mix the informa-
tion between two channels and realizes complete trans-
formation. The affine coupling layer in WaveGlow and
Flowavenet transforms one half dimension x
b
of input
vector x each time, leaving the other half dimension x
a
unchanged. The transformation process is:
z
a
= x
a
(22)
z
b
= x
b
· σ
b
(x
a
) + µ
b
(x
a
) (23)
where x
a
and x
b
are the result of bisecting x, z
a
and z
b
are the corresponding latent variables respectively. The
inference process is:
x
a
= z
a
(24)
x
b
=
z
b
µ
b
(x
a
)
σ
b
(x
a
)
(25)
Therefore, WaveGlow and FloWaveNet can both com-
pute the latent variable z and synthesize the speech
frame x in parallel. In fact, the bipartite transformation
is a special autoregressive transformation [162], which
can be reduced to a bipartite transformation by substi-
tution:
µ
i
(x
1:i1
)
σ
i
(x
1:i1
)
=
(0, 1)
T
, i a
(µ
i
(x
a
), σ
i
(x
a
))
T
, i b
(26)
Compared with autoregressive transformation, bipar-
tite transformation is not as expressive as autoregres-
sive transformation, because it reduces the dependence
between data X and latent variable Z. As a result, the
speech synthesized by WaveGlow and FloWaveNet is of
low quality. A deeper network is needed to obtain the
results comparable to the autoregressive model.
In addition to normalizing flow, GAN [60] can also
be used to synthesize speech in parallel, such as Parallel
WaveGAN [234], MelGAN [113], multi-band MelGAN
[237] and GAN-TTS [17]. Parallel WaveGAN’s genera-
tor is similar in structure to WaveNet, which uses ran-
dom noise and mel-spectrogram conditions to generate
speech waveforms. Its discriminator is used to deter-
mine whether the generated audio is real. MelGAN’s
generator simply uses dilated CNN to increase the re-
ceptive field, and its inference speed is faster than Par-
allel WaveGAN. Its discriminator outputs real/fake la-
bels and feature maps [226], and speeds up training by
using grouped convolutions to reduce the model param-
eter.
The feature matching loss adopted by MelGAN gen-
erates feature maps with neural networks, while the
multi-resolution STFT loss adopted by Parallel Wave-
GAN uses STFT algorithm to generate feature maps.
Inspired by this, multi-band MelGAN introduces the
multi-resolution STFT loss in Parallel Wavegan into
MelGAN instead of the original feature matching loss,
and carries out a multi-band extension to MelGAN to
measure the difference between the real and predicted
audio in multiple subband scales of audio, which further
improves the training and inference speed of MelGAN.
In order to obtain better results and faster training
speed, GAN-TTS uses an ensemble of small scale un-
conditional and conditional Random Window Discrim-
inators (RWDs) operating at different window sizes,
which respectively assess the realism of the generated
speech and its correspondence with the input text.
The diffusion probability model [73, 193] can also be
used to generate speech waveforms. It is a probabilis-
tic model based on Markov chain, which divides the
mapping relationship between the noise and the tar-
get waveform into several steps, and gradually trans-
forms the simple distribution (e.g., isotropic Gaussian)
into the complex data distribution by means of Markov
chain. It first trains the diffusion process of Markov
chains (from structured waveform to noise), and then
decodes the noise through the reverse process (from
noise to structured waveform). The decoding process re-
quires only a constant few generation steps, so the infer-
ence speed is fast. Chen et al. [27] proposed a fully con-
volutional vocoder WaveGrad to synthesize speech non-
autoregressively based on diffusion probability model
and score matching framework [194, 195]. A similar
model is DiffWave [110], which uses bidirectional di-
lated convolution architecture with a long bidirectional
receptive field and a much smaller number of model pa-
rameters than WaveGrad. However, the inference speed
of the vocoder based on diffusion probability model is
slightly lower than that of the flow-based vocoder.
Review of end-to-end speech synthesis technology based on deep learning 23
4.2 High-quality vocoder
To improve the naturalness of speech, WaveNet pro-
poses to expand the receptive field by dilated CNN and
introduce additional conditional information, such as
speaker information (global conditioning) and acous-
tic features (local conditioning), by modeling the con-
ditional probability of audio. WaveNet takes softmax
layer as the output layer of the network, and adopts
nonlinear quantization method of µ-law companding
transformation to obtain discrete-value speech signals.
Although the reconstructed speech signal is close to the
original, the quantization process still introduces white
noise into the original signal. Yoshimura et al. [242]
proposed a quantization noise shaping method based
on mel-cepstrum, which solved this problem by pre-
processing WaveNet with a mel-log spectral approxi-
mation (MLSA) filter [82]. Because the mel-cepstrum
matches the human auditory perception characteris-
tics, this method effectively filters the white noise in-
troduced by the commonly used quantization method
in the speech waveform synthesis system, and has no
extra computational cost compared with WaveNet in
the synthesis stage.
In order to improve the quality of the speech syn-
thesized by the autoregressive vocoder, Jin et al. [87]
proposed to add zero padding to the input to make
the network have a stronger generalization ability. And
when outputting the result, instead of directly taking
the value of the maximum probability, sampling is con-
ducted according to the probability distribution to sim-
ulate the real speech signal containing noise. Due to the
training error of vocoder, there is always noise in the
generated speech sample. And in the process of autore-
gressive generation, the noise in the synthesized speech
sample will become more and more loud over time. Gen-
erating new samples with noisy speech samples as in-
put to the network adds more and more uncertainty
to the network. Therefore, during the training, they
added some noise to the input to make the network
robust to the input samples containing noise, and re-
duced the noise injected into the pronunciation sam-
ples by post-processing with spectral subtraction noise
reduction [129].
When using implicit generative models such as GAN
to generate audio, speech waveforms of different resolu-
tions can be predicted at the same time to perfect the
details of synthesized speech and stabilize the train-
ing process, as shown in Table 5. Parallel Wave GAN
and Multi-Band MelGAN use a multi-resolution STFT
loss for training. The discriminator in MelGAN adopts
a multi-scale structure to simultaneously discriminate
feature maps of audio waveforms with different sam-
pling frequencies to learn the features of different au-
dio frequency ranges. Besides, MelGAN uses feature
matching loss to optimize both discriminator and gen-
erator, thereby reducing the distance between the fea-
ture maps of the real and synthesized audio. VocGAN
[239] uses both multi-resolution STFT loss and feature
matching loss, and extends the generator on the ba-
sis of MelGAN to output multiple waveforms of differ-
ent scales. It helps the generator learn the mapping of
both low- and high-frequency components of acoustic
features by training the generator with the adversarial
loss calculated by a set of discriminators with different
resolutions. Moreover, VocGAN also applied the joint
conditional and unconditional (JCU) loss [256]. The
conditional loss leads the generator to map the acous-
tic feature of the input mel-spectrogram to the wave-
form more accurately, thus reducing the discrepancy
between the acoustic characteristics of the input mel-
spectrogram and the output waveform. In addition to
using the multi-scale discriminator in MelGAN, HiFi-
GAN [109] introduced the multi-period discriminator
(MPD) to model the periodic patterns of speech. Each
sub-discriminator only accepts equally spaced samples
of an input audio, aiming to capture different implicit
structures from each other by looking at different parts
of the input audio. Besides, the generator in HiFi-GAN
is connected with a multi-receptive field fusion (MRF)
module after each transposed convolution, which can
observe patterns of various lengths in parallel. Grit-
senko et al. [63] proposed a method for training par-
allel vocoder based on the spectral generalized energy
distance (GED) [58, 180, 188] between the generated
and the real audio distribution. The main difference
from other spectrogram-based losses is that, in addi-
tion to the attractive term between the generated data
and the actual data, GED also adds a repulsive term
between generated data to the training loss to avoid
generated samples collapsing to a single point, thus cap-
turing the full data distribution. GED can be combined
with the adversarial loss to further improve the synthe-
sized speech quality.
Similar to the acoustic models, the multi-speaker
TTS task can also be performed only by the vocoder.
Chen et al. [30] borrowed the idea of meta-learning and
proposed three methods to synthesize the voice of a
new speaker using only a small amount of the target
speaker’s speech. The first method is to fix other pa-
rameters of the model and update only the speaker
embedding vector. The second method is to fine-tune
all the parameters of the model. The third method
is to use a trained neural network encoder to predict
the speaker embedding. The experimental results show
that the speech synthesized by the second method has
24 Zhaoxi Mu
1
et al.
Table 5 Methods of GAN-based vocoder to improve the naturalness of generated speech
Vocoder Characteristics
MelGAN (Kumar et al.,
2019)
Using multi-scale discriminant structure and feature
matching loss
Parallel WaveGAN (Ya-
mamoto et al., 2020)
Using multi-resolution STFT loss
VocGAN (Yang et al.,
2020)
Using multi-resolution STFT loss, feature matching loss,
multi-scale waveform generator, and JCU loss
HiFi-GAN (Kong et al.,
2020)
Using multi-scale discrimination, multi-period discrimina-
tion, and MRF
Multi-Band MelGAN
(Yang et al., 2021)
Using multi-resolution STFT loss
the highest naturalness. However, the method they pro-
posed only works when the quality of the training speech
data is high.
5 Speech corpus
The proposal of the end-to-end TTS method based on
deep learning reduces the difficulty of developing a high-
quality TTS system. Compared with the ASR model,
TTS model requires more high-quality speech data with
labels to achieve better training results, and the number
of open source corpora that meets these conditions is
very small. For the convenience of researchers to carry
out experiments, several commonly used open source
TTS corpora are introduced below. The details of each
corpus are shown in Table 6.
5.1 English speech corpus
Due to the versatility of English, the academic research
on English TTS is the most. Therefore, there are many
English TTS corpora available for free, such as VCTK
[223], LJ Speech [83] and LibriTTS [253].
The VCTK corpus
1
includes speech data uttered
by 109 native speakers of English with various accents.
Each speaker reads out about 400 sentences, most of
which were selected from a newspaper plus the Rain-
bow Passage and an elicitation paragraph intended to
identify the speaker’s accent. The speaker uses an omni-
directional head-mounted microphone to record speech
in a hemi-anechoic chamber of the University of Edin-
burgh at a sampling frequency of 24 bit and 96 kHz. All
recordings were converted into 16 bit, downsampled to
1
The VCTK corpus can be freely available for download
from https://datashare.is.ed.ac.uk/handle/10283/2119.
48 kHz, and manually end-pointed. The VCTK corpus
was originally recorded for building HMM-based multi-
speaker TTS systems.
LJ Speech
2
is a public domain corpus consisting of
13,100 short audio clips of a single speaker, made up of
non-professional audiobooks from the LibriVox project
[95]. Each audio file is a single-channel 16 bit PCM
WAV with a sampling rate of 22,050 Hz. The audio
clips range in length from approximately 1 second to
10 seconds and are segmented automatically based on
silences in the recording, with a total duration of about
24 hours. Clip boundaries generally align with sentence
or clause boundaries. The text was matched to the au-
dio manually, and a QA pass was done to ensure that
the text accurately matched the words spoken in the
audio.
The LibriTTS corpus
3
is composed of audio and
text from the LibriSpeech [156] corpus. Librispeech,
made up of audiobooks from the LibriVox project, was
originally designed for ASR research and contains 982
hours of speech data from 2,484 speakers. The Lib-
riTTS corpus inherits some of the properties of the Lib-
riSpeech corpus, while addressing problems that make
LibriSpeech less suitable for TTS tasks. For example,
LibriTTS increases the sampling rate of audio files from
16 kHz to 24 kHz, splits speech at sentence breaks in-
stead of at silences longer than 0.3 seconds, contains
the original text and the standardized text, can ex-
tract contextual information (such as neighbouring sen-
tences), and excludes utterances with significant back-
ground noise. The processed LibriTTS corpus consists
of 585 hours of speech data at 24 kHz sampling rate
2
The LJ Speech corpus is freely available for download
from https://keithito.com/LJ-Speech-Dataset/.
3
The LibriTTS corpus is freely available for download from
http://www.openslr.org/60/.
Review of end-to-end speech synthesis technology based on deep learning 25
Table 6 Details of each corpus
Corpus Language Number of
speakers
Hours Labeling method Sampling
Rate
(kHz)
VCTK English 109 44 Characters 48
LJ Speech English 1 24 Original and stan-
dardized characters
and phonemes
22.05
LibriTTS English 2,456 585 Original and stan-
dardized characters,
contextual informa-
tion
24
CMU ARC-
TIC
English 7 7 Characters 16
Blizzard2011 English 1 16.6 Characters 16
Blizzard2013 English 1 300 Characters 44.1
Blizzard2017 English 1 6 Characters 44.1
CSMSC Mandarin 1 12 Pinyin, rhythm and
phoneme boundary
48
AISHELL-3 Mandarin 218 85 Characters, pinyin 44.1
DiDiSpeech Mandarin 6,000 800 Standardized Pinyin 48
CSS10 German,
Greek, Span-
ish, French,
Chinese,
Japanese,
Russian,
Finnish, Hun-
garian, Dutch
Single
speaker
per lan-
guage
Original and stan-
dardized characters
22
Common
Voice
60 languages 7,335 Characters 48
from 2,456 speakers and its corresponding text tran-
scripts.
There are other open source English corpora, such
as the CMU ARCTIC corpus
4
[108] constructed by
the Language Technologies Institute of Carnegie Mellon
University for unit selection speech synthesis research.
However, the amount of data in this corpus is too small
to train the neural end-to-end TTS model well. Every
year, The Blizzard Challenge, an international speech
synthesis competition, provides participants with open
source English speech data. For example, the corpus of
4
The data is freely available for download from http://
www.festvox.org/cmu_arctic/.
The Blizzard Challenge 2011, 2013 and 2017
5
consists
of tens of hours, hundreds of hours and 6 hours of au-
dio and corresponding text transcripts of audiobooks
read by a single speaker, with sampling frequencies of
16 kHz, 44.1 kHz and 44.1 kHz, respectively.
5.2 Mandarin speech corpus
Mandarin is the language with the largest number of
speakers in the world, thus Mandarin TTS has also been
widely researched and applied [57, 160]. However, Man-
darin has a complex tone and prosodic structure [140].
5
These data sets are freely available for download
fromhttp://www.cstr.ed.ac.uk/projects/blizzard/ and
can only be used for non-commercial use.
26 Zhaoxi Mu
1
et al.
Meanwhile, Chinese characters are ideograms, which
are not directly related to pronunciation. It is neces-
sary to convert the original Chinese text into phonemes
or pinyin as audio transcription. Therefore, compared
with English, the cost of recording and transcribing
high-quality Mandarin corpus is higher, resulting in few
open source high-quality Mandarin corpus. In order to
facilitate researchers to conduct research on Mandarin
TTS, several open source Mandarin corpora that can
be used for TTS will be introduced.
CSMSC (Chinese Standard Mandarin Speech Co-
pus)
6
[8] is a single-speaker Mandarin female voice cor-
pus released by data-baker company. The corpus uses
a professional recording studio and recording software
for recording. The recording environment and equip-
ment remain unchanged throughout the recording, and
the signal-to-noise ratio (SNR) of the recording envi-
ronment is not less than 35 dB. The audio format is
a mono PCM WAV with a sampling frequency of 48
kHz 16 bit and an effective duration of approximately
12 hours. The recordings cover a variety of topics, such
as news, fiction, technology, entertainment, dialogue,
etc. The speech corpus is proofread, and rhythms and
phoneme boundaries are manually edited.
AISHELL-3
7
[191] is a high-quality Mandarin cor-
pus for multi-speaker TTS published by Shell Shell. It
contains roughly 85 hours of emotion-neutral record-
ings spoken by 218 native Chinese mandarin speakers,
as well as transcripts in Chinese character-level and
pinyin-level. All utterances are recorded using a high-
fidelity microphone (44.1 kHz, 16 bit) in a quiet indoor
environment. The topics of the textual content spread
a wide range of domains including smart home voice
commands, news reports and geographic information.
DiDiSpeech
8
[67] is a large open source Mandarin
speech corpus released by DiDi Chuxing company. The
corpus includes approximately 800 hours of speech data
at a sampling rate of 48 kHz from 6,000 speakers and
corresponding text transcripts. All speech data in the
DiDiSpeech corpus are recorded in a quiet environment,
and the audio with significant background noise is fil-
tered. It is suitable for various speech processing tasks,
such as voice conversion, multi-speaker TTS and ASR.
6
The CSMSC corpus is available at https://www.
data-baker.com/open_source.html for non-commercial use
only.
7
The AISHELL-3 corpus is available at http://www.
aishelltech.com/aishell_3, supporting academic research
only and is prohibited from commercial use without permis-
sion.
8
The DiDiSpeech data set is available for application on
https://outreach.didichuxing.com/research/opendata/.
5.3 Multilingual speech corpus
There has been little research in the TTS field into lan-
guages other than English, partly because of the lack
of available open source corpora. To enable TTS to be
applied to more languages, some researchers have con-
structed speech corpora containing multiple languages,
such as CSS10 [158] and Common Voice [128].
CSS10
9
is a single-speaker corpus of ten languages,
including Chinese, Dutch, French, Finnish, Japanese,
Hungarian, Greek, German, Russian and Spanish. It
is composed of short audio clips from LibriVox audio-
books and corresponding standardized transcriptions.
All audio files are sampled at 22 kHz.
Common Voice
10
is the largest public multilingual
speech corpus, currently containing nearly 9,283 hours
(7,335 hours verified) of speech data in 60 languages and
fully open to the public. The project employs crowd-
sourcing for data collection and data validation. The
audio clips are released as mono-channel, 16 bit MPEG-
3 files with a 48 kHz sampling rate. This corpus is de-
signed for ASR and rather noisy, thus denoising of the
original audio data is required before it is used for the
TTS task [147].
5.4 Emotional speech corpus
Emotional TTS has been extensively researched, but
one of the problems currently in this field is the lack
of publicly available emotional speech corpus and the
difficulty of recording such data. None of the above-
mentioned corpora contains explicit emotional informa-
tion, and most of the existing emotional corpora cannot
be effectively used to train the emotional TTS model
based on deep learning, because these data sets contain
a small number of sentences, such as RAVDESS [128],
CREMA-D [23], GEEMP [9], EMO-DB [18], or contain
noise, such as IMPROV [20] and IEMOCAP [19].
To fill this gap, Tits et al. [207] released the Emov-
DB corpus
11
, which covers five emotions (amusement,
anger, sleepiness, disgust, and neutral) and two lan-
guages (English and French). The English speech data
is recorded by two male and two female speakers, and
the French speech data is recorded by one male speaker.
English sentences are taken from the CMU ARCTIC
Corpus and French sentences from the SIWIS Corpus
[75]. Each audio file is recorded in 16 bits .wav format.
9
The CSS10 corpus is available for free at https://github.
com/Kyubyong/CSS10.
10
The Common Voice corpus is available for free at https:
//commonvoice.mozilla.org/.
11
The EmoV-DB database is available for free at https:
//github.com/numediart/EmoV-DB.
Review of end-to-end speech synthesis technology based on deep learning 27
6 Evaluation method
The speech quality is measured in three aspects: clar-
ity, intelligibility and naturalness. However, at present,
there is no uniform evaluation criterion for the quality
of synthesized speech. In fact, different from the quan-
titative evaluation methods for tasks such as classifica-
tion and prediction, since the final user is the audience,
the level of generated speech quality often requires sub-
jective qualitative evaluation. However, subjective eval-
uation is difficult to measure with strict standards, be-
cause there will be some deviations. In addition, some
objective speech quality evaluation metrics also have
reference value. Therefore, this section will summarize
the evaluation methods of synthesized speech respec-
tively from both subjective and objective aspects.
6.1 Subjective evaluation method
Subjective evaluation methods are usually more suit-
able for evaluating generative models, but they require
significant resources and face challenges in the relia-
bility, validity and reproducibility of results [84]. The
most commonly used subjective evaluation method is
the Mean Opinion Score (MOS), which measures nat-
uralness by asking listeners to score the synthesized
speech. MOS adopts a five-point scoring system, with
higher scores indicating higher speech quality, which
can be collected using the CrowdMOS toolkit [176].
MUSHRA (Multiple Stimuli with Hidden Reference and
Anchor) [171, 182] is also a subjective listening test
method. Specifically, the audio to be tested is mixed
with natural speech as reference (upper limit) and total
loss audio as anchor factor (lower limit). The listeners
are asked to subjectively score the test audio, hidden
reference audio and anchor factor through the double-
blind listening test, with a score from 0 to 100. The
0-100 scale used by MUSHRA allows very small differ-
ences to be rated. The main advantage over the MOS
methodology is that MUSHRA requires fewer partici-
pants to obtain statistically significant results.
All the above are absolute rating methods, and some-
times it is necessary to compare the speech quality gen-
erated by two models, which requires the use of rela-
tive rating methods, such as comparison mean opinion
score (CMOS) and AB preference test. CMOS is used
to compare the difference between the MOS value of
the model under test and the baseline. AB preference
test selects a better model or finds no significant differ-
ence between the two models by asking the listeners to
compare the speech of the same sentence synthesized by
the two models. The ABX preference test can be used
when comparing multi-speaker TTS models or speech
conversion models. Specifically, listeners are asked to
listen to three speech fragments A, B and X respec-
tively, where X represents the target speech, while A
and B represent the speech generated by the two mod-
els respectively. The listeners are then asked to judge
whether speech A or B is closer to X in terms of the
personality characteristics of the speech, or can not give
a clear judgment. Finally, the judgments of all listeners
are counted to calculate the proportion of the speech
synthesized by each model that sounded more like the
target speech.
6.2 Objective evaluation method
The objective evaluation method is mainly the quanti-
tative evaluation of the TTS model and the generated
speech. The differences between the generated samples
and the real samples are usually used to evaluate the
model. However, these evaluation metrics can only re-
flect the data processing ability of the model to a cer-
tain extent, and cannot truly reflect the quality of the
generated speech.
The most intuitive way to objectively evaluate the
prosody and accuracy of synthesized speech is to di-
rectly calculate the root mean square error (RMSE),
absolute error and negative log likelihood (NLL) of f
0
,
pitch, c
0
(the 0-th cepstrum coefficient) and duration
of reference audio and predicted audio, as well as the
character error rate (CER), word error rate (WER) and
utterance error rate (UER) of synthesized speech.
Another commonly used objective evaluation met-
ric for judging the difference between the generated
samples and the real samples is Mel-Cepstral Distor-
tion (MCD) [112]. MCD quantifies the reconstruction
performance of Mel-Frequency Cepstrum Coefficients
(MFCC) by calculating the spectral distance between
synthesized and reference mel-spectral features. Its cal-
culation formula is:
MCD
K
=
1
T
T 1
X
t=0
v
u
u
t
K
X
k=1
(c
t,k
c
0
t,k
)
2
(27)
where c
t,k
and c
0
t,k
are the k-th MFCC from the t-
th frame of the reference and predicted audio, respec-
tively. The MCD is usually calculated using the mean
square error (MSE) calculated by the MFCC features
of K = 13 dimensions. The lower the value of MCD,
the higher the quality of synthesized speech. It can be
used to evaluate timbral distortion, and its unit is db.
A similar evaluation metric is mel-spectral distortion
(MSD). MSD is calculated in the same way as MCD,
but it is calculated with the logarithmic mel-spectral
28 Zhaoxi Mu
1
et al.
amplitude rather than cepstrum coefficient, which cap-
tures the harmonic content not found in MCD.
Gross Pitch Error (GPE) and Voicing Decision Er-
ror (VDE) are two commonly used metrics to measure
the error rate of synthesized speech [146]. GPE is the
estimation error of the audio f
0
value, defined as [192]:
GP E =
P
t
1[| p
t
p
0
t
|> 0.2p
t
]1[v
t
]1[v
0
t
]
P
t
1[v
t
]1[v
0
t
]
(28)
where p
t
, p
0
t
are the pitch signals from the reference and
predicted audio, v
t
, v
0
t
are the voicing decisions from
the reference audio and predicted audio, and 1 is the
indicator function. The GPE measures the percentage
of voiced frames in the predicted audio that deviate
in pitch by more than 20% compared to the reference.
VDE is defined as [192]:
V DE =
P
T 1
t=0
1[v
t
6= v
0
t
]
T
(29)
where v
t
, v
0
t
are the voicing decisions of the reference
and predicted audio, T is the total number of frames,
and 1 is the indicator function. VDE is used to mea-
sure the frame-level voicing decision error rate of the
predicted audio. The lower these two metrics, the bet-
ter. However, some algorithms have low GPE but high
VDE. In order to reduce the values of VDE and GPE
at the same time, Chu and Alwan [37] combined GPE
and VDE and proposed f
0
Frame Error (FFE) metric.
FFE is used to measure the percentage of frames that
either contain a 20% pitch error (according to GPE) or
a voicing decision error (according to VDE), defined as
[192]:
F F E =
P
T 1
t=0
(1[| p
t
p
0
t
|> 0.2p
t
]1[v
t
]1[v
0
t
] + 1[v
t
6= v
0
t
])
T
(30)
FFE is used to calculate the ratio of the difference be-
tween the predicted pitch and the true pitch, which can
quantify the reconstruction error of the f
0
trajectory.
The lower the value, the better.
Bi´nkowski et al. [17] also proposed four metrics for
evaluating TTS models: unconditional and conditional
Fechet DeepSpeech Distance (FDSD, cFDSD) and Ker-
nel DeepSpeech Distance (KDSD, cKDSD). These met-
rics are inspired by the commonly used metrics for eval-
uating GAN-based image generation models [16, 72],
which judge the quality of the synthesized speech by
calculating the distance between the synthesized audio
and the reference audio. Moreover, the quality of syn-
thesized speech waveform can also be evaluated by cal-
culating Perceptual evaluation of speech quality (PESQ)
[177] of reference speech and synthesized speech, with
the higher the value, the better.
7 Future development direction
With the development of deep learning fields such as
NMT, ASR, image generation and music generation,
although existing TTS methods can synthesize high-
fidelity speech by drawing on various Seq2Seq models
and generation models, they still have many shortcom-
ings. For example, the existing TTS technology based
on deep learning is still unable to stably synthesize
speech in real time, and the quality of the generated
speech is difficult to be guaranteed. For example, the
end-to-end TTS technology based on deep learning has
not been able to synthesize speech stably in real time,
and the quality of the generated speech cannot be guar-
anteed. Therefore, a large part of TTS models currently
used in the industry are based on waveform cascade
technology [24]. Moreover, the state-of-the-art TTS tech-
nology is limited to a few common languages such as
English and Mandarin. Since it is difficult to obtain
the data pairs of htext, speechi, there has been little
research on minor languages and dialects.
Based on the above introduction and summary of
TTS method, it can be concluded that there will be at
least the following development directions in the field
of TTS in the future:
Control the style of speech in a precise and fine-
grained manner Speaking styles such as emotion,
intonation and rhythm often change during conver-
sation. However, current neural TTS systems cannot
precisely control these style features of speech indi-
vidually. How to achieve fine-grained style control
of speech at word level and phrase level will also be
the focus of TTS research in the future. In addition,
due to the difficulty in recording and labeling emo-
tional speech data, how to effectively use emotional
speech data limited in quantity and quality to train
the TTS model and enable it to learn the represen-
tation methods of various style features in speech is
also an urgent problem in the field of TTS.
In-depth research on the representation method of
speech signal in deep neural network Children learn
to speak long before they learn to read and write.
They can conduct a dialogue and produce novel sen-
tences, without being trained on an annotated cor-
pus of speech and text or aligned phonetic symbols.
Presumably, they achieve this by recoding the input
speech in their own internal phonetic representa-
tions (proto-phonemes or proto-text) [48]. This idea
can also be applied to TTS systems, as stated in the
goal of the ZeroSpeech Challenge: extract acoustic
units from speech signals by unsupervised learning
and create good data representation. Therefore, rep-
resentation learning and meta-learning can be used
Review of end-to-end speech synthesis technology based on deep learning 29
to improve the modeling ability and learning effi-
ciency of TTS model for speech data, thus greatly
reducing the labeled speech data required for train-
ing.
Build a fully end-to-end TTS pipeline Although the
existing TTS models are all called end-to-end, most
of them are divided into three parts: text front-end,
acoustic model and vocoder. These three modules
need to be trained separately, and the errors gener-
ated by each module will gradually accumulate. The
latest TTS frameworks such as ClariNet [161], Fast-
Speech 2s [174], EATS [47] and Wave-Tacotron com-
bine these modules and claim to be fully end-to-end
for training and inference. However, they still gener-
ate intermediate acoustic features as the condition
of the audio generation module, essentially similar
to other methods. A fully end-to-end model that
maps original text or phonemes directly to speech
waveforms would greatly simplify the TTS pipeline.
Apply the deep learning methods used in other tasks
to TTS First, as a generation task, speech synthesis
and image generation have great similaritie. Many
methods used in TTS are inspired by image gener-
ation methods. For example, MelNet [221] regards
the speech spectrogram as an image, and synthe-
sizes the mel-spectrogram using a 2-D multi-scale
autoregressive generation method. The methods of
generating images and speech with specific styles
are also very similar. Second, the alignment method
in the acoustic model can learn from the methods
in NMT and ASR, which are also Seq2Seq mod-
els. Third, as recognition and generation are dual
tasks, multi-task learning can be adopted to com-
bine recognition and generation models to improve
each other and reduce the demand for labeled data
during training. In addition to combining TTS and
ASR [123, 173, 208, 209, 232], it is also possible
to combine speaker recognition with multi-speaker
TTS [30, 209], and combine speech emotion recogni-
tion with emotional speech synthesis [120] for dual
training.
8 Conclusion
The research of end-to-end TTS technology based on
deep learning has become a hot topic in the field of
artificial intelligence. In order to make researchers to
have a clear understanding of the latest TTS paradigm,
this paper summarizes the latest technologies used in
each module of the TTS system in detail, and classifies
the methods according to their characteristics and com-
pares their advantages and disadvantages. Furthermore,
the public speech corpus for various TTS tasks and the
commonly used subjective and objective speech qual-
ity evaluation methods are also summarized. Finally,
some suggestions for the future development direction
of TTS are put forward.
References
1. Adiga N, Prasanna S (2019) Acoustic features
modelling for statistical parametric speech synthe-
sis: a review. IETE Technical Review 36(2):130–
149
2. Aggarwal V, Cotescu M, Prateek N, Lorenzo-
Trueba J, Barra-Chicote R (2020) Using vaes and
normalizing flows for one-shot text-to-speech syn-
thesis of expressive speech. In: ICASSP 2020-
2020 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP), IEEE,
pp 6179–6183
3. Arık S
¨
O, Chrzanowski M, Coates A, Diamos G,
Gibiansky A, Kang Y, Li X, Miller J, Ng A,
Raiman J, et al. (2017) Deep voice: Real-time neu-
ral text-to-speech. In: International Conference on
Machine Learning, PMLR, pp 195–204
4. Arik SO, Chen J, Peng K, Ping W, Zhou Y (2018)
Neural voice cloning with a few samples. arXiv
preprint arXiv:180206006
5. Aroon A, Dhonde S (2015) Statistical parametric
speech synthesis: A review. In: 2015 IEEE 9th In-
ternational Conference on Intelligent Systems and
Control (ISCO), IEEE, pp 1–5
6. Atal BS, Hanauer SL (1971) Speech analysis and
synthesis by linear prediction of the speech wave.
The journal of the acoustical society of America
50(2B):637–655
7. Bahdanau D, Cho K, Bengio Y (2014) Neural ma-
chine translation by jointly learning to align and
translate. arXiv preprint arXiv:14090473
8. Baker D (2017) Chinese standard mandarin
speech copus
9. anziger T, Mortillaro M, Scherer KR (2012) In-
troducing the geneva multimodal expression cor-
pus for experimental research on emotion percep-
tion. Emotion 12(5):1161
10. Battenberg E, Skerry-Ryan R, Mariooryad S,
Stanton D, Kao D, Shannon M, Bagby T (2020)
Location-relative attention mechanisms for robust
long-form speech synthesis. In: ICASSP 2020-
2020 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP), IEEE,
pp 6194–6198
11. Baum LE, Petrie T, Soules G, Weiss N (1970) A
maximization technique occurring in the statisti-
cal analysis of probabilistic functions of markov
30 Zhaoxi Mu
1
et al.
chains. The annals of mathematical statistics
41(1):164–171
12. Beliaev S, Rebryk Y, Ginsburg B (2020) Talknet:
Fully-convolutional non-autoregressive speech
synthesis model. arXiv preprint arXiv:200505514
13. Bengio S, Vinyals O, Jaitly N, Shazeer N
(2015) Scheduled sampling for sequence prediction
with recurrent neural networks. arXiv preprint
arXiv:150603099
14. Bi M, Lu H, Zhang S, Lei M, Yan Z (2018)
Deep feed-forward sequential memory networks
for speech synthesis. In: 2018 IEEE International
Conference on Acoustics, Speech and Signal Pro-
cessing (ICASSP), IEEE, pp 4794–4798
15. Bian Y, Chen C, Kang Y, Pan Z (2019) Multi-
reference tacotron by intercross training for style
disentangling, transfer and control in speech syn-
thesis. arXiv preprint arXiv:190402373
16. Bi´nkowski M, Sutherland DJ, Arbel M, Gretton
A (2018) Demystifying mmd gans. arXiv preprint
arXiv:180101401
17. Bi´nkowski M, Donahue J, Dieleman S, Clark A,
Elsen E, Casagrande N, Cobo LC, Simonyan K
(2019) High fidelity speech synthesis with adver-
sarial networks. arXiv preprint arXiv:190911646
18. Burkhardt F, Paeschke A, Rolfes M, Sendlmeier
WF, Weiss B (2005) A database of german emo-
tional speech. In: Ninth European Conference on
Speech Communication and Technology
19. Busso C, Bulut M, Lee CC, Kazemzadeh A,
Mower E, Kim S, Chang JN, Lee S, Narayanan
SS (2008) Iemocap: Interactive emotional dyadic
motion capture database. Language resources and
evaluation 42(4):335–359
20. Busso C, Parthasarathy S, Burmania A, Abdel-
Wahab M, Sadoughi N, Provost EM (2016) Msp-
improv: An acted corpus of dyadic interactions to
study emotion perception. IEEE Transactions on
Affective Computing 8(1):67–80
21. Cai Z, Yang Y, Zhang C, Qin X, Li M (2019) Poly-
phone disambiguation for mandarin chinese using
conditional neural network with multi-level em-
bedding features. arXiv preprint arXiv:190701749
22. Cai Z, Zhang C, Li M (2020) From speaker ver-
ification to multispeaker speech synthesis, deep
transfer with feedback constraint. arXiv preprint
arXiv:200504587
23. Cao H, Cooper DG, Keutmann MK, Gur
RC, Nenkova A, Verma R (2014) Crema-
d: Crowd-sourced emotional multimodal actors
dataset. IEEE transactions on affective comput-
ing 5(4):377–390
24. Capes T, Coles P, Conkie A, Golipour L, Had-
jitarkhani A, Hu Q, Huddleston N, Hunt M, Li
J, Neeracher M, et al. (2017) Siri on-device deep
learning-guided unit selection text-to-speech sys-
tem. In: INTERSPEECH, pp 4011–4015
25. Chaudhari S, Polatkan G, Ramanath R, Mithal
V (2019) An attentive survey of attention models.
arXiv preprint arXiv:190402874
26. Chen LH, Raitio T, Valentini-Botinhao C, Ling
ZH, Yamagishi J (2015) A deep generative ar-
chitecture for postfiltering in statistical para-
metric speech synthesis. IEEE/ACM Transac-
tions on Audio, Speech, and Language Processing
23(11):2003–2014
27. Chen N, Zhang Y, Zen H, Weiss RJ, Norouzi
M, Chan W (2020) Wavegrad: Estimating gra-
dients for waveform generation. arXiv preprint
arXiv:200900713
28. Chen X, Duan Y, Houthooft R, Schulman J,
Sutskever I, Abbeel P (2016) Infogan: Inter-
pretable representation learning by information
maximizing generative adversarial nets. arXiv
preprint arXiv:160603657
29. Chen X, Kingma DP, Salimans T, Duan Y,
Dhariwal P, Schulman J, Sutskever I, Abbeel
P (2016) Variational lossy autoencoder. arXiv
preprint arXiv:161102731
30. Chen Y, Assael Y, Shillingford B, Budden D, Reed
S, Zen H, Wang Q, Cobo LC, Trask A, Laurie
B, et al. (2018) Sample efficient adaptive text-to-
speech. arXiv preprint arXiv:180910460
31. Chorowski J, Bahdanau D, Serdyuk D, Cho
K, Bengio Y (2015) Attention-based mod-
els for speech recognition. arXiv preprint
arXiv:150607503
32. Chorowski J, Weiss RJ, Bengio S, van den Oord
A (2019) Unsupervised speech representation
learning using wavenet autoencoders. IEEE/ACM
transactions on audio, speech, and language pro-
cessing 27(12):2041–2053
33. Chou Jc, Yeh Cc, Lee Hy, Lee Ls (2018) Multi-
target voice conversion without parallel data by
adversarially learning disentangled audio repre-
sentations. arXiv preprint arXiv:180402812
34. Chou Jc, Yeh Cc, Lee Hy (2019) One-shot voice
conversion by separating speaker and content rep-
resentations with instance normalization. arXiv
preprint arXiv:190405742
35. Chu M, Qian Y (2001) Locating boundaries for
prosodic constituents in unrestricted mandarin
texts. In: International Journal of Computational
Linguistics & Chinese Language Processing, Vol-
ume 6, Number 1, February 2001: Special Issue
Review of end-to-end speech synthesis technology based on deep learning 31
on Natural Language Processing Researches in
MSRA, pp 61–82
36. Chu M, Peng H, Zhao Y, Niu Z, Chang E
(2003) Microsoft mulan-a bilingual tts system. In:
2003 IEEE International Conference on Acous-
tics, Speech, and Signal Processing, 2003. Pro-
ceedings.(ICASSP’03)., IEEE, vol 1, pp I–I
37. Chu W, Alwan A (2009) Reducing f0 frame er-
ror of f0 tracking algorithms under noisy condi-
tions with an unvoiced/voiced classification fron-
tend. In: 2009 IEEE International Conference on
Acoustics, Speech and Signal Processing, IEEE,
pp 3969–3972
38. Chung J, Gulcehre C, Cho K, Bengio Y (2014)
Empirical evaluation of gated recurrent neural
networks on sequence modeling. arXiv preprint
arXiv:14123555
39. Chung YA, Wang Y, Hsu WN, Zhang Y, Skerry-
Ryan R (2019) Semi-supervised training for im-
proving data efficiency in end-to-end speech syn-
thesis. In: ICASSP 2019-2019 IEEE International
Conference on Acoustics, Speech and Signal Pro-
cessing (ICASSP), IEEE, pp 6940–6944
40. Conkie A, Finch A (2020) Scalable multilingual
frontend for tts. In: ICASSP 2020-2020 IEEE In-
ternational Conference on Acoustics, Speech and
Signal Processing (ICASSP), IEEE, pp 6684–6688
41. Cooper E, Lai CI, Yasuda Y, Fang F, Wang
X, Chen N, Yamagishi J (2020) Zero-shot multi-
speaker text-to-speech with state-of-the-art neural
speaker embeddings. In: ICASSP 2020-2020 IEEE
International Conference on Acoustics, Speech
and Signal Processing (ICASSP), IEEE, pp 6184–
6188
42. Dauphin YN, Fan A, Auli M, Grangier D (2017)
Language modeling with gated convolutional net-
works. In: International conference on machine
learning, PMLR, pp 933–941
43. Deng GF, Tsai CH, Ku T (2018) The historical re-
view and current trends in speech synthesis by bib-
liometric approach. In: International Conference
on Frontier Computing, Springer, pp 1966–1978
44. Devlin J, Chang MW, Lee K, Toutanova K (2018)
Bert: Pre-training of deep bidirectional transform-
ers for language understanding. arXiv preprint
arXiv:181004805
45. Dinh L, Krueger D, Bengio Y (2014) Nice: Non-
linear independent components estimation. arXiv
preprint arXiv:14108516
46. Dinh L, Sohl-Dickstein J, Bengio S (2016) Den-
sity estimation using real nvp. arXiv preprint
arXiv:160508803
47. Donahue J, Dieleman S, Bi´nkowski M, Elsen E,
Simonyan K (2020) End-to-end adversarial text-
to-speech. arXiv preprint arXiv:200603575
48. Dunbar E, Algayres R, Karadayi J, Bernard M,
Benjumea J, Cao XN, Miskic L, Dugrain C, On-
del L, Black AW, et al. (2019) The zero re-
source speech challenge 2019: Tts without t. arXiv
preprint arXiv:190411469
49. Elias I, Zen H, Shen J, Zhang Y, Jia Y,
Weiss R, Wu Y (2020) Parallel tacotron: Non-
autoregressive and controllable tts. arXiv preprint
arXiv:201011439
50. Ellinas N, Vamvoukakis G, Markopoulos K, Cha-
lamandaris A, Maniati G, Kakoulidis P, Raptis S,
Sung JS, Park H, Tsiakoulis P (2020) High qual-
ity streaming speech synthesis with low, sentence-
length-independent latency. Proc Interspeech 2020
pp 2022–2026
51. Fan Y, Qian Y, Xie FL, Soong FK (2014) Tts
synthesis with bidirectional lstm based recurrent
neural networks. In: Fifteenth annual conference
of the international speech communication associ-
ation
52. Fernandez R, Rendel A, Ramabhadran B, Hoory
R (2013) F0 contour prediction with a deep be-
lief network-gaussian process hybrid model. In:
2013 IEEE International Conference on Acoustics,
Speech and Signal Processing, IEEE, pp 6885–
6889
53. Fernandez R, Rendel A, Ramabhadran B, Hoory
R (2014) Prosody contour prediction with long
short-term memory, bi-directional, deep recurrent
neural networks. In: Fifteenth Annual Conference
of the International Speech Communication Asso-
ciation
54. Gatys LA, Ecker AS, Bethge M (2015) A neu-
ral algorithm of artistic style. arXiv preprint
arXiv:150806576
55. Gatys LA, Ecker AS, Bethge M (2016) Image style
transfer using convolutional neural networks. In:
Proceedings of the IEEE conference on computer
vision and pattern recognition, pp 2414–2423
56. Gehring J, Auli M, Grangier D, Yarats D,
Dauphin YN (2017) Convolutional sequence to se-
quence learning. In: International Conference on
Machine Learning, PMLR, pp 1243–1252
57. Gibiansky A, Arik S
¨
O, Diamos GF, Miller J, Peng
K, Ping W, Raiman J, Zhou Y (2017) Deep voice
2: Multi-speaker neural text-to-speech. In: NIPS
58. Gneiting T, Raftery AE (2007) Strictly proper
scoring rules, prediction, and estimation. Jour-
nal of the American statistical Association
102(477):359–378
32 Zhaoxi Mu
1
et al.
59. Gonzalvo X, Tazari S, Chan Ca, Becker M, Gutkin
A, Silen H (2016) Recent advances in google real-
time hmm-driven unit selection synthesizer
60. Goodfellow IJ, Pouget-Abadie J, Mirza M, Xu
B, Warde-Farley D, Ozair S, Courville A, Bengio
Y (2014) Generative adversarial networks. arXiv
preprint arXiv:14062661
61. Graves A (2013) Generating sequences with
recurrent neural networks. arXiv preprint
arXiv:13080850
62. Griffin D, Lim J (1984) Signal estimation from
modified short-time fourier transform. IEEE
Transactions on acoustics, speech, and signal pro-
cessing 32(2):236–243
63. Gritsenko AA, Salimans T, Berg Rvd, Snoek J,
Kalchbrenner N (2020) A spectral energy dis-
tance for parallel speech synthesis. arXiv preprint
arXiv:200801160
64. Gu J, Wang Y, Chen Y, Cho K, Li VO (2018)
Meta-learning for low-resource neural machine
translation. arXiv preprint arXiv:180808437
65. Guo H, Soong FK, He L, Xie L (2019) Exploiting
syntactic features in a parsed tree to improve end-
to-end tts. arXiv preprint arXiv:190404764
66. Guo H, Soong FK, He L, Xie L (2019) A new gan-
based end-to-end tts training algorithm. arXiv
preprint arXiv:190404775
67. Guo T, Wen C, Jiang D, Luo N, Zhang R, Zhao
S, Li W, Gong C, Zou W, Han K, et al. (2020)
Didispeech: A large scale mandarin speech corpus.
arXiv preprint arXiv:201009275
68. Guo W, Yang H, Gan Z (2018) A dnn-based
mandarin-tibetan cross-lingual speech synthesis.
In: 2018 Asia-Pacific Signal and Information Pro-
cessing Association Annual Summit and Confer-
ence (APSIPA ASC), IEEE, pp 1702–1707
69. Gururani S, Gupta K, Shah D, Shakeri Z, Pinto J
(2019) Prosody transfer in neural text to speech
using global pitch and loudness features. arXiv
preprint arXiv:191109645
70. Hayashi T, Watanabe S, Toda T, Takeda K, Tosh-
niwal S, Livescu K (2019) Pre-trained text embed-
dings for enhanced text-to-speech synthesis. In:
INTERSPEECH, pp 4430–4434
71. He M, Deng Y, He L (2019) Robust sequence-to-
sequence acoustic modeling with stepwise mono-
tonic attention for neural tts. arXiv preprint
arXiv:190600672
72. Heusel M, Ramsauer H, Unterthiner T, Nessler B,
Hochreiter S (2017) Gans trained by a two time-
scale update rule converge to a local nash equilib-
rium. arXiv preprint arXiv:170608500
73. Ho J, Jain A, Abbeel P (2020) Denoising
diffusion probabilistic models. arXiv preprint
arXiv:200611239
74. Hochreiter S, Schmidhuber J (1997) Long short-
term memory. Neural computation 9(8):1735–
1780
75. Honnet PE, Lazaridis A, Garner PN, Yamag-
ishi J (2017) The siwis french speech synthesis
database? design and recording of a high quality
french database for speech synthesis. Tech. rep.,
Idiap
76. Hsu WN, Zhang Y, Weiss RJ, Chung YA, Wang
Y, Wu Y, Glass J (2019) Disentangling corre-
lated speaker and noise for speech synthesis via
data augmentation and adversarial factorization.
In: ICASSP 2019-2019 IEEE International Confer-
ence on Acoustics, Speech and Signal Processing
(ICASSP), IEEE, pp 5901–5905
77. Huang FL, Lin JH, Lin XW (2010) Disambigua-
tion for polyphones of chinese based on two-pass
unified approach. In: 2010 International Computer
Symposium (ICS2010), IEEE, pp 603–607
78. Huang Z, Xu W, Yu K (2015) Bidirectional lstm-
crf models for sequence tagging. arXiv preprint
arXiv:150801991
79. Huang Z, Li H, Lei M (2020) Devicetts: A small-
footprint, fast, stable network for on-device text-
to-speech. arXiv preprint arXiv:201015311
80. Hunt AJ, Black AW (1996) Unit selection in a con-
catenative speech synthesis system using a large
speech database. In: 1996 IEEE International
Conference on Acoustics, Speech, and Signal Pro-
cessing Conference Proceedings, IEEE, vol 1, pp
373–376
81. Huybrechts G, Merritt T, Comini G, Perz B,
Shah R, Lorenzo-Trueba J (2020) Low-resource
expressive text-to-speech using data augmenta-
tion. arXiv preprint arXiv:201105707
82. Imai S, Sumita K, Furuichi C (1983) Mel log spec-
trum approximation (mlsa) filter for speech syn-
thesis. Electronics and Communications in Japan
(Part I: Communications) 66(2):10–18
83. Ito K, Johnson L (2017) The lj speech dataset.
https://keithito.com/LJ-Speech-Dataset/
84. Ji S, Luo J, Yang X (2020) A comprehensive sur-
vey on deep music generation: Multi-level repre-
sentations, algorithms, evaluations, and future di-
rections. arXiv preprint arXiv:201106801
85. Jia Y, Zhang Y, Weiss RJ, Wang Q, Shen J,
Ren F, Chen Z, Nguyen P, Pang R, Moreno IL,
et al. (2018) Transfer learning from speaker veri-
fication to multispeaker text-to-speech synthesis.
arXiv preprint arXiv:180604558
Review of end-to-end speech synthesis technology based on deep learning 33
86. Jiao X, Yin Y, Shang L, Jiang X, Chen X, Li L,
Wang F, Liu Q (2019) Tinybert: Distilling bert for
natural language understanding. arXiv preprint
arXiv:190910351
87. Jin Z, Finkelstein A, Mysore GJ, Lu J (2018)
Fftnet: A real-time speaker-dependent neural
vocoder. In: 2018 IEEE International Confer-
ence on Acoustics, Speech and Signal Processing
(ICASSP), IEEE, pp 2251–2255
88. Johnson J, Alahi A, Fei-Fei L (2016) Percep-
tual losses for real-time style transfer and super-
resolution. In: European conference on computer
vision, Springer, pp 694–711
89. Kalchbrenner N, Elsen E, Simonyan K, Noury S,
Casagrande N, Lockhart E, Stimberg F, Oord A,
Dieleman S, Kavukcuoglu K (2018) Efficient neu-
ral audio synthesis. In: International Conference
on Machine Learning, PMLR, pp 2410–2419
90. Kalita J, Deb N (2017) Emotional text to speech
synthesis: A review. International Journal of Ad-
vanced Research in Computer and Communica-
tion Engineering 6(4):428–430
91. Kameoka H, Kaneko T, Tanaka K, Hojo N (2018)
Stargan-vc: Non-parallel many-to-many voice con-
version using star generative adversarial net-
works. In: 2018 IEEE Spoken Language Technol-
ogy Workshop (SLT), IEEE, pp 266–273
92. Kang M, Hong Y (2011) Formant synthesis of
haegeum: a sound analysis/synthesis system us-
ing cepstral envelope. In: 2011 International Con-
ference on Information Science and Applications,
IEEE, pp 1–8
93. Karaali O, Corrigan G, Gerson I, Massey N
(1998) Text-to-speech conversion with neural net-
works: A recurrent tdnn approach. arXiv preprint
cs/9811032
94. Kawahara H, Morise M, Takahashi T, Nisimura R,
Irino T, Banno H (2008) Tandem-straight: A tem-
porally stable power spectral representation for
periodic signals and applications to interference-
free spectrum, f0, and aperiodicity estimation. In:
2008 IEEE International Conference on Acoustics,
Speech and Signal Processing, IEEE, pp 3933–
3936
95. Kearns J (2014) Librivox: Free public domain au-
diobooks. Reference Reviews
96. Kenter T, Wan V, Chan CA, Clark R, Vit J (2019)
Chive: Varying prosody in speech synthesis with
a linguistically driven dynamic hierarchical condi-
tional variational network. In: International Con-
ference on Machine Learning, PMLR, pp 3331–
3340
97. Khorinphan C, Phansamdaeng S, Saiyod S (2014)
Thai speech synthesis with emotional tone: Based
on formant synthesis for home robot. In: 2014
Third ICT International Student Project Confer-
ence (ICT-ISPC), IEEE, pp 111–114
98. Kim J, Kim S, Kong J, Yoon S (2020) Glow-tts:
A generative flow for text-to-speech via monotonic
alignment search. arXiv preprint arXiv:200511129
99. Kim S, Lee SG, Song J, Kim J, Yoon S (2018)
Flowavenet: A generative flow for raw audio. arXiv
preprint arXiv:181102155
100. Kim Y, Rush AM (2016) Sequence-level knowl-
edge distillation. arXiv preprint arXiv:160607947
101. Kingma DP, Dhariwal P (2018) Glow: Genera-
tive flow with invertible 1x1 convolutions. arXiv
preprint arXiv:180703039
102. Kingma DP, Welling M (2013) Auto-encoding
variational bayes. arXiv preprint arXiv:13126114
103. Kingma DP, Salimans T, Jozefowicz R, Chen X,
Sutskever I, Welling M (2016) Improving varia-
tional inference with inverse autoregressive flow.
arXiv preprint arXiv:160604934
104. Kisler T, Reichel U, Schiel F (2017) Multilingual
processing of speech via web services. Computer
Speech & Language 45:326–347
105. Klatt DH (1980) Software for a cascade/parallel
formant synthesizer. the Journal of the Acoustical
Society of America 67(3):971–995
106. Klatt DH (1987) Review of text-to-speech conver-
sion for english. The Journal of the Acoustical So-
ciety of America 82(3):737–793
107. Klein D, Manning CD, et al. (2003) Fast exact in-
ference with a factored model for natural language
parsing. Advances in neural information process-
ing systems pp 3–10
108. Kominek J, Black AW, Ver V (2003) Cmu arctic
databases for speech synthesis
109. Kong J, Kim J, Bae J (2020) Hifi-gan: Generative
adversarial networks for efficient and high fidelity
speech synthesis. arXiv preprint arXiv:201005646
110. Kong Z, Ping W, Huang J, Zhao K, Catanzaro
B (2020) Diffwave: A versatile diffusion model for
audio synthesis. arXiv preprint arXiv:200909761
111. Kriman S, Beliaev S, Ginsburg B, Huang J,
Kuchaiev O, Lavrukhin V, Leary R, Li J, Zhang Y
(2020) Quartznet: Deep automatic speech recogni-
tion with 1d time-channel separable convolutions.
In: ICASSP 2020-2020 IEEE International Confer-
ence on Acoustics, Speech and Signal Processing
(ICASSP), IEEE, pp 6124–6128
112. Kubichek R (1993) Mel-cepstral distance mea-
sure for objective speech quality assessment. In:
Proceedings of IEEE Pacific Rim Conference on
34 Zhaoxi Mu
1
et al.
Communications Computers and Signal Process-
ing, IEEE, vol 1, pp 125–128
113. Kumar K, Kumar R, de Boissiere T, Gestin L,
Teoh WZ, Sotelo J, de Br´ebisson A, Bengio Y,
Courville A (2019) Melgan: Generative adversar-
ial networks for conditional waveform synthesis.
arXiv preprint arXiv:191006711
114. Kwon O, Jang I, Ahn C, Kang HG (2019) An effec-
tive style token weight control technique for end-
to-end emotional speech synthesis. IEEE Signal
Processing Letters 26(9):1383–1387
115. Kwon O, Song E, Kim JM, Kang HG (2019) Effec-
tive parameter estimation methods for an excitnet
model in generative text-to-speech systems. arXiv
preprint arXiv:190508486
116. Lamb A, Goyal A, Zhang Y, Zhang S, Courville
A, Bengio Y (2016) Professor forcing: A new al-
gorithm for training recurrent networks. arXiv
preprint arXiv:161009038
117. La´ncucki A (2020) Fastpitch: Parallel text-to-
speech with pitch prediction. arXiv preprint
arXiv:200606873
118. Li N, Liu S, Liu Y, Zhao S, Liu M (2019) Neu-
ral speech synthesis with transformer network. In:
Proceedings of the AAAI Conference on Artificial
Intelligence, vol 33, pp 6706–6713
119. Li N, Liu Y, Wu Y, Liu S, Zhao S, Liu M (2020)
Robutrans: A robust transformer-based text-to-
speech model. In: Proceedings of the AAAI Con-
ference on Artificial Intelligence, vol 34, pp 8228–
8235
120. Li T, Yang S, Xue L, Xie L (2021) Controllable
emotion transfer for end-to-end speech synthe-
sis. In: 2021 12th International Symposium on
Chinese Spoken Language Processing (ISCSLP),
IEEE, pp 1–5
121. Lim D, Jang W, Park H, Kim B, Yoon J, et al.
(2020) Jdi-t: Jointly trained duration informed
transformer for text-to-speech without explicit
alignment. arXiv preprint arXiv:200507799
122. Ling J (2017) Coarse-to-fine attention models for
document summarization. PhD thesis
123. Liu DR, Yang CY, Wu SL, Lee HY (2018) Im-
proving unsupervised style transfer in end-to-end
speech synthesis with end-to-end speech recogni-
tion. In: 2018 IEEE Spoken Language Technology
Workshop (SLT), IEEE, pp 640–647
124. Liu P, Wu X, Kang S, Li G, Su D, Yu D (2019)
Maximizing mutual information for tacotron.
arXiv preprint arXiv:190901145
125. Liu R, Yang J, Liu M (2019) A new end-to-
end long-time speech synthesis system based on
tacotron2. In: Proceedings of the 2019 Interna-
tional Symposium on Signal Processing Systems,
pp 46–50
126. Liu R, Sisman B, Li H (2020) Graphspeech:
Syntax-aware graph attention network for neural
speech synthesis. arXiv preprint arXiv:201012423
127. Liu R, Sisman B, Li J, Bao F, Gao G, Li
H (2020) Teacher-student training for robust
tacotron-based tts. In: ICASSP 2020-2020 IEEE
International Conference on Acoustics, Speech
and Signal Processing (ICASSP), IEEE, pp 6274–
6278
128. Livingstone SR, Russo FA (2018) The ryerson
audio-visual database of emotional speech and
song (ravdess): A dynamic, multimodal set of fa-
cial and vocal expressions in north american en-
glish. PloS one 13(5):e0196391
129. Loizou PC (2013) Speech enhancement: theory
and practice. CRC press
130. Lu C, Zhang P, Yan Y (2019) Self-attention based
prosodic boundary prediction for chinese speech
synthesis. In: ICASSP 2019-2019 IEEE Interna-
tional Conference on Acoustics, Speech and Signal
Processing (ICASSP), IEEE, pp 7035–7039
131. Lu H, King S, Watts O (2013) Combining a vector
space representation of linguistic context with a
deep neural network for text-to-speech synthesis.
In: Eighth ISCA Workshop on Speech Synthesis
132. Ma M, Huang L, Xiong H, Zheng R, Liu K, Zheng
B, Zhang C, He Z, Liu H, Li X, et al. (2018) Stacl:
Simultaneous translation with implicit anticipa-
tion and controllable latency using prefix-to-prefix
framework. arXiv preprint arXiv:181008398
133. Ma M, Zheng B, Liu K, Zheng R, Liu H, Peng K,
Church K, Huang L (2019) Incremental text-to-
speech synthesis with prefix-to-prefix framework.
arXiv preprint arXiv:191102750
134. Ma S, Mcduff D, Song Y (2018) Neural tts styl-
ization with adversarial and collaborative games.
In: International Conference on Learning Repre-
sentations
135. McAuliffe M, Socolof M, Mihuc S, Wagner M, Son-
deregger M (2017) Montreal forced aligner: Train-
able text-speech alignment using kaldi. In: Inter-
speech, vol 2017, pp 498–502
136. Mehri S, Kumar K, Gulrajani I, Kumar R, Jain
S, Sotelo J, Courville A, Bengio Y (2016) Sam-
plernn: An unconditional end-to-end neural audio
generation model. arXiv preprint arXiv:161207837
137. Merboldt A, Zeyer A, Schl¨uter R, Ney H (2019)
An analysis of local monotonic attention variants.
In: INTERSPEECH, pp 1398–1402
138. Miao C, Liang S, Chen M, Ma J, Wang S, Xiao J
(2020) Flow-tts: A non-autoregressive network for
Review of end-to-end speech synthesis technology based on deep learning 35
text to speech based on flow. In: ICASSP 2020-
2020 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP), IEEE,
pp 7209–7213
139. Mikolov T, Chen K, Corrado G, Dean J (2013) Ef-
ficient estimation of word representations in vector
space. arXiv preprint arXiv:13013781
140. Minematsu N, Kobayashi S, Shimizu S, Hirose K
(2012) Improved prediction of japanese word ac-
cent sandhi using crf. In: Thirteenth Annual Con-
ference of the International Speech Communica-
tion Association
141. Morrison M, Jin Z, Bryan NJ, Mysore GJ
(2020) Controllable neural prosody synthesis.
arXiv preprint arXiv:200803388
142. Moss HB, Aggarwal V, Prateek N, Gonz´alez
J, Barra-Chicote R (2020) Boffin tts: Few-shot
speaker adaptation by bayesian optimization. In:
ICASSP 2020-2020 IEEE International Confer-
ence on Acoustics, Speech and Signal Processing
(ICASSP), IEEE, pp 7639–7643
143. Moulines E, Charpentier F (1990) Pitch-
synchronous waveform processing techniques for
text-to-speech synthesis using diphones. Speech
communication 9(5-6):453–467
144. Murray IR, Arnott JL, Rohwer EA (1996) Emo-
tional stress in synthetic speech: Progress and
future directions. Speech Communication 20(1-
2):85–91
145. Nachmani E, Polyak A, Taigman Y, Wolf L (2018)
Fitting new speakers based on a short untran-
scribed sample. In: International Conference on
Machine Learning, PMLR, pp 3683–3691
146. Nakatani T, Amano S, Irino T, Ishizuka K, Kondo
T (2008) A method for fundamental frequency es-
timation and voicing decision: Application to in-
fant utterances recorded in real acoustical envi-
ronments. Speech Communication 50(3):203–214
147. Nekvinda T, Duˇsek O (2020) One model, many
languages: Meta-learning for multilingual text-to-
speech. arXiv preprint arXiv:200800768
148. Ning Y, He S, Wu Z, Xing C, Zhang LJ (2019)
A review of deep learning based speech synthesis.
Applied Sciences 9(19):4050
149. Nose T (2016) Efficient implementation of global
variance compensation for parametric speech
synthesis. IEEE/ACM Transactions on Audio,
Speech, and Language Processing 24(10):1694–
1704
150. Oord A, Li Y, Babuschkin I, Simonyan K,
Vinyals O, Kavukcuoglu K, Driessche G, Lock-
hart E, Cobo L, Stimberg F, et al. (2018) Par-
allel wavenet: Fast high-fidelity speech synthesis.
In: International conference on machine learning,
PMLR, pp 3918–3926
151. Oord Avd, Dieleman S, Zen H, Simonyan K,
Vinyals O, Graves A, Kalchbrenner N, Senior
A, Kavukcuoglu K (2016) Wavenet: A gen-
erative model for raw audio. arXiv preprint
arXiv:160903499
152. Oord Avd, Vinyals O, Kavukcuoglu K (2017) Neu-
ral discrete representation learning. arXiv preprint
arXiv:171100937
153. Paine TL, Khorrami P, Chang S, Zhang Y, Ra-
machandran P, Hasegawa-Johnson MA, Huang TS
(2016) Fast wavenet generation algorithm. arXiv
preprint arXiv:161109482
154. Pan H, Li X, Huang Z (2019) A mandarin prosodic
boundary prediction model based on multi-task
learning. In: INTERSPEECH, pp 4485–4488
155. Pan J, Yin X, Zhang Z, Liu S, Zhang Y, Ma
Z, Wang Y (2020) A unified sequence-to-sequence
front-end model for mandarin text-to-speech syn-
thesis. In: ICASSP 2020-2020 IEEE International
Conference on Acoustics, Speech and Signal Pro-
cessing (ICASSP), IEEE, pp 6689–6693
156. Panayotov V, Chen G, Povey D, Khudanpur S
(2015) Librispeech: an asr corpus based on public
domain audio books. In: 2015 IEEE international
conference on acoustics, speech and signal process-
ing (ICASSP), IEEE, pp 5206–5210
157. Park K, Lee S (2020) g2pm: A neural grapheme-
to-phoneme conversion package for mandarinchi-
nese based on a new open benchmark dataset.
arXiv preprint arXiv:200403136
158. Park K, Mulc T (2019) Css10: A collection of sin-
gle speaker speech datasets for 10 languages. arXiv
preprint arXiv:190311269
159. Peng K, Ping W, Song Z, Zhao K (2020) Non-
autoregressive neural text-to-speech. In: Interna-
tional Conference on Machine Learning, PMLR,
pp 7586–7598
160. Ping W, Peng K, Gibiansky A, Arik SO, Kannan
A, Narang S, Raiman J, Miller J (2017) Deep voice
3: Scaling text-to-speech with convolutional se-
quence learning. arXiv preprint arXiv:171007654
161. Ping W, Peng K, Chen J (2018) Clarinet: Paral-
lel wave generation in end-to-end text-to-speech.
arXiv preprint arXiv:180707281
162. Ping W, Peng K, Zhao K, Song Z (2020) Wave-
flow: A compact flow-based model for raw audio.
In: International Conference on Machine Learn-
ing, PMLR, pp 7706–7716
163. Platanios EA, Sachan M, Neubig G, Mitchell T
(2018) Contextual parameter generation for uni-
versal neural machine translation. arXiv preprint
36 Zhaoxi Mu
1
et al.
arXiv:180808493
164. Prenger R, Valle R, Catanzaro B (2019) Waveg-
low: A flow-based generative network for speech
synthesis. In: ICASSP 2019-2019 IEEE Interna-
tional Conference on Acoustics, Speech and Signal
Processing (ICASSP), IEEE, pp 3617–3621
165. Qian K, Zhang Y, Chang S, Yang X, Hasegawa-
Johnson M (2019) Autovc: Zero-shot voice style
transfer with only autoencoder loss. In: Interna-
tional Conference on Machine Learning, PMLR,
pp 5210–5219
166. Qian K, Zhang Y, Chang S, Hasegawa-Johnson
M, Cox D (2020) Unsupervised speech decompo-
sition via triple information bottleneck. In: Inter-
national Conference on Machine Learning, PMLR,
pp 7836–7846
167. Qian Y, Wu Z, Ma X, Soong F (2010) Automatic
prosody prediction and detection with conditional
random field (crf) models. In: 2010 7th Interna-
tional Symposium on Chinese Spoken Language
Processing, IEEE, pp 135–138
168. Qian Y, Fan Y, Hu W, Soong FK (2014) On the
training aspects of deep neural network (dnn) for
parametric tts synthesis. In: 2014 IEEE Interna-
tional Conference on Acoustics, Speech and Signal
Processing (ICASSP), IEEE, pp 3829–3833
169. Radford A, Wu J, Child R, Luan D, Amodei D,
Sutskever I (2019) Language models are unsuper-
vised multitask learners. OpenAI blog 1(8):9
170. Raffel C, Luong MT, Liu PJ, Weiss RJ, Eck D
(2017) Online and linear-time attention by enforc-
ing monotonic alignments. In: International Con-
ference on Machine Learning, PMLR, pp 2837–
2846
171. Recommendation I (2001) 1534-1,“method for the
subjective assessment of intermediate sound qual-
ity (mushra)”. International Telecommunications
Union, Geneva, Switzerland 2
172. Ren Y, Ruan Y, Tan X, Qin T, Zhao S,
Zhao Z, Liu TY (2019) Fastspeech: Fast, robust
and controllable text to speech. arXiv preprint
arXiv:190509263
173. Ren Y, Tan X, Qin T, Zhao S, Zhao Z, Liu
TY (2019) Almost unsupervised text to speech
and automatic speech recognition. In: Interna-
tional Conference on Machine Learning, PMLR,
pp 5410–5419
174. Ren Y, Hu C, Qin T, Zhao S, Zhao Z,
Liu TY (2020) Fastspeech 2: Fast and high-
quality end-to-end text-to-speech. arXiv preprint
arXiv:200604558
175. Rezende D, Mohamed S (2015) Variational in-
ference with normalizing flows. In: International
Conference on Machine Learning, PMLR, pp
1530–1538
176. Ribeiro F, Florˆencio D, Zhang C, Seltzer M (2011)
Crowdmos: An approach for crowdsourcing mean
opinion score studies. In: 2011 IEEE international
conference on acoustics, speech and signal process-
ing (ICASSP), IEEE, pp 2416–2419
177. Rix AW, Beerends JG, Hollier MP, Hekstra
AP (2001) Perceptual evaluation of speech qual-
ity (pesq)-a new method for speech quality as-
sessment of telephone networks and codecs. In:
2001 IEEE International Conference on Acoustics,
Speech, and Signal Processing. Proceedings (Cat.
No. 01CH37221), IEEE, vol 2, pp 749–752
178. Saito Y, Takamichi S, Saruwatari H (2017) Sta-
tistical parametric speech synthesis incorporat-
ing generative adversarial networks. IEEE/ACM
Transactions on Audio, Speech, and Language
Processing 26(1):84–96
179. Schr¨oder M (2001) Emotional speech synthesis:
A review. In: Seventh European Conference on
Speech Communication and Technology
180. Sejdinovic D, Sriperumbudur B, Gretton A, Fuku-
mizu K (2013) Equivalence of distance-based and
rkhs-based statistics in hypothesis testing. The
Annals of Statistics pp 2263–2291
181. Sennrich R, Haddow B, Birch A (2015) Neural
machine translation of rare words with subword
units. arXiv preprint arXiv:150807909
182. Series B (2014) Method for the subjective assess-
ment of intermediate quality level of audio sys-
tems. International Telecommunication Union Ra-
diocommunication Assembly
183. Serr`a J, Pascual S, Segura C (2019) Blow:
a single-scale hyperconditioned flow for non-
parallel raw-audio voice conversion. arXiv preprint
arXiv:190600794
184. Shahriari B, Swersky K, Wang Z, Adams RP,
De Freitas N (2015) Taking the human out of the
loop: A review of bayesian optimization. Proceed-
ings of the IEEE 104(1):148–175
185. Shan C, Xie L, Yao K (2016) A bi-directional lstm
approach for polyphone disambiguation in man-
darin chinese. In: 2016 10th International Sym-
posium on Chinese Spoken Language Processing
(ISCSLP), IEEE, pp 1–5
186. Shankar S, Garg S, Sarawagi S (2018) Surpris-
ingly easy hard-attention for sequence to sequence
learning. In: Proceedings of the 2018 Conference
on Empirical Methods in Natural Language Pro-
cessing, pp 640–645
187. Shaw P, Uszkoreit J, Vaswani A (2018) Self-
attention with relative position representations.
Review of end-to-end speech synthesis technology based on deep learning 37
arXiv preprint arXiv:180302155
188. Shen C, Vogelstein JT (2020) The exact equiva-
lence of distance and kernel methods in hypothesis
testing. AStA Advances in Statistical Analysis pp
1–19
189. Shen J, Pang R, Weiss RJ, Schuster M, Jaitly N,
Yang Z, Chen Z, Zhang Y, Wang Y, Skerrv-Ryan
R, et al. (2018) Natural tts synthesis by condition-
ing wavenet on mel spectrogram predictions. In:
2018 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP), IEEE,
pp 4779–4783
190. Shen J, Jia Y, Chrzanowski M, Zhang Y, Elias
I, Zen H, Wu Y (2020) Non-attentive tacotron:
Robust and controllable neural tts synthesis in-
cluding unsupervised duration modeling. arXiv
preprint arXiv:201004301
191. Shi Y, Bu H, Xu X, Zhang S, Li M (2020) Aishell-
3: A multi-speaker mandarin tts corpus and the
baselines. arXiv preprint arXiv:201011567
192. Skerry-Ryan R, Battenberg E, Xiao Y, Wang Y,
Stanton D, Shor J, Weiss R, Clark R, Saurous
RA (2018) Towards end-to-end prosody trans-
fer for expressive speech synthesis with tacotron.
In: international conference on machine learning,
PMLR, pp 4693–4702
193. Sohl-Dickstein J, Weiss E, Maheswaranathan N,
Ganguli S (2015) Deep unsupervised learning us-
ing nonequilibrium thermodynamics. In: Interna-
tional Conference on Machine Learning, PMLR,
pp 2256–2265
194. Song Y, Ermon S (2020) Improved techniques
for training score-based generative models. arXiv
preprint arXiv:200609011
195. Song Y, Garg S, Shi J, Ermon S (2020) Sliced
score matching: A scalable approach to density
and score estimation. In: Uncertainty in Artificial
Intelligence, PMLR, pp 574–584
196. Sotelo J, Mehri S, Kumar K, Santos JF, Kastner
K, Courville A, Bengio Y (2017) Char2wav: End-
to-end speech synthesis
197. Sruthi K, Meharban M (2020) Review on im-
age captioning and speech synthesis techniques.
In: 2020 6th International Conference on Ad-
vanced Computing and Communication Systems
(ICACCS), IEEE, pp 352–356
198. Stephenson B, Besacier L, Girin L, Hueber T
(2020) What the future brings: Investigating the
impact of lookahead for incremental neural tts.
arXiv preprint arXiv:200902035
199. Stephenson B, Hueber T, Girin L, Besacier L
(2021) Alternate endings: Improving prosody for
incremental neural tts with predicted future text
input. arXiv preprint arXiv:210209914
200. Sun G, Zhang Y, Weiss RJ, Cao Y, Zen H, Wu
Y (2020) Fully-hierarchical fine-grained prosody
modeling for interpretable speech synthesis. In:
ICASSP 2020-2020 IEEE International Confer-
ence on Acoustics, Speech and Signal Processing
(ICASSP), IEEE, pp 6264–6268
201. Sz´ekely
´
E, Henter GE, Beskow J, Gustafson J
(2020) Breathing and speech planning in spon-
taneous speech synthesis. In: ICASSP 2020-2020
IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP), IEEE,
pp 7649–7653
202. Tachibana H, Uenoyama K, Aihara S (2018) Ef-
ficiently trainable text-to-speech system based
on deep convolutional networks with guided at-
tention. In: 2018 IEEE International Confer-
ence on Acoustics, Speech and Signal Processing
(ICASSP), IEEE, pp 4784–4788
203. Tahon M, Lecorv´e G, Lolive D (2018) Can we
generate emotional pronunciations for expressive
speech synthesis? IEEE Transactions on Affective
Computing 11(4):684–695
204. Taigman Y, Wolf L, Polyak A, Nachmani E (2017)
Voiceloop: Voice fitting and synthesis via a phono-
logical loop. arXiv preprint arXiv:170706588
205. Taylor J, Richmond K (2019) Analysis of pronun-
ciation learning in end-to-end speech synthesis. In:
INTERSPEECH, pp 2070–2074
206. Taylor P (2009) Text-to-speech synthesis. Cam-
bridge university press
207. Tits N, El Haddad K, Dutoit T (2019) Emotional
speech datasets for english speech synthesis pur-
pose: A review. In: Proceedings of SAI Intelligent
Systems Conference, Springer, pp 61–66
208. Tjandra A, Sakti S, Nakamura S (2017) Listen-
ing while speaking: Speech chain by deep learning.
In: 2017 IEEE Automatic Speech Recognition and
Understanding Workshop (ASRU), IEEE, pp 301–
308
209. Tjandra A, Sakti S, Nakamura S (2018) Machine
speech chain with one-shot speaker adaptation.
arXiv preprint arXiv:180310525
210. Tokuda K, Nankaku Y, Toda T, Zen H, Yamag-
ishi J, Oura K (2013) Speech synthesis based on
hidden markov models. Proceedings of the IEEE
101(5):1234–1252
211. Tomczak JM, Welling M (2016) Improving varia-
tional auto-encoders using householder flow. arXiv
preprint arXiv:161109630
212. Tu T, Chen YJ, Yeh Cc, Lee HY (2019) End-
to-end text-to-speech for low-resource languages
by cross-lingual transfer learning. arXiv preprint
38 Zhaoxi Mu
1
et al.
arXiv:190406508
213. Tuerk C, Robinson T (1993) Speech synthesis us-
ing artificial neural networks trained on cepstral
coefficients. In: Third European Conference on
Speech Communication and Technology
214. Um SY, Oh S, Byun K, Jang I, Ahn C, Kang HG
(2020) Emotional speech synthesis with rich and
granularized control. In: ICASSP 2020-2020 IEEE
International Conference on Acoustics, Speech
and Signal Processing (ICASSP), IEEE, pp 7254–
7258
215. Vainer J, Duˇsek O (2020) Speedyspeech: Ef-
ficient neural speech synthesis. arXiv preprint
arXiv:200803802
216. Valentini-Botinhao C, Yamagishi J (2018) Speech
enhancement of noisy and reverberant speech for
text-to-speech. IEEE/ACM Transactions on Au-
dio, Speech, and Language Processing 26(8):1420–
1433
217. Valin JM, Skoglund J (2019) Lpcnet: Improving
neural speech synthesis through linear prediction.
In: ICASSP 2019-2019 IEEE International Confer-
ence on Acoustics, Speech and Signal Processing
(ICASSP), IEEE, pp 5891–5895
218. Valle R, Li J, Prenger R, Catanzaro B (2020) Mel-
lotron: Multispeaker expressive voice synthesis by
conditioning on rhythm, pitch and global style to-
kens. In: ICASSP 2020-2020 IEEE International
Conference on Acoustics, Speech and Signal Pro-
cessing (ICASSP), IEEE, pp 6189–6193
219. Valle R, Shih K, Prenger R, Catanzaro B (2020)
Flowtron: an autoregressive flow-based genera-
tive network for text-to-speech synthesis. arXiv
preprint arXiv:200505957
220. Van Oord A, Kalchbrenner N, Kavukcuoglu K
(2016) Pixel recurrent neural networks. In: Inter-
national Conference on Machine Learning, PMLR,
pp 1747–1756
221. Vasquez S, Lewis M (2019) Melnet: A generative
model for audio in the frequency domain. arXiv
preprint arXiv:190601083
222. Vaswani A, Shazeer N, Parmar N, Uszkoreit J,
Jones L, Gomez AN, Kaiser L, Polosukhin I
(2017) Attention is all you need. arXiv preprint
arXiv:170603762
223. Veaux C, Yamagishi J, MacDonald K, et al.
(2016) Superseded-cstr vctk corpus: English
multi-speaker corpus for cstr voice cloning toolkit
224. Vogten L, Berendsen E (1988) From text to
speech: the mitalk system. Journal of Phonetics
16(3):371–375
225. Wang G (2019) Deep text-to-speech system with
seq2seq model. arXiv preprint arXiv:190307398
226. Wang TC, Liu MY, Zhu JY, Tao A, Kautz J,
Catanzaro B (2018) High-resolution image syn-
thesis and semantic manipulation with conditional
gans. In: Proceedings of the IEEE conference on
computer vision and pattern recognition, pp 8798–
8807
227. Wang Y, Skerry-Ryan R, Stanton D, Wu Y, Weiss
RJ, Jaitly N, Yang Z, Xiao Y, Chen Z, Bengio S,
et al. (2017) Tacotron: Towards end-to-end speech
synthesis. arXiv preprint arXiv:170310135
228. Wang Y, Stanton D, Zhang Y, Ryan RS, Batten-
berg E, Shor J, Xiao Y, Jia Y, Ren F, Saurous RA
(2018) Style tokens: Unsupervised style model-
ing, control and transfer in end-to-end speech syn-
thesis. In: International Conference on Machine
Learning, PMLR, pp 5180–5189
229. Whitehill M, Ma S, McDuff D, Song Y
(2019) Multi-reference neural tts stylization with
adversarial cycle consistency. arXiv preprint
arXiv:191011958
230. Wightman CW, Talkin DT (1997) The aligner:
Text-to-speech alignment using markov models.
In: Progress in speech synthesis, Springer, pp 313–
323
231. Wu F, Fan A, Baevski A, Dauphin YN, Auli
M (2019) Pay less attention with lightweight
and dynamic convolutions. arXiv preprint
arXiv:190110430
232. Xu J, Tan X, Ren Y, Qin T, Li J, Zhao S, Liu TY
(2020) Lrspeech: Extremely low-resource speech
synthesis and recognition. In: Proceedings of the
26th ACM SIGKDD International Conference on
Knowledge Discovery & Data Mining, pp 2802–
2812
233. Xu K, Ba J, Kiros R, Cho K, Courville A,
Salakhudinov R, Zemel R, Bengio Y (2015) Show,
attend and tell: Neural image caption generation
with visual attention. In: International conference
on machine learning, PMLR, pp 2048–2057
234. Yamamoto R, Song E, Kim JM (2020) Paral-
lel wavegan: A fast waveform generation model
based on generative adversarial networks with
multi-resolution spectrogram. In: ICASSP 2020-
2020 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP), IEEE,
pp 6199–6203
235. Yanagita T, Sakti S, Nakamura S (2019) Neural
itts: Toward synthesizing speech in real-time with
end-to-end neural text-to-speech framework. In:
Proceedings of the 10th ISCA Speech Synthesis
Workshop, pp 183–188
236. Yang B, Zhong J, Liu S (2019) Pre-trained text
representations for improving front-end text pro-
Review of end-to-end speech synthesis technology based on deep learning 39
cessing in mandarin text-to-speech synthesis. In:
INTERSPEECH, pp 4480–4484
237. Yang G, Yang S, Liu K, Fang P, Chen W, Xie L
(2021) Multi-band melgan: Faster waveform gen-
eration for high-quality text-to-speech. In: 2021
IEEE Spoken Language Technology Workshop
(SLT), IEEE, pp 492–498
238. Yang J, Wang Y, Liu H, Li J, Lu J (2014)
Deep learning theory and its application in speech
recognition. Commun Countermeas 33:1–5
239. Yang J, Lee J, Kim Y, Cho H, Kim I (2020)
Vocgan: A high-fidelity real-time vocoder with a
hierarchically-nested adversarial network. arXiv
preprint arXiv:200715256
240. Yasuda Y, Wang X, Yamagishi J (2019) Ini-
tial investigation of an encoder-decoder end-
to-end tts framework using marginalization of
monotonic hard latent alignments. arXiv preprint
arXiv:190811535
241. Yoshimura T, Tokuda K, Masuko T, Kobayashi
T, Kitamura T (1999) Simultaneous modeling
of spectrum, pitch and duration in hmm-based
speech synthesis. In: Sixth European Conference
on Speech Communication and Technology
242. Yoshimura T, Hashimoto K, Oura K, Nankaku
Y, Tokuda K (2018) Mel-cepstrum-based quanti-
zation noise shaping applied to neural-network-
based speech waveform synthesis. IEEE/ACM
Transactions on Audio, Speech, and Language
Processing 26(7):1177–1184
243. Yu C, Lu H, Hu N, Yu M, Weng C, Xu K, Liu P,
Tuo D, Kang S, Lei G, et al. (2019) Durian: Du-
ration informed attention network for multimodal
synthesis. arXiv preprint arXiv:190901700
244. Yu L, Blunsom P, Dyer C, Grefenstette E, Kocisky
T (2016) The neural noisy channel. arXiv preprint
arXiv:161102554
245. Yu L, Buys J, Blunsom P (2016) Online segment
to segment neural transduction. arXiv preprint
arXiv:160908194
246. Zaremba W, Sutskever I (2015) Reinforcement
learning neural turing machines-revised. arXiv
preprint arXiv:150500521
247. Ze H, Senior A, Schuster M (2013) Statistical
parametric speech synthesis using deep neural net-
works. In: 2013 ieee international conference on
acoustics, speech and signal processing, IEEE, pp
7962–7966
248. Zen H, Sak H (2015) Unidirectional long short-
term memory recurrent neural network with recur-
rent output layer for low-latency speech synthesis.
In: 2015 IEEE International Conference on Acous-
tics, Speech and Signal Processing (ICASSP),
IEEE, pp 4470–4474
249. Zen H, Nose T, Yamagishi J, Sako S, Masuko
T, Black AW, Tokuda K (2007) The hmm-based
speech synthesis system (hts) version 2.0. In: SSW,
Citeseer, pp 294–299
250. Zen H, Tokuda K, Black AW (2009) Statistical
parametric speech synthesis. speech communica-
tion 51(11):1039–1064
251. Zen H, Sak H, Graves A, Senior A (2014) Statisti-
cal parametric speech synthesis based on recurrent
neural networks. In: Poster presentation given at
UKSpeech Conference
252. Zen H, Agiomyrgiannakis Y, Egberts N, Hender-
son F, Szczepaniak P (2016) Fast, compact, and
high quality lstm-rnn based statistical paramet-
ric speech synthesizers for mobile devices. arXiv
preprint arXiv:160606061
253. Zen H, Dang V, Clark R, Zhang Y, Weiss RJ, Jia
Y, Chen Z, Wu Y (2019) Libritts: A corpus derived
from librispeech for text-to-speech. arXiv preprint
arXiv:190402882
254. Zeng Z, Wang J, Cheng N, Xia T, Xiao J (2020)
Aligntts: Efficient feed-forward text-to-speech sys-
tem without explicit alignment. In: ICASSP 2020-
2020 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP), IEEE,
pp 6714–6718
255. Zhang H, Lin Y (2020) Unsupervised learn-
ing for sequence-to-sequence text-to-speech
for low-resource languages. arXiv preprint
arXiv:200804549
256. Zhang H, Xu T, Li H, Zhang S, Wang X, Huang
X, Metaxas DN (2018) Stackgan++: Realistic im-
age synthesis with stacked generative adversarial
networks. IEEE transactions on pattern analysis
and machine intelligence 41(8):1947–1962
257. Zhang H, Goodfellow I, Metaxas D, Odena A
(2019) Self-attention generative adversarial net-
works. In: International conference on machine
learning, PMLR, pp 7354–7363
258. Zhang J, Pan J, Yin X, Li C, Liu S, Zhang Y,
Wang Y, Ma Z (2020) A hybrid text normalization
system using multi-head self-attention for man-
darin. In: ICASSP 2020-2020 IEEE International
Conference on Acoustics, Speech and Signal Pro-
cessing (ICASSP), IEEE, pp 6694–6698
259. Zhang JX, Ling ZH, Dai LR (2018) Forward at-
tention in sequence-to-sequence acoustic modeling
for speech synthesis. In: 2018 IEEE International
Conference on Acoustics, Speech and Signal Pro-
cessing (ICASSP), IEEE, pp 4789–4793
260. Zhang S, Lei M, Yan Z, Dai L (2018) Deep-
fsmn for large vocabulary continuous speech
40 Zhaoxi Mu
1
et al.
recognition. In: 2018 IEEE International Confer-
ence on Acoustics, Speech and Signal Processing
(ICASSP), IEEE, pp 5869–5873
261. Zhang W, Yang H, Bu X, Wang L (2019) Deep
learning for mandarin-tibetan cross-lingual speech
synthesis. IEEE Access 7:167884–167894
262. Zhang Y, Deng L, Wang Y (2020) Unified man-
darin tts front-end based on distilled bert model.
arXiv preprint arXiv:201215404
263. Zhang YJ, Pan S, He L, Ling ZH (2019)
Learning latent representations for style control
and transfer in end-to-end speech synthesis. In:
ICASSP 2019-2019 IEEE International Confer-
ence on Acoustics, Speech and Signal Processing
(ICASSP), IEEE, pp 6945–6949
264. Zhang Z, Wu F, Yang C, Dong M, Zhou F (2016)
Mandarin prosodic phrase prediction based on
syntactic trees. In: SSW, pp 160–165
265. Zhang Z, Tian Q, Lu H, Chen LH, Liu S
(2020) Adadurian: Few-shot adaptation for neu-
ral text-to-speech with durian. arXiv preprint
arXiv:200505642
266. Zheng Y, Tao J, Wen Z, Li Y (2018) Blstm-crf
based end-to-end prosodic boundary prediction
with context sensitive embeddings in a text-to-
speech front-end. In: Interspeech, pp 47–51
267. Zheng Y, Tao J, Wen Z, Yi J (2019) Forward–
backward decoding sequence for regularizing end-
to-end tts. IEEE/ACM Transactions on Audio,
Speech, and Language Processing 27(12):2067–
2079