Text To Speech with Deep Learning Introduction

8 min readOct 8, 2021

Text to speech or speech synthesis has a variety of models that have been developed that facilitate this. This document covers the next generation of Text To Speech models. They differ in basic architecture from their predecessors which utilize an Auto-Regressive architecture. While their performance is great there Auto-Regressive architecture holds them back. Auto-Regressive architecture is serial/linear in its approach to computing samples. This limitation that limits the ability to utilize parallel computing results in slow models. The slow speed of these models means that they have limited applications in real-time speech generation.

If you find any errors, please let me know!

Why Not Machine Learning?

Human speech wave patterns are extremely complex. This comes from a variety of factors including but not limited to the many rules that govern language and the complexity of tones. This complexity makes the approach to this topic nearly impossible with basic statistical-driven machine learning models. When it comes to this topic that data would need to be mapped on tens of thousands of factors just to get started, and classic machine learning requires some understanding from the engineer on all these different fascists of the data, which is nearly impossible.

General Structure

General model design for this purpose focuses on a general structure where two models are implemented one for converting text data into MEL spectrum waves, and then a second model that takes MEL spectrum waves as inputs and outputs a binary representation of sound waves that when played are a human voice. This is show in the diagram below (Fig 1).

Evaluating Performance

Evaluating models requires a different approach because the quality of the results are very qualitative.

These models use Mean Opinion Score (MOS) which was developed in the telecommunications industry to rate audio quality. This topic has a rich history that requires us to get into how frequency bands were separated to create multiple channels down one life when the telephone was first implemented worldwide. We won’t go into detail, but MOS is a Metascore made up of a variety of components that contribute to audio quality. It can range from zero to five.

What is a MEL Spectrogram

To understand MEL Spectrogram we need to break it into two parts. MEL stands for melody and signifies that the Spectrogram is representative of a melody or more generally audio waveforms. Spectrums are a representation of signal (which is based on time) converted to based on frequency. This is done using a Fourier transform. In this article, we will not do a deep dive into Fourier transforms, but understand that they can be used to break a complex signal into all the sub-signals that make it up. This can be seen in Fig 3 above. Using Fig 3 as reference the solid blue signal on our Time Domain plot is the Waveform, and the gray signals are the sub-signals that generate that blue signal. The Fourier Transform isolates those gray signals and maps them on a plot based on frequency. The most important thing to understand is there is a direct relationship between a Spectrogram and the Waveform.

Phoneme

Phonemes are the distinct units of sound that make up any language. In the English language, there are 44 phonemes.

Models

As the second above titled “General Structure” already covered the most common and basic approach to text to speech involves impenitently two distant models. In this section we will be going over four total models, two of these models predict Text →Mel Spectrogram and two of these models predict Mel Spectrogram → Wave Form.

Text →Mel Spectrogram Models

Tacotron2

This model was developed in partnership with Google in 2018 with the general goal to replace Tacotron. Tacotron was the first major approach that tackle the problem of producing linguistic and acoustic features in the form of a Mel Spectrogram. While Tacotron was a huge improvement over previous models, the approach (Griffin-Lim algorithm) produced significant artifact. Tacotron2 main goal was to drastically reduce that artifact. Obviously, because we are talking about this model, the designers succeeded.

Tacotron2 is a recurrent neural network designed with the architecture seen in Fig 5. This Model’s architecture, like many we will talk about is split into two sections. The first is the Encoder which is shown in the lower 25% of Fig 5, and the second is a Decoder show in the upper 75% of Fig 5. The Encoders role is to take the key positional characteristic of the text input phonemes, and the phonemes themselves. This is done through two major steps, Character Embedding, and a Long Short Term Memory (LSTM) layer. This data is then passed to the decoder. The Decoder’s role is to take those key characteristics and map them to a Mel Spectrogram that takes into account aspects of smoothing and pauses. This is done through additional LSTM layers as well as Convolutional layers that amplify the key characteristics.

The original paper utilizes a WaveNet for Mel Spectrogram → Wave Form and achieved a MOS of 4.526.

FastSpeech

FastSpeech was developed in 2019 in partnership with Microsoft, with the general goal of improving inference speed and controllability. Controllability refers to the ability to modify aspects of the output such as speed. Previous models were slow in learning, FastSpeech improves on inference speed by utilizing a Transform Architecture instead of a Recurrent Architecture. The general idea being that Transform Architecture implements an attention mechanize, which unlike a hidden state mechanized allows for parallel computing.

The FastSpeech architecture which is shown in Fig 6, is broken into two segments, which the designers refer to as the Phoneme side and the Mel Spectrogram side. These two sections are bridged by the Length Regulator. The focus of the Phoneme side is encoding those Phoneme present, encoding position, and then sending the encoded data though a Transformer Model architecture. The Transform Model architecture is referred to as the FFT Block in Fig 6. The FFT block works to produce the Mel Spectrogram from this encoded data. This data is then send to the Length Regulator. The Length Regulator regulates a mismatch in length between the phoneme and spectrogram sequence. The Phoneme sequence is smaller that the Mel Spectrogram because each phoneme maps to multiple Mel-Spectrogram values. I view this as a DL/learning based down-sampling, with responsibilities such as mapping the speed of speech/switch between phoneme’s. The data then enters the Mel Spectrogram side, where it goes though the FFT block again. This time this block is responsible for not skipping mapped Mel Spectrograms and smoothing transitions.

For evaluation, this team used WaveGlow as their Mel Spectrogram → Wave Form model and achieved a MOS of 3.84. For comparison, the FastSpeech team achieved a MOS of 3.86 when using the Tacotron 2 model with WaveGlow. While this is a lower MOS, the team’s goal was to increase speed and improve controllability.

Mel Spectrogram → Waveform Models

Melgan

Developed in partnership with Descript in 2019, the Melgan model was the first General Adversarial Network (GAN) model to perform well in terms MOS with this task. Their goal was to prove that a GAN could be used in the context of this task, and improve the speed. A key aspect to remember is that audio waveforms are made up of tens of thousands of data points and therefore speed in the context or real-time generation is extremely important.

The Melgan model is broken into two parts, Generator and Discriminator. This follows the traditional General Adversarial Network (GAN) design. The Generator, show on the right side of Fig 7, accepts Mel Spectrograms as inputs. The Generators main goal is to, as the GAN designation calls for, produce fabricated raw waveforms from a Mel Spectrogram. The Generator’s general design focuses on learning through decreasing data dimensionality and then increasing data dimensionality recursively. The output of the Generator which is a waveform is then passed to the Discriminator. Note that the Generator output is what is the output used when attempting to do a Mel Spectrogram to waveform. The goal of the Discriminator is to determine if an input wave is fabricated or real and utilize that classification to train the Generator. The Discriminator takes an ensemble learning approach. First input data is split into three groups the first is raw, the second has gone through one average pooling layer, and the third goes through two average pooling layers. These layers are then sent through a model that produces three binary classifications that when pulled together determine the final classification by majority. Each input also has a list of feature maps that are used for improving the generator.

MB Melgan (Multi-Band Melgan)

Based on the Melgan model MB Melgan had the goals of increasing the receptive field of the Generator and improve the accuracy of detecting fabricated vs real waveforms in the Discriminator.

Since the MB Melgan architecture, as shown in Fig 8, is nearly identical to the Melgan architecture this section will focus on the differences between the two models. The first key difference is that the MB MelGAN Generator outputs waveforms broken down by frequency bands. Each of one these waveform is run through a filter, which removes noise at each frequency level instead of treating the entire wave with the same filter magnitude. I expect that this approach changes the magnitude of noise removal based on each frequency’s independent amplitude leading to less loss of key characteristics. The second key change large is the use of STFT loss in the Discriminator Blocks instead of Auxiliary Loss, which is a waveform-specific approach that had been previously shown to produce stronger models.’

Last Thoughts.

Hopefully this article will help you get started on this very interesting topic. Make sure to check out the sources below! Of course theory is only one part of any topic and if you would like to attempt to apply these models to your own project or data then checkout this Google Colab notebook. In the notebook I utilize a great library TensorFlowTTS that implements these models in Tensorflow.

Thank you for reading!

Sources:

All non original figures have there reference URL attached to the subtitle. All Model architecture figures were pulled from the model papers which can be found below.

Tacotron2 Model Paper

FastSpeech Model Paper

Melgan Model Paper

Multi-Band Melgan Model Paper