WaveNet

基本信息

标题: "WaveNet: A Generative Model for Raw Audio"
作者:
- 01 Aaron van den Oord (Google DeepMind)
- 02 Sander Dieleman (Google DeepMind)
- 03 Heiga Zen (Google)
- 04 Karen Simonyan (Google DeepMind)
- 05 Oriol Vinyals (Google DeepMind)
- 06 Alex Graves (Google DeepMind)
- 07 Nal Kalchbrenner (Google DeepMind)
- 08 Andrew Senior (Google DeepMind)
- 09 Koray Kavukcuoglu (Google DeepMind)
链接:
- ArXiv
- [Publication]
- [Github]
- Demo
文件:
- ArXiv
- [Publication]

Abstract: 摘要

原文

This paper introduces WaveNet, a deep neural network for generating raw audio waveforms. The model is fully probabilistic and autoregressive, with the predictive distribution for each audio sample conditioned on all previous ones; nonetheless we show that it can be efficiently trained on data with tens of thousands of samples per second of audio. When applied to text-to-speech, it yields state-of-the-art performance, with human listeners rating it as significantly more natural sounding than the best parametric and concatenative systems for both English and Mandarin. A single WaveNet can capture the characteristics of many different speakers with equal fidelity, and can switch between them by conditioning on the speaker identity. When trained to model music, we find that it generates novel and often highly realistic musical fragments. We also show that it can be employed as a discriminative model, returning promising results for phoneme recognition.

本文介绍一个用于生成原始音频波形的深度神经网络 WaveNet. 该模型是完全概率性且自回归的, 基于之前的音频样本为条件预测每个音频样本的分布, 但我们展示了它能够在每秒音频的成千上百的样本点上高效训练. 当应用于文本转语音, 它能够得到 SoTA 性能, 人类听众评价其生成的语音在英语和中文方面比最佳参数化和拼接系统听起来要自然得多. 单个 WaveNet 可以捕捉到许多不同说话人的特性, 并能够根据说话人身份切换. 当被训练用于建模音乐时, 我们发现它能够生成新颖且高度逼真的音乐片段. 我们还展示了它可以作为判别模型, 对于音素识别, 给出了有希望的结果.

1·Introduction: 引言

This work explores raw audio generation techniques, inspired by recent advances in neural autoregressive generative models that model complex distributions such as images \citep{van2016pixel, ConditionalPixelCNN} and text \citep{RafalLanguage}. Modeling joint probabilities over pixels or words using neural architectures as products of conditional distributions yields state-of-the-art generation.

Remarkably, these architectures are able to model distributions over thousands of random variables (e.g., 64$\times$64 pixels as in PixelRNN \citep{van2016pixel}). The question this paper addresses is whether similar approaches can succeed in generating wideband raw audio waveforms, which are signals with very high temporal resolution, at least 16,000 samples per second (see \figref{fig:audio}).

This paper introduces WaveNet, an audio generative model based on the PixelCNN \citep{van2016pixel, ConditionalPixelCNN} architecture. The main contributions of this work are as follows:

We show that WaveNets can generate raw speech signals with subjective naturalness never before reported in the field of text-to-speech (TTS), as assessed by human raters.
In order to deal with long-range temporal dependencies needed for raw audio generation, we develop new architectures based on dilated causal convolutions, which exhibit very large receptive fields.
We show that when conditioned on a speaker identity, a single model can be used to generate different voices.
The same architecture shows strong results when tested on a small speech recognition dataset, and is promising when used to generate other audio modalities such as music.

We believe that WaveNets provide a generic and flexible framework for tackling many applications that rely on audio generation (e.g. TTS, music, speech enhancement, voice conversion, source separation).

2·Related Works: 相关工作

The goal of TTS synthesis is to render naturally sounding speech signals given a text to be synthesized. Human speech production process first translates a text (or concept) into movements of muscles associated with articulators and speech production-related organs. Then using air-flow from lung, vocal source excitation signals, which contain both periodic (by vocal cord vibration) and aperiodic (by turbulent noise) components, are generated. By filtering the vocal source excitation signals by time-varying vocal tract transfer functions controlled by the articulators, their frequency characteristics are modulated. Finally, the generated speech signals are emitted. The aim of TTS is to mimic this process by computers in some way.

TTS can be viewed as a sequence-to-sequence mapping problem; from a sequence of discrete symbols (text) to a real-valued time series (speech signals). A typical TTS pipeline has two parts; 1) text analysis and 2) speech synthesis. The text analysis part typically includes a number of natural language processing (NLP) steps, such as sentence segmentation, word segmentation, text normalization, part-of-speech (POS) tagging, and grapheme-to-phoneme (G2P) conversion. It takes a word sequence as input and outputs a phoneme sequence with a variety of linguistic contexts. The speech synthesis part takes the context-dependent phoneme sequence as its input and outputs a synthesized speech waveform. This part typically includes prosody prediction and speech waveform generation.

There are two main approaches to realize the speech synthesis part; non-parametric, example-based approach known as concatenative speech synthesis \citep{PSOLA,ATR_nutalk,Hunt_UnitSelection_ICASSP}, and parametric, model-based approach known as statistical parametric speech synthesis \citep{yoshimura_PhD,Zen_SPSS_SPECOM}. The concatenative approach builds up the utterance from units of recorded speech, whereas the statistical parametric approach uses a generative model to synthesize the speech. The statistical parametric approach first extracts a sequence of vocoder parameters \citep{Vocoder} $\mathbf{o} = {\mathbf{o}_1, \dots,\mathbf{o}_N }$ from speech signals $\mathbf{x} = { x_1, \dots, x_T }$ and linguistic features $\mathbf{l}$ from the text $W$, where $N$ and $T$ correspond to the numbers of vocoder parameter vectors and speech signals. Typically a vocoder parameter vector $\mathbf{o}_n$ is extracted at every 5 milliseconds. It often includes cepstra \citep{UELS} or line spectral pairs \citep{LSP}, which represent vocal tract transfer function, and fundamental frequency ($F_0$) and aperiodicity \citep{Kawahara_STRAIGHT_Excitation}, which represent characteristics of vocal source excitation signals. Then a set of generative models, such as hidden Markov models (HMMs) \citep{yoshimura_PhD}, feed-forward neural networks \citep{Zen_DNN_ICASSP}, and recurrent neural networks \citep{Robinson_RNNTTS,Karaani_RTDNNTTS,Fan_BLSTM_Interspeech14}, is trained from the extracted vocoder parameters and linguistic features as

$$ \hat{\Lambda} = \argmax_{\Lambda} p\left(\mathbf{o} \mid \mathbf{l}, \Lambda \right), $$

where $\Lambda$ denotes the set of parameters of the generative model. At the synthesis stage, the most probable vocoder parameters are generated given linguistic features extracted from a text to be synthesized as

$$ \hat{\mathbf{o}} = \argmax_{\mathbf{o}} p (\mathbf{o} \mid \mathbf{l}, \hat{\Lambda} ). $$

Then a speech waveform is reconstructed from $\hat{\mathbf{o}}$ using a vocoder. The statistical parametric approach offers various advantages over the concatenative one such as small footprint and flexibility to change its voice characteristics. However, its subjective naturalness is often significantly worse than that of the concatenative approach; synthesized speech often sounds muffled and has artifacts. \cite{Zen_SPSS_SPECOM} reported three major factors that can degrade the subjective naturalness; quality of vocoders, accuracy of generative models, and effect of oversmoothing. The first factor causes the artifacts and the second and third factors lead to the muffleness in the synthesized speech. There have been a number of attempts to address these issues individually, such as developing high-quality vocoders \citep{Kawahara_STRAIGHT,Vocaine,WORLD}, improving the accuracy of generative models \citep{Zen_trjHMM_CSL,Zen_DNN_ICASSP,Fan_BLSTM_Interspeech14,Uria_TrajectoryRNADE_ICASSP2015}, and compensating the oversmoothing effect \citep{Toda_MLGV_IEICE,Takamichi_ModulationSpectrum_TASLP}. \cite{Zen_LSTMprod_Interspeech} showed that state-of-the-art statistical parametric speech syntheziers matched state-of-the-art concatenative ones in some languages. However, its vocoded sound quality is still a major issue.

Extracting vocoder parameters can be viewed as estimation of a generative model parameters given speech signals \citep{LPC,UELS}. For example, linear predictive analysis \citep{LPC}, which has been used in speech coding, assumes that the generative model of speech signals is a linear auto-regressive (AR) zero-mean Gaussian process;

$$ x_t = \sum_{p=1}^P a_p x_{t-p} + \epsilon_t \quad \epsilon_t \sim \mathcal{N}(0, G^2) $$

where $a_p$ is a $p$-th order linear predictive coefficient (LPC) and $G^2$ is a variance of modeling error. These parameters are estimated based on the maximum likelihood (ML) criterion. In this sense, the training part of the statistical parametric approach can be viewed as a two-step optimization and sub-optimal: extract vocoder parameters by fitting a generative model of speech signals then model trajectories of the extracted vocoder parameters by a separate generative model for time series \citep{Tokuda_ASRU2011}. There have been attempts to integrate these two steps into a single one \citep{STAVOCO,MGELSD,Maia_WaveformModel_SSW,Kazuhiro_McepHMM_IEICE,Black_AutoEncoder_arXiv,Tokuda_CepLSTM_ICASSP2015,Tokuda_MixCepLSTM_ICASSP2016,Takaki_FFTDNN_ICASSP}. For example, \cite{Tokuda_MixCepLSTM_ICASSP2016} integrated non-stationary, nonzero-mean Gaussian process generative model of speech signals and LSTM-RNN-based sequence generative model to a single one and jointly optimized them by back-propagation. Although they showed that this model could approximate natural speech signals, its segmental naturalness was significantly worse than the non-integrated model due to over-generalization and over-estimation of noise components in speech signals.

The conventional generative models of raw audio signals have a number of assumptions which are inspired from the speech production, such as

Use of fixed-length analysis window; They are typically based on a stationary stochastic process \citep{LPC,UELS,Poritz_ARHMM_ICASSP82,Juang_MARHMM_ICASSP85,Kameoka_MultiKernelLPC_ASJ}. To model time-varying speech signals by a stationary stochastic process, parameters of these generative models are estimated within a fixed-length, overlapping and shifting analysis window (typically its length is 20 to 30 milliseconds, and shift is 5 to 10 milliseconds). However, some phones such as stops are time-limited by less than 20 milliseconds \citep{Rabiner_ASR}. Therefore, using such fixed-size analysis window has limitations.
Linear filter; These generative models are typically realized as a linear time-invariant filter \citep{LPC,UELS,Poritz_ARHMM_ICASSP82,Juang_MARHMM_ICASSP85,Kameoka_MultiKernelLPC_ASJ} within a windowed frame. However, the relationship between successive audio samples can be highly non-linear.
Gaussian process assumption; The conventional generative models are based on Gaussian process \citep{LPC,UELS,Poritz_ARHMM_ICASSP82,Juang_MARHMM_ICASSP85,Kameoka_MultiKernelLPC_ASJ,Tokuda_CepLSTM_ICASSP2015,Tokuda_MixCepLSTM_ICASSP2016}. From the source-filter model of speech production \citep{Chiba_SourceFilter,Fant_SourceFilter} point of view, this is equivalent to assuming that a vocal source excitation signal is a sample from a Gaussian distribution \citep{LPC,UELS,Poritz_ARHMM_ICASSP82,Juang_MARHMM_ICASSP85,Tokuda_CepLSTM_ICASSP2015,Kameoka_MultiKernelLPC_ASJ,Tokuda_MixCepLSTM_ICASSP2016}. Together with the linear assumption above, it results in assuming that speech signals are normally distributed. However, distributions of real speech signals can be significantly different from Gaussian.

Although these assumptions are convenient, samples from these generative models tend to be noisy and lose important details to make these audio signals sounding natural.

WaveNet, which was described in Section~\ref{sec:wavenet}, has none of the above-mentioned assumptions. It incorporates almost no prior knowledge about audio signals, except the choice of the receptive field and $\mu$-law encoding of the signal. It can also be viewed as a non-linear causal filter for quantized signals. Although such non-linear filter can represent complicated signals while preserving the details, designing such filters is usually difficult \citep{NonlinearFilterDesign}. WaveNets give a way to train them from data.

3·Methodology: 方法

In this paper we introduce a new generative model operating directly on the raw audio waveform. The joint probability of a waveform $\vec{x} = { x_1, \dots, x_T }$ is factorized as a product of conditional probabilities as follows:

$$ p\left(\vec{x}\right) = \prod_{t=1}^{T} p\left(x_t \mid x_1, \dots ,x_{t-1}\right)\tag{01} $$

Each audio sample $x_t$ is therefore conditioned on the samples at all previous timesteps.

Similarly to PixelCNNs \citep{van2016pixel, ConditionalPixelCNN}, the conditional probability distribution is modelled by a stack of convolutional layers. There are no pooling layers in the network, and the output of the model has the same time dimensionality as the input. The model outputs a categorical distribution over the next value $x_{t}$ with a softmax layer and it is optimized to maximize the log-likelihood of the data w.r.t. the parameters. Because log-likelihoods are tractable, we tune hyper-parameters on a validation set and can easily measure if the model is overfitting or underfitting.

3.1·Dilated Causal Convolutions

The main ingredient of WaveNet are causal convolutions. By using causal convolutions, we make sure the model cannot violate the ordering in which we model the data: the prediction $p\left(x_{t+1} \mid x_1,...,x_{t}\right)$ emitted by the model at timestep $t$ cannot depend on any of the future timesteps $x_{t+1}, x_{t+2},\dots,x_T$ as shown in \figref{fig:masked_convolution}. For images, the equivalent of a causal convolution is a masked convolution \citep{van2016pixel} which can be implemented by constructing a mask tensor and doing an elementwise multiplication of this mask with the convolution kernel before applying it. For 1-D data such as audio one can more easily implement this by shifting the output of a normal convolution by a few timesteps.

At training time, the conditional predictions for all timesteps can be made in parallel because all timesteps of ground truth $\vec{x}$ are known. When generating with the model, the predictions are sequential: after each sample is predicted, it is fed back into the network to predict the next sample.

Because models with causal convolutions do not have recurrent connections, they are typically faster to train than RNNs, especially when applied to very long sequences. One of the problems of causal convolutions is that they require many layers, or large filters to increase the receptive field. For example, in \figref{fig:masked_convolution} the receptive field is only 5 (= #layers + filter length - 1). In this paper we use dilated convolutions to increase the receptive field by orders of magnitude, without greatly increasing computational cost.

A dilated convolution (also called \emph{`a trous}, or convolution with holes) is a convolution where the filter is applied over an area larger than its length by skipping input values with a certain step. It is equivalent to a convolution with a larger filter derived from the original filter by dilating it with zeros, but is significantly more efficient. A dilated convolution effectively allows the network to operate on a coarser scale than with a normal convolution. This is similar to pooling or strided convolutions, but here the output has the same size as the input. As a special case, dilated convolution with dilation $1$ yields the standard convolution. \figref{fig:masked_dilated_convolution} depicts dilated causal convolutions for dilations $1$, $2$, $4$, and $8$. Dilated convolutions have previously been used in various contexts, e.g., signal processing \citep{Holschneider1989,Dutilleux1989}, and image segmentation \citep{chen14semantic,YuKoltun2016}.

Stacked dilated convolutions enable networks to have very large receptive fields with just a few layers, while preserving the input resolution throughout the network as well as computational efficiency. In this paper, the dilation is doubled for every layer up to a limit and then repeated: e.g., $1,2,4,\dots,512,1,2,4,\dots,512,1,2,4,\dots,512.$ The intuition behind this configuration is two-fold. First, exponentially increasing the dilation factor results in exponential receptive field growth with depth~\citep{YuKoltun2016}. For example each $1,2,4,\dots,512$ block has receptive field of size $1024$, and can be seen as a more efficient and discriminative (non-linear) counterpart of a $1\times1024$ convolution. Second, stacking these blocks further increases the model capacity and the receptive field size.

3.2·Softmax Distributions

One approach to modeling the conditional distributions $p\left(x_t \mid x_1, \dots ,x_{t-1}\right)$ over the individual audio samples would be to use a mixture model such as a mixture density network \citep{MDN} or mixture of conditional Gaussian scale mixtures (MCGSM) \citep{theis2015generative}. However, \cite{van2016pixel} showed that a softmax distribution tends to work better, even when the data is implicitly continuous (as is the case for image pixel intensities or audio sample values). One of the reasons is that a categorical distribution is more flexible and can more easily model arbitrary distributions because it makes no assumptions about their shape.

Because raw audio is typically stored as a sequence of 16-bit integer values (one per timestep), a softmax layer would need to output 65,536 probabilities per timestep to model all possible values. To make this more tractable, we first apply a $\mu$-law companding transformation \citep{G711} to the data, and then quantize it to 256 possible values:

$$ f\left(x_t\right) = \operatorname{sign}(x_t) \frac{\ln \left(1+\mu \left| x_t \right|\right)}{\ln \left(1+\mu\right)}, $$

where $-1 < x_t < 1$ and $\mu = 255$. This non-linear quantization produces a significantly better reconstruction than a simple linear quantization scheme. Especially for speech, we found that the reconstructed signal after quantization sounded very similar to the original.

3.3·Gated Activation Units

We use the same gated activation unit as used in the gated PixelCNN \citep{ConditionalPixelCNN}:

$$ \vec{z} = \tanh \left(W_{f, k} \ast \vec{x}\right) \odot \sigma \left(W_{g, k} \ast \vec{x} \right), $$

where $\ast$ denotes a convolution operator, $\odot$ denotes an element-wise multiplication operator, $\sigma(\cdot)$ is a sigmoid function, $k$ is the layer index, $f$ and $g$ denote filter and gate, respectively, and $W$ is a learnable convolution filter. In our initial experiments, we observed that this non-linearity worked significantly better than the rectified linear activation function \citep{nair2010rectified} for modeling audio signals.

3.4·Residual and Skip Connections

Both residual \citep{he15deep} and parameterised skip connections are used throughout the network, to speed up convergence and enable training of much deeper models. In \figref{fig:architecture} we show a residual block of our model, which is stacked many times in the network.

3.5·Conditional WaveNets

Given an additional input $\vec{h}$, WaveNets can model the conditional distribution $p\left(\vec{x} \mid \vec{h}\right)$ of the audio given this input. \eqnref{eq:px} now becomes

$$ p\left( \vec{x} \mid \vec{h} \right) = \prod_{t=1}^{T} p\left(x_t \mid x_1, \dots ,x_{t-1}, \vec{h}\right). $$

By conditioning the model on other input variables, we can guide WaveNet's generation to produce audio with the required characteristics. For example, in a multi-speaker setting we can choose the speaker by feeding the speaker identity to the model as an extra input. Similarly, for TTS we need to feed information about the text as an extra input.

We condition the model on other inputs in two different ways: global conditioning and local conditioning. Global conditioning is characterised by a single latent representation $\vec{h}$ that influences the output distribution across all timesteps, e.g., a speaker embedding in a TTS model. The activation function from \eqnref{eq:gated_activation} now becomes:

$$ \vec{z} = \tanh \left(W_{f, k} \ast \vec{x} + V_{f, k}^T \vec{h}\right) \odot \sigma \left(W_{g, k} \ast \vec{x} + V_{g, k}^T \vec{h} \right). $$

where $V_{, k}$ is a learnable linear projection, and the vector $V_{, k}^T\vec{h}$ is broadcast over the time dimension.

For local conditioning we have a second timeseries $h_t$, possibly with a lower sampling frequency than the audio signal, e.g., linguistic features in a TTS model. We first transform this time series using a transposed convolutional network (learned upsampling) that maps it to a new time series $\vec{y} = f(\vec{h})$ with the same resolution as the audio signal, which is then used in the activation unit as follows:

$$ \vec{z} = \tanh \left(W_{f, k} \ast \vec{x} + V_{f, k} \ast \vec{y}\right) \odot \sigma \left(W_{g, k} \ast \vec{x} + V_{g, k} \ast \vec{y}\right), $$

where $V_{f, k} \ast \vec{y}$ is now a $1\times 1$ convolution. As an alternative to the transposed convolutional network, it is also possible to use $V_{f, k} \ast \vec{h}$ and repeat these values across time. We saw that this worked slightly worse in our experiments.

3.6·Context Stacks

We have already mentioned several different ways to increase the receptive field size of a WaveNet: increasing the number of dilation stages, using more layers, larger filters, greater dilation factors, or a combination thereof. A complementary approach is to use a separate, smaller \emph{context} stack that processes a long part of the audio signal and locally conditions a larger WaveNet that processes only a smaller part of the audio signal (cropped at the end). One can use multiple context stacks with varying lengths and numbers of hidden units. Stacks with larger receptive fields have fewer units per layer. Context stacks can also have pooling layers to run at a lower frequency. This keeps the computational requirements at a reasonable level and is consistent with the intuition that less capacity is required to model temporal correlations at longer timescales.

4·Experiments: 实验

To measure WaveNet's audio modelling performance, we evaluate it on three different tasks: multi-speaker speech generation (not conditioned on text), TTS, and music audio modelling. We provide samples drawn from WaveNet for these experiments on the accompanying webpage: https://www.deepmind.com/blog/wavenet-generative-model-raw-audio/.

4.1.Multi-Speaker Speech Generation: 多说话人语音生成

For the first experiment we looked at free-form speech generation (not conditioned on text). We used the English multi-speaker corpus from CSTR voice cloning toolkit (VCTK) \citep{VCTK} and conditioned WaveNet only on the speaker. The conditioning was applied by feeding the speaker ID to the model in the form of a one-hot vector. The dataset consisted of 44 hours of data from 109 different speakers.

Because the model is not conditioned on text, it generates non-existent but human language-like words in a smooth way with realistic sounding intonations. This is similar to generative models of language or images, where samples look realistic at first glance, but are clearly unnatural upon closer inspection. The lack of long range coherence is partly due to the limited size of the model's receptive field (about 300 milliseconds), which means it can only remember the last 2--3 phonemes it produced.

A single WaveNet was able to model speech from any of the speakers by conditioning it on a one-hot encoding of a speaker. This confirms that it is powerful enough to capture the characteristics of all 109 speakers from the dataset in a single model. We observed that adding speakers resulted in better validation set performance compared to training solely on a single speaker. This suggests that WaveNet's internal representation was shared among multiple speakers.

Finally, we observed that the model also picked up on other characteristics in the audio apart from the voice itself. For instance, it also mimicked the acoustics and recording quality, as well as the breathing and mouth movements of the speakers.

4.2.Text-To-Speech: 文本转语音

For the second experiment we looked at TTS. We used the same single-speaker speech databases from which Google's North American English and Mandarin Chinese TTS systems are built. The North American English dataset contains 24.6 hours of speech data, and the Mandarin Chinese dataset contains 34.8 hours; both were spoken by professional female speakers.

WaveNets for the TTS task were locally conditioned on \emph{linguistic features} which were derived from input texts. We also trained WaveNets conditioned on the logarithmic fundamental frequency ($\log F_0$) values in addition to the linguistic features. External models predicting $\log F_0$ values and phone durations from linguistic features were also trained for each language. The receptive field size of the WaveNets was 240 milliseconds. As example-based and model-based speech synthesis baselines, hidden Markov model (HMM)-driven unit selection concatenative \citep{Xavi_Barracuda_interspeech} and long short-term memory recurrent neural network (LSTM-RNN)-based statistical parametric \citep{Zen_LSTMprod_Interspeech} speech synthesizers were built. Since the same datasets and linguistic features were used to train both the baselines and WaveNets, these speech synthesizers could be fairly compared.

To evaluate the performance of WaveNets for the TTS task, subjective paired comparison tests and mean opinion score (MOS) tests were conducted. In the paired comparison tests, after listening to each pair of samples, the subjects were asked to choose which they preferred, though they could choose ``neutral'' if they did not have any preference. In the MOS tests, after listening to each stimulus, the subjects were asked to rate the naturalness of the stimulus in a five-point Likert scale score (1: Bad, 2: Poor, 3: Fair, 4: Good, 5: Excellent). Please refer to Appendix~\ref{appendix:tts_experiment} for details.

\figref{fig:sxs2} shows a selection of the subjective paired comparison test results (see Appendix~\ref{appendix:tts_experiment} for the complete table). It can be seen from the results that WaveNet outperformed the baseline statistical parametric and concatenative speech synthesizers in both languages. We found that WaveNet conditioned on linguistic features could synthesize speech samples with natural segmental quality but sometimes it had unnatural prosody by stressing wrong words in a sentence. This could be due to the long-term dependency of $F_0$ contours: the size of the receptive field of the WaveNet, 240 milliseconds, was not long enough to capture such long-term dependency. WaveNet conditioned on both linguistic features and $F_0$ values did not have this problem: the external $F_0$ prediction model runs at a lower frequency (200 Hz) so it can learn long-range dependencies that exist in $F_0$ contours.

\tblref{tab:mos} show the MOS test results. It can be seen from the table that WaveNets achieved 5-scale MOSs in naturalness above 4.0, which were significantly better than those from the baseline systems. They were the highest ever reported MOS values with these training datasets and test sentences. The gap in the MOSs from the best synthetic speech to the natural ones decreased from 0.69 to 0.34 (51%) in US English and 0.42 to 0.13 (69%) in Mandarin Chinese.

4.3.Music: 音乐

For out third set of experiments we trained WaveNets to model two music datasets:

the MagnaTagATune dataset \citep{law2009input}, which consists of about 200 hours of music audio. Each 29-second clip is annotated with tags from a set of 188, which describe the genre, instrumentation, tempo, volume and mood of the music.
the YouTube piano dataset, which consists of about 60 hours of solo piano music obtained from YouTube videos. Because it is constrained to a single instrument, it is considerably easier to model.

Although it is difficult to quantitatively evaluate these models, a subjective evaluation is possible by listening to the samples they produce. We found that enlarging the receptive field was crucial to obtain samples that sounded musical. Even with a receptive field of several seconds, the models did not enforce long-range consistency which resulted in second-to-second variations in genre, instrumentation, volume and sound quality. Nevertheless, the samples were often harmonic and aesthetically pleasing, even when produced by unconditional models.

Of particular interest are conditional music models, which can generate music given a set of tags specifying e.g., genre or instruments. Similarly to conditional speech models, we insert biases that depend on a binary vector representation of the tags associated with each training clip. This makes it possible to control various aspects of the output of the model when sampling, by feeding in a binary vector that encodes the desired properties of the samples. We have trained such models on the MagnaTagATune dataset; although the tag data bundled with the dataset was relatively noisy and had many omissions, after cleaning it up by merging similar tags and removing those with too few associated clips, we found this approach to work reasonably well.

4.4.Speech Recognition: 语音识别

Although WaveNet was designed as a generative model, it can straightforwardly be adapted to discriminative audio tasks such as speech recognition.

Traditionally, speech recognition research has largely focused on using log mel-filterbank energies or mel-frequency cepstral coefficients (MFCCs), but has been moving to raw audio recently \citep{palaz2013estimating,tuske2014acoustic,hoshen2015speech,sainath2015learning}. Recurrent neural networks such as LSTM-RNNs \citep{LSTM} have been a key component in these new speech classification pipelines, because they allow for building models with long range contexts. With WaveNets we have shown that layers of dilated convolutions allow the receptive field to grow longer in a much cheaper way than using LSTM units.

As a last experiment we looked at speech recognition with WaveNets on the TIMIT \citep{TIMIT} dataset. For this task we added a mean-pooling layer after the dilated convolutions that aggregated the activations to coarser frames spanning 10 milliseconds (160$\times$ downsampling). The pooling layer was followed by a few non-causal convolutions. We trained WaveNet with two loss terms, one to predict the next sample and one to classify the frame, the model generalized better than with a single loss and achieved $18.8$ PER on the test set, which is to our knowledge the best score obtained from a model trained directly on raw audio on TIMIT.

5·Results: 结果

6·Conclusions: 结论

This paper has presented WaveNet, a deep generative model of audio data that operates directly at the waveform level. WaveNets are autoregressive and combine causal filters with dilated convolutions to allow their receptive fields to grow exponentially with depth, which is important to model the long-range temporal dependencies in audio signals. We have shown how WaveNets can be conditioned on other inputs in a global (e.g., speaker identity) or local way (e.g., linguistic features). When applied to TTS, WaveNets produced samples that outperform the current best TTS systems in subjective naturalness. Finally, WaveNets showed very promising results when applied to music audio modeling and speech recognition.

本文介绍了 WaveNet, 这是一种直接在波形级别上操作的音频数据深度生成模型.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2016.09.12_WaveNet.md

2016.09.12_WaveNet.md

WaveNet

Abstract: 摘要

1·Introduction: 引言

2·Related Works: 相关工作

3·Methodology: 方法

3.1·Dilated Causal Convolutions

3.2·Softmax Distributions

3.3·Gated Activation Units

3.4·Residual and Skip Connections

3.5·Conditional WaveNets

3.6·Context Stacks

4·Experiments: 实验

4.1.Multi-Speaker Speech Generation: 多说话人语音生成

4.2.Text-To-Speech: 文本转语音

4.3.Music: 音乐

4.4.Speech Recognition: 语音识别

5·Results: 结果

6·Conclusions: 结论

Files

2016.09.12_WaveNet.md

Latest commit

History

2016.09.12_WaveNet.md

File metadata and controls

WaveNet

Abstract: 摘要

1·Introduction: 引言

2·Related Works: 相关工作

3·Methodology: 方法

3.1·Dilated Causal Convolutions

3.2·Softmax Distributions

3.3·Gated Activation Units

3.4·Residual and Skip Connections

3.5·Conditional WaveNets

3.6·Context Stacks

4·Experiments: 实验

4.1.Multi-Speaker Speech Generation: 多说话人语音生成

4.2.Text-To-Speech: 文本转语音

4.3.Music: 音乐

4.4.Speech Recognition: 语音识别

5·Results: 结果

6·Conclusions: 结论