最近在跑github的waveRNN实现,地址:GitHub - fatchord/WaveRNN: WaveRNN Vocoder + TTS,记录一下学习过程..
首先从github上将项目下载下来,想把模型跑起来很简单,不会遇到什么问题..
作者给了预训练的模型,想要快速体验模型的话,直接调用quick_start.py程序就好了.
想要自行训练模型也可以,按照readme的指导来做基本都没有什么问题..
这个记录呢主要是记录一下对于项目中tacotron模型代码的认识、学习.
tacotron简单来说就是一个端到端的TTS模型.你只负责输入文本,模型输出语音.
上图中+号左边的部分就是tacotron, 这个github项目中,作者应该是把Griffin-LIm reconstruction给砍掉了.原生的tacotron产生linear-scale spectrogram或者mel spectrogram后经过Griffin-LIm reconstruction就直接可以得到对应音频了,作者将其砍掉后接上wavernn,用mel特征和音频样本共同作为输入,导入至wavernn.
接下来咱们按照上图的顺序,一步一步走.
Encoder
首先Character embeddings词嵌入,就是把每个字符变成向量,因为你的神经网络可是不认字符的,他只认向量数组.
对应的项目代码是这样的:
self.encoder = Encoder(embed_dims, num_chars, encoder_dims,
encoder_K, num_highways, dropout)
class Encoder(nn.Module):
def __init__(self, embed_dims, num_chars, cbhg_channels, K, num_highways, dropout):
super().__init__()
self.embedding = nn.Embedding(num_chars, embed_dims)
self.pre_net = PreNet(embed_dims)
self.cbhg = CBHG(K=K, in_channels=cbhg_channels, channels=cbhg_channels,
proj_channels=[cbhg_channels, cbhg_channels],
num_highways=num_highways)
def forward(self, x):
x = self.embedding(x)
x = self.pre_net(x)
x.transpose_(1, 2)
x = self.cbhg(x)
return x
获得词向量后,有一个per-net模块,这个模块是一个3层的网络结果,有两个隐藏层,用于对输入进行一系列的非线性变换.使模型更好的收敛、泛化.
class PreNet(nn.Module):
def __init__(self, in_dims, fc1_dims=256, fc2_dims=128, dropout=0.5):
super().__init__()
self.fc1 = nn.Linear(in_dims, fc1_dims)
self.fc2 = nn.Linear(fc1_dims, fc2_dims)
self.p = dropout
def forward(self, x):
x = self.fc1(x)
x = F.relu(x)
x = F.dropout(x, self.p, training=self.training)
x = self.fc2(x)
x = F.relu(x)
x = F.dropout(x, self.p, training=self.training)
return x
可以看到,这个网络结构采用relu激活函数,dropout系数0.5.第一层256维,第二层128维.
经过pre-net后,输出将被输入值CBHG模块,这个模块主要是用于提高模型泛化能力.
CBHG模块首先是一个卷积层,这个卷积层有K个不同尺度的1维filter(1,2,3,....K),这些不同尺度的卷积核能够提取出长度不同的上下文信息.在做卷积时会采用padding的形式使得各个核的输出维度一致.然后把这些信息堆叠起来,作为最大池化层的输入.然后在经过两个一维卷积层,在CBHG模块的卷积层里都会采用batch normalization,(BatchNormConv) 来防止梯度爆炸.接下来会有一个残差连接,将卷积层的输出和embeding之后的序列相加,输入到highway layers层.一共有四层,每个highway layers层中有两个一层的全连接网络,,这两个网络分别采用Relu和sigmoid激活函数,输入进入网络后得到两个不同的输出ouput1和ouput2,将这两个输出经下述变化得到output, :
output=output1∗output2+input∗(1−output2)
highway layers层之后,会有一个双向GRU,,从GRU中输出的结果就是encoder的输出。
代码块:
class CBHG(nn.Module):
def __init__(self, K, in_channels, channels, proj_channels, num_highways):
super().__init__()
# List of all rnns to call `flatten_parameters()` on
self._to_flatten = []
self.bank_kernels = [i for i in range(1, K + 1)]
self.conv1d_bank = nn.ModuleList()
for k in self.bank_kernels:
conv = BatchNormConv(in_channels, channels, k)
self.conv1d_bank.append(conv)
self.maxpool = nn.MaxPool1d(kernel_size=2, stride=1, padding=1)
self.conv_project1 = BatchNormConv(len(self.bank_kernels) * channels, proj_channels[0], 3)
self.conv_project2 = BatchNormConv(proj_channels[0], proj_channels[1], 3, relu=False)
# Fix the highway input if necessary
if proj_channels[-1] != channels:
self.highway_mismatch = True
self.pre_highway = nn.Linear(proj_channels[-1], channels, bias=False)
else:
self.highway_mismatch = False
self.highways = nn.ModuleList()
for i in range(num_highways):
hn = HighwayNetwork(channels)
self.highways.append(hn)
self.rnn = nn.GRU(channels, channels, batch_first=True, bidirectional=True)
self._to_flatten.append(self.rnn)
# Avoid fragmentation of RNN parameters and associated warning
self._flatten_parameters()