In this project builds a deep neural network that functions end-to-end automatic speech recognition (ASR) pipeline!
This is an ongoing project, we are adding language model to the pipeline.
The Notebook vui.ibnpy
is the Main procedure, it is self-explained and is a good place to start.
You should run this project with GPU acceleration for best performance.
-
Install TensorFlow.
- Option 1: To install TensorFlow with GPU support, follow the guide to install the necessary NVIDIA software on your system. If you are using an EC2 GPU instance, you can skip this step and only need to install the
tensorflow-gpu
package:
pip install tensorflow-gpu==1.1.0
- Option 2: To install TensorFlow with CPU support only,
pip install tensorflow==1.1.0
- Option 1: To install TensorFlow with GPU support, follow the guide to install the necessary NVIDIA software on your system. If you are using an EC2 GPU instance, you can skip this step and only need to install the
-
Install a few Requires packages.
pip install -r requirements.txt
-
Switch Keras backend to TensorFlow.
- Linux or Mac:
KERAS_BACKEND=tensorflow python -c "from keras import backend"
-
Obtain the
libav
package.- Linux:
sudo apt-get install libav-tools
orsudo apt install ffmpeg #requirement to run avahi wget http://launchpadlibrarian.net/348889634/libav-tools_3.4.1-1_all.deb sudo dpkg -i libav-tools_3.4.1-1_all.deb
- Linux:
-
Obtain the appropriate dataset, and convert all flac files to wav format. This works with data directories that are organized like LibriSpeech: data_directory/group/speaker/[file_id1.wav, file_id2.wav, ..., speaker.trans.txt] Where speaker.trans.txt has in each line, file_id transcription
- Linux or Mac:
mv flac_to_wav.sh $data_folder$ cd $data_folder$ ./flac_to_wav.sh
-
Create JSON files corresponding to the train and validation datasets.
cd ..
python create_desc_json.py $data_folder$ train_corpus.json
python create_desc_json.py $data_folder$ valid_corpus.json
The performance of the decoding step can be greatly enhanced by incorporating a language model.
Train a network that uses raw audio waveforms!