Mozilla has updated the speech recognition system DeepSpeech, significantly increasing its performance

Mozilla обновила систему распознавания речи DeepSpeech, значительно повысив ее производительность

System DeepSpeech, which is a set of tools for speech recognition and is supported by a group of developers from Mozilla, got an update. The new version DeepSpeech v0.6 is one of the fastest models for speech recognition open source among the presented to date. What improvements have been made to the system, one of the developers of the Mozilla Rich Ruben (Ruben Morais) said in his blog.

In the latest version DeepSpeech was integrated support TensorFlow Lite version of the system machine learning of Google, optimized to run on mobile devices with limited computational capabilities. As a result, the size DeepSpeech decreased from 98 MB to 3.7 MB and the size of the finished integrated model of English language has decreased from 188 MB to 47 MB. It is also noted that memory consumption was reduced in 22 times, and startup speed of data processing increased by more than 500 times.

A system DeepSpeech v0.6 in General has become much more productive through the use of a new stream decoder, which provides low latency and memory usage regardless of the length of the transcribed audio. The two main subsystems of the platform (the acoustic model and decoder) now support streaming, which will require fine-tuning of their own equipment. Updated version DeepSpeech able to provide a transcription already after 260 MS after the end of the audio, which is 73 % faster in comparison with the performance of the system to integrate streaming decoder.

It is worth noting that in terms of performance the new system operates twice as fast when it comes to training the model. This has been achieved through the use of the system TensorFlow 1.14 and integration of the new API.

To train the model uses a set of Common voice data Voice, consisting of 1400 hours of speech in 18 different languages. The developers note that this is one of the largest multilingual sets of voice data. It is much more of a set of Common Voice published in the past and consisting of 500 hours of speech with pronunciation 20,000 volunteers (all records are in English). Currently, the company is active in data collection in 70 languages, in order to make DeepSpeech even more perfect.