2021 is the year Carl will finally start listening to what I tell him!
To that end, I needed to revisit my choices for Speech-To-Text, or sometimes referred to as Automatic Speech Recognition (ASR).
First up: Google Cloud Speech-to-Text API.
While very powerful, and requiring very little of Carl’s Pi3B computing resource, I don’t want Carl to be tied to external resources, and Google Cloud Speech-To-Text is only free for 60 minutes a month, and my “Free Trial” runs out soon.
That said, I needed to know what it could do for Carl. I was able to install it quickly, configure the billing, and get the two examples needed for comparison with other engines - reco from file and reco from the microphone.
Result: phenomenal recognition accuracy (100% for the quick testing I did) and “real-time” recognition.
Next Up: Mozilla DeepSpeech
I worked with ASR from various vendors for over twenty years. These were traditional technology engines that extracted characteristics from the speech sample and used statistics to estimate what was said.
Lately, all the buzz has been “Machine Learning” this and “Deep Learning” that, with the general rule being it takes a big computer. When ModRobotics released the latest GoPiGo OS it included several TensorLite vision examples, so I started thinking perhaps Carl may be able to join in on this ML / Convolutional Neural Network (CNN or ConvNet) stuff.
I started seeing articles about Mozilla DeepSpeech and DeepSpeech-TFLite on the Raspberry Pi 4. Now Carl only has a Pi3B, but it is a Pi so I wanted to try out DeepSpeech-TFLite on Carl.
I found out Mozilla laid off 25% of their workforce and the Raspberry Pi is no longer high on the priorities. The latest version only releases the full DeepSpeech engine. The engine uses TensorFlow Lite developed models.
Bottom line - I got it installed but the recognition took 9-11 seconds for 2 seconds of speech.
The Plan
I need to revisit the CMU pocketsphinx engine which I benchmarked several years ago at “real-time” on the Pi3B for small language models and grammar based recognition.