Speech Recognition is Hot (Words)

Carl is over two years old now and is still waiting for me to engage him in conversations. I set a goal that 2021 will be year Carl learns to listen for commands. If he learns to behave, I’ll consider teaching him more general conversational dialog.

Years back I experimented with the CMU Sphinx / PocketSphinx research engine for speech recognition. PocketSphinx did everything - hotword, grammar, and language model forms of speech recognition, but it was “old” technology considered brute-force by today’s standards.

While I was experimenting with PocketSphinx, the research community was experimenting with “neural nets” for language model speech recognition, which is needed for input into NLU natural language understanding dialog engines.

Today Carl has his choice of DeepSpeech, Kaldi, and Vosk for off-line speech recognition using machine learning / deep learning. (Carl has a Google Cloud API account, but I don’t let him loose with the credit-card very often.) There may be other players, but I only have so much patience to search.

DeepSpeech will run on Carl’s Pi3B, but is primarily focused on big platforms with GPU Graphics Processing Unit to process the connection networks. Carl was not impressed with the processor load and much slower than real-time recognition of DeepSpeech on his RPi.

Vosk is an open-source connection-net powered research engine that heralds from the Kaldi research engine. The engine was created by and supported by a Russian reco guru - Nickolay V. Shmyrev who helped me five years ago with PocketSphinx and is still willing to engage on his “obsession.”

Vosk only has two forms of recognition - language-model or list-of-words. list-of-words constrains the recognition vocabulary, but does not enforce any syntax (no grammars). Either of these two forms use approximately 40% of one core of Carl’s Pi3B even sitting idle, and eat an extra 75 mA per hour, which results in a loss of about 1.5 hours of "playtime, so I wouldn’t want to use it 24/7 when not conversing with Carl.

Now, in 2021, some folks in the neural net for speech recognition are turning their attention to using connection networks for hot-word detection. PyTorch, PicoVoice Porcupine, Nyuyama are three engines, I found.

With PyTorch I would have to become a speech and deep learning guru, then train my own hot-word model on my Mac. That was interesting reading but not going to happen.

I tested the PicoVoice Porcupine engine first, and was extremely impressed with the low (10% of one core) processing load, phenomenal recognition of far-field voice and even whispers, and near zero false recognitions. PicoVoice will not even accept email from non-commercial domains, (google, yahoo, hotmail), and will not create a custom hot-word model for Carl. I would have to say “Porcupine”, or “Computer” to get Carl to begin listening for commands. Carl insists that I say his name if I want his attention.

Carl Listening To CNN for “Marvin”, “Shiela”, “FireFox” or “Alexa”:

I discovered another connection-net based hot-word engine - Nyumaya out of Austria, the brainchild of a smiling fellow by the name of Neumair Günther. This engine requires the same barely-noticeable 10% of one core load, and exhibits great recognition of the hot-words (“Marvin”, “Sheila”, “FireFox”, or “Alexa”) with either normal speaking voice or whispers. It only had 3 false recognitions during an hour of Carl listening to the CNN audio news feed, and still recognized my normal voice hot-word utterance at the same volume as the background audio news feed.

Neumair Günther responded to my inquiry, about a custom model for Carl, with “Of course carl will be trained.” I was so excited I awoke at 5:45am this morning to create a tool to record samples of speaking “carl”, “carl listen”, and “hey carl”, (at a more reasonable hour for my wife and me to “play with Carl”).

So it looks like Carl will be able to use Nyumaya hot-word engine to wake-up when I address him, and the Vosk engine for recognizing what I want from him. Of course recognizing what I want and him knowing how to do my bidding will have to come from my fingers, hopefully at more reasonable hours!

Speech Recognition is Hot!


I would say more like a “Hot Smokin’ Weapon!”

I haven’t gotten that far yet, still working on my multiple O/S project, ultimately hoping to telescope back into to the remote camera robot project again, once I can figure out how to boot into multiple operating systems without having to swap fiddly-little SD cards.

I went out and spent about $100 US for a Seagate One Touch 500 gig SSD external USB drive for Charlie to use as a multi-boot source drive.  I don’t know if it’s cheaper in the US, but at El-Dorado here in Moscow, it was a smidge less than $100 US, (maybe $85, maybe $90, I didn’t do the conversion).  I snapped up three of them, gave one to my son, and kept two to experiment with.

It boots fast as a thief, faster than I can read the boot messages scrolling across the screen.  Maybe even too fast if the mount-count on the /rootfs partition is exceeded and it wants a fsck - it tries to boot faster than the fsck can finish!


Charlie is going to be up on blocks soon with a broken axle and a juice pipe down his throat, if you keep adding more hardware to him. I think he needs some heavy software to keep him “balanced.”



Soon as I finish the multi-boot project, I’ll start work on the software part of all this.

Ultimately, the goal is to make on-device development workable without having to wait five minutes for a refresh.

Where did you find those monitoring tools?

It would help if I could determine with a degree of precision, exactly where the bottlenecks are.  (I’m assuming I/O is the primary one for now.)

I use RPI-Monitor because it has a web graphic interface, with historical plots, and can show battery and docking status.

There are great command-line tools you should become familiar with as a Linux based robot owner, and this article will walk you through finding system bottlenecks:




Seriously though, that’s very cool. And I hadn’t thought about using two different engines in different contexts, but it makes total sense.