Google Cloud Speech-To-Text and DeepSpeech on Pi3B Results

cyclicalobsessive · January 4, 2021, 2:02am

2021 is the year Carl will finally start listening to what I tell him!

To that end, I needed to revisit my choices for Speech-To-Text, or sometimes referred to as Automatic Speech Recognition (ASR).

First up: Google Cloud Speech-to-Text API.

While very powerful, and requiring very little of Carl’s Pi3B computing resource, I don’t want Carl to be tied to external resources, and Google Cloud Speech-To-Text is only free for 60 minutes a month, and my “Free Trial” runs out soon.

That said, I needed to know what it could do for Carl. I was able to install it quickly, configure the billing, and get the two examples needed for comparison with other engines - reco from file and reco from the microphone.

Result: phenomenal recognition accuracy (100% for the quick testing I did) and “real-time” recognition.

Next Up: Mozilla DeepSpeech

I worked with ASR from various vendors for over twenty years. These were traditional technology engines that extracted characteristics from the speech sample and used statistics to estimate what was said.

Lately, all the buzz has been “Machine Learning” this and “Deep Learning” that, with the general rule being it takes a big computer. When ModRobotics released the latest GoPiGo OS it included several TensorLite vision examples, so I started thinking perhaps Carl may be able to join in on this ML / Convolutional Neural Network (CNN or ConvNet) stuff.

I started seeing articles about Mozilla DeepSpeech and DeepSpeech-TFLite on the Raspberry Pi 4. Now Carl only has a Pi3B, but it is a Pi so I wanted to try out DeepSpeech-TFLite on Carl.

I found out Mozilla laid off 25% of their workforce and the Raspberry Pi is no longer high on the priorities. The latest version only releases the full DeepSpeech engine. The engine uses TensorFlow Lite developed models.

Bottom line - I got it installed but the recognition took 9-11 seconds for 2 seconds of speech.

The Plan

I need to revisit the CMU pocketsphinx engine which I benchmarked several years ago at “real-time” on the Pi3B for small language models and grammar based recognition.

jimrh · January 4, 2021, 2:51am

You may wish to revisit your thinking about the Pi-4.

For the kind of heavy lifting you want Carl to be doing, you might want to consider getting a couple or three Pi-4 8 Giggers, one or two to put on the test bench, fire up software on, and try things before committing it to Carl.

Heat sinking is your friend. Especially if you’re going to be running math intensive stuff on a Pi. I ordered myself a aluminum heat-sink that totally encloses the Raspberry Pi, and includes two flush-mount fans. Won’t be much bigger than the arrangement I already have, and if you don’t mind moving the one large capacitor towards the edge of the GoPiGo board, you should be golden.

This is what the newer GPG-3 controller boards look like on the component side:
Note that they’ve moved the large 470μF capacitor to the right of the large inductor and just above the camera cable slot.

This is Charlie’s older-style GPG-3 board that has the large capacitor mounted next to the large square filter inductor.

This is how I implemented the change made to the newer boards to allow for the heat-sink and fan:

You can see that I moved the large 470μF capacitor to the right, just above the camera cable slot.

The large inductor, (in the green square on the new board), is still a bit big and can interfere with a heat-sink’s fan. It becomes a matter of spacing the top board an extra mm or two away from the Pi.

Alternatively, if you don’t want to mess with the PCB’s at all, and I won’t blame you if you don’t, maybe you can add a fan on the side of the chassis that can blow air across a heat-sink?

In any event, if you want any kind of heavy lifting that doesn’t require Google or deep pockets, you’re probably going to need a Pi-4, the more memory the merrier.

cyclicalobsessive · January 4, 2021, 5:27am

Not needed. I did buy a little fan though for when Carl learns how to use more than one core. Until then he just needs software - the is no hardware solution that will make him smarter.

The CMU engine ran very acceptably on the Pi3B several years ago, I just need to install the latest and greatest version on Carl and start thinking of witty come-backs for when he starts listening to me.

jimrh · January 4, 2021, 5:50am

I remember you mentioning a while back when you had Carl doing a “talk-back”, repeating what he thought you said.

Don’t remember the specifics, but it was a hoot!
You: “What’s the weather outside, Carl?”
Carl: "Damnit! I said ‘get me a beer!’ "

When you get Carl to the point where he’ll fetch a cold one on command, you can truly say he’s arrived.

BTW, I looked up the “Braitenburg vehicle” thing you mentioned and it turns out that I did a “sorta-like-that” thing a while back using Bloxter and the distance sensor. I programmed a “don’t touch me!” kind of code where Charlie would move up to a specific distance from something and then stop. If the obstacle moved away, Charlie moved forward until the specific distance was again achieved. If it moved closer, Charlie would back away.

Not as fancy as Carl, but it’s something.

Surely Carl’s using more than one core. A nine+ second analysis time for a 2 second sound bite? There’s gotta be a better way, and hopefully something that will put more of Carl’s processing power to work.

It might be interesting to try this again with the 64 bit preemptive kernel.

KeithW · January 4, 2021, 1:19pm

Do you want to actually run everything on the Pi per se? I was wondering if you could offload the speech recognition to a microprocessor. Something like TinyML (book on Amazon here) might be just the ticket. The external board probably wouldn’t be too much of a power suck. Of course then you have to figure out how to communicate, but seems like you’re well versed in I2C.
/K

cyclicalobsessive · January 4, 2021, 1:33pm

That’s a big YES!

Carl is only using 0.26 of one of four cores of available thinking power right now. When Carl tells me he is feeling weak from over-work, I will probably build the ROS based, distributed processing “Carla” or “Ava” ? (Hopefully better ending than Ex Machina)

jimrh · January 4, 2021, 1:40pm

Maybe “Carolus”?

It seems to me that a ROS version of Carl should be in Latin!

cyclicalobsessive · January 4, 2021, 1:44pm

Close - “Carlushka” Yiddish/ Polish / Chek

jimrh · January 4, 2021, 1:47pm

That’s the Russian diminutive too - “Карлушка”

BTW, etymologically speaking, “Yiddish” is actually a blend of Russian/Ukranian and Hebrew as carried out of the old Russian Empire when all the Jews were expelled.

Oy! (Ои!) (as in Oy-Vey!) is current Russian for “ouch!” or as an exclamation of surprise like when you almost miss a step, or accidentally step off a curb you didn’t see.

KeithW · January 5, 2021, 1:52pm

I really need to take a lesson from your work and track CPU usage more closely. I’m always just assuming it’s running close to max all the time, and clearly that’s not the case. So yeah, why would you feel the need to offload voice recognition.

If you’re feeling Germanic, Karlchen would do (although that would technically be a better fit for @jimrh 's Charlie). From your other post, looks like you might be stuck with Porcupine

/K

cyclicalobsessive · January 5, 2021, 3:01pm

Yes, I created the “Carl-ified” demo that says “Yes. I heard you.” whenever anyone speaks (or whispers) the word “porcupine”, but Carl’s investigation continues on.

It seems the CMU pocketsphinx engine that I used years back is obsolete, and a new open-source engine for research, called Vosk, is available from Alpha Cephai

Vosk even purports to support speaker-identification!

( I spent 20 years at IBM doing speech recognition and speaker identity verification for self-service telephony applications, so I know how hard this stuff is for a little processor like the Raspberry Pi 3B, … and how resourceful the research folks are. They did always tell me “You are asking the wrong questions” but I always managed to figure out how to use what they had for my purposes.)