Speech to Text in Isadora!

liminal_andy

A user over on the Facebook Group posted a video of actors sitting on a couch in front of a screen, with a projection of their words seemingly falling from the sky as they were spoken. As many of the comments identified, an easy way to get this effect accomplished in Isadora is to prebake the asset or even draw the words with Text Draw individually because, in scripted performance, the language (theoretically) is always the same. However, it got my wheels turning: is there a way to do this live, with improvised language? Turns out the answer is: yes!

There is a well-known speech recognition library for python, and I've become comfortable programming OSC into python scripts over the years, so I figured there would be a way to combine them to make it happen. Turns out, the folks over at Programming for People beat me to it, as they put out a YouTube tutorial of programming the functions a few years ago and sell a package of the source code and a Max interface for a few dollars on their website. I purchased the source code and began to tinker as it needed some updates to be compatible with current python versions, etc. I received a friendly email from the developer and after a brief exchange, we decided that I can redistribute my fork of their work free of charge, though I would encourage those of you who may end up using this to shoot a few dollars over their way in good will.

As a result, here is OSCTranscribe! For those of you who want to tinker with the source code, you can follow that link to the git repository where I have the program hosted. Otherwise, you can grab the release for Windows 10 here.

When you run the program, it will print a list of all available audio devices it can bind to, and you just select the one you want to use. Then, you select the OSC network settings (defaults will be compatible with Isadora). Next, you use Isadora to send the following OSC commands via whatever triggers you want:

OSC API for Controlling OSCTranscribe:

/OSCTranscribe/calibrate {int thresh}: Setup, establishes the amount of quite space before processing

/OSCTranscribe/startListening: Setup, begins listening to the mic and sending to the learning system

/OSCTranscribe/stopListening: Stops listening to the mic and stops sending to the learning system

By default, you send these commands to port 7070. Then, you should listen for the following OSC message:

/OSCTranscribe/data {string text}: The words recognized from the audio stream in between pauses

Now, Isadora's OSC Multi Listener, set to Text mode, will deliver the spoken words as Text for you to pass to Text draws, formatters, Javascript, etc.

Here's a quick demo:

The recognition from the speech API is pretty good! Visually, a lot more tinkering could be done within Isadora to make this look more pleasing, but I will leave that to all of you to try out. Let me know what you think :)