Yes! Voice Recognization Experiment With Wit.ai

With this post, I wish to demonstrate a very simple experiment of using the speech analysis API offered by Facebook's Wit.ai; specifically, utterance recognization of the English word YES. Of cause, Wit.ai is much more than voice recognization; nevertheless, for this post, this is how we are going to use it.

As my previous post -- Arduino AI Fun With TensorFlow Lite, Via DumbDisplay -- microcontroller board ESP32 is used as driver only.

For simplicity, voice utterance is provided by some sample WAV files stored with DumbDisplay app.
Yes, DumbDisplay is used for the UI.
Indeed, Wit.ai API request is sent via DumbDisplay app.

Hopefully, in a future experiment of mine, ESP32 will be used to acquire voice for recognization.

Step 1: Prepare Wit.ai App

For the sketch in this post to work, you will need a Wit.ai app access token. Here are the steps to setup such an app and retrieve the app's access token.

Head to Wit.ai.
Create a new app, say "yesdetect". The name doesn't matter. What you need is the access token.
In the settings page of the newly created app, you should see the Server Access Token. Note it down! You will need it later.

Notice that there is much more that you can do with the app. However, for the experiment of this post, you don't need to do anything other than creating an "empty" app, and getting an access token.

Step 2: Prepare Arduino IDE

In order to be able to compile and run the sketch shown here, you will first need to install the DumbDisplay Arduino library. Open your Arduino IDE; go to the menu item Tools | Manage Libraries, and type "dumbdisplay" in the search box there.

On the other side -- your Android phone side -- you will need to install the DumbDisplay Android app.

Step 3: DumbDisplay As UI

The UI for the sketch is realized with DumbDisplay.

Sample WAV files for recognization comes with DumbDisplay app. For this, you will need to grant "media storage access" to DumbDisplay app. You do so with the settings dialog of DumbDisplay app.
DumbDisplay app can be used to play the WAV files. You can try playing the WAV files by clicking the gray buttons. The button click is sent as "feedback" to ESP32, and ESP32 then drives DumbDisplay app to play the WAV file.
DumbDisplay app is also responsible for sending API request to Wit.ai, interpreting the JSON response, and sending back the extracted "answer" to ESP32 as "feedback".
You trigger recognization by clicking the yellow buttons. If YES yellow button is clicked, YES should be recognized. After successful recognization, YES will be sounded. Similarly, the sketch will consider NO a correct recognization.

Step 4: The Sketch

The sketch ddyesdetect.ino can be downloaded here. Put it in a folder called ddyesdetect.

The sketch will not compile yet, since you will additionally need to declare your Wit.ai app access token.

At the top of the sketch, define WIT_ACCESS_TOKEN like

#define WIT_ACCESS_TOKEN "<your-app-access-token>"

As mentioned previously, the sketch is targeted for ESP32. If ESP32 is the build target, Bluetooth connectivity to DumbDisplay app is configured, with name "BT32". If the build target is not ESP32 however, say Raspberry Pi Pico instead, USB OTG connection is assumed.

#if defined(ESP32)
  #include "esp32dumbdisplay.h"
  DumbDisplay dumbdisplay(new DDBluetoothSerialIO("BT32", true, 115200));
#else
  #include "dumbdisplay.h"
  DumbDisplay dumbdisplay(new DDInputOutput(115200));
#endif

Other than using layers for drawing buttons etc, the "meat" is a JsonDDTunne, which is used to trigger DumbDisplay app to send request to Wit.ai API endpoint and get back result, on behalf of ESP32.

// declare "tunnel" etc to send detect request to api.wit.ai ... please get "access token" from api.wit.ai
JsonDDTunnel* witTunnel;
const char* witAccessToken = WIT_ACCESS_TOKEN;
DDTunnelEndpoint witEndpoint("https://api.wit.ai/speech");

The Wit.ai API HTTP request is a bit more involving.

  // create / setup "tunnel" etc to send detect request
  witTunnel = dumbdisplay.createJsonTunnel("", false);
  witEndpoint.addHeader("Authorization", String("Bearer ") + witAccessToken);
  witEndpoint.addHeader("Content-Type", "audio/wav");
  witEndpoint.addParam("text");

Set HTTP headers "Authorization" and "Content-Type" to the DDTunnelEndpoint object witEndpoint.
The parameter "text" is used for filtering the result to return, which is a series of JSONs like

...
{
...
  "is_final": true,
  "speech": {
    "confidence": 0.7399,
    "tokens": [
      {
        "confidence": 0.4797,
...
      }
    ]
  },
  "text": "Yes",
  "traits": {}
}

(How the JSON result is returned to ESP32 will be described later.)

The above is just the setup. The actual request is done like

    witEndpoint.resetSoundAttachment(detectSound);
    witTunnel->reconnectToEndpoint(witEndpoint);

The DDTunnelEndpoint object witEndpoint is further set with "sound attachment", which is the WAV file name.
Note that this "sound attachment" dictates that the WAV be sent as the content of the HTTP request.
Then the JsonDDTunne object witTunnel is [re]connected to the Wit.ai endpoint, i.e. to make request.

Wit.ai api call result is a series of JSONs, and the JSON values are extracted to key-value pairs, for sending back to ESP32 as "feedback".

For example,

  ...
  "speech": {
    "confidence": 0.7399,
    ...

"confidence" value will be extracted as the key-value pair -- "speech.confidence" / "0.7399".

And

...
"text": "Yes",
...

"text" value will be extracted as the key-value pair -- "text" / "Yes".

The parameter "text" set to witEndpoint as mentioned previously, will filter for those key-value (field id-value) pairs with matching key "text".

Hence, the Wit.ai api call result is read like

    String detected = "";
    while (!witTunnel->eof()) {
      String fieldId;
      String fieldValue;
      if (witTunnel->read(fieldId, fieldValue)) {
        if (fieldValue != "") {
          dumbdisplay.writeComment(fieldValue);
          detected = fieldValue;
          statusLayer->writeCenteredLine(String("... ") + " [" + detected + "] ...");
        }
      }
    }

Note that due to the "text" filtering parameter, only "text" key-value pair will ever be returned. Therefore all values read here will match key "text", i.e. the detected text.