ESP-Now Voice Commander Fun With Wit.ai and DumbDisplay

Extended from the experiments of my previous posts, in this post I will try to demonstrate a hopefully more fun experiment -- "voice commander".

Voice input will be captured similarly as described in ESP32 Mic Testing With INMP441 and DumbDisplay. Captured voice will be sent to Facebook's Wit.ai service for recognization, as described in Yes! Voice Recognization Experiment With Wit.ai. Nevertheless this time, Wit.ai service will be used fuller -- not just used to recognize some pre-recorded utterances of some words -- with a set of voice commands set up for recognization.

The result of the voice recognization will end up a voice command, which is supposed to be one of several that Wit.ai service is set up to recognize. Voice commands will be translated to ESP-Now packets and sent to various pre-mapped ESP-Now agents for execution of the voice commands.

DumbDisplay will be used as UI for both "voice commander" as well as a way to simulate ESP-Now "voice command receiver" agents. Nevertheless, a more "real" ESP-Now agent is set up with an ESP01 relay module, which connects to a motor to take the role of a fan.

The voice commands in the experiment of this post are:

Turn on / off the kitchen.
Turn on / off the living room.
Lock / unlock the bedroom.
Lock / unlock the balcony.
Turn on / off the fan

Here are the roles of the different ESP microcontroller boards:

An ESP32 board that acts as the "voice commanders". The ESP32 board will have INMP441 attached for voice capturing, as described in ESP32 Mic Testing With INMP441 and DumbDisplay.
ESP32 or ESP8266 (ESP01) boards for simulating "lights" (kitchen and living room), and "locks" (bedroom and balcony).
An ESP01 board with 5V ESP01 relay module for the fan.

Although multiple ESP microcontroller boards are needed for the full demonstration of the experiment, the only one required is the ESP32 board that acts as the "voice commanders"; other ESP-Now agents are optional -- without the ESP-Now agents, you just don't see the effects of sending/receiving of ESP-Now voice commands.

Moreover, please note that each instance of DumbDisplay app can only connect to one microcontroller board at a time. Therefore, you may need to consider multiple Andorid phones, Android tablets, Chromebooks, or Android emulators.

Step 1: ESP32 Voice Commander UI

The UI of the "voice commander" is realized by DumbDisplay, and the UI is pretty simple. Basically, you just click on the Start button to start capturing voice commands. It is a loop. It will keep on listening to voice commands until you click Stop.

Before starting, you may optionally select some preferences

You can choose to have the captured voice replayed back before the voice is sent to Wit.ai for voice command recognization.
You can also adjust the simple "software-base amplification" factor, which is defaulted to 10.

Step 2: Voice Commander Wit.AI App -- First Utterance

Since the voice command recognization capability of this experiment is enabled by Facebook's Wit.ai service, you will certainly need to set up such Wit.ai app.

Head to Wit.ai.
Create a new app, say "voicecommander". The name doesn't matter.
Turn to the Understanding page of the newly created app.
In the Utterance text box, enter "turn on the kitchen".
Click on Choose or add intent to bring up the "intent selection" dialog box. Create the intent of the utterance -- "command".
Then select the word "kitchen" amongst the entered text in the Utterance text box to bring up the "entity selection" dialog box.
Create a new entity "target" (command target). Notice that the Resolved value is automatically correctly set as the selected word "kitchen".
Next, click Add Trait to bring up the "trait selection" dialog. From the dialog, create a new trait "onoff" (command action).
Then create new trait Value "on", as the "onoff" trait of the utterance.
As the last step, click on the Train and Validate button to finish inputting the utterance.

In summary, the utterance "turn on the kitchen"

belongs to the intent "command" -- i.e. it is a command
associates to the entity "target" with value "kitchen" -- i.e. the command is targeted to "kitchen"
assigned to the trait "onoff" with value "on" -- i.e. the command action to "kitchen" is to turn "on"

Step 3: Voice Commander -- Second Utterance

Similarily, create the second utterance.

In the Utterance text box, enter "lock the bedroom".
The intent "command" is automatically selected for you, and this intent is the correct intent wanted.
Now, select the word "bedroom" in the Utterance text box, and select the previously created entity, "target".
Notice that the Resolved value is automatically correctly set as the selected word "bedroom".
Also notice that the trait "onoff" may be automatically added. However, this trait is not what wanted. Delete it by clicking the X next "onoff" trait.
After deleting the incorrect trait, click Add Trait to bring up the "trait selection" dialog. From the dialog, create a new trait "lockunlock" (command action).
Then create new trait Value "lock", as the "lockunlock" trait of the utterance.
As the last step, click on the Train and Validate button to finish inputting the utterance.

In summary, the utterance "lock the bedroom"

belongs to the intent "command" -- i.e. it is a command
associates to the entity "target" with value "bedroom" -- i.e. the command is targeted to "bedroom"
assigned to the trait "lockunlock" with value "lock" -- i.e. the command action to "bedroom" is to "lock" it

Step 4: Voice Commander -- Other Utterances

Please go ahead and add the following utterances

turn on the kitchen -- this one should already be added previously
turn off the kitchen
turn on the living room
turn off the living room
lock the bedroom -- this one should already be added previously
unlock the bedroom
lock the balcony
unlock the balcony
turn on the fan
turn off the fan

In summary, the Wit.ai app has

a single intent "command"
a single entity "target" with different values, command targets -- "kitchen", "living room", "bedroom" and "fan".
two traits -- "onoff" and "lockunlock"
10 utterances, as above

Step 5: Wit.AI App Server Access Token

In the settings page of the app, you should see the Server Access Token. Note it down! You will need it later.

Step 6: Prepare Arduino IDE

In order to be able to compile and run the sketches shown here, you will first need to install the DumbDisplay Arduino library. Open your Arduino IDE; go to the menu item Tools | Manage Libraries, and type "dumbdisplay" in the search box there.

On the other side -- your Android phone side -- you will need to install the DumbDisplay Android app.

Step 7: ESP32 Voice Commander Connections

Here are the needed connections between ESP32 and INMP441:

connect ESP32 3.3V to VDD of INMP441
connect ESP32 GND to GND and L/R of INMP441 (connecting L/R to GND means using a single I2S for capturing mono sound)
connect ESP32 GPIO25 to WS of INMP441
connect ESP32 GPIO33 to SD of INMP441
connect ESP32 GPIO32 to SCK of INMP441

Note that the picture shows the back of a normal pre-solider INMP441 board. As a matter of fact, the mic input is really on the back as shown.

Step 8: ESP32 Voice Commander Sketch

You can download the "voice commander" sketch voicecommander.inohere.

The sketch will make connection with DumbDislay app using WIFI. Hence, you will need to modify the sketch a bit adding defines for WIFI_SSID and WIFI_PASSWORD

#define WIFI_SSID     "your-wifi-ssid"
#define WIFI_PASSWORD "your-wifi-password"
...
#include "wifidumbdisplay.h"
DumbDisplay dumbdisplay(new DDWiFiServerIO(WIFI_SSID, WIFI_PASSWORD));

You will also need to define your Wit.ai app access token, with macro WIT_VC_ACCESS_TOKEN

#define WIT_VC_ACCESS_TOKEN "your-wit.ai-app-access-token"
...
const char* witAccessToken = WIT_VC_ACCESS_TOKEN;
DDTunnelEndpoint witEndpoint("https://api.wit.ai/speech");

The sketch by default will assume you have all the 3 ESP-Now agents mentioned previously.

If you do not have any ESP-Now agents, comment out the line that defines ENABLE_ESPNOW_REMOTE_COMMANDS to disable ESP-Now all together.

// if no ESP-NOW clients, make sure the following line is commented out
#define ENABLE_ESPNOW_REMOTE_COMMANDS

If you keep ESP-Now enabled, you should have at least one ESP-Now agent. In such a case, you may want to comment out what ESP-Now agents that you do not have.

// define ESP-NOW clients ... comment out what not used
#define LIGHT_ESP_NOW_MAC   0x94, 0xB5, 0x55, 0xC7, 0xCD, 0x60
#define DOOR_ESP_NOW_MAC    0x48, 0x3F, 0xDA, 0x51, 0x22, 0x15
#define FAN_ESP_NOW_MAC     0x84, 0xF3, 0xEB, 0xD8, 0x41, 0x53

It is important to note that those XXX_ESP_NOW_MAC macros actually tell the MAC addresses of the corresponding ESP-Now agents. You will need to replace them with yours! (Will mention about ESP-Now agent MAC address again in later section.)

Step 9: Voice Commander -- Voice Capturing

As in ESP32 Mic Testing With INMP441 and DumbDisplay, the mic sound capturing is 16-bit mono; however for the purpose of capturing voice command, the sample rate used by this sketch is a bit higher at 16000 samples per second

const int SoundSampleRate = 16000;  // will be 16-bit per sample

Sound capturing I2S is used basically in similar fashion as mentioned in that post.

  i2s_install();
  i2s_setpin();
  i2s_start(I2S_PORT);
...
// 16000 sample per second (32000 bytes per second; since 16 bits per sample) ==> 8192 bytes = 256 ms per read
const int StreamBufferNumBytes = 8192;
const int StreamBufferLen = StreamBufferNumBytes / 2;
int16_t StreamBuffer[StreamBufferLen];
...
bool cacheMicVoice(int amplifyFactor, bool playback) {
  ...
  int chunkId = dumbdisplay.cacheSoundChunked16(MicVoiceName, SoundSampleRate, SoundNumChannels);
  while (true) {
    ...
    size_t bytesRead = 0;
    esp_err_t result = i2s_read(I2S_PORT, &StreamBuffer, StreamBufferNumBytes, &bytesRead, portMAX_DELAY);
    ...
        totalSampleCount += samplesRead;
        dumbdisplay.sendSoundChunk16(chunkId, StreamBuffer, samplesRead, false);
    ...
  }
  dumbdisplay.sendSoundChunk16(chunkId, NULL, 0, true);  
  ...
}

The loop in cacheMicVoice() to capture sound actually starts off "hearing" -- without "listening" (sending sound samples to DumbDisplay app).

  int32_t silentThreshold = SilentThreshold * amplifyFactor;
  statusLayer->writeCenteredLine("... hearing ...");

It keeps "hearing" as long as it is "silent", then it starts "listening", sending sound samples to DumbDisplay app for caching

       if (overThresholdCount >= VoiceMinOverSilentThresholdCount) {
        lastHighMillis = millis();
      }
      if (startMillis == -1) {
        if (lastHighMillis != -1) {
          startMillis = millis();
          statusLayer->writeCenteredLine("... listening ...");
        } 
      }
      if (startMillis != -1) {
        totalSampleCount += samplesRead;
        dumbdisplay.sendSoundChunk16(chunkId, StreamBuffer, samplesRead, false);
      }

When it detects "silent" again, it stops capturing.

    if (startMillis != -1) {
      if (lastHighMillis != -1) {
        if ((millis() - lastHighMillis) >= StopCacheSilentMillis) {
          // if silent for some time, stop it
          break;
        }
      }
      if ((millis() - startMillis) >= MaxCacheVoiceMillis) {
        // caching for too long, force stop it
        break;
      }
    }

After capturing, if "replay" is enabled, the captured sound will be replayed like

    float forHowLongS = (float) totalSampleCount / SoundSampleRate;
    dumbdisplay.playSound(MicVoiceName);
    delay(1000 * (1 + forHowLongS));

Step 10: Voice Commander -- Calling Wit.aI

After sound capturing (and cached by DumbDisplay app), the captured sound is sent to the Wit.ai app you set up previously for "voice command" recognization. The captured sound is sent to Wit.ai service in the same way as described in Yes! Voice Recognization Experiment With Wit.ai

  // create / setup "tunnel" etc to send detect request
  witTunnel = dumbdisplay.createJsonTunnel("", false);
  witEndpoint.addHeader("Authorization", String("Bearer ") + witAccessToken);
  witEndpoint.addHeader("Content-Type", "audio/wav");
  witEndpoint.addParam("text");  // "text" is not absolutely needed
  witEndpoint.addParam("value");

...
      // get voice command
      if (!cacheMicVoice(amplifyFactor, replayVoiceAfterCache)) {
        break;
      }
      ...
      witEndpoint.resetSoundAttachment(MicVoiceName);
      witTunnel->reconnectToEndpoint(witEndpoint);

Wit.ai api call result is a series of JSONs; for example

{
  "entities": {
    "target:target": [
      {
        "body": "kitchen",
        "confidence": 0.9995,
        "end": 19,
        "entities": {},
        "id": "537862281773815",
        "name": "target",
        "role": "target",
        "start": 12,
        "type": "value",
        "value": "kitchen"
      }
    ]
  },
  "intents": [
    {
      "confidence": 0.9999980926550052,
      "id": "559840926181009",
      "name": "command"
    }
  ],
  "text": "turn on the kitchen",
  "traits": {
    "onoff": [
      {
        "confidence": 0.8178951209020524,
        "id": "743515353683310",
        "value": "on"
      }
    ]
  }
}

The JSON values are read by DumbDisplay app, extracted to key-value pairs, and send back to ESP32 as "feedback".

On the ESP32 sketch side, the "feedback" key-value pairs extracted from the Wit.ai result are read like

      // gather Wit.ai result
      String entity;
      String trait;
      while (!witTunnel->eof()) {
        String fieldId;
        String fieldValue;
        if (witTunnel->read(fieldId, fieldValue)) {
          if (fieldId.startsWith("entities.") && fieldId.endsWith(".value")) {
            entity = fieldValue;
          } else if (fieldId.startsWith("traits.") && fieldId.endsWith(".value")) {
            trait = fieldValue;
          } else if (fieldId == "text") {
            dumbdisplay.writeComment(String("   {") + fieldValue + "}");   // for display only
          }
        }
      }
      ...
        sendCommand(entity, trait);

The entity got is actually the "command target", and the trait got is the "command action".

Step 11: Voice Commander -- Sending Voice Comand Via ESP-Now

ESP-Now is initialized like.

#include <esp_now.h>
#include <WiFi.h>
...
  uint8_t LightReceiverMACAddress[] = { LIGHT_ESP_NOW_MAC };
  esp_now_peer_info_t LightPeerInfo;

  // Set device as a Wi-Fi Station and also an Access Point
  WiFi.mode(WIFI_AP_STA);
  // Init ESP-NOW
  if (esp_now_init() != ESP_OK) {
    return false;
  }  
  // Register "send callback" lambda expression
  esp_now_register_send_cb([](const uint8_t *mac_addr, esp_now_send_status_t status) {
    if (status == ESP_NOW_SEND_SUCCESS) {
      ...
    } else {
      ...
    }
  });
...
  memcpy(LightPeerInfo.peer_addr, LightReceiverMACAddress, 6);
  LightPeerInfo.channel = 0;  
  LightPeerInfo.encrypt = false;
  if (esp_now_add_peer(&LightPeerInfo) != ESP_OK) {
  }

it is suggested to set WIFI mode to both "station" plus "access point", not just "station" that is required by ESP-Now
esp_now_init() is the standard routine to call to initialize ESP-Now
esp_now_register_send_cb() registers the callback that ESP-Now will call to notify of sending ESP-Now package result. For this sketch, it is not strictly needed
note that [](...){...} is just a C++ lambda expression that defines an inline subroutine.
esp_now_add_peer() is to add the ESP-Now agent it will send packet to
note that LightPeerInfo is declared as a global variable

And here is how "voice command" is sent via ESP-Now

// define a structure as ESP Now packet
struct ESPNowCommandPacket {
  char commandTarget[32];
  char commandAction[32];
};

bool sendCommand(const String& commandTarget, const String& commandAction) {
  dumbdisplay.writeComment(String("command for [") + commandTarget + "] to [" + commandAction + "]");

  const uint8_t* receiverMACAddress = getCommandReceiverMACAddress(commandTarget, commandAction);
  if (receiverMACAddress == NULL) {
    dumbdisplay.writeComment("no command receiver");
    return false;
  }
  ESPNowCommandPacket packet;
  strcpy(packet.commandTarget, commandTarget.c_str());
  strcpy(packet.commandAction, commandAction.c_str());
  if (esp_now_send(receiverMACAddress, (const uint8_t *) &packet, sizeof(packet)) != ESP_OK) {
    dumbdisplay.writeComment("failed to send command");
    return false;
  }

  return true;
}

ESPNowCommandPacket defines the structure of the ESP-Now packet for voice command; e.g. commandTarget is "kitchen", and commandAction is "on"
the subroutine getCommandReceiverMACAddress() is called to determine the ESP-Now agent MAC address for the voice command
esp_now_send() is the routine used to send the voice command packet

Step 12: Voice Command Receiver With ESP01 Relay Module

As mentioned previously, a more "real" ESP-Now "voice command receiver" agent is realized using a 5V ESP01 relay module attached to a motor simulating a fan.

For connections:

Connect the positive terminal 5V power source to VCC of the relay module; connect the negative of that 5V power source to GND of the relay module. The 5V power source is for supplying power to the ESP01 board as well as the relay module board.
Connect the positive of another 5V power source to the positive terminal of the motor; connect the negative terminal of that motor to NC of the relay model; connect the negative of that 5V power source to COM of the relay module.

You can download the sketch esp01_voicecommandagent.inohere.

The sketch is pretty simple, and will not rely on DumbDisplay. Nevertheless, you will need to find out the agent's MAC address, for ESP-Now communication.

After uploading the sketch. you can open serial monitor with baud rate 115200, and see lines like

ESP01 agent MAC is 84:F3:EB:D8:41:53

Here, you will find out the MAC address of your ESP01 (attached to ESP01 relay module). And this MAC address is the value you need for defining "voice commander" sketch's FAN_ESP_NOW_MAC macro

#define FAN_ESP_NOW_MAC     0x84, 0xF3, 0xEB, 0xD8, 0x41, 0x53

The relay is controlled by the GPIO0 of ESP01. Hence, the "fan pin" (GPIO0) is set to "output mode" in the setup() block

#define FAN_PIN 0
... 
void setup() {
   Serial.begin(115200);
   ...
   pinMode(FAN_PIN, OUTPUT); 
   ...
}

Since the sketch is an ESP-Now packet receiver, ESP-Now support definitely needs to be initialized like

  // Set device as a Wi-Fi Station
  WiFi.mode(WIFI_STA);

  // Init ESP-NOW
  if (esp_now_init() == ESP_OK) {
    Serial.println("Done initializing ESP-NOW");
  } else {
    Serial.println("Error initializing ESP-NOW");
  }  

  // Register "receive callback"
  if (esp_now_register_recv_cb(OnDataRecv) == ESP_OK) {
    Serial.println("Done registering receive callback");
  } else {
    Serial.println("Error registering receive callback");
  }

ESP-Now requires that WIFI be set to "station" mode
esp_now_init() is the routine to initialize ESP-Now
esp_now_register_recv_cb() is used to register the "receive package callback" to handle receiving of ESP-Now packet

Here is the ESP-Now "receive package callback" is OnDataRecv

void OnDataRecv(uint8_t *mac, uint8_t *incomingData, uint8_t len) {
  ESPNowCommandPacket receivedPacket;
  memcpy(&receivedPacket, incomingData, sizeof(receivedPacket));
  String commandTarget = receivedPacket.commandTarget;
  String commandAction = receivedPacket.commandAction;
  Serial.println(String("* Received command for [") + commandTarget + "] to [" + commandAction + "]");
  if (commandTarget == "fan") {
    if (commandAction == "on") {
      digitalWrite(FAN_PIN, 1);
      Serial.println("- turned on fan");
    } else if (commandAction == "off") {
      digitalWrite(FAN_PIN, 0);
      Serial.println("- turned off fan");
    }
  }
}

when ESP-Now packet (voice command) is received, the corresponding "command target" and "command action" are "extracted"
if the "command target" is "fan", the fan is switched on / off by setting "fan pin" HIGH / LOW depending on whether "command action" is "on" or "off"

Compare to ESP32, it is harder to upload sketch to ESP01 board. However, with the help of a proper ESP01 USB programmer adapter, uploading sketch to ESP01 should be as easy.

Step 13: General ESP Voice Command Receivers

In addition to the ESP01 relay module, you can have two ESP32 / ESP8266 boards acting as "voice command receivers" ESP-Now agents. Such agents will receive voice commands from "voice commander", and simulate executing the voice command with the help of DumbDisplay.

You can download the sketch voicecommandagents.inohere.

It can connect to DumbDisplay app using WIFI or using OTG

#if defined(WIFI_SSID) && defined(WIFI_PASSWORD)

  #include "wifidumbdisplay.h"
  DumbDisplay dumbdisplay(new DDWiFiServerIO(WIFI_SSID, WIFI_PASSWORD));

#else

  #include "dumbdisplay.h"
  DumbDisplay dumbdisplay(new DDInputOutput());

#endif

If you want to use WIFI connectivity, please define WIFI_SSID and WIFI_PASSWORD at the top of the sketch

#define WIFI_SSID     "your-wifi-ssid"
#define WIFI_PASSWORD "your-wifi-password"

If not, OTG connection with USB cable is assumed.

Another important step is to find out the MAC addresses of your ESP boards for ESP-Now connection.

Assuming you are using WIFI for connectivity; after you updated the sketch, you can open serial monitor with baud rate 115200, and see lines like

*****
* agent MAC is 94:B5:55:C7:CD:60
*****

Here, you will see the MAC address of the agent. And this MAC address is the value you need for defining "voice commander" sketch's XXX_ESP_NOW_MAC macros

// define ESP-NOW clients ... comment out what not used
#define LIGHT_ESP_NOW_MAC   0x94, 0xB5, 0x55, 0xC7, 0xCD, 0x60
#define DOOR_ESP_NOW_MAC    0x48, 0x3F, 0xDA, 0x51, 0x22, 0x15
#define FAN_ESP_NOW_MAC     0x84, 0xF3, 0xEB, 0xD8, 0x41, 0x53

For example:

94:B5:55:C7:CD:60 ==> 0x94, 0xB5, 0x55, 0xC7, 0xCD, 0x60

The sketch will just sit there (occasionally write out the agent's MAC as comments), and wait for voice command from "voice commander". Once received "understandable" voice command, DumbDisplay app's display will be updated to simulate executing the voice command.

Please notice that the UI for simulating lock / unlock with web images is using a similar mechanism as described in Blink Test With Colored Image, With Arduino Nano

Step 14: Making WIFI Connection With DumbDisplay App

In order for DumbDisplay app to make WIFI connection with microcontroller board, you will need to find out the IP address of the board. To find out the IP address, upload the sketch and connect it to a serial monitor with baud rate 115200. From the serial monitor, you should see lines like

binded WIFI TrevorWireless
listening on 192.168.0.172:10201 ...

Of cause, your WIFI SSID will be different than shown above, and very likely, the IP address will be different as well. Note down the IP address shown. You will need this IP for DumbDisplay WIFI connection.

On your Android phone side

Start the DumbDisplay app.
Click on the Establish Connection icon.
In the "establish connection" dialog, you should see the "add WIFI device" icon at the bottom right of the dialog. Click on it.
A popup for you to enter WIFI IP will be shown. Enter the IP address of your ESP board as Network Host. Click OK when done.
Back to the "establish connection" dialog, a new entry will be added, click on it to establish WIFI connection.

One more thing. Due to the high volume of data sent to DumbDisplay app, it is strongly suggested that you turn off DumbDisplay app's "show commands" option. You can turn off "show commands" option from the DumbDisplay app menu.

Step 15: Enjoy!

Hope you will have fun with this "voice commander" experiment, which is inspired by my seeing the YouTube video Build your own Alexa with the ESP32 and TensorFlow Lite. Too bad that I didn't finish adapting and make use of "wake word" for triggering voice commands.

Maybe, the experiment can be more interesting if combined with speech response -- ESP32 Speech Synthesizer Experiment With XFS5152CE. Anyway, enjoy!

Peace be with you. Jesus loves you. May God bless you!