ESP32 Speech Synthesizer Experiment With XFS5152CE

My recent microcontroller experiments have been about voice and AI. This reminds me of the text-to-speech synthesizer experiment I did a while back -- Arduino Speech Synthesizer Experiment with XFS5152CE

Being able to synthesize arbitrary text with a small chip -- XFS5152CE -- without connecting to a server, is definitely an advantage. Even though the synthesized English sounds robotic, Putonghua sounds pretty human-like to me.

In this post, I will demonstrate that experiment again, with some code refactoring enhancements. Therefore, the core of everything is basically the same as described in the video.

Step 1: The UI

DumbDisplay will be used as UI for this experiment.

With the UI, you will be able to trigger text-to-speech synthesis in two ways.

You can click on the News button to have a piece of headline news acquired from the news web service NewsApi.

The piece of headline news certainly contains some text about the headline, it might also contain an image [URL] to accompany the headline. Both the text as well as the image will be displayed. Since this experiment is about text-to-speech synthesis with XFS5152CE, you certainly will also hear the headline text synthesized.

You can also input your own text for synthesis. Simply click on the text area (text layer); a keyboard will pop up allowing you to enter whatever text you desire. After receiving the text as "feedback", XFS5152CE will be requested to synthesize the text you entered.

As mentioned previously, my feeling is that XFS5152CE synthesizes Putonghua better. Indeed, the UI provides you the option to acquire "English only" or "English & Chinese" news headlines. You switch between the options by clicking the English button.

To help visualize, you may want to watch the demo in the video Arduino Speech Synthesizer Experiment with XFS5152CE.

Step 2: Connections

The ESP32 board will be powered with an external 5V power source. The same power source will also be used to power the XFS5152CE board as well. Please note that the XFS5152CE board here is to be powered with 5V; some other XFS5152CE boards might be powered with 3.3V.

Connect 5V of ESP32 to the positive terminal of the 5V power source
Connect GND of ESP32 to the negative terminal of the 5V power source
Connect GPIO17 of ESP32 to RXD of XFS5152CE
Connect GIPO16 of ESP32 to TXD of XFS5152CE
Connect DC5V of XFS5152CE to the positive terminal of the 5V power source
Connect GND of XFS5152CE to the negative terminal of the 5V power source
Connect speaker pins of XFS5152CE to the two terminals of the speaker

You may notice that this XFS5152CE board only exposes the pins for UART. Indeed, in this experiment, ESP32 will communicate with XFS5152CE using UART.

Step 3: Preparation

In order to be able to compile and run the sketch shown here, you will first need to install the DumbDisplay Arduino library. Open your Arduino IDE; go to the menu item Tools | Manage Libraries, and type "dumbdisplay" in the search box there.

On the other side -- your Android phone side -- you will need to install the DumbDisplay Android app.

Step 4: The Sketch

You can down the sketch here.

To connect to DumbDisplay app, the sketch with make use of ESP32's Bluetooth support; i.e. connection is via Bluetooth and with name "ESP32".

#include "esp32dumbdisplay.h"
DumbDisplay dumbdisplay(new DDBluetoothSerialIO("ESP32"));

To communicate with XFS5152CE, the sketch will make use of ESP32's UART2 with baud rate 115200.

#define synthesizer Serial2
...
void setup() {
    synthesizer.begin(115200);  // XFS5152CE UART baud rate is 115200
    ...
}

The various UI actions are triggered by "feedbacks" for the different layers, and those "feedbacks" are handled by the "feedback" handler FeedbackHandler, registered to the different layers

void setup() {
    ...
    langsButton->setFeedbackHandler(FeedbackHandler, "f");   
    ...
    newsButton->setFeedbackHandler(FeedbackHandler, "f");...
    ...   
    textLayer->setFeedbackHandler(FeedbackHandler, "f:keys");  // "feedback" is input text (with keyboard)
    ...
}

Here is the "feedback" handler FeedbackHandler

DDPendingValue<bool> englishOnly(true);
DDPendingValue<bool> requestNews;
DDPendingValue<String> adhocText;
void FeedbackHandler(DDLayer* layer, DDFeedbackType type, const DDFeedback& feedback) {
    if (layer == langsButton) {
        englishOnly = !englishOnly;
    } else if (layer == newsButton) {
        requestNews = true;
    } else if (layer == textLayer) {
        adhocText = feedback.text;
    }
}

It simply changes the "pending value" -- helper object to track new value set -- of different global variables according to which layer has "feedback".

If the langsButton layer has "feedback" (clicked), the englishOnly boolean "pending value" is toggled. Notice that the "pending value" is initially set to true.

If the newsButton layer has "feedback" (clicked), the requestNews boolean "pending value" is set to true.

If the textLayer has "feedback" with the text you entered for synthesizing, the text will be assigned to adhocText string "pending value".

In the loop block, setting of the "pending values" will be detected and "acknowledged" (i.e. handled)

void loop() {
    // check if englishOnly has "pending value" [since last check]
    if (englishOnly.acknowledge()) {
        if (englishOnly) {
            langsButton->writeCenteredLine("English");
        } else {
            langsButton->writeCenteredLine("Eng & 中文");
        }
    }
    ...
    // check if adhocText has "pending value" [since last check]
    if (adhocText.acknowledge()) {
        HandleAdhocText(adhocText);
    }
    ...
    // check if requestNews has "pending value" [since last check]
    if (requestNews.acknowledge()) {
        HandleGetAnotherNews();
    }
    ...
    DDYield();  // yield to DD so that it can do its work
}

As mentioned previously, headline news will be acquired from the news web service NewsApi, and in order to be able to call the service, you will need an API Key, which you can get from NewsApi.

To specify your API key, you define it with the macro NEWSAPI_API_KEY

#define NEWSAPI_API_KEY "your-NewsApi-api-key"

And the API key is used to construct the news web service API endpoint like

const String NewsApiEndpoint = String("https://newsapi.org/v2/top-headlines?apiKey=") + NEWSAPI_API_KEY;

A piece of headline news is acquired via DumbDisplay app in the subroutine HandleGetAnotherNews like

    String country = "us";
    if (!englishOnly) {
        if (rand() % 2 == 0) {
            country = "hk";
        }
    }    String endpoint = NewsApiEndpoint + ("&pageSize=1&category=" + category) + ("&country=" + country);
    newsTunnel->reconnectTo(endpoint);  // download a piece of headline news

Please note that the parameters "pageSize", "category" and "country" are added to the endpoint, in order to customize what to return.

The result of the API call is a JSON like

{
  "status": "ok",
  "totalResults": 1,
  "articles": [
    {
      "title": "<headline text>",
      ...
      "urlToImage": "<URL to image>"
    }
    ...
  ]
}

DumbDisplay app will extract results from the JSON as "id-value" pairs. For example, the "id" of the entry for "title" is "article.0.title", and the "value" is "<headline text>". The extracted "id-value" pairs will be passed back to ESP32 as "feedbacks". Here is how the sketch reads the "id-value" pairs

    String title = "";
    String imageUrl = "";
    while (!newsTunnel->eof()) {
        if (newsTunnel->count() > 0) {
            textLayer->print(".");
            String fieldId;
            String fieldValue;
            newsTunnel->read(fieldId, fieldValue);
            if (fieldId == "articles.0.title") {
                title = fieldValue;
            } else if (fieldId == "articles.0.urlToImage") {
                imageUrl = fieldValue;
            }
        }
    }

After getting title, the subroutine SynthesizeVoice is called to synthesize the text

    if (title.length() > 0) {
        textLayer->println(title);
        isIdle = false;
        String text = "[v1][h0]" + title;  // [V1]: volume @ level 1; [h0]: synthesize so as to read out English (as opposed to spell out) 
        if (englishOnly) {
            text = "[g2]" + text;          // [g2]: select English ... for reading out things like number
        } else {
            text = "[g1]" + text;          // [g1]: select Chinese ... for reading out things like number
        }
        SynthesizeVoice(text);
    }

And in case an image is associated with that piece of headline news, [image at] imageUrl is downloaded to DumbDisplay app, and displayed like

    if (imageUrl.length() > 0) {
        imageTunnel->reconnectTo(imageUrl);  // download the image
        while (true) {
            int result = imageTunnel->checkResult();
            if (result == 1) {
                imageLayer->drawImageFileFit(ImageFileName); 
            }
            if (result != 0) {
                break;
            }
        }
    }

Here is the subroutine SynthesizeVoice

void SynthesizeVoice(const String& text) {
    int text_len = text.length();  // text_len is actually the number of chars used to store the text (in UTF8 format)
    uint8_t buffer[2 * text_len];
    int len = StringToUnicode(text, buffer);  // convert text to UTF16 format
    int out_len = 2 + len;
    synthesizer.write((byte) 0xFD);                      // header
    synthesizer.write((byte) ((out_len & 0xFF00) > 8));  // data len: higher byte
    synthesizer.write((byte) (out_len & 0x00FF));        // data len: lower byte
    synthesizer.write((byte) 0x01);                      // command: synthesize
    synthesizer.write((byte) 0x03);                      // data encoding: UTF16
    synthesizer.write(buffer, len);                      // data (UTF16 text)
}

As shown, the text to synthesize will first be converted to UTF16 format (from UTF8, since Arduino String is UTF8). The data (converted text) are sent to XFS5152CE via UART, with the needed headers.

Similarly, arbitrary text you entered for synthesizing is handled by the subroutine HandleAdhocText like

void HandleAdhocText(const String& text) {
            ...
            isIdle = false;
            String text = "[v1][h0]" + text;
            if (englishOnly) {
                text = "[g2]" + text;
            } else {
                text = "[g1]" + text;
            }
            dumbdisplay.writeComment(text);
            SynthesizeVoice(text);
            ...
}