Build a Bedtime Songs Bot With a Custom ML Model

"As a parent of a 3-year-old and a 1-year-old, I am often presented with a toy and asked to sing a song about it. When I was testing out Viam’s ML Model service, I came up with the idea of using machine learning to make my computer do this instead."

- Tess, Engineering Director at Viam

Follow this tutorial to train a machine learning model to make your own “bedtime songs bot” out of a personal computer. It's simple, fast, and endlessly entertaining!

[Posted by Sierra Guequierre on behalf of Tess Avitabile]

Supplies

To make your own singing robot, you need only the following hardware:

A computer with a webcam, speakers, and the Go Client SDK installed. We used a Macbook, but you can use any PC with a Viam-compatible operating system that meets the above requirements.

Step 1: Train Your ML Model With Pictures of Toys

Configure your webcam to capture data

In the Viam app, create a new robot and follow the steps on your new robot’s Setup tab.

Navigate to your robot’s page on the app and click on the Config tab.

First, add your personal computer’s webcam to your robot as a camera by creating a new component with type camera and model webcam:

Select Builder mode, as shown below. Click the Components subtab, then click Create component in the lower-left corner of the page.
Select camera for the type, then select webcam for the model. Enter cam for the name of your camera component, then click Create.

You do not have to edit the attributes of your camera at this point. Optionally, select a fixed filepath for the camera from the automated options in the video path drop-down menu.
If you use a different name, adapt the code in the later steps of this tutorial to use the name you give your camera.

Save the config. To view your webcam’s image stream, navigate to the Control tab of your robot’s page on the Viam app. Click on the drop-down menu labeled camera and toggle the feed on. If you want to test your webcam’s image capture, you can click on Export Screenshot to capture an image, as shown below:

Now, configure the Data Management Service to capture data, so you can use the image data coming from your camera on your robot to train your ML model:

On the Config tab, select Builder mode. Click on the Services sub-tab, and navigate to Create service.
Add a service so your robot can sync data to the Viam app in the cloud. For type, select Data Management from the drop-down, and name your service Data-Management-Service. If you use a different name, adapt the code in the later steps of this tutorial to use the name you give your service.
Make sure both Data Capture and Cloud Sync are enabled as shown:

Enabling data capture and cloud sync lets you capture images from your webcam, sync them to the cloud and, in the Viam app, easily tag them and train your own machine learning model.

Note: You can leave the default directory as is. By default, captured data is saved to the ~/.viam/capture directory on-robot.

Next, configure Data Capture for your webcam:

Go to the Components tab and scroll down to the camera component you previously configured.
Click + Add method in the Data Capture Configuration section.
Set the Type to ReadImage and the Frequency to 0.333. This will capture an image from the camera roughly once every 3 seconds. Feel free to adjust the frequency if you want the camera to capture more or less image data. You want to capture data quickly so your classifier model can be very accurate.
Select the Mime Type as image/jpeg:

At this point, if you select Raw JSON mode on your robot's Config tab, the JSON should look like the following:

{
  "components": [
    {
      "model": "webcam",
      "attributes": {},
      "depends_on": [],
      "service_config": [
        {
          "attributes": {
            "capture_methods": [
              {
                "additional_params": {
                  "mime_type": "image/jpeg"
                },
                "capture_frequency_hz": 0.333,
                "method": "ReadImage"
              }
            ]
          },
          "type": "data_manager"
        }
      ],
      "name": "cam",
      "type": "camera"
    }
  ],
  "services": [
    {
      "attributes": {
        "sync_interval_mins": 0.1,
        "capture_dir": "",
        "tags": []
      },
      "name": "Data-Management-Service",
      "type": "data_manager"
    }
  ]
}

Capture data

Your webcam is now configured to automatically capture images when you are connected to your robot live on the Viam app. At this point, grab the toys or any objects you want the robot to be able to differentiate between.

Tess’s kids like playing with puzzle pieces, which come in different shapes and color combinations. They decided to filter between these puzzle pieces by tagging by shape, but you can filter your objects as you choose.

Hold up the toys to the camera while photos are being taken. Try to capture images from different angles and backgrounds. Try to get at least 50 images that fit your criteria for each tag.

You set the rate of capture in your webcam’s service configuration attribute capture_frequency_hz. If you set this to .333, the data management service will capture 1 image roughly every 3 seconds as you hold up your toys to the camera.

Go to the DATA tab in the Viam app to see the images captured by your webcam.

When you’ve captured enough images to tag, navigate back to the Config tab. Scroll to the card with the name of your webcam and click the power switch next to the Data Capture Configuration to off to disable data capture.

Now, use these pictures to train your machine learning model.

Tag data

Head over to the DATA page and select an image captured from your robot. After selecting the image, you will see all of the data that is associated with that image.

Add tags for each of the puzzle pieces. Type in your desired tag in the Tags section and save the tag. Since Tess wanted to classify their toys by shape, they used “octagon”, “circle”, “triangle”, “oval", “rectangle”, “pentagon”, “diamond”, and “square”.

Scroll between your images. Add tags for each image that shows an object of the corresponding shape. You can select tags from the Recently used drop down menu. Try to have at least 50 images labeled for each tag. This is important for the next step.

Filter based on tags

Now that you’ve tagged the image data, you have the option to filter your images according to those tags. Head over to the Filtering menu and select a tag from the drop down list to view all labeled images.

Train a model

After tagging and filtering your images, begin training your model.

Click the Train Model button. Name your model "shape-classifier-model". If you use a different name, adapt the code in the later steps of this tutorial to use the name you give your model. Select Multi label as the model type, which accounts for multiple tags.

Then select the tags that you used to label your toys and click Train Model.

Read through our guide to training a new model for more information.

Step 2: Use Your ML Model to Sing Songs to Your Kids

Configure your webcam to act as a shape classifier

Deploy the model to the robot and configure a vision service classifier of model mlmodel to use the model you trained to classify objects in your robot's field of vision.

Name your mlmodel vision service "shape-classifier". If you use a different name, adapt the code in the later steps of this tutorial to use the name you give your service.

At this point, on the Config tab, if you select Builder mode, the full Raw JSON configuration of your robot should look like the following:

{
  "packages": [
    {
      "name": "shapes",
      "version": "latest",
      "package": "20055b44-c8a7-4bc5-ad93-86900ee9735a/shapes"
    }
  ],
  "services": [
    {
      "name": "shape-classifier-model",
      "type": "mlmodel",
      "model": "tflite_cpu",
      "attributes": {
        "model_path": "${packages.shapes}/shapes.tflite",
        "label_path": "${packages.shapes}/labels.txt",
        "num_threads": 1
      }
    },
    {
      "name": "shape-classifier",
      "type": "vision",
      "model": "mlmodel",
      "attributes": {
        "mlmodel_name": "shape-classifier-model"
      }
    }
  ],
  "components": [
    {
      "name": "cam",
      "model": "webcam",
      "type": "camera",
      "namespace": "rdk",
      "attributes": {},
      "depends_on": []
    }
  ]
}

Record bedtime songs

Now, capture the audio of the songs you want your bot to play.

Record or download the audio files you want to use to your computer in .mp3 format.
Make the names of the files match the classifier tags you used: for example, square.mp3.
Navigate to a directory where you want to store your SDK code. Save your audio files inside of this directory.

The audio files Tess used are available to download on GitHub.

Program your bedtime-songs bot

Now, use Viam’s Go SDK to program your robot so that if your webcam “sees” (captures image data containing) a toy, the robot knows to play a particular song through its computer’s speakers.

Follow these instructions to start working on your Go control code:

Navigate to your robot’s page in the Viam app, and click on the Code sample tab.
Select Go as the language.
Click Copy to copy the generated code sample, which establishes a connection with your robot when run.
Open your terminal. Navigate to the directory where you want to store your code. Paste this code sample into a new file named play-songs.go, and save it.

By default, the sample code does not include your robot location secret. We strongly recommend that you add your location secret as an environment variable and import this variable into your development environment as needed.

To show your robot’s location secret in the sample code, toggle Include secret on the Code sample tab. You can also see your location secret on the locations page.

For example, run the following commands on your Macbook to create and open the file:

cd <insert-path-to>/<my-bedtime-songs-bot-directory>
touch play-songs.go
vim play-songs.go

Insert your code sample with location secret here.

Now, you can add code into play-songs.go to write the logic that defines your bedtime songs bot.

To start, add in the code that initializes your speaker and plays the songs. Tess used the platform-flexible Go os package and an audio processing package from GitHub to do this.

func initSpeaker(logger logger.Logger) {
   f, err := os.Open("square.mp3")
   if err != nil {
       logger.Fatal(err)
   }
   defer f.Close()


   streamer, format, err := mp3.Decode(f)
   if err != nil {
       logger.Fatal(err)
   }
   defer streamer.Close()


   speaker.Init(format.SampleRate, format.SampleRate.N(time.Second/10))
}


func play(label string, logger logger.Logger) {
   f, err := os.Open(label + ".mp3")
   if err != nil {
       logger.Fatal(err)
   }
   defer f.Close()


   streamer, _, err := mp3.Decode(f)
   if err != nil {
       logger.Fatal(err)
   }
   defer streamer.Close()


   done := make(chan bool)
   speaker.Play(beep.Seq(streamer, beep.Callback(func() {
       done <- true
   })))


   <-done
}

Modify the above code as you desire.

Make sure you import the necessary packages by adding the following to the import statement of your program:

"github.com/faiface/beep"
"github.com/faiface/beep/mp3"
"github.com/faiface/beep/speaker"

Also, make sure that you add initSpeaker(logger), a line that initializes the speaker, to the main function of your program.

Now, create the logic for the classifiers. Use the vision service’s classification API method ClassificationsFromCamera to do this.

You can get your components from the robot like this:

visService, err := vision.FromRobot(robot, "shape-classifier")

And you can get the classification the "shape-classifier-model" behind "shape-classifier" computes for your robot like this:

classifications, err := visService.ClassificationsFromCamera(context.Background(), "cam", 1, nil)

Change the name in FromRobot() if you used a different name for the resource in your code.

This is what Tess used for the logic for the classifiers:

// Classifications logic
for {
  for i := 0; i < 3; i++ {
      visService.ClassificationsFromCamera(context.Background(), "cam",  1, nil)
  }


  classifications, err := visService.ClassificationsFromCamera(context.Background(), "cam", 1, nil)
  if err != nil {
      logger.Fatalf("Could not get classifications: %v", err)
  }
  if len(classifications) > 0 && classifications[0].Score() > 0.7 {
      logger.Info(classifications[0])
      play(classifications[0].Label(), logger)
  }


}

After completing these instructions, your program play-songs.go should look like the following:

package main


import (
 "context"
 "os"
 "time"


 "github.com/faiface/beep"
 "github.com/faiface/beep/mp3"
 "github.com/faiface/beep/speaker"
 "go.viam.com/rdk/logging"
 "go.viam.com/rdk/robot/client"
 "go.viam.com/rdk/utils"
 "go.viam.com/utils/rpc"
 "go.viam.com/rdk/services/vision"
)

// Initialize the speaker
func initSpeaker(logger logger.Logger) {
   f, err := os.Open("square.mp3")
   if err != nil {
       logger.Fatal(err)
   }
   defer f.Close()


   streamer, format, err := mp3.Decode(f)
   if err != nil {
       logger.Fatal(err)
   }
   defer streamer.Close()


   speaker.Init(format.SampleRate, format.SampleRate.N(time.Second/10))
}

// Play a song
func play(label string, logger logger.Logger) {
   f, err := os.Open(label + ".mp3")
   if err != nil {
       logger.Fatal(err)
   }
   defer f.Close()


   streamer, _, err := mp3.Decode(f)
   if err != nil {
       logger.Fatal(err)
   }
   defer streamer.Close()


   done := make(chan bool)
   speaker.Play(beep.Seq(streamer, beep.Callback(func() {
       done <- true
   })))


   <-done
}

// Code Sample Connect() Code
func main() {
 logger := logger.NewDevelopmentLogger("client")
 robot, err := client.New(
     context.Background(),
     ".viam.cloud", // Insert your remote address here. Go to the Code Sample tab in the Viam app to find.
     logger,
     client.WithDialOptions(rpc.WithCredentials(rpc.Credentials{
         Type:    utils.CredentialsTypeRobotLocationSecret,
         Payload: "", // Insert your robot location secret here. Go to the Code Sample tab in the Viam app to find.
     })),
 )
 if err != nil {
     logger.Fatal(err)
 }


 defer robot.Close(context.Background())

// Get the shape classifier from the robot
 visService, err := vision.FromRobot(robot, "shape-classifier")
 if err != nil {
   logger.Error(err)
 }

 // Initialize the speaker
 initSpeaker(logger)


 // Classifications logic
 for {
   for i := 0; i < 3; i++ {
       visService.ClassificationsFromCamera(context.Background(), "cam",  1, nil)
   }


   classifications, err := visService.ClassificationsFromCamera(context.Background(), "cam", 1, nil)
   if err != nil {
       logger.Fatalf("Could not get classifications: %v", err)
   }
   if len(classifications) > 0 && classifications[0].Score() > 0.7 {
       logger.Info(classifications[0])
       play(classifications[0].Label(), logger)
   }


 }
}

Save your play-songs.go program with this logic added in. Run the code on your personal computer as follows:

go run ~/<my-bedtime-songs-bot-directory>/play-songs.go

The full example source code for play-songs.go is available on GitHub.

Now your smart bedtime songs bot knows to play a song whenever it sees a shape on the camera.

Next steps

This project is just a start.

Expand upon the configuration of your bedtime-songs bot to further customize a robot that can entertain with machine learning, the vision service, and more components and services.