Introduction: Object Detection With Sipeed MaiX Boards(Kendryte K210)

About: My channel about robotics with ROS and machine learning!

As a continuation of my previous article about image recognition with Sipeed MaiX Boards, I decided to write another tutorial, focusing on object detection. There was some interesting hardware popping up recently with Kendryte K210 chip, including Seeed AI Hat for Edge Computing, M5 stack's M5StickV and DFRobot's HuskyLens (although that one has proprietary firmware and more targeted for complete beginners). Because of it's cheap price, Kendryte K210 has appealed to people, wishing to add computer vision to their projects. But as usual with Chinese hardware products, the tech support is lacking and this is something that I'm trying to improve with my articles and videos. But do keep in mind, that I am not on the Kendryte or Sipeed developers team and cannot answer all the questions related to their product.

With that in mind, let's start! We'll begin with short(and simplified) overview of how object recognition CNN models work.

UPDATE MAY 2020: Seeing how my article and video on Object Detection with K210 boards are still very popular and among top results on YouTube and Google, I decided to update the article to include the information about aXeleRate, Keras-based framework for AI on the Edge I develop. aXeleRate, essentially, is based off the collection of scripts I used for training image recognition/object detection models - combined into a single framework and optimized for workflow on Google Colab. It is more convenient to use and more up to date.

For the old version of the article, you can still see it on

Step 1: Object Detection Model Architecture Explained

Image recognition (or image classification) models take the whole image as an input and output a list of probabilities for each class we're trying to recognize. It is very useful if the object we're interested in occupies a large portion of the image and we don't care much about its location. But what if our project (say, face-tracking camera) requires us not only to have a knowledge about the type of object in the image, but also its coordinates. And what about project requiring detecting multiple objects(for example for counting)?

Here is when Object Detection Models come in handy. In this article we'll be using YOLO (you only look once) architecture and focus the explanation on internal mechanics of this particular architecture.

We're trying to determine what objects are present in the picture and what are their coordinates. Since machine learning is not magic and not "a thinking machine", but just an algorithm which uses statistics to optimize the function(neural network) to better solve a particular problem. We need to paraphrase this problem to make it more "optimizable". A naive approach here would be to have the algorithm minimizing loss(difference) between it's prediction and correct coordinates of the object. That would work pretty well, as long as we have only one object in the image. For multiple objects we take a different approach - we add the grid and make our network predict the presence (or absence) of the object(s) in each grid. Sounds great, but still leaves too much uncertainty for the network - how to output the prediction and what to do when there are multiple objects with center inside one grid cell? We need to add one more constrain - so called anchors. Anchors are initial sizes (width, height) some of which (the closest to the object size) will be resized to the object size - using some outputs from the neural network (final feature map).

So, here's a top-level view on what's going on when YOLO architecture neural network performs an object detection on the image. According to features detected by feature extractor network, for each grid cell a set of predictions is made, which includes the anchors offset, anchor probability and anchor class. Then we discard the predictions with low probability and voila!

Step 2: Prepare the Environment

aXeleRate is based on wonderful project by penny4860, SVHN yolo-v2 digit detector. aXeleRate takes this implementation of YOLO detector in Keras to a next level and uses its convenient configuration system to perform training and conversion of image recognition/object detection and image segmentation networks with various backends.

To are two ways to use aXeleRate: running locally on Ubuntu machine or in Google Colab. For running in Google Colab, have a look at this example:

PASCAL-VOC Object Detection Colab Notebook

Training your model locally and exporting it to be used with hardware acceleration is also much easier now.I highly recommend you installing all the necessary dependencies in Anaconda environment to keep your project separated from others and avoid conflicts.

Download the installer here.

After installation is complete, create a new environment:

conda create -n yolo python=3.7

Let's activate the new environment

conda activate yolo

A prefix before your bash shell will appear with the name of the environment, indicating that you work now in that environment.

Install aXeleRate on your local machine with

pip install git+

And then run this to download scripts you will need for training and inference:

git clone

You can run quick tests with in aXeleRate folder. It will run training and inference for each model type, save and convert trained models. Since it is only training for 5 epochs and dataset is very small, you will not be able to get useful models, but this script is only meant for checking for absence of errors.

Step 3: Train an Object Detection Model With Keras

Now we can run a training script with the configuration file. Since Keras implementation of YOLO object detector is quite complicated, instead of explaining every relevant piece of code, I will explain how to configure the training and also describe relevant modules, in case you want to make some changes to them yourself.

Let's start with a toy example and train a racoon detector. There is a config file inside of /config folder, raccoon_detector.json. We choose MobileNet7_5 as architecture (where 7_5 is alpha parameter of the original Mobilenet implementation, controls the width of the network) and 224x224 as input size. Let's have a look at the most important parameters in the config:

Type is model frontend - Classifier, Detector or Segnet
Architecture is model backend (feature extractor)

- Full Yolo - Tiny Yolo - MobileNet1_0 - MobileNet7_5 - MobileNet5_0 - MobileNet2_5 - SqueezeNet - VGG16 - ResNet50

For more information on anchors, please read here

Labels are labels present in your dataset. IMPORTANT: Please, list all the labels present in the dataset.

object_scale determines how much to penalize wrong prediction of confidence of object predictors

no_object_scale determines how much to penalize wrong prediction of confidence of non-object predictors

coord_scale determines how much to penalize wrong position and size predictions (x, y, w, h)

class_scale determines how much to penalize wrong class prediction

augumentation - image augumentation, resizing, shifting and blurring the image in order to prevent overfitting and have greater variety in dataset.

train_times, validation_times - how many times to repeat the dataset. Useful if you have augumentation


first_trainable_layer - allows you to freeze certain layers if you're using a pre-trained feature network

Now we need to download the dataset, which I shared on my Google Drive (original dataset), which is a racoon detection dataset, containing 150 annotated pictures.

Make sure to change the lines in configuration file (train_image_folder, train_annot_folder) accordingly and then start the training with the following command:

python axelerate/ -c configs/raccoon_detector.json reads the configuration from .json file and trains the model with axelerate/networks/yolo/ script. yolo/backend/ is where custom loss function is implemented and yolo/backend/ is where the model is created(input, feature extractor and detection layers put together). axelerate/networks/common_utils/ is script that implements training process and axelerate/networks/common_utils/ contains feature extractors. If you intend to use trained model with K210 chip and Micropython firmware,due to memory limitations you can choose between MobileNet(2_5, 5_0 and 7_5) and TinyYolo, but I've found MobileNet gives better detection accuracy.

Since it is a toy example and only contains 150 images of raccoons, the training process should be pretty fast, even without GPU, although the accuracy will be far from stellar. For work-related project I've trained a traffic sign detector and a number detector, both datasets included over a few thousand training examples.

Step 4: Convert It to .kmodel Format

With aXeleRate, model conversion is performed automatically - this is probably the biggest difference from the old version of training scripts! Plus you get the model files and training graph saved neatly in project folder. Also I did find that vaiidation accuracy sometimes fails to give estimation on model real perfomance for object detection and this why I added mAP as a validation metric for object detection models. You can read more about mAP here.

If the mAP, mean average precision (our validation metric) is not improving for 20 epochs, the training will stop prematurely. Every time mAP improves, model is saved in the project folder. After training is over, aXeleRate automatically converts the best model to specified formats - you can choose, "tflite", "k210" or "edgetpu" as of now.

Now to the last step, actually running our model on Sipeed hardware!

Step 5: Run on Micropython Firmware

It is possible to run inference with our object detection model with C code, but for the sake of convenience we will use Micropython firmware and MaixPy IDE instead.

Download MaixPy IDE from here and micropython firmware from here. You can use python script to burn the firmware or download separate GUI flash tool here.

Copy model.kmodel to the root of an SD card and insert SD card into Sipeed Maix Bit(or other K210 device). Alternatively you can burn .kmodel to device's flash memory. My example script reads .kmodel from flash memory. If you are using SD card, please change this line

task = kpu.load(0x200000)


task = kpu.load("/sd/model.kmodel")

Open MaixPy IDE and press the connect button. Open script from example_scripts/k210/detector folder and press Start button. You should be seeing a live stream from camera with bounding boxes around ... well, raccoons. You can increase the accuracy of the model by providing more training examples, but do keep in mind that it is fairy small model(1.9 M) and it will have troubles detecting small objects (due to low resolution).

One of the questions I received in comments to my previous article on image recognition is how to send the detection results over UART/I2C to other device connected to Sipeed development boards. In my github repository you will be able to find another example script,, which (you guessed it) detects raccoons and sends the coordinates of bounding boxes over UART. Keep in mind, that pins used for UART communication are different of different boards, this is something you need to check yourself in the documentation.

Step 6: Summary

Kendryte K210 is a solid chip for computer vision, flexible, albeit with limited memory available. So far, in my tutorials we have covered using it for recognizing custom objects, detecting custom objects and running some OpenMV based computer vision tasks. I know for a fact that it is also suitable for face recognition and with some tinkering it should be possible to do pose detection and image segmentation (you can use aXeleRate to train semantic segmentation model, but I did not yet implement the inference with K210). Feel free to have a look at aXeleRate repository issues and make a PR if you think there are some improvements that you can contribute!

Here are some articles I used in writing this tutorial, have a look if you want to learn more about object detection with neural networks:

Bounding box object detectors: understanding YOLO, You Look Only Once

Understanding YOLO (more math)

Gentle guide on how YOLO Object Localization works with Keras (Part 2)

Real-time Object Detection with YOLO, YOLOv2 and now YOLOv3

Hope you can use the knowledge you have now to build some awesome projects with machine vision! You can buy Sipeed boards here, they are among the cheapest options available for ML on embedded systems.

Add me on LinkedIn if you have any questions and subscribe to my YouTube channel to get notified about more interesting projects involving machine learning and robotics.