A Personal Assistant Using Raspberry 4 and LLM Using Gemini API

A personal assistant leveraging Large Language Model through Gemini API, using raspberry pi 4 and Viam SDK . It is capable of answering questions based on voice input and the answers can be heard via speaker. It can also take picture input and answer questions regarding it.

The conversation is triggered after it detects a person using the pi Camera and person detection model named 'EfficientDet-COCO' from Viam and it is capable of continuing the conversation through follow-up questions.

Supplies

Raspberry Pi 4 Model B
Pi Camera
USB Speaker
USB MIcrophone
MicroSD Card
Power Supply for board.

Step 1: Setting Up Viam Server

For this project I have used Viam SDK and connected the Pi to the server to access the computer vision model that will detect person from the camera and have used the SpeechIO Module to get the LLM response via USB speaker.

The Installation guide for Raspberry PI can be found from here

Step 2: Setting Up Board and Camera

Next we need to add the board and camera component in the Config of the Vaim App.

The Instruction to do the same can be found from this tutorial. The given screenshots are the the config json after setting up the components for this project

Step 3: Setting Up Computer Vision Capability

I have used the computer vision model from Viam's Model hub to detect person which is using the Tensorflow Lite model.

The instruction for setup can be found from this tutorial.

Step 4: Setting Up Speech Module

I have used the SpeechIO module to get the response from the LLM via our usb speaker. It performs the Text-To-Speech task in this case.

The instruction to setup the speech module can be found from here

The given screenshots are the the config json after setting up the components for this project. Here I have assigned true to 'disable_mic' as false to 'listen' so that the USB microphone can be accessed by other modules in the code.

Step 5: Robot Configuration on Viam App

After adding all the required components, services and modules on the Viam app, this is how the configuration json looks like:

{
  "components": [
    {
      "name": "local",
      "namespace": "rdk",
      "type": "board",
      "model": "pi",
      "attributes": {}
    },
    {
      "name": "cam",
      "namespace": "rdk",
      "type": "camera",
      "model": "webcam",
      "attributes": {
        "video_path": "video0",
        "height_px": 1080,
        "width_px": 1920
      }
    },
    {
      "name": "detectionCam",
      "namespace": "rdk",
      "type": "camera",
      "model": "transform",
      "attributes": {
        "source": "cam",
        "pipeline": [
          {
            "type": "detections",
            "attributes": {
              "detector_name": "myPeopleDetector",
              "confidence_threshold": 0.5
            }
          }
        ]
      },
      "depends_on": [
        "cam"
      ]
    }
  ],
  "services": [
    {
      "name": "roboFlowFaceDetection",
      "namespace": "rdk",
      "type": "vision",
      "model": "viam-labs:vision:roboflow",
      "attributes": {
        "project": "objectdetect-fs9hx",
        "version": 1,
        "api_key": "LB2Xh5i7EmwYGmZv36kV"
      },
      "depends_on": [
        "cam"
      ]
    },
    {
      "name": "speech",
      "namespace": "viam-labs",
      "type": "speech",
      "model": "viam-labs:speech:speechio",
      "attributes": {
        "listen_trigger_command": "hey there",
        "listen": false,
        "disable_mic": true
      }
    },
    {
      "name": "people",
      "namespace": "rdk",
      "type": "mlmodel",
      "model": "tflite_cpu",
      "attributes": {
        "model_path": "${packages.ml_model.EfficientDet-COCO}/effdet0 (3).tflite",
        "label_path": "${packages.ml_model.EfficientDet-COCO}/effdetlabels.txt"
      }
    },
    {
      "name": "myPeopleDetector",
      "namespace": "rdk",
      "type": "vision",
      "model": "mlmodel",
      "attributes": {
        "mlmodel_name": "people"
      }
    }
  ],
  "modules": [
    {
      "type": "registry",
      "name": "viam-labs_roboflow-vision",
      "module_id": "viam-labs:roboflow-vision",
      "version": "0.0.1"
    },
    {
      "type": "registry",
      "name": "viam-labs_speech",
      "module_id": "viam-labs:speech",
      "version": "0.5.2"
    }
  ],
  "packages": [
    {
      "name": "EfficientDet-COCO",
      "package": "6efeb7e5-91aa-4187-8ed3-e84764726078/EfficientDet-COCO",
      "type": "ml_model",
      "version": "latest"
    }
  ],
  "agent_config": {
    "subsystems": {
      "viam-server": {
        "release_channel": "stable",
        "pin_version": "",
        "pin_url": "",
        "disable_subsystem": false
      },
      "agent-provisioning": {
        "disable_subsystem": false,
        "release_channel": "stable",
        "pin_version": "",
        "pin_url": ""
      },
      "agent-syscfg": {
        "release_channel": "stable",
        "pin_version": "",
        "pin_url": "",
        "disable_subsystem": false
      },
      "viam-agent": {
        "pin_version": "",
        "pin_url": "",
        "disable_subsystem": false,
        "release_channel": "stable"
      }
    }
  }
}

Step 6: Setting Up Speech Recognition Module

I have used the library called SpeechRecognition to take the voice input from the microphone and pass it to the LLM to get back response

The installation guide can be found from here.

I have used the recognize_whisper_api for speech recognition. In order to use this function we will need an OpenAI API key.

Step 7: How to Use Gemini API

For this step we need to install google-generativeai package using this command:

pip install -U google-generativeai

After that we can wee need to get the API key for Gemini, which can be obtained from here.

Here we need to click on the Get API Key button and follow the instrcution. Gemini provides different models from which we have used GenerativeModel named gemini-1.0-pro-latest for Chat purpose and have used gemini-pro-vision to take Image input via our pi camera and ask questions regarding the provided image.

Step 8: Choosing APIs Over Local LLMs

For data privacy and security we can use local LLMs from our computer and but as we are using a single-board computer here for this projects we have chosen APIs to access the LLMs. To get real time response or near real-time response we need good(if not great!) compute resource which Raspberry Pi boards cannot provide.

I tested by downloading the Whisper model in the board to transcribe the audio input received from microphone, locally the performance was good but it came at an expense of significant waiting time. So to improve the user experience I opted for APIs to access the available LLMs

Step 9: How the System Works

On a high-level this is how the systems works:

It looks for people using the camera, it it detects someone it greets the person and ask how can it help. Here the system can answer questions for voice inputs and it can also take pictures from the camera and answer question regarding that picture. This is main loop of conversation.
If the system recognizes 'tell me' in voice command In enters into the Normal chat mode of the conversation. The Person can ask a question and after giving the response back the system ask if the person has any follow-up question.
If the user input has 'Yes' in the recognized speech the system continues the chat in this way
If the user input has 'No' in the recognized speech or something else the system goes back to the main loop. Though the systems responds back differently in both cases.
If the system recognizes 'picture' in voice command It enters into the Image chat mode of the conversation. The system captures an image and explains the image. Like the normal chat mode, it asks if the person has any more follow-up questions and flow from here is same as the normal chat mode:
It is same as the normal chat mode If the user input has 'Yes' in the recognized speech the system continues the chat in this way
If the user input has 'No' in the recognized speech or something else the system goes back to the main loop. Though the systems responds back differently in both cases.
For anything else the system assumes the user doesn't need further assistance in either chat or image analysis interaction loops, the it says goodbye and breaks the main loop.
Then It again enters into the person detection loop using the vision component, where if it starts the conversation again if it detects someone.

Step 10: Enhancing Image Chat Mode

An Important thing to mention about answering questions regarding image input: Unlike the chat model, the vision model from Gemini doesn't provide support for follow-up question regarding the image input inherently. for the vision model the conversation is a one-off thing.

So to facilitate the follow-up question for a given image input, I passed the previous response from the model in the prompt along with the follow-question from the user which showed some pretty impressive results.

The code can be found from this GitHub repo. The code level explanation can be found from the README there.