Introduction: Webcam Motion Piano With Python and OpenCV

When I was in a children's museum, there was a camera and a screen, and areas on the screen would get highlighted with music played when you moved. My youngest kid really loved it so I made a simple version with a webcam and a row of piano keys at the top that you can activate by waving your hands.

This was my first (and only) motion recognition project. I am going to describe a simplified version of my final code with the hope that it will be helpful to other beginners.

It works both on my Windows 10 laptop and, with a bit of latency, on our Raspberry PI 3B+.

I think learned how to do this from this tutorial.


  • computer with webcam
  • python3

Step 1: Install Dependencies

You need:

  • python3
  • OpenCV2 for image processing
  • numpy for OpenCV2
  • rtmidi for playing music

Once you have python3 installed, run:

python3 -m pip install opencv-python python-rtmidi

Install Timidity and sound fonts if you're using Linux. On a Raspberry PI, this should should work:

sudo apt-get install timidity
sudo apt-get install fluid-soundfont-gm fluid-soundfont-gs

Step 2: Run Code

Download go here, right click, and choose "Save as..."



Wave your hands or other objects in the keyboard area and it will make music. Press ESC to exit.

There is also a version with a few more options, but in this Instructable I will describe the code for the simpler one.

Step 3: The Algorithm

The basic idea for the main loop is this:

  • Capture frame from camera
  • Extract the piano key area
  • Scale it down
  • Apply a Gaussian blur
  • Compare to a comparison frame and play/sustain music if the difference exceeds a specified threshold in the area of the corresponding key
  • Display the frame with the triggered keys highlighted.

What is the comparison frame? Here there is a little bit of complication. One option which I tried was just to compare to the previous blurred frame. This required constant motion of your hand to sustain sound on a piano key.

A simpler idea is just to save the first blurred frame as a comparison frame and compare against it. But then if the camera shifted or lighting changed, all the keys would trigger and stay triggered for good, and you would have to restart the software. Not good for something you want to be able to leave unattended.

So I went with a more complicated algorithm:

  • Save the first blurred frame as the comparison frame
  • Check whether the key area has changed (beyond the threshold) and then stayed unchanged (according to the threshold) for five seconds, sampling once per second. If so, we have a new comparison frame.

The result is that if you shift the camera or lights change, then if you just leave the motion piano alone for five seconds, it will automatically readjust for the change. It also does mean that if you keep your hand on a key unmoved for five seconds, it will assume the hand is part of the background and will trigger when you remove it. But five seconds is a long time to keep a hand unmoved in the air, and it will reset again if left alone.

Step 4: Code: Initialization

We start with some usual imports and then set up parameters for the music and image recognition.

import time, cv2
import numpy as np
import rtmidi

NOTES = [ 60, 62, 64, 65, 67, 69, 71, 72, 74 ]
WINDOW_NAME = "MotionPiano"


The NOTES are MIDI note values C4 to D5. You can add more from this list if you like. The KEY_HEIGHT of 0.25 specifies that the height of each piano key is 25% of the display height.

The RECOGNIZER_WIDTH is what the camera frame gets scaled to before motion recognition. The scaling is useful on a slower computer or with a higher resolution camera. KERNEL_SIZE controls the amount of Gaussian blurring and THRESHOLD is the brightness change that is needed to trigger a key (or saving of a new comparison frame). The RESET_TIME and SAVE_CHECK_TIME are parameters for the comparison frame updating algorithm: by default a new comparison frame must remain unchanged (within THRESHOLD) for five seconds, with sampling every second.

Finally, COMPARISON_VALUE is an internal value for how the key recognition algorthm works.

Next, we prepare the music output and the video capture:

numKeys = len(NOTES)
playing = numKeys * [False]

midiout = rtmidi.MidiOut()
portNumber = 0 if len(midiout.get_ports()) == 1 or 'through' not in str(midiout.get_ports()[0]).lower() else 1

video = cv2.VideoCapture(0)
frameWidth = int(video.get(cv2.CAP_PROP_FRAME_WIDTH))
frameHeight = int(video.get(cv2.CAP_PROP_FRAME_HEIGHT))

The playing array stores which keys are currently depressed. For good piano sound, we don't want to send a new press to the music engine every frame that a key is down, but only when the key is first pressed. We use the first midi port on the computer that does not have 'through' in its name (this works better with Timidity on my Raspberry PI 3B+). Finally, we initialize video capture, and make a note of the dimensions of the incoming camera frames.

Next, we calculate some scaled dimensions, since we're going to scale each frame to the RECOGNIZER_WIDTH, though only if the camera frame width is bigger than RECOGNIZER_WIDTH.

if RECOGNIZER_WIDTH >= frameWidth:
    scaledWidth = frameWidth
    scaledHeight = frameHeight
    aspect = frameWidth / frameHeight
    scaledWidth = RECOGNIZER_WIDTH
    scaledHeight = int(RECOGNIZER_WIDTH / aspect)

We now do some simple initialization.

kernelSize = 2*int(KERNEL_SIZE*scaledWidth/2)+1

blankOverlay = np.zeros((frameHeight,frameWidth,3),dtype=np.uint8)

cv2.resizeWindow(WINDOW_NAME, frameWidth, frameHeight)

The kernelSize is for the Gaussian blurring. The blankOverlay is a rectangle we will use for drawing keys on the screen. And we set the display window to the same resolution as the camera frame (my more complicated code allows these two sizes to come apart).

We now set up the coordinates for the piano keys. These are stored in the scaledRects list for coordinates for the recognizer-scaled keys and in frameRects for display/camera-scaled keys. We are careful with rounding to avoid pixel overlap, and we also save various overall coordinates of the key area:

scaledRects = []
frameRects = []

for i in range(numKeys):
    x0 = scaledWidth*i//numKeys
    x1 = scaledWidth*(i+1)//numKeys-1

    r = [(x0,0),(x1,int(KEY_HEIGHT*scaledHeight))]

    x0 = frameWidth*i//numKeys
    x1 = frameWidth*(i+1)//numKeys-1

    r = [(x0,0),(x1,int(KEY_HEIGHT*frameHeight))]
keysTopLeftFrame = (min(r[0][0] for r in frameRects),min(r[0][1] for r in frameRects))
keysBottomRightFrame = (max(r[1][0] for r in frameRects),max(r[1][1] for r in frameRects))

keysTopLeftScaled = (min(r[0][0] for r in scaledRects),min(r[0][1] for r in scaledRects))
keysBottomRightScaled = (max(r[1][0] for r in scaledRects),max(r[1][1] for r in scaledRects))
keysWidthScaled = keysBottomRightScaled[0]-keysTopLeftScaled[0]
keysHeightScaled = keysBottomRightScaled[1]-keysTopLeftScaled[1]

Now, we do something a bit strange:

keys = np.zeros((keysHeightScaled,keysWidthScaled),dtype=np.uint8)

def adjustToKeys(xy):
    return (xy[0]-keysTopLeftScaled[0],xy[1]-keysTopLeftScaled[1])
for i in range(numKeys):
    r = scaledRects[i]
    cv2.rectangle(keys, adjustToKeys(r[0]), adjustToKeys(r[1]), i+1, cv2.FILLED)

Here, we have set up an 8-bit grayscale bitmap called keys matching the recognizer dimensions of the key area, and filled it with colors ranging from 1 through the number of keys. This will be used to help figure out which keys are being pressed.

We now initialize the running data:

comparisonFrame = None
savedFrame = None
savedTime = 0
lastCheckTime = 0

The comparisonFrame is the blurred frame we compare the current blurred frame to. The savedFrame is used for updating the comparisonFrame. And we haven't saved anything or checked anything yet.

Step 5: Code: Main Loop

We start by making a simple threshold comparison function using OpenCV:

def compare(a,b):
    return cv2.threshold(cv2.absdiff(a, b), THRESHOLD, COMPARISON_VALUE, cv2.THRESH_BINARY)[1]

This takes two equally-sized bitmaps a and b, and creates a new bitmap which has value 0 wherever the inputs are similar (relative to the threshold), and has value COMPARISON_VALUE (which, recall, was 128) wherever the inputs are different.

And now on to the main loop:

while True:
    ok, frame =
    if not ok:
    frame = cv2.flip(frame, 1)

We read a frame. If there wasn't one available, we pause a little and try again. Once we have the frame, we mirror-image it as that makes it look more natural for a motion piano.

We do initial processing on the frame:

    keysFrame = frame[keysTopLeftFrame[1]:keysBottomRightFrame[1], keysTopLeftFrame[0]:keysBottomRightFrame[0]]
    if scaledWidth != frameWidth:
        keysFrame = cv2.resize(keysFrame, (keysWidthScaled,keysHeightScaled))
    keysFrame = cv2.cvtColor(keysFrame, cv2.COLOR_BGR2GRAY)
    blurred = cv2.GaussianBlur(keysFrame, (kernelSize, kernelSize), 0)

The first thing we did was to extract a smaller bitmap that covers just the piano key area. For, of course, we want our code to deal with as small a part of the image as possible to save CPU time. We rescale the bitmap to the recognizer dimensions if necessary. Then we convert to grayscale and blur it.

If all we want is the simple algorithm where we compare everything to the initial frame, we could just do this:

    if comparisonFrame is None:
        comparisonFrame = blurred

I.e., if we don't yet have a comparison frame, make one, and then go back and wait for a new frame. But since I want to reset the comparison frame due to changed conditions, before doing the above, I do this somewhat complicated bit of code:

    t = time.time()
    save = False
    if savedFrame is None:
        save = True
        lastCheckTime = t
        if t >= lastCheckTime + SAVE_CHECK_TIME:
            if COMPARISON_VALUE in compare(savedFrame, blurred):
                save = True
            lastCheckTime = t
        if t >= savedTime + RESET_TIME:
            comparisonFrame = blurred
            save = True
    if save:
        savedFrame = blurred
        savedTime = t

The idea, again, is that the current blurred frame needs to match the savedFrame for five seconds (sampled every second, rather than every frame, to save CPU time) before resetting the comparisonFrame to the current blurred frame. I check for mismatch by seeing if COMPARISON_VALUE occurs inside the frame returned by compare().

We now go to the heart of the key detection code, which is very simple:

    delta = compare(comparisonFrame, blurred)
    sum = keys+delta

This creates an 8-bit image called delta that is equal to COMPARISON_VALUE (=128) wherever comparisonFrame differs significantly from the current frame (both frames are blurred, remember) and 0 where there is no significant difference. Now, recall that mysterious keys bitmap, which had gray-scale values 1,2,...,numKeys? We add that bitmap to delta. The result is a bitmap that has gray-scale value COMPARISON_VALUE+1 wherever there is motion detected in the area of the first key, COMPARISON_VALUE+2 if there is motion in the area of the second key, and so on. This may not be the most efficient way of detecting motion in a rectangular area, but it very easily generalizes to arbitrary shapes if one wants.

The screenshot above (taken from running in DEBUG=True mode) shows the acquired grayscale, blurred and sum images (the delta image looks almost exactly the same as the sum image).

We now use this to act on the keys pressed:

    overlay = blankOverlay.copy()

    for i in range(numKeys):
        r = frameRects[i]
        if 1+i+COMPARISON_VALUE in sum:
            cv2.rectangle(overlay, r[0], r[1], (255,255,255), cv2.FILLED)
            if not playing[i]:
                midiout.send_message([0x90, NOTES[i], NOTE_VELOCITY])
                playing[i] = True
            if playing[i]:
                midiout.send_message([0x80, NOTES[i], 0])
                playing[i] = False
        cv2.rectangle(overlay, r[0], r[1], (0,255,0), 2)

First, we create a copy of our key-area overlay. Then we go through all the piano keys. If one of them is pressed down (i.e., 1+i+COMPARISON_VALUE occurs in the sum bitmap), we highlight the key. Additionally, if a key is pressed and isn't playing, we play it, and if it's released and is playing, we stop playing it. Next, we draw a green outline around each key.

Finally, we alpha-blend the overlay over the camera frame, display it, and exit if the ESC key has been pressed or the window has been closed.

    cv2.imshow(WINDOW_NAME, cv2.addWeighted(frame, 1, overlay, 0.25, 1.0))
    if (cv2.waitKey(1) & 0xFF) == 27 or cv2.getWindowProperty(WINDOW_NAME, 0) == -1:

And that's it for the main loop. Once we're done with it, we do some cleanup:

del midiout
Hour of Code Speed Challenge

Participated in the
Hour of Code Speed Challenge