Sound Localizing Mannequin Head With Kinect

Meet Margaret, a testing dummy for a driver fatigue monitoring system. She recently retired from her duties and found her way to our office space, and has since drawn the attention of those that think she's 'creepy.' In the interest of justice, I've given her the ability to face her accusers head-on; instead of seemingly following you with her soulless gaze, now she actually does so. The system uses the microphone array of a Microsoft Kinect and a servo to steer her in the direction of people talking near her.

Step 1: Theory

Calculating the Angle

When we hear something, unless that noise is directly in front of us it reaches one ear before the other. Our brains perceive that arrival delay and convert that into a general direction that noise is coming from, allowing us to find the source. We can achieve the exact same kind of localization using a pair of microphones. Consider the shown diagram, which contains a microphone pair and a sound source. If we're looking from the top down, sound waves are circular, but if the distance to the source is large relative to the spacing between the microphones, then from the point-of-view of our sensors the wave is approximately planar. This is known as the far-field assumption and simplifies the geometry of our problem.

So assume the wavefront is a straight line. If the sound is coming from the right, it will hit microphone #2 at time t2 and microphone #1 at time t1. The distance d the sound traveled between hitting microphone #2 and microphone #1 is the time difference in detecting the sound multiplied by the speed of sound vs:

d = vs*(t1-t2) = vs*Δt

We can relate this distance to the distance d12 between the microphone pair and the angle θ from the pair to the sound source with the relation:

cos(θ) =d/d12 = vs*Δt/d12

Because we only have two microphones, there will be ambiguity in our calculation on whether the sound source is in front or behind us. In this system, we will assume that the sound source is in front of the pair and clamp the angle between 0 degrees (fully to the right of the pair) to 180 degrees (fully to the left).

Finally, we can solve for theta by taking the inverse cosine:

θ = acos( vs*Δt/d12 ), 0 <= θ <= π

To make the angle a bit more natural, we can subtract 90 degrees from theta, so that 0 degrees is directly in front of the pair and +/- 90 degrees is full left or full right. This turns our expression from the inverse cosine to the inverse sine.

cos(θ-π/2) = sin(θ) = d/d12 = vs*Δt/d12
θ = asin( vs*Δt/d12 ), -π/2 <= θ <= π/2

Finding the Delay

As you can see from the equation above, all we need to solve for the angle is the delay in the sound wave arriving at microphone one compared to microphone two; the speed of sound and the distance between the microphones are both fixed and known. To accomplish this, we first sample the audio signals at the frequency fs, converting them from analog to digital and storing the data for later use. We sample for a period of time known as sampling window, which is of long enough duration to capture distinguishable features of our sound wave. For example, our window could be the last half second's worth of audio data.

After obtaining the windowed audio signals, we find the delay between the two by computing their cross-correlation. To compute the cross-correlation, we hold the windowed signal from one microphone fixed, and slide the second signal along the time axis from all the way behind the first to all the way ahead of the first. At each step along our slide we multiply each point in our fixed signal by it's corresponding point in our sliding signal, then sum together all the results to compute our correlation coefficient for that step. After completing our slide, the step which has the highest correlation coefficient corresponds to the point where the two signals are most similar, and what step we're on tells us how many samples n signal two is offset from signal 1. If n is negative, then signal two is lagging behind signal one, if it's positive then signal two is ahead, and if it's zero then the two are already aligned. We convert this sample offset to a time delay using our sampling frequency with the relation Δt = n/fs, thus:

θ = asin( vs*n/(d12*fs) ), -π/2 <= θ <= π/2

Step 2: Components

Parts

Microsoft Kinect for Xbox 360, model 1414 or 1473. The Kinect has four microphones arranged in a linear array we will use.
Adapter to convert the Kinect's proprietary connector to USB + AC power like this one.
Raspberry Pi 2 or 3 running Raspbian Stretch. I originally tried to use a Pi 1 Model B+, but it was not powerful enough. I kept having issues disconnecting from the Kinect.
The creepiest mannequin head you can find
An analog servo strong enough to turn your mannequin head
A 5V USB wall charger with enough amperage to power both the Pi and the servo and at least two ports. (I used a 5A 3-port plug similar to this
An extension cord with two outlets (One for the USB wall charger and the other for the Kinect AC adapter.
Two USB cables: a type-A to micro-USB cable to power the Pi and another to power the servo that you don't mind cutting up
A platform for everything to sit on and another smaller platform for the mannequin head. I used a plastic serving tray as the base and a plastic plate as the head platform. Both were from Walmart and only cost a few dollars
4x #8-32 1/2" bolts and nuts to attach your servo to the larger platform
2x M3 8mm bolt with washers (or whatever size you need to attach your servo horn to the smaller platform)
Two male-to-male jumper wires, one red and one black, and one female-to-male jumper wire
Adhesive backed Velcro strips
Electrical tape
Duct tape for cable management

Tools

Dremel with cutting wheel
Drill
7/64", 11/16", and 5/16" drill bits
M3 tap (Optional, depending on your servo horn)
Screwdriver
Soldering iron with solder
Helping hands (optional)
Marker
Compass
Wire strippers
Multimeter (Optional)

PPE

Safety Glasses
Face Mask (for dremmel-ed plastic bits).

Step 3: Lower Platform Assembly

The first part we will make is the lower platform, which will hold our Kinect, servo, and all our electronics. To make the platform you will need:

Plastic Serving Tray
Servo
4x #8-32 1/2" bolts with nuts
Dremel with Cutting Wheel
Screwdriver
Drill
11/16" Drill Bit
Marker

How to Make

Flip your tray upside down.
Place your servo sideways near the back of the tray, ensure the output gear of the servo lies along the center line of the tray, then mark around the base of the servo.
Using your dremel and cutting wheel, cut out the area you marked, then slide your servo into its slot.
Mark the centers of the servo housing mounting holes on the tray, then remove the servo and drill out those holes with your 11/16" drill bit. It's very easy to crack thin plastic like this when drilling holes, so I find it much safer to run the drill in reverse and slowly whittle away the material. It's much slower than drilling the holes properly but it ensures that there are no cracks.
Place your servo back in the slot, then mount it to the tray with the #8-32 bolts and nuts.

Step 4: Head Platform Assembly

The next part we will make will be a platform to connect the mannequin head to the servo. To make the head platform you will need:

Plastic plate
Servo horn
2x M3 8mm bolt with washers
Screwdriver
Drill
7/64" and 5/16" drill bits
Compass
Dremel with cutting wheel

How to Make

Set your compass to the radius of the base of your mannequin head.
Use your compass to mark a circle centered on the center of the plate. This will be the actual size of our head platform.
Use your dremel and cutting wheel to cut the smaller platform out of the plate.
Drill out the center of your new platform with a 5/16" drill bit. This will give us access to the screw that mounts our servo horn to our servo. To give the platform stability as I drilled the hole, I put a spool of wire underneath it and drilled through the center of the spool.
Line up your servo horn with the center of the platform and mark two holes to attach the horn to the platform. Make sure these mounting holes are far enough apart so there is room for your M3 bolt heads and washers.
Drill out these marked holes with a 7/64" drill bit.
The lower hole of my servo horn was smooth, i.e. it did not have the threads for the M3 bolt. Thus, I used my drill and an M3 tap to make the threads.
Use the bolts and washers to attach the servo horn to the head platform.

Step 5: Servo Power Cable

Analog servos are typically powered with 4.8-6V. Since the Raspberry Pi is already going to be powered by 5V from USB, we will simplify our system by also powering the servo from USB. To do so we will need to modify a USB cable. To make the servo power cable you will need:

Spare USB cable with a type-A end (the kind that plugs into your computer)
One red and one black jumper wire
Soldering iron
Solder
Wire strippers
Electrical tape
Helping hands (optional)
Multimeter (optional)

How to Make

Cut the non USB type-A connector off your cable, then strip off a bit of the insulation to reveal the four inner wires. Cut off the shielding surrounding the exposed wires.
Typically the USB cable will have four wires: two for data transmission and reception and two for power and ground. We are interested in power and ground, which are commonly red and black, respectively. Strip some of the insulation off of the red and black wires and cut off the green and white wires. If you're worried that you don't have the correct power and ground wires, you can plug your cable into your USB power adapter and check the output voltage with a multimeter.
Next, cut one end off of your red and black jumper cables and strip some of the insulation off.
Now, twist together the exposed black wires of your jumper and USB cables. Cross over the centers of the exposed wires and twist them around each other. Then, apply solder to the mated wires to hold them together. Helping hands will make this easier by holding your cables in place.
Repeat step 4 for the red wires.
Cover the exposed wiring with electrical tape, or heat shrink tubing if you're feeling fancy. These joints will be fragile since the wires are so small, so add a second layer of tape holding the jumper cables to the outer insulation of the USB cable. This will make the assembly more rigid and thus less likely to break from getting bent.

Step 6: Electronics Mounting

Finally, we will bring everything together, mounting our electronics and everything else to the lower platform. You will need:

Lower platform
Head platform
Mannequin head
Kinect with USB+AC adapter
USB power adapter
Extension cord
Micro USB cable
Servo power cable
Raspberry Pi
Male-to-Female jumper cable
Adhesive Velcro
Scissors

How to Make

Mount the Pi to the bottom of the tray with Velcro.
Attach the USB power adapter with Velcro.
Plug servo and Pi into the USB power adapter.
Connect pin 12 (GPIO18) of the Pi to the signal cable of the servo. It is the 6th pin down on the right.
Snake your extension cord through the back handle of the tray and plug the USB power adapter into one side.
Take the Kinect USB+AC adapter and plug the power adapter into the other side of the extension cord and the USB into the Pi.
Snake the cord of the Kinect through the front handle of the tray and plug into the Kinect adapter.
I used duct tape to hold the cables to the underside of the platform. This doesn't look the most elegant, but luckily all this is hidden.
Flip the platform right-side up and use Velcro to mount the Kinect to the front of the platform.
Use Velcro to mount the mannequin head to the head platform. Once everything is lined up, though, separate the two pieces so we can access the servo horn mounting screw. Don't screw the horn to the servo yet, though, as we need to make sure the servo is in it's center position first so we can line everything up. We'll do this in a later step.

Step 7: Software and Algorithm

Overview

The software for this project is written in C++ and is integrated with Robot Operating System (ROS), a framework for writing robotics software. In ROS, the software for a system is broken up into a collection of programs called nodes, where each node implements a specific subsection of the system's functionality. Data is passed between nodes using a publish/subscribe method, where nodes that are producing the data publish it and nodes that consume the data subscribe to it. Decoupling the code in this manner allows system functionality to be easily expanded, and allows nodes to be shared between systems for quicker development.

In this system, ROS is primarily used to separate the code calculating the direction of arrival (DOA) of the sound source from the code controlling the servo, allowing other projects to include the Kinect DOA estimation without including servo code they may not need or want. If you wish to look at the code itself, it can be found on GitHub:

https://github.com/raikaDial/kinect_doa

Kinect DOA Node

The kinect_doa node is the meat and bones of this system, doing basically everything interesting. Upon startup, it initializes the ROS node, making all the ROS magic possible, then uploads firmware to the Kinect so that the audio streams become available. It then spawns a new thread which opens the audio streams and starts reading in microphone data. The Kinect samples its four microphones at a frequency of 16 kHz each, so it is good to have the cross-correlation and the data collection in separate threads to avoid missing data due to computational load. Interfacing with the Kinect is accomplished using libfreenect, a popular open-source driver.

The collection thread executes a callback function whenever new data is received, and both stores the data and determines when to estimate the DOA. The data from each microphone is stored in rolling buffers equal in length to our sampling window, which here is 8192 samples. This translates to computing the cross-correlation with about the past half second's worth of data, what I found through experimenting to be a good balance between performance and computational load. The DOA estimation is triggered for every 4096 samples by signaling the main thread, so that consecutive cross-correlations overlap by 50%. Consider a case where there is no overlap, and you make a very quick noise that gets cut in half by the sampling window. Before and after your distinctive sound will likely be white noise, which can be hard to line up with the cross-correlation. Overlapping windows provides us a more complete sample of the sound, increasing the reliability of our cross-correlation by giving us more distinct features to line up.

The main thread waits for the signal from the collection thread, then computes the DOA estimate. First, though, it checks whether or not the captured waveforms are significantly different from white noise. Without this check, we would be computing our estimate four times a second regardless of whether there were interesting noises or not, and our mannequin head would be a spastic mess. The white noise detection algorithm used in this system is the first of the two listed here. We compute the ratio of the absolute integral of the derivative of our waveform to its absolute integral; for signals with high white-noise content this ratio is higher than for less noisy signals. By setting a threshold for this ratio separating noise from non-noise, we can trigger the cross-correlation only when appropriate. Of course, this ratio is something that has to be re-tuned every time the system is moved to a new environment.

Once determining that the waveforms contain significant non-noise content, the program proceeds with the cross-correlations. There are however three important optimizations built into these calculations:

There are four microphones on the Kinect, meaning there are six total pairs of waveforms we can cross-correlate. However, if you look at the spatial arrangement of the microphone array, you can see that microphones 2, 3, and 4 are very close to each other. In fact, they are so close that due to the speed of sound and our sampling frequency the waveforms received at 2, 3, and 4 will be separated by at most one sample ahead or behind, which we can verify with the calculation maxlag = Δd*fs/vs, where Δd is the separation of the microphone pair, fs is the sampling frequency, and vs is the speed of sound. Thus, correlating pairs between these three is useless, and we only need to cross-correlate microphone 1 with 2, 3, and 4.
Standard cross-correlation of audio signals is known to perform poorly in the presence of reverberations (echos). A robust alternative is known as the generalized cross-correlation with phase transform (GCC-PHAT). This method boils down to applying a weighting function that amplifies peaks in the cross-correlation, making it easier to distinguish the original signal from echos. I compared the performance of GCC-PHAT to the simple cross-correlation in a reverberation chamber (read: concrete bathroom being remodeled), and found GCC-PHAT to be 7 times more effective at estimating the correct angle.
When performing the cross-correlation, we are taking the two signals, sliding one along the other, and at each step multiplying each point in our fixed signal by each point in our sliding signal. For two signals of length n, this results in n^2 computations. We could improve this by performing the cross-correlation in the frequency domain instead, which involves a fast fourier transform (nlogn calculations), multiplying each point in one transformed signal by the corresponding point in the other (n calculations), then performing an inverse fourier transform to go back to the time domain (nlogn calculations), resulting in n+2*nlogn calculations, less than n^2. However, this is the naive approach. The microphones in our array are so close together and the speed of sound is so relatively slow that the audio waveforms will already be mostly aligned. Thus, we can window our cross-correlation to only consider offsets that are slightly ahead or behind. For microphones 1 and 4, the lag must fall between +/-12 samples, meaning for each cross-correlation we only need to perform 24*n calculations, resulting in computational savings when our waveforms are longer than 2900 samples.

This system leverages the minidsp library, which implements the GCC-PHAT algorithm with optimization 3.

Once finding the lag in the signals from each microphone pair, the program chooses the median value for lag, uses it to compute the estimated angle, and publishes the result so it can be used to control the servo.

Servo Control Node

Compared to the kinect_doa node, the servo node is relatively simple. Its job is to solely take the estimated DOA and move the servo to that angle. It uses the wiringPi library to access the hardware PWM module of the Raspberry Pi, using it to set the angle of the servo. Most analog servos are controlled by a PWM signal with a pulse width ranging from 1000 µs to 2000 µs, corresponding to an angle of 0° to 180°, but the servo I used was controlled with 500 µs to 2500 µs, corresponding to an angle of 0° to 270°. Thus, the node is configurable for different servo hardware by setting parameters for the minimum pulse width, maximum pulse width, and the difference between the maximum and minimum angles. Additionally, the servo doesn't immediately move to the target angle, but rather moves toward the angle at a configurable speed, giving Margaret a more gradual, creepy vibe (plus, the sound of a servo moving quickly back and forth gets annoying really fast).

Step 8: Build and Installation

Install Dependencies:

First, install libfreenect. We have to build it from source because the version you can get with the package manager doesn't include support for audio. This is because we must upload firmware to the Kinect to enable audio, and redistributing this firmware is not legal in certain jurisdictions. Additionally, we can avoid building the examples which require OpenGL and glut, unnecessary for headless Raspbian installations.

sudo apt-get install git cmake build-essential libusb-1.0-0-dev
cd
git clone https://github.com/OpenKinect/libfreenect

cd libfreenect
mkdir build
cd build
cmake .. -DCMAKE_BUILD_REDIST_PACKAGE=OFF -DCMAKE_BUILD_EXAMPLES=OFF
make
sudo make install
sudo cp ~/libfreenect/platform/linux/udev/51-kinect.rules /etc/udev/rules.d
udevadm control --reload-rules && udevadm trigger

Next, we need to install the wiringPi package, which allows us to control the GPIO pins of the Pi:

cd
git clone git://git.drogon.net/wiringPi
cd ~/wiringPi
./build

Attach Mannequin Head:

With wiringPi installed we can now take a quick detour back to hardware-land to attach the mannequin head onto the lower platform. To center the servo via the command line, enter the following commands:

gpio pwm-ms
gpio pwmc 192
gpio pwmr 2000
gpio -g pwm 18 150

If there is no movement, then your servo is probably already centered. To be sure, though, you could set the servo to a non-center value, e.g. gpio -g pwm 18 200, then set it back to 150.

Once you're sure the servo is centered, attach the servo horn of the head platform to the servo such that your mannequin head will be looking straight forward. Then, screw the horn onto the servo and attach your head via the Velcro bits.

Install ROS:

Next, install ROS on your Pi. A great install guide can be found here; for our system we don't need OpenCV, so you can skip step 3. This build will take several hours to complete. When finished following the install guide, add sourcing the installation to your bashrc so that we can use our newly installed ROS packages:

echo "source /opt/ros/kinetic/setup.bash" >> ~/.bashrc

Build Kinect DOA Package:

After all that is done, make a catkin workspace for our project and enter the src directory:

mkdir -p ~/kinect_doa_ws/src
cd ~/kinect_doa_ws/src

The code for this project is contained in the kinect_doa package, so clone it into the src directory of your new workspace:

git clone https://github.com/rykerDial/kinect_doa

The robot_upstart package provides an easy to use tool for installing launch files so that they run at startup, so also clone this into your workspace:

git clone https://github.com/clearpathrobotics/robot_upstart

Now, we can build the project code by calling catkin_make from the top level directory of our workspace, then source our build so our packages are available:

cd ~/kinect_doa_ws
catkin_make
echo "source /home/pi/kinect_doa_ws/devel/setup.bash" >> ~/.bashrc

Running and Tuning:

Assuming everything is plugged in and powered on, you should now be able to launch the system and have the Kinect track you voice! However, if you have a Kinect 1473, first open the file ~/kinect_doa_ws/src/kinect_doa/launch/kinect_doa.launch in a text editor and set the parameter using_kinect_1473 to true. Additionally, if you used a different servo than I did it is probably a standard analog servo, So while in the launch file, change the parameter min_us to 1000, max_us to 2000, and max_deg to 180.

roslaunch kinect_doa kinect_doa.launch

Play around with it for a while. If you feel the system is too sensitive (looking in random directions that don't correspond to voices or distinctive noises), try changing the white_noise_ratio parameter in the launch file and relaunching the system until the responsiveness is at a level you're comfortable with. Raising the ratio will make the system less responsive and vice versa. You will likely have to perform this tuning whenever you move the system to a different location to get the performance you want.

To launch the program when we power on the Pi, we use the robot_upstart package to install our launch file. If ROS isn't currently running, start it with the command roscore. Then, open up a new terminal and install the launch with:

rosrun robot_upstart install kinect_doa/launch/kinect_doa.launch --user root --symlink

We create a symlink to the launch file instead of copying it so that we can change parameters by editing ~/kinect_doa_ws/src/kinect_doa/launch/kinect_doa.launch.

Step 9: Hiding It at the Office

Now for the fun part. Head into work after hours and set your mannequin head up in secret. Then just sit back and see how long it takes for your co-workers to catch on! You're new creation is guaranteed to turn a few heads...