Imagine a conference call where multiple speakers surround one camera. We often encounter where the limited view angle of a camera often fails to gaze at the person who speaks. This hardware limitation significantly degrades the user experience. If the camera can gaze on active speakers, the remote audience would be more engaged in a conversation during the call. In this project, we propose a (prototype) camera system that detects and follows the active speaker by turning the camera toward the speaker. The system uses both visual- and audio-based approach. When faces are detected from the camera, it determines the person who speaks and calculates the angle to rotate. When the faces are not detected at the current angle, the system searches for the speaker based on the direction of the arrival audio signals.

Step 1: Materials

Adafruit Feather nRF52840 Express X 1

https://www.adafruit.com/product/4062

Electret Microphone Amplifier - MAX4466 X 2

https://www.adafruit.com/product/1063

Micro Servo Motor X 1

https://www.adafruit.com/product/169

Android smartphone X 1

Step 2: Hardware - 3D Printing

For fast implementation, we decided to 3D-print the enclosures we need. There are two main components for enclosures; a turntable and a smartphone stand. We used the turntable from this link (https://www.thingiverse.com/thing:141287), where it provides Arduino case at the bottom and a rotating table that can be connected with a servo motor. We used a smartphone stand from this link (https://www.thingiverse.com/thing:2673050), which is foldable and angle-adjustable such that it allows us to calibrate the angle conveniently. The below figure shows the 3D printed parts assembled together.

Step 3: Hardware - Electronic Components

There are four wired components; Adafruit Feather, two microphones, and a motor. For the compact packaging, we soldered (gray circles) the wires without using a breadboard. Below describes the circuit diagram and the actual artifact.

Step 4: Software

Our system primarily uses the visual information from the face recognition to follow the speaker as it is more accurate. In order for the Feather to get visual information from the Android app, we use Bluetooth Low Energy as the main communication method.

When any face is detected, the app calculates the angle that motor needs to rotate to focus the speaker at the center of the frame. We decomposed the possible scenarios and handled like the following:

If face(s) is detected and speaking, it calculates the midpoint of the speakers and returns the relative angle to the Feather.
If face(s) is detected and but none of them is speaking, it also calculates the midpoint of the faces and returns the angle accordingly.
If any face is not detected, the system changes the speaker-tracking logic from the visual to the audio.

SPACS software is located at https://github.com/yhoonkim/cse599h-fp.

Step 5: Software - Sound

Sound (YH)

To locate the source of incoming sound, we first tried to use the time difference between the two microphones. But it was not accurate as much as we expected since the sampling rate (~900Hz) of Arduino Leopard, where we tested the sound signals, were slow such that it cannot pick up the time difference between 10cm-apart microphones.

We changed the plan to use the intensity difference between the two input sound signals. As a result, the feather takes two sound signals and process them to detect where the sound was coming from. The processing can be described by the following steps:

Take the inputs from two microphones and subtract the offset to get the amplitudes of the signals.
Accumulate the absolute values of the amplitudes per MIC for 500 pickups.
Save the difference of the accumulated values to the queue having 5 slots.
Return the sum of the queues as the final difference value.
Compare the final value with thresholds to decide where the sound came from.

We found the threshold by plotting the final value in various circumstances including sound coming from left and right. On top of the thresholds for the final value, we also set another threshold for the mean of the accumulated amplitudes in step 2 to filter out the noises.

Step 6: Software - Face and Speaking Detection

For the face recognition, we employed ML Kit for Firebase released by Google (https://firebase.google.com/docs/ml-kit). ML Kit provides the face detection API that returns the bounding box of each face and its landmarks, including eyes, a nose, ears, cheeks, and different points on a mouth. Once faces are detected, the app tracks the mouth movement to determine whether the person is speaking. We use a simple threshold-based approach which yields reliable performance. We leveraged the fact that the mouth movement gets larger in both horizontally and vertically when a person speaks. We calculate the vertical and horizontal distance of the mouth and compute the standard deviation for each distance. Distance is normalized to the size of the face. Larger standard deviation indicates speaking. This approach has the limitation that every activity involves the mouth movement, including eating, drinking, or yawning, can be recognized as speaking. But, it has a low false negative rate.

Step 7: Software - Rotating Motor

The motor rotation was not as straightforward as we expected due to the control of the rotation speed. To control the speed, we declare a global counter variable such that allows the motor to turn only when the variable reaches a certain value. We also declared another global variable indicating if the motor is moving to let microphones know so that it can avoid the sound coming from the motor rotation.

Step 8: Future Improvements

One of the limitations is that the motor becomes wobble at certain angles. It seems that the motor is not powerful enough to overcome the torque generated by rotating the smartphone. It can be resolved by using a more powerful motor or adjusting the position of the smartphone toward the center of the rotation to reduce the torque.

Audio-based sound direction detection could be improved with a more sophisticated method. We would like to try an acoustic beamforming approach to determine the direction of the incoming sound. We have tried with the time of arrival of the audio signals. But, the sampling rate of the Feather is limited to detect the time difference when the microphones are only around 10cm away.

The final missing piece of this prototype is the usability evaluation. One promising way to evaluate is integrating the system with the existing video call platform and observe the users responses. Those responses will help to improve the system and make the next iteration of this prototype.