About the scene: There are 4 person in a normal room with one person initially out of view of scope. Everyone is placed in different configuration and varying distance.
Key steps in detection:
1. Nao first detects the person using face recognition.
2. Then it uses a novel sound source localization technique to detect the person who is speaking
3. Then is rotate its head toward the detected person