Introduction: Controlling Devices With Your Hands: Automated Filmmaking Rig

About: Self-taught maker of all sorts, digital + physical.

If the video won't play follow the 'Watch On Youtube' button or use this link.


I'm always thinking of new ways I can interact with machines intuitively. Movies are often a great place to find inspiration for these kind of effortless, futuristic interactions. This one is probably stolen from Iron Man!

In this project I built a motorised slider and zoom rig for my camera, and then used a Kinect sensor (and quite a bit of code) to create an interface between it and my hands, as well as other wired/wireless devices, like Philips Hue lamps.

You can flick lights on and off by pointing, or drag away from them to increase their brightness. You can position the camera along the rail by 'grabbing it' at a distance and you can call it towards/away from you with a flick. You can also precisely control the zoom with a 'stretching' gesture with both hands.

Tangible interaction semantics

I love interaction semantics that mimic real and relatable actions like pointing, grabbing, pulling and stretching. I'm a firm believer that the future lies in mixed reality interactions rather than anything purely virtual, because digital things that actually do something in the physical world makes for far more pleasant and understandable UX design.

Everything you see here I learnt how to do on the internet. It's really a lot of different projects that I brought together, so rather than make an exhaustive guide for replicating exactly this, I think it's more useful to share my struggles and successes in a more general way so you can take these interactions somewhere else - there's so much exciting stuff left to be done with man-machine interfaces and ux/interaction design. These concepts can be applied to any device that can be hacked or accessed via an API; or to your own creations with things like Arduino.


» Making a super cheap slider and zoom rig

» Sockets as a hugely useful concept for connectivity

» Working with Kinect, working with Unity - tracking your skeleton and making basic 3D user interfaces

» Tying things together with NodeJS - a connectivity framework - creating an all-purpose hub for networking wireless IOT devices like Philips Hue and local devices like Arduino


In the future, I'd like to share the communication framework I built when I can maintain it. At the moment though, I'm not sure how user-friendly it is, and there is no guarantee it even runs on other platforms. There is really quite a lot to explain, so hopefully the guidance below is helpful.


For hand/body tracking

» Kinect V1* (Xbox 360) or Kinect V2* (Xbox One) or Kinect Azure

* For V1 and V2 you'll need to buy a PC adapter too since they were originally built for the Xbox.

Kinect sensors (at the moment) are probably the best window into skeleton tracking. They're also useful for things like 3D reconstruction because they have a depth camera; you can scan physical objects into a textured mesh.

As you go up the version, they become more expensive but more capable. I've used Kinect V1 for a lot of projects and it can definitely do the job for interactions if you're creative about the control logic you use. It can detect the joints in the bodies of two users in real time, but it can't detect hand gestures out of the box though. I've got around this by using virtual cursors, and cues like checking whether the user has their hands raised, or the direction their body is pointing to know when they want to interact.

For this project, I went with the Kinect V2, which has slighter improved features. Importantly, it can natively recognise several hand states unlike V1, like open/closed fist and 'lasso' pointing. This is an incredible useful step up, because I needed precise cues for actions like 'stop moving' or 'start grabbing'. I haven't yet got my hands on a Kinect Azure, but it seems to be a whole leap better.

For creating connected devices

There are so many options for development boards now so I'm out of my depth. The slider and zoom rig are running on a standard Arduino Mega being controlled over USB, but it would be very easy to make it wireless. I've found Particle boards quite intuitive to use for this kind of thing in the past, and they're fast enough for streaming data from sensors to webpages or sending control signals to motors over wifi.

For stepper motors

If you're looking for stepper motor drivers, search up '9-42V stepper driver' on Amazon. They've not failed on me yet and I've used them on integrations like a CNC where they've been run for hours on end. They've got good passive cooling, are very versatile in terms of input voltage, and easy to reuse between projects.

Step 1: Making a Super Cheap Slider and Zoom Rig

The slider and zoom rig is pretty much as low cost as you can go - it does the job and has survived a lot of knocks. It's a bit longer than most sliders need to be.


I used steel angle for the runners, 9mm MDF for the panels, and 3D printed rollers fit around skate bearings for the runners. Under the weight of the camera, this system is steady, but if you want the best, you'd want a linear rail system for smoothness.

Drive system

G2 belt is relatively cheap now because everyone is making 3D printers. If I made it again I'd go with this. However, back when I built this I wanted to experiment with an alternative that you can buy in longer sections - standard bead chain used for window blinds. The drive gear you see in the first photo is a series of laser cut slats held in place by two discs, each slat notched in a way that it grabs the cord between the beads. It's a robust drive system that is perfect for larger scale mechatronics you might want to make. The only thing I noticed is that at some slower speeds you get significant vibration (actually big standing waves) being formed on the cord.

The zoom rig is even simpler. It uses a small timing belt wrapped around the zoom ring and a small MDF gear directly on the stepper motor's axle.

Power system

Batteries and power supplies can get expensive, so I went with a cordless method which was completely free. This slider was built for filming in a workshop so it made sense to make it run on drill batteries anyway. I took a battery and cloned the measurements of its stem. I then cut and sanded some steel strip from a torsion spring to create a pair of contacts that deflect around each contact of the battery. This method worked extremely well and has survived a long duty cycle without any issues. I reckon this is because I used spring steel. Just be sure to insulate between contacts and watch the battery voltage / add a voltage alarm.

Control signals

Both steppers for the slider and zoom are run from a single Arduino Mega, so in order to be able to zoom in and out whilst the slider is moving, you have to approach movements in a particular way. Each stepper motor relies on short digital pulses to indicate a step, but you can only do one thing at a time on the Arduino's processor. For a single stepper motor, you can get away with the standard Stepper library, but you'll find that as you add more and run them simultaneously, they slow each other down because the signals are blocking.

So the solution is to take turns, moving each motor by one increment, and polling at high frequencies to check whether it's time to move another step. The easiest way to overcome this is by using the AccelStepper library. However, I only wanted constant speed movements so I didn't want all of that wrapping. I created my own implementation that also uses Direct Port Manipulation to speed things up a bit more. It's a useful technique when you're working with low level signals because you can write to 8 pins simultaneously which in this case allows me to update two steppers in one command (slightly faster than digitalWrite).

If you're working with the Mega 2560 and port manipulation, this might be useful because the bindings are different.

The slider has a limit switch at one end which, when contacted, triggers an interrupt that the Arduino will act on immediately, regardless of what it's doing, stopping the motor from moving too far. It allows me to define an extent for the travel along the slider in steps, and have a consistent reset point after powering off each time. On startup, the slider runs a homing sequence, letting the computer know when it's ready.

Serial commands

I've worked with GRBL quite a bit, and for more complex systems with multiple steppers this is a perfect solution. It abstracts all the stepper control signals and has a very robust command interface.

However it was unnecessary here. I made it so that each axis, the slider and the zoom, is set from a position between 0 and 1. You send a value like "0.45s" or "0.8z" over serial, to move to a specific position, in this case the slider to 0.45 or the zoom to 0.8 respectively. You can also send commands to start or stop moving, and the slider will continue moving in the direction of the furthest extent.

Step 2: Sockets: a Hugely Useful Concept for Connectivity

Learning about sockets unlocks a whole lot of programming capability for connected things. They are a essentially a universal 'jumper wire' that can port data from one program or device to another at high speed. They're what underlies the internet - things like video calls wouldn't be possible without them. You could have microcontrollers sending joystick data to your web browser, a Raspberry Pi sending live video over the network, or two apps on your computer talking to each other, for example.

The great thing about sockets is that they are a universal construct, i.e. they have an implementation in most languages. If you try your best initially to ignore the jargon, they are actually pretty simple to work with, and once you've understood the general idea, they can be easily applied in any other language because everything conceptually (and even syntactically) is largely the same.

In this way they're also a great way to cheat with code. Originally, when I wanted to start working with Kinect, I found that the source code was written in C++, but I only knew Python at the time. So instead of trying to learn a whole new language, I realised I could take this C++ code, and figure out only how to include a basic socket implementation. This way I could grab the skeleton data from the C++ application with very little effort and send it over to Python, where it was far more manageable to deal with, but also crucially where it was faster for me to prototype high-level interaction logic. It also meant that any code I added to my program wouldn't slow down the underlying processing involved with skeleton tracking because they're being done in different processes.

How they work

Sockets allow you to exchange messages at high speed. These messages can contain writing, pixels from an image or values from a sensor, for example; anything that can be expressed as text and eventually binary. The only two things you really need to establish in order to create a socket connection between two devices is the 'location' of each device and the 'language' they'll be using to communicate, since the data has to be encoded so it can be sent between devices.

The 'location', which is made up of an unique address and a port, allows you to target the specific device you want to talk to, as opposed to the many others that might also be listening on the network.

The 'language' is really made up of a protocol and an encoding. A protocol explains the procedure for sending and receiving messages, like "send a message and wait for me to reply" or "say stop at the end of each message". The encoding determines how you should recover the data from the coded binary messages after they are received.

The most common protocol in use is called TCP. In this type of socket, a connection is established between a server and a client. Very generally, the server is running in the background on the 'master' device, and is always listening for clients who might want to communicate. Clients with the right address can attempt to connect, and the server will deal with these new connections being made, as well as old connections being broken as they disconnect.

Once this kind of 'handshake' is successful, you can begin sending data back and forth in the form of binary data. It is safer to break up information into smaller messages or 'chunks', in case one gets lost or corrupted.

For most things TCP will suffice. It's very reliable because it involves a few checks to ensure all the data come through correctly. The alternative you might want to know about is UDP, which throws out some of that checking and takes a slightly different approach to the concept of binding client and server. It's great for when you need super low-latency and you don't mind if the occasional piece of data gets lost, like streaming a video.

Step 3: Working With Kinect (and a Bit of Unity)


The best way to play around with Kinect is to start with the examples you find in the SDK browser which comes with the SDK install, however you will need to be happy working in C# or C++. I've had absolutely no luck with the third party bindings people have made for Python in the past, so as I mentioned previously I originally got around this by taking a basic socket implementation and porting the data to Python from C++. Eventually though I did make the jump to C# and it is a nicer way to work because you can handle things like locating and connect/disconnect events for the sensor.

The Kinect has multiple streams you might want to access, either all at once or individually. These are colour, infrared, depth and body frames (ones containing the tracked skeletons). The appeal of using a sensor like this is that all the calibration has be done for you. This means you have a kind of universal map between the different frames: you can take a point in 3D space and find out what colour it is, or locate a shape in the colour frame and find out how far away it is. One thing I'd like to implement in the future is using your finger as a real-life colour picker / pipette tool, so you look at the body frame, find the position of the raised finger and map it to colour space to find the nearby colour you're indicating.

The first animation shows the colourised depth view in Kinect Studio, with my skeleton overlaid. As I raise my hands and clench my fist or point my index and middle finger, it changes the colour of my hand to show the hand state it thinks I'm holding.


Recently I wanted to make another leap to learn a useful tool for 3D. Mapping the interaction space for this interaction would have been very cumbersome had I not been able to visualise it with Unity. It's quick to pick up, and it abstracts the low level stuff nicely. This was my first project with it. If you're not familiar with it, Unity is a game engine that can actually be used for a whole lot more things, like architectural visualisation, simulations and interaction design. Importantly there's a plugin made for the Kinect V2. It's labelled as exclusive for pro version, because in older versions when it was released you had to have a paid license of Unity. Now that's no longer the case, and you can download both Unity and the plugin for free.

Even though Unity can be used without any coding, it's a bit tricky to circumvent the C# interface when you're working with Kinect. That said, if you are able to integrate the 'body source viewer' and 'body source manager' scripts that come with the plugin, really all you have to do is find a way to get the data you want out. The manager script handles the sensor connection and grabs 'frames' of skeleton information as they are ready so the viewer script can visualise the skeleton in Unity objects as you see in the second animation. In this test I use a flick gesture to create spheres at the location of the palm.

Interaction Zones

Also in the second animation I show the virtual 'zones' in which each interaction output is selected (left lamp, right lamp or slider). This was the benefit of using Unity - I could look at the location of the lights and slider in the room and quickly add some spheres to represent their regions of space. I then used colliders in my hands and these objects to find out what I was touching.

A note if you're implementing something similar: I have to say, I found working with collider components initally quite cumbersome and unintuitive. You must add a rigid body component for this to work. If you want to have a moving collider, like the ones in my hands, they need to be set to 'kinematic' too, and the collision detection mode should be 'continuous speculative'. You need to move the rigid body using 'MovePosition()' in the 'FixedUpdate()' method rather than simply editing its transform component in the 'Update()' method which runs just after it. The fixed update is intended for smoothly animating physics components.


If there is one important thing I came across that's worth sharing because it's applicable to a lot of stuff you might build, it's the processing you need to do on the hand states. When I began streaming hand state data from the sensor (which is in the form of states like 'unknown', 'untracked', 'closed', 'open', 'lasso') I noticed that often as you clench your fist, or move your hand whilst using the lasso, it loses tracking for a few frames. If you are writing interaction logic for something like a drag command, this can make for quite erratic and fragile interaction cues.

In the same way that you 'debounce' a physical push button, which is where you clean up the signal so it just registers a simple 'on/off' instead of 'on/off/on/off' for a single press, you can average this state data over a number frames to clean up the result. I did this by logging a set number of hand frames to a queue, which is a data structure where you add items at one end and remove items at the other (first in first out). I created an averaging method that is especially sensitive to valid states like open/closed fist and lasso, and will throw out 'unknown' states unless they are really consistent.

Step 4: Tying Everything Together With NodeJS

Neuronal framework

In order to get all these different devices talking, I needed a communication framework. I didn't want to just churn out a quick hardcoded solution for this particular demo - that would really waste a lot of time. Instead I've been thinking about creating a universal toolkit for this kind of thing for a while, something structured enough to network everything from mobile devices, serial devices, IOT devices to local apps and webpages: a digital-physical fabric. The solution came by approaching it with biomimicry, considering all device inputs and outputs as neurons, and the reconfigurable connections that are specific to the particular demo as synapses, neural connections that can have multiple inputs/outputs. This meant that I could create a generic 'entity' system made up of containers which I call 'channels' that can handle any kind of device and its data stream, but then configure the specifics when it comes to building out a demo. In the case of this one, the Kinect is the main input neuron, and its activity triggers things to fire like the slider moving, the camera zooming or the lights dimming.


NodeJS is a great language to learn for web development, and you essentially are getting two languages for the price of one (frontend javascript and backend node). To me it definitely seems like the most elegant medium for making networked systems. It's fast, has tons of functionality for server-side operations, uses javascript syntax and handles packages and objects in a clean way.


I used the serialport package to control the Arduino. It's pretty straightforward - it uses an event system to notify you of events like receiving data or connections being made/broken. As far as I can tell Arduino Serial commands are made for ASCII encoding, so be sure to specify 'ascii' as the encoding type when you're setting up a port. If you want a way to seek out devices automatically rather than hardcoding a device path, scroll down to 'static methods' here and use the list() function.

Http Requests

Practically all smart home devices use a RESTful architecture, which is an efficient way to exchange state information between IOT devices and a user or server. They almost always rely on http requests, which is no different to what happens when you click on a webpage and the content loads.

Node is excellent for doing this kind of thing, because it lends itself to asynchronous processes - imagine you've sent off a request to a website for some information, but it's really slow - you want to get on with other stuff whilst the response comes through instead of waiting around. If you do want to work with requests, a bit like the socket stuff, once you understand how they work, you really can apply them in any language. The best way is to have a play and extend examples you find on the internet. The requests library just got deprecated, so if you want a good node package to work with, try axios or got.

Philips Hue

Philips has a well-documented API to allow you to take control of your Hue lights with code. You can send up to 10 commands a second, changing the colour or brightness of each lamp. You need to go through the authentification process here in order to get a token. Once you have a token, you can beging sending POST requests which are used to change parameters like hue or brightness.

Step 5: Wrapping Up

I tend to approach projects by breaking them down into components that directly builds on something I see online or something I've built before, that's why I wanted this to be a vague but all-encompassing interaction guide. I think figuring things out the bits you don't know on-the-go is often the best way to pace learning. Hopefully this has given you some general guidance to spur on ideas for your own connected projects. If you make something cool, please send a message, I'd love to see it!

Automation Contest

Second Prize in the
Automation Contest