What the Kinect sensor actually does…

by Stephen Hobley on December 4, 2010

Post image for What the Kinect sensor actually does…

I was confused about the Kinect sensor.

I knew that it was somehow capable of recognising human gestures but I didn’t know exactly how that data was presented to the host computer. Peering in on the Kinect community from the outside, it was difficult to work out exactly what the Kinect sensor delivered.

Luckily I was able to find a sensor for sale at my local Meijer supermarket and snapped it up to find out what all the fuss is about.

Basically the Kinect appears to be a 640×480 30fps second video camera that knows the *depth* of every single pixel in the frame. It does this by projecting a pattern of dots with a near infrared laser over the scene and using a detector that establishes the parallax shift of the dot pattern for each pixel in the detector.

Alongside this there is a regular RGB video camera that detects a standard video frame. This RGBZ (or ‘D’) data is then packaged up and sent to the host over USB.

First off this is very cool – RGBD cameras are traditionally very expensive and so $150 is a bargain. So top marks to Microsoft for this.

What it does not do is identify shapes within the field of view and attempt to map skeletal outlines of those shapes recognised. This was the most confusing thing for me, as every article I’d read had shown the skeletal representation as an explanation of what the Kinect does.

You would need to take each one of the 640×480 frames and copy them into a framebuffer so they can be processed by a vision library like OpenCV. Typical operations would be to threshold the depth image to get the “closest” pixels – then perform a blob analysis ROI to group these pixels into identifiable features and then track those blobs over their lifetime.

This is actually quite a lot of work – and one thing I’ve noticed about some of the Kinect demo videos is the slight lag –

Both these demos are impressive, but I’m not totally convinced that they rely on the Kinect’s 3D abilities. I don’t think it would be too hard to implement either of these using OpenCV and a good 2D camera.

When I was working on the laser harp I spent some time trying various video cameras as potential detectors. What I found was 30fps was too slow to get a response suitable for music – something 60-100fps was better.

Also 640×480 was just too much data to crunch at this rate and 320×240 was about the maximum that could be processed.

The PS3Eye camera is excellent in this respect – it can deliver 120 fps at 320×240 monochrome – perfect for a laser harp type instrument or a super responsive “surface” computer.

One of the best things about the Wiimote / Pixart sensor is that it does the blob tracking in hardware, so you end up with a datastream containing the X/Y/Z position of up to 4 bright points – perfect for an Arduino* or other “slow” microprocessor.

Where I think the Kinect will be outstanding is in robotic vision applications where the processor has time to analyse the image, update the internal world model and navigate accordingly.

But for true realtime operation – there is still a bit too much work to be done.

*But* maybe I’m wrong on this – because the Xbox seems to manage all that processing, along with rendering a game too.

Comments very welcome on this one…

* iBlogCred *= 100; // Mention of Arduino


(via the comment from Mike below)

This is a fantastic example of using the Kinect’s unique 3D capabilities. I urge you to check out all of the videos in the series.

What is really interesting about this is in one of the later videos the framerate counter in the bottom corner is reporting 100fps – I don’t know how he’s getting FPS greater than the 30 limit coming from the Kinect – maybe some of these are duplicates, in that the 3d rendering is happening quicker than the data is coming in.

In closing, I want to stress that I am in no way trying to “disrespect” the Kinect sensor, or the brilliant work done by the hacking community to put the drivers out. It’s a very interesting device, and quite unique. I do think that the depth mapping is being underutilised in most of the early demos, and we all need to get our respective thinking caps on to come up with implementations that take advantage of it.

Follow-up to this article.

You might also like:


{ 25 comments… read them below or add one }

r December 4, 2010 at 11:39 pm

The depth buffer is only 320×480. I think the hardware will happilygive you a 640×480 version ( this is 360 API memory, so upscalingmay actually occur on the 360 ) but the hardware itself oly gets ebough data to fill 320×480.

I’m also also positive the sensor uses parallax, and not intensity. The people I’ve talked to who were involved ( tangentially, it seemed like ) said that materials ( hair in particular ) caused large fluctuations in intensity, so It doesn’t seem like it would be a useful channel to probe for depth data.

Aplogiesfor the atrocious typing. The iPad keyboard goes from useable to preposterous in portraitmode.

Stephen Hobley December 4, 2010 at 11:44 pm

Hey thanks for the info – parallax detection actually makes a lot of sense.

I will check the code and see how big the z buffer is – my initial inspection indicated that the depth was packaged with the video frame, so you had to “receive” the full frame before you could process it, but if its less then this helps with processing speed a lot.

If you can reduce the framesize, that would be great – I’ve had better performance with 320×240 frames, and that’s more than enough data for my applications.

Steven Leibrock December 5, 2010 at 12:42 am

Hey Stephen, you’re right about the fact that real-time processing is probably next to impossible with the Kinect. Since the Kinect is going at 30FPS, 640×480, the data is really hard to manage. Even some of the games with the Kinect aren’t entirely 100% responsive immediately to gestures.

It could be that the Xbox reduces the data to a smaller size making it easier to process, since there is a lot of work going on. I’ve hypothesized several ideas, and shrinking the data to a smaller size seems like the likeliest option. The other thing I thought about is the distance that most Kinect games ask you to maintain might have an effect on how much data will be processed during a game. I’ve only experimented at about 3-4 feet, and most games will clarify to stay 6-12 feet away from the Kinect in order for it to work.

My third guess is that the applications only collect depth information every n-frames of an application, which makes sense to make the processing easier. It wouldn’t produce smooth results, but to a common eye, wouldn’t matter in the long run.

Let me know what you think, I’d love to hear some input from a fellow Kinect experimenter!

mike December 5, 2010 at 2:22 am

this is a good implementation of the depth http://www.youtube.com/watch?v=7QrnwoO1-8A

Sean December 5, 2010 at 2:36 am

These are first generation games. I’m sure the Kinect will be superior than Sony’s solution due to the smaller amount of data and the fact that the data returned is much more useful to the programmer.

Michael Giambalvo December 5, 2010 at 6:41 am

“Both these demos are impressive, but I’m not totally convinced that they rely on the Kinect’s 3D abilities. I don’t think it would be too hard to implement either of these using OpenCV and a good 2D camera.”

In order to do the same kinds of things as these demos, you would have to isolate the user’s hands in the scene. This is very tricky to do quickly and well. You can make it much easier by having the user wear special colored gloves, or IR beacons or something like that, but being able to filter the image by depth can make some tasks almost trivial.

Also, Microsoft acquired a company that produced an RGBD camera that used time of flight to measure distance – I originally thought that’s what kinect used. The IR emitted by the sensor is definately structured though, so I guess it could be parallax. I’m really curious as to how well it works outdoors.

Stephen Hobley December 5, 2010 at 6:59 am

With respect, I disagree – the girl is wearing black against a white wall – I could do this with OpenCV and threshold and blob detection – something I would still have to do with the Kinect data.

As for the piano demo – this seems too unresponsive to be useful, with some careful thought on light and camera placement, I could achieve this with just a few scan lines from a camera and simple arithmetic.

blongo December 5, 2010 at 11:31 am

Duh .. the 3d-scene is rendered at ~100fps, but the depth data is only updated at 30 fps.
You can rotate what you got faster than the kinect can deliver. Two different things, not related.

Scott Brickey December 5, 2010 at 12:09 pm

I suspect that the reason the xbox can process the data so quickly is due to its massively parallel processor. It would seem that the best way a desktop could take advantage of the Kinect would be to use GPU processing (CUDA, etc) of the Kinect data, and supply the meaningful output (blobs, etc) to the processor for application usage.

Drake December 5, 2010 at 1:30 pm

I’d like to see the drivers rewritten maybe using compressive sampling techniques. I bet you can get the same performance but manipulate a lot less data at the same time. I suspected that’s what they were doing when I saw some of the patterns the light array makes, but then I saw regular patterns too so who knows…

Stephen Hobley December 5, 2010 at 1:54 pm

blongo – That’s what I meant by this bit:

“maybe some of these are duplicates, in that the processing of this is happening quicker than the data is coming in.”

But thanks for the confirmation on that.

n December 5, 2010 at 3:45 pm

“With respect, I disagree – the girl is wearing black against a white wall – I could do this with OpenCV and threshold and blob detection – something I would still have to do with the Kinect data.”

But you couldn’t do this if she was wearing black against a black wall. What is great about the kinect is that it’s cheap and limits the impact of poor lighting/special conditions required for computer vision. Take a look at this great blog post by memo akten: http://memo.tv/kinect_why_it_matters

Adam Reineke December 5, 2010 at 6:56 pm

The 100fps in that last video I assume is the framerate that his computer is drawing the scene in 3D. I assume he is getting the data to to construct the scene at a slower rate.

Stephen Hobley December 6, 2010 at 12:39 am

n – again with respect – you don’t deliberately make computer vision hard by turning out the lights. Doing so just because you *can* seems counterintuitive. I’ll always favour making allowances just to improve the responsiveness and robustness of a system. But then that’s just me.

Maybe it’s because I work in musical control interfaces – where response time is critical to be able to actually play live music using lasers, electric fields, light etc…

From everything I had read thus far I was lead to believe that the Kinect sensor was able to interpret human body movement and gestures. But it turns out this is not the case. You still need to do some heavy lifting after receiving the data to get to where you need to be. In this respect I *was* a little disappointed with the Kinect.

I was hoping that this would help me to solve a problem I’ve been working on for some time – a velocity sensitive controller similar to the laser harp – but at the moment it seems not.

To do so I will need to sample hand position quicker than 30fps – as it’s not the position of the hand that is important, but the rate of change of position.

I am not saying it won’t though – given enough time…

Christian Sciberras December 6, 2010 at 3:53 am

I was wondering, wouldn’t it be interesting to have two of them at the same time (3 if you want perfect full coverage)?

The last video shows how only one side is being recorded (of course), would be interesting to have two of them monitoring a whole area.

Just a thought.

cfone December 6, 2010 at 4:33 am

“n – again with respect – you don’t deliberately make computer vision hard by turning out the lights. Doing so just because you *can* seems counterintuitive. I’ll always favour making allowances just to improve the responsiveness and robustness of a system. But then that’s just me.”

Not everyone wants to wear particular clothing or has adequate lighting. Adding the depth sensing information on the Kinect sensor makes object recognition far easier and more robust (making allowances for your system does not make it robust, it just speaks to its lack of robustness) as you’re not dependent on sufficient contrast within the image between objects. There’s only so much you can do with a 2D camera anyway (i.e. it is *impossible* to accurately tell apart the relative depth of objects in a scene), this just opens up new possibilities.

Tom December 6, 2010 at 5:58 am

On the other hand, you don’t make a computer vision system that only works in special lighting conditions because we-never-had-these-new-fangled-depth-fields-back-in-my-day, which seems to be essentially what you are saying. It turns the problem of extracting points of interest from a very complex series of operations – format conversion, edge detection, corner detection, special lighting and carefully selected background – to a simple threshold on the depth field. And it means your system works in pretty much any lighting conditions. What’s not to like? Previous CV demos had looked cool, but when I asked, so, when can I have one in my home? the answer was always, ah, well, we need to set the scene up in quite a special way etc etc. That’s gone.

On the performance of OpenCV, I spent some time messing with it a while back and was disappointed by its sloppy response. Just as I was about to put it all down, I tried switching my build configuration to ‘Release’. Oh my! It is snappy. This type of application is extremely susceptible to all sorts of optimisations – use of the SSE extensions for SIMD operations can reduce runtimes by a factor of ~20 alone, then you add in the OpenMP parallelisations that OpenCV uses to spread the processing across however many cores you’ve got, that’s another 8x speed up on my box (4 cores + HT across the ALUs) and so on. I wonder how many of those sluggish demos were built in the debug configuration?

But you are right about the framerate – 30fps is too slow for anything musical. Even if there is no processing delay at all, 30ms latency is around the point where you can feel something is not right, and you can maybe hear the delay between doing something and the sound appearing. Add in some processing delay… Its a shame, as I thought a virtual set of timps were in my grasp…

Tom December 6, 2010 at 6:29 am

It’s worth mentioning kinecthacks.net/ – Basically a 1 stop shop for a lot of the new Kinect hacks being worked on.

Out of interest – what’s the rate limiting step for the data crunching? Couldn’t you throw a cheap CPU or specific chip to be able to crunch more data to get the lag down? Or is it prohibitively expensive currently? Could you offload this to a desktop processor?

This is one place say a Light Peak cable could be interesting -get the computation done by your desktop, and let your games machine doing the rest – it seems Kinect is somewhat hobbled by the XBox 360′s computational power -that it couldn’t handle too much Kinect computation so it’s been held back in some degree so that the XBox can run the game too.

Jonathan Dickinson December 6, 2010 at 8:13 am

Here is my 2c on how to get it working a little faster.

1. You only really care about the color image when the 3D model changes (i.e. rotates) past a certain threshold; or when you are trying to figure out which pixels make up the ‘initial’ human. So instead of caring about the color image every frame, discard it (or cache it) and work off the 3D data instead. This is even faster than the color processing because you are only worrying about 1bpp. Honestly the only reason we ever used color processing is because we didn’t have a RGBD camera.
2. It looks like Microsoft built in a NN (neural network). I read somewhere (lost the reference) that they needed to train the Kinect with gestures – which lends it self to a BP NN. Running 1bpp through a NN (forwards only) might be pretty fast, especially if you compiled that NN somehow. In re. to your last comment, remember the hackers have not figured out what all the bits mean; in other words, that gesture data could be streaming from the device (they just don’t know how to read it yet).
3. Don’t limit your game/demo framerate to the framerate of your Kinect and image processor. Run that stuff on a separate thread. This means your controls will lag slightly behind reality – but I assume this is how the official XBox Kinect games do it. The human mind has problems percieving lag under 30hz. So if you keep your image processor faster (or as fast) as that you are sorted; just keep the spiffy graphics rendering at 60hz.
4. Building a wireframe is awesome; but essentially useless. Decide what you want to do with the sensor and build a specialized system for that. Looking at that data as a heightmap (and hence we go back to the NN argument) might get you much more mileage in terms of speed.

Stephen Hobley December 6, 2010 at 11:11 am

Tom – thanks for the tip – I’ve been disappointed with some the OpenCV stuff I’ve tried, so I’ll certainly take a look at the optimizations available.

I can totally see how being able to function in the dark etc… is a step forwards in robustness – no argument from me.

I guess I take a different view because my background is in real-time performance, and so I go to great lengths to make a system that has the smallest amount of work to do in the shortest time. Usually I work in embedded control systems too, so every MPU cycle counts. I get twitchy when people want to add another microprocessor when a 555 timer could do the same thing.

I have a (shelved) robotics project where I think the Kinect will be absolutely perfect in solving some of my “navigation” issues.

Stefan December 6, 2010 at 3:55 pm

Hey there,

just wanted to say, I don’t think the Kinect uses a 3×3 checkerboard pattern.
This one


at least looks quite different ;)
Would be interesting to know what the other bits from the Kinect mean, but I think that MS might have some kind of XBox-optimized body/gesture recognition or skeletonization library that they simply put on the disks of the Kinect games. That way a game developer could be more innovative, because he can make his own decision on how to interpret the “movement” data. He could define new gestures for his game and wouldn’t be bound to a limited set of gestures that got to MS’s mind when creating the Kinect.



Stephen Hobley December 6, 2010 at 4:14 pm

There’s a pretty good image of the 3×3 effect that I was talking about here.


It’s barely perceptible on my original video, but quite clear on the third picture down.

Stefan December 7, 2010 at 1:35 pm

You’re right, haven’t seen that before ^^

Cahalan Liben November 26, 2011 at 11:47 am

You have a lot of useful pointers on this site. This is a well written article that I have bookmarked for future reading. Have a fun.

Scott Driscoll February 23, 2013 at 11:52 am

This is a bit late, but this short video explains the IR depth detection, and why it doesn’t work outdoors or with multiple kinects: http://www.youtube.com/watch?v=uq9SEJxZiUg

Leave a Comment

Comments links could be nofollow free.

{ 4 trackbacks }

Previous post:

Next post: