Aggregated Data
The software that drives the Kinect device is able to draw upon a huge database of information collected from real-world scenarios. This vast collection of sample data is what makes the Kinect able to recognize people of different heights, ages and body sizes, and wearing different clothes. Every Kinect calls back to Microsoft's data-crunching servers in order to get help interpreting what it sees in front of it and to produce the correct response.
Hardware
The three key components that make up an Xbox Kinect device are the VGA video camera with a resolution of 640 by 480 pixels, a depth sensor and a multi-array microphone. The depth sensor features a monochrome CMOS sensor and an infrared projector, which are both used to create a 3D map of what's in the room -- invisible light is projected out from the Kinect, which then measures the time it takes to return in order to work out distances.
Accuracy and Flexibility
Xbox Kinect can measure depths to 1 cm, and height and width to within 3 mm. Depth and color information is used together to make the Kinect better at recognizing where body parts and objects start and end, and the sensor also draws on information supplied to it from the app or game that's currently running -- if you're watching Netflix, for example, it looks for navigation gestures; if you're playing a dancing game, it looks for that type of movement instead. The Kinect also has the ability to recognize and distinguish between faces.
PrimeSense
PrimeSense originally developed the raw technology that was licensed to Microsoft to form the basis of the Kinect motion sensor. The work the company did built on existing "time of flight" methods in which light is bounced back from an object, by encoding data within the infrared streams to get a more accurate picture when it is returned. Objects within a room create deformations in the light as it's reflected back, which gives the Kinect the data it needs to spot shapes and people. The sensor looks for human torsos first of all, then for arms and legs, and then tries to guess where these body parts will be in a few microseconds' time as well as where they are now.