Recognize the pose directly in the browser b in real time

Today we show and tell you how to recognize a complex posture of a person using AI right in the browser. This is useful, for example, in developing exercise apps. Previously, even the best detectors could not cope with this task. For details, we invite you under the cut, while our flagship begins Data Science course

We are pleased to present the MoveNet Pose Model with our new Pose API in TensorFlow.js. MoveNet is an ultra-fast and accurate model that identifies 17 key points on the body.

The model is offered on TF Hub in two variants known as Lightning and Thunder. Lightning is for latency-critical applications, while Thunder is for high-fidelity applications.

Both models run faster than real-time (30+ FPS) on most modern desktops, laptops and phones, which is critical for fitness, sports and health applications.

The effect is achieved by running the model entirely on the client side, in the browser using TensorFlow.js, without any server calls after the initial page load, and without installing any dependencies.

Try it interactive demo!

MoveNet can track key points during fast movements and atypical postures.

The assessment of human posture has come a long way over the past five years, but surprisingly it has not yet found widespread use. This is because there is more emphasis on creating larger, more accurate pose models, rather than engineering work to deploy them quickly and ubiquitously.

At MoveNet, our goal was to design and optimize a model that would take advantage of the best aspects of modern architectures while keeping inference times as low as possible. The result is a model capable of accurately rendering key points in a wide variety of poses, environments, and hardware settings.

Interactive health apps

To see if MoveNet can help open up remote patient care for us. We have teamed up with IncludeHealth, a digital health and efficiency company.

IncludeHealth has developed an interactive web application that guides a patient through a variety of procedures using a phone, tablet or laptop. The treatments are digitally structured and prescribed by physiotherapists to check balance, strength and range of motion.

Such a service requires web-based posing models and local triggering for privacy that can provide accurate cue points at high frame rates, which are then used to quantify and qualitatively assess a person’s postures and movements.

If for simple movements such as shoulder abduction or squats for the whole bodyIf a conventional ready-made detector is enough, then more complex postures, such as knee extension in a sitting position or a prone position, cause difficulties even for the most modern detectors trained on erroneous data.

Comparison of the traditional detector (top) and MoveNet (bottom) in difficult poses.

We’ve provided IncludeHealth with an early version of MoveNet available through the new Pose API. This model is trained in fitness, dance, and yoga poses (see below for more on training dataset). IncludeHealth integrated the model into its application and benchmarked MoveNet against other available pose detectors:

“The MoveNet model has provided the powerful combination of speed and precision required to deliver prescribed care. While other models do not compromise on one or the other, this unique balance opens up a new generation of healthcare delivery. The Google team has been a fantastic partner in this endeavor, ”says Ryan Eder, founder and CEO of IncludeHealth.

IncludeHealth partners with hospital systems, insurance companies and the military to take traditional care and education outside of the box.

IncludeHealth demo application that runs in a browser and evaluates balance and movement using cue point estimation powered by MoveNet and TensorFlow.js

Installation

There are two ways to use MoveNet with the new pose detection API:

Via NPM:

import * as poseDetection from '@tensorflow-models/pose-detection';

Through the script tag:

<script src="https://cdn.jsdelivr.net/npm/@tensorflow/tfjs-core"></script>
<script src="https://cdn.jsdelivr.net/npm/@tensorflow/tfjs-converter"></script>
<script src="https://cdn.jsdelivr.net/npm/@tensorflow/tfjs-backend-webgl"></script>
<script src="https://cdn.jsdelivr.net/npm/@tensorflow-models/pose-detection"></script>

Try it yourself!

After installing the package, you just need to follow a few steps to start using it:

// Create a detector.
const detector = await poseDetection.createDetector(poseDetection.SupportedModels.MoveNet);

By default, the detector uses the Lightning version; to select Thunder, create a detector as shown below:

// Create a detector.
const detector = await poseDetection.createDetector(poseDetection.SupportedModels.MoveNet, {modelType: poseDetection.movenet.modelType.SINGLEPOSE_THUNDER});
// Pass in a video stream to the model to detect poses.
const video = document.getElementById('video');
const poses = await detector.estimatePoses(video);

Each pose contains 17 key points with absolute x, y coordinates, a confidence score, and the name of a body part:

console.log(poses[0].keypoints);
// Outputs:
// [
//    {x: 230, y: 220, score: 0.9, name: "nose"},
//    {x: 212, y: 190, score: 0.8, name: "left_eye"},
//    ...
// ]

For more information on the API, see our

README

If you do something using this model, tag your social media post with #MadeWithTFJS so we can find your work, as we’d love to see what you’ve created.

MoveNet architecture

MoveNet is a bottom-up scoring model that uses heat maps to pinpoint human key points. The architecture consists of two components: a feature extractor and a set of prediction heads. The prediction scheme is broadly the same as CenterNet, with noticeable changes that improve speed and accuracy. All models are trained using the TensorFlow Object Detection API.

Feature extractor in MoveNet – MobileNetV2 with an attached feature pyramid network (FPN), which allows to obtain a semantically rich feature map with high resolution (output stride 4) [Страйд — параметр, изменяющий количество движения]… Four prediction heads are connected to the feature extractor, which are responsible for tight prediction:

  • Human Center Heat Map: Predicts the geometric center of a person.
  • Cue point regression field: Predict the full set of cue points of a person, used to group cue points into instances.
  • Human Key Point Heat Map: Predicts the location of all key points, regardless of people.
  • Two-dimensional displacement field of each cue point: Predict local displacements from every pixel in the output feature map to the exact subpixel location of each cue point.

MoveNet architecture

Although these predictions are computed in parallel, you can get a general idea of ​​how the model works by looking at this sequence of operations:

  • Step 1. The heatmap of the face centers is used to determine the centers of all faces in the frame, which are determined as the arithmetic average of all key points belonging to the face. The location with the highest score (weighted by the inverse distance from the center of the frame) is selected.
  • Step 2. An initial set of human cue points is created by slicing the cue point regression based on the pixel corresponding to the center of the object. Since we are talking about center forecasting – which should work at different scales – the accuracy of the positioning of the regressed key points will not be very high.
  • Step 3. Each pixel on the cue point heatmap is multiplied by a weight that is inversely proportional to the distance to the corresponding regressed cue point. This ensures that we do not accept key points from people, as they will generally not be near regressed key points and therefore have low resulting scores.
  • Step 4. The final set of key point forecasts is selected by searching for the coordinates of the maximum heat map values ​​in each key point channel. Local 2D displacement predictions are then added to these coordinates to obtain refined estimates. The figure below shows these four steps.

MoveNet post-processingMoveNet post-processing

Training dataset

MoveNet was trained on two datasets: COCO and an internal Google dataset called Active. Although COCO is the reference dataset for detection due to its variety of scenes and scales, it is not suitable for fitness and dance applications in which complex postures are observed and movements are highly blurred.

Active was created by marking cue points (adopting the standard 17 COCO body cue points) on yoga, fitness and dance videos from YouTube. A maximum of three frames are selected from each training video to provide a variety of scenes and faces.

Estimates on the Active validation dataset show significant performance gains over identical architectures trained on COCO alone. This is not surprising: COCO rarely showcases people with extreme postures (yoga, push-ups, headstands, and more).

To learn more about the dataset and how MoveNet works with different categories, please see model card

img

Optimization

While much effort has gone into the architecture, post-processing logic, and data sampling, no less attention has been paid to output speed to make MoveNet a high-quality detector.

First, the low performance layers from MobileNetV2 were selected for lateral connections in FPN. Likewise, the number of convolution filters in each prediction head has been significantly reduced to speed up the work with the output feature maps. Depth-separated convolutions are used throughout the network, except for the first layer of MobileNetV2.

MoveNet has been profiled several times, especially slow operators have been removed. For example, we replaced tf.math.top_k with tf.math.argmax because it is much faster and is suitable for a single person.

For fast execution in TensorFlow.js, all model outputs were packed into one output tensor, so that loading from GPU to CPU occurs only once.

Perhaps the most significant speedup we got was using 192×192 inputs for the model (256×256 in the case of Thunder). To compensate for the drop in resolution, smart cropping is applied based on the detection of the previous frame. This allows the model to pay attention and resources to the main object rather than the background.

Time filtering

Working with a camera data stream at high frames per second allows anti-aliasing to be applied to the cue point estimates. Both Lightning and Thunder apply a robust nonlinear filter to the incoming stream of key point predictions.

This filter is tuned to simultaneously suppress high frequency noise (i.e. jitter) and overshoot from the model while maintaining high bandwidth on fast movements. This results in smooth rendering of keypoints with minimal lag under all circumstances.

MoveNet and browser performance

To quantify MoveNet’s inference speed, the model was tested on multiple devices. Model latency (expressed in FPS) was measured per GPU via WebGL and also via WebAssembly (WASM), a typical backend for devices with or without low-power GPUs.

img
Speed ​​of MoveNet withdrawals on various devices and TF.js backends. The first cell is Lighting, the second is Thunder.

TF.js is constantly optimizing its backends to speed up execution of models on all supported devices. To help the models achieve this performance, we have implemented several methods, for example, implemented packed WebGL core for depth-separable convolutions; and improved GL scheduling for mobile Chrome.

To see FPS models on your device, try our demo. You can change the model type and backends in real time to see what works best for the device.

Plans

The next step is to extend the Lightning and Thunder models to work with multiple people. To make the models run faster, it is also planned to speed up the TensorFlow.js backends by repeatedly testing and optimizing the backend.

While neural networks continue to develop, you can start or continue studying them in our courses:

To find out how we train specialists in other areas, you can go to the courses pages

from catalog

Other professions and courses

Data Science and Machine Learning

Python, web development

Mobile development

Java and C #

From the basics to the depth

And:

Similar Posts

Leave a Reply