by Bhavika Panara, Ivelin Ivanov

MoveNet and PoseNet are computer vision models for Pose Estimation. Their architecture is composed of several layers. First, they detect a human figure in an image and then estimate spatial locations of key body joints (key points), such as someone’s elbow, shoulder or foot showing up in an image.

It is important to be aware of the fact that pose estimation merely estimates where key body joints are and does not recognize who is in an image or video.

Applications

There are two popular classes of applications for pose estimation: Healthcare and Fitness/yoga.

Health Applications

One popular class of applications for pose detection is physical therapy. Monitoring movement and posture for deviations from established medical norms and facilitating proactive guidance to patients.

During the COVID19 pandemic when elderly people have been isolated and required remote monitoring, pose estimation has helped with remote fall detection and alerting in real-time in systems such as Ambianic.ai. Ambianic Fall Detector is a Raspberry Pi 4B based smart camera that uses pose estimation to continuously monitor high-risk areas of a home for possible falls and instantly alerts caregivers and family when a loved one has fallen down.

A screenshot from Ambianic Fall Detector

Fitness/yoga

Pose Estimation allows detecting athletic movements such as yoga, weight lifting, squats etc. Pose estimation models allow us to track joint positions such as shoulders, elbows, hips in real-time. These fitness routines can be built digitally which are prescribed by therapists.

Pose Estimation Model Architectures

Pose estimation allows us to build an interactive fitness application that guides users in their fitness program with the comfort of their own homes. This service can be run web-based or locally that delivers precise key points. It helps users to count their exercises and keep their historical records.

image10

PoseNet image from the original Tensorflow PoseNet Blog

The Google AI Tensorflow team introduced various pose estimation models in the past couple of years with a variety of model architectures: Posenet, MoveNet Model and Blazepose. All these models have various variants of model architectures.

Blazepose model is offered by MediaPipe and it infers 33 key points of a human body (in 2D space and 3D space versions) where PoseNet and MoveNet infer 17 key points in 2D space. MoveNet is considered the new generation version of PoseNet. All 3 models are available on Tensorflow Hub in several runtime formats such as Tensorflow Javascript (TF.js), TFLite and Coral (Edge TPU).

TF.js model format allows us to deploy ML models in JavaScript and use ML directly in the browser or in Node.js.

TFLite model format allows us to deploy ML models in mobile and IoT devices and run on-device inference. TensorFlow Lite is a lightweight version (around 1MB binary vs >1GB for a full Tensorflow install) of TensorFlow designed for mobile and embedded devices. TensorFlow Lite models are 5-10x compressed versions of full TF models. TFLite models usually measure in 10s of MB vs 100s of MB for the original models. The compression is done via techniques such as quantization, which converts 32-bit parameter data into 8-bit representations.

Until recently it was not possible to train a model directly with TensorFlow Lite. We had to first train a model with TensorFlow/Keras, then save the trained model, convert it to a TFLite model using TensorFlow Lite converter and then deploy it on an edge device.

In 2020 the TensorFlow team introduced TensorFlow Lite Model Maker that enables us to train certain types of models on-device with custom datasets. It uses a transfer learning approach to reduce the required amount of training data and shorten the training time.

Tensorflow Lite enables us to deploy models on devices with CPU only as well as devices with support for Edge TPU. Edge TPU provides a specific set of neural network operations and architectures and it is capable of executing deep neural networks about 10x faster than a CPU. It supports only TensorFlow Lite models that are fully 8-bit quantized and then compiled specifically for the Edge TPU.

image5

Converting a regular Tensorflow model to TFLite and optionally to Edge TPU for embedded device deployment

PoseNet:

PoseNet is an older generation pose estimation model released in 2017. It is trained on a standard COCO dataset and provides a single pose and multiple pose estimation variants. The single pose variant can detect only one person in an image/video and the multi pose variant can detect multiple persons in an image/video. Both variants have their own set of parameters and methodology. Single pose estimation is simpler and faster but required to have a single person in an image/video otherwise key points from multiple persons will likely be estimated as being part of a single subject.

PoseNet again has two variants in terms of model architecture that is MobileNet v1 architecture and ResNet50 architecture. The MobileNetV1 architecture model is smaller and faster but has lower accuracy. The ResNet50 variant is larger and slower but it’s more accurate. Both MobileNetV1 and ResNet50 variants support single pose and multi-person pose estimation.

The model returns the coordinates of the 17 key points along with a confidence score.

MoveNet:

MoveNet is the latest generation pose estimation model released in 2021. MoveNet is an ultra-fast and accurate model that detects 17 key points of a body. MoveNet has two variants known as Lightning and Thunder. Lightning is meant for latency-critical applications, while Thunder is meant for applications that require high accuracy. Both variants support 30+ FPS on most modern desktops, laptops, and phones. MoveNet outperforms PoseNet on a variety of datasets.

MoveNet is trained on two datasets, COCO and an internal Google dataset called Active. Active was produced by labelling key points on yoga, fitness, and dance videos from YouTube. This dataset is built by selecting three frames from each video for training. Evaluations on the Active validation dataset show a significant performance. The initial version of MoveNet only supported single pose estimation, however a multi-person tracking version is under development.

MoveNet is a bottom-up estimation model. The architecture consists of two components: a feature extractor and a set of prediction heads. The feature extractor in MoveNet is MobileNetV2.

There are four prediction heads attached to the feature extractor namely Person centre heatmap( geometric centre of person), Keypoint regression(predicts a full set of key points for a person), Person keypoint heatmap(predicts the location of all key points), 2D per-keypoint offset field. These predictions are computed in parallel.

Fall-Detection challenges in real-world settings

Base Heuristic Algorithm

One of the promising use cases of the pose estimation model is to detect human falls. By analyzing frame sequences using pose estimation, we can predict fall motions. A simple and effective heuristic approach to detect a fall is to compare the angle between the spinal vectors of a person in before and after images from a fall sequence.

In many cases, this approach predicts true events. However, there are also scenarios where results are false positives. For example, when someone is leaning intentionally to tie shoes or squatting.

This approach also fails sometimes to predict true falls when the spinal vector angle between before and after frames does not meet a pre-configured threshold. For example when someone slides out of their bed/couch and is unable to hold their weight. They get stuck sitting on the floor unable to call for help. While this is not the most obvious example of a fall, it is the ground truth for seniors with specific medical conditions.

In the process of testing the system with users in real-world settings, we discovered a number of challenges. Some of them are listed below:

Distance from subject

While PoseNet and MoveNet were optimized for distances around 10-15 feet (3-4m), falls can occur in a bigger range from the camera lens. Because falls are unplanned events, we cannot expect people to position themselves in optimal distance right before they fall. Sometimes incidents happen very close to the camera, and sometimes further away.

Camera angle

Pose detection models are trained mainly on data where the camera is positioned at eye level relative to the person(s) in the image or video. In home settings cameras are placed in a range of locations depending on the personal preference of the homeowner. The camera has to not only be functional, but it also needs to be “out of the way” and it has to “fit in” with the rest of the furniture in the room. Sometimes cameras are mounted near a ceiling corner and sometimes placed on a piece of furniture near the floor.

image11

Example with a ceiling mounted camera.

image9

In this example with a camera angled down from a ceiling corner, PoseNet confused background objects with a person. MoveNet correctly focused on the person.

Ambient lighting

Falls can happen any time. Day or night. For the current version of the system we assume minimal lighting exists in a home but we are well aware that we will have to drop these assumptions eventually as there are people who sometimes walk unintentionally(or intentionally) in their sleep through dark areas.

The base pose detection models perform well in cases with low lighting although they do have a limit. We actually saw examples where the ML models did well even when it was hard for a human eye to distinguish objects in a room with dim lighting.

image1

Example of a low lighting room that makes it challenging for a human eye to distinguish key points, but the ML models did well.

image2

Another example with dim lighting. PoseNet got confused. MoveNet did well.

Background objects

People personalize their spaces in a variety of ways. Some are minimalists in their choice of furniture and wall colors yet others are happier with a rainbow of colors and objects around them. The latter seems to present a notable challenge to computer vision models that have not seen such unique home decors.

image8

In this example PoseNet got confused and placed pose key points on a vacuum cleaner. MoveNet did better.

image7

In this example both models confused objects with person keypoints with. Although confidence scores were under 10%.

For the time being, we advise users to keep areas of high risk falls clear of clutter, but we are working on introducing a feedback loop that would allow users to help their local models learn about their personal space and reduce mistakes.

Occlusions

It turns out that everyday items such as chairs and tables can be significant challenges to pose detection models. Both PoseNet and MoveNet suffer when occlusions block a certain part of the person whose fall we want to detect. Even in cases when the human eye can reasonably see and determine what position is a person in an image with occlusion, the ML models struggle. See examples below with PoseNet and MoveNet detections on a video frame sequence:

Posenet

PoseNet presents very low confidence scores (<10%) on correct detections in this sequence. It also confuses person keypoints with background objects. When occlusions cover a bigger part of the person body, PoseNet is completely off.

Movenet

MoveNet does not get easily confused by background objects. It presents high confidence scores (>20%) on correct detections. However it does have a problem with occlusions and it does sometimes present similarly high confidence (>20%) on incorrect pose keypoints.

Outdoor scenes

Doorsteps are a high-risk area for falls. We saw examples where ML models confused trees and pillars with people. The confidence scores for these detections were usually low (less than 10%) which alleviates the issue to some extent as we only recommend alerting when detections are with at least 50% confidence. However, this is another area where a user feedback loop would be appropriate to allow the local model to learn and avoid detecting incorrect objects in the particular home setting.

image6

Both PoseNet and MoveNet confused a pillar of bricks with a person.

Multiple people

Fall detection system alerts are mainly useful when people fall while there is no one around to help. However, there are situations when the people nearby are in a wheelchair or otherwise unable to assist. Therefore it is important to accurately distinguish between different individuals in a scene before analyzing their poses for a possible fall.

image4

In an earlier frame (red lines) PoseNet incorrectly crossed key points between two people next to each other. In a different frame (green lines) it correctly focused on one person. MoveNet did not confuse the two people.

MoveNet’s approach to estimate a central body for the main subject and then assign relative weight of other key points as a function of distance from the central point helps with these situations. On our data it did better focusing on one main subject in an image. However the single pose version we tested did not address the issue of tracking the subject between frames when multiple people move around. It is an area that still requires work and testing of the upcoming multi-person tracking version of MoveNet.

Other scenes

We continue to learn about new situations from our users who share their feedback with the Open Source community. As new challenging cases come in, we expand our model benchmark test suite and discuss problems that the ML system needs to learn to overcome. Along with this blog post, we also publish this interactive notebook showing the latest results comparing PoseNet to MoveNet.

Practical workarounds and troubleshooting

Understanding the challenges and ongoing work to fine tune fall detection models, there are several practical steps that provide workaround for some of the known challenges.

  • Distance: The models perform best when the monitored risk area where a subject might fall is roughly 6-15 feet (2-5 meters) from the camera.
  • Angle: The models perform best when the camera is placed approximately at eye level - 4-6 feet ( 1.5-2 meters) from the ground and angled parallel to the floor.
  • Visibility: As mentioned above, large occlusions such as tables and chairs can prevent the pose detection models from confidently detecting keypoints. The fewer occlusions in the monitored area, the better the fall detection model will perform.
  • Multiple Cameras: To accomodate unavoidable occlusions and objects that might confuse the pose detection models, it is advisable to install multiple cameras observing the fall risk areas from different angles. As a recent publication in Nature on an alternative fall detection system suggested, using up to 8 cameras is a practical tradeoff to increase overal fall detection system performance. The MoveNet model card provides additional technical limitations of the model.

When a setting allows the flexibility to implement the forementioned steps, the practical benefits of the exising fall detection system is significantly improved over settings that exclusively rely on continuous human supervision.

Future Work

The current system is helpful in many cases when unattended seniors fall and need assistance. That is an improvement over the status quo which requires a care team member to be always present in person in order to react adequately and in time. While wearable medical alert devices are a viable alternative and have been around for years, research shows that seniors are not wearing one when needed.

We believe strongly that ambient intelligence has the potential to improve people’s lives in a meaningful way. Although we are in the early days of proving this out, there is a growing community of users and supporters who continue to help us improve daily.

One immediate area of attention for the core Ambianic.ai team is to enable users to provide feedback through an intuitive UI app with minimal intuitive actions. That would enable local on-device transfer learning to train a fall classification model with improved accuracy, precision and recall metrics.

The idea is to use the current simple heuristic model as a baseline for initial training of the fall classification DNN and gradually apply user feedback to further improve performance.

As users are presented with Fall Detections (from the current heuristic model) in the UI app, they can choose to Agree or Disagree with the detection. If they Agree, the detection sample will be labeled as positive and added to the local on-device training data set for the local fall classification model. If they Disagree, the sample will be labeled as negative classification and added to the local data set. Every 10th labeled sample will be added to a local test data set instead of the training dataset.

Once there is a batch of 100 new samples collected on the device, TFLite Model Maker will run on-device image classification transfer learning on the fall classification model. When the fall classification model reaches 90% accuracy, it can replace the baseline heuristic model and continue to improve via user feedback as it comes.

Further down the road, we are planning to test out the promise of federated learning by enabling users to pool learning from multiple devices without sharing image data from cameras.

If you want to do something about bringing positive change to your loved ones who need better care today or will need it tomorrow, then try the Open Source Ambianic.ai product and join the project now!