Dynamics-based Human Motion Modeling for People Tracking (DARPA Mindís Eye Program)*
Human pose estimation using monocular vision is a challenging problem in computer vision. Past work has focused on developing efficient inference algorithms and probabilistic prior models based on captured kinematic/dynamic measurements. However, such algorithms face challenges in generalization beyond the learned dataset.


Figure 1: Project Overview

In this work, we propose a model-based generative approach for estimating the human pose solely from uncalibrated monocular video in unconstrained environments without any prior learning on motion capture/image annotation data. We propose a novel Product of Heading Experts (PoHE) based generalized heading estimation framework by probabilistically-merging heading outputs (probabilistic/non-probabilistic) from time varying number of estimators. Our current implementation employs motion cues based human heading estimation framework to bootstrap a synergistically integrated probabilistic-deterministic sequential optimization framework to robustly estimate human pose. Novel pixel-distance based performance measures are developed to penalize false human detections and ensure identity-maintained human tracking. We tested our framework with varied inputs (silhouette and bounding boxes) to evaluate, compare and benchmark it against ground-truth data (collected using our human annotation tool) for 52 video vignettes in the publicly available DARPA Mindís Eye Year I dataset 1. Results show robust pose estimates on this challenging dataset of highly diverse activities.

Human tracking is typically formulated as a Bayesian filtering problem, based on a Particle Filter (PF). In PF the posterior is approximated using a set of weighted samples/particles and is computed recursively. In this work, we will focus on developing a dynamics based temporal prior contributing to the posterior as opposed to a first or second order linear dynamical system with Gaussian noise which is often adopted due to unavailability of more realistic priors. We assume that for simulating dynamics of the scene the segment shapes, mass properties, collision geometries and other associated parameters (e.g. direction of gravity) is known and remain constant throughout the motion sequence. We also consider a human as a loop-free articulated structure. Bayesian filtering technique such as PF or Annealed Particle Filter will finally be employed with the proposed dynamics-based prior method.

Figure 2: Summary of optimization framework implemented for pose estimation on each frame


 Research Issue- Human Heading Estimation from Videos

We model the heading estimation task independent of features/types of individual estimators and focus on optimally fusing the information from all the available estimators. Hence, we propose a Product of Heading Experts (PoHE) based generalized heading estimation framework which probabilistically merges heading outputs from time varying number of estimators to produce robust heading estimates under varied conditions in unconstrained scenarios. Further, we developed a novel generative model for estimating heading direction of the subject in the video using motion-based cues thus, significantly reducing the pose search space.



 Research Issue- Pose Estimation and Optimization

In order to tackle this complex human pose estimation problem, we adopted a sequential optimization based framework to determine the optimal and uncoupled pose states (camera/body location, body joint angles) separately using a combination of deterministic and probabilistic optimization approaches to leverage the advantages associated with each. By implementing a probabilistic-deterministic optimization scheme, faster convergence to the global minima were achieved. Initial guesses for our problem were estimated using population based global optimization technique for deterministic convex optimization scheme. Finally, we introduced the notion of pose evaluation for videos with multiple humans to quantitatively evaluate the (optimal) pose estimates by defining identity maintained pose evaluation metrics.


Figure 3: Manually Annotated Human Markers with Image Overlay


 Movies - Human Annotation GUI

- Tutorial for MATLAB based manual human annotation application developed to obtain ground truth pose estimates data from the video datasets.


- View it on


 Movies - Motion Detection based on Bounding Box (Results)

 Students Involved:

- Priyanshu Agarwal, MS Student, University at Buffalo [Graduated]

- Suren Kumar, PhD Student, University at Buffalo


 Future Goals- Kinematic and Dynamic Schemes:

How much physics?

What level of abstraction is effective for accurate tracking of the human pose?


Do physics-based models for vision need to be as robust and rich as that in Robotics?


How to do simultaneous inference of pose and other unknown environmental parameters in physics-based models?

Role of learning & mocap:

What is the role of learning and motion capture data for building useful priors over physics-based models?



*This project is funded by Defense Advanced Research Projects Agency (DARPA).

New: visit ARMLAB YouTube Channel - , and research progress on


 Related Publications - Conference Proceedings:
[01] P. Agarwal, S. Kumar, J. Ryde, J. Corso, and V. Krovi. Estimating Human Dynamics On-the-fly Using Monocular Video for Pose Estimation. Robotics: Science and Systems Conference, University of Sydney, Sydney, Australia, July 9-13, 2012. [PDF]
[02] P. Agarwal, S. Kumar, J. Corso, and V. Krovi. Estimating Dynamics On-the-fly Using Monocular Video. Dynamic Systems and Control Conference, California, October 12-14, 2011 [PDF]


 Related Publications - Theses
[01] P. Agarwal, Dynamics-based Human Pose Estimation Using Monocular Vision, M.S. Thesis, Department of Mechanical & Aerospace Engineering, SUNY at Buffalo, Jun 2012. [PDF]


Questions or comments regarding the website, please contact the webmaster.

Last Updated: April 21, 2012