Papers
Topics
Authors
Recent
Search
2000 character limit reached

ELMO: Enhanced Real-time LiDAR Motion Capture through Upsampling

Published 9 Oct 2024 in cs.GR, cs.AI, cs.CV, and cs.LG | (2410.06963v2)

Abstract: This paper introduces ELMO, a real-time upsampling motion capture framework designed for a single LiDAR sensor. Modeled as a conditional autoregressive transformer-based upsampling motion generator, ELMO achieves 60 fps motion capture from a 20 fps LiDAR point cloud sequence. The key feature of ELMO is the coupling of the self-attention mechanism with thoughtfully designed embedding modules for motion and point clouds, significantly elevating the motion quality. To facilitate accurate motion capture, we develop a one-time skeleton calibration model capable of predicting user skeleton offsets from a single-frame point cloud. Additionally, we introduce a novel data augmentation technique utilizing a LiDAR simulator, which enhances global root tracking to improve environmental understanding. To demonstrate the effectiveness of our method, we compare ELMO with state-of-the-art methods in both image-based and point cloud-based motion capture. We further conduct an ablation study to validate our design principles. ELMO's fast inference time makes it well-suited for real-time applications, exemplified in our demo video featuring live streaming and interactive gaming scenarios. Furthermore, we contribute a high-quality LiDAR-mocap synchronized dataset comprising 20 different subjects performing a range of motions, which can serve as a valuable resource for future research. The dataset and evaluation code are available at {\blue \url{https://movin3d.github.io/ELMO_SIGASIA2024/}}

Definition Search Book Streamline Icon: https://streamlinehq.com
References (67)
  1. 2009. OptiTrack Motion Capture Systems. https://www.optitrack.com/
  2. 2011. Xsens Technologies B.V. https://www.xsens.com/
  3. 2011. Xsens Technologies B.V. https://www.movella.com/products/motion-capture/xsens-mvn-awinda
  4. 2023. Hesai QT128 Specification. https://www.hesaitech.com/wp-content/uploads/QT128C2X_User_Manual_Q03-en-231210.pdf
  5. Performance Capture from Sparse Multi-View Video. ACM Transactions on Graphics (02 2008). https://doi.org/10.1145/1360612.1360697
  6. Flag: Flow-based 3d avatar generation from sparse observationszhang22motiondiffuse. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13253–13262.
  7. Multi-view Pictorial Structures for 3D Human Pose Estimation. In British Machine Vision Conference.
  8. A data-driven approach for real-time full body pose reconstruction from a depth camera. In 2011 International Conference on Computer Vision. 1092–1099. https://doi.org/10.1109/ICCV.2011.6126356
  9. Blazepose: On-device real-time body pose tracking. arXiv preprint arXiv:2006.10204 (2020).
  10. Keep it SMPL: Automatic Estimation of 3D Human Pose and Shape from a Single Image. arXiv:1607.08128 [cs.CV]
  11. C. Bregler and J. Malik. 1998. Tracking people with twists and exponential maps. In Proceedings. 1998 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No.98CB36231). 8–15. https://doi.org/10.1109/CVPR.1998.698581
  12. 3D Pictorial Structures for Multiple View Articulated Pose Estimation. In 2013 IEEE Conference on Computer Vision and Pattern Recognition. 3618–3625. https://doi.org/10.1109/CVPR.2013.464
  13. Weakly supervised 3d multi-person pose estimation for large-scale scenes based on monocular camera and single lidar. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 37. 461–469.
  14. Fast and Robust Multi-Person 3D Pose Estimation and Tracking From Multiple Views. IEEE Transactions on Pattern Analysis and Machine Intelligence 44, 10 (2022), 6981–6992. https://doi.org/10.1109/TPAMI.2021.3098052
  15. C· ASE: Learning conditional adversarial skill embeddings for physics-based characters. In SIGGRAPH Asia 2023 Conference Papers. 1–11.
  16. MuscleVAE: Model-Based Controllers of Muscle-Actuated Characters. In SIGGRAPH Asia 2023 Conference Papers. 1–11.
  17. Multi-Objective Adversarial Gesture Generation. In Proceedings of the 12th ACM SIGGRAPH Conference on Motion, Interaction and Games (Newcastle upon Tyne, United Kingdom) (MIG ’19). Association for Computing Machinery, New York, NY, USA, Article 3, 10 pages.
  18. Humans in 4D: Reconstructing and Tracking Humans with Transformers. arXiv preprint arXiv:2305.20091 (2023).
  19. Generative Adversarial Nets. In Advances in Neural Information Processing Systems, Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K.Q. Weinberger (Eds.), Vol. 27. Curran Associates, Inc.
  20. Moglow: Probabilistic and controllable motion synthesis using normalising flows. ACM Transactions on Graphics (TOG) 39, 6 (2020), 1–14.
  21. Neural MoCon: Neural Motion Control for Physically Plausible Human Motion Capture. arXiv:2203.14065 [cs.CV]
  22. Deep inertial poser: Learning to reconstruct human pose from sparse inertial measurements in real time. ACM Transactions on Graphics (TOG) 37, 6 (2018), 1–15.
  23. MOVIN: Real-time Motion Capture using a Single LiDAR. In Computer Graphics Forum. Wiley Online Library, e14961.
  24. MOCHA: Real-Time Motion Characterization via Context Matching. In SIGGRAPH Asia 2023 Conference Papers. 1–11.
  25. EgoPoser: Robust Real-Time Ego-Body Pose Estimation in Large Scenes. arXiv preprint arXiv:2308.06493 (2023).
  26. Transformer Inertial Poser: Attention-based Real-time Human Motion Reconstruction from Sparse IMUs. arXiv preprint arXiv:2203.15720 (2022).
  27. Diederik P Kingma and Max Welling. 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013).
  28. Vibe: Video inference for human body pose and shape estimation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 5253–5263.
  29. Convolutional Mesh Regression for Single-Image Human Shape Reconstruction. arXiv:1905.03244 [cs.CV]
  30. Dancing to Music. In Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (Eds.), Vol. 32. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2019/file/7ca57a9f85a19a6e4b9a248c1daca185-Paper.pdf
  31. SAME: Skeleton-Agnostic Motion Embedding for Character Animation. In SIGGRAPH Asia 2023 Conference Papers. 1–11.
  32. QuestEnvSim: Environment-Aware Simulated Motion Tracking from Sparse Sensors. arXiv preprint arXiv:2306.05666 (2023).
  33. NIKI: Neural Inverse Kinematics with Invertible Neural Networks for 3D Human Pose and Shape Estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12933–12942.
  34. Learning to Generate Diverse Dance Motions with Transformer. (08 2020).
  35. Lidarcap: Long-range marker-less 3d human motion capture with lidar point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 20502–20512.
  36. GANimator: Neural Motion Synthesis from a Single Sequence. ACM Trans. Graph. 41, 4, Article 138 (jul 2022), 12 pages.
  37. ACE: Adversarial Correspondence Embedding for Cross Morphology Motion Retargeting from Human to Nonhuman Characters. arXiv preprint arXiv:2305.14792 (2023).
  38. Character controllers using motion vaes. ACM Transactions on Graphics (TOG) 39, 4 (2020), 40–1.
  39. SMPL: A skinned multi-person linear model. ACM transactions on graphics (TOG) 34, 6 (2015), 1–16.
  40. VNect: Real-time 3D Human Pose Estimation with a Single RGB Camera. ACM Transactions on Graphics 36, 4, 14 pages. https://doi.org/10.1145/3072959.3073596
  41. Coarse-to-Fine Volumetric Prediction for Single-Image 3D Human Pose. arXiv:1611.07828 [cs.CV]
  42. Action-conditioned 3d human motion synthesis with transformer vae. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 10985–10995.
  43. SparsePoser: Real-time Full-body Motion Reconstruction from Sparse Data. ACM Transactions on Graphics 43, 1 (2023), 1–14.
  44. PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation. arXiv:1612.00593 [cs.CV]
  45. LiveHPS: LiDAR-based Scene-level Human Pose and Shape Estimation in Free Environment. arXiv preprint arXiv:2402.17171 (2024).
  46. Lidar-aid inertial poser: Large-scale human motion capture by sparse inertial and lidar sensors. IEEE Transactions on Visualization and Computer Graphics 29, 5 (2023), 2337–2347.
  47. PLIKS: A Pseudo-Linear Inverse Kinematic Solver for 3D Human Body Estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 574–584.
  48. Learning structured output representation using deep conditional generative models. Advances in neural information processing systems 28 (2015).
  49. 3D Segmentation of Humans in Point Clouds with Synthetic Data. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1292–1304.
  50. RSMT: Real-time Stylized Motion Transition for Characters. arXiv preprint arXiv:2306.11970 (2023).
  51. Human Motion Diffusion Model. In ICLR.
  52. EDGE: Editable Dance Generation From Music. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
  53. Sparse inertial poser: Automatic 3d human pose estimation from sparse imus. In Computer graphics forum, Vol. 36. Wiley Online Library, 349–360.
  54. Combining Recurrent Neural Networks and Adversarial Training for Human Motion Synthesis and Control. IEEE Transactions on Visualization and Computer Graphics 27, 1 (jan 2021), 14–28.
  55. Capturing Humans in Motion: Temporal-Attentive 3D Human Pose and Shape Estimation from Monocular Video. arXiv:2203.08534 [cs.CV]
  56. Xiaolin Wei and Jinxiang Chai. 2010. Videomocap: Modeling physically realistic human motion from monocular video sequences. In ACM SIGGRAPH 2010 papers. 1–10.
  57. QuestSim: Human Motion Tracking from Sparse Sensors with Simulated Avatars. In SIGGRAPH Asia 2022 Conference Papers. 1–8.
  58. Physics-based character controllers using conditional vaes. ACM Transactions on Graphics (TOG) 41, 4 (2022), 1–12.
  59. Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition. arXiv:1801.07455 [cs.CV]
  60. DivaTrack: Diverse Bodies and Motions from Acceleration-Enhanced Three-Point Trackers. In Computer Graphics Forum. Wiley Online Library, e15057.
  61. Decoupling human and camera motion from videos in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 21222–21232.
  62. Jiaming Ying and Xu Zhao. 2021. Rgb-D Fusion For Point-Cloud-Based 3d Human Pose Estimation. In 2021 IEEE International Conference on Image Processing (ICIP). 3108–3112. https://doi.org/10.1109/ICIP42928.2021.9506588
  63. Point-BERT: Pre-training 3D Point Cloud Transformers with Masked Point Modeling. arXiv:2111.14819 [cs.CV]
  64. MotionDiffuse: Text-Driven Human Motion Generation with Diffusion Model. arXiv preprint arXiv:2208.15001 (2022).
  65. Neural Categorical Priors for Physics-Based Character Control. ACM Transactions on Graphics (TOG) 42, 6 (2023), 1–16.
  66. MotionBERT: Unified Pretraining for Human Motion Analysis. arXiv preprint arXiv:2210.06551 (2022).
  67. Real-time non-rigid reconstruction using an RGB-D camera. ACM Transactions on Graphics (ToG) 33, 4 (2014), 1–12.

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 0 likes about this paper.