Yuxiang Guo

307B Clark Hall

3400 N Charles St Baltimore, MD 21218

yguo87 at jhu dot edu

Hi! I’m Yuxiang! I’m currently a final year PhD student at AIEM advised by Prof. Rama Chellappa. I earned MS degree from Mccormick School of Engineering, Northwestern University in 2021 and BS degree from joint program hosted by University of Electronic Science and Technology of China and University of Glasgow.

My research lies in Machine Learning and Computer Vision, with a particular emphasis on Human-centered AI, Video Understanding, Multi-modal Large Language Models, and 3D Reconstruction. I develop machines to perceive the world from observations, analyze and reason about these perceptions through the knowledge encoded in MLLMs, and generate informed, context-aware responses. My recent work further considers temporal dynamics and focuses on building explainable models that generalize effectively to real-world scenarios.

I had a wonderful experience as a Research Intern at (Spring 2024) mentored by Dr. Shao-Yuan Lo; a Research Intern at mentored by Dr. Jiang Liu (Summer 2025) and a student researcher at (Fall 2025) hosted by Cheng Zhong.

I am on the job market and actively seeking full-time research scientist/engineer opportunities starting in 2026!

selected publications

IJCB2024
Distillation-guided Representation Learning for Unconstrained Gait Recognition

Yuxiang Guo, Siyuan Huang, Ram Prabhakar, and 3 more authors

In 2024 IEEE International Joint Conference on Biometrics (IJCB), 2024

Awarded Abs DOI Bib HTML PDF

IAPR Best Biometrics Student Paper Award

Gait recognition holds the promise of robustly identifying subjects based on walking patterns instead of appearance information. While previous approaches have performed well for curated indoor data, they tend to underperform in unconstrained situations, e.g. in outdoor, long distance scenes, etc. We propose a framework, termed GAit DEtection and Recognition (GADER), for human authentication in challenging outdoor scenarios. Specifically, GADER leverages a Double Helical Signature to detect segments that contain human movement and builds discriminative features through a novel gait recognition method, where only frames containing gait information are used. To further enhance robustness, GADER encodes viewpoint information in its architecture, and distills representation from an auxiliary RGB recognition model, which enables GADER to learn from silhouette and RGB data at training time. At test time, GADER only infers from the silhouette modality. We evaluate our method on multiple State-of-The-Arts(SoTA) gait baselines and demonstrate consistent improvements on indoor and outdoor datasets, especially with a significant 25.2% improvement on unconstrained, remote gait data.
@inproceedings{10744527, author = {Guo, Yuxiang and Huang, Siyuan and Prabhakar, Ram and Lau, Chun Pong and Chellappa, Rama and Peng, Cheng}, booktitle = {2024 IEEE International Joint Conference on Biometrics (IJCB)}, title = {Distillation-guided Representation Learning for Unconstrained Gait Recognition}, year = {2024}, volume = {}, number = {}, pages = {1-11}, keywords = {Training;Representation learning;Legged locomotion;Pipelines;Detectors;Information leakage;Feature extraction;Robustness;Gait recognition;Standards}, doi = {10.1109/IJCB62174.2024.10744527} }
WACV2025
GaitContour: Efficient Gait Recognition based on a Contour-Pose Representation

Yuxiang Guo, Anshul Shah, Jiang Liu, and 3 more authors

2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2025

Abs Bib HTML PDF

Gait recognition holds the promise to robustly identify subjects based on walking patterns instead of appearance information. In recent years, this field has been dominated by learning methods based on two principal input representations: dense silhouette masks or sparse pose keypoints. In this work, we propose a novel, point-based Contour-Pose representation, which compactly expresses both body shape and body parts information. We further propose a local-to-global architecture, called GaitContour, to leverage this novel representation and efficiently compute subject embedding in two stages. The first stage consists of a local transformer that extracts features from five different body regions. The second stage then aggregates the regional features to estimate a global human gait representation. Such a design significantly reduces the complexity of the attention operation and improves efficiency and performance simultaneously. Through large scale experiments, GaitContour is shown to perform significantly better than previous point-based methods, while also being significantly more efficient than silhouette-based methods. On challenging datasets with significant distractors, GaitContour can even outperform silhouette-based methods.
@article{guo2023gaitcontour, title = {GaitContour: Efficient Gait Recognition based on a Contour-Pose Representation}, author = {Guo, Yuxiang and Shah, Anshul and Liu, Jiang and Gupta, Ayush and Chellappa, Rama and Peng, Cheng}, journal = {2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)}, year = {2025}, booktitle = {2025 Winter Conference on Applications of Computer Vision (WACV)}, }
IJCV
Stimuvar: Spatiotemporal stimuli-aware video affective reasoning with multimodal large language models

Yuxiang Guo, Faizan Siddiqui, Yang Zhao, and 2 more authors
Posted by JHU Whiting School
International Journal of Computer Vision, 2024

Abs Bib HTML PDF

Predicting and reasoning how a video would make a human feel is crucial for developing socially intelligent systems. Although Multimodal Large Language Models (MLLMs) have shown impressive video understanding capabilities, they tend to focus more on the semantic content of videos, often overlooking emotional stimuli. Hence, most existing MLLMs fall short in estimating viewers’ emotional reactions and providing plausible explanations. To address this issue, we propose StimuVAR, a spatiotemporal Stimuli-aware framework for Video Affective Reasoning (VAR) with MLLMs. StimuVAR incorporates a two-level stimuli-aware mechanism: frame-level awareness and token-level awareness. Frame-level awareness involves sampling video frames with events that are most likely to evoke viewers’ emotions. Token-level awareness performs tube selection in the token space to make the MLLM concentrate on emotion-triggered spatiotemporal regions. Furthermore, we create VAR instruction data to perform affective training, steering MLLMs’ reasoning strengths towards emotional focus and thereby enhancing their affective reasoning ability. To thoroughly assess the effectiveness of VAR, we provide a comprehensive evaluation protocol with extensive metrics. StimuVAR is the first MLLM-based method for viewer-centered VAR. Experiments demonstrate its superiority in understanding viewers’ emotional responses to videos and providing coherent and insightful explanations.
@article{guo2024stimuvar, title = {Stimuvar: Spatiotemporal stimuli-aware video affective reasoning with multimodal large language models}, author = {Guo, Yuxiang and Siddiqui, Faizan and Zhao, Yang and Chellappa, Rama and Lo, Shao-Yuan}, journal = {International Journal of Computer Vision}, year = {2024}, }
CVPR2025
SPARS3R: Semantic Prior Alignment and Regularization for Sparse 3D Reconstruction

Yutao Tang^*, Yuxiang Guo^*, Deming Li, and 1 more author

2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

Abs Bib HTML PDF

Recent efforts in Gaussian-Splat-based Novel View Synthesis can achieve photorealistic rendering; however, such capability is limited in sparse-view scenarios due to sparse initialization and over-fitting floaters. Recent progress in depth estimation and alignment can provide dense point cloud with few views; however, the resulting pose accuracy is suboptimal. In this work, we present SPARS3R, which combines the advantages of accurate pose estimation from Structure-from-Motion and dense point cloud from depth estimation. To this end, SPARS3R first performs a Global Fusion Alignment process that maps a prior dense point cloud to a sparse point cloud from Structure-from-Motion based on triangulated correspondences. RANSAC is applied during this process to distinguish inliers and outliers. SPARS3R then performs a second, Semantic Outlier Alignment step, which extracts semantically coherent regions around the outliers and performs local alignment in these regions. Along with several improvements in the evaluation process, we demonstrate that SPARS3R can achieve photorealistic rendering with sparse images and significantly outperforms existing approaches.
@article{tang2024spars3r, title = {SPARS3R: Semantic Prior Alignment and Regularization for Sparse 3D Reconstruction}, author = {Tang, Yutao and Guo, Yuxiang and Li, Deming and Peng, Cheng}, journal = {2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, year = {2025}, }
ImageDoctor: Diagnosing Text-to-Image Generation via Grounded Image Reasoning

Yuxiang Guo^*, Jiang Liu^*, Ze Wang, and 7 more authors

2025

Abs Bib HTML PDF Website

The rapid advancement of text-to-image (T2I) models has increased the need for reliable human preference modeling, a demand further amplified by recent progress in reinforcement learning for preference alignment. However, existing approaches typically quantify the quality of a generated image using a single scalar, limiting their ability to provide comprehensive and interpretable feedback on image quality. To address this, we introduce ImageDoctor, a unified multi-aspect T2I model evaluation framework that assesses image quality across four complementary dimensions: plausibility, semantic alignment, aesthetics, and overall quality. ImageDoctor also provides pixel-level flaw indicators in the form of heatmaps, which highlight misaligned or implausible regions, and can be used as a dense reward for T2I model preference alignment. Inspired by the diagnostic process, we improve the detail sensitivity and reasoning capability of ImageDoctor by introducing a “look-think-predict” paradigm, where the model first localizes potential flaws, then generates reasoning, and finally concludes the evaluation with quantitative scores. Built on top of a vision-language model and trained through a combination of supervised fine-tuning and reinforcement learning, ImageDoctor demonstrates strong alignment with human preference across multiple datasets, establishing its effectiveness as an evaluation metric. Furthermore, when used as a reward model for preference tuning, ImageDoctor significantly improves generation quality—achieving an improvement of 10% over scalar-based reward models.
@article{Guo2025ImageDoctorDT, title = {ImageDoctor: Diagnosing Text-to-Image Generation via Grounded Image Reasoning}, author = {Guo, Yuxiang and Liu, Jiang and Wang, Ze and Chen, Hao and Sun, Ximeng and Zhao, Yang and Wu, Jialian and Yu, Xiaodong and Liu, Zicheng and Barsoum, Emad}, year = {2025}, }