Incoming PhD Student · Multimodal AI · Embodied Intelligence

Bohan Hou 侯博涵

I work on multimodal models, information retrieval, embodied intelligence, and robot learning.

Multimodal Models Information Retrieval Embodied Intelligence Robot Learning

Short Bio

A concise research-focused introduction.

I am an incoming PhD student at Nanyang Technological University (NTU), where I will be advised by Prof. Jianfei Yang. I received my undergraduate education at Shandong University, where I studied Computer Science at Taishan College under the guidance of Prof. Xuemeng Song. I have also been a research assistant at the Intelligent Media Research Center (iLearn), led by Prof. Liqiang Nie. My research interests include multimodal models, information retrieval, embodied intelligence, robot learning, and related areas.

Experience

Research appointments and internships.

Nov. 2025 — Present

Alibaba DAMO Academy

Research Intern · Embodied Intelligence and World Model

Mentors: Dr. Jun Cen, Ronghao Dang, and Dr. Xin Li.

I also work closely with my talented colleagues Jiayan Guo(We explore Agents) and Siteng Huang(We explore VLA)

Sep. 2022 — Present

iLearn Lab, Shandong University

Research Assistant · Multimodal Large Language Models and Information Retrieval

Supervisors: Prof. Xuemeng Song and Prof. Liqiang Nie.

Education

Academic training and degree programs.

Incoming

Nanyang Technological University

Incoming PhD Student

Advised by Prof. Jianfei Yang. Research direction: multimodal learning, MLLM Agents, and embodied intelligence.

Sep. 2022 — Jun. 2026

Shandong University

B.Eng. in Computer Science and Technology

Taishan College, First-Honor Program. GPA: 88.5 / 100; Rank: 5 / 22.

First-Author Publications

Left: visual preview. Right: paper information.

InterLV-Search preview
arXiv 2026Agentic Search

InterLV-Search: Benchmarking Interleaved Multimodal Agentic Search

Bohan Hou, Jiuning Gu, Jiayan Guo, Ronghao Dang, Sicong Leng, Xin Li, Xuemeng Song, and Jianfei Yang.

A benchmark for interleaved language-vision agentic search with active visual evidence seeking and open-web multimodal search.

World Model for Robot Learning preview
arXiv 2026Survey

World Model for Robot Learning: A Comprehensive Survey

Bohan Hou, Gen Li, Jindou Jia, Tuo An, Xinying Guo, Sicong Leng, Haoran Geng, Yanjie Ze, Tatsuya Harada, Philip Torr, Oier Mees, Marc Pollefeys, Zhuang Liu, Jiajun Wu, Pieter Abbeel, Jitendra Malik, Yilun Du, and Jianfei Yang.

A robotics-centered survey of world models for policy learning, planning, learned simulation, evaluation, and video generation.

RynnBrain preview
arXiv 2026Embodied Foundation Models

RynnBrain: Open Embodied Foundation Models

Ronghao Dang*, Jiayan Guo*, Bohan Hou*, Sicong Leng*, Kehan Li*, Xin Li*, et al.

An open embodied foundation model family for egocentric understanding, spatiotemporal localization, physical reasoning, and planning.

FiRE preview
SIGIR 2025Conference · CCF-A / CORE-A

FiRE: Enhancing MLLMs with Fine-Grained Context Learning for Complex Image Retrieval

Bohan Hou, Haoqiang Lin, Xuemeng Song, Haokun Wen, Meng Liu, Yupeng Hu, and Xiangyu Zhao.

Fine-grained context learning for complex image retrieval with multimodal large language models.

Multimodal Document Retrieval preview
WWW 2025 ChallengeOral · CCF-A

Visual Anchor Point for Multimodal Document Retrieval

Bohan Hou, Haokun Wen, Haoqiang Lin, Xuemeng Song, and Liqiang Nie.

Winner of the WWW’25 Multimodal Document Retrieval Challenge.

PTG-FSCIR preview
IJCNN 2025Conference · CORE-A

Pseudo-triplet Guided Few-shot Composed Image Retrieval

Bohan Hou, Haoqiang Lin, Haokun Wen, Meng Liu, Mingzhu Xu, and Xuemeng Song.

A pseudo-triplet guided scheme for improving few-shot composed image retrieval.

Co-Author Publications

Left: visual preview. Right: paper information.

ImgEdit preview
NeurIPS 2025Conference · CCF-A

ImgEdit: A Unified Image Editing Dataset and Benchmark

Yang Ye, Xianyi He, Zongjian Li, Bin Lin, Shenghai Yuan, Zhiyuan Yan, Bohan Hou, and Li Yuan.

A unified image editing dataset, benchmark, and model for high-quality single-turn and multi-turn image editing.

Composed Image Retrieval survey preview
TOISJournal · CCF-A

A Comprehensive Survey on Composed Image Retrieval

Xuemeng Song, Haoqiang Lin, Haokun Wen, Bohan Hou, Mingzhu Xu, and Liqiang Nie.

A comprehensive survey of composed image retrieval, covering supervised, zero-shot, and related multimodal retrieval settings.

Selected Project

Representative open-source effort.

Embodied-AI-Guide

★ 13.6k stars

A popular Chinese knowledge base and resource index for embodied intelligence, organized in an encyclopedia-style format. It covers robotics foundations, robot learning, simulators, benchmarks, datasets, multimodal models, VLA models, navigation, and related embodied-AI research directions.

Honors & Misc.

Compact highlights; link to CV for the full list.

Selected Honors

  • Champion, WWW’25 Multimodal Document Retrieval Challenge.
  • Shandong University Outstanding Student Scholarship, 2023–2025.
  • Shandong University Academic Innovation Scholarship, 2025.
  • Second Prize, National Undergraduate Mathematics Competition, 2024.

Links