**Yiming Wang$^{2, \star, \spadesuit}$ Da Yin$^{1, \star, \spadesuit, \heartsuit}$ Yuedong Cui$^{1, \star}$ Ruichen Zheng$^{1, \star}$ Zhiqian Li$^1$**
**Zongyu Lin$^1$ Di Wu$^1$ Xueqing Wu$^1$ Chenchen Ye$^1$ Yu Zhou$^1$ Kai-Wei Chang$^1$**
$^1$UCLA $^2$Harvard University $^\star$Co-First Authors $^\spadesuit$Co-Lead$_{\textrm{Alphabetical Order}}$ $^\heartsuit$Equal Advise
Last Updated on Oct 16, 2025 | 📄: Arxiv | :github:: Github | 🤗: Huggingface
We welcome questions and feedback, and are open to discussion and collaboration! Please feel free to contact us at [email protected]
and [email protected]
.
<aside>
TL;DR
Can we train UI agents with a few experience on real environments or even without any?
Figure 1: Performance highlights of UI-Simulator and its empowered UI-Simulator-Grow. In particular, UI-Simulator could outperform the same data collection process on real-world environments; UI-Simulator-Grow could bring more rapid scaling trend than UI-Simulator.
We notice that most digital UI environments, including web, mobile, and computer, can be represented as structured textual accessibility trees. Pre-training on front-end code and procedural knowledge makes LLMs suitable as a backbone model to synthesize reasonable UI states and state transitions triggered by user actions.
The transition follows a multi-step pipeline that guides the world simulator to anticipate outcomes, infer coherent and diverse next states, and render them into a structured format.