UI-TARS-7B-DPO

Maintained By
bytedance-research

UI-TARS-7B-DPO

PropertyValue
Model Size7B parameters
PaperarXiv:2501.12326
AuthorByteDance Research
Model URLhttps://huggingface.co/bytedance-research/UI-TARS-7B-DPO

What is UI-TARS-7B-DPO?

UI-TARS-7B-DPO is a next-generation GUI interaction model that represents a significant advancement in automated interface navigation and task completion. It integrates perception, reasoning, grounding, and memory capabilities into a single vision-language model, enabling end-to-end automation without predefined workflows.

Implementation Details

The model utilizes a unified architecture that processes both visual and textual information to understand and interact with graphical user interfaces. It has been trained using Direct Preference Optimization (DPO) to enhance its decision-making capabilities and task execution accuracy.

  • Achieves 89.5% average accuracy on ScreenSpot benchmarks
  • Demonstrates superior performance in cross-domain tasks with 67.1% success rate
  • Excels in both mobile and desktop interface interactions

Core Capabilities

  • Advanced perception with 79.7% accuracy on VisualWebBench
  • Robust element grounding across different interface types
  • Seamless handling of text and icon/widget interactions
  • Enhanced performance in online and offline task automation
  • Support for multiple platforms including mobile, desktop, and web interfaces

Frequently Asked Questions

Q: What makes this model unique?

UI-TARS-7B-DPO stands out for its integrated approach to GUI interaction, combining all necessary components in a single model rather than using traditional modular frameworks. It achieves state-of-the-art performance across multiple benchmarks and can handle complex interface interactions without predefined rules.

Q: What are the recommended use cases?

The model is ideal for automated GUI testing, task automation across different platforms, interface accessibility enhancement, and development of intelligent user assistance systems. It performs particularly well in scenarios requiring understanding of complex interfaces and multi-step task execution.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.