DINO-X is a unified object-centric vision model boasting the best open-world object detection performance. To simplify long-tailed object detection, it expands its input options to support text prompts, visual prompts, and customized prompts. A universal object prompt has been developed to enable prompt-free open-world detection, allowing it to detect any object in an image without requiring users to provide any prompt. A large-scale dataset featuring over 100 million high-quality grounding samples—dubbed Grounding-100M—has been constructed to enhance the model’s open-vocabulary detection performance. Additionally, DINO-X has been extended to integrate multiple perception heads, enabling it to simultaneously support a range of object perception and understanding tasks, including detection, segmentation, pose estimation, object captioning, and object-based QA, among others. DINO-X comprises two models:
a. DINO-X Pro: the most capable model, equipped with enhanced perception capabilities for diverse scenarios
b. DINO-X Edge: an efficient model optimized for faster inference speed, making it better suited for deployment on edge devices.