ViGoR-Bench: How Far Are Visual Generative Models
From Zero-Shot Visual Reasoners?

Haonan Han1,2, Jiancheng Huang2, Xiaopeng Sun2, Junyan He2,‡, Rui Yang3, Jie Hu2, Xiaojiang Peng4,
Lin Ma2, Xiaoming Wei2, Xiu Li1,†
1 Tsinghua University    2 Meituan M17    3 The University of Hong Kong    4 SIAT, Chinese Academy of Sciences
Corresponding author    Project lead
Tsinghua University Meituan M17 The University of Hong Kong SIAT, Chinese Academy of Sciences

Abstract

Beneath the stunning visual fidelity of modern AIGC models lies a ‘logical desert’, where systems fail tasks that require physical, causal, or complex spatial reasoning. Current evaluations largely rely on superficial metrics or fragmented benchmarks, creating a ‘performance mirage’ that overlooks the generative process. To address this, we introduce ViGoR (Vision-Generative Reasoning-centric Benchmark), a unified framework designed to dismantle this mirage. ViGoR distinguishes itself through four key innovations: 1) holistic cross-modal coverage bridging Image-to-Image and Video tasks; 2) a dual-track mechanism evaluating both intermediate processes and final results; 3) an evidence-grounded automated judge ensuring high human alignment; and 4) granular diagnostic analysis that decomposes performance into fine-grained cognitive dimensions. Experiments on over 20 leading models reveal that even state-of-the-art systems harbor significant reasoning deficits, establishing ViGoR as a critical ‘stress test’ for the next generation of intelligent vision models.

Overview

ViGoR-Bench Overview
Figure 1: Overview of ViGoR-Bench.

Evaluation Pipeline

Evaluation Pipeline
Figure 2: ViGoR-Bench evaluation pipeline with dual-track assessment.

Leaderboard

Category Type Process Model OS Proc.BC↑ Proc.RO↑ Proc.VQ↑ Proc.RA↑ Proc.Avg Res.BC↑ Res.RO↑ Res.VQ↑ Res.RS↑ Res.Avg
EditEditFLUX.1-Kontext-dev65.113.175.91.638.9
EditFLUX.2-dev40.011.563.84.229.9
EditQwen-Image-Edit-250952.414.360.71.432.2
EditQwen-Image-Edit-251142.411.144.64.925.8
EditLongCat-Image-Edit53.611.460.53.332.2
EditStep1X-Edit64.68.262.92.834.6
EditHiDream-E1.12.82.60.41.01.7
EditICEdit34.45.335.70.919.1
Unified w/o CoTUnifiedw/o CoTBagel35.58.746.02.223.1
Unifiedw/o CoTOmniGen227.76.354.60.822.4
Unifiedw/o CoTUniWorld-V149.010.954.11.829.0
Unifiedw/o CoTUniPic2-M-9B11.74.712.63.18.0
Unifiedw/o CoTOvis-U1-3B42.07.046.91.224.3
Unifiedw/o CoTDiMOO73.313.248.01.434.0
Unifiedw/o CoTSeedream 4.050.628.766.619.941.5
Unifiedw/o CoTGPT-image-157.529.387.613.446.9
Unifiedw/o CoTNano Banana57.030.986.716.347.7
Unifiedw/o CoTNano Banana Pro70.262.095.146.468.4
Unified w/ CoTUnifiedw/ CoTBagel-Think15.24.536.52.214.68.26.121.39.59.5
Unifiedw/ CoTZebra-CoT49.98.957.62.729.842.113.335.41.623.1
Unifiedw/ CoTUni-CoT34.210.147.65.324.326.310.33.118.014.4
Unifiedw/ CoTGPT-image-1 (CoT)67.435.880.732.954.227.825.752.019.731.3
Unifiedw/ CoTNano Banana (CoT)82.240.792.637.763.363.836.879.123.850.5
Unifiedw/ CoTNano Banana Pro (CoT)86.058.690.952.072.066.354.583.940.261.2
Video GenVideo GenVideoWan 2.261.511.967.07.437.031.26.436.71.118.9
Video GenVideoKling 1.673.912.477.09.643.259.56.352.51.630.0
Video GenVideoSeedance 1.0 Pro39.512.163.59.231.148.014.051.54.829.6
Video GenVideoVeo 369.636.285.331.955.833.615.025.08.420.5
Video GenVideoSora 2 Pro70.538.885.534.857.424.516.620.010.117.8

Qualitative Results: Binary Models

Input
Input
GPT-image-1
GPT-image-1
Seedream 4.0
Seedream 4.0
FLUX-Kontext
FLUX-Kontext
Step1X-Edit
Step1X-Edit
LongCat-Image-Edit
LongCat-Image-Edit
Nano Banana Pro
Nano Banana Pro
Input
Input
GPT-image-1
GPT-image-1
Seedream 4.0
Seedream 4.0
FLUX-Kontext
FLUX-Kontext
Step1X-Edit
Step1X-Edit
LongCat-Image-Edit
LongCat-Image-Edit
Nano Banana Pro
Nano Banana Pro
Input
Input
GPT-image-1
GPT-image-1
Seedream 4.0
Seedream 4.0
FLUX-Kontext
FLUX-Kontext
Step1X-Edit
Step1X-Edit
LongCat-Image-Edit
LongCat-Image-Edit
Nano Banana Pro
Nano Banana Pro
Input
Input
GPT-image-1
GPT-image-1
Seedream 4.0
Seedream 4.0
FLUX-Kontext
FLUX-Kontext
Step1X-Edit
Step1X-Edit
LongCat-Image-Edit
LongCat-Image-Edit
Nano Banana Pro
Nano Banana Pro
Input
Input
GPT-image-1
GPT-image-1
Seedream 4.0
Seedream 4.0
FLUX-Kontext
FLUX-Kontext
Step1X-Edit
Step1X-Edit
LongCat-Image-Edit
LongCat-Image-Edit
Nano Banana Pro
Nano Banana Pro
Input
Input
GPT-image-1
GPT-image-1
Seedream 4.0
Seedream 4.0
FLUX-Kontext
FLUX-Kontext
Step1X-Edit
Step1X-Edit
LongCat-Image-Edit
LongCat-Image-Edit
Nano Banana Pro
Nano Banana Pro
Input
Input
GPT-image-1
GPT-image-1
Seedream 4.0
Seedream 4.0
FLUX-Kontext
FLUX-Kontext
Step1X-Edit
Step1X-Edit
LongCat-Image-Edit
LongCat-Image-Edit
Nano Banana Pro
Nano Banana Pro
Input
Input
GPT-image-1
GPT-image-1
Seedream 4.0
Seedream 4.0
FLUX-Kontext
FLUX-Kontext
Step1X-Edit
Step1X-Edit
LongCat-Image-Edit
LongCat-Image-Edit
Nano Banana Pro
Nano Banana Pro
Input
Input
GPT-image-1
GPT-image-1
Seedream 4.0
Seedream 4.0
FLUX-Kontext
FLUX-Kontext
Step1X-Edit
Step1X-Edit
LongCat-Image-Edit
LongCat-Image-Edit
Nano Banana Pro
Nano Banana Pro

Qualitative Results: Chain-of-Thought Models

Input
Input
GPT-image-1 (CoT)
GPT-image-1 (CoT)
Nano Banana Pro (CoT)
Nano Banana Pro (CoT)
Input
Input
GPT-image-1 (CoT)
GPT-image-1 (CoT)
Nano Banana Pro (CoT)
Nano Banana Pro (CoT)
Input
Input
GPT-image-1 (CoT)
GPT-image-1 (CoT)
Nano Banana Pro (CoT)
Nano Banana Pro (CoT)
Input
Input
GPT-image-1 (CoT)
GPT-image-1 (CoT)
Nano Banana Pro (CoT)
Nano Banana Pro (CoT)
Input
Input
GPT-image-1 (CoT)
GPT-image-1 (CoT)
Nano Banana Pro (CoT)
Nano Banana Pro (CoT)
Input
Input
GPT-image-1 (CoT)
GPT-image-1 (CoT)
Nano Banana Pro (CoT)
Nano Banana Pro (CoT)
Input
Input
GPT-image-1 (CoT)
GPT-image-1 (CoT)
Nano Banana Pro (CoT)
Nano Banana Pro (CoT)
Input
Input
GPT-image-1 (CoT)
GPT-image-1 (CoT)
Nano Banana Pro (CoT)
Nano Banana Pro (CoT)
Input
Input
GPT-image-1 (CoT)
GPT-image-1 (CoT)
Nano Banana Pro (CoT)
Nano Banana Pro (CoT)

Qualitative Results: Video Generation

Input
Input
Kling 1.6
Veo 3
Sora 2 Pro
Input
Input
Kling 1.6
Veo 3
Sora 2 Pro
Input
Input
Kling 1.6
Veo 3
Sora 2 Pro
Input
Input
Kling 1.6
Veo 3
Sora 2 Pro
Input
Input
Kling 1.6
Veo 3
Sora 2 Pro
Input
Input
Kling 1.6
Veo 3
Sora 2 Pro
Input
Input
Kling 1.6
Veo 3
Sora 2 Pro
Input
Input
Kling 1.6
Veo 3
Sora 2 Pro
Input
Input
Kling 1.6
Veo 3
Sora 2 Pro

SFT & RL Training on Maze Navigation

SFT Validation
SFT Validation Metrics
RL Validation
RL Validation Metrics

Test Case Comparison

SFTSFT 1SFT 2SFT 3SFT 4
RLRL 1RL 2RL 3RL 4
SFTSFT 5SFT 6SFT 7SFT 8
RLRL 5RL 6RL 7RL 8

Citation

@article{han2025vigor, title={ViGoR-Bench: How Far Are Visual Generative Models From Zero-Shot Visual Reasoners?}, author={Han, Haonan and Huang, Jiancheng and Sun, Xiaopeng and He, Junyan and Yang, Rui and Hu, Jie and Peng, Xiaojiang and Ma, Lin and Wei, Xiaoming and Li, Xiu}, journal={arXiv preprint arXiv:XXXX.XXXXX}, year={2025} }