ViGoR-Bench: Vision-Generative Reasoning-Centric Benchmark

Abstract

Beneath the stunning visual fidelity of modern AIGC models lies a ‘logical desert’, where systems fail tasks that require physical, causal, or complex spatial reasoning. Current evaluations largely rely on superficial metrics or fragmented benchmarks, creating a ‘performance mirage’ that overlooks the generative process. To address this, we introduce ViGoR (Vision-Generative Reasoning-centric Benchmark), a unified framework designed to dismantle this mirage. ViGoR distinguishes itself through four key innovations: 1) holistic cross-modal coverage bridging Image-to-Image and Video tasks; 2) a dual-track mechanism evaluating both intermediate processes and final results; 3) an evidence-grounded automated judge ensuring high human alignment; and 4) granular diagnostic analysis that decomposes performance into fine-grained cognitive dimensions. Experiments on over 20 leading models reveal that even state-of-the-art systems harbor significant reasoning deficits, establishing ViGoR as a critical ‘stress test’ for the next generation of intelligent vision models.

Overview

Figure 1: Overview of ViGoR-Bench.

Evaluation Pipeline

Figure 2: ViGoR-Bench evaluation pipeline with dual-track assessment.

Leaderboard

Category	Type	Process	Model	OS	Proc.BC↑	Proc.RO↑	Proc.VQ↑	Proc.RA↑	Proc.Avg	Res.BC↑	Res.RO↑	Res.VQ↑	Res.RS↑	Res.Avg
Edit	Edit	—	FLUX.1-Kontext-dev	✓	—	—	—	—	—	65.1	13.1	75.9	1.6	38.9
	Edit	—	FLUX.2-dev	✓	—	—	—	—	—	40.0	11.5	63.8	4.2	29.9
	Edit	—	Qwen-Image-Edit-2509	✓	—	—	—	—	—	52.4	14.3	60.7	1.4	32.2
	Edit	—	Qwen-Image-Edit-2511	✓	—	—	—	—	—	42.4	11.1	44.6	4.9	25.8
	Edit	—	LongCat-Image-Edit	✓	—	—	—	—	—	53.6	11.4	60.5	3.3	32.2
	Edit	—	Step1X-Edit	✓	—	—	—	—	—	64.6	8.2	62.9	2.8	34.6
	Edit	—	HiDream-E1.1	✓	—	—	—	—	—	2.8	2.6	0.4	1.0	1.7
	Edit	—	ICEdit	✓	—	—	—	—	—	34.4	5.3	35.7	0.9	19.1
Unified w/o CoT	Unified	w/o CoT	Bagel	✓	—	—	—	—	—	35.5	8.7	46.0	2.2	23.1
	Unified	w/o CoT	OmniGen2	✓	—	—	—	—	—	27.7	6.3	54.6	0.8	22.4
	Unified	w/o CoT	UniWorld-V1	✓	—	—	—	—	—	49.0	10.9	54.1	1.8	29.0
	Unified	w/o CoT	UniPic2-M-9B	✓	—	—	—	—	—	11.7	4.7	12.6	3.1	8.0
	Unified	w/o CoT	Ovis-U1-3B	✓	—	—	—	—	—	42.0	7.0	46.9	1.2	24.3
	Unified	w/o CoT	DiMOO	✓	—	—	—	—	—	73.3	13.2	48.0	1.4	34.0
	Unified	w/o CoT	Seedream 4.0	✓	—	—	—	—	—	50.6	28.7	66.6	19.9	41.5
	Unified	w/o CoT	GPT-image-1	✗	—	—	—	—	—	57.5	29.3	87.6	13.4	46.9
	Unified	w/o CoT	Nano Banana	✗	—	—	—	—	—	57.0	30.9	86.7	16.3	47.7
	Unified	w/o CoT	Nano Banana Pro	✗	—	—	—	—	—	70.2	62.0	95.1	46.4	68.4
Unified w/ CoT	Unified	w/ CoT	Bagel-Think	✓	15.2	4.5	36.5	2.2	14.6	8.2	6.1	21.3	9.5	9.5
	Unified	w/ CoT	Zebra-CoT	✓	49.9	8.9	57.6	2.7	29.8	42.1	13.3	35.4	1.6	23.1
	Unified	w/ CoT	Uni-CoT	✓	34.2	10.1	47.6	5.3	24.3	26.3	10.3	3.1	18.0	14.4
	Unified	w/ CoT	GPT-image-1 (CoT)	✗	67.4	35.8	80.7	32.9	54.2	27.8	25.7	52.0	19.7	31.3
	Unified	w/ CoT	Nano Banana (CoT)	✗	82.2	40.7	92.6	37.7	63.3	63.8	36.8	79.1	23.8	50.5
	Unified	w/ CoT	Nano Banana Pro (CoT)	✗	86.0	58.6	90.9	52.0	72.0	66.3	54.5	83.9	40.2	61.2
Video Gen	Video Gen	Video	Wan 2.2	✓	61.5	11.9	67.0	7.4	37.0	31.2	6.4	36.7	1.1	18.9
	Video Gen	Video	Kling 1.6	✗	73.9	12.4	77.0	9.6	43.2	59.5	6.3	52.5	1.6	30.0
	Video Gen	Video	Seedance 1.0 Pro	✗	39.5	12.1	63.5	9.2	31.1	48.0	14.0	51.5	4.8	29.6
	Video Gen	Video	Veo 3	✗	69.6	36.2	85.3	31.9	55.8	33.6	15.0	25.0	8.4	20.5
	Video Gen	Video	Sora 2 Pro	✗	70.5	38.8	85.5	34.8	57.4	24.5	16.6	20.0	10.1	17.8

Qualitative Results: Binary Models

Input

GPT-image-1

Seedream 4.0

FLUX-Kontext

Step1X-Edit

LongCat-Image-Edit

Nano Banana Pro

Input

GPT-image-1

Seedream 4.0

FLUX-Kontext

Step1X-Edit

LongCat-Image-Edit

Nano Banana Pro

Input

GPT-image-1

Seedream 4.0

FLUX-Kontext

Step1X-Edit

LongCat-Image-Edit

Nano Banana Pro

Input

GPT-image-1

Seedream 4.0

FLUX-Kontext

Step1X-Edit

LongCat-Image-Edit

Nano Banana Pro

Input

GPT-image-1

Seedream 4.0

FLUX-Kontext

Step1X-Edit

LongCat-Image-Edit

Nano Banana Pro

Input

GPT-image-1

Seedream 4.0

FLUX-Kontext

Step1X-Edit

LongCat-Image-Edit

Nano Banana Pro

Input

GPT-image-1

Seedream 4.0

FLUX-Kontext

Step1X-Edit

LongCat-Image-Edit

Nano Banana Pro

Input

GPT-image-1

Seedream 4.0

FLUX-Kontext

Step1X-Edit

LongCat-Image-Edit

Nano Banana Pro

Input

GPT-image-1

Seedream 4.0

FLUX-Kontext

Step1X-Edit

LongCat-Image-Edit

Nano Banana Pro

Qualitative Results: Chain-of-Thought Models

Input

GPT-image-1 (CoT)

Nano Banana Pro (CoT)

Input

GPT-image-1 (CoT)

Nano Banana Pro (CoT)

Input

GPT-image-1 (CoT)

Nano Banana Pro (CoT)

Input

GPT-image-1 (CoT)

Nano Banana Pro (CoT)

Input

GPT-image-1 (CoT)

Nano Banana Pro (CoT)

Input

GPT-image-1 (CoT)

Nano Banana Pro (CoT)

Input

GPT-image-1 (CoT)

Nano Banana Pro (CoT)

Input

GPT-image-1 (CoT)

Nano Banana Pro (CoT)

Input

GPT-image-1 (CoT)

Nano Banana Pro (CoT)

Qualitative Results: Video Generation

Input

Kling 1.6

Veo 3

Sora 2 Pro

Input

Kling 1.6

Veo 3

Sora 2 Pro

Input

Kling 1.6

Veo 3

Sora 2 Pro

Input

Kling 1.6

Veo 3

Sora 2 Pro

Input

Kling 1.6

Veo 3

Sora 2 Pro

Input

Kling 1.6

Veo 3

Sora 2 Pro

Input

Kling 1.6

Veo 3

Sora 2 Pro

Input

Kling 1.6

Veo 3

Sora 2 Pro

Input

Kling 1.6

Veo 3

Sora 2 Pro

SFT & RL Training on Maze Navigation

SFT Validation Metrics

RL Validation Metrics

Test Case Comparison

SFT
RL
SFT
RL

Citation

@article{han2025vigor, title={ViGoR-Bench: How Far Are Visual Generative Models From Zero-Shot Visual Reasoners?}, author={Han, Haonan and Huang, Jiancheng and Sun, Xiaopeng and He, Junyan and Yang, Rui and Hu, Jie and Peng, Xiaojiang and Ma, Lin and Wei, Xiaoming and Li, Xiu}, journal={arXiv preprint arXiv:XXXX.XXXXX}, year={2025} }

ViGoR-Bench: How Far Are Visual Generative ModelsFrom Zero-Shot Visual Reasoners?

Abstract

Overview

Evaluation Pipeline

Leaderboard

Qualitative Results: Binary Models

Qualitative Results: Chain-of-Thought Models

Qualitative Results: Video Generation

SFT & RL Training on Maze Navigation

Test Case Comparison

Citation

ViGoR-Bench: How Far Are Visual Generative Models
From Zero-Shot Visual Reasoners?