Pix2Fact · When Vision Is Not Enough

When vision is
not enough.

A visual question-answering benchmark of 1,000 expert-crafted questions on 4K+ real-world scenes — where every answer demands both fine-grained visual grounding and open-web knowledge search.

01The gap

The promise of VLMs runs into the messy last mile.

Picture a tour guide spotting a distant mountain and telling you its history. For humans this is automatic — effortlessly linking subtle visual cues with structured facts via broad knowledge. For modern Vision-Language Models, it remains nearly impossible. The challenge is not visual grounding or knowledge search alone, but their tight coupling.

— From the Pix2Fact paper, Section 1

Pixel-level grounding

Locate a small visual detail — a phone number on a sign, a logo on a tent, a book title spine-on — inside a 4K+ scene full of distractors.

Open-world retrieval

Using only that detail, formulate a query, search the open web, navigate to unstructured sources, and extract the right fact.

Their conjunction

Existing benchmarks test either one. Pix2Fact tests both at once — what real human assistance requires, and where today's models break.

Figure 6 · Traffic & Infrastructure

Highway aerial · mall behind scaffolding
(replace with 4K original)

A real Pix2Fact question

A tourist might naturally ask:

"The building behind the scaffolding is a mall. Which number should we call for leasing inquiries to open a clothing store there?"

Correct answer

(86 10) 6505-6688

What a model has to do: get past the foreground roads, through the scaffolding, read the sign, identify the specific mall, then navigate its website to the leasing page — a chain general search engines rarely surface directly.

Why this is hard · the 5 things a model must do, in order

Locate the right building

Past foreground roads & distractor buildings.

SPATIAL · GROUNDING

Read tiny signage

OCR the mall name from behind scaffolding.

FINE OCR

Identify which mall

Disambiguate among real-world entities.

ENTITY LINKING

Find the official site

Construct a query that surfaces the right page.

QUERY · RETRIEVAL

Extract the contact

Navigate to leasing page, isolate the number.

NAVIGATION · EXTRACT

04Failure modes

Models fail — but they fail in different ways.

A 121-case audit reveals heterogeneous bottlenecks. The dominant breakage has shifted upstream — into query construction and retrieval recall, not perception or reasoning.

Claude

87%

Knowledge bottleneck

Search executes correctly, but the long-tail fact never surfaces. Produces a confident but slightly wrong answer.

Qwen

63%

Tool-use breakdown

Fails to invoke the search tool at all. A fundamental gap in agentic behavior, not query formulation.

All models

43%

Query construction

The most common search-stage error — queries built on misidentified entities. Only 17% are integration issues.

Cropping

100%

Crop rescue rate

Expert cropping resolves all 18 grounding failures. 72% involve wrong attention, not just low resolution.

Two real failure cases

Sourced from the paper's deep-dive analysis.

Case 1 · Index 923-1 · Wrong attention

Attention re-localization

On the right side of the image, what brand are the sneakers the woman is wearing?

Pixel-level grounding · what the model has to find

Original scene (C1) High resolution · 4K+

Expert
crop

Expert crop (C3) Target region

Without cropping

Model attends to a visually similar but wrong object: "She is wearing black sneakers with a distinctive white 'N' logo on the side, identifying the brand as New Balance."

✓

With expert crop

Attention re-localizes: "the short-haired woman is wearing black sneakers featuring a distinctive white 'swoosh' logo."

Ground truth Nike

Case 2 · Index 415-1 · Confident-but-wrong

Near-miss factual error

For the well-known clothing brand in the picture, how many stores did it have worldwide in fiscal year 2024?

Pixel-level grounding · the logo on the storefront

Original scene (C1) High resolution · 4K+

Expert
crop

Expert crop (C3) Target region

Claude's answer

"According to Fast Retailing's annual report for FY2024, Uniqlo had approximately 2,501 stores worldwide, including around 800 in Japan and over 1,700 internationally."

✓

What actually happened

Brand correctly identified. Evidence correctly retrieved. But the final number is fabricated as a "reasonable-sounding" approximation — the actual figure is 2,495, a 6-store difference that breaks the question.

Ground truth 2,495 stores

When vision is
not enough.

The promise of VLMs runs into the messy last mile.

Pixel-level grounding

Open-world retrieval

Their conjunction

Locate the right building

Read tiny signage

Identify which mall

Find the official site

Extract the contact

Eight scenes of everyday life.

Street scene with people

Storefronts & facades

Retail & commercial interior

Traffic & infrastructure

Markets & outdoor vendors

Public & cultural interior

Landmarks & attractions

Cityscape & aerial

One real question per scenario — try them yourself.

How each (Q, A) is built — industrial-grade quality.

Source

Filter & categorize

Author Q & ground-truth

Three-tier review

Ten frontier VLMs · the ceiling sits at 51.7%.