The promise of VLMs runs into the messy last mile.
Pixel-level grounding
Locate a small visual detail — a phone number on a sign, a logo on a tent, a book title spine-on — inside a 4K+ scene full of distractors.
Open-world retrieval
Using only that detail, formulate a query, search the open web, navigate to unstructured sources, and extract the right fact.
Their conjunction
Existing benchmarks test either one. Pix2Fact tests both at once — what real human assistance requires, and where today's models break.
(replace with 4K original)
Locate the right building
Past foreground roads & distractor buildings.
Read tiny signage
OCR the mall name from behind scaffolding.
Identify which mall
Disambiguate among real-world entities.
Find the official site
Construct a query that surfaces the right page.
Extract the contact
Navigate to leasing page, isolate the number.








