When Multimodal Models Go Blind

Here's the sequence for o4-mini-high:

You hand o4‑mini‑high a technical patent with an embedded IRR vs Frequency graph and ask:

"At what frequency does IRR peak?"

It thinks for 30 seconds and instead of just reading the chart, it hits you with:

"Which page is that on?"

Cue dramatic facepalm. 🤦

GPT-4o mini-high failing to answer a question about a graph

Even after I grumbled "Page 6," it pulled out the Python tool use gun (my favorite as well) proclaimed the peak was "the highest point on the line." Technically wrong and hilariously sure of itself.

Additional context doesn't resolve the limitation

Model's unsuccessful self-analysis attempt

Here's the sequence for Morphik:

We treat each page like one giant image+text puzzle:

Snap the whole page as an image (diagrams, tables, doodles included)
Extract text blocks with their exact positions (headings, captions, footnotes)
Blend vision & text embeddings into a multi-vector cocktail 🍹
Retrieve the full region (text+diagram) as a unit—no more orphaned charts

Result? The same question returns:

"IRR peaks at 0 MHz." Boom. 🎯

Morphik's technical approach correctly processes the query

Context visualization showing the complete retrieved section

When Multimodal Models Go Blind

Here's the sequence for o4-mini-high:

Here's the sequence for Morphik:

Ready to Transform Your Knowledge Management?

Related Posts

Getting Started with Morphik: Your AI-Powered Knowledge Assistant

Morphik’s 2025 Ultimate List of 10 AI Tools for Technical Documentation

Explore More

📚 Documentation

🔧 Solutions