Morphik
1 min read

When Multimodal Models Go Blind

A technical exploration of why even natively multimodal LLMs struggle with diagram interpretation in documents

Here's the sequence for o4-mini-high:

You hand o4‑mini‑high a technical patent with an embedded IRR vs Frequency graph and ask:

"At what frequency does IRR peak?"

It thinks for 30 seconds and instead of just reading the chart, it hits you with:

"Which page is that on?"

Cue dramatic facepalm. 🤦

GPT-4o mini-high failing to answer a question about a graph

Even after I grumbled "Page 6," it pulled out the Python tool use gun (my favorite as well) proclaimed the peak was "the highest point on the line." Technically wrong and hilariously sure of itself.

Additional context doesn't resolve the limitation

Model's unsuccessful self-analysis attempt

Here's the sequence for Morphik:

We treat each page like one giant image+text puzzle:

  1. Snap the whole page as an image (diagrams, tables, doodles included)
  2. Extract text blocks with their exact positions (headings, captions, footnotes)
  3. Blend vision & text embeddings into a multi-vector cocktail 🍹
  4. Retrieve the full region (text+diagram) as a unit—no more orphaned charts

Result? The same question returns:

"IRR peaks at 0 MHz." Boom. 🎯

Morphik's technical approach correctly processes the query

Context visualization showing the complete retrieved section

Ready to Transform Your Knowledge Management?

Join thousands of teams using Morphik to unlock insights from their documents and data.

Related Posts

6 min read

Vibe-Coding Memory

What I Learnt From Vibe-Coding an Open-Source Alternative to ChatGPT's New Memory Feature

12 min read

LLM Science Battle

Drowning in Discoveries? How LLMs (and Morphik) Are Learning to Read Science

Explore More

📚 Documentation

Learn how to integrate Morphik into your workflow with our comprehensive guides.

🔧 Solutions

Discover how Morphik can be tailored to your industry and use case.