This blog examines the inherent limitations of the current OCR pipeline in the context of document question-answering systems from an information-theoretic perspective and discusses why a direct, vision-based approach can be more effective. It also provides a practical implementation of a vision-based question-answering system for long documents.
PageIndex Chat is the world's first human-like long-document AI analyst. You can upload entire books, research papers, or hundred-page reports and chat with them without context limits, all in the browser.
Unlike traditional RAG or "chat-with-your-doc" tools that rely on vector similarity search, PageIndex builds a hierarchical tree index of your document (like a table of contents), and then reasons over this index to retrieve and interpret relevant sections. It doesn’t search by keywords or embeddings — it reads, understands, and reasons through the document like a human expert.
What makes it different:
- Reasoning-based retrieval: Understands structure, logic, and meaning, not just semantic similarity.
- Page-level references: Every answer includes precise citations for easy verification.
- Cross-section reasoning: Connects information across sections and appendices to find true answers.
- Human-in-the-loop: You can guide, refine, and verify its reasoning.
- Multi-document comparison: Analyze and contrast multiple reports at once.
PageIndex Chat is the world's first human-like long-document AI analyst. You can upload entire books, research papers, or hundred-page reports and chat with them without context limits, all in the browser.
Unlike traditional RAG or "chat-with-your-doc" tools that rely on vector similarity search, PageIndex builds a hierarchical tree index of your document (like a table of contents), and then reasons over this index to retrieve and interpret relevant sections. It doesn’t search by keywords or embeddings — it reads, understands, and reasons through the document like a human expert.
What makes it different:
- Reasoning-based retrieval – Understands structure, logic, and meaning, not just semantic similarity. - Page-level references – Every answer includes precise citations for easy verification. - Cross-section reasoning – Connects information across sections and appendices to find true answers. - Human-in-the-loop – You can guide, refine, and verify its reasoning. - Multi-document comparison – Analyze and contrast multiple reports at once.
PageIndex Chat is the world's first human-like long-document AI analyst. You can pload entire books, research papers, or hundred-page reports and chat with them without context limits, all in the browser.
Unlike traditional RAG or "chat-with-your-doc" tools that rely on vector similarity search, PageIndex builds a hierarchical tree index of your document (like a table of contents), and then reasons over this index to retrieve and interpret relevant sections. It doesn’t search by keywords or embeddings — it reads, understands, and reasons through the document like a human expert.
What makes it different:
- Reasoning-based retrieval – Understands structure, logic, and meaning, not just semantic similarity.
- Page-level references – Every answer includes precise citations for easy verification.
- Cross-section reasoning – Connects information across sections and appendices to find true answers.
- Human-in-the-loop – You can guide, refine, and verify its reasoning.
- Multi-document comparison – Analyze and contrast multiple reports at once.
1. What happens when the TOC is too long?
-- This is why we choose the tree structure. If the ToC is too long, it will do a hierarchy search, which means search over the father level nodes first and then select one node, and then search its child nodes.
2. How does the index handle near misses, and how do you disambiguate between close titles?
For each node, we generate a description or summary to give more information rather than just titles.
3. For documents that are not in a hierarchy, it will just become a list structure, which you can still look through.
We found our MCP service works well in general financial/legal/textbook/research paper cases, see https://pageindex.ai/mcp for some examples.
We do agree in some cases, like recommendation systems, you need semantic similarity and Vector DB, so I wouldn't recommend this approach. Keen to learn more cases that we haven't thought through!
The current OCR approach typically relies on a Vision-Language Model (VLM) to convert a table into a JSON structure. However, a table inherently has a 2D spatial structure, while Large Language Models (LLMs) are optimized for processing 1D sequential text. This creates a fundamental mismatch between the data representation and the model’s input format.
Most existing pipelines address this by preprocessing the table into a linearized 1D string before passing it to the LLM — a question-agnostic step that may lose structural information.
Instead, one could retain the original table form and, when a question is asked, feed both the question and the original table (as an image) directly into the VLM. This approach allows the model to reason over the data in its native 2D domain, providing a more natural and potentially more accurate solution.