Hacker Newsnew | past | comments | ask | show | jobs | submit | mingtianzhang's commentslogin

VLM can already process both the document images and the query to produce an answer directly. Do we still need the intermediate OCR step?


We discuss the limitations of the classic OCR pipeline and provide a pure vision-based RAG system for document analysis (https://github.com/VectifyAI/PageIndex/blob/main/cookbook/vi...)

Any feedback is welcome!


We actually don't need OCR: https://pageindex.ai/blog/do-we-need-ocr


This blog examines the inherent limitations of the current OCR pipeline in the context of document question-answering systems from an information-theoretic perspective and discusses why a direct, vision-based approach can be more effective. It also provides a practical implementation of a vision-based question-answering system for long documents.


PageIndex Chat is the world's first human-like long-document AI analyst. You can upload entire books, research papers, or hundred-page reports and chat with them without context limits, all in the browser. Unlike traditional RAG or "chat-with-your-doc" tools that rely on vector similarity search, PageIndex builds a hierarchical tree index of your document (like a table of contents), and then reasons over this index to retrieve and interpret relevant sections. It doesn’t search by keywords or embeddings — it reads, understands, and reasons through the document like a human expert.

What makes it different:

- Reasoning-based retrieval: Understands structure, logic, and meaning, not just semantic similarity.

- Page-level references: Every answer includes precise citations for easy verification.

- Cross-section reasoning: Connects information across sections and appendices to find true answers.

- Human-in-the-loop: You can guide, refine, and verify its reasoning.

- Multi-document comparison: Analyze and contrast multiple reports at once.


PageIndex Chat is the world's first human-like long-document AI analyst. You can upload entire books, research papers, or hundred-page reports and chat with them without context limits, all in the browser.

Unlike traditional RAG or "chat-with-your-doc" tools that rely on vector similarity search, PageIndex builds a hierarchical tree index of your document (like a table of contents), and then reasons over this index to retrieve and interpret relevant sections. It doesn’t search by keywords or embeddings — it reads, understands, and reasons through the document like a human expert.

What makes it different:

- Reasoning-based retrieval – Understands structure, logic, and meaning, not just semantic similarity. - Page-level references – Every answer includes precise citations for easy verification. - Cross-section reasoning – Connects information across sections and appendices to find true answers. - Human-in-the-loop – You can guide, refine, and verify its reasoning. - Multi-document comparison – Analyze and contrast multiple reports at once.


PageIndex Chat is the world's first human-like long-document AI analyst. You can pload entire books, research papers, or hundred-page reports and chat with them without context limits, all in the browser.

Unlike traditional RAG or "chat-with-your-doc" tools that rely on vector similarity search, PageIndex builds a hierarchical tree index of your document (like a table of contents), and then reasons over this index to retrieve and interpret relevant sections. It doesn’t search by keywords or embeddings — it reads, understands, and reasons through the document like a human expert.

What makes it different:

- Reasoning-based retrieval – Understands structure, logic, and meaning, not just semantic similarity. - Page-level references – Every answer includes precise citations for easy verification. - Cross-section reasoning – Connects information across sections and appendices to find true answers. - Human-in-the-loop – You can guide, refine, and verify its reasoning. - Multi-document comparison – Analyze and contrast multiple reports at once.


Instead of relying on vector databases, DeepMind proposes:

1. The LLM itself selects the most relevant documents — no vector database needed.

2. The selected documents are then placed directly into the context for generation.

This kind of in-context retrieval approach greatly improves retrieval accuracy compared to traditional vector-based retrieval methods.


Hi, thanks for your inspiring questions.

1. What happens when the TOC is too long? -- This is why we choose the tree structure. If the ToC is too long, it will do a hierarchy search, which means search over the father level nodes first and then select one node, and then search its child nodes.

2. How does the index handle near misses, and how do you disambiguate between close titles? For each node, we generate a description or summary to give more information rather than just titles.

3. For documents that are not in a hierarchy, it will just become a list structure, which you can still look through.

We also write down how it can combine with a reasoning process and give some comparisons to Vector DB, see https://vectifyai.notion.site/PageIndex-for-Reasoning-Based-....

We found our MCP service works well in general financial/legal/textbook/research paper cases, see https://pageindex.ai/mcp for some examples.

We do agree in some cases, like recommendation systems, you need semantic similarity and Vector DB, so I wouldn't recommend this approach. Keen to learn more cases that we haven't thought through!


thanks!


The current OCR approach typically relies on a Vision-Language Model (VLM) to convert a table into a JSON structure. However, a table inherently has a 2D spatial structure, while Large Language Models (LLMs) are optimized for processing 1D sequential text. This creates a fundamental mismatch between the data representation and the model’s input format.

Most existing pipelines address this by preprocessing the table into a linearized 1D string before passing it to the LLM — a question-agnostic step that may lose structural information.

Instead, one could retain the original table form and, when a question is asked, feed both the question and the original table (as an image) directly into the VLM. This approach allows the model to reason over the data in its native 2D domain, providing a more natural and potentially more accurate solution.


Yeah, I wonder how PNG would fare in this contest.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: