I built a PDF text extraction library in Zig that's significantly faster than Mu...

DannyBee · 2025-12-31T02:50:27 1767149427

FWIW - mupdf is simply not fast. I've done lots of pdf indexing apps, and mupdf is by far the slowest and least able to open valid pdfs when it came to text extraction. It also takes tons of memory.

a better speed comparison would either be multi-process pdfium (since pdfium was forked from foxit before multi-thread support, you can't thread it), multi-threaded foxit, or something like syncfusion (which is quite fast and supports multiple threads). Or even single thread pdfium vs single thread your-code.

These were always the fastest/best options. I can (and do) achieve 41k pages/sec or better on these options.

The other thing it doesn't appear you mention is whether you handle putting the words in reading order (IE how they appear on the page), or only stream order (which varies in its relation to apperance order) .

If it's only stream order, sure, that's really fast to do. But also not anywhere near as helpful as reading order, which is what other text-extraction engines do.

Looking at the code, it looks like the code to do reading order exists, but is not what is being benchmarked or used by default?

If so, this is really comparing apples and oranges.

tveita · 2025-12-30T21:24:25 1767129865

What kind of performance are you seeing with/without SIMD enabled?

From https://github.com/Lulzx/zpdf/blob/main/src/main.zig it looks like the help text cites an unimplemented "-j" option to enable multiple threads.

There is a "--parallel" option, but that is only implemented for the "bench" command.

lulzx · 2025-12-30T21:38:09 1767130689

I have now made parallel by default and added an option to enable multiple threads.

I haven't tested without SIMD.

cheshire_cat · 2025-12-30T21:28:35 1767130115

You've released quite a few projects lately, very impressive.

Are you using LLMs for parts of the coding?

What's your work flow when approaching a new project like this?

lulzx · 2025-12-30T22:01:04 1767132064

Claude Code.

littlestymaar · 2025-12-30T21:59:47 1767131987

> Are you using LLMs for parts of the coding?

I can't talk about the code, but the readme and commit messages are most likely LLM-generated.

And when you take into account that the first commit happened just three hours ago, it feels like the entire project has been vibe coded.

Neywiny · 2025-12-30T22:24:52 1767133492

Hard disagree. Initial commit was 6k LOC. Author could've spent years before committing. Ill advised but not impossible.

littlestymaar · 2025-12-30T22:44:04 1767134644

Why would you make Claude write your commit message for a commit you've spent years working on though?

Neywiny · 2025-12-30T22:56:38 1767135398

1. Be not good at or a fan of git when coding

2. Be not good at or a fan of git when committing

Not sure what the disconnect is.

Now if it were vibecoded, I wouldn't be surprised. But benefit of the doubt

Jach · 2025-12-31T00:56:29 1767142589

We're well beyond benefit of the doubt these days. If it looks like a duck... For me there wasn't any doubt, the author's first top comment here was evidence enough, then seeing the readme + random code + random commit message, it's all obvious LLM-speak to me.

I don't particularly care, though, and I'm more positive about LLMs than negative even if I don't (yet?) use them very much. I think it's hilarious that a few people asked for Python bindings and then bam, done, and one person is like "..wha?" Yes, LLMs can do that sort of grunt work now! How cool, if kind of pointless. Couldn't the cycles have just been spent on trying to make muPDF better? Though I see they're in C and AGPL, I suppose either is motivation enough to do a rewrite instead. (This is MIT Licensed though it's still unclear to me how 100% or even large-% vibe-coded code deserves any copyright protection, I think all such should generally be under the Unlicense/public domain.)

If the intent of "benefit of the doubt" is to reduce people having a freak out over anyone who dares use these tools, I get that.

lulzx · 2025-12-31T01:10:54 1767143454

I have updated the licence to WTFPL.

I'll try my best to make it a really good one!

littlestymaar · 2025-12-31T11:41:00 1767181260

> I have updated the licence to WTFPL.

You still have no basis in claiming copyright protection hence you cannot set a license on that code.

Instead of the WTFPL you should just write a disclaimer that due to being machine generated and devoid of creating work, the work is not protected by copyright and free to be used without any license.

lulzx · 2025-12-31T11:47:26 1767181646

hasn't world moved on from these things already?

littlestymaar · 2025-12-31T08:11:08 1767168668

> I built

You didn't. Claude did. Like it did write this comment.

And you didn't even bother testing it before submitting, which is insulting to everyone.

lulzx · 2025-12-31T12:01:07 1767182467

tools are tools.

jeffbee · 2025-12-30T21:57:47 1767131867

What's fast about mmap?

kennethallen · 2025-12-31T04:25:22 1767155122

Two big advantages:

You avoid an unnecessary copy. Normal read system call gets the data from disk hardware into the kernel page cache and then copies it into the buffer you provide in your process memory. With mmap, the page cache is mapped directly into your process memory, no copy.

All running processes share the mapped copy of the file.

There are a lot of downsides to mmap: you lose explicit error handling and fine-grained control of when exactly I/O happens. Consult the classic article on why sophisticated systems like DBMSs do not use mmap: https://db.cs.cmu.edu/mmap-cidr2022/

commandersaki · 2025-12-31T08:24:29 1767169469

you lose explicit error handling

I've never had to use mmap but this is always been the issue in my head. If you're treating I/O as memory pages, what happens when you read a page and it needs to "fault" by reading the backing storage but the storage fails to deliver? What can be said at that point, or does the program crash?

saidinesh5 · 2025-12-31T05:31:24 1767159084

This is a very interesting link. I didn't expect mmap to be less performant than read() calls.

I now wonder which use cases would mmap suit better - if any...

> All running processes share the mapped copy of the file.

So something like building linkers that deal with read only shared libraries "plugins" etc ..?

squirrellous · 2025-12-31T10:07:34 1767175654

One reason to use shared memory mmap is to ensure that even if your process crashes, the memory stays intact. Another is to communicate between different processes.

rishabhaiover · 2025-12-30T23:43:10 1767138190

it allows the program to reference memory without having to manage it in the heap space. it would make the program faster in a memory managed language, otherwise it would reduce the memory footprint consumed by the program.

jeffbee · 2025-12-30T23:53:23 1767138803

You mean it converts an expression like `buf[i]` into a baroque sequence of CPU exception paths, potentially involving a trap back into the kernel.

rishabhaiover · 2025-12-31T00:08:59 1767139739

I don't fully understand the under the hood mechanics of mmap, but I can sense that you're trying to convey that mmap shouldn't be used a blanket optimization technique as there are tradeoffs in terms of page fault overheads (being at the mercy of OS page cache mechanics)

StilesCrisis · 2025-12-31T03:56:54 1767153414

Tradeoffs such as "if an I/O error occurs, the program immediately segfaults." Also, I doubt you're I/O bound to the point where mmap noticeably better than read, but I guess it's fine for an experiment.

jibal · 2025-12-31T01:16:02 1767143762

I think he's conveying that he doesn't know what he's talking about. buf[i] generates the same code regardless of whether mmap is being used. The first access to a page will cause a trap that loads the page into memory, but this is also true if the memory is read into.

jonstewart · 2025-12-30T22:27:32 1767133652

What’s the fidelity like compared to tika?

lulzx · 2025-12-30T22:39:23 1767134363

The accuracy difference is marginal (1-2%) but the speed difference is massive.