exactly. Just sent raw tesseract output, no formatting or "fix the OCR text" ste...

anonymouse008 · on May 15, 2023

That’s still super cool!

Yeah my use cases are in the really bad category - I’ve been building parsers for a while, and I’ve basically given up to manually stating rows of interest if present logic. Camelot got so close but I ended up building my own control layer to pdfminer.six to accommodate (I’d recommend Camelot if you’re still exploring). It absolutely sucks needing to be so specific out the gate, but at least the context rarely changes.

pplante · on May 15, 2023

What is the source of these nasty docs? I am also working on a layer above pdfminer.six to parse tables. It seems like this task is never done. LLMs have had mixed results for me too. I am focused on documents containing invoices, income statements, etc from the real estate industry.

My email is in my profile if you want to reach out and compare notes!