Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I've tried to apply it to parsing HTML as this article into a pretty long pipeline. I'm using DeepInfra with Mistral 8x7B and I'm still unsure if I'm going to use for production.

The problem I'm finding is that the time I wanted to save mantaining selectors and the like is time that I'm spending writing wrapper code and dealing with the mistakes it makes. Some are OK and can deal with them, others are pretty annoying because It's difficult to deal with them in a deterministic manner.

I've also tried with GPT-4 but it's way more expensive, and despite what this guy got, it also makes mistakes.

I don't really care about inference speed, but I do care about price and correctness.



Might be a silly question, but if you want determinism in this, why don't you get the LLM to write the deterministic code, and use that instead? Interesting experiment, though!

In fact, what about a hybrid of what you're doing now? Initially, you use an LLM to generate examples. And then from those examples, you use that same LLM to write deterministic code?


Have you tried swapping Mistral 8x7B with either command-r 34B, Qwen 1.5 70B, or miqu 70B? Those are all superior in my experience, though suited for slightly different tasks, so experimentation is needed.


Parsing HTML and tagsoup is IMHO not the right application for LLMs since these are ultimately structured formats. LLM are for NLP tasks, like extracting meaning out of unstructured and ambiguous text. The computational cost of an LLM chewing through even moderately-sized document can be more efficiently spent on sophisticated parser technologies that have been around for decades, which can also to a degree deal with ambiguous and irregular grammars. LLMs should be able to help you write those.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: