yup, this is a pretty common occurrence in using LLMs for data extraction. For personal use (trying to load a receipt) it’s great that the LLM filled in info. For production systems which need high quality, near 100% extraction accuracy, inferring results is a failure. Think medical record parsing, financial data, etc These hallucinations occur quite frequently, and we haven’t found a way to minimize this through prompt eng.
To even have a chance at doing it you'd need to start the training from scratch with _huge_ penalties for filling in missing information and a _much_ larger vision component to the model.