My biggest problem with LLM's at this point is that they produce different and inconsistent results or behave differently, given the same prompt. The better grounding would be amazing at this point. I want to give an LLM the same prompt on different days and I want to be able to trust that it will do the same thing as yesterday. Currently they misbehave multiple times a week and I have to manually steer it a bit which destroys certain automated workflows completely.
It sounds like you have dug into this problem with some depth so I would love to hear more. When you've tried to automate things, I'm guessing you've got a template and then some data and then the same or similar input gives totally different results? What details about how different the results are can you share? Are you asking for eg JSON output and it totally isn't, or is it a more subtle difference perhaps?
> I want to give an LLM the same prompt on different days and I want to be able to trust that it will do the same thing as yesterday
Bad news, it's winter now in the Northern hemisphere, so expect all of our AIs to get slightly less performant as they emulate humans under-performing until Spring.
It doesn’t really solve it as a slight shift in the prompt can have totally unpredictable results anyway. And if your prompt is always exactly the same, you’d just cache it and bypass the LLM anyway.
What would really be useful is a very similar prompt should always give a very very similar result.
This doesn't work with the current architecture, because we have to introduce some element of stochastic noise into the generation or else they're not "creatively" generative.
Your brain doesn't have this problem because the noise is already present. You, as an actual thinking being, are able to override the noise and say "no, this is false." An LLM doesn't have that capability.