I think the problem is the quirkiness on the English side, not the SQL side. You could translate datalog to SQL or vice versa, but understanding intention from arbitrary english is much harder. And often query results must be 100% accurate and reliable.
> I think the problem is the quirkiness on the English side
While likely, the question asked if there was any improvement shown with other targets to validate that assumption. There is no benefit in thinking.
> And often query results must be 100% accurate and reliable.
It seems that is impossible. Even the human programmers struggle to reliably convert natural language to SQL according to the aforementioned test study. They are slightly better than the known alternatives, but far from perfect. But if another target can get closer to human-level performance, that is significant.
When I find someone claiming a suspicious data analysis result I can ask them for the SQL and investigate it to see if there's a bug in it (or further investigate where the data being queried comes from). If the abstraction layer between LLM prompt and data back is removed, I'm left with (just like other LLM answers) some words but no way to know if they're correct.
1. How would the abstraction be removed? Language generation is what LLMs do; a language abstraction is what you are getting out, no matter what. There is no magic involved.
2. The language has to represent a valid computer program. That is as true of SQL as any other target. You can know that it is correct by reading it.
Once you have SQL, you have datalog. Once you have datalog, you have SQL. The problem isn't the target, it is getting sufficiently rigorous and structured output from the LLM to target anything.
So you already claimed, but, still, curiously we have no answer to the question. If you don't know, why not just say so?
That said, if you have ever used these tools to generate code, you will know that they are much better at some languages than others. In the general case, the target really is the problem sometimes. Does that carry into this particular narrow case? I don't know. What do the comparison results show?