Something we're looking to experiment with is asking the LLM to produce pieces of things that we then construct a query from, rather than ask it to also assemble it. The hypothesis is that it's more likely to produce things we can "work with" that are also "interesting" or "useful" to users.
FWIW we have about a ~7% failure rate (meaning it fails to produce a valid, runnable query) after some work done to correct what we consider correctable outputs. Not terrible, but we think the above idea could help with that.
Based on my personal experience I think that's a much better approach, so I wish you luck with it.
Maybe somewhat counter-intuitively to how most people view LLMs, I strongly believe they're better when you constrain them a bit with some guardrails (E.g. pieces of a query, a bunch of existing queries, etc).
Happily surprised you guys managed to get it down to only a 7% failure rate though! For how temperamental LLMs are and the seeming complexity of the task that's impressive.
> Happily surprised you guys managed to get it down to only a 7% failure rate though!
Thanks! It, uhh, was quite a bit higher before we did some of that work though, heh. Since we can take a query and attempt to run it, we get good errors for anything that's ill-specified, and we can track it. Ideally we'd address everything with better prompt engineering, but it's certainly quicker to just fix stuff up after the fact when we know how to.
Re: constraints, it turns out that banning tokens in a vocabulary is a great way to force models to be creative and follow syntactic or semantic constraints without errors:
FWIW we have about a ~7% failure rate (meaning it fails to produce a valid, runnable query) after some work done to correct what we consider correctable outputs. Not terrible, but we think the above idea could help with that.