If you can generate code from the grammar then what exactly are you RLing? The point was to generate code in the first place so what does backpropagation get you here?
The grammar of this language is no more than a few hundred tokens (thousands at worst) & current LLMs support context windows in the millions of tokens.
Theorycrafting is very easy. Not a single person in this thread has shown any code to do what they're suggesting. You have access to the best models & yet you still haven't managed to prompt it to give you the code to prove your point so spare me any further theoretical responses. Either show the code to do exactly what you're saying is possible or admit you lack the relevant understanding to back up your claims.
> You have access to the best models & yet you still haven't managed to prompt it to give you the code to prove your point so spare me any further theoretical responses. Either show the code to do exactly what you're saying is possible
GPU poor here though...
To quote someone (you...) on the internet:
> More generally, don't ask random people on the internet to do work for you for free.
Claims require evidence & if you are unwilling to present it then admit you do not have any evidence to support your claims. It's not complicated. Either RL works & you have evidence or you do not know & can not claim that it works w/o first doing the required due diligence which (shockingly) actually requires work instead of empty theory crafting & hand waving.
Why would I do that? If you know something then quote the relevant passage & equation that says you can train code generators w/ RL on a novel language w/ little to no code to train on. More generally, don't ask random people on the internet to do work for you for free.
Your other comment sounded like you were interested in learning about how AI labs are applying RL to improve programming capability. If so, the DeepSeek R1 paper is a good introduction to the topic (maybe a bit out of date at this point, but very approachable). RL training works fine for low resource languages as long as you have tooling to verify outputs and enough compute to throw at the problem.
> Group Relative Policy Optimization (GRPO), a variant reinforcement learning (RL) algorithm of Proximal Policy Optimization (PPO) (Schulman et al., 2017).
GRPO foregoes the critic model, instead estimating the baseline from group scores, significantly
reducing training resources. By solely using a subset of English instruction tuning data, GRPO
obtains a substantial improvement over the strong DeepSeekMath-Instruct, including both
in-domain (GSM8K: 82.9% → 88.2%, MATH: 46.8% → 51.7%) and out-of-domain mathematical
tasks (e.g., CMATH: 84.6% → 88.8%) during the reinforcement learning phase
> Similarly, for code competition prompts, a compiler can be utilized to evaluate the model’s responses against a suite of predefined test cases, thereby generating objective feedback on
correctness
None of those are novel domains w/ their own novel syntax & semantic validators, not to mention the dearth of readily available sources of examples for sampling the baselines. So again, where does it say it works for a programming language with nothing but a grammar & a compiler?
You're not going to get less confused by doubling down. None of your claims are valid & this is because you haven't actually tried to do what you're suggesting. Taking a grammar & compiler & RLing will get you nowhere.