More

measurablefunc · 2026-01-21T08:39:27 1768984767

The task is ill-defined.

saagarjha · 2026-01-21T09:14:38 1768986878

You make it faster

measurablefunc · 2026-01-21T10:57:00 1768993020

Fewer instructions doesn't mean it's faster. It can be faster but it's not guaranteed in general. Obvious counterexample is single threaded vs multi-threaded code. Single threaded code will have fewer instructions but won't necessarily be faster.

saagarjha · 2026-01-21T11:44:33 1768995873

It does in this case; you can read the assignment to see that it is all single-threaded

measurablefunc · 2026-01-21T16:45:08 1769013908

I read it, you're mistaken.

saagarjha · 2026-01-21T18:48:21 1769021301

I did the assignment my guy

measurablefunc · 2026-01-21T19:45:42 1769024742

That's great but I didn't ask & that's still not addressing my point.

saagarjha · 2026-01-21T20:11:11 1769026271

I didn’t ask you to be rude or wrong either, yet here we are. The assignment is explicitly single core and cycle accurate. Your point is completely irrelevant and shows a disconnect with the content being discussed.

measurablefunc · 2026-01-21T20:28:07 1769027287

It's neither rude nor wrong to ask for evidence to support claims being made in what appears to be corporate advertising. The claim is their LLM is better than a person, I asked for evidence. None was presented. It's not complicated.

measurablefunc · 2026-01-21T08:31:06 1768984266

Generate instructions for their simulator to compute some numbers (hashes) in whatever is considered the memory of their "machine"¹. I didn't see any places where they actually disallow cheating b/c it says they only check the final state of the memory² so seems like if you know the final state you could just "load" the final state into memory. The cycle count is supposedly the LLM figuring out the fewest number of instructions to compute the final state but again, it's not clear what they're actually measuring b/c if you know the final state you can cheat & there is no way to tell how they're prompting the LLM to avoid the answers leaking into the prompt.

¹https://github.com/anthropics/original_performance_takehome/...

²https://github.com/anthropics/original_performance_takehome/...

saagarjha · 2026-01-21T09:06:43 1768986403

Well, they read your code in the actual hiring loop.

measurablefunc · 2026-01-21T09:09:42 1768986582

My point still stands. I don't know what the LLM is doing so my guess is it's cheating unless there is evidence to the contrary.

red75prime · 2026-01-21T09:51:53 1768989113

I guess your answer to "Try to run Claude Code on your own 'ill-defined' problem" would be "I'm not interested." Correct? I think we can stop here then.

KeplerBoy · 2026-01-21T11:13:56 1768994036

Well that's certainly a challenge when you use LLMs for this test driven style of programming.

saagarjha · 2026-01-21T09:21:47 1768987307

Why do you assume it’s cheating?

measurablefunc · 2026-01-21T10:55:59 1768992959

Because it's a well know failure mode of neural networks & scalar valued optimization problems in general: https://www.nature.com/articles/s42256-020-00257-z

saagarjha · 2026-01-21T11:41:56 1768995716

Again, you can just read the code

measurablefunc · 2026-01-21T16:46:42 1769014002

You're missing the point. There is no evidence to support their claims which means they are more than likely leaking the memory into the LLM prompt & it is cheating by simply loading constants into memory instead of computing anything. This is why formal specifications are used to constrain optimization. Without proof that the code is equivalent you might as well just load constants into memory & claim victory.

fc417fc802 · 2026-01-21T17:23:57 1769016237

> There is no evidence to support their claims

Do you make a habit of not presuming even basic competence? You believe that Anthropic left the task running for hours, got a score back, and never bothered to examine the solution? Not even out of curiosity?

Also if it was cheating you'd expect the final score to be unbelievably low. Unless you also suppose that the LLM actively attempted to deceive the human reviewers by adding extra code to burn (approximately the correct number of) cycles.

measurablefunc · 2026-01-21T18:48:54 1769021334

This has nothing to do w/ me & consistently making it a personal problem instead of addressing the claims is a common tactic for people who do not know what it means to present evidence for their claims. Anthropic has not provided the necessary evidence for me to conclude that their LLM is not cheating. I have no opinion on their competence b/c that is not what is at issue. They could be incompetent & not notice that their LLM is cheating at their take home exam but I don't care about that.

fc417fc802 · 2026-01-21T19:00:33 1769022033

You are implying that you believe them to be incompetent since otherwise you would not expect evidence in this instance. They also haven't provided independent verification of their claims - do you suspect them of lying as well?

How do you explain the specific score that was achieved if as you suggest the LLM simply copied the answer directly?

measurablefunc · 2026-01-21T19:47:09 1769024829

Either they have proof that their LLM is not cheating or they don't. The linked post does not provide evidence that the LLM is not cheating. I don't have to explain anything on my end b/c my claim is very simple & easily refuted w/ the proper evidence.

red75prime · 2026-01-21T11:32:55 1768995175

And? Anthropic is not aware of this 2020 paper? The problem is not solvable?

measurablefunc · 2026-01-21T16:53:35 1769014415

Why are you asking me? Email & ask Anthropic.

red75prime · 2026-01-22T00:28:05 1769041685

Obviously, because you use this old paper as an argument.

measurablefunc · 2026-01-19T23:48:00 1768866480

There is no RL for programming languages. Especially ones w/ no significant amount of code.

nl · 2026-01-20T04:53:23 1768884803

I guess the op was implying that is something fixable fairly easily?

(Which is true - it's easy to prompt your LLM with the language grammar, have it generate code and then RL on that)

Easy in the sense of "it is only having enough GPUs to RL a coding capable LLM" anyway.

measurablefunc · 2026-01-20T05:24:15 1768886655

If you can generate code from the grammar then what exactly are you RLing? The point was to generate code in the first place so what does backpropagation get you here?

nl · 2026-01-20T11:07:42 1768907262

Post RL you won't need to put the grammar in the prompt anymore.

measurablefunc · 2026-01-20T20:06:52 1768939612

The grammar of this language is no more than a few hundred tokens (thousands at worst) & current LLMs support context windows in the millions of tokens.

nl · 2026-01-20T21:53:22 1768946002

Sure.

The point is that your statement about the ability to do RL is wrong.

Additionally your response to the Deepseek paper in the other subthread shows profound and deliberate ignorance.

measurablefunc · 2026-01-20T23:37:52 1768952272

Theorycrafting is very easy. Not a single person in this thread has shown any code to do what they're suggesting. You have access to the best models & yet you still haven't managed to prompt it to give you the code to prove your point so spare me any further theoretical responses. Either show the code to do exactly what you're saying is possible or admit you lack the relevant understanding to back up your claims.

nl · 2026-01-21T05:50:09 1768974609

> You have access to the best models & yet you still haven't managed to prompt it to give you the code to prove your point so spare me any further theoretical responses. Either show the code to do exactly what you're saying is possible

GPU poor here though...

To quote someone (you...) on the internet:

> More generally, don't ask random people on the internet to do work for you for free.

https://news.ycombinator.com/item?id=46689232

measurablefunc · 2026-01-21T07:00:43 1768978843

Claims require evidence & if you are unwilling to present it then admit you do not have any evidence to support your claims. It's not complicated. Either RL works & you have evidence or you do not know & can not claim that it works w/o first doing the required due diligence which (shockingly) actually requires work instead of empty theory crafting & hand waving.

thorum · 2026-01-20T07:59:42 1768895982

Go read the DeepSeek R1 paper

measurablefunc · 2026-01-20T08:21:13 1768897273

Why would I do that? If you know something then quote the relevant passage & equation that says you can train code generators w/ RL on a novel language w/ little to no code to train on. More generally, don't ask random people on the internet to do work for you for free.

thorum · 2026-01-20T09:50:02 1768902602

Your other comment sounded like you were interested in learning about how AI labs are applying RL to improve programming capability. If so, the DeepSeek R1 paper is a good introduction to the topic (maybe a bit out of date at this point, but very approachable). RL training works fine for low resource languages as long as you have tooling to verify outputs and enough compute to throw at the problem.

whimsicalism · 2026-01-20T15:39:36 1768923576

imo generally not worth it to keep going when you encounter this sort of HN archetype

measurablefunc · 2026-01-20T20:04:57 1768939497

So you should have no problem bringing up the exact passages & equations they use for their policies.

whimsicalism · 2026-01-20T15:38:56 1768923536

well, that’s one way to react to being provided with interesting reading material.

measurablefunc · 2026-01-20T20:05:41 1768939541

Bring up passage that supports your claim. I'll wait.

nl · 2026-01-21T06:03:10 1768975390

Not exactly sure what you are looking for here.

That GRPO works?

> Group Relative Policy Optimization (GRPO), a variant reinforcement learning (RL) algorithm of Proximal Policy Optimization (PPO) (Schulman et al., 2017). GRPO foregoes the critic model, instead estimating the baseline from group scores, significantly reducing training resources. By solely using a subset of English instruction tuning data, GRPO obtains a substantial improvement over the strong DeepSeekMath-Instruct, including both in-domain (GSM8K: 82.9% → 88.2%, MATH: 46.8% → 51.7%) and out-of-domain mathematical tasks (e.g., CMATH: 84.6% → 88.8%) during the reinforcement learning phase

Page 2 of https://arxiv.org/pdf/2402.03300

That GRPO on code works?

> Similarly, for code competition prompts, a compiler can be utilized to evaluate the model’s responses against a suite of predefined test cases, thereby generating objective feedback on correctness

Page 4 of https://arxiv.org/pdf/2501.12948

measurablefunc · 2026-01-21T07:03:38 1768979018

None of those are novel domains w/ their own novel syntax & semantic validators, not to mention the dearth of readily available sources of examples for sampling the baselines. So again, where does it say it works for a programming language with nothing but a grammar & a compiler?

nl · 2026-01-21T12:21:53 1768998113

To quote you:

> here is no RL for programming languages.

and

> Either RL works & you have evidence

This is just so completely wrong, and here is the evidence.

I think everyone in this thread is just surprised you don't seem to know this.

Haven't you seen the hundreds of job ads for people to write code for LLMs to train on?

measurablefunc · 2026-01-21T16:48:55 1769014135

You're not going to get less confused by doubling down. None of your claims are valid & this is because you haven't actually tried to do what you're suggesting. Taking a grammar & compiler & RLing will get you nowhere.

whimsicalism · 2026-01-19T23:51:01 1768866661

not even wrong

measurablefunc · 2026-01-19T23:55:16 1768866916

Exactly.

measurablefunc · 2026-01-17T21:21:02 1768684862

Too many mistakes & ill-defined concepts to correct them all but their conception of Godel's incompleteness theorem is in the "not even wrong" category.

measurablefunc · 2026-01-17T01:49:10 1768614550

Tokens do not encode semantics.

WithinReason · 2026-01-17T09:32:44 1768642364

You can choose which token to sample based on language semantics. You simply don't sample invalid ones. So the language should be restrictive on what tokens it allows enough that invalid code is impossible.

SkiFire13 · 2026-01-17T12:25:54 1768652754

> You can choose which token to sample based on language semantics

Can you though?

> the language should be restrictive on what tokens it allows

This is a restriction on the language syntax, not its semantics.

measurablefunc · 2026-01-16T05:03:29 1768539809

This is the right answer. Unless there is some equivalent of it on the open internet which their search engine can find you should not expect a good outcome.

danpalmer · 2026-01-16T05:09:58 1768540198

"good outcome" is pretty subjective, I do get useful productivity gains from some LLM work, but the issues are the same as they always have been.

measurablefunc · 2026-01-16T06:27:26 1768544846

That's probably b/c you know how to write code & have enough of an understanding about the fundamentals to know when the LLM is bullshitting or when it is actually on the right track.

measurablefunc · 2026-01-15T04:03:06 1768449786

All of these things have readily available analogues on the web which means they are more than likely just laundering open source code & claiming victory.

rzmmm · 2026-01-15T08:58:55 1768467535

There are many open-source toy browser implementations available, so this seems quite likely.

random_mutex · 2026-01-16T06:09:02 1768543742

It doesn't compile so no victory

measurablefunc · 2026-01-16T06:25:05 1768544705

Just the usual corporate marketing & hype.

measurablefunc · 2026-01-09T23:43:24 1768002204

In 1897, the Indiana General Assembly attempted to legislate a new value for pi, proposing it be defined as 3.2, which was based on a flawed mathematical proof. This bill, known as the Indiana pi bill, never became law due to its incorrect assertions and the prior proof that squaring the circle is impossible: https://en.wikipedia.org/wiki/Indiana_pi_bill

measurablefunc · 2026-01-09T23:42:22 1768002142

You're forgetting that some equations have π/2 so on balance nothing will change. It will be the same number of symbols.

ogogmad · 2026-01-10T01:47:38 1768009658

I don't think it's just the sheer number of symbols. It's also the fact that the symbol τ means "turn". So you can say "quarter-turn" instead of π/2.

I'm not sure why that point gets lost in these discussions. And personally, I think of the set of fundamental mathematical objects as having a unique and objective definition. So, I get weirdly bothered by the offset in the Gamma function.

measurablefunc · 2026-01-09T07:46:52 1767944812

One of my hobbies is reading a paper until I find a statement that is seems obviously false to me

    > mathematicians can derive new knowledge by reasoning from axioms without external information

But their entire section on paradoxes is full of what appears to be nonsense to me b/c I have actually studied the listed topics. They're sweeping too many assumption under the rug & I am confident the rest of the paper is not going to resolve any of the issues I noticed.