Hacker Newsnew | past | comments | ask | show | jobs | submit | findjashua's commentslogin

won't this essentially disable prompt caching, that you get from a standard append-only chat history?


That's actually a great question. and the answer is yes and no; While it does disable the caching mechanism for the conversation history (and not for the system prompt, who remains constant), there is a difference between a chatbot with a constant chat history (just exchange of messages) and an agent who uses a large part of the conversation as a type of "scratchpad", sometimes even holding variables value in the beginning of the chat (to be sort of 'stateful'). if these variables change, the scratchpad changes (can be even 30%-40% of the entire conversation), there is a timeout in the cache (Claude gives you 5 minutes of cache for normal caching) or any other change to the exact history - you get a recaching of the entire conversation. additionally, caching still costs money.

The main advantage of the librarian is that is an 'insurance policy' for this caching mechanism. combining it with solving the context rot issue - and you get improved performance at scale.


oh, one other caveat is that each request could result in the curation of system messages earlier in the chat message history, i haven't done a deep dive into prompt caching, but that could complicate things. the more i think about it, the more i wonder that the prompt caching is a patch for "dumb prompting" to try to save money when you're doing things the dumb way of throwing everything you have at it and praying it gets it right, when it'd just make more sense to keep the entirety of the prompt as lean as possible to prevent context rot and maximize signal to noise ratio.


that's a good point, we haven't delved too deeply into prompt caching yet, but my understanding is that it only helps for a conversation that remains "hot", not one that a user just comes back to everyday and keep adding more to it over a longer period of time. i could see some optimization there where when the conversation is "hot" we keep the system message with the summarized index and all subsequent conversation messages that haven't been summarized intact until the conversation cools off.


failed the car wash test.

i think instead of postiioning as a general purpuse reasoning model, they'd have more success focusing on a specific use case (eg coding agent) and benchmark against the sota open models for the use case (eg qwen3-coder-next)


Honestly I don't understand why they/any fast-and-error-prone model position themselves as coding agents; my experience tells me that I'd much rather working with a slow-but-correct model and let it run longer session than handholding a fast-but-wrong model.


you need a reviewer agent for every step of the process - review the plan generated by the planner, the update made by the task worker subagent, and a final reviewer once all tasks are done.

this does eat up tokens _very_ quickly though :(


providers' ToS explicitly states whether or not any data provided is used for training purposes. the usual that i've seen is that while they retain the right to use the data on free tiers, it's almost never the case for paid tiers


Right, so totally cool to ignore the law but our TOS is a binding contract.


Yes, they can be sued for breach of contract. And it’s not a regular ToS but a signed MSA and other legally binding documents.


the license on my open source code is a contract, and they ignored that

if they can get away with it (say by claiming it's "fair use"), they'll ignore corporate ones too


If I were to go out on a limb, those companies spend more on tech companies than you and they have larger legal teams than you. That is a carrot and a stick for AI companies to follow the contract.


no, it's not an incentive to follow the contract

it's an incentive to pretend as if you're following the contract, which is not the same thing


Where are they ignoring the law?



Thats an allegation. Doesnt an allegation need to be tested?


people that say this tend to have a misinterpretation of copyright, and use all the court cases brought by large rights holders as validation

despite all 3 branches of the government disagreeing with them over and over again


I bet companies are circumventing this in a way that allows them to derive almost all the benefit from your data, yet makes it very hard to build a case against them.

For example, in RL, you have a train set, and a test set, which the model never sees, but is used to validate it - why not put proprietary data in the test set?

I'm pretty sure 99% of ML engineers would say this would constitute training on your data, but this is an argument you could drag out in courts forever.

Or alternatively - it's easier to ask for forgiveness than permission.

I've recently had an apocalyptic vision, that one day we'll wake up, an find that AI companies have produced an AI copy of every piece of software in existence - AI Windows, AI Office, AI Photoshop etc.


Given the conduct we've seen to date, I'd trust them to follow the letter - but not the spirit - of IP law.

There may very well be clever techniques that don't require directly training on the users' data. Perhaps generating a parallel paraphrased corpus as they serve user queries - one which they CAN train on legally.

The amount of value unlocked by stealing practically ~everyone's lunch makes me not want to put that past anyone who's capable of implementing such a technology.


it is amazing in almost 2026 there is anyone believing this… amazing


I wonder how much wiggle there is for collect now (to provide service, context history, etc), then later anonymise (some how, to some level) and then train on it?

Also I wonder if the ToS covers "queries & interaction" vs "uploaded data" - I could imagine some tricky language in there that says we wont use your word document, but we may at some time use the queries you put against it, not as raw corpus but as a second layer examining what tools/workflows to expand/exploit.


“We don’t train on your data” doesn’t exclude metadata, training on derived datasets via some anonymisation process, etc.

There’s a range of ways to lie by omission, here, and the major players have established a reputation for being willing to take an expansive view of their legal rights.


20-yr CAGR seems to be consistently higher than SP500: https://testfol.io/?s=7boYMdNxqjh


NME at all - 5.1 codex has been the best by far.


How can you stand the excruciating slowness? Claude Code is running circles around codex. The most mundane tasks make it think for a minute before doing anything.


I use it on medium reasoning and it's decently quick. I only switch to gpt-5.1-codex-max xhigh for the most annoying problems.


By learning to parallelize my work. This also solved my problem with slow Xcode builds.


Well you can’t edit files while Xcode is building or the compiler will throw up, so I‘m wondering what you mean here. You can’t even run swift test in 2 agents at the same time, because swift serializes access for some reason.

Whenever I have more than 1 agent run Swift tests in a loop to fix things, and another one to build something, the latter will disturb the former and I need to cancel.

And then there’s a lot of work that can’t be parallelized, like complex git rebases - well you can do other things in a worktree, but good luck merging that after you‘ve changed everything in the repo. Codex is really really bad at git.


Yes these are horrible pain points. I can only hope Apple improves this stuff if it's true that they're adding MCP support throughout the OS which should require better multi-agent handling

You can use worktrees to have multiple copies building or testing at once

I'm a solo dev so I rarely use some git features like rebase. I work out of trunk only without branches (if I need a branch, I use a feature flag). So I can't help with that

What I did is build an Xcode MCP server that controls Xcode via AppleScript and the simulator via accessibility & idb. For running, it gives locks to the agent that the agent releases once it's done via another command (or by pattern matching on logs output or scripting via JS criteria for ending the lock "atomically" without requiring a follow-up command, for more typical use). For testing, it serializes the requests into a queue and blocks the MCP response.

This works well for me because I care more about autonomous parallelization than I do eliminating waiting states, as long as I myself am not ever waiting. (This is all very interesting to me as a former DevOps/Continuous Deployment specialist - dramatically different practices around optimizing delivery these days...)

Once I get this tool working better I will productize it. It runs fully inside the macOS sandbox so I will deploy it to the Mac App Store and have an iOS companion for monitoring & managing it that syncs via iCloud and TailScale (no server on my end, more privacy friendly). If this sounds useful to you please let me know!

In addition to this, I also just work on ~3 projects at the same time and rotate through them by having about 20 iTerm2 tabs open where I use the titles of each tab (cmd-i to update) as the task title for my sake.

I've also started building more with SwiftWASM (with SQLite WASM, and I am working on porting SQLiteData to WASM too so I can have a unified data layer that has iCloud sync on Apple platforms) and web deployment for some of my apps features so that I can iterate more quickly and reuse the work in the apps.


Yes, that makes sense to me. I cannot really put builds in a queue because I have very fine-grained updates that I tell my agents so they do need the direct feedback to check what they have just done actually works, or they will interfere with each other’s work.

I do strive to use Mac OS targets because those are easier to deal with than a simulator, especially when you use Bluetooth stuff and you get direct access to log files and SQLite files.

Solo devs have it way easier in this new world because there’s no strict rules to follow. Whatever goes, goes, I guess.


I found Codex got much better (and with some AGENTS.md context about it) at ignoring unrelated changes from other agents in the same repo. But making worktrees easier to spin up and integrate back in might be a better approach for you.

When the build fails (rather than functional failure), most of the time I like to give the failure to a brand new agent to fix rather than waste context on the original agent resolving it, now that they're good at picking up on those changes. Wastes less precious context on the main task, and makes it easier to not worry about which agent addresses which build failures.

And then for individual agents checking their own work, I rely on them inspecting test or simulator/app results. This works best if agents don't break tests outside the area they're working in. I try to avoid having parallel agents working on similar things in the same tree.

I agree on the Mac target ease. Especially also if you have web views.

Orgs need to adapt to this new world too. The old way of forcing devs generally to work on only one task at a time to completion doesn't make as much sense anymore even from the perspective of the strictest of lean principles. That'll be my challenge to figure out and help educate that transformation if I want to productize this.


How can I get in touch?


hn () manabi.io


I use the web ui, easy to parallelize stuff to 90% done. manually finish the last 10% and a quick test


For Xcode projects?


i workshop a detailed outline w it first, and once i'm happy w the plan/outline, i let it run while i go do something else


By my tests (https://github.com/7mind/jopa) Gemini 3 is somewhat better than Claude with Opus 4.5. Both obliterate Codex with 5.1


What's - roughly - your monthly spend when using ppt models? I only use fixed priced copilot, and my napkin maths says I'd be spending something crazy like $200/mo if I went ppt on the more expensive models.


They have subscriptions too (at least Claude and ChatGPT/Codex; I don't use Gemini much). It's far cheaper to use the subscriptions first and then switch to paying per token beyond that.


Something around 500 euros.


Codex is super cheap though even with the cheapest GPT subscription you get lots of tokens. I use 4.5 opus at work and codex at home tbh the differences are not that big if you know what you are doing.


NME = "not my experience" I presume.

JFC TLA OD...


i think the standard recommendation is to do range partitioning on the hash of the key, aka hash range partitioning (i know yugabyte supports this out of the box, i'd be surprised if others don't). this prevents the situation of all recent uuids ending up on the same shard.


Indeed. In fact, Cassandra and DynamoDB have both hash keys and range keys; I've edited my comment to be more specific.


RAG != EBR


why can't you create separate git worktrees, and open each worktree in a separate IDE window? then you get the same functionality, no?


You can, but Aider is designed to work in a console and be interacted with through limited screen real estate, whereas cursor is designed to be interacted with through a full screen IDE. Besides the resource consumption issue, Cursor's manual prompts are hard to interact with when the window is tiny because it wants to try and pop up source file windows and display diffs in an editor pane, for instance.

When we're managing 10-20 AI coding agents to get work done, the interface for each is going to need to be minimal. A lot of cursor's functionality is going to be vestigial at that point, as a tool it only makes sense as a gap-bridger for people that are still attached to manual coding.


the broker should be adjusting the cost basis in the 1099 to account for wash sale. are you not seeing that? what broker are you using?


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: