That's actually a great question. and the answer is yes and no;
While it does disable the caching mechanism for the conversation history (and not for the system prompt, who remains constant), there is a difference between a chatbot with a constant chat history (just exchange of messages) and an agent who uses a large part of the conversation as a type of "scratchpad", sometimes even holding variables value in the beginning of the chat (to be sort of 'stateful'). if these variables change, the scratchpad changes (can be even 30%-40% of the entire conversation), there is a timeout in the cache (Claude gives you 5 minutes of cache for normal caching) or any other change to the exact history - you get a recaching of the entire conversation. additionally, caching still costs money.
The main advantage of the librarian is that is an 'insurance policy' for this caching mechanism. combining it with solving the context rot issue - and you get improved performance at scale.
oh, one other caveat is that each request could result in the curation of system messages earlier in the chat message history, i haven't done a deep dive into prompt caching, but that could complicate things. the more i think about it, the more i wonder that the prompt caching is a patch for "dumb prompting" to try to save money when you're doing things the dumb way of throwing everything you have at it and praying it gets it right, when it'd just make more sense to keep the entirety of the prompt as lean as possible to prevent context rot and maximize signal to noise ratio.
that's a good point, we haven't delved too deeply into prompt caching yet, but my understanding is that it only helps for a conversation that remains "hot", not one that a user just comes back to everyday and keep adding more to it over a longer period of time. i could see some optimization there where when the conversation is "hot" we keep the system message with the summarized index and all subsequent conversation messages that haven't been summarized intact until the conversation cools off.
i think instead of postiioning as a general purpuse reasoning model, they'd have more success focusing on a specific use case (eg coding agent) and benchmark against the sota open models for the use case (eg qwen3-coder-next)
Honestly I don't understand why they/any fast-and-error-prone model position themselves as coding agents; my experience tells me that I'd much rather working with a slow-but-correct model and let it run longer session than handholding a fast-but-wrong model.
you need a reviewer agent for every step of the process - review the plan generated by the planner, the update made by the task worker subagent, and a final reviewer once all tasks are done.
providers' ToS explicitly states whether or not any data provided is used for training purposes. the usual that i've seen is that while they retain the right to use the data on free tiers, it's almost never the case for paid tiers
If I were to go out on a limb, those companies spend more on tech companies than you and they have larger legal teams than you. That is a carrot and a stick for AI companies to follow the contract.
I bet companies are circumventing this in a way that allows them to derive almost all the benefit from your data, yet makes it very hard to build a case against them.
For example, in RL, you have a train set, and a test set, which the model never sees, but is used to validate it - why not put proprietary data in the test set?
I'm pretty sure 99% of ML engineers would say this would constitute training on your data, but this is an argument you could drag out in courts forever.
Or alternatively - it's easier to ask for forgiveness than permission.
I've recently had an apocalyptic vision, that one day we'll wake up, an find that AI companies have produced an AI copy of every piece of software in existence - AI Windows, AI Office, AI Photoshop etc.
Given the conduct we've seen to date, I'd trust them to follow the letter - but not the spirit - of IP law.
There may very well be clever techniques that don't require directly training on the users' data. Perhaps generating a parallel paraphrased corpus as they serve user queries - one which they CAN train on legally.
The amount of value unlocked by stealing practically ~everyone's lunch makes me not want to put that past anyone who's capable of implementing such a technology.
I wonder how much wiggle there is for collect now (to provide service, context history, etc), then later anonymise (some how, to some level) and then train on it?
Also I wonder if the ToS covers "queries & interaction" vs "uploaded data" - I could imagine some tricky language in there that says we wont use your word document, but we may at some time use the queries you put against it, not as raw corpus but as a second layer examining what tools/workflows to expand/exploit.
“We don’t train on your data” doesn’t exclude metadata, training on derived datasets via some anonymisation process, etc.
There’s a range of ways to lie by omission, here, and the major players have established a reputation for being willing to take an expansive view of their legal rights.
How can you stand the excruciating slowness? Claude Code is running circles around codex. The most mundane tasks make it think for a minute before doing anything.
Well you can’t edit files while Xcode is building or the compiler will throw up, so I‘m wondering what you mean here. You can’t even run swift test in 2 agents at the same time, because swift serializes access for some reason.
Whenever I have more than 1 agent run Swift tests in a loop to fix things, and another one to build something, the latter will disturb the former and I need to cancel.
And then there’s a lot of work that can’t be parallelized, like complex git rebases - well you can do other things in a worktree, but good luck merging that after you‘ve changed everything in the repo. Codex is really really bad at git.
Yes these are horrible pain points. I can only hope Apple improves this stuff if it's true that they're adding MCP support throughout the OS which should require better multi-agent handling
You can use worktrees to have multiple copies building or testing at once
I'm a solo dev so I rarely use some git features like rebase. I work out of trunk only without branches (if I need a branch, I use a feature flag). So I can't help with that
What I did is build an Xcode MCP server that controls Xcode via AppleScript and the simulator via accessibility & idb. For running, it gives locks to the agent that the agent releases once it's done via another command (or by pattern matching on logs output or scripting via JS criteria for ending the lock "atomically" without requiring a follow-up command, for more typical use). For testing, it serializes the requests into a queue and blocks the MCP response.
This works well for me because I care more about autonomous parallelization than I do eliminating waiting states, as long as I myself am not ever waiting. (This is all very interesting to me as a former DevOps/Continuous Deployment specialist - dramatically different practices around optimizing delivery these days...)
Once I get this tool working better I will productize it. It runs fully inside the macOS sandbox so I will deploy it to the Mac App Store and have an iOS companion for monitoring & managing it that syncs via iCloud and TailScale (no server on my end, more privacy friendly). If this sounds useful to you please let me know!
In addition to this, I also just work on ~3 projects at the same time and rotate through them by having about 20 iTerm2 tabs open where I use the titles of each tab (cmd-i to update) as the task title for my sake.
I've also started building more with SwiftWASM (with SQLite WASM, and I am working on porting SQLiteData to WASM too so I can have a unified data layer that has iCloud sync on Apple platforms) and web deployment for some of my apps features so that I can iterate more quickly and reuse the work in the apps.
Yes, that makes sense to me. I cannot really put builds in a queue because I have very fine-grained updates that I tell my agents so they do need the direct feedback to check what they have just done actually works, or they will interfere with each other’s work.
I do strive to use Mac OS targets because those are easier to deal with than a simulator, especially when you use Bluetooth stuff and you get direct access to log files and SQLite files.
Solo devs have it way easier in this new world because there’s no strict rules to follow. Whatever goes, goes, I guess.
I found Codex got much better (and with some AGENTS.md context about it) at ignoring unrelated changes from other agents in the same repo. But making worktrees easier to spin up and integrate back in might be a better approach for you.
When the build fails (rather than functional failure), most of the time I like to give the failure to a brand new agent to fix rather than waste context on the original agent resolving it, now that they're good at picking up on those changes. Wastes less precious context on the main task, and makes it easier to not worry about which agent addresses which build failures.
And then for individual agents checking their own work, I rely on them inspecting test or simulator/app results. This works best if agents don't break tests outside the area they're working in. I try to avoid having parallel agents working on similar things in the same tree.
I agree on the Mac target ease. Especially also if you have web views.
Orgs need to adapt to this new world too. The old way of forcing devs generally to work on only one task at a time to completion doesn't make as much sense anymore even from the perspective of the strictest of lean principles. That'll be my challenge to figure out and help educate that transformation if I want to productize this.
What's - roughly - your monthly spend when using ppt models? I only use fixed priced copilot, and my napkin maths says I'd be spending something crazy like $200/mo if I went ppt on the more expensive models.
They have subscriptions too (at least Claude and ChatGPT/Codex; I don't use Gemini much). It's far cheaper to use the subscriptions first and then switch to paying per token beyond that.
Codex is super cheap though even with the cheapest GPT subscription you get lots of tokens. I use 4.5 opus at work and codex at home tbh the differences are not that big if you know what you are doing.
i think the standard recommendation is to do range partitioning on the hash of the key, aka hash range partitioning (i know yugabyte supports this out of the box, i'd be surprised if others don't). this prevents the situation of all recent uuids ending up on the same shard.
You can, but Aider is designed to work in a console and be interacted with through limited screen real estate, whereas cursor is designed to be interacted with through a full screen IDE. Besides the resource consumption issue, Cursor's manual prompts are hard to interact with when the window is tiny because it wants to try and pop up source file windows and display diffs in an editor pane, for instance.
When we're managing 10-20 AI coding agents to get work done, the interface for each is going to need to be minimal. A lot of cursor's functionality is going to be vestigial at that point, as a tool it only makes sense as a gap-bridger for people that are still attached to manual coding.