This looks useful for people not using Claude Code, but I do think that the desktop example in the video could be a bit misleading (particularly for non-developers) - Claude is definitely not taking screenshots of that desktop & organizing, it's using normal file management cli tools. The reason seems a bit obvious - it's much easier to read file names, types, etc. via an "ls" than try to infer via an image.
But it also gets to one of Claude's (Opus 4.5) current weaknesses - image understanding. Claude really isn't able to understand details of images in the same way that people currently can - this is also explained well with an analysis of Claude Plays Pokemon https://www.lesswrong.com/posts/u6Lacc7wx4yYkBQ3r/insights-i.... I think over the next few years we'll probably see all major LLM companies work on resolving these weaknesses & then LLMs using UIs will work significantly better (and eventually get to proper video stream understanding as well - not 'take a screenshot every 500ms' and call that video understanding).
I keep seeing “Claude image understanding is poor” being repeated, but I’ve experienced the opposite.
I was running some sentiment analysis experiments; describe the subject and the subjects emotional state kind of thing. It picked up on a lot of little detail; the brand name of my guitar amplifier in the background, what my t shirt said and that I must enjoy craft beer and or running (it was a craft beer 5k kind of thing), and picked up on my movement through multiple frames. This was a video slicing a frame every 500ms, it noticed me flexing, giving the finger, appearing happy, angry, etc.
I was really surprised how much it picked up on, and how well it connected those dots together.
I regularly show Claude Code a screenshot of a completely broken UI--lots of cut off text, overlapping elements all over the place, the works--and Claude will reply something like "Perfect! The screenshot shows that XYZ is working."
I can describe what is wrong with the screenshot to make Claude fix the problem, but it's not entirely clear to what extent it's using the screenshot versus my description. Any human with two brain cells wouldn't need the problems pointed out.
This is my experience as well. If CC does something, and I get broken results and reply with just an image it will almost always reply with "X is working!" response. Sometimes just telling it to look more closely is enough, or sometimes I have to be more specific. It seems to be able to read text from screenshots of logs just fine though and always seems to process those as I'd expect.
> Claude is definitely not taking screenshots of that desktop & organizing, it's using normal file management cli tools
Are you sure about that?
Try "claude --chrome" with the CLI tool and watch what it does in the web browser.
It takes screenshots all the time to feed back into the multimodal vision and help it navigate.
It can look at the HTML or the JavaScript but Claude seems to find it "easier" to take a screenshot to find out what exactly is on the screen. Not parse the DOM.
So I don't know how Cowork does this, but there is no reason it couldn't be doing the same thing.
I wonder if there's something to be said about screenshots preventing context poisoning vs parsing. Or in other words, the "poison" would have to be visible and obvious on the page where as it could be easily hidden in the DOM.
And I do know there are ways to hide data like watermarks in images but I do not know if that would be able to poison an AI.
Considering that very subtle not-human-visible tweaks can make vision models misclassify inputs, it seems very plausible that you can include non-human-visible content the model consumes.
Maybe at one time, but it absolutely understands images now. In VSCode Copilot, I am working on a python app that generates mesh files that are imported in a blender project. I can take a screenshot of what the mesh file looks like and ask Claude code questions about the object, in context of a Blender file. It even built a test script that would generate the mesh and import it into the Blender project, and render a screenshot. It built me a vscode Task to automate the entire workflow and then compare image to a mock image. I found its understanding of the images almost spooky.
im doing extremely detailed and extremely visual javascript uis with claude code with reactjs and tailwind. driven by lots of screenshots, which often one shot the solution
Claude Opus 4.5 can understand images: one thing I've done frequently in Claude Code and have had great success is just showing it an image of weird visual behavior (drag and drop into CC) and it finds the bug near-immediately.
The issue is that Claude Code won't automatically Read images by default as a part of its flow: you have to very explicitly prompt it to do so. I suspect a Skill may be more useful here.
I've done similar while debugging an iOS app I've been working on this past year.
Occasionally it needs some poking and prodding but not to a substantial degree.
I also was able to use it to generate SVG files based on in-app design using screenshots and code that handles rendering the UI and it was able to do a decent job. Granted not the most complex of SVG but the process worked.
On a more serious note, I really only use Windows for games & I'm still always frustrated with how many updates (& restarts during updates) Windows needs. My fans are always constantly spinning on Windows too (laptop or desktop) whereas my Mac & Linux machines are generally silent outside of heavy load.
This is common for any self-updating software that you use infrequently.
A friend of mine complained that he hated how Firefox "always" wants to restart with an update. I couldn't understand what he was on about. Turned out I use Firefox daily and he uses it like once every 2 months to test something and yeah, Firefox has an update out every 2 months or so, so that fits.
It's the same with Windows (and, I assume, macOS). Use Windows more and the updater will disappear out of sight.
I update Linux maybe once a year. Sure, there are security vulnerabilities. But I'm behind a firewall. And meanwhile, I don't have to spend any time dealing with update issues.
But Windows is made for the big masses. It's definitely a good thing that Microsoft forces Auto-Updates, because otherwise 95% of people would run around with devices that have gaping security holes. And 90% of these people are not being a firewall 100% of their time.
Side effect unfortunately is that they are shoving ad- and bloatware down your throat through these updates.
But that is, because Microsoft does not care about the end user at all. It's not the fault of auto-updates.
My Mac doesn’t randomly reboot, doesn’t force updates on shutdown, doesn’t have weekly updates that require updates. IMO Apple handles updates much better than Windows.
Windows still reboots instead of shut down when you do update and shut down.
> My fans are always constantly spinning on Windows too (laptop or desktop) whereas my Mac & Linux machines are generally silent outside of heavy load.
defender seemingly needs to check every 10ms that you still don't have a virus
I'm always amused by these occasional "you still don't have any viruses" popup notifications from Defender. Well, good to know, thank you very much, I guess.
I feel like I've been monkeys-pawed with the downfall of Windows for gaming. I.e. rather than being at the point where everything just works best/easiest in my Bazzite install it's a game of DRM, modding tool support, feature support, and random "this game runs better on Windows, this game runs better on Bazzite" discovery. Also Windows/Steam OS clone/"normal Linux" setups all have their own very awkward corners around the non-gaming portions. I've not found one that does not require substantial tweaking to get a usable all around experience unless you buy a device to use as more a dedicated gaming console (Xbox/Steam Deck type device).
I miss the ~mid Windows 7 era. Not that everything ran perfectly without issues on Windows 7 at the time, particularly old games, but at least there was an option good enough to assume to always go with first instead of "see if the games you play work best here".
It really depends on what you play. I've been playing online co-op regularly with a bunch of friends since Covid times. We're jumping to new (well, on sale) games regularly, and the only recent time I booted to Windows was because a 4-player mod for Remnant II _might_ not work on Linux. Can't remember the previous game that did not work on Linux.
I'm so used to things working without major tinkering that I forget to check protondb most of the time.
Nice, this is similar to what I was wondering about - it looks like it's pretty limited in capability right now (looks like it only supports canvas2d at the moment: https://nxjs.n8.io/runtime/rendering/canvas), but in theory it would allow you to make a layer to convert WebGPU or WebGL games for Switch (ignoring the huge performance drop going from v8 / jit JS engines to QuickJS).
The interesting part here is about AthenaEnv. It looks like it uses QuickJS for the Javascript interpreter and wraps around the native system libraries that the PS2 provides.
I'm wondering if there's a modern similar project that would allow writing Javascript Canvas games (WebGPU / WebGL) and publishing on Switch/2, PS5, and Xbox.
From my understanding, they explicitly disallow JITs so you can't just wrap your JS game with Electron / Node Webkit and use V8. I'm not sure if anyone has tried publishing a game using a V8-jitless electron fork - the sdks for consoles are under NDA so there's not really much written about it publicly & most games using Unreal or Unity don't deal with these things themselves.
PC, Mac, and even mobile are surprisingly easier here because you can just run the JS via electron or in a webview on mobile.
Yeah, I saw the video about that earlier which is what led me to wonder if there was a native JS way now.
They used Kha in order to port only the console versions, the desktop versions remained JS from my understanding: https://github.com/Kode/Kha which is built on top of Haxe. This works, but it also means not having a single codebase anymore which would be one of the benefits of a JS based system.
There are other options here - something like using an AOT JS compiler like Porffor, but from my understanding it's never been tested (and would probably be missing a lot of support to get it working - like shimming canvas & providing a WebGPU context that the compiled JS could execute against).
The official Nintendo 3DS and Wii U SDKs both provided an Electron-like framework that allowed games to be written with web technologies. I seem to recall that it was discontinued at some point before the Switch? The Switch does have a WebKit browser applet that games can call to display web-based content, but it's pretty limited since JIT is disabled like you say. I've only ever seen it used for e-manuals.
Was able to sign up for the Max plan & start using it via opencode. It does a way better job than Qwen3 Coder in my opinion. Still extremely fast, but in less than 1 hour I was able to use 7M input tokens, so with a single agent running I would be able easily to pass that 120M daily token limit. The speed difference between Claude Code is significant though - to the point where I'm not waiting for generation most of the time, I'm waiting for my tests to run.
For reference, each new request needs to send all previous messages - tool calls force new requests too. So it's essentially cumulative when you're chatting with an agent - my opencode agent's context window is only 50% used at 72k tokens, but Cerebra's tracking online shows that I've used 1M input tokens and 10k output tokens already.
> For reference, each new request needs to send all previous messages - tool calls force new requests too. So it's essentially cumulative when you're chatting with an agent - my opencode agent's context window is only 50% used at 72k tokens, but Cerebra's tracking online shows that I've used 1M input tokens and 10k output tokens already.
This is how every "chatbot" / "agentic flow" / etc works behind the scenes. That's why I liked that "you should build an agent" post a few days ago. It gets people to really understand what's behind the curtain. It's requests all the way down, sometimes with more context added, sometimes with less (subagents & co).
Many API endpoints (and local services for that matter) does caching at this point though, with much cheaper prices for input/outputs that were found in the caching. I know Anthrophic does this, and DeepSeek I think too, at the very least.
At those speeds, it's probably impossible. It would require enormous amounts of memory (which the chip simply doesn't have, there's no room for it) or rather a lot of bandwidth off-chip to storage, and again they wouldn't want to waste surface area on the wiring. Bit of a drawback of increasing density.
Is this built with JS / something like Fabric JS? There are some things that feel very similar to a web app that I worked on before. Wondering if there's plans to have a plugin API at some point if it is.
One interesting thing here is that the chat side panel is agentic - it can read tab contents, open links in the existing tab or create new tabs, and do most of the standard "summarize", etc. things too.
This might be the first time that I move off of Chrome for an extended period of time.
uBlock origin lite kinda sucks compared to the OG uBlock, though. YouTube videos have this awkward buffering at the start now, sometimes YouTube homepage ads still load, sponsored placements on GrubHub/DoorDash appear and aren't able to be removed, etc.
"I pay to remove ads so my experience with a neutered adblocker isn't as bad" is a weird take.
If you think the end game is companies deciding they're comfortable with removing ads in exchange for a subscription, rather than a subscription with a gradually increasing amount of ads, then I have a bridge to sell you.
I support the creators I watch by donating to them directly.
I mentioned multiple domains...? I said it also impacts sponsored listings on food delivery platforms. Those used to be blocked and, more broadly, the ability to manually block specific elements of a webpage was lost with the transition to UB lite.
But it also gets to one of Claude's (Opus 4.5) current weaknesses - image understanding. Claude really isn't able to understand details of images in the same way that people currently can - this is also explained well with an analysis of Claude Plays Pokemon https://www.lesswrong.com/posts/u6Lacc7wx4yYkBQ3r/insights-i.... I think over the next few years we'll probably see all major LLM companies work on resolving these weaknesses & then LLMs using UIs will work significantly better (and eventually get to proper video stream understanding as well - not 'take a screenshot every 500ms' and call that video understanding).
reply