Hacker Newsnew | past | comments | ask | show | jobs | submit | eob's commentslogin

I think because you don't know /which/ developer you're going to get.

One interesting aspect of LLMs is that each one, weights frozen, can be thought of as a single developer whose work you have already evaluated.

The cost of finding, evaluating, and negotiating with a new human is tremenous.


These comments are so filled with cool projects and visualizations I’ll throw my own out there:

Trying to generate non-slop, print-worthy recipe books for different diaspora communities.

www.robotbookclub.com

I’m pretty sure the recipes have surpassed slop. The photos are pretty close. The layouts, intros, and “chef photos” need a lot of work though.


It is good enough to have much fun reading it. About chef photos, in Cook French in America, page 11: no eggs on a breakfast tartine.

As a French, I would like the French version.


This is how I write down recipes for myself! There’s got to be some cognitive archetype that works better with this style encoding.


Do you have any suspicion about what is different between the backends?

That's an absolutely bonkers statistic: it would mean spurious differences in hosting container overwhelm the performance differences between models.


I genuinely don't, sadly. I'm a mathematician originally, evolved organically into ML then AI - but I never really was a SWE.

I feel like there's some backend decoding or chat template thing going on at a much lower level than what I'm best at. Maybe it's injecting headers or something that eventually compounds to model confusion? I really have no idea.

I really hope folks better than me at backend stuff take a look and dive into it though because it's definitely under-reported and super consistent across model families and backends ranging from ollama, lama.cpp native, prompt, llamafile, and even vLLM that I didn't formally benchmark in the repo.


Hey, this is most probably related to the chat template or the reasoning parser or the tool call parser or also things like kv cache quantization and possibly other params that affect results like the regular top k top p and all of that, the backend often sets its own defaults or the lack of them. It’s best to have all these under control if possible. I wonder regarding this project have you been testing it on real world projects? I’m working on an agentic loop as well also using a local model.


Yes I've now used it "in the wild" for a handful of use-cases. I still run into the backend thing even when declaring params though, which is odd to me. But there might be params not typically passed in with the model that backends are setting. Again, really not my area of expertise.

As for consumers, I've done a home assistant, an agentic coding harness, and an autonomous engineering project (still in flight).


My guess it's just the emergent behavior that results when a company doesn't provide developers time to fix bugs.

If their week is already booked full just trying to keep up with the roadmap deadlines, a bug ticket feels like being tossed a 25lb weight when you're drowning.

You could say: "but have pride in your work!"

But if your company only values shipping, not fixing, that attitude doesn't make it through the first performance review.


What I've found to be most effective for program management is to set aside a maintenance team separate from the feature teams. The roadmap is then planned without counting anything for the maintenance team and they deal with bug tickets as they come in. Rotate the assignment periodically so that every developer has to occasionally spend a few months on the maintenance team.


Doesn’t this lead to problems like the feature team pushing buggy code and having no accountability or responsibility to deal with it?

My preference is to treat the defects like feature work, size and plan. Yes you might not get all the feature work done but the team is accountable for everything they make


There's a lot more to effective program quality management than I can explain in a comment here. Forcing all developers to rotate through the maintenance team is one incentive not to ship crap because they might end up having to deal with it anyway. But more importantly you have to shift left the quality assurance and control activities to minimize the risk of defect leakage in the first place. And set up a closed-loop system where any leaked defect triggers a rigorous root-cause analysis that results in further process improvement.


You’ve just described AGILE development, a way for product owners to backlog code rot while empowering developers to feel like they have a say in things.


Some outlets reporting T-Mobile and ATT as well.

I assume state on state cyber attacks are commonplace but get minimized to avoid public fear.. perhaps this will be the first notable one.


The alternative network reports are most likely people trying to call Verizon customers and reporting an outage when they can't get through.


You think like a person who’s debugged large systems failures before :). That feels very plausible.


I've seen it happen before - in the big AT&T outage a couple years ago reports came in on downdetector of outages for other providers.


Estonia was the first major NATO victim of such things https://en.wikipedia.org/wiki/2007_cyberattacks_on_Estonia

Made worse by the fact Estonia is a more networked society than, for example, the US.


The down detector site has Verizon outage reports two order of magnitude bigger, so it doesn't seem like a cyber attack to me.

https://downdetector.com/status/t-mobile/ ~ 1,600

https://downdetector.com/status/att/ ~ 1,500

https://downdetector.com/status/verizon/ ~ peaked at ~169k, dropped to 67k


T-Mobile is up and Verizon is down in my house


I build coding agents for a living, and I'm struggling to map this onto the set of things I do at work.

In general, interoperability and user choice are really important for us to get right as the community of people building AI platforms...

Have others reading this document been able to map it onto their work?

As a specific example:

> ai://bank/service/payments?amount=10&currency=USD

I'm not sure what this is representing here. Is it a way to encode a clickable link to chat with `bank` about `service/payments` with a few additional args attached?


Third party apps can’t use the network though. Iirc there’s an async message queue with eventual delivery that each app gets, which it can use to send messages back and forth with a paired phone app.


That was once the case, but no longer. Third-party WatchOS apps can work without a phone present, up to being installed directly from the watch's app store. They can definitely do independent networking, but there are still some restrictions, eg they can't do it when backgrounded, and websockets are pretty locked down (only for audio-streaming as per Apple policy).

I reckon the lack of general-purpose websockets is probably the issue for a system based on Phoenix LiveView.


Bravo -- this is fantastic.

I've been waiting for this ever since reading some interview with Orson Scott Card ages ago. It turns out he thinks of his novels as radio theater, not books. Which is a very different way to experience the audio.


Thanks for the kind words :)))


Or vice versa - perhaps some subset of the "thought chains" of Cyc's inference system could be useful training data for LLMs.


When I first learned about LLMs, what came to mind is some sort of "meeting of the minds" with Cyc. 'Twas not to be, apparently.


I view Cyc's role there as a RAG for common sense reasoning. It might prevent models from advising glue on pizza.

    (is-a 'pizza 'food)
    (not (is-a 'glue 'food))
    (for-all i ingredients
      (assert-is-a i 'food))



sure but the bigger models don’t make these trivial mistakes, and I’m not sure if translating the LLM english sentences into LISP and trying to check them is going to be more accurate than just training the models better


The bigger models avoid those mistakes by being, well, bigger. Offloading to a structured knowledgebase would achieve the same without the model needing to be bigger. Indeed, the model could be a lot smaller (and a lot less resource-intensive) if it only needed to worry about converting $LANGUAGE queries to Lisp queries and converting Lisp results back into $LANGUAGE results (where $LANGUAGE is the user's natural language, whatever that might be), rather than having to store some approximation of that knowledgebase within itself on top of understanding $LANGUAGE and understanding whatever ad-hoc query/result language it's unconsciously invented for itself.


Beyond just checking for mistakes, it would be interesting to see if Cyc has concepts that the LLMs don't or vice versa. Can we determine this by examining the models' internals?


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: