sherlockxu's comments

sherlockxu · 2025-10-27T01:57:21 1761530241

Seems there’s still some confusion around what DeepSeek-OCR really does. Learn about the model, Contexts Optical Compression, and its impact on LLMs here: https://www.bentoml.com/blog/deepseek-ocr-contexts-optical-c...

sherlockxu · 2025-07-14T12:26:58 1752496018

Thanks. We just added the example.

sherlockxu · 2025-07-11T08:00:11 1752220811

Hi everyone. I'm one of the maintainers of this project. We're both excited and humbled to see it on Hacker News!

We created this handbook to make LLM inference concepts more accessible, especially for developers building real-world LLM applications. The goal is to pull together scattered knowledge into something clear, practical, and easy to build on.

We’re continuing to improve it, so feedback is very welcome!

GitHub repo: https://github.com/bentoml/llm-inference-in-production

DiabloD3 · 2025-07-11T18:47:42 1752259662

I'm not going to open an issue on this, but you should consider expanding on the self-hosting part of the handbook and explicitly recommend llama.cpp for local self-hosted inference.

leopoldj · 2025-07-11T21:51:04 1752270664

The self hosting section covers corporate use case using vLlm and sglang as well as personal desktop use using Ollama which is a wrapper over llama.cpp.

DiabloD3 · 2025-07-12T01:19:53 1752283193

Recommending Ollama isn't useful for end users, its just a trap in a nice looking wrapper.

nl · 2025-07-12T02:52:07 1752288727

Strong disagree on this. Ollama is great for moderately technical users who aren't really programmers or proficient with the command line.

DiabloD3 · 2025-07-12T04:29:25 1752294565

You can disagree all you want, but Ollama does not keep their llama.cpp vendored copy up to date, and also ships, via their mirror, completely random badly labeled models claiming to be the upstream real ones, often misappropriated from major community members (Unsloth, et al).

When you get a model offered by Ollama's service, you have no clue what you're getting, and normal people who have no experience aren't even aware of this.

Ollama is an unrestricted footgun because of this.

nl · 2025-07-12T11:20:10 1752319210

I thought the models were like HuggingFace, where anyone can upload a model and you choose which you pull. The Unsloth ones look like this to me, eg: https://ollama.com/secfa/DeepSeek-R1-UD-IQ1_S

DiabloD3 · 2025-07-12T23:10:57 1752361857

Ollama themselves upload models to the mirror, and often mislabel them.

When R1 first came out, for example, their official copy of it was one of the distills labeled as "R1" instead of something like "R1-qwen-distill". They've done this more than once.

ChromaticPanic · 2025-07-12T20:03:45 1752350625

Not the footgun you think it is. Ollama comes with a few things that make it convenient for casual users.

criemen · 2025-07-11T20:18:32 1752265112

Thanks a lot for putting this together!

I have a question. In https://github.com/bentoml/llm-inference-in-production/blob/..., you have a single picture that defines TTFT and ITL. That does not match my understanding (but you guys know probably more than me): In the graphic, it looks like that the model is generating 4 tokens T0 to T3, before outputting a single output token.

I'd have expected that picture for ITL (except that then the labeling of the last box is off), but for TTFT, I'd have expected that there's only a single token T0 from the decode step, that then immediately is handed to detokenization and arrives as first output token (if we assume a streaming setup, otherwise measuring TTFT makes little sense).

sherlockxu · 2025-07-14T12:25:07 1752495907

Thanks. We have updated the image to make it more accurate.

armcat · 2025-07-11T09:30:48 1752226248

Amazing work on this, beautifully put together and very useful!

sethherr · 2025-07-11T22:07:49 1752271669

This seems useful and well put together, but splitting it into many small pages instead of a single page that can be scrolled through is frustrating - particularly on mobile where the table of contents isn't shown by default. I stopped reading after a few pages because it annoyed me.

At the very least, the sections should be a single page each.

sherlockxu · on March 22, 2024

Hi HN readers,

One thing I didn't mention in this blog post is that developing vertical models tailored to specific industries may be more important than creating general-purpose models.

Actually I have been wondering why we need so many general-purpose models? People in this world come from different industries and what they need is targeted solutions. Vertical models can address nuanced problems that general-purpose models might overlook due to their broad training.

Feel free to leave your comments here :-)

kouteiheika · on March 22, 2024

> Actually I have been wondering why we need so many general-purpose models? People in this world come from different industries and what they need is targeted solutions. Vertical models can address nuanced problems that general-purpose models might overlook due to their broad training.

It'd be interesting to see a direct comparison which would answer the question of "how many less parameters do you need for a targeted vertical model to solve the same problem as a general purpose model".

Like, for example, let's say we pick the task of translating Python to JavaScript, or just any other concrete task: how small could you make a model that only can do this task, vs a general purpose model that can also do this equally well plus a bunch of other things? I wonder if there are any interesting papers tackling this?

camkego · on March 22, 2024

Thank you for writing the article on the various models.

But, I think your HN-comment parent is spot on regarding vertical models vs general purpose.

It would be awesome to see an article about when to try to use general-purpose models vs vertical.

The ability of LLM models to serve as FAQs and chat-bots and everything in-between, is super powerful.

But what are the pros and cons of using vertical vs general purpose LLMs for knowledge bases and chat-bots?

I'd love to see an article that addresses how to create these models, and should they be large-scale general LLMs that are tweaked lightly, or vertical models with baked-in understanding of the vertical they are trying to serve.

An article on this might be very useful to many people.

vouaobrasil · on March 22, 2024

> Actually I have been wondering why we need so many general-purpose models? People in this world come from different industries and what they need is targeted solutions. Vertical models can address nuanced problems that general-purpose models might overlook due to their broad training.

It is because the real way to make money from AI is to use it to distract, brainwash, confuse, and make poeple think they need something when they don't. So, everyone wants a slice of that pie. Plus, large corporations know that if they create a general-purpose AI then it will be the perfect drug to further distract us from their unsustainable practices.

sherlockxu · on Dec 26, 2023

Apple recently revealed a new method in a research paper, enabling the operation of AI on iPhones. This approach streamlines LLMs by optimizing flash storage.

sherlockxu · on Aug 11, 2023

"Our hypothesis is that if our semantic search produces high-quality results, technologists looking for answers will use our search instead of a search engine or conversational AI."

I am not sure about others, but as long as my problem is solved, I do not care whether the answer is provided by AI or human.

sherlockxu · on Jan 11, 2023

I asked the author the same thing. But they have not contributed it back to the community, but perhaps that's sth they will consider in the future.

sherlockxu · on Jan 11, 2023

It looks like the website has some problems. See this link: https://www.streamnative.io/blog/handling-100k-consumers-wit...