I'm a Plus member, and the biggest limitation I am running into by far is the maximum length of a context window. I'm having context fall out of scope throughout the conversion or not being able to give it a large document that I can then interrogate.
So if I go from paying $20/month for 32,000 tokens, to $200/month for Pro, I expect something more akin to Enterprise's 128,000 tokens or MORE. But they don't even discuss the context window AT ALL.
For anyone else out there looking to build a competitor I STRONGLY recommend you consider the context window as a major differentiator. Let me give you an example of a usage which ChatGPT just simply cannot do very well today: Dump a XML file into it, then ask it questions about that file. You can attach files to ChatGPT, but it is basically pointless because it isn't able to view the entire file at once due to, again, limited context windows.
Seems like something that would be worth pinging OpenAI about because it's a pretty important claim that they are making on their pricing page! Unless it's a matter of counting tokens differently.
According to the pricing page, 32K context is for Plus users and 128K context is for Pro users. Not disagreeing with you, just adding context for readers that while you are explaining that the 4o API has 128K window, the 4o ChatGPT agent appears to have varying context depending on account type.
The longer the context the more backtracking it needs to do. It gets exponentially more expensive. You can increase it a little, but not enough to solve the problem.
Instead you need to chunk your data and store it in a vector database so you can do semantic search and include only the bits that are most relevant in the context.
LLM is a cool tool. You need to build around it. OpenAI should start shipping these other components so people can build their solutions and make their money selling shovels.
Instead they want end user to pay them to use the LLM without any custom tooling around. I don't think that's a winning strategy.
Transformer architectures generally take quadratic time wrt sequence length, not exponential. Architectural innovations like flash attention also mitigate this somewhat.
Backtracking isn't involved, transformers are feedforward.
No, additional context does not cause exponential slowdowns and you absolutely can use FlashAttention tricks during training, I'm doing it right now. Transformers are not RNNs, they are not unrolled across timesteps, the backpropagation path for a 1,000,000 context LLM is not any longer than a 100 context LLM of the same size. The only thing which is larger is the self attention calculation which is quadratic wrt compute and linear wrt memory if you use FlashAttention or similar fused self attention calculations. These calculations can be further parallelized using tricks like ring attention to distribute very large attention calculations over many nodes. This is how google trained their 10M context version of Gemini.
So why are the context windows so "small", then? It would seem that if the cost was not so great, then having a larger context window would give an advantage over the competition.
The cost for both training and inference is vaguely quadratic while, for the vast majority of users, the marginal utility of additional context is sharply diminishing. For 99% of ChatGPT users something like 8192 tokens, or about 20 pages of context would be plenty. Companies have to balance the cost of training and serving models. Google did train an uber long context version of Gemini but since Gemini itself fundamentally was not better than GPT-4 or Claude this didn't really matter much, since so few people actually benefited from such a niche advantage it didn't really shift the playing field in their favor.
Marginal utility only drops because effective context is really bad, i.e. most models still vastly prefer the first things they see and those "needle in a haystack" tests are misleading in that they convince people that LLMs do a good job of handling their whole context when they just don't.
If we have the effective context window equal to the claimed context window, well, I'd start worrying a bit about most of the risks that AI doomers talk about...
There has been a huge increase in context windows recently.
I think the larger problem is "effective context" and training data.
Being technically able to use a large context window doesn't mean a model can actually remember or attend to that larger context well. In my experience, the kinds of synthetic "needle in haystack" tasks that AI companies use to show how large of a context their model can handle don't translate very well to more complicated use cases.
You can create data with large context for training by synthetically adding in random stuff, but there's not a ton of organic training data where something meaningfully depends on something 100,000 tokens back.
Also, even if it's not scaling exponentially, it's still scaling: at what point is RAG going to be more effective than just having a large context?
Great point about the meaningful datasets, this makes perfect sense. Esp. in regards to SFT and RLHF. Although I suppose it would be somewhat easier to do pretraining on really long context (books, I assume?)
Because you have to do inference distributed between multiple nodes at this point. For prefill because prefill is actually quadratic, but also for memory reasons. KV Cache for 405B at 10M context length would take more than 5 terabytes (at bf16). That's 36 H200 just for KV Cache, but you would need roughly 48 GPUs to serve bf16 version of the model. Generation speed at that setup would be roughly 30 tokens per second, 100k tokens per hour, and you can server only a single user because batching doesn't make sense at these kinds of context lengths. If you pay 3 dollars per hour per GPU, it's $1440 per million tokens cost. For fp8 version the numbers are a bit better: you need only 24 GPUs, generation speed stays roughly the same, so it's only 700 dollars per million tokens. There are architectural modifications that will bring that down significantly, but, nonetheless, it's still really really expensive, but also quite hard to get to work.
Another factor in context window is effective recall. If the model can't actually use a fact 1m tokens earlier, accurately and precisely, then there's no benefit and it's harmful to the user experience to allow the use of a poorly functioning feature. Part of what Google have done with Gemini's 1-2m token context window is demonstrate that the model will actually recall and use that data. Disclosure, I do work at Google but not on this, I don't have any inside info on the model.
Memory. I don't know the equation, but its very easy to see when you load a 128k context model at 8K vs 80K. The quant I am running would double VRAM requirements when loading 80K.
> The only thing which is larger is the self attention calculation which is quadratic wrt compute and linear wrt memory if you use FlashAttention or similar fused self attention calculations.
FFWD input is self-attention output. And since the output of self-attention layer is [context, d_model], FFWD layer input will grow as well. Consequently, FFWD layer compute cost will grow as well, no?
The cost of FFWD layer according to my calculations is ~(4+2 * true(w3)) * d_model * dff * n_layers * context_size so the FFWD cost grows linearly wrt the context size.
So, unless I misunderstood the transformer architecture, larger the context the larger the compute of both self-attention and FFWD is?
So you're saying that if I have a sentence of 10 words, and I want the LLM to predict the 11th word, FFWD compute is going to be independent of the context size?
I don't understand how since that very context is what makes the likeliness of output of next prediction worthy, or not?
More specifically, FFWD layer is essentially self attention output [context, d_model] matrix matmul'd with W1, W2 and W3 weights?
I may be missing something, but I thought that each context token would result in an 3 additional parameters per context token for self attention to build its map, since each attention must calculate a value considering all existing context
> you need to chunk your data and store it in a vector database so you can do semantic search and include only the bits that are most relevant in the context
Be aware that this tends to give bad results. Once RAG is involved you essentially only do slightly better than a traditional search, a lot of nuance gets lost.
> Instead you need to chunk your data and store it in a vector database so you can do semantic search and include only the bits that are most relevant in the context.
Isn't that kind of what Anthropic is offering with projects? Where you can upload information and PDF files and stuff which are then always available in the chat?
Because they can't do long context windows. That's the only explanation. What you can do with a 1m token context window is quite a substantial improvement, particularly as you said for enterprise usage.
The only reason I open Chat now is because Claude will refuse to answer questions on a variety of topics including for example medication side effects.
When I tested o1 a few hours ago, it seemed like it was losing context. After I asked it to use a specific writing style, and pasting a large reference text, it forgot my demand. I reminded it, and it kept the rule for a few more messages, and after another long paste it forgot again.
If a $200/month pro level is successful it could open the door to a $2000/month segment, and the $20,000/month segment will appear and the segregation of getting ahead with AI will begin.
Agreed. Where may I read about how to set up an LLM similar to that of Claude, which has the minimum length of Claude's context window, and what are the hardware requirements? I found Claude incredibly useful.
And now you can get the 405b quality in a 70b according to meta. Costs really come down massively with that. I wonder if it's really as good as they say though.
Full blown agents but they have to really able to replace a semi competent, harder than it sounds especially for edge cases where a human can easily get past
With o1-preview and $20 subscription my queries typically were answered in 10-20 seconds. I've tried $200 subscription with some queries and got 5-10 minutes answer time. Unless the load is substantially increased and I was just waiting in queue for computing resources, I'd assume that they throw a lot more hardware for o1-pro. So it's entirely possible that $200/month is still at loss.
I've been concatenating my source code of ~3300 lines and 123979 bytes(so likely < 128K context window) into the chat to get better answers. Uploading files is hopeless in the web interface.
Have you considered RAG instead of using the entire document? It's more complex but would at least allow you to query the document with your API of choice.
When talking about context windows I'm surprised no one mentions https://poe.com/.
Switched over from ChatGPT about a year ago, and it's amazing. Can use all models and the full context window of them, for the same price as a ChatGPT subscription.
Poe.com goes straight to login page, doesn't want to divulge ANY information to me before I sign up. No About Us or Product description or Pricing - nothing. Strange behavior. But seeing it more and more with modern web sites.
What don’t you like about Claude? I believe the context is larger.
Coincidentally I’ve been using it with xml files recently (iOS storyboard files), and it seems to do pretty well manipulating and refactoring elements as I interact with it.
I'm a Plus member, and the biggest limitation I am running into by far is the maximum length of a context window. I'm having context fall out of scope throughout the conversion or not being able to give it a large document that I can then interrogate.
So if I go from paying $20/month for 32,000 tokens, to $200/month for Pro, I expect something more akin to Enterprise's 128,000 tokens or MORE. But they don't even discuss the context window AT ALL.
For anyone else out there looking to build a competitor I STRONGLY recommend you consider the context window as a major differentiator. Let me give you an example of a usage which ChatGPT just simply cannot do very well today: Dump a XML file into it, then ask it questions about that file. You can attach files to ChatGPT, but it is basically pointless because it isn't able to view the entire file at once due to, again, limited context windows.