Why doesn't Pro include longer context windows? I'm a Plus member, and the bigge...

carbocation · on Dec 5, 2024

Pro does have longer context windows, specifically 128k. Take a look at the pricing page for this information: https://openai.com/chatgpt/pricing/

nstj · on Dec 6, 2024

Thanks for this. I’m surprised they haven’t made this more obvious in their release and other documentation

mattwallace · on Dec 6, 2024

o1 pro failed to accept 121903 tokens input into the chat (claude took it just fine)

carbocation · on Dec 6, 2024

Seems like something that would be worth pinging OpenAI about because it's a pretty important claim that they are making on their pricing page! Unless it's a matter of counting tokens differently.

katamari-damacy · on Dec 6, 2024

ChatGPT and GPT4o APIs have 128K window as well. The 32K is from the days of GPT4.

carbocation · on Dec 6, 2024

According to the pricing page, 32K context is for Plus users and 128K context is for Pro users. Not disagreeing with you, just adding context for readers that while you are explaining that the 4o API has 128K window, the 4o ChatGPT agent appears to have varying context depending on account type.

thomasahle · on Dec 6, 2024

It's disappointing because the o1-preview had 128k context length. At least on the API. So they nerfed it and made the original product $200/month.

dudus · on Dec 5, 2024

The longer the context the more backtracking it needs to do. It gets exponentially more expensive. You can increase it a little, but not enough to solve the problem.

Instead you need to chunk your data and store it in a vector database so you can do semantic search and include only the bits that are most relevant in the context.

LLM is a cool tool. You need to build around it. OpenAI should start shipping these other components so people can build their solutions and make their money selling shovels.

Instead they want end user to pay them to use the LLM without any custom tooling around. I don't think that's a winning strategy.

gcr · on Dec 5, 2024

This isn't true.

Transformer architectures generally take quadratic time wrt sequence length, not exponential. Architectural innovations like flash attention also mitigate this somewhat.

Backtracking isn't involved, transformers are feedforward.

Google advertises support for 128k tokens, with 2M-token sequences available to folks who pay the big bucks: https://blog.google/technology/ai/google-gemini-next-generat...

dartos · on Dec 5, 2024

During inference time, yes, but training time does scale exponentially as backpropagation still has to happen.

You can’t use fancy flash attention tricks either.

thunderbird120 · on Dec 5, 2024

No, additional context does not cause exponential slowdowns and you absolutely can use FlashAttention tricks during training, I'm doing it right now. Transformers are not RNNs, they are not unrolled across timesteps, the backpropagation path for a 1,000,000 context LLM is not any longer than a 100 context LLM of the same size. The only thing which is larger is the self attention calculation which is quadratic wrt compute and linear wrt memory if you use FlashAttention or similar fused self attention calculations. These calculations can be further parallelized using tricks like ring attention to distribute very large attention calculations over many nodes. This is how google trained their 10M context version of Gemini.

upghost · on Dec 5, 2024

So why are the context windows so "small", then? It would seem that if the cost was not so great, then having a larger context window would give an advantage over the competition.

thunderbird120 · on Dec 5, 2024

The cost for both training and inference is vaguely quadratic while, for the vast majority of users, the marginal utility of additional context is sharply diminishing. For 99% of ChatGPT users something like 8192 tokens, or about 20 pages of context would be plenty. Companies have to balance the cost of training and serving models. Google did train an uber long context version of Gemini but since Gemini itself fundamentally was not better than GPT-4 or Claude this didn't really matter much, since so few people actually benefited from such a niche advantage it didn't really shift the playing field in their favor.

Der_Einzige · on Dec 6, 2024

Marginal utility only drops because effective context is really bad, i.e. most models still vastly prefer the first things they see and those "needle in a haystack" tests are misleading in that they convince people that LLMs do a good job of handling their whole context when they just don't.

If we have the effective context window equal to the claimed context window, well, I'd start worrying a bit about most of the risks that AI doomers talk about...

PollardsRho · on Dec 6, 2024

There has been a huge increase in context windows recently.

I think the larger problem is "effective context" and training data.

Being technically able to use a large context window doesn't mean a model can actually remember or attend to that larger context well. In my experience, the kinds of synthetic "needle in haystack" tasks that AI companies use to show how large of a context their model can handle don't translate very well to more complicated use cases.

You can create data with large context for training by synthetically adding in random stuff, but there's not a ton of organic training data where something meaningfully depends on something 100,000 tokens back.

Also, even if it's not scaling exponentially, it's still scaling: at what point is RAG going to be more effective than just having a large context?

upghost · on Dec 6, 2024

Great point about the meaningful datasets, this makes perfect sense. Esp. in regards to SFT and RLHF. Although I suppose it would be somewhat easier to do pretraining on really long context (books, I assume?)

terafo · on Dec 6, 2024

Because you have to do inference distributed between multiple nodes at this point. For prefill because prefill is actually quadratic, but also for memory reasons. KV Cache for 405B at 10M context length would take more than 5 terabytes (at bf16). That's 36 H200 just for KV Cache, but you would need roughly 48 GPUs to serve bf16 version of the model. Generation speed at that setup would be roughly 30 tokens per second, 100k tokens per hour, and you can server only a single user because batching doesn't make sense at these kinds of context lengths. If you pay 3 dollars per hour per GPU, it's $1440 per million tokens cost. For fp8 version the numbers are a bit better: you need only 24 GPUs, generation speed stays roughly the same, so it's only 700 dollars per million tokens. There are architectural modifications that will bring that down significantly, but, nonetheless, it's still really really expensive, but also quite hard to get to work.

danpalmer · on Dec 6, 2024

Another factor in context window is effective recall. If the model can't actually use a fact 1m tokens earlier, accurately and precisely, then there's no benefit and it's harmful to the user experience to allow the use of a poorly functioning feature. Part of what Google have done with Gemini's 1-2m token context window is demonstrate that the model will actually recall and use that data. Disclosure, I do work at Google but not on this, I don't have any inside info on the model.

monkmartinez · on Dec 5, 2024

Memory. I don't know the equation, but its very easy to see when you load a 128k context model at 8K vs 80K. The quant I am running would double VRAM requirements when loading 80K.

thomasfromcdnjs · on Dec 6, 2024

This was my understanding too. Would love more people to chime in on the limits and costs of larger contexts.

menaerus · on Dec 6, 2024

> The only thing which is larger is the self attention calculation which is quadratic wrt compute and linear wrt memory if you use FlashAttention or similar fused self attention calculations.

FFWD input is self-attention output. And since the output of self-attention layer is [context, d_model], FFWD layer input will grow as well. Consequently, FFWD layer compute cost will grow as well, no?

The cost of FFWD layer according to my calculations is ~(4+2 * true(w3)) * d_model * dff * n_layers * context_size so the FFWD cost grows linearly wrt the context size.

So, unless I misunderstood the transformer architecture, larger the context the larger the compute of both self-attention and FFWD is?

Kubuxu · on Dec 6, 2024

FFWD later is independent of context size, each processed token passes thought the same weights.

menaerus · on Dec 6, 2024

So you're saying that if I have a sentence of 10 words, and I want the LLM to predict the 11th word, FFWD compute is going to be independent of the context size?

I don't understand how since that very context is what makes the likeliness of output of next prediction worthy, or not?

More specifically, FFWD layer is essentially self attention output [context, d_model] matrix matmul'd with W1, W2 and W3 weights?

dartos · on Dec 6, 2024

I may be missing something, but I thought that each context token would result in an 3 additional parameters per context token for self attention to build its map, since each attention must calculate a value considering all existing context

hyperbovine · on Dec 5, 2024

I’m confused. Backdrop scales linearly w

solarkraft · on Dec 5, 2024

> you need to chunk your data and store it in a vector database so you can do semantic search and include only the bits that are most relevant in the context

Be aware that this tends to give bad results. Once RAG is involved you essentially only do slightly better than a traditional search, a lot of nuance gets lost.

rahimnathwani · on Dec 6, 2024

This depends on the amount of context you provide, and the quality of your retrieval step.

tom1337 · on Dec 5, 2024

> Instead you need to chunk your data and store it in a vector database so you can do semantic search and include only the bits that are most relevant in the context.

Isn't that kind of what Anthropic is offering with projects? Where you can upload information and PDF files and stuff which are then always available in the chat?

cma · on Dec 5, 2024

They put all the project in the context, works much better than RAG when it fits. 200k context for their pro plan, and 500K for enterprise.

hackernewds · on Dec 6, 2024

I don't know whether using exponential in the general English language usage of the word, but it does not get exponentially more expensive

Melatonic · on Dec 5, 2024

Seems like a good candidate for a "dumb" AI you can run locally to grab data you need and filter it down before giving to OpenAI

danpalmer · on Dec 5, 2024

Because they can't do long context windows. That's the only explanation. What you can do with a 1m token context window is quite a substantial improvement, particularly as you said for enterprise usage.

KTibow · on Dec 5, 2024

In my experience OpenAI models perform worse on long contexts than Anthropic/Google's, even when using the cheaper ones.

kranke155 · on Dec 6, 2024

Claude is clearly the superior product id say.

The only reason I open Chat now is because Claude will refuse to answer questions on a variety of topics including for example medication side effects.

visarga · on Dec 6, 2024

When I tested o1 a few hours ago, it seemed like it was losing context. After I asked it to use a specific writing style, and pasting a large reference text, it forgot my demand. I reminded it, and it kept the rule for a few more messages, and after another long paste it forgot again.

j45 · on Dec 5, 2024

If a $200/month pro level is successful it could open the door to a $2000/month segment, and the $20,000/month segment will appear and the segregation of getting ahead with AI will begin.

johnisgood · on Dec 5, 2024

Agreed. Where may I read about how to set up an LLM similar to that of Claude, which has the minimum length of Claude's context window, and what are the hardware requirements? I found Claude incredibly useful.

j45 · on Dec 6, 2024

Looking into running models locally, maybe a 405B parameter model sounds like the place to start.

Once understood you could practice with a private hosted llm (run your own model) to tweak and get it dialled in per hour, and then make the leap.

wkat4242 · on Dec 8, 2024

And now you can get the 405b quality in a 70b according to meta. Costs really come down massively with that. I wonder if it's really as good as they say though.

m3kw9 · on Dec 6, 2024

Full blown agents but they have to really able to replace a semi competent, harder than it sounds especially for edge cases where a human can easily get past

j45 · on Dec 6, 2024

Agents still need a fair bit of human input and design and tweaking.

dr_kiszonka · on Dec 5, 2024

This is a significant concern for me too.

j45 · on Dec 6, 2024

It's important to become early users of everything while AI is heavily subsidized.

Over time, using open source model as well will get more done per dollar of compute and hopefully the gap will remain close.

fragmede · on Dec 6, 2024

Question is if OpenAI is actually making money at $200/month.

vbezhenar · on Dec 6, 2024

With o1-preview and $20 subscription my queries typically were answered in 10-20 seconds. I've tried $200 subscription with some queries and got 5-10 minutes answer time. Unless the load is substantially increased and I was just waiting in queue for computing resources, I'd assume that they throw a lot more hardware for o1-pro. So it's entirely possible that $200/month is still at loss.

j45 · on Dec 6, 2024

For funded startups, losing less can be a form of runway and capacity especially at the numbers they are spending.

itissid · on Dec 5, 2024

I've been concatenating my source code of ~3300 lines and 123979 bytes(so likely < 128K context window) into the chat to get better answers. Uploading files is hopeless in the web interface.

fragmede · on Dec 6, 2024

why not use aider/similar and upload via API?

frakt0x90 · on Dec 5, 2024

Have you considered RAG instead of using the entire document? It's more complex but would at least allow you to query the document with your API of choice.

mark_l_watson · on Dec 5, 2024

Switch to Gemini Pro just when you need huge context size. That is what I do.

8n4vidtmkvmk · on Dec 6, 2024

Just? You don't think the model is as capable when the context does fit?

mark_l_watson · on Dec 6, 2024

I tend to use OpenAI, Gemini, and Claude. All are excellent, but when I am not happy with results I hit all!three.

domysee · on Dec 6, 2024

When talking about context windows I'm surprised no one mentions https://poe.com/. Switched over from ChatGPT about a year ago, and it's amazing. Can use all models and the full context window of them, for the same price as a ChatGPT subscription.

EVa5I7bHFq9mnYK · on Dec 6, 2024

Poe.com goes straight to login page, doesn't want to divulge ANY information to me before I sign up. No About Us or Product description or Pricing - nothing. Strange behavior. But seeing it more and more with modern web sites.

bobnamob · on Dec 6, 2024

I wouldn’t bother with Poe, poe2 early access costs $30 and starts on the 6th

rtwld · on Dec 6, 2024

I think you’re confusing it with Path of Exile 2? That’s the same mistake ChatGPT made…

DrammBA · on Dec 6, 2024

I think the confusion was intentional in an attempt to make a funny :)

nilsherzig · on Dec 6, 2024

You can take a look at openrouter, also a pay as you go frontend (or API "proxy") for every single API in existence

WhitneyLand · on Dec 6, 2024

What don’t you like about Claude? I believe the context is larger.

Coincidentally I’ve been using it with xml files recently (iOS storyboard files), and it seems to do pretty well manipulating and refactoring elements as I interact with it.

rmbyrro · on Dec 6, 2024

Google models have huge contexts, but are terrible...

bn-l · on Dec 6, 2024

Agreed. The new 1121 is better but still garbage relatively.