I just got it to install git and clone (the non existent) repo https://github.com/openai/assistant, and am now browsing it’s own interpretation of a repo with a lot of python code, including directories like “training”, “output”, “parsing” and with files with content like this:
import json
from collections import Counter
from typing import Any, Dict, List, Optional, Tuple
import numpy as np
from openai_secret_manager import get_secrets
from assistant.constants import MAX_OUTPUT_LENGTH
from assistant.utils.string_utils import strip_html
from assistant.utils.text_utils import split_text_into_lines
class Output:
def __init__(
self,
generated_text: str,
response: Optional[Dict[str, Any]] = None,
score: Optional[float] = None,
):
self.generated_text = generated_text
self.response = response or {}
self.score = score
On a side note it feels like each command takes longer to process than the previous - almost like it is re-doing everything for each command (and that is how it keeps state).
>On a side note it feels like each command takes longer to process than the previous - almost like it is re-doing everything for each command (and that is how it keeps state).
That's because it's probably redoing everything.
But that's probably to keep the implementation simple. They are probably just appending the new input and re-running the whole network.
The typical data dependency structure in a transformer architecture is the following :
outputt0 outputt1 outputt2 outputt3 | outputt4
featL4t0 featL4t1 featL4t2 featL4t3 | featL4t4
featL3t0 featL3t1 featL3t2 featL3t3 | featL3t4
featL2t0 featL2t1 featL2t2 featL2t3 | featL2t4
featL1t0 featL1t1 featL1t2 featL1t3 | featL1t4
input_t0 input_t1 input_t2 input_t3 | input_t4
The features at layer Li at time tj only depends on the features of the layer L(i-1) at times t<=tj.
If you append some new input at the next time t4 and recompute everything from scratch it doesn't change any feature values for time < t4.
To compute the features and output at time t4 you need all the values of the previous times for all layers.
The alternative to recomputing would be preserving the previously generated features, and incrementally building the last chunk by stitching it to the previous features. If you have your AI assistant running locally that something you can do, but when you are serving plenty of different sessions, you will quickly run out of memory.
With simple transformers, the time horizon of the transformer used to be limited because the attention of the transformer was scaling quadratically (in compute), but they are probably using an attention that scale in O(n*log(n)) something like the Reformer, which allows them to handle very long sequence for cheap, and probably explain the boost in performance compared to previous GPTs.
GPT-3 cannot run on hobbyist-level GPU yet. That's the difference (compared to Stable Diffusion which could run on 2070 even with a not-so-carefully-written PyTorch implementation), and the reason why I believe that while ChatGPT is awesome and made more people aware what LLMs could do today, this is not a moment like what happened with diffusion models.
What makes you say this? Rerunning the whole, which it appears they’re doing, is to prevent the need to hold onto state, so memory is not used. In other words, they’re not having this problem because they’re not doing it that way.
If you generate only a single timestep, during inference when recomputing you can compute layer by layer, you don't need to preserve the features of the previous layers as the layer only depend on the layer immediately below. So your memory need don't depend on the number of layers.
But typically in a standard transformer architecture, you usually generate multiple timesteps by feeding sequentially the output as an input to the next timestep so you need to preserve all the features to not have to recompute them at each timestep. So your memory depends again on the number of layer of your network.
But if you are memory constrained, you can modify your architecture a little (and the training procedure) to put yourself back in the first situation where you only generate a single timestep, by extracting with the transformer a context vector of fixed size by layer for all the past (including your most recent input prompt), and you use another transformer to generate the word in sequence based on this context vector.
In my experience, you can get it to change its mind by troubleshooting the connectivity issues. E.g. if you use dig to get the ip and then ask curl to use that ip instead of a dns lookup, then it works for me.
I did `curl icanhazip.com` and it spit out the "local" private IP. I told chatgpt that icanhazip would never do that, and it revised the answer to 37.48.80.166, which is an IP owned by LeaseWeb.
OK, fair enough! But it would be interesting to add the link with the real Internet in the next release. Sadly, the model’s global state is not immediately updated, there are snapshots… but I think it would be interesting to watch it conversing in real here on Hacker News.
Why do you think this? I don't think there's any reason it would be able to reproduce its own code. It's never seen it so it's not in the weights, and it doesn't have that type of reflection so it can't look it up dynamically.
ChatGPT output:
"I am not sure which specific programming languages or libraries were used to train my language model, as I do not have access to that information. Language models are typically trained using a combination of various programming languages and tools, and the specific technologies that are used can vary depending on the specific model and the research team that developed it. I am a large language model trained by OpenAI, and I use artificial intelligence (AI) and natural language processing (NLP) techniques to generate responses to text-based queries."