Did you generate a bunch all at once before starting to get some idea of what the natural variance looks like? I would think it's important to verify some level of progression over time, because with the current four it seems entirely possible that the examples could have all been generated at the same time with no changes to the model.
Then it's using the default value which is temperature=1.0, which is by no way deterministic (not that temperature=0.0 is either, but it's more likely to give similar responses to similar prompts, than 1.0 is)
GPT's output is by default somewhat random. If you ask the same exact question several times, you'll potentially get several different answers. Each successive word in the output is chosen from a distribution of possibilities -- that distribution is fixed, but that actual sample chosen from the distribution is not fixed. See, e.g., https://platform.openai.com/docs/api-reference/completions/c...
When using the openAi end points you can set the temperature, k_top etc. That will allow you to tone down the randomness. Passing a temperature of 0 would mean same input always has the same output.
So far as I can see, the only difference is that the GitHub repo uses the API; the ones that I have were rendered using the web chat UI. Which makes me wonder if they're using the bleeding edge model for the chat.
I haven't yet gained access to the enhanced chat features with image outputs. I'm using the API with default parameters, with the gpt-4-0314 model, outputting SVG.
It's not a perfect experiment, but we'll see how it gets on over time.