>according to its learned concept of "better". I'm gonna assume that all the inp...

Imnimo · on Nov 1, 2022

So the simplified view of a diffusion model is like the following (this leaves out the role of the prompt):

-Sample random noise.

-Ask the neural network "Here is a noisy image. What do you think it looked like before I added all this noise?" (note that you did not form this image by adding random noise to an initial image)

-Adjust the pixels of your random noise towards what the network said.

-Repeat until there is no noise left.

During training, we take a training image and add noise to it. This way we know the "correct" (in scare quotes because there are many possible clean images that lead to the same noisy image given different realizations of the noise) answer to the question in step 2. This is used to update the weights of the neural network.

Ultimately, a diffusion model is just a denoiser. A denoiser implicitly represents an underlying distribution of clean data. The diffusion process used to sample from the diffusion model is a clever way of drawing samples from that underlying distribution given access only to the denoiser.

At sampling time, we have no training image that we add noise to. We just sample random noise out of thin air. This works because in the limit of large amounts of noise, the distribution of "initial image plus lots of noise" and "just lots of noise" are the same. You can certainly draw an analogy between this initial random noise and the uncarved block of marble that the sculptor says "contains" a sculpture waiting to be uncovered. Given the same noise, the neural network is deterministic - it will always produce the same output.

You could even imagine an oracle who can unwind the process of the neural network being cranked and tell us exactly what initial noise sample would produce any desired output. Just like an artist might just draw the image that they want rather than waiting for the random pixel machine to output it, the oracle could simply set the precise values of Stable Diffusion's noise input to produce a Hollie Mengert work, rather than sampling repeatedly until it found one.

doodlebugging · on Nov 1, 2022

I think that we just described the same process.

>-Sample random noise.

>-Ask ... ...what the network said.

>-Repeat until there is no noise left.

Initially we have noise (random or not doesn't matter) and we are trying to find inside that noise a matching function to an image that we already have, our model or training image. That image may or may not also contain some residual noise and as you describe in your iterative steps, noise is added to it thus decreasing the signal to noise ratio, and the neural network compares the output of it's initial or updated image to the "correct answer" image and weights the next iteration accordingly so that a better match can be obtained. The function itself consists of an optimization process designed to minimize the residual noise (a de-noiser like you say) between the most recent image and our target image.

When you say that you begin with a sample of random noise created by your array populating function which is assumed to be random but that you have no training image to which you are adding noise, that fits the whole process but it ignores that you are using the random noise image to iteratively produce an output that fits a model image created by compiling statistics from multiple images to create a target output that is assumed to be near noiseless or to have very high signal to noise ratio.

If you are trying to use this to "find" Alfred E. Neuman in your output the process needs to know what Alfred E. Neuman looks like so it can effectively optimize to that result. Each iteration denoises based on the known output in the model created from the images that it must ingest in order to build the model. If you only have a few images of Alfred E. Neuman but you have thousands of Homer Simpson in your model input dataset, then you will have to fight through the tendency of the process to converge on Homer Simpson. No matter, you always have a priori information that is used to verify the integrity of the output. The input is irrelevant whether it is random or not since you are looking at an optimization process that matches, denoises, and weights iteratively until it minimizes an error function in the match and can be said to be an optimum, or good match.

This is not particularly new or novel or anything else. It is a typical iterative modeling exercise like those that have been used for decades but now you have the compute power to build a near noise-free target model that fits the known data from every source at your disposal.

The user who created the Hollie Mengert styled outputs could not have hit that target without using a model that was designed to create or mimic that type of output. That is why he chose to use her work in his process. He liked it. Then when he found out that she was not pleased about not being consulted and that she didn't have the rights to use some of those images then I think he had a come-to-Jesus moment that ultimately led him to rename it so he could feel better about it. Guilt-tripped him.

Anyway, ethics should be a required part of every computer science curriculum especially when private personal information is involved.

I'm in the oil and gas industry. It sucks sometimes. Fortunately there has been a push to include or require ethics training. Maybe one day it will clean up that industry. I'm only holding my breath though when I pass a refinery.

Imnimo · on Nov 1, 2022

>The user who created the Hollie Mengert styled outputs could not have hit that target without using a model that was designed to create or mimic that type of output.

This is the part that I'm not so sure about. The value of fine-tuning on Hollie Mengert's work is not so much that it enables the model to create that type of output, it's that it makes it far less likely to create other types of outputs. It narrows down the haystack, but it doesn't create the needle.

Similarly, if I set out to find Alfred E. Neuman, but my training data has no images from MAD magazine due to licensing concerns, will it be possible? It may not be possible to use the prompt "Alfred E. Neuman", but maybe it's possible to use the prompt "A cartoon drawing of a grinning red-headed boy with a gap in his front teeth". Images recognizable as Alfred are likely still in the model's output space, even if they are not so easily found. They are certainly in the output space of the "random pixel machine". It's just a question of how hard they are to find.

doodlebugging · on Nov 2, 2022

Basically, from what I have read about Stable Diffusion, a model can be created to replicate a particular style of output by incorporating images illustrating that style into the model space. Once that is complete SD can use that model to create new images in that style because it has a huge textual-image trained model space to use where features in images are tagged to provide contextual clues to SD so that when the user inputs "A cartoon drawing of a grinning red-headed boy with a gap in his front teeth" the process will know how to parse the requested image parameters. In short, it already understands what a cartoon drawing is from having multiple images tagged as cartoon drawings incorporated. It also knows the difference between a grin and a frown or a look of disapproval, can recognize colors, differentiate gender, along with the other parms in the text prompt used to build the image.

However, if it has never seen Alfred E. Neuman it is very unlikely to be able to produce an output that resembles him. This is part of the reason for the huge popularity of SD. Not only is it open source and free to use, very fast since it limits image size to 512x512 (from what I read) but it also allows tuning (training) of existing models with new images so that the user can easily train it to produce images with specific features or characteristics that may not have been in the original training set. You can steer it to produce variations of any type of image that you can train it on.

As the author mentioned in the article, he trained SD to insert his image by into the outputs by adding only 30 properly configured images of himself to the model set. It worked great after that. Without those images it did not work because the model had no context for part of his prompt.

>Yesterday, I used a simple YouTube tutorial and a popular Google Colab notebook >to fine-tune Stable Diffusion on 30 cropped 512×512 photos of me. The entire >process, start to finish, took about 20 minutes and cost me about $0.40. (You >can do it for free but it takes 2-3 times as long, so I paid for a faster Colab >Pro GPU.) > >The result felt like I opened a door to the multiverse, like remaking that >scene from Everything Everywhere All at Once, but with me instead of Michelle >Yeoh.

Without knowing anything about him it would not be able to produce images of him. He uses Garfield as an example where it does not do well.

>...it really struggles with Garfield and Danny DeVito. It knows that Garfield’s >an orange cartoon cat and Danny DeVito’s general features and body shape, but >not well enough to recognizably render either of them.

SD is a model-based image optimization process which uses a very deep (100 languages recognized) training dataset of millions of tagged, open source images scraped from the internet to produce a relatively light-weight and thus, fast image generation tool. It has to know what something looks like from contextualized a priori data in order to be able to create an image that uses the object, characteristic, feature, etc in the output.

Thanks for spurring me to look into this. It looks very interesting though I am unlikely to find time to work with it myself.