Representative take:
If you ask Stable Diffusion for a picture of a cat it always seems to produce images of healthy looking domestic cats. For the prompt “cat” to be unbiased Stable Diffusion would need to occasionally generate images of dead white tigers since this would also fit under the label of “cat”.
Ah, but have you considered how much worse they could be if they weren’t prompted with “high quality, masterpiece, best”?
all of my generative AI results have been disappointing because I didn’t give it the confidence it needed to succeed
One of the reasons I dislike this technology so much is that some of the ridiculous tricks actually (sometimes, sort of) work. But they don’t work for the reasons the interface invites the user to think that they do, they don’t work reproducibly or consistently, so the line between “getting large neural networks to behave requires strange tricks” and pure cargo-cult thinking is blurred.
I have no idea what exactly went into the training sets of Midjourney (or DALL-E), except that it’s probably safe to assume it’s a set of (image, text) pairs like the open source image generators. The easy thing to put in the text component is the caption, any accessibility alt-text the image might have, and whatever a computer vision system decides to classify the image as. When the scrapers appropriate images from artists’ forums, personal webpages and social media accounts, they could then also scrape any comments present, process them and put some of them into the text component as well. So, it’s entirely possible that 1. there are some of the images the generator saw during training that had “masterpiece”, “great work” etc. in the text component, and 2. there is a statistically significant correlation between those words being present in the text, and the image being something people like looking at. So, when the generator is trying to pull images out of gaussian noise, it’ll be trying to spot patterns that match “masterpiece-ness” if prompted with “masterpiece”. Clearly this doesn’t work consistently - eg. if the generator has never seen a masterpiece-tagged painting of a snake, it’s not at all obvious that its model of “masterpiece-ness” can be applied to snakes at all. Neural networks infamously tend to learn shortcuts rather than what their builders want them to learn.
Even then, most of it still looks like the result of a mugging in the Uncanny Alley. There’s almost always something “off” about it, even when it is technically impressive. Details that make no sense, weird lighting, shadows and textures, and a feeling of “eeriness” that I’d probably have the vocabulary to describe if I were a visual artist.
(PS: Does the idea of using well-intentioned accessibility features and kind words to artists to create a machine intended to destroy their livelihood make you feel a bit iffy? Congratulations, you are probably not a sociopath.)