Midjourney is the most popular discord channel by far with 19.5M+ members, $200M in revenue in 2023 with 0 external investments and only 40 employees.
The problem has nothing to do with commercializing image gen AI and all to do with Emad/Stability having seemingly 0 sensible business plans.
Seriously this seemed to be the plan:
Step 1: Release SD for free
Step 2: ???
Step 3: Profit
The vast majority of users couldn't be bothered to take the steps necessary to get it running locally so I don't even think the open sourcing philosophy would have been a serious hurdle to wider commercial adoption.
In my opinion, a paid, easy to use, robust UI around Stability's models should have been the number one priority and they waited far too long to even begin.
There's been a lot of amazing augmentations to the stable diffusion models (ControlNet, Dreambooth etc) that have propped up, lots of free research and implementations because the research community has latched onto the stability models and I feel they failed to capitalize on any of it.
Leonardo.ai have basically done exactly this and seem to be doing OK.
It’s a shame because they’re literally just using stable diffusion for all their tech but built a nicer front end and incorporated control net. No-where else has done this.
Controlnet / instantID etc are the really killer things about SD and make it way more powerful than Midjourney, but they aren’t even available via the stability API. They just don’t seem to care.
InstantID uses a non-commercial licensed model (from insightface) as part of its pipeline so I think that makes it a no-go for being part of Stability's commercial service.
Yes, and MJ has no public API either. Same for Ideogram, I imagine they have at least 10m in the bank, and aren't even bothering making an API despite being SoTA for lots of areas.
That’s not true. He was pretty open about the business plan. The plan was to have open foundational models and provide services to governments and corporations that wanted custom models trained on private data, tailored to their specific jurisdictions and problem domains.
Was there any traction on this? I cannot imagine government services being early customers. What models would the want?Military -- maybe, for simulation or training, but that requires focus, dedicated effort and a lot of time. My 2c.
I've heard this pitch from a few AI labs. I suspect that they will fail, customers just want a model that works in the shortest amount of time and effort. The vast majority of companies do not have useful fine tuning data or skills. Consultancy businesses are low margin and hard to scale.
Heres a Stable Diffusion buisness idea: sign up all the celebrities and artists who are cool with AI, and provide end users / fans with an AI image generation interface, trained on their exclusive likenesses / artwork (loras).
You know, the old tried and true licensed merchandise model. Everybody gets paid.
I think the following isn't said often enough: there must be a reason why there are extremely few celebrities and artists who are cool with AI, and it cannot be something abstract and bureaucratic as copyright concerns although those are problematic.
It's just not there yet. GenAI outputs aren't something audiences wants to hang on a wall. It's something that evoke sense of distress. Otherwise everyone's tracing them at least.
Most people mix up all the different kinds of intellectual property basically all the time[0], so while people say it's about copyright, I (currently) think it's more likely to be a mixture of "moral rights" (the right to be named as the creator of a work) and trademarks (registered or otherwise), and in the case of celebrities, "personality rights": https://en.wikipedia.org/wiki/Personality_rights
> It's just not there yet. GenAI outputs aren't something audiences wants to hang on a wall.
People have a wide range of standards. Last summer I attended the We Are Developers event in Berlin, and there were huge posters that I could easily tell were from AI due to the eyes not matching; more recently, I've used (a better version) to convert a photo of a friend's dog into a renaissance oil painting, and it was beyond my skill to find the flaws with it… yet my friend noticed instantly.
Also, even with "real art", Der Kuss (by Klimt) is widely regarded as being good art, beautiful, romantic, etc. — yet to me, the man looks like he has a broken neck, while the woman looks like she's been decapitated at the shoulder then had her head rotated 90° and reattached via her ear.
> Der Kuss (by Klimt) is widely regarded as being good art,
The point is, generative AI images are not widely regarded as good art. They're often seen as passable for some filler use cases and hard to tell apart from human generations, but not "good".
It's not not-there-yet because AI sometimes generates sixth fingers, it's something another level from Gustav Klimt, Damien Hirst, Kusama Yayoi, or the likes[0]. It could be that genAI is leaving something that human artist would filter out, or because images are too disorganized that they appear to us to be encoding malice or other negative emotions, or maybe I'm just wrong and it's all about anatomy.
But whatever the reason is, IMO, it's way too rarely considered good, gaining too few supportive celebrities and artists and audiences, to work.
0: I admit I'm not well versed with contemporary art, or art in general for that matter
> The point is, generative AI images are not widely regarded as good art. They're often seen as passable for some filler use cases and hard to tell apart from human generations, but not "good".
> It's not not-there-yet because AI sometimes generates sixth fingers, it's something another level from Gustav Klimt
My point is: yes AI is different — it's better. (Or, less provocatively: better by my specific standards).
Always? No. But I chose Der Kuss specifically because of the high regard in which it is held, and yet to my eye it messes with anatomy as badly as if he had put 6 fingers on one of the hands (indeed, my first impression when I look closely at the hand of the man behind the head of the woman, is that the fingers art too long and thumb looks like a finger).
wait what? Isn't that missing the point of expressionism? Klimt's Judith I is basically a photo, surely he can draw sh*t if he wanted to?
But myriad predecessors such as Vermeer, Rembrandt, Van Gogh, da Vinci, et al., have done enough in realism, and also photography was becoming more viable and more prevalent, that artists basically started diversifying? Isn't that what lead to various forms of early 20th century arts like surrealism(super-real -ism), cubism, etc?
I don't mean offense but that's just, surely that level of understanding can't be basis of policy decisions when it comes to moral rights and licensing discussions and "artists should just use AI" and such???
I think you're conflating "good" in the sense of "competent" with "good" in the sense of "ethical" or "legal".
I am asserting here that the AI is (at its best) more competent, not any of the other things.
I suspect that the law will follow the economics, just as it often has done for everything else before — you're communicating with me via a device named after the job that the device made redundant ("computer").
But I said "often" not "always", because the business leaders ignoring the workers they were displacing 200 years ago led to riots, and eventually to the Communist Manifesto. I wouldn't discount this repeating.
--
I've just looked up "Judith I" (I recognise the art, just not the name), and I don't even understand why you're holding this up as an example of "basically a photo".
As for the other artists demonstrating realism: photography made realism redundant despite being initially dismissed as "not real art". Artists were forced to diversify, because a small box of chemistry was allowing unskilled people do their old job faster, cheaper, and better. Photography only became an art in its own right when people found ways to make it hard, for example by travelling the world and using it to document their travels, or with increasingly complex motion pictures.
I suspect that art fulfils the same role in humans as tails fulfil in peacocks: an expensive signal to demonstrate power, such that the difficulty is the entire point and anything which makes it easy is seen as worse than not even trying. This is also why forgeries are a big deal, instead of being "that's a nice picture", and why an original painting can retain a high price despite (or perhaps because of) a large number of extremely cheap prints being plastered onto everything from dorm rooms to chocolate wrappers.
Why would those celebs pay Stability any significant money for this, given they can get it for a one off payment of at most a few hundred dollars salary/opportunity cost by paying an intern to gather the images and feed it into the existing free tools for training a LoRA?
You can already do that with reference images and even for inpainting. No training required. Also no need to pay actors outrageous sums to use their likeness in perpetuity as long as you do business. The licensing still tricky anyways, because even if the face is approved and certified, the entire body and surroundings would also have to be. Otherwise you basically re-invented the celebrity deepfake porn movement. I don't see any A-lister signing up for that.
What's insane to me is the fact that the best interfaces to utilize any of these models, from open source LLMs to open source diffusion models, are still random gradio webUIs made by the 4chan/discord anime profile picture crowd.
Automatic1111, ComfyUI, Oobabooga. There's more value within these 3 projects than within at least 1 billion dollars worth of money thrown around on yet another podunk VC backed firm with no product.
It appears that no one is even trying to seriously compete with them on the two primary things that they excel at - 1. Developer/prosumer focus and 2. extension ecosystem.
Also, if you're a VC/Angel reading my comments about this, I would very much love to talk to you.
For a founder maybe , definitely not for employees .
AI startups need not an insignificant amount of startup capital , you cannot just spend weekends to build like you would a saas app . Model training is expensive so only wealthy individuals can even consider this route
Companies like that have no oversight or control mechanisms when management inevitably goes down crazy paths, also without external valuations option vesting structures are hard to ascertain value.
Sometimes you need to say fuck the money, I’ve already got enough, and I just want to do what I enjoy. It may not be an ideal model for HN but damn not everything in life is about grinding, P/E ratios, and vesting schedules
Yeah, that's easier to say when you have enough. A lot of employees might not be in that privilege position. The reality for some of the folks might be addressing education loans, families to take care of, tuition for kids, medical bills, etc.
As a counter-counter-point that gets rarely discussed on HN, VCs aren't taking as much of the pie as people think. In a 2-founder, 4-engineer company, it wouldn't be unusual to have equity be roughly:
20% investors
70% founders
2-3% employees (1% emp1, 1% emp2, 0.5% emp3, 0.25% emp4)
7% for future employees before next funding round
This is not a fair comparison because you are not taking into account liquidation preferences. Those investors don't have the same class of equity as everyone else. That doesn't matter in the case of lights out success but it matters a great deal in many other scenarios.
Sure. My point was that most employees think that VCs take 80+%, and especially the first few employees usually have no idea just how little equity they have compared to the founders.
There's money to be made for sure, and Stability's sloppy execution and strategy definitely didn't help them. But I think there are also industry-wide factors at play that make AI companies quite brittle for now.
>The vast majority of users couldn't be bothered to take the steps necessary to get it running locally so I don't even think the open sourcing philosophy would have been a serious hurdle to wider commercial adoption.
The more I think about the AI space the more I realize that open sourcing large models is pointless now.
Until you can reasonably buy a rig to run the model there is simply no point in doing this. It's no like you will be edified by setting the weights either.
I think an ethical business model for these business is to release whatever model can fit into a $10,000 machine and keeping the rest closed source until above machine is able to run them.
The released image generation models run on consumer GPUs. Even the big LLMs will run on a $3500 Mac with reasonable performance, and the CPU of a dirt cheap machine if you don't care about it being slow, which is sometimes important and sometimes isn't.
The `big' AI models are trillion parameter models.
The medium sized models like GPT3 and Grok are 185b and 314b respectively.
There is no way for _anyone_ to run these on a sub $50k machine in 2024, and even if you can the token generation speed on CPU is under 0.1 tokens per second.
It's just semantic gymnastics. I'm sure most people will consider LLaMa 70B a big model. Of course if you define big = trillion then sure big = trillion[1].
You can get registered DDR4 for ~$1/GB. A trillion parameter model in FP16 would need ~2TB. Servers that support that much are actually cheap (~$200), the main cost would be the ~$2000 in memory itself. That is going to be dog slow but you can certainly do it if you want to and it doesn't cost $50,000.
For 2TB and the server you're at $1698. You can get a drive bracket for a few bucks and a 2TB SSD for $100 and have almost $200 left over to put faster CPUs in it if you want to.
That's stinking Optane, would work if you're desperate. Normal 128GB LRDIMMs cost more than other DDR4 DIMMs. You can, however, get DDR4 RDIMMs for ~$1/GB:
You can get a decent approximation for LLM performance in tokens/second by dividing the model size in GB by the system's memory bandwidth. That's assuming it's well-optimized and memory rather than compute bound, but those are often both true or pretty close.
And "depending on the task" is the point. There are systems that would be uselessly slow for real-time interaction but if your concern is to have it process confidential data you don't want to upload to a third party you can just let it run and come back whenever it finishes. And releasing the model allows people to do the latter even if machines necessary to do the former are still prohibitively expensive.
Also, hardware gets cheaper over time and it's useful to have the model out there so it's well-optimized and stable by the time fast hardware becomes affordable instead of waiting for the hardware and only then getting to work on the code.
Why would increasing memory bandwidth reduce performance? You said "You can get a decent approximation for LLM performance in tokens/second by dividing the model size in GB by the system's memory bandwidth"
Indeed! Also, Mixtral 8x7b runs just as well on older M1 Max and M2 Max Macs, since LLM inference is memory bandwidth bound and memory bandwidth hasn't significantly changed between M1 and M3.
ChatGPT is 20B according to Microsoft researchers, also the fact that big AI models are trillion parameter models is mostly speculation, about GPT-4 it was spread by geohot.
To be precise, ChatGPT 3.5 turbo being 20B is officially a mistake from a Microsoft Researcher, quoting a wrong source published before the release of chatgpt3.5 turbo. Up to you to believe it or not. But I wouldn’t claim it’s a 20B according to Microsoft Researchers.
I think it became apparent when mixtral came out. I've noticed too during training that my model overwrites useful information so it makes sense for these types of models to have emerged.
Disagree. A few weeks ago, I followed a step-by-step tutorial to diwnlad ollama, which in turn can download various models. On my not-soecisl laptop with a so-so graphics card, Mixtral runs just fine.
As models advance, they will become - not just larger - but also more efficient. Hardware advances. Large models will run just fine on affordable hardware in just a few years.
I’ve come to the opposite conclusion personally - AI model inference requires burst compute, which particularly suits cloud deployment (for these sort of applications).
And while AIs may become more compute-efficient in some respects, the tasks we ask AIs to do will grow larger and more complex.
Sure you might get a good image locally but what about when the market moves to video? Sure chat GPT might give good responses locally, but how long will it take when you want it to refactor an entire codebase?
Not saying that local compute won’t have its use-cases though… and this is just a prediction that may turn out to be spectacularly wrong!
I wonder if MidJourney is still ripping. I’m actually curious if it’s superior to ChatGPT’s Dall-E images… I switched and cancelled my subscription when ChatGPT added images, but I think I was mostly focused on convenience.
If you have a particular style in mind then results may vary but aesthetically Midjourney is generally still the best, however Dalle-3 has every other model beat in terms of prompt adherence.
Image quality, stylistic variety, and resolution are much better than ChatGPT. Prompt following is a little better with ChatGPT, but MJ v6 has narrowed the gap.
The problem has nothing to do with commercializing image gen AI and all to do with Emad/Stability having seemingly 0 sensible business plans.
Seriously this seemed to be the plan:
Step 1: Release SD for free
Step 2: ???
Step 3: Profit
The vast majority of users couldn't be bothered to take the steps necessary to get it running locally so I don't even think the open sourcing philosophy would have been a serious hurdle to wider commercial adoption.
In my opinion, a paid, easy to use, robust UI around Stability's models should have been the number one priority and they waited far too long to even begin.
There's been a lot of amazing augmentations to the stable diffusion models (ControlNet, Dreambooth etc) that have propped up, lots of free research and implementations because the research community has latched onto the stability models and I feel they failed to capitalize on any of it.