Basically- the same math as modern automated manufacturing. Super expensive and complex build-out - then a money printer once running and optimized.
I know there is lots of bearish sentiments here. Lots of people correctly point out that this is not the same math as FAANG products - then they make the jump that it must be bad.
But - my guess is these companies end up with margins better than Tesla (modern manufacturer), but less than 80%-90% of "pure" software. Somewhere in the middle, which is still pretty good.
Also - once the Nvidia monopoly gets broken, the initial build out becomes a lot cheaper as well.
And if you ever stop/step off the treadmill and jack up prices to reach profitability, a new upstart without your sunk costs will immediately create a 99% solution and start competing with you. Or more like hundreds of competitors. Like we've seen with Karpathy & Murati, any engineer with pedigree working on the frontline models can easily raise billions to compete with them.
Expect the trend to pick up as the pool of engineers who can create usable LLMs from scratch increases through knowledge/talent diffusion.
The LLM scene is an insane economic bloodbath right now. The tech aside, the financial moves here are historical. It's the ultimate wet dream for consumers - many competitors, face-ripping cap-ex, any missteps being quickly punished, and a total inability to hold back anything from the market. Companies are spending hundreds of billions to put the best tech in your hands as fast and as cheaply as possible.
If OpenAI didn't come along with ChatGPT, we would probably just now be getting Google Bard 1.0 with an ability level of GPT-3.5 and censorship so heavy it would make it useless for anything beyond "Tell me who the first president was".
We have been running this playbook for the last 2 years in healthcare, and we have been super successful. Doubling every quarter over the last year. 70%+ profitability, almost 7 figures of revenue. 100% bootstrapped.
People are still mentally locked in to the world where code was expensive. Code now is extremely cheap. And if it is cheap, then it makes sense that every customer gets their own.
Before - we built factories to give people heavy machinery. Now, we run a 3d printer.
Everyday I thank SV product-led growth cargo cults for telling, sometimes even forcing our competition to not go there.
One of the most pleasant experiences I had writing code, is early AI days when we did hyperscript SSE. Super locality of behavior, super interesting way of writing Server Sent Events code.
on message as string
put it into #div
end
on open
log "connection opened."
end
on close
log "connection closed."
end
on error
log "handle error here..."
end
Lots of YC companies copy each other process and selection criteria. Basically- they all have the same blind spots and look for the same type of engineer.
So, super easy to scam all of them with the same skillset and mannerism.
I send this article as part of onboarding for all new devs we hire. It is super great to keep a fast growing team from falling into the typical cycle of more people, more complexity.
Thanks for the link to the ColPali implementation - interesting! I am specifically interested in evaluation benchmarks for different image embedding models.
I see the ColiVara-Eval repo in your link. If I understand correctly, ColQwen2 is the current leader followed closely by ColPali when applying those models for RAG with documents.
But how do those models compare to each other and to the llama3.2-vision embeddings when applied to, for example, sentiment analysis for photos? Do benchmarks like that exist?
The “equivalent” here would be Jina-Clip (architecture-wise), not necessarily performance.
The ColPali paper(1) does a good job explaining why you don’t really want to directly use vision embeddings; and how you are much better off optimizing for RAG with a ColPali like setup. Basically, it is not optimized for textual understanding, it works if you are searching for the word bird; and images of birds. But doesn’t work well to pull a document where it’s a paper about birds.
Makes sense. My main takeaway from the ColPali paper (and your comments) is that ColPali works best for document RAG, whereas vision model embeddings are best used for image similarity search or sentiment analysis. So to answer my own question: The best model to use depends on the application.
I would like to through our project in the ring. We use ColQwen2 over a ColPali implementation. Basically, search & extract pipeline: https://docs.colivara.com/guide/markdown
Here is a nice use-case. Put this in a pharmacy - have people hit a button, and ask questions about over-the-counter medications.
Really - any physical place where people are easily overwhelmed, have something like that would be really nice.
With some work - you can probably even run RAG on the questions and answer esoteric things like where the food court in an airport or the ATM in a hotel.
> Put this in a pharmacy - have people hit a button, and ask questions about over-the-counter medications.
Even if you trust OpenAI's models more than your trained, certified, and insured pharmacist -- the pharmacists, their regulators, and their insurers sure won't!
They've got a century of sunk costs to consider (and maybe even some valid concern over the answers a model might give on their behalf...)
Don't be expecting anything like that in an traditional regulated medical setting any time soon.
The last few doctors appointments I’ve had, the clinician used a service to record and summarize the visit. It was using some sort of TTS and LLM to do so. It’s already in medical settings.
Thanks for digging that out. Yes, that makes sense to me as someone who made a fully local speech-2-speech prototype with Electron, including VAD and AEC. It was responsive but taxing. I had to use a mix of specialty models over onnx/wasm in the renderer and llama.cpp in the main process. One day, multimodal model will just do it all.
We benchmarked two ways to improve the latency in RAG workflows with a multi-vector setup. Hybrid search using Postgres native capabilities and a relatively new method of token pooling. Token pooling unlocked up to 70% faster latency with <1% performance cost.
I know there is lots of bearish sentiments here. Lots of people correctly point out that this is not the same math as FAANG products - then they make the jump that it must be bad.
But - my guess is these companies end up with margins better than Tesla (modern manufacturer), but less than 80%-90% of "pure" software. Somewhere in the middle, which is still pretty good.
Also - once the Nvidia monopoly gets broken, the initial build out becomes a lot cheaper as well.