Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Wouldn't it be better to generate multiple tracks that can be mixed / tweaked together, rather than a single track? That way you can also keep the parts you like and continue iterating on the parts you dislike.

If the sound is already being generated at a specific time, surely you can make it generate an output that can be consumed by existing audio mixing tools for further refinement.

The problem with doing these all-in-one integrated solutions is that you're kinda giving people an all-or-nothing option, which doesn't seem that useful. Maybe I'll end up being proven wrong.



Yes, same problem as with commercial AI music products not providing stems or MIDI, The engineers on these products are too full of themselves to actually ask anyone in the field what they want, so we just keep getting these stupid magic 8 ball efforts.

This one is particularly annoying as I worked for years as a sound engineer and have recorded or produced the soundtrack for 10 feature films and some large number of shorts. What's going to happen with this is directors or producers are gonna do this at home for every scene in a burst of over-enthusiasm, realize the totality is Not Great, and then demand someone like me fix it, but for 1/4 of what the job used to pay, arguing 'but most of the work is already done'. It's all so tiresome.


Same reason you don't see AI making images in layers etc, its just much easier to train an AI that generate everything in one layer. Training a model with the same level of quality output that generates multiple layers is much much harder, and of course companies and users prefers the higher quality over having layers, especially since the quality you get with a single layer is still barely passable.


The sample they used for training are mixed.

Unless they can have enough raw, unmixed sample, this depends on how well they "unmix" them.


Yes...that's the problem. A problem that could be easily avoided by asking existing professionals what matters and what tools they actually want.


Most ML engineers know that many want more fine grained control. But the straight forward way to train such models is incredibly data demanding. The datasets used for whole image generation consist of several billion images. I do not think anyone has compiled any DAW project / stems projects that are anywhere close to this size. So that is a limiting factor right now. But we will find ways to get there, probably a lot of progress over the next 5 years. Maybe even the next 2.


It sounds like between the two of you(and the person who mentioned generating images in layers for image editing software), you've stumbled upon an obvious gap in the market.


I’ve tried to explain this to several friends. Until these tools can generate output that can be mixed properly they’re going to be very niche.


> Wouldn't it be better to generate multiple tracks that can be mixed / tweaked together, rather than a single track? That way you can also keep the parts you like and continue iterating on the parts you dislike.

That'd interest me (a musical hobbyist) more than the "whole track" generators, for sure.

I imagine it's a harder task tho'. Presumably, if you give the same source material (video, prompt) to the AI multiple times, it will generate different pieces of music. So if you do a series of prompts, each one specifying a different instrument or group/bus, then you (or the AI) need to arrange for the parts to blend correctly, follow the same cues and assemble to a coherent arrangement. Is that one pass with multiple outputs, or multiple passes/prompts with one output each?

I have got the impression (from casual reading) that the music generators don't inherently "know" about different parts of a piece of music. They just know about the final output.


> Wouldn't it be better to generate multiple tracks that can be mixed / tweaked together, rather than a single track? That way you can also keep the parts you like and continue iterating on the parts you dislike.

Totally and that is 100% what is coming. For a great many pictures too: why generate a picture full of lightning issues / approximation when you'll soon be able to generate and entire 3D scene and render it properly.

We've mastered 3D rendering and audio engineering.

I want the 3D models and the 3D scenes. I want the individual tracks (and combine them in Dobly Atmos or whatever shall be cool).

And that is coming, no question about it.


ElevenLabs just released something that is more controllable:

https://news.ycombinator.com/item?id=40736536


the AI Musical IF This Then That Step 2 > https://www.lalal.ai/ "Extract vocal, accompaniment and various instruments from any audio and video"


it's limited by the mechanism of diffusion.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: