Big vs. small GPU clouds for fine-tuning LLMs

tbalsam · on Aug 12, 2023

Start with 7B and refine as much as you possibly can there. Your value is going to be determined by the variable iteration_time, where your final equation involves 1/iteration_time. Any lessons learned there will effectively scale.

When you cannot get any improvements over several days/weeks of different experiments and have converged somewhat, then move up to the next model size, and do that to convergence. Then repeat this loop as you go.

Same concept as mipmapping, just with resources. How you use your resources is more important than the ones you have. I've made the vast majority of my own big discoveries with a T4 or a single A100, generally speaking, and I've done this for years.

In terms of providers, I like Lambda the best, personally, but I do a shocking amount of work in Colab Pro due to its iterative nature. I believe I've had GPU availability issues for both of them, however.

brianjking · on Aug 13, 2023

I have definitely run into GPU scarcity lately on Colab Pro, and I have VERY limited use compared to many people here or researchers and enthusiasts at large.

tbalsam · on Aug 13, 2023

I've been wondering what much of the drive for it is, to be honest. Before I didn't have much problems securing A100s, now it seems to be a lot more frequent (having taken a short break from things).

I made a training speedrunning repository that sort of assumes anyone can grab an A100 on Colab, but I guess that's not true as much now. Which honestly is a bit of a surprise for me! DDDD::::

71a54xd · on Aug 12, 2023

I've had good experiences hosting 4x 4090 and 6x a5000 / a6000 machines from Vast AI [0]. Surprisingly good, but some clients do have issues with multi-week uptime.

0 - https://cloud.vast.ai/?ref_id=74601

claudiug · on Aug 12, 2023

not afiliated with the url: https://www.unite.ai/best-gpu-hosting-providers/

also this: https://vast.ai/

maybe also vultr

tikkun · on Aug 14, 2023

I’ve written quite a bit about this. See https://gpus.llm-utils.org/cloud-gpu-guide/#which-gpu-cloud-... as well as some of the other posts linked in that “which gpu cloud should I use” section.

Edit: And I realize you emailed me the other day - hi again!

See also - https://hn.algolia.com/?dateRange=pastMonth&page=0&prefix=fa...

za_mike157 · on Aug 14, 2023

You can fine-tune Llama 2 7B/13B/70B on Cerebrium (https://www.cerebrium.ai). Disclosure: I am the founder.

We allow you to change all the hyper parameters such as num_epochs, learning_rate, tokenizer etc and even can submit your own prompt templates. If you want to use the recommended settings you can. It allows you to focus on your data and your hyperparameters rather than worrying about setup and infrastructure.

If you want to write your own code to train, you can do that also - the interface is as if you were developming locally:)

quickthrower2 · on Aug 13, 2023

I use vast.ai to train 110M models - so much smaller but it is good value for money and I am sure you can scale it up. I just use a single RTX3090.

You might also consider Lambda Labs. Or GCP.

edgoode · on Aug 12, 2023

We have built Shadeform https://shadeform.ai, which has all the small cloud providers in a single, unified api and platform.

quickthrower2 · on Aug 13, 2023

I guess you are early in your journey but there is no pricing

edgoode · on Aug 14, 2023

you can access our platform with live pricing and availability here: https://platform.shadeform.ai

brucethemoose2 · on Aug 12, 2023

What claud said.

And QLora makes 70B training relatively affordable.

But as a random aside, consider starting with an existing finetune instead of base llama 70B, and match the formatting in your dataset.

brianjking · on Aug 13, 2023

Is there a particular fine tune you would suggest starting from?

I really wish there was (maybe there is?) a 7 or 13B (or 70B) version with extended context (at least 16k) and function calling support ala OpenAI.

Both exist on their own, I don't know of a combination.

brucethemoose2 · on Aug 13, 2023

TBH I am out of the loop on finetunes, but you can search for "70B" on huggingface and sort by date.

zwaps · on Aug 12, 2023

I tried paperspace but their pipelines feature bugged out. The support was entirely unhelpful.

I would say not production ready.

euclaise · on Aug 12, 2023

I like runpod, although I've found that I typically have to set NCCL_P2P_DISABLE=1