Post
Interesting paper: 𝐆𝐚𝐋𝐨𝐫𝐞: 𝐭𝐫𝐚𝐢𝐧 𝟕𝐁 𝐦𝐨𝐝𝐞𝐥𝐬 𝐨𝐧 𝐜𝐨𝐧𝐬𝐮𝐦𝐞𝐫-𝐠𝐫𝐚𝐝𝐞 𝐆𝐏𝐔𝐬 💪
It's now possible to 𝙛𝙪𝙡𝙡𝙮 𝙥𝙧𝙚-𝙩𝙧𝙖𝙞𝙣 a 7B model on a consumer-grade GPU of 24Gb RAM, without any performance loss!
The memory usage of training models has always been an acute issue. For instance full pre-training of a 7B model used to eat ~50Gb of RAM!
The common workarounds to reduce memory load are:
- separate models on multiple GPUs ("sharding")
- quantize models: encode weights on fewer bits
Another technique is to 𝙥𝙧𝙤𝙟𝙚𝙘𝙩 𝙩𝙝𝙚 𝙬𝙚𝙞𝙜𝙝𝙩 𝙢𝙖𝙩𝙧𝙞𝙭 𝙩𝙤 𝙡𝙤𝙬𝙚𝙧-𝙧𝙖𝙣𝙠 𝙨𝙥𝙖𝙘𝙚𝙨, (since sometimes the weights do not really vary on all dimensions): this can save a lot of space!
This low-rank projection can be done on adapters to preserve the original weights (go check out LoRA), but it still generally hurts the performance too much for pre-training.
➡️ Enter the authors of 𝘎𝘢𝘓𝘰𝘳𝘦: 𝘔𝘦𝘮𝘰𝘳𝘺-𝘌𝘧𝘧𝘪𝘤𝘪𝘦𝘯𝘵 𝘓𝘓𝘔 𝘛𝘳𝘢𝘪𝘯𝘪𝘯𝘨 𝘣𝘺 𝘎𝘳𝘢𝘥𝘪𝘦𝘯𝘵 𝘓𝘰𝘸-𝘙𝘢𝘯𝘬 𝘗𝘳𝘰𝘫𝘦𝘤𝘵𝘪𝘰𝘯. They gather (and prove) interesting insights:
⛔ The weight matrix does not reliably converge to lower ranks during training.
✅ But the gradient matrix does!
Based on these insights, 𝘁𝗵𝗲𝘆 𝗯𝘂𝗶𝗹𝗱 𝗚𝗮𝗟𝗼𝗿𝗲, that projects the gradient to lower ranks.
🗺️ 𝗚𝗿𝗲𝗮𝘁 𝗶𝗱𝗲𝗮: to leave the optimization free to explore more space, they periodically re-build the low-rank projection throughout the training (a nice illustration is in the paper).
🤝 This method can even be combined with previous ones such as 8-bit Adam (quantizing the optimizer states to 8-bit).
➡️ 𝐑𝐞𝐬𝐮𝐥𝐭𝐬:
📉 Of course, huge reduction in memory footprint allowing the training on consumer-grade GPU (cf figure).
💪 No reduction in performance: this scales well up to 7B parameters (and was independently confirmed since) ⇒ this is essential, it confirms that the method is viable!
Read the full paper here: GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection (2403.03507)
It's now possible to 𝙛𝙪𝙡𝙡𝙮 𝙥𝙧𝙚-𝙩𝙧𝙖𝙞𝙣 a 7B model on a consumer-grade GPU of 24Gb RAM, without any performance loss!
The memory usage of training models has always been an acute issue. For instance full pre-training of a 7B model used to eat ~50Gb of RAM!
The common workarounds to reduce memory load are:
- separate models on multiple GPUs ("sharding")
- quantize models: encode weights on fewer bits
Another technique is to 𝙥𝙧𝙤𝙟𝙚𝙘𝙩 𝙩𝙝𝙚 𝙬𝙚𝙞𝙜𝙝𝙩 𝙢𝙖𝙩𝙧𝙞𝙭 𝙩𝙤 𝙡𝙤𝙬𝙚𝙧-𝙧𝙖𝙣𝙠 𝙨𝙥𝙖𝙘𝙚𝙨, (since sometimes the weights do not really vary on all dimensions): this can save a lot of space!
This low-rank projection can be done on adapters to preserve the original weights (go check out LoRA), but it still generally hurts the performance too much for pre-training.
➡️ Enter the authors of 𝘎𝘢𝘓𝘰𝘳𝘦: 𝘔𝘦𝘮𝘰𝘳𝘺-𝘌𝘧𝘧𝘪𝘤𝘪𝘦𝘯𝘵 𝘓𝘓𝘔 𝘛𝘳𝘢𝘪𝘯𝘪𝘯𝘨 𝘣𝘺 𝘎𝘳𝘢𝘥𝘪𝘦𝘯𝘵 𝘓𝘰𝘸-𝘙𝘢𝘯𝘬 𝘗𝘳𝘰𝘫𝘦𝘤𝘵𝘪𝘰𝘯. They gather (and prove) interesting insights:
⛔ The weight matrix does not reliably converge to lower ranks during training.
✅ But the gradient matrix does!
Based on these insights, 𝘁𝗵𝗲𝘆 𝗯𝘂𝗶𝗹𝗱 𝗚𝗮𝗟𝗼𝗿𝗲, that projects the gradient to lower ranks.
🗺️ 𝗚𝗿𝗲𝗮𝘁 𝗶𝗱𝗲𝗮: to leave the optimization free to explore more space, they periodically re-build the low-rank projection throughout the training (a nice illustration is in the paper).
🤝 This method can even be combined with previous ones such as 8-bit Adam (quantizing the optimizer states to 8-bit).
➡️ 𝐑𝐞𝐬𝐮𝐥𝐭𝐬:
📉 Of course, huge reduction in memory footprint allowing the training on consumer-grade GPU (cf figure).
💪 No reduction in performance: this scales well up to 7B parameters (and was independently confirmed since) ⇒ this is essential, it confirms that the method is viable!
Read the full paper here: GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection (2403.03507)