S3GD Optimizer Algorithm
We are excited to announce WhyPhy’s first open-source release, S3GD, a PyTorch-compatible, fused kernel implementation of the Smoothed SignSGD training optimizer.
S3GD is nearly equivalent to Adam with aggressive gradient clipping, as its sparse nature creates a similar Just Works effect during post-training. Derived from the simpler SGD algorithm, Smoothed SignSGD is more computationally and memory efficient relative to Adam variants, pairing perfectly with its sparsity to make it a compelling candidate for both SFT and RFT. As post-training workloads and particularly reinforcement learning increase in prevalence, these efficiency gains accumulate to improve accessibility, benefiting the entirety of the Open Ecosystem.
Working from a Triton implementation of Google’s Lion optimizer, we 1) changed the raw gradient to sign(gradient), 2) adapted the step direction to the exponential moving average itself, and 3) eliminated the second β parameter, to make the conversion to Smoothed SignSGD. Given Lion’s success in FP8 training and the similarity of the algorithms, we further optimized S3GD, storing the momentum buffer in FP8 precision and casting the arithmetic to BF16, cutting state memory and bandwidth 4x while maintaining numerical robustness.
S3GD is WhyPhy’s first contribution to the Open Ecosystem, with many more to come. You can find the production-ready implementation, open-sourced under the MIT License, in our S3GD repository.
Spencer Garnets and Aria Bagheri
Co-Founders of WhyPhy Labs
Citations
@misc{cesista2025adamagressiveclipping,
url = {http://leloykun.github.io/ponder/adam-aggressive-clipping/},
author = {Franz Louis Cesista},
title = {Adam with Agressive Gradient Clipping ≈ Smoothed SignSGD/NormSGD},
year = {2025}
}
@misc{kalomaze2025tweet,
url = {https://x.com/kalomaze/status/1940424032119316813},
author = {kalomaze},
title = {On Adam with aggressive gradient clipping causing sparse updates},
year = {2025}
}
@misc{lucidrains,
url = {https://github.com/lucidrains/lion-pytorch},
author = {Phil Wang},
title = {lion-pytorch},
year = {2024}
}
@misc{chen2023symbolic,
url = {https://arxiv.org/abs/2302.06675},
author = {Chen, Xiangning and Liang, Chen and Huang, Da and Real, Esteban and Wang, Kaiyuan and Liu, Yao and Pham, Hieu and Dong, Xuanyi and Luong, Thang and Hsieh, Cho-Jui and Lu, Yifeng and Le, Quoc V.},
title = {Symbolic Discovery of Optimization Algorithms},
publisher = {arXiv},
year = {2023}
}
@misc{narayan2025munit,
url = {[https://arxiv.org/abs/2302.06675](https://arxiv.org/abs/2502.05967)},
author = {Narayan, Saaketh and Gupta, Abhay and Paul, Mansheej and Blalock, Davis},
title = {μnit Scaling: Simple and Scalable FP8 LLM Training},
publisher = {arXiv},
year = {2025}
}