Update README.md
Browse files
README.md
CHANGED
|
@@ -166,7 +166,7 @@ the upper atmosphere (e.g., 50 hPa) contribute relatively little to the total lo
|
|
| 166 |
|
| 167 |
Data parallelism is used for training, with a batch size of 16. One model instance is split across four 40GB A100
|
| 168 |
GPUs within one node. Training is done using mixed precision (Micikevicius et al. [2018]), and the entire process
|
| 169 |
-
takes about one week, with 64 GPUs in total. The checkpoint size is 1.19 GB and it does not include the optimizer
|
| 170 |
state.
|
| 171 |
|
| 172 |
## Evaluation
|
|
|
|
| 166 |
|
| 167 |
Data parallelism is used for training, with a batch size of 16. One model instance is split across four 40GB A100
|
| 168 |
GPUs within one node. Training is done using mixed precision (Micikevicius et al. [2018]), and the entire process
|
| 169 |
+
takes about one week, with 64 GPUs in total. The checkpoint size is 1.19 GB and as mentioned above, it does not include the optimizer
|
| 170 |
state.
|
| 171 |
|
| 172 |
## Evaluation
|