News

If a distributed training job is initially launched on A nodes, the model checkpoint will be saved as blabla_worldsize_A_rank_xx.pt. If the training is interrupted and later resumed with a differen ...