experiment tracking (e.g. wandb, etc.)
loss tracking
gradient tracking
multi-node multi-gpu
ddp
torch elastic (torch run)
nccl - cuda only
mpi
gloo - cpu only
node interconnect
roce
ib
checkpoints
wandb
s3, etc.
torch.save
safetensors