Optimizer
RoundPipe supports any optimizer from PyTorch and correctly synchronizes optimizer state in distributed training. For PyTorch optimizers that support a fused implementation, we recommend passing fused=True when creating the optimizer for significantly better performance. However, PyTorch's built-in optimizers have poor CPU performance, so we maintain CPU-optimized implementations of selected optimizers, compiled specifically for the host CPU to accelerate optimizer updates.
The following are the optimizer implementations maintained in RoundPipe, listed in alphabetical order.
roundpipe.optim.Adam
class roundpipe.optim.Adam(
params: ParamsT,
lr: Union[float, torch.Tensor] = 1e-3,
betas: Tuple[Union[float, torch.Tensor], Union[float, torch.Tensor]] = (0.9, 0.999),
eps: float = 1e-8,
weight_decay: float = 0.0,
amsgrad: bool = False,
*,
foreach: Optional[bool] = None,
maximize: bool = False,
capturable: bool = False,
differentiable: bool = False,
fused: Optional[bool] = None,
decoupled_weight_decay: bool = False,
)
Implements the Adam optimization algorithm, performing parameter updates on CPU in fp32 precision. This Adam implementation is compiled and optimized for CPU execution, offering better performance than PyTorch's native CPU Adam.
The interface is designed to be API-compatible with torch.optim.Adam and can be used as a drop-in replacement.
Parameters:
params: An iterable of parameters or dicts defining parameter groups.lr: Learning rate. Defaults to1e-3.betas: Coefficients used for computing running averages of gradient and its square. Defaults to(0.9, 0.999).eps: Term added to the denominator to improve numerical stability. Defaults to1e-8.weight_decay: Weight decay coefficient. Defaults to0.0.amsgrad: Whether to use the AMSGrad variant from the paper On the Convergence of Adam and Beyond. Defaults toFalse.maximize: Whether to maximize the objective function (instead of minimizing). Defaults toFalse.decoupled_weight_decay: IfTrue, equivalent to the AdamW algorithm where weight decay does not accumulate in the momentum and variance terms. Defaults toFalse.foreach: Compatibility placeholder parameter; ignored with a warning if provided.capturable: Compatibility placeholder parameter; setting toTrueis not supported.differentiable: Compatibility placeholder parameter; setting toTrueis not supported.fused: Compatibility placeholder parameter; ignored with a warning if provided.
Limitations:
- Only supports
float32tensors on CPU. - Sparse gradients are not supported.
- All tensors must be contiguous.