Train Your Large Models
High PerformanceEasy to UseBuilt for Gaming GPUs
pip install roundpipe
Built for huge models
On a single 24GB GPU, train with 64K+ long context, full fine-tune 32B models or LoRA fine-tune models up to 235B.
β‘High Performance
Push a 4090 close to A800 NVLINK-class throughput. Up to 6Γ faster than FSDP Offload in typical training workloads.
πLinear scaling
Scale to multiple GPUs in-node without rewriting your training loop. Throughput grows linearly while your code stays the same.
β¨Feels like PyTorch
A sequential programming interface with a low learning curve. Works well in Jupyter Notebook for rapid iteration.
π§General by design
No constraints on layer structure, training flow, or parameter update strategy.
πPortable across accelerators
Pure PyTorch implementation. Runs across Nvidia, AMD, and Ascend GPU platforms.
Train bigger than ever
64K+ long-context training on a single 24GB GPU
Full fine-tuning for 32B, LoRA for up to 235B
Up to 7Γ longer sequence length than PyTorch FSDP
Extracts maximum performance
A 4090 can reach near A800 NVLINK-level throughput
Up to 6Γ faster than FSDP Offload
As models grow, RoundPipe keeps pulling ahead
Scale out without rewrites
100% automatic multi-GPU scaling within a node
Throughput grows linearly with GPU count
Max sequence length per GPU stays unchanged
Simple API, flexible training
Sequential programming interface
0 parallel programming
Jupyter Notebook friendly
import torch
from roundpipe import RoundPipe, OptimizerCtx
# Any deep neural network
model = torch.nn.Sequential(layer1, layer2, layer3, ...)
# Any PyTorch optimizer
with OptimizerCtx():
optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)
# Any training loop
for data in dataloader:
loss = model.forward_backward(data)
# Any parameter update strategy
def step_fn():
optimizer.step()
optimizer.zero_grad()
model.step(step_fn)
Portable by default
Pure PyTorch implementation
Compatible with Nvidia, AMD, Ascend, and more
Write once, train anywhere