Skip to content

RoundPipe Banner RoundPipe Banner

Train Your Large Models

High PerformanceEasy to UseBuilt for Gaming GPUs

> pip install roundpipe
Get started β†—


Train bigger than ever

64K+ long-context training on a single 24GB GPU

Full fine-tuning for 32B, LoRA for up to 235B

Up to 7Γ— longer sequence length than PyTorch FSDP

Huge model support Huge model support

Extracts maximum performance

A 4090 can reach near A800 NVLINK-level throughput

Up to 6Γ— faster than FSDP Offload

As models grow, RoundPipe keeps pulling ahead

GPU throughput GPU throughput

Scale out without rewrites

100% automatic multi-GPU scaling within a node

Throughput grows linearly with GPU count

Max sequence length per GPU stays unchanged

Linear scaling Linear scaling

Simple API, flexible training

Sequential programming interface

0 parallel programming

Jupyter Notebook friendly

import torch
from roundpipe import RoundPipe, OptimizerCtx
# Any deep neural network
model = torch.nn.Sequential(layer1, layer2, layer3, ...)
# Any PyTorch optimizer
with OptimizerCtx():
    optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)
# Any training loop
for data in dataloader:
    loss = model.forward_backward(data)
    # Any parameter update strategy
    def step_fn():
        optimizer.step()
        optimizer.zero_grad()
    model.step(step_fn)

Portable by default

Pure PyTorch implementation

Compatible with Nvidia, AMD, Ascend, and more

Write once, train anywhere

Cross-platform Cross-platform