RoundPipe

Train Your Large Models

High PerformanceEasy to UseBuilt for Gaming GPUs

> pip install roundpipe

Get started ↗

🧠

Built for huge models

On a single 24GB GPU, train with 64K+ long context, full fine-tune 32B models or LoRA fine-tune models up to 235B.

⚡

High Performance

Push a 4090 close to A800 NVLINK-class throughput. Up to 6× faster than FSDP Offload in typical training workloads.

📈

Linear scaling

Scale to multiple GPUs in-node without rewriting your training loop. Throughput grows linearly while your code stays the same.

✨

Feels like PyTorch

A sequential programming interface with a low learning curve. Works well in Jupyter Notebook for rapid iteration.

🔧

General by design

No constraints on layer structure, training flow, or parameter update strategy.

🔄

Portable across accelerators

Pure PyTorch implementation. Runs across Nvidia, AMD, and Ascend GPU platforms.

Train bigger than ever

64K+ long-context training on a single 24GB GPU

Full fine-tuning for 32B, LoRA for up to 235B

Up to 7× longer sequence length than PyTorch FSDP

Extracts maximum performance

A 4090 can reach near A800 NVLINK-level throughput

Up to 6× faster than FSDP Offload

As models grow, RoundPipe keeps pulling ahead

Scale out without rewrites

100% automatic multi-GPU scaling within a node

Throughput grows linearly with GPU count

Max sequence length per GPU stays unchanged

Simple API, flexible training

Sequential programming interface

0 parallel programming

Jupyter Notebook friendly

import torch
from roundpipe import RoundPipe, OptimizerCtx
# Any deep neural network
model = torch.nn.Sequential(layer1, layer2, layer3, ...)
# Any PyTorch optimizer
with OptimizerCtx():
    optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)
# Any training loop
for data in dataloader:
    loss = model.forward_backward(data)
    # Any parameter update strategy
    def step_fn():
        optimizer.step()
        optimizer.zero_grad()
    model.step(step_fn)

Portable by default

Pure PyTorch implementation

Compatible with Nvidia, AMD, Ascend, and more

Write once, train anywhere