Wrap Model
The first step to training with RoundPipe is wrapping your model into a RoundPipe instance. There are two approaches: for models with built-in presets (e.g., Qwen3, Llama), you can wrap them in one line; for custom models, you convert them to nn.Sequential form and wrap manually.
Using Model Presets
RoundPipe ships with built-in presets for popular large language models, automatically converting them into the Sequential structure required for pipeline execution. See the Model Zoo for the full list of supported models.
Use wrap_model_to_roundpipe() for one-line wrapping:
from transformers import AutoModelForCausalLM
from roundpipe import wrap_model_to_roundpipe, RoundPipeRunConfig
model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen3-1.7B",
use_cache=False, # KV cache must be disabled during training
dtype=torch.float16,
_attn_implementation="flash_attention_2",
)
model = wrap_model_to_roundpipe(
model,
use_sequential_preset=True, # Force using preset lookup and conversion
model_run_config=RoundPipeRunConfig(num_microbatch=4), # Optional: override default run config
optim_dtype=torch.float32,
# Additional RoundPipe() constructor arguments can be passed here
)
wrap_model_to_roundpipe() automatically detects the model type. If a matching preset exists, it converts the model into an equivalent Sequential structure and returns a RoundPipe instance. After conversion, the original model's attributes (e.g., model.vocab_size, model.config) remain accessible.
Custom Models
RoundPipe can train any deep neural network architecture. To enable correct distributed training with good performance, your model needs to follow a few conventions.
nn.Sequential Representation
The model passed to RoundPipe must be organized as an nn.Sequential. RoundPipe treats each submodule in the Sequential as a model layer for scheduling, so how you partition the layers directly affects training efficiency.
We recommend organizing the model as input adapter + repeated blocks + output adapter. Split points should be at the model's "narrow waists" — where the data passed between layers is smallest. For transformer models, the typical split looks like:
import torch.nn as nn
# Example: a simple transformer model
model = nn.Sequential(
embedding_layer, # Input adapter: token ids -> hidden states
transformer_layer_0, # Repeated blocks
transformer_layer_1,
transformer_layer_2,
# ...
transformer_layer_n,
norm_and_lm_head, # Output adapter: hidden states -> logits
)
Each transformer layer passes only the hidden-states tensor to the next, which is relatively small — an ideal split point.
What to avoid:
# Not recommended: splitting within attention and MLP internals
model = nn.Sequential(
layer_0_qkv_proj,
layer_0_attention,
layer_0_out_proj,
layer_0_mlp_up_proj,
layer_0_mlp_down_proj,
layer_1_qkv_proj,
# ...
)
While this works, it causes large activation tensors to be transferred between layers, increasing inter-GPU data transfer overhead and hurting training efficiency.
Forward Function Checklist
Variable Access Restrictions
Because RoundPipe executes multiple layers in parallel across GPUs, there are restrictions on variable access inside forward():
| Operation | Global variables | Regular instance variables | Parameters | Standalone module buffers | Shared module buffers |
|---|---|---|---|---|---|
| Read | ✅ | ✅ | ✅ | ✅ | ✅ |
| Write | ❌ | ❌ | ❌ | ✅ | ❌ |
What each category means:
- Global variables: Variables defined outside the model. Multiple layers may run in different threads simultaneously, so writing to globals causes data races.
- Regular instance variables: Accessed via
self.xxxbut not wrapped withnn.Parameteror registered viaregister_buffer. Subject to the same concurrency risk. - Parameters: Model parameters (
nn.Parameter). RoundPipe manages their CPU-GPU transfers; they are read-only during forward execution. Parameter sharing within the same RoundPipe instance is supported, but not across different RoundPipe instances. - Standalone module buffers: Registered via
register_bufferand belonging to a single layer (not shared across submodules ofnn.Sequential). RoundPipe transfers them alongside parameters, so they can be written safely. - Shared module buffers: Registered via
register_bufferbut shared across multiple layers. Since they may be accessed concurrently by different layers, writing is not safe.
Example:
class MyBlock(nn.Module):
def __init__(self):
super().__init__()
self.linear = nn.Linear(256, 256)
self.register_buffer('acc', torch.zeros(256)) # Standalone buffer, safe to write
self.call_count = 0 # Regular instance variable — do NOT write!
def forward(self, x):
# ✅ Correct: read parameters, write to standalone buffer
out = self.linear(x)
self.acc.add_(out.mean(dim=0).detach())
return out
# ❌ Wrong: writing to a regular instance variable
# self.call_count += 1 # Data race!
Temporary Tensor Device Placement
Temporary tensors created in forward() must not hard-code a device. Infer the device from inputs or weights instead:
class MyBlock(nn.Module):
def forward(self, x):
# ❌ Wrong: hard-coded device
mask = torch.ones(x.shape[0], device='cuda:0')
# ✅ Correct: infer device from input
mask = torch.ones(x.shape[0], device=x.device)
# ✅ Correct: infer device from weights
bias = torch.zeros(self.linear.weight.shape[0],
device=self.linear.weight.device)
return x
RoundPipe schedules different layers on different GPUs. Hard-coding cuda:0 will create the tensor on the wrong device.
Wrapping the Model
Once your nn.Sequential model is ready, wrap it with RoundPipe():
from roundpipe import RoundPipe, RoundPipeRunConfig
model = RoundPipe(
model=my_sequential_model.to(torch.float16),
optim_dtype=torch.float32,
model_run_config=RoundPipeRunConfig(num_microbatch=4),
pin_model="alloc",
)
Parameter reference:
model_run_config sets the model-level default run configuration. See RoundPipeRunConfig Tuning for details.
pin_model controls the page-locking strategy for model parameters in CPU memory, which affects CPU-to-GPU transfer performance:
| Option | Description | When to use |
|---|---|---|
"alloc" |
Allocates pinned memory via PyTorch's pin_memory |
Default. Best transfer performance, but CPU memory usage may roughly double (PyTorch aligns allocations to powers of 2) |
"register" |
Pins existing memory via cudaHostRegister |
NVIDIA GPUs only. Useful for LoRA fine-tuning of large models when CPU memory is tight. ~10% slower transfers |
"off" |
No pinned memory | For LoRA fine-tuning of very large models (e.g., 235B) that exceed CPU memory, used together with mmap loading |
optim_dtype specifies the data type for optimizer parameters. A typical setup uses torch.float16 for model parameters (saving VRAM and transfer bandwidth) and torch.float32 for optimizer parameters (preserving numerical stability). If omitted, it defaults to the model parameter type.
Automatic Model Wrapping
Experimental Feature
Automatic splitting is experimental. It does not support the fused forward_backward() and incurs a performance penalty.
For complex models that you prefer not to convert manually to Sequential form, you can try the automatic splitting mode of wrap_model_to_roundpipe().
We strongly recommend writing the Sequential conversion manually for better performance and full feature support (including forward_backward()). If the model is a well-known open-source model not yet in the preset list, consider opening an issue or PR to add a preset.
from roundpipe import wrap_model_to_roundpipe
model = wrap_model_to_roundpipe(
model,
use_sequential_preset=False, # Skip preset lookup; use automatic splitting
optim_dtype=torch.float32,
)
Automatic splitting recursively walks the model's submodule tree and decides how to wrap each module based on parameter-size thresholds. If the model ultimately cannot be split into Sequential form, it returns an AutoRoundPipe instance. This instance can still use RoundPipe's forward pass and optimizer features, but cannot use the fused forward_backward() and does not benefit from RoundPipe's pipeline scheduling optimizations.