Custom Execution Plans
What Is an Execution Plan
An execution plan (ModelExecutePlan) defines how model layers are grouped and ordered during forward and backward passes — essentially, it determines which layers belong to the same pipeline stage.
An execution plan has two attributes:
fwd_plan: Forward pass grouping. Type:List[range], where eachrangespecifies the layer indices in one stage.bwd_plan: Backward pass grouping. Type:List[range].
The Concept of Stages
A stage is the basic scheduling unit for the pipeline. All layers in a stage are uploaded to a GPU together and executed sequentially. Stage design directly affects training efficiency:
- Stages too large: A single stage's parameters and activations consume too much VRAM, potentially causing OOM.
- Stages unbalanced: The slowest stage becomes the bottleneck, and other GPUs sit idle.
RoundPipe's asymmetric partitioning allows forward and backward passes to use different stage layouts. Since forward computation takes roughly 1/3 the time of the backward pass, forward stages typically contain more layers and backward stages fewer, balancing execution time across stages.
Example: An execution plan for a 4-layer model:
from roundpipe import ModelExecutePlan
plan = ModelExecutePlan()
# Execution plan when using forward()
plan.fwd_plan = [range(0, 2), range(2, 4)] # Forward: 2 stages, 2 layers each
plan.bwd_plan = [range(3, 4), range(2, 3), # Backward: 4 stages, 1 layer each
range(1, 2), range(0, 1)]
When using forward_backward(), the first backward stage fuses part of the forward computation (avoiding redundant recomputation), so the first backward stage's layers should not overlap with the forward plan:
# Execution plan when using forward_backward()
plan = ModelExecutePlan()
plan.fwd_plan = [range(0, 3)] # Forward covers only the first 3 layers
plan.bwd_plan = [range(3, 4), range(2, 3), # First backward stage starts at layer 4
range(1, 2), range(0, 1)]
Automatic Tuning
In most cases, you don't need to build an execution plan manually. ModelExecutePlan.auto() generates a near-optimal partition based on actual per-layer computation times and memory usage.
from roundpipe import ModelExecutePlan, RoundPipeRunConfig
# Auto-generate the plan (run a few iterations first to collect timing data)
plan = ModelExecutePlan.auto("fused", model)
# Use the generated plan
loss = model.forward_backward(
input_args=(data,),
label=labels,
loss_fn=loss_fn,
run_config=RoundPipeRunConfig(execute_plan=plan),
)
auto() parameters:
run_type: Execution mode."infer": Forward-only inference."train": Separate forward and backward (training based onforward())."fused": Fused forward-backward (training based onforward_backward()— the most common choice).
min_stages: Minimum number of stages. Defaults to the GPU count. More stages mean fewer pipeline bubbles but smaller stages.upper_threshold: Load-balancing tolerance. Defaults to 1.1, meaning a stage is allowed to take up to 1.1x the time of the longest individual layer. Increasing this allows more flexible partitions but may increase memory usage.model_memory_limit: Estimated available GPU memory (GB). Defaults to 60% of the smallest GPU's VRAM. Because RoundPipe prefetches one stage's parameters, each stage's memory limit is half this value.
How Auto-Tuning Works
auto() optimizes based on:
- Timing data: RoundPipe automatically measures per-layer forward, backward, and recomputation times during execution, using a moving average. On the first run, a default partition is used; subsequent iterations can regenerate a better plan based on actual timings.
- Memory constraints: Ensures each stage's total parameter and gradient size stays within the memory limit.
Joint optimization across multiple models:
If training involves multiple RoundPipe models (e.g., encoder + decoder), pass them all to auto() for joint optimization:
plan1, plan2 = ModelExecutePlan.auto("fused", model1, model2)
Manual Execution Plans
When to Use
- Auto-tuning results are unsatisfactory (e.g., unstable per-layer timings).
- You need precise control over each stage's memory footprint.
- Debugging or profiling with a fixed partition.
Goal
The core objective when building a manual plan is to balance execution time across stages. The slowest stage determines the pipeline's throughput; idle time in other stages is wasted.
How to Build a Plan
-
Enable verbose timing to get per-layer measurements:
from roundpipe.timer import ModelTimer ModelTimer.VERBOSE = True # Run a few iterations; per-layer fwd/re/bwd times will be printed to stderr -
Group adjacent layers so that each group's total time is roughly equal:
plan = ModelExecutePlan() # Suppose: 24 transformer layers + 1 lm_head layer # Forward: each layer ≈ 2 ms, lm_head ≈ 6 ms # Partition into 4 stages, each ≈ 14 ms plan.fwd_plan = [ range(0, 7), # layers 0-6: 7×2 = 14 ms range(7, 14), # layers 7-13: 7×2 = 14 ms range(14, 21), # layers 14-20: 7×2 = 14 ms range(21, 24), # layers 21-24: 3×2 + 6 = 12 ms (includes lm_head) ] # Backward is similar, but each layer takes ≈ 6 ms (3× forward) plan.bwd_plan = [ range(24, 25), # lm_head backward range(22, 24), range(20, 22), # ... range(0, 2), ]
Validation Rules
An execution plan must satisfy these conditions, or RoundPipe will raise an error:
- The union of all forward
ranges must cover every layer exactly once (0 to L-1). - The same applies to the backward plan.
- Layer indices in the forward plan must be in ascending order (shallow to deep).
- Layer indices in the backward plan must be in descending order (deep to shallow).
- When using
forward_backward(), the last layer of the forward plan + 1 must equal the first layer of the first backward stage — meaning the final layers participate only in backward (their forward computation is treated as recomputation).