Custom Execution Plans

What Is an Execution Plan

An execution plan (ModelExecutePlan) defines how model layers are grouped and ordered during forward and backward passes — essentially, it determines which layers belong to the same pipeline stage.

An execution plan has two attributes:

fwd_plan: Forward pass grouping. Type: List[range], where each range specifies the layer indices in one stage.
bwd_plan: Backward pass grouping. Type: List[range].

The Concept of Stages

A stage is the basic scheduling unit for the pipeline. All layers in a stage are uploaded to a GPU together and executed sequentially. Stage design directly affects training efficiency:

Stages too large: A single stage's parameters and activations consume too much VRAM, potentially causing OOM.
Stages unbalanced: The slowest stage becomes the bottleneck, and other GPUs sit idle.

RoundPipe's asymmetric partitioning allows forward and backward passes to use different stage layouts. Since forward computation takes roughly 1/3 the time of the backward pass, forward stages typically contain more layers and backward stages fewer, balancing execution time across stages.

Example: An execution plan for a 4-layer model:

from roundpipe import ModelExecutePlan

plan = ModelExecutePlan()

# Execution plan when using forward()
plan.fwd_plan = [range(0, 2), range(2, 4)] # Forward: 2 stages, 2 layers each
plan.bwd_plan = [range(3, 4), range(2, 3), # Backward: 4 stages, 1 layer each
                 range(1, 2), range(0, 1)]

When using forward_backward(), the first backward stage fuses part of the forward computation (avoiding redundant recomputation), so the first backward stage's layers should not overlap with the forward plan:

# Execution plan when using forward_backward()
plan = ModelExecutePlan()
plan.fwd_plan = [range(0, 3)]              # Forward covers only the first 3 layers
plan.bwd_plan = [range(3, 4), range(2, 3), # First backward stage starts at layer 4
                 range(1, 2), range(0, 1)]

Automatic Tuning

In most cases, you don't need to build an execution plan manually. ModelExecutePlan.auto() generates a near-optimal partition based on actual per-layer computation times and memory usage.

from roundpipe import ModelExecutePlan, RoundPipeRunConfig

# Auto-generate the plan (run a few iterations first to collect timing data)
plan = ModelExecutePlan.auto("fused", model)

# Use the generated plan
loss = model.forward_backward(
    input_args=(data,),
    label=labels,
    loss_fn=loss_fn,
    run_config=RoundPipeRunConfig(execute_plan=plan),
)

auto() parameters:

run_type: Execution mode.
- "infer": Forward-only inference.
- "train": Separate forward and backward (training based on forward()).
- "fused": Fused forward-backward (training based on forward_backward() — the most common choice).
min_stages: Minimum number of stages. Defaults to the GPU count. More stages mean fewer pipeline bubbles but smaller stages.
upper_threshold: Load-balancing tolerance. Defaults to 1.1, meaning a stage is allowed to take up to 1.1x the time of the longest individual layer. Increasing this allows more flexible partitions but may increase memory usage.
model_memory_limit: Estimated available GPU memory (GB). Defaults to 60% of the smallest GPU's VRAM. Because RoundPipe prefetches one stage's parameters, each stage's memory limit is half this value.

How Auto-Tuning Works

auto() optimizes based on:

Timing data: RoundPipe automatically measures per-layer forward, backward, and recomputation times during execution, using a moving average. On the first run, a default partition is used; subsequent iterations can regenerate a better plan based on actual timings.
Memory constraints: Ensures each stage's total parameter and gradient size stays within the memory limit.

Joint optimization across multiple models:

If training involves multiple RoundPipe models (e.g., encoder + decoder), pass them all to auto() for joint optimization:

plan1, plan2 = ModelExecutePlan.auto("fused", model1, model2)

Manual Execution Plans

When to Use

Auto-tuning results are unsatisfactory (e.g., unstable per-layer timings).
You need precise control over each stage's memory footprint.
Debugging or profiling with a fixed partition.

Goal

The core objective when building a manual plan is to balance execution time across stages. The slowest stage determines the pipeline's throughput; idle time in other stages is wasted.

How to Build a Plan

Enable verbose timing to get per-layer measurements:

from roundpipe.timer import ModelTimer
ModelTimer.VERBOSE = True
# Run a few iterations; per-layer fwd/re/bwd times will be printed to stderr

Group adjacent layers so that each group's total time is roughly equal:

plan = ModelExecutePlan()
# Suppose: 24 transformer layers + 1 lm_head layer
# Forward: each layer ≈ 2 ms, lm_head ≈ 6 ms
# Partition into 4 stages, each ≈ 14 ms
plan.fwd_plan = [
    range(0, 7),    # layers 0-6: 7×2 = 14 ms
    range(7, 14),   # layers 7-13: 7×2 = 14 ms
    range(14, 21),  # layers 14-20: 7×2 = 14 ms
    range(21, 24),  # layers 21-24: 3×2 + 6 = 12 ms (includes lm_head)
]
# Backward is similar, but each layer takes ≈ 6 ms (3× forward)
plan.bwd_plan = [
    range(24, 25),  # lm_head backward
    range(22, 24),
    range(20, 22),
    # ...
    range(0, 2),
]

Validation Rules

An execution plan must satisfy these conditions, or RoundPipe will raise an error:

The union of all forward ranges must cover every layer exactly once (0 to L-1).
The same applies to the backward plan.
Layer indices in the forward plan must be in ascending order (shallow to deep).
Layer indices in the backward plan must be in descending order (deep to shallow).
When using forward_backward(), the last layer of the forward plan + 1 must equal the first layer of the first backward stage — meaning the final layers participate only in backward (their forward computation is treated as recomputation).