PipeFusion: Displaced Patch Pipeline Parallelism for Inference of Diffusion Transformer Models

In the Sora Era, still spend money on NVLink and high-bandwidth networks for serving long-context Diffusion Models? With PipeFusion, PCIe and Ethernet are enough!

The project provides a suite of efficient parallel inference approaches for Diffusion Models. The backend networks of the diffusion model primarily include U-Net and Transformers (DiT). Both of these can be applied to DiT, and some methods can also be used for U-Net.

Tensor Parallelism. (DiT, U-Net)
Sequence Parallelism, USP is a unified sequence parallel approach including DeepSpeed-Ulysses, Ring-Attention: (DiT)
Displaced Patch Parallelism, named DistriFusion: (DiT, U-Net)
Displaced Patch Pipeline Paralelism, named PipeFusion, first proposed in this repo. (DiT)

The communication and memory cost of the above parallelism for DiT is listed in the following table. (* indicates comm. can be hidden by computation, but needs extra buffers.)

	attn-KV	communication cost	param	activations	extra buff
Tensor Parallel	fresh	$4O(p \times hs)L$	$\frac{1}{N}P$	$\frac{2}{N}A = \frac{1}{N}QO$	$\frac{2}{N}A = \frac{1}{N}KV$
DistriFusion*	stale	$2O(p \times hs)L$	$P$	$\frac{2}{N}A = \frac{1}{N}QO$	$2AL = (KV)L$
Ring Seq Parallel*	fresh	NA	$P$	$\frac{2}{N}A = \frac{1}{N}QO$	$\frac{2}{N}A = \frac{1}{N}KV$
Ulysses Seq Parallel	fresh	$4O(p \times hs)L$	$P$	$\frac{2}{N}A = \frac{1}{N}QO$	$\frac{2}{N}A = \frac{1}{N}KV$
PipeFusion*	stale-	$2O(p \times hs)$	$\frac{1}{N}P$	$\frac{2}{M}A = \frac{1}{M}QO$	$\frac{2L}{N}A = \frac{1}{N}(KV)L$

The Latency on 4xA100 (PCIe)

The Latency on 8xL20 (PCIe)

The Latency on 8xA100 (NVLink)

Best Practices:

PipeFusion is the preferable for both memory and communication efficiency. It does not need high inter-GPU bandwidth, like NVLink. Therefore, it is the lowest latency for PCIe clusters. However, on NVLink, the power of PipeFusion is weakened.
DistriFusion is fast on NVLink at a cost with large overall memory cost usage and therefore has OOM for high-resolution images.
PipeFusion and Tensor parallelism is able to generate high-resolution images due to their splitting on both parameters and activations. Tensor parallelism is fast on NVLink, while PipeFusion is fast on PCIe.
Sequence Parallelism is usually faster than tensor parallelism, but has OOM for high-resolution images.

PipeFusion: Displaced Patch Pipeline Parallelism

Overview

As shown in the above table, PipeFusion significantly reduces memory usage and required communication bandwidth, not to mention it also hides communication overhead under the communication. It is the best parallel approach for DiT inference to be hosted on GPUs connected via PCIe.

The above picture compares DistriFusion and PipeFusion. (a) DistriFusion replicates DiT parameters on two devices. It splits an image into 2 patches and employs asynchronous allgather for activations of every layer. (b) PipeFusion shards DiT parameters on two devices. It splits an image into 4 patches and employs asynchronous P2P for activations across two devices.

PipeFusion partitions an input image into $M$ non-overlapping patches. The DiT network is partitioned into $N$ stages ($N$ < $L$), which are sequentially assigned to $N$ computational devices. Note that $M$ and $N$ can be unequal, which is different from the image-splitting approaches used in sequence parallelism and DistriFusion. Each device processes the computation task for one patch of its assigned stage in a pipelined manner.

The PipeFusion pipeline workflow when $M$ = $N$ =4 is shown in the following picture.

Usage

install long-context-attention to use sequence parallelism
install pipefuison from local.

python setup.py install

Usage Example In ./scripts/pixart_example.py, we provide a minimal script for running DiT with PipeFusion.

import torch

from pipefuser.pipelines import DistriPixArtAlphaPipeline
from pipefuser.utils import DistriConfig
from pipefusion.modules.opt.chunk_conv2d import PatchConv2d

# parallelism choose from ["patch", "naive_patch", "pipeline", "tensor"],
distri_config = DistriConfig(
    parallelism="pipeline",
)

pipeline = DistriPixArtAlphaPipeline.from_pretrained(
    distri_config=distri_config,
    pretrained_model_name_or_path=args.model_id,
)

# use the following patch for memory efficient VAE
# PatchConv2d(1024)(pipeline)
pipeline.set_progress_bar_config(disable=distri_config.rank != 0)

output = pipeline(
        prompt="An astronaut riding a green horse",
        generator=torch.Generator(device="cuda").manual_seed(42),
        num_inference_steps=20,
        output_type="pil,
    )
if distri_config.rank == 0:
    output.save("astronaut.png")

Benchmark

You can adapt to ./scripts/benchmark.sh to benchmark latency and memory usage of different parallel approaches.

Evaluation Image Quality

To conduct the FID experiment, follow the detailed instructions provided in the assets/doc/FID.md documentation.

Other optimizations

Memory Efficient VAE:

The VAE decode implementation from diffusers can not be applied on high resolution images (8192px). It has CUDA memory spike issue, diffusers/issues/5924. We fixed the issue by splitting a conv operator into multiple small ones and executing them sequentially to reduce the peak memory.

Cite Us

@article{wang2024pipefusion,
      title={PipeFusion: Displaced Patch Pipeline Parallelism for Inference of Diffusion Transformer Models}, 
      author={Jiannan Wang and Jiarui Fang and Aoyu Li and PengCheng Yang},
      year={2024},
      eprint={2405.07719},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Acknowledgenments

Our code is developed on distrifuser from MIT-HAN-LAB.

Name		Name	Last commit message	Last commit date
Latest commit History 271 Commits
assets		assets
pipefuser		pipefuser
scripts		scripts
tests/modules		tests/modules
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE.txt		LICENSE.txt
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PipeFusion: Displaced Patch Pipeline Parallelism for Inference of Diffusion Transformer Models

PipeFusion: Displaced Patch Pipeline Parallelism

Overview

Usage

Benchmark

Evaluation Image Quality

Other optimizations

Cite Us

Acknowledgenments

About

Uh oh!

Releases

Packages

Languages

License

MachineLearningSystem/PipeFusion

Folders and files

Latest commit

History

Repository files navigation

PipeFusion: Displaced Patch Pipeline Parallelism for Inference of Diffusion Transformer Models

PipeFusion: Displaced Patch Pipeline Parallelism

Overview

Usage

Benchmark

Evaluation Image Quality

Other optimizations

Cite Us

Acknowledgenments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages