Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
153 changes: 153 additions & 0 deletions docs/architecture-diagram.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,153 @@
# DiffSynth-Engine Architecture Diagram

This mermaid diagram shows the overall architecture and flow of the DiffSynth-Engine, which is a high-performance inference engine for diffusion models.

```mermaid
graph TB
%% Input Layer
A[User Input: Prompt, Image, Parameters] --> B[Configuration]
B --> B1[Pipeline Config<br/>FluxPipelineConfig/SDXLPipelineConfig/etc.]

%% Model Fetching & Loading
B1 --> C[Model Fetching System]
C --> C1[fetch_model<br/>HuggingFace/CivitAI/ModelScope]
C1 --> C2[State Dict Loading<br/>SafeTensors/GGUF]
C2 --> C3[Model Conversion<br/>Diffusers → DiffSynth Format]

%% Pipeline Factory
C3 --> D{Pipeline Type}
D -->|Text-to-Image| E[FluxImagePipeline]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For consistency with the other pipeline labels in the diagram (like SDXL, SD, Video, Qwen Image), which refer to the model family or task, consider changing the label for FluxImagePipeline from Text-to-Image to Flux. This would make it clearer that this branch of the diagram corresponds to the Flux model family.

Suggested change
D -->|Text-to-Image| E[FluxImagePipeline]
D -->|Flux| E[FluxImagePipeline]

D -->|SDXL| F[SDXLImagePipeline]
D -->|SD| G[SDImagePipeline]
D -->|Video| H[WanVideoPipeline]
D -->|Qwen Image| I[QwenImagePipeline]

%% Main Pipeline Flow (using Flux as example)
E --> J[Model Initialization]

%% Text Processing
J --> K[Text Processing]
K --> K1[CLIPTokenizer + T5TokenizerFast]
K1 --> K2[FluxTextEncoder1 + FluxTextEncoder2]
K2 --> K3[Text Embeddings]

%% Image Processing (if img2img)
J --> L[Image Processing]
L --> L1[Image Preprocessing]
L1 --> L2[FluxVAEEncoder]
L2 --> L3[Latent Space]

%% Noise & Sampling Setup
J --> M[Noise & Sampling Setup]
M --> M1[Noise Generation + Dynamic Shifting]
M --> M2[RecifitedFlowScheduler → Timesteps]
M --> M3[FlowMatchEulerSampler → Strategy]

%% Core Denoising Loop
K3 --> N[Core Denoising Loop]
L3 --> N
M1 --> N
M2 --> N
M3 --> N

N --> N1[FluxDiT Transformer<br/>+ ControlNet/IP-Adapter]
N1 --> N2[Noise Prediction]
N2 --> N3[Sampler Step]
N3 --> N4{More Steps?}
N4 -->|Yes| N1
N4 -->|No| O[Final Latents]

%% Image Decoding
O --> P[FluxVAEDecoder]
P --> Q[Generated Image]

%% Performance Optimizations
R[Performance Features]
R --> R1[Memory Management<br/>CPU/GPU Offloading<br/>Sequential Offloading]
R --> R2[Parallel Processing<br/>Tensor/Sequence Parallel<br/>CFG Parallel]
R --> R3[Quantization<br/>FP8/GGUF Support<br/>Model Compilation]

R1 --> J
R2 --> J
R3 --> J

%% Model Customization
S[Model Customization]
S --> S1[LoRA Support<br/>Fused/Unfused Loading]
S --> S2[Conditioning<br/>IP-Adapter/Redux]
S --> S3[Control<br/>ControlNet/Inpainting]

S1 --> N1
S2 --> N1
S3 --> N1

%% Tools & Extensions
T[Tools & Extensions]
T --> T1[FluxInpaintingTool]
T --> T2[FluxOutpaintingTool]
T --> T3[FluxReferenceTools]
T --> T4[FluxReplaceTool]

T --> E

%% Algorithm Foundation
U[Algorithm Foundation]
U --> U1[Noise Schedulers<br/>Beta/DDIM/Exponential/Karras]
U --> U2[Samplers<br/>Euler/DPM++/DDPM/FlowMatch]

U1 --> M
U2 --> M

style A fill:#e1f5fe
style Q fill:#c8e6c9
style N1 fill:#fff3e0
style E fill:#f3e5f5
style C fill:#fce4ec
style R fill:#e8f5e8
style S fill:#fff8e1
```

## Architecture Overview

The DiffSynth-Engine follows a modular architecture with these key components:

### 1. **Pipeline Layer**
- **FluxImagePipeline**: Primary image generation pipeline using Flux models
- **SDXLImagePipeline**: Stable Diffusion XL pipeline
- **SDImagePipeline**: Standard Stable Diffusion pipeline
- **WanVideoPipeline**: Video generation pipeline
- **QwenImagePipeline**: Qwen image generation pipeline

### 2. **Text Processing**
- **Tokenizers**: CLIPTokenizer and T5TokenizerFast for text preprocessing
- **Text Encoders**: CLIP and T5 models for text embedding generation
- **Prompt Encoding**: Converts text prompts to numerical embeddings

### 3. **Image Processing**
- **VAE Encoder**: Encodes images to latent space representation
- **VAE Decoder**: Decodes latents back to pixel space
- **Preprocessing**: Image normalization and format conversion

### 4. **Noise Scheduling & Sampling**
- **Schedulers**: Define noise schedules (Beta, DDIM, Exponential, etc.)
- **Samplers**: Implement sampling strategies (Euler, DPM++, DDPM, etc.)
- **Timestep Management**: Controls the denoising process progression

### 5. **Core Denoising**
- **DiT (Diffusion Transformer)**: Main neural network for noise prediction
- **Attention Mechanisms**: Self-attention and cross-attention layers
- **ControlNet Integration**: Optional conditioning for guided generation

### 6. **Advanced Features**
- **LoRA Support**: Low-rank adaptation for model customization
- **IP-Adapter & Redux**: Image-based conditioning
- **Parallel Processing**: Multi-GPU and distributed inference
- **Memory Management**: CPU/GPU offloading and optimization
- **Quantization**: FP8 and other precision optimizations

### 7. **Model Management**
- **State Dict Handling**: Loading and converting model weights
- **Device Management**: GPU/CPU memory allocation
- **Model Lifecycle**: Loading, offloading, and cleanup

The engine supports multiple diffusion model formats (Flux, SD, SDXL, Wan, Qwen) while providing a unified interface and extensive optimization features for high-performance inference.