diff --git a/docs/architecture-diagram.md b/docs/architecture-diagram.md new file mode 100644 index 0000000..c3f6d55 --- /dev/null +++ b/docs/architecture-diagram.md @@ -0,0 +1,153 @@ +# DiffSynth-Engine Architecture Diagram + +This mermaid diagram shows the overall architecture and flow of the DiffSynth-Engine, which is a high-performance inference engine for diffusion models. + +```mermaid +graph TB + %% Input Layer + A[User Input: Prompt, Image, Parameters] --> B[Configuration] + B --> B1[Pipeline Config
FluxPipelineConfig/SDXLPipelineConfig/etc.] + + %% Model Fetching & Loading + B1 --> C[Model Fetching System] + C --> C1[fetch_model
HuggingFace/CivitAI/ModelScope] + C1 --> C2[State Dict Loading
SafeTensors/GGUF] + C2 --> C3[Model Conversion
Diffusers → DiffSynth Format] + + %% Pipeline Factory + C3 --> D{Pipeline Type} + D -->|Text-to-Image| E[FluxImagePipeline] + D -->|SDXL| F[SDXLImagePipeline] + D -->|SD| G[SDImagePipeline] + D -->|Video| H[WanVideoPipeline] + D -->|Qwen Image| I[QwenImagePipeline] + + %% Main Pipeline Flow (using Flux as example) + E --> J[Model Initialization] + + %% Text Processing + J --> K[Text Processing] + K --> K1[CLIPTokenizer + T5TokenizerFast] + K1 --> K2[FluxTextEncoder1 + FluxTextEncoder2] + K2 --> K3[Text Embeddings] + + %% Image Processing (if img2img) + J --> L[Image Processing] + L --> L1[Image Preprocessing] + L1 --> L2[FluxVAEEncoder] + L2 --> L3[Latent Space] + + %% Noise & Sampling Setup + J --> M[Noise & Sampling Setup] + M --> M1[Noise Generation + Dynamic Shifting] + M --> M2[RecifitedFlowScheduler → Timesteps] + M --> M3[FlowMatchEulerSampler → Strategy] + + %% Core Denoising Loop + K3 --> N[Core Denoising Loop] + L3 --> N + M1 --> N + M2 --> N + M3 --> N + + N --> N1[FluxDiT Transformer
+ ControlNet/IP-Adapter] + N1 --> N2[Noise Prediction] + N2 --> N3[Sampler Step] + N3 --> N4{More Steps?} + N4 -->|Yes| N1 + N4 -->|No| O[Final Latents] + + %% Image Decoding + O --> P[FluxVAEDecoder] + P --> Q[Generated Image] + + %% Performance Optimizations + R[Performance Features] + R --> R1[Memory Management
CPU/GPU Offloading
Sequential Offloading] + R --> R2[Parallel Processing
Tensor/Sequence Parallel
CFG Parallel] + R --> R3[Quantization
FP8/GGUF Support
Model Compilation] + + R1 --> J + R2 --> J + R3 --> J + + %% Model Customization + S[Model Customization] + S --> S1[LoRA Support
Fused/Unfused Loading] + S --> S2[Conditioning
IP-Adapter/Redux] + S --> S3[Control
ControlNet/Inpainting] + + S1 --> N1 + S2 --> N1 + S3 --> N1 + + %% Tools & Extensions + T[Tools & Extensions] + T --> T1[FluxInpaintingTool] + T --> T2[FluxOutpaintingTool] + T --> T3[FluxReferenceTools] + T --> T4[FluxReplaceTool] + + T --> E + + %% Algorithm Foundation + U[Algorithm Foundation] + U --> U1[Noise Schedulers
Beta/DDIM/Exponential/Karras] + U --> U2[Samplers
Euler/DPM++/DDPM/FlowMatch] + + U1 --> M + U2 --> M + + style A fill:#e1f5fe + style Q fill:#c8e6c9 + style N1 fill:#fff3e0 + style E fill:#f3e5f5 + style C fill:#fce4ec + style R fill:#e8f5e8 + style S fill:#fff8e1 +``` + +## Architecture Overview + +The DiffSynth-Engine follows a modular architecture with these key components: + +### 1. **Pipeline Layer** +- **FluxImagePipeline**: Primary image generation pipeline using Flux models +- **SDXLImagePipeline**: Stable Diffusion XL pipeline +- **SDImagePipeline**: Standard Stable Diffusion pipeline +- **WanVideoPipeline**: Video generation pipeline +- **QwenImagePipeline**: Qwen image generation pipeline + +### 2. **Text Processing** +- **Tokenizers**: CLIPTokenizer and T5TokenizerFast for text preprocessing +- **Text Encoders**: CLIP and T5 models for text embedding generation +- **Prompt Encoding**: Converts text prompts to numerical embeddings + +### 3. **Image Processing** +- **VAE Encoder**: Encodes images to latent space representation +- **VAE Decoder**: Decodes latents back to pixel space +- **Preprocessing**: Image normalization and format conversion + +### 4. **Noise Scheduling & Sampling** +- **Schedulers**: Define noise schedules (Beta, DDIM, Exponential, etc.) +- **Samplers**: Implement sampling strategies (Euler, DPM++, DDPM, etc.) +- **Timestep Management**: Controls the denoising process progression + +### 5. **Core Denoising** +- **DiT (Diffusion Transformer)**: Main neural network for noise prediction +- **Attention Mechanisms**: Self-attention and cross-attention layers +- **ControlNet Integration**: Optional conditioning for guided generation + +### 6. **Advanced Features** +- **LoRA Support**: Low-rank adaptation for model customization +- **IP-Adapter & Redux**: Image-based conditioning +- **Parallel Processing**: Multi-GPU and distributed inference +- **Memory Management**: CPU/GPU offloading and optimization +- **Quantization**: FP8 and other precision optimizations + +### 7. **Model Management** +- **State Dict Handling**: Loading and converting model weights +- **Device Management**: GPU/CPU memory allocation +- **Model Lifecycle**: Loading, offloading, and cleanup + +The engine supports multiple diffusion model formats (Flux, SD, SDXL, Wan, Qwen) while providing a unified interface and extensive optimization features for high-performance inference. \ No newline at end of file