Skip to content

15. Multimodal Preprocessing

FerrisMind edited this page Sep 10, 2025 · 1 revision

Multimodal Preprocessing

Table of Contents

  1. Introduction
  2. Core Architecture
  3. Configuration System
  4. Image Processing
  5. Audio Processing
  6. Unified Multimodal Interface
  7. Batch and Streaming Processing
  8. Practical Examples
  9. Troubleshooting Guide

Introduction

The Multimodal Preprocessing system provides a unified interface for processing different types of data (images, audio, text) in machine learning applications. Built on the Candle framework, this system offers specialized processors for vision and audio data while maintaining a consistent API across modalities. The architecture is designed to support various pre-trained models like CLIP, Whisper, DINOv2, and ImageNet, with configurable preprocessing pipelines optimized for each model's requirements.

Section sources

  • multimodal.rs
  • vision.rs

Core Architecture

The multimodal preprocessing system follows a modular architecture with specialized components for different data modalities. At its core is the MultimodalProcessor which coordinates between vision and audio processors, providing a unified interface for multimodal data processing.

``mermaid graph TD A[MultimodalProcessor] --> B[VisionProcessor] A --> C[AudioProcessor] B --> D[VisionConfig] C --> E[AudioConfig] A --> F[MultimodalConfig] G[MultimodalSample] --> A H[MultimodalBatchProcessor] --> A


**Diagram sources**
- [multimodal.rs](file://src-tauri/src/core/multimodal.rs#L78-L116)
- [vision.rs](file://src-tauri/src/core/vision.rs#L1-L20)
- [audio.rs](file://src-tauri/src/core/audio.rs#L1-L20)

**Section sources**
- [multimodal.rs](file://src-tauri/src/core/multimodal.rs#L78-L185)

## Configuration System
The system provides a comprehensive configuration framework that allows customization for different models and use cases. The configuration system is hierarchical, with separate configurations for each modality that can be combined into a unified multimodal configuration.

### Multimodal Configuration
The `MultimodalConfig` struct combines vision and audio configurations with additional processing parameters:

rust pub struct MultimodalConfig { pub vision: VisionConfig, pub audio: AudioConfig, pub streaming: bool, pub max_batch_size: usize, pub augmentation: bool, }


The system provides factory methods for common model configurations:
- `MultimodalConfig::clip()` - Optimized for CLIP-style models
- `MultimodalConfig::whisper()` - Optimized for Whisper-style models  
- `MultimodalConfig::training()` - Optimized for multimodal training
- `MultimodalConfig::musicgen()` - Optimized for MusicGen-style models

### Vision Configuration
The `VisionConfig` struct defines parameters for image preprocessing:

rust pub struct VisionConfig { pub image_size: usize, pub mean: [f32; 3], pub std: [f32; 3], pub normalize: bool, pub resize_method: ResizeMethod, pub center_crop: bool, pub random_flip: bool, pub color_jitter: Option, }


Model-specific configurations:
- `VisionConfig::imagenet()` - ImageNet-style preprocessing
- `VisionConfig::clip()` - CLIP-style preprocessing with specific mean/std values
- `VisionConfig::dinov2()` - DINOv2-style preprocessing with 518x518 resolution
- `VisionConfig::training()` - Training configuration with augmentations

### Audio Configuration
The `AudioConfig` struct defines parameters for audio preprocessing:

rust pub struct AudioConfig { pub sample_rate: u32, pub n_mels: usize, pub n_fft: usize, pub hop_length: usize, pub normalize: bool, pub mel_fmin: f32, pub mel_fmax: Option, }


Model-specific configurations:
- `AudioConfig::whisper()` - Whisper-style audio preprocessing
- `AudioConfig::encodec()` - ENCODEC-style configuration
- `AudioConfig::musicgen()` - MusicGen-style configuration

**Section sources**
- [multimodal.rs](file://src-tauri/src/core/multimodal.rs#L45-L116)
- [vision.rs](file://src-tauri/src/core/vision.rs#L45-L150)
- [audio.rs](file://src-tauri/src/core/audio.rs#L45-L150)

## Image Processing
The vision processing system provides comprehensive functionality for image loading, preprocessing, and augmentation. It supports various image formats and implements preprocessing pipelines optimized for different vision models.

### Image Preprocessing Pipeline
The image preprocessing pipeline follows these steps:
1. Load image from file or bytes
2. Convert to RGB format
3. Resize according to configuration
4. Apply center crop if configured
5. Convert to tensor format
6. Normalize pixel values
7. Apply augmentations (during training)

``mermaid
flowchart TD
A[Load Image] --> B[Convert to RGB]
B --> C{Resize Method}
C --> |Exact| D[Resize to Target Size]
C --> |Fill| D
C --> |ResizeLongest| E[Resize Longest Side]
C --> |ResizeShortest| F[Resize Shortest Side]
D --> G{Center Crop?}
E --> G
F --> G
G --> |Yes| H[Apply Center Crop]
G --> |No| I[Convert to Tensor]
H --> I
I --> J[Normalize Values]
J --> K{Augmentation?}
K --> |Yes| L[Apply Augmentations]
K --> |No| M[Return Tensor]
L --> M

Diagram sources

  • vision.rs

Section sources

  • vision.rs

Audio Processing

The audio processing system handles audio file loading, waveform preprocessing, and mel spectrogram generation. It's designed to work with speech recognition, music generation, and other audio-based models.

Audio Preprocessing Pipeline

The audio preprocessing pipeline follows these steps:

  1. Load audio from file or bytes
  2. Resample to target sample rate
  3. Convert to mono if stereo
  4. Normalize audio values
  5. Generate mel spectrogram
  6. Apply augmentations (during training)

Key configuration parameters:

  • Sample Rate: Target sample rate in Hz
  • n_mels: Number of mel filter banks
  • n_fft: Size of FFT window
  • hop_length: Hop length between frames

The system supports streaming processing for large audio files and provides utilities for working with audio tensors.

Section sources

  • audio.rs

Unified Multimodal Interface

The system provides a unified interface through the MultimodalProcessor class, which abstracts away the differences between modalities and provides consistent methods for processing different types of data.

Core Processing Methods

The MultimodalProcessor provides the following key methods:

impl MultimodalProcessor {
    // Image processing
    pub fn process_image<P: AsRef<Path>>(&self, path: P) -> Result<Tensor>
    pub fn process_image_bytes(&self, bytes: &[u8]) -> Result<Tensor>
    pub fn process_images_batch(&self, image_paths: Vec<&Path>) -> Result<Tensor>
    
    // Audio processing  
    pub fn process_audio<P: AsRef<Path>>(&self, path: P) -> Result<Tensor>
    pub fn process_audio_bytes(&self, bytes: &[u8]) -> Result<Tensor>
    pub fn process_audio_samples(&self, samples: &[f32], sample_rate: u32) -> Result<Tensor>
    pub fn process_audio_batch(&self, audio_paths: Vec<&Path>) -> Result<Tensor>
    
    // Feature extraction
    pub fn extract_features(&self, tensor: &Tensor, modality: Modality) -> Result<Tensor>
    pub fn audio_to_mel(&self, waveform: &Tensor) -> Result<Tensor>
}

Data Structures

The system uses the following key data structures:

pub enum Modality {
    Vision,
    Audio,
}

pub struct MultimodalSample {
    pub vision: Option<Tensor>,
    pub audio: Option<Tensor>,
    pub text: Option<String>,
    pub metadata: SampleMetadata,
}

pub struct SampleMetadata {
    pub file_path: Option<String>,
    pub duration: Option<f32>,
    pub image_size: Option<(u32, u32)>,
    pub sample_rate: Option<u32>,
    pub timestamp: std::time::SystemTime,
}

Section sources

  • multimodal.rs

Batch and Streaming Processing

The system provides specialized components for efficient batch and streaming processing of multimodal data.

Batch Processing

The MultimodalBatchProcessor enables efficient processing of multiple samples:

pub struct MultimodalBatchProcessor {
    processor: MultimodalProcessor,
    batch_size: usize,
}

impl MultimodalBatchProcessor {
    pub fn process_batch(&self, samples: Vec<MultimodalSample>) -> Result<Vec<MultimodalSample>>
    pub fn process_streaming<F>(&self, samples: Vec<MultimodalSample>, callback: F) -> Result<()>
}

Processing Strategies

The system supports different processing strategies based on use case:

  • Batch Processing: For processing multiple samples at once, optimized for throughput
  • Streaming Processing: For large datasets that don't fit in memory
  • Real-time Processing: For interactive applications with low latency requirements

The process_streaming method allows processing large datasets in chunks, with a callback function that receives processed batches:

pub fn process_streaming<F>(&self, samples: Vec<MultimodalSample>, mut callback: F) -> Result<()>
where
    F: FnMut(Vec<MultimodalSample>) -> Result<()>,
{
    for chunk in samples.chunks(self.batch_size) {
        let processed_chunk = self.process_batch(chunk.to_vec())?;
        callback(processed_chunk)?;
    }
    Ok(())
}

Section sources

  • multimodal.rs

Practical Examples

This section provides practical examples demonstrating how to use the multimodal preprocessing system for different scenarios.

Example 1: Image Classification with CLIP

use candle::{Device, Result};
use std::path::Path;

// Create CLIP-optimized configuration
let config = MultimodalConfig::clip();
let device = Device::Cpu;
let processor = MultimodalProcessor::new(config, device);

// Process a single image
let image_tensor = processor.process_image("path/to/image.jpg")?;
println!("Image tensor shape: {:?}", image_tensor.dims());

// Process multiple images in batch
let image_paths = vec![
    Path::new("image1.jpg"),
    Path::new("image2.jpg"),
    Path::new("image3.jpg"),
];
let batch_tensor = processor.process_images_batch(image_paths)?;
println!("Batch tensor shape: {:?}", batch_tensor.dims());

Example 2: Speech Recognition with Whisper

// Create Whisper-optimized configuration
let config = MultimodalConfig::whisper();
let device = Device::Cpu;
let processor = MultimodalProcessor::new(config, device);

// Process audio file
let audio_tensor = processor.process_audio("path/to/audio.wav")?;
println!("Audio tensor shape: {:?}", audio_tensor.dims());

// Process raw audio samples
let samples: Vec<f32> = // ... audio samples
let audio_tensor = processor.process_audio_samples(&samples, 16000)?;

Example 3: Training with Data Augmentation

// Create training-optimized configuration
let config = MultimodalConfig::training();
let device = Device::Cpu;
let mut processor = MultimodalProcessor::new(config, device);

// Process image with augmentations
let image_tensor = processor.process_image("training_image.jpg")?;
let augmented_tensor = processor.augment(&image_tensor, Modality::Vision)?;

Example 4: Streaming Processing of Large Dataset

// Create batch processor
let config = MultimodalConfig::default();
let device = Device::Cpu;
let processor = MultimodalProcessor::new(config, device);
let batch_processor = MultimodalBatchProcessor::new(processor, 8);

// Process large dataset in streaming mode
let samples = // ... large collection of multimodal samples
batch_processor.process_streaming(samples, |processed_chunk| {
    // Process each batch
    for sample in processed_chunk {
        // Send to model for inference
        // Save to disk
        // Or other processing
    }
    Ok(())
})?;

Section sources

  • multimodal.rs
  • multimodal_preprocessing.rs

Troubleshooting Guide

This section addresses common issues encountered when using the multimodal preprocessing system and provides solutions.

Common Issues and Solutions

Issue 1: Image Loading Failures

  • Symptoms: Error::wrap when calling load_and_preprocess_image
  • Causes: Corrupted image files, unsupported formats, file permission issues
  • Solutions:
    • Verify image file integrity
    • Ensure file format is supported (JPEG, PNG, etc.)
    • Check file permissions and path validity

Issue 2: Audio Resampling Problems

  • Symptoms: Audio quality degradation, incorrect sample rates
  • Causes: Incompatible sample rates between source and configuration
  • Solutions:
    • Verify source audio sample rate matches configuration
    • Use process_audio_samples for precise control over sample rate
    • Check that audio file is not corrupted

Issue 3: Memory Issues with Large Files

  • Symptoms: Out of memory errors, slow processing
  • Causes: Processing very large images or audio files
  • Solutions:
    • Use streaming processing mode
    • Reduce batch size
    • Process files individually rather than in large batches

Issue 4: Normalization Artifacts

  • Symptoms: Images appear too dark or too bright after preprocessing
  • Causes: Incorrect mean/std values for the target model
  • Solutions:
    • Use appropriate configuration (e.g., VisionConfig::clip() for CLIP models)
    • Verify mean/std values match the model's training configuration
    • Check that normalization is enabled when required

Issue 5: Dimension Mismatch Errors

  • Symptoms: Tensor dimension errors during model inference
  • Causes: Incorrect image resizing or audio processing
  • Solutions:
    • Verify image size matches model requirements
    • Check that resize method produces expected dimensions
    • Use tensor.dims() to debug tensor shapes

Performance Optimization Tips

  • Use GPU acceleration when available by setting Device::Cuda or Device::Metal
  • Process data in batches to maximize throughput
  • Pre-compute mel filters for audio processing when processing multiple files
  • Use appropriate batch sizes based on available memory
  • Consider using half-precision (F16) tensors to reduce memory usage

Section sources

  • multimodal.rs
  • vision.rs
  • multimodal_preprocessing.rs

Referenced Files in This Document

  • multimodal.rs
  • vision.rs
  • audio.rs
  • multimodal_preprocessing.rs

Clone this wiki locally