15. Multimodal Preprocessing

Multimodal Preprocessing

Introduction

The Multimodal Preprocessing system provides a unified interface for processing different types of data (images, audio, text) in machine learning applications. Built on the Candle framework, this system offers specialized processors for vision and audio data while maintaining a consistent API across modalities. The architecture is designed to support various pre-trained models like CLIP, Whisper, DINOv2, and ImageNet, with configurable preprocessing pipelines optimized for each model's requirements.

Section sources

multimodal.rs
vision.rs

Core Architecture

The multimodal preprocessing system follows a modular architecture with specialized components for different data modalities. At its core is the MultimodalProcessor which coordinates between vision and audio processors, providing a unified interface for multimodal data processing.

``mermaid graph TD A[MultimodalProcessor] --> B[VisionProcessor] A --> C[AudioProcessor] B --> D[VisionConfig] C --> E[AudioConfig] A --> F[MultimodalConfig] G[MultimodalSample] --> A H[MultimodalBatchProcessor] --> A


**Diagram sources**
- [multimodal.rs](file://src-tauri/src/core/multimodal.rs#L78-L116)
- [vision.rs](file://src-tauri/src/core/vision.rs#L1-L20)
- [audio.rs](file://src-tauri/src/core/audio.rs#L1-L20)

**Section sources**
- [multimodal.rs](file://src-tauri/src/core/multimodal.rs#L78-L185)

## Configuration System
The system provides a comprehensive configuration framework that allows customization for different models and use cases. The configuration system is hierarchical, with separate configurations for each modality that can be combined into a unified multimodal configuration.

### Multimodal Configuration
The `MultimodalConfig` struct combines vision and audio configurations with additional processing parameters:

rust pub struct MultimodalConfig { pub vision: VisionConfig, pub audio: AudioConfig, pub streaming: bool, pub max_batch_size: usize, pub augmentation: bool, }


The system provides factory methods for common model configurations:
- `MultimodalConfig::clip()` - Optimized for CLIP-style models
- `MultimodalConfig::whisper()` - Optimized for Whisper-style models  
- `MultimodalConfig::training()` - Optimized for multimodal training
- `MultimodalConfig::musicgen()` - Optimized for MusicGen-style models

### Vision Configuration
The `VisionConfig` struct defines parameters for image preprocessing:

rust pub struct VisionConfig { pub image_size: usize, pub mean: [f32; 3], pub std: [f32; 3], pub normalize: bool, pub resize_method: ResizeMethod, pub center_crop: bool, pub random_flip: bool, pub color_jitter: Option, }


Model-specific configurations:
- `VisionConfig::imagenet()` - ImageNet-style preprocessing
- `VisionConfig::clip()` - CLIP-style preprocessing with specific mean/std values
- `VisionConfig::dinov2()` - DINOv2-style preprocessing with 518x518 resolution
- `VisionConfig::training()` - Training configuration with augmentations

### Audio Configuration
The `AudioConfig` struct defines parameters for audio preprocessing:

rust pub struct AudioConfig { pub sample_rate: u32, pub n_mels: usize, pub n_fft: usize, pub hop_length: usize, pub normalize: bool, pub mel_fmin: f32, pub mel_fmax: Option, }


Model-specific configurations:
- `AudioConfig::whisper()` - Whisper-style audio preprocessing
- `AudioConfig::encodec()` - ENCODEC-style configuration
- `AudioConfig::musicgen()` - MusicGen-style configuration

**Section sources**
- [multimodal.rs](file://src-tauri/src/core/multimodal.rs#L45-L116)
- [vision.rs](file://src-tauri/src/core/vision.rs#L45-L150)
- [audio.rs](file://src-tauri/src/core/audio.rs#L45-L150)

## Image Processing
The vision processing system provides comprehensive functionality for image loading, preprocessing, and augmentation. It supports various image formats and implements preprocessing pipelines optimized for different vision models.

### Image Preprocessing Pipeline
The image preprocessing pipeline follows these steps:
1. Load image from file or bytes
2. Convert to RGB format
3. Resize according to configuration
4. Apply center crop if configured
5. Convert to tensor format
6. Normalize pixel values
7. Apply augmentations (during training)

``mermaid
flowchart TD
A[Load Image] --> B[Convert to RGB]
B --> C{Resize Method}
C --> |Exact| D[Resize to Target Size]
C --> |Fill| D
C --> |ResizeLongest| E[Resize Longest Side]
C --> |ResizeShortest| F[Resize Shortest Side]
D --> G{Center Crop?}
E --> G
F --> G
G --> |Yes| H[Apply Center Crop]
G --> |No| I[Convert to Tensor]
H --> I
I --> J[Normalize Values]
J --> K{Augmentation?}
K --> |Yes| L[Apply Augmentations]
K --> |No| M[Return Tensor]
L --> M

Diagram sources

vision.rs

Section sources

vision.rs

Audio Processing

The audio processing system handles audio file loading, waveform preprocessing, and mel spectrogram generation. It's designed to work with speech recognition, music generation, and other audio-based models.

Audio Preprocessing Pipeline

The audio preprocessing pipeline follows these steps:

Load audio from file or bytes
Resample to target sample rate
Convert to mono if stereo
Normalize audio values
Generate mel spectrogram
Apply augmentations (during training)

Key configuration parameters:

Sample Rate: Target sample rate in Hz
n_mels: Number of mel filter banks
n_fft: Size of FFT window
hop_length: Hop length between frames

The system supports streaming processing for large audio files and provides utilities for working with audio tensors.

Section sources

audio.rs

Unified Multimodal Interface

The system provides a unified interface through the MultimodalProcessor class, which abstracts away the differences between modalities and provides consistent methods for processing different types of data.

Core Processing Methods

The MultimodalProcessor provides the following key methods:

impl MultimodalProcessor {
    // Image processing
    pub fn process_image<P: AsRef<Path>>(&self, path: P) -> Result<Tensor>
    pub fn process_image_bytes(&self, bytes: &[u8]) -> Result<Tensor>
    pub fn process_images_batch(&self, image_paths: Vec<&Path>) -> Result<Tensor>
    
    // Audio processing  
    pub fn process_audio<P: AsRef<Path>>(&self, path: P) -> Result<Tensor>
    pub fn process_audio_bytes(&self, bytes: &[u8]) -> Result<Tensor>
    pub fn process_audio_samples(&self, samples: &[f32], sample_rate: u32) -> Result<Tensor>
    pub fn process_audio_batch(&self, audio_paths: Vec<&Path>) -> Result<Tensor>
    
    // Feature extraction
    pub fn extract_features(&self, tensor: &Tensor, modality: Modality) -> Result<Tensor>
    pub fn audio_to_mel(&self, waveform: &Tensor) -> Result<Tensor>
}

Data Structures

The system uses the following key data structures:

pub enum Modality {
    Vision,
    Audio,
}

pub struct MultimodalSample {
    pub vision: Option<Tensor>,
    pub audio: Option<Tensor>,
    pub text: Option<String>,
    pub metadata: SampleMetadata,
}

pub struct SampleMetadata {
    pub file_path: Option<String>,
    pub duration: Option<f32>,
    pub image_size: Option<(u32, u32)>,
    pub sample_rate: Option<u32>,
    pub timestamp: std::time::SystemTime,
}

Section sources

multimodal.rs

Batch and Streaming Processing

The system provides specialized components for efficient batch and streaming processing of multimodal data.

Batch Processing

The MultimodalBatchProcessor enables efficient processing of multiple samples:

pub struct MultimodalBatchProcessor {
    processor: MultimodalProcessor,
    batch_size: usize,
}

impl MultimodalBatchProcessor {
    pub fn process_batch(&self, samples: Vec<MultimodalSample>) -> Result<Vec<MultimodalSample>>
    pub fn process_streaming<F>(&self, samples: Vec<MultimodalSample>, callback: F) -> Result<()>
}

Processing Strategies

The system supports different processing strategies based on use case:

Batch Processing: For processing multiple samples at once, optimized for throughput
Streaming Processing: For large datasets that don't fit in memory
Real-time Processing: For interactive applications with low latency requirements

The process_streaming method allows processing large datasets in chunks, with a callback function that receives processed batches:

pub fn process_streaming<F>(&self, samples: Vec<MultimodalSample>, mut callback: F) -> Result<()>
where
    F: FnMut(Vec<MultimodalSample>) -> Result<()>,
{
    for chunk in samples.chunks(self.batch_size) {
        let processed_chunk = self.process_batch(chunk.to_vec())?;
        callback(processed_chunk)?;
    }
    Ok(())
}

Section sources

multimodal.rs

Practical Examples

This section provides practical examples demonstrating how to use the multimodal preprocessing system for different scenarios.

Example 1: Image Classification with CLIP

use candle::{Device, Result};
use std::path::Path;

// Create CLIP-optimized configuration
let config = MultimodalConfig::clip();
let device = Device::Cpu;
let processor = MultimodalProcessor::new(config, device);

// Process a single image
let image_tensor = processor.process_image("path/to/image.jpg")?;
println!("Image tensor shape: {:?}", image_tensor.dims());

// Process multiple images in batch
let image_paths = vec![
    Path::new("image1.jpg"),
    Path::new("image2.jpg"),
    Path::new("image3.jpg"),
];
let batch_tensor = processor.process_images_batch(image_paths)?;
println!("Batch tensor shape: {:?}", batch_tensor.dims());

Example 2: Speech Recognition with Whisper

// Create Whisper-optimized configuration
let config = MultimodalConfig::whisper();
let device = Device::Cpu;
let processor = MultimodalProcessor::new(config, device);

// Process audio file
let audio_tensor = processor.process_audio("path/to/audio.wav")?;
println!("Audio tensor shape: {:?}", audio_tensor.dims());

// Process raw audio samples
let samples: Vec<f32> = // ... audio samples
let audio_tensor = processor.process_audio_samples(&samples, 16000)?;

Example 3: Training with Data Augmentation

// Create training-optimized configuration
let config = MultimodalConfig::training();
let device = Device::Cpu;
let mut processor = MultimodalProcessor::new(config, device);

// Process image with augmentations
let image_tensor = processor.process_image("training_image.jpg")?;
let augmented_tensor = processor.augment(&image_tensor, Modality::Vision)?;

Example 4: Streaming Processing of Large Dataset

// Create batch processor
let config = MultimodalConfig::default();
let device = Device::Cpu;
let processor = MultimodalProcessor::new(config, device);
let batch_processor = MultimodalBatchProcessor::new(processor, 8);

// Process large dataset in streaming mode
let samples = // ... large collection of multimodal samples
batch_processor.process_streaming(samples, |processed_chunk| {
    // Process each batch
    for sample in processed_chunk {
        // Send to model for inference
        // Save to disk
        // Or other processing
    }
    Ok(())
})?;

Section sources

multimodal.rs
multimodal_preprocessing.rs

Troubleshooting Guide

This section addresses common issues encountered when using the multimodal preprocessing system and provides solutions.

Common Issues and Solutions

Issue 1: Image Loading Failures

Symptoms: Error::wrap when calling load_and_preprocess_image
Causes: Corrupted image files, unsupported formats, file permission issues
Solutions:
- Verify image file integrity
- Ensure file format is supported (JPEG, PNG, etc.)
- Check file permissions and path validity

Issue 2: Audio Resampling Problems

Symptoms: Audio quality degradation, incorrect sample rates
Causes: Incompatible sample rates between source and configuration
Solutions:
- Verify source audio sample rate matches configuration
- Use process_audio_samples for precise control over sample rate
- Check that audio file is not corrupted

Issue 3: Memory Issues with Large Files

Symptoms: Out of memory errors, slow processing
Causes: Processing very large images or audio files
Solutions:
- Use streaming processing mode
- Reduce batch size
- Process files individually rather than in large batches

Issue 4: Normalization Artifacts

Symptoms: Images appear too dark or too bright after preprocessing
Causes: Incorrect mean/std values for the target model
Solutions:
- Use appropriate configuration (e.g., VisionConfig::clip() for CLIP models)
- Verify mean/std values match the model's training configuration
- Check that normalization is enabled when required

Issue 5: Dimension Mismatch Errors

Symptoms: Tensor dimension errors during model inference
Causes: Incorrect image resizing or audio processing
Solutions:
- Verify image size matches model requirements
- Check that resize method produces expected dimensions
- Use tensor.dims() to debug tensor shapes

Performance Optimization Tips

Use GPU acceleration when available by setting Device::Cuda or Device::Metal
Process data in batches to maximize throughput
Pre-compute mel filters for audio processing when processing multiple files
Use appropriate batch sizes based on available memory
Consider using half-precision (F16) tensors to reduce memory usage

Section sources

multimodal.rs
vision.rs
multimodal_preprocessing.rs

Referenced Files in This Document

multimodal.rs
vision.rs
audio.rs
multimodal_preprocessing.rs

15. Multimodal Preprocessing

Multimodal Preprocessing

Table of Contents

Introduction

Core Architecture

Audio Processing

Audio Preprocessing Pipeline

Unified Multimodal Interface

Core Processing Methods

Data Structures

Batch and Streaming Processing

Batch Processing

Processing Strategies

Practical Examples

Example 1: Image Classification with CLIP

Example 2: Speech Recognition with Whisper

Example 3: Training with Data Augmentation

Example 4: Streaming Processing of Large Dataset

Troubleshooting Guide

Common Issues and Solutions

Performance Optimization Tips

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally