1. Introduction

Introduction

Core Value Proposition

Oxide-Lab is a privacy-first, local AI chat application built with Tauri and Rust that enables secure, offline AI interactions directly on the user's machine. The application's core value proposition centers on complete data privacy and security by ensuring that no user data ever leaves the local device during AI inference. By leveraging the Candle framework for Rust, Oxide-Lab provides GPU-accelerated local inference capabilities using GGUF and safetensors model formats, allowing users to interact with powerful AI models without relying on cloud services or internet connectivity. This architecture guarantees that conversations, prompts, and personal information remain entirely under the user's control, making it ideal for privacy-conscious individuals, researchers, and developers who require secure AI interactions.

Section sources

README.md
TECHNICAL.md

Key Features

Oxide-Lab offers a comprehensive set of features designed to enhance the local AI chat experience while maintaining privacy and performance. The application supports GGUF and safetensors model formats, enabling users to leverage quantized models for efficient local inference. One of its standout features is the Thinking Mode, which allows users to see the AI's reasoning process before receiving the final answer, providing deeper insights into complex problem-solving. The application integrates with the Hugging Face Hub, allowing seamless access to a wide range of pre-trained models directly from the repository. For performance optimization, Oxide-Lab supports CUDA for GPU acceleration, significantly speeding up inference times on compatible hardware. The application also features streaming responses, delivering AI-generated content token-by-token in real-time, creating a natural "typing" effect in the chat interface. Additional features include adjustable inference parameters such as temperature, top-k, top-p, and repeat penalty, giving users fine-grained control over the AI's creativity and response style.

The latest update introduces unified multimodal preprocessing capabilities, extending Oxide-Lab's functionality to handle both image and audio inputs. The multimodal system supports vision and audio modalities through dedicated preprocessing pipelines that convert raw media into tensors suitable for AI models. For vision processing, the application implements comprehensive image preprocessing including resizing, normalization, and augmentation with support for various configurations like ImageNet, CLIP, and DINOv2. Audio processing includes loading from various formats, resampling, loudness normalization, and mel-spectrogram generation using configurable parameters for models like Whisper, EnCodec, and MusicGen. The multimodal system also includes batch and streaming processing modes, allowing efficient handling of both individual samples and large datasets.

Section sources

README.md
TECHNICAL.md
src/lib/chat/stream/think_html.ts
src/lib/services/huggingface.ts
src-tauri/src/core/multimodal.rs
src-tauri/src/core/vision.rs
src-tauri/src/core/audio.rs

Architecture Overview

Oxide-Lab employs a robust two-tier architecture that separates the frontend user interface from the backend inference engine, leveraging the strengths of both Svelte and Rust technologies. The frontend is built with SvelteKit, providing a responsive and modern user interface that runs as a single-page application (SPA) within the Tauri framework. This Svelte-based frontend handles all user interactions, chat rendering, and interface management. The backend, implemented in Rust, serves as the computational engine powered by the Candle framework, which provides efficient tensor operations and neural network inference capabilities. Communication between these layers occurs through Tauri's command system, where frontend actions trigger Rust functions that perform model loading, tokenization, and inference. The architecture maintains a clear separation of concerns, with the Rust backend managing the ModelState, device selection, and token streaming through components like token_output_stream. This design ensures that computationally intensive AI operations are performed efficiently in Rust, while the Svelte frontend provides a smooth and responsive user experience.

The multimodal preprocessing system is implemented as a unified module in the Rust backend, with dedicated processors for vision and audio data. The MultimodalProcessor orchestrates the preprocessing workflow, handling both individual samples and batch processing through the MultimodalBatchProcessor. Vision processing is handled by the VisionProcessor, which implements image loading, resizing, normalization, and augmentation according to configurable parameters. Audio processing is managed by the AudioProcessor, which supports loading from various formats, resampling, loudness normalization, and mel-spectrogram generation. The system uses a unified MultimodalSample structure that can contain vision, audio, or text data with associated metadata, enabling flexible handling of different input types. The preprocessing pipeline supports both batch processing for efficiency and streaming mode for handling large datasets that may not fit in memory.

``mermaid graph TB subgraph "Frontend (Svelte)" UI[User Interface] Parser[Incremental Parser] Renderer[Segment Renderer] UI --> Parser Parser --> Renderer end subgraph "Backend (Rust)" Tauri[Tauri Commands] State[ModelState] Device[Device Manager] Stream[TokenOutputStream] Multimodal[MultiModalProcessor] Vision[VisionProcessor] Audio[AudioProcessor] Tauri --> State State --> Device State --> Stream State --> Multimodal Multimodal --> Vision Multimodal --> Audio end Renderer --> |token events| Tauri Stream --> |emit tokens| Tauri


**Diagram sources**
- [src-tauri/src/main.rs](file://src-tauri/src/main.rs)
- [src-tauri/src/core/state.rs](file://src-tauri/src/core/state.rs)
- [src-tauri/src/core/token_output_stream.rs](file://src-tauri/src/core/token_output_stream.rs)
- [src-tauri/src/core/device.rs](file://src-tauri/src/core/device.rs)
- [src-tauri/src/core/multimodal.rs](file://src-tauri/src/core/multimodal.rs)
- [src-tauri/src/core/vision.rs](file://src-tauri/src/core/vision.rs)
- [src-tauri/src/core/audio.rs](file://src-tauri/src/core/audio.rs)

**Section sources**
- [TECHNICAL.md](file://TECHNICAL.md)
- [src-tauri/src/main.rs](file://src-tauri/src/main.rs)
- [src-tauri/src/core/multimodal.rs](file://src-tauri/src/core/multimodal.rs)

## System Context Diagram

``mermaid
graph LR
User[User] --> |Input/Output| UI[Frontend UI]
UI --> |Tauri Commands| Backend[Rust Backend]
Backend --> |Model Loading| GGUF[GGUF Models]
Backend --> |Model Loading| Safetensors[safetensors Models]
Backend --> |Hugging Face API| HF[Hugging Face Hub]
Backend --> |CUDA Operations| GPU[NVIDIA GPU]
Backend --> |CPU Operations| CPU[CPU]
Backend --> |Image Processing| Vision[Vision Models]
Backend --> |Audio Processing| Audio[Audio Models]
GGUF --> Backend
Safetensors --> Backend
HF --> Backend
GPU --> Backend
CPU --> Backend
Vision --> Backend
Audio --> Backend
Backend --> |Streaming Tokens| UI
UI --> User

Diagram sources

README.md
TECHNICAL.md
src/lib/services/huggingface.ts

Use Cases

Oxide-Lab serves a diverse range of users with specific needs for local, private AI interactions. For developers, the application provides a sandbox environment to experiment with different AI models and parameters without exposing sensitive code or data to external servers. Researchers benefit from the ability to conduct AI experiments with complete data control, ensuring the integrity and confidentiality of their work, particularly when handling proprietary or sensitive information. Privacy-conscious users can leverage Oxide-Lab for personal tasks such as journaling, brainstorming, or creative writing, knowing that their thoughts and ideas remain entirely on their local machine. The Thinking Mode feature is particularly valuable for educational purposes, allowing students and educators to explore the reasoning processes behind AI responses. The application's support for Hugging Face Hub integration enables users to quickly test and compare different models, while CUDA support makes it accessible to users with varying hardware capabilities. Additionally, the streaming response feature creates a more natural conversational experience, making Oxide-Lab suitable for both technical and non-technical users seeking a private AI assistant.

The new multimodal capabilities significantly expand Oxide-Lab's use cases. Researchers can now analyze both visual and audio data locally without compromising privacy, making it ideal for sensitive research in fields like healthcare or social sciences. Developers can build and test multimodal applications that process images and audio without relying on cloud-based APIs, reducing costs and improving data security. Content creators can use the application to generate captions for images or transcribe and analyze audio content while maintaining complete control over their intellectual property. The unified preprocessing system with configurable parameters for different models (Whisper, CLIP, EnCodec, etc.) allows users to tailor the processing pipeline to their specific needs, whether they're working on speech recognition, image classification, or multimodal reasoning tasks. The batch and streaming processing modes support both real-time interaction and offline processing of large datasets, making Oxide-Lab versatile for various workflows.

Section sources

README.md
TECHNICAL.md
src-tauri/src/core/multimodal.rs
src-tauri/src/core/vision.rs
src-tauri/src/core/audio.rs

Referenced Files in This Document

README.md
TECHNICAL.md
src-tauri/src/main.rs
src-tauri/src/core/state.rs
src-tauri/src/core/token_output_stream.rs
src-tauri/src/core/device.rs
src/lib/chat/stream/think_html.ts
src/lib/services/huggingface.ts
src-tauri/src/core/multimodal.rs - Added in recent commit
src-tauri/src/core/vision.rs - Added in recent commit
src-tauri/src/core/audio.rs - Added in recent commit
src-tauri/src/core/log.rs - Added in recent commit

1. Introduction

Introduction

Table of Contents

Core Value Proposition

Key Features

Architecture Overview

Use Cases

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally