Skip to content

devk03/imessage-finetuning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

2 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

iMessage Chat Bot Fine-Tuning

Train a personalized chatbot on your own iMessage conversations using Qwen models and MLX.

πŸš€ Features

  • Extract and parse iMessage conversations from macOS
  • Fine-tune Qwen models on your chat history
  • Interactive chat interface with conversation memory
  • Support for multiple training checkpoints
  • Clean, conversational responses

πŸ“‹ Prerequisites

  • macOS (for iMessage database access)
  • Python 3.12+
  • Apple Silicon Mac (for MLX framework)
  • uv package manager (install here)

πŸ› οΈ Setup

1. Install uv

curl -LsSf https://astral.sh/uv/install.sh | sh

2. Clone the Repository

git clone <your-repo-url>
cd

3. Copy Your iMessage Database

Place chat.db in the chat_db_goes_here/ folder:

sudo cp ~/Library/Messages/chat.db ./chat_db_goes_here/

⚠️ Note: You may need to grant Terminal "Full Disk Access" in System Preferences β†’ Security & Privacy β†’ Privacy β†’ Full Disk Access.

4. Run the Commands

That's it! Now just run these three commands:

# 1. Parse your messages (creates training data)
uv run python src/data_collection/parser.py

# 2. Train the model (interactive)
uv run python src/train.py
# β†’ Shows downloaded models OR enter any HuggingFace model name
# β†’ Example: "Qwen/Qwen3-1.7B" or "microsoft/Phi-3-mini-4k-instruct"

# 3. Chat with your bot (interactive)
uv run python src/chat.py
# β†’ Select from your trained models & checkpoints

Training will show you all downloaded models, but you can also type in ANY HuggingFace model name to download and train it automatically!

Chat will show all your trained models and let you pick which checkpoint to use.


Alternative: Manual Training Command

If you prefer to run the training command directly without the interactive menu:

uv run python -m mlx_lm.lora \
  --model Qwen/Qwen3-1.7B \
  --train \
  --data data_set/ \
  --iters 500 \
  --adapter-path adapters/Qwen_Qwen3-1_7B \
  --batch-size 2 \
  --max-seq-length 2048

Replace the model name and adapter path as needed. The interactive train.py script does this for you automatically!



🎯 How It Works

Training - Choose or Add Models

When you run train.py, it shows all downloaded models AND lets you enter new ones:

πŸ€– Available Downloaded Models:

  [1] Qwen/Qwen2.5-1.5B-Instruct
      (medium: iters=500, batch=2, seq=2048)
  [2] Qwen/Qwen3-1.7B
      (medium: iters=500, batch=2, seq=2048)
  [3] Qwen/Qwen3-4B-Instruct-2507
      (large: iters=500, batch=1, seq=2048)

  [4] Enter model name manually

Select a model (1-4): 4
Enter model name (e.g., Qwen/Qwen3-1.7B): microsoft/Phi-3-mini-4k-instruct

Two ways to add models:

  1. Pre-download: huggingface-cli download MODEL_NAME (faster)
  2. Type it in: MLX will download automatically when training starts

The script automatically determines optimal training settings based on model size (small/medium/large/xlarge).

Each model's adapters are saved in a separate folder: adapters/Qwen_Qwen3-1_7B/, etc.

Inference

When you run chat.py, it automatically finds all trained models:

πŸ€– Available Trained Models:

  [1] Qwen/Qwen3-1.7B
      (6 checkpoint(s) in Qwen_Qwen3-1_7B)
  [2] Qwen/Qwen3-4B-Instruct-2507
      (6 checkpoint(s) in Qwen_Qwen3-4B-Instruct-2507)
  [3] Use base model (no fine-tuning)

Select a model (1-3):

Then you can choose which checkpoint (100, 200, 300, etc.) to use!


πŸ”§ Advanced Configuration (Optional)

Find Your Specific Chat ID

By default, the parser uses CHAT_ID = 3. To train on a different conversation, this was simply the origincal chat I was targetting, this chat will be completely different on your machine:

# Open the database
sqlite3 chat_db_goes_here/chat.db

# List all chats
SELECT ROWID, chat_identifier, display_name FROM chat;

# Exit
.quit

Then edit src/data_collection/parser.py and change CHAT_ID to your desired chat.

Add More Models

To use any HuggingFace model:

# Download a model (optional - MLX will auto-download)
huggingface-cli download Qwen/Qwen2.5-3B-Instruct

# Or just enter the model name when prompted in train.py

The script will automatically detect it!

Customize Training Configs

Edit src/train.py to adjust default training settings:

# Size-based defaults
DEFAULT_CONFIGS = {
    "small": {"iters": 500, "batch_size": 2, "max_seq": 2048},
    "medium": {"iters": 500, "batch_size": 2, "max_seq": 2048},
    "large": {"iters": 500, "batch_size": 1, "max_seq": 2048},
    "xlarge": {"iters": 500, "batch_size": 1, "max_seq": 1024},
}

# Model-specific overrides
MODEL_CONFIGS = {
    "your-model/name": {"iters": 300, "batch_size": 1, "max_seq": 1024},
}

The script automatically categorizes models by size based on their name.

Customize Parser

In src/data_collection/parser.py:

  • CHUNK_SIZE: Messages per training example (default: 50)
  • TRAIN_RATIO: Train/validation split (default: 0.9)

πŸ“ Project Structure

finetuning_imessage_chats/
β”œβ”€β”€ README.md              # You are here!
β”œβ”€β”€ pyproject.toml         # Dependencies
β”œβ”€β”€ chat_db_goes_here/     # Place your chat.db here
β”‚   └── .gitkeep
β”œβ”€β”€ data_set/              # Generated training data (gitignored)
β”‚   β”œβ”€β”€ train.jsonl
β”‚   └── valid.jsonl
β”œβ”€β”€ adapters/              # Trained model weights (gitignored)
β”‚   β”œβ”€β”€ Qwen_Qwen3-1_7B/  # Each model gets its own folder
β”‚   β”‚   β”œβ”€β”€ adapter_config.json
β”‚   β”‚   β”œβ”€β”€ 0000100_adapters.safetensors
β”‚   β”‚   └── ...
β”‚   └── Qwen_Qwen3-4B-Instruct-2507/
β”‚       └── ...
└── src/                   # All code lives here
    β”œβ”€β”€ train.py           # Interactive training script
    β”œβ”€β”€ chat.py            # Interactive chat interface
    └── data_collection/   # Database parsing
        └── parser.py

βš™οΈ Configuration

Chat Bot Settings (src/chat.py)

MAX_HISTORY = 15  # Number of conversation turns to remember
SYSTEM_PROMPT = """..."""  # Customize the bot's personality

Parser Settings (src/data_collection/parser.py)

CHAT_ID = 3           # Your iMessage chat ID
CHUNK_SIZE = 50       # Messages per training example
TRAIN_RATIO = 0.9     # 90% train, 10% validation

πŸ› Troubleshooting

"Database not found" error

Make sure you've copied chat.db to chat_db_goes_here/:

ls -la chat_db_goes_here/chat.db

"Permission denied" when accessing chat.db

Grant Terminal "Full Disk Access" in System Preferences β†’ Security & Privacy β†’ Privacy β†’ Full Disk Access.

Model runs out of memory

Reduce --batch-size to 1 or use a smaller model:

--model Qwen/Qwen3-0.6B

Bot generates weird responses

  • Try different checkpoints (earlier ones might be better)
  • Lower the temperature in src/chat.py: sampler = make_sampler(temp=0.4)
  • Increase training iterations for more fine-tuning

πŸ“ Tips

  1. Start small: Begin with 500 iterations and test. Increase if needed.
  2. Monitor training: Watch the loss values decrease during training.
  3. Try different checkpoints: Earlier checkpoints (100-300) sometimes perform better than later ones.
  4. Adjust temperature: Lower = more deterministic, higher = more creative (range: 0.1-1.5).
  5. Clean your data: The parser filters out reactions and system messages, but you may want to customize the filters.

πŸ”’ Privacy

⚠️ Important: Your chat.db contains all your personal messages. This repository's .gitignore is configured to exclude:

  • Database files (chat_db_goes_here/*.db)
  • Training data (data_set/*.jsonl)
  • Model weights (adapters/*.safetensors)

Never commit these files to a public repository!

πŸ“„ License

MIT License - Feel free to use and modify as needed.

πŸ™ Acknowledgments

  • MLX - Apple's ML framework
  • MLX-LM - Language model examples
  • Qwen - Base models from Alibaba Cloud

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages