A comprehensive, production-ready Supervised Fine-Tuning (SFT) pipeline for Large Language Models using HuggingFace Transformers with distributed training support.
- Flexible Data Loading: Support for HuggingFace datasets, local files (JSON, JSONL, CSV, TXT)
- Distributed Training: Multi-GPU training with PyTorch DDP
- Memory Optimization: Gradient checkpointing, mixed precision (FP16/BF16), quantization support
- Parameter-Efficient Fine-tuning: LoRA support for resource-constrained training
- Advanced Monitoring: Wandb integration, comprehensive metrics, automatic batch size calculation
- Production Ready: Robust error handling, checkpointing, resumable training
- Modular Design: Clean, extensible codebase with separate concerns
# Clone the repository
git clone <your-repo-url>
cd sft-training-pipeline
# Install dependencies
pip install -r requirements.txt
# Optional: Install development dependencies
pip install -r requirements-dev.txtThe pipeline supports multiple data formats:
{"instruction": "Translate to French", "input": "Hello world", "output": "Bonjour le monde"}
{"instruction": "Summarize", "input": "Long text...", "output": "Summary..."}{"messages": [{"role": "user", "content": "Hello"}, {"role": "assistant", "content": "Hi there!"}]}{"text": "This is a training example with input and expected output."}Create or modify the configuration file:
cp sample_config.yaml my_config.yaml
# Edit my_config.yaml according to your needsKey configuration options:
# Model
model_name_or_path: "microsoft/DialoGPT-medium"
# Data
dataset_path: "./data"
train_file: "train.jsonl"
validation_file: "validation.jsonl"
max_seq_length: 512
# Training
per_device_train_batch_size: 4
gradient_accumulation_steps: 1
learning_rate: 5e-5
num_train_epochs: 3
num_gpus: 2 # Number of GPUs to use
# Optimization
fp16: true
gradient_checkpointing: true
# Optional: LoRA for parameter-efficient training
lora_config:
r: 16
lora_alpha: 32
target_modules: ["q_proj", "v_proj"]python sft_main.py --config_file my_config.yaml# Using the launch script (recommended)
./launch_training.sh --config my_config.yaml --gpus 4
# Or using torchrun directly
torchrun --nproc_per_node=4 sft_main.py --config_file my_config.yamlsbatch slurm_job.sh # See examples/slurm_job.shsft-training-pipeline/
├── sft_main.py # Main training script
├── config_manager.py # Configuration management
├── data_handler.py # Data loading and preprocessing
├── model_manager.py # Model and tokenizer management
├── trainer_utils.py # Custom trainer with enhanced features
├── distributed_utils.py # Distributed training utilities
├── sample_config.yaml # Sample configuration file
├── launch_training.sh # Distributed training launcher
└── requirements.txt # Python dependencies
# Load from HuggingFace Hub
dataset_name: "squad"
dataset_config_name: "v2.0"
# Or load from local files
dataset_path: "./data"
train_file: "train.jsonl"
validation_file: "val.jsonl"
# Preprocessing
max_seq_length: 512
preprocessing_num_workers: 4# Batch size and optimization
per_device_train_batch_size: 4 # Adjust based on GPU memory
gradient_accumulation_steps: 2 # Effective batch size = 4 * 2 * num_gpus
learning_rate: 5e-5
weight_decay: 0.01
# Training schedule
num_train_epochs: 3
warmup_steps: 500
lr_scheduler_type: "linear"
# Memory optimization
fp16: true # Mixed precision training
gradient_checkpointing: true # Trade compute for memorynum_gpus: 4 # Number of GPUs
ddp_backend: "nccl" # Communication backendlora_config:
r: 16 # Rank of adaptation
lora_alpha: 32 # LoRA scaling parameter
target_modules: ["q_proj", "v_proj", "k_proj", "o_proj"]
lora_dropout: 0.1
bias: "none"
task_type: "CAUSAL_LM"quantization_config:
load_in_4bit: true
bnb_4bit_quant_type: "nf4"
bnb_4bit_compute_dtype: "float16"
bnb_4bit_use_double_quant: trueThe pipeline automatically calculates optimal batch sizes based on:
- GPU memory available
- Model size
- Sequence length
- Target memory utilization
- Perplexity: Language model quality metric
- Token Accuracy: Token-level prediction accuracy
- Sequence Accuracy: Full sequence match accuracy
- Training Speed: Samples per second, GPU utilization
- Memory Usage: GPU memory allocation and utilization
report_to: "wandb"
run_name: "my_sft_experiment"
wandb_project: "llm-fine-tuning"report_to: "tensorboard"
logging_dir: "./logs"resume_from_checkpoint: "./sft_output/checkpoint-1000"
save_steps: 500
save_total_limit: 3
load_best_model_at_end: true{
"instruction": "Write a haiku about programming",
"input": "",
"output": "Code flows like water\nBugs emerge from hidden depths\nDebug, then release"
}{
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is machine learning?"},
{"role": "assistant", "content": "Machine learning is a subset of AI..."}
]
}{
"input": "What is the capital of France?",
"output": "The capital of France is Paris."
}{
"text": "The quick brown fox jumps over the lazy dog."
}-
Gradient Checkpointing: Trade compute for memory
gradient_checkpointing: true
-
Mixed Precision: Use FP16 or BF16
fp16: true # or bf16: true for newer hardware
-
Quantization: 4-bit or 8-bit quantization
quantization_config: load_in_4bit: true
-
LoRA: Parameter-efficient fine-tuning
lora_config: r: 16 target_modules: ["q_proj", "v_proj"]
- Batch Size: Use largest batch size that fits in memory
- Gradient Accumulation: Simulate larger batches
- DataLoader Workers: Parallel data loading
dataloader_num_workers: 4 preprocessing_num_workers: 8
- Backend Selection: Use NCCL for GPU training
- Network: Use high-bandwidth interconnects (InfiniBand)
- Data Sharding: Ensure balanced data distribution
# Reduce batch size
per_device_train_batch_size: 2
# Enable gradient checkpointing
gradient_checkpointing: true
# Use mixed precision
fp16: true
# Consider quantization
quantization_config:
load_in_4bit: true# Check CUDA devices
nvidia-smi
# Verify network connectivity
ping <other_node_ip>
# Check for hanging processes
ps aux | grep python
# Kill hanging processes
pkill -f "python.*sft_main.py"# Check file formats
head -n 1 data/train.jsonl | python -m json.tool
# Validate data
python -c "
import json
with open('data/train.jsonl') as f:
for i, line in enumerate(f):
try:
json.loads(line)
except:
print(f'Invalid JSON at line {i+1}: {line[:100]}')
if i > 10:
break
"# Monitor GPU usage
watch -n 1 nvidia-smi
# Monitor CPU and memory
htop
# Monitor network (for distributed training)
iftop# Custom monitoring script
import wandb
# Log custom metrics
wandb.log({
"custom_metric": value,
"epoch": epoch,
"step": step
})model_name_or_path: "gpt2"
dataset_path: "./data"
train_file: "train.jsonl"
per_device_train_batch_size: 8
num_train_epochs: 3
learning_rate: 5e-5model_name_or_path: "microsoft/DialoGPT-large"
per_device_train_batch_size: 2
gradient_checkpointing: true
fp16: true
lora_config:
r: 32
lora_alpha: 64
target_modules: ["c_attn", "c_proj"]num_gpus: 4
per_device_train_batch_size: 4
gradient_accumulation_steps: 2
# Effective batch size: 4 * 4 * 2 = 32- Fork the repository
- Create a feature branch:
git checkout -b feature-name - Make changes and add tests
- Run tests:
pytest tests/ - Submit a pull request
MIT License - see LICENSE file for details.
If you use this pipeline in your research, please cite:
@software{sft_training_pipeline,
title={SFT Training Pipeline for Large Language Models},
author={Your Name},
year={2024},
url={https://github.com/your-username/sft-training-pipeline}
}For questions and support:
- Create an issue on GitHub
- Check the documentation
- Join our Discord community
Happy Fine-tuning! 🚀