FaseehGPT is an advanced pipeline for training a GPT-style language model specifically designed for the Arabic language. This project leverages state-of-the-art NLP techniques and pre-trained Arabic tokenizers to generate and train language models, enabling the efficient processing of Arabic text.
Check this for model training and full details Model Details
- Pre-trained Tokenizer: Utilizes
asafaya/bert-base-arabicfor robust Arabic tokenization. - Efficient Data Preprocessing: Handles large datasets and long sequences with overlapping chunking.
- Customizable Model Architecture: Allows for easy adjustments in model dimensions, attention heads, and layers.
- Training Pipeline: Fully equipped for model training, checkpoint saving, and evaluation.
- Text Generation: Generates high-quality Arabic text from a trained model.
- Evaluation Metrics: Supports perplexity and BLEU score for model performance evaluation.
Before running the pipeline, install the required dependencies:
pip install torch transformers datasets regex tqdmFaseehGPT/
│
├── models/ # Model architecture and attention mechanisms
│ ├── __init__.py
│ ├── attention.py # Attention mechanisms implementation
│ └── gpt_model.py # Core GPT model architecture
│
├── utils/ # Utility functions
│ ├── __init__.py
│ ├── tokenizer.py # Arabic tokenizer wrapper
│ ├── dataset.py # Dataset classes for training/evaluation
│ └── metrics.py # Perplexity and BLEU score implementations
│
├── training/ # Training and evaluation pipeline
│ ├── __init__.py
│ ├── trainer.py # Main training script
│ ├── data_loader.py # Dataset loading functions
│ └── evaluation.py # Model evaluation functions
│
├── config/ # Configuration files
│ └── default_config.py # Default configuration parameters
│
├── scripts/ # Helper scripts to run training and generation
│ ├── train.py # Script to start training
│ └── generate.py # Script to generate text from trained model
│
├── checkpoints/ # Model checkpoints storage
│ └── .gitkeep
│
├── output/ # Generated text outputs
│ └── .gitkeep
│
├── requirements.txt # Project dependencies
└── README.md # Project documentation
To begin training the model, use the provided training pipeline. Set the desired configurations in the script.
python scripts/train.pyOnce the model is trained, you can use it to generate Arabic text from a prompt.
python scripts/generate.py --checkpoint_path checkpoints/best_model --prompt "كان يا مكان في قديم الزمان" --max_new_tokens 100Configuration parameters can be found in config/default_config.py. Customize options like:
- Model dimensions (embedding size, number of attention heads, etc.)
- Batch size
- Epochs
- Training time limits
- Maximum text sequences
- Checkpoint Path: Specify the path to the model checkpoint for text generation.
Evaluate the model using metrics like perplexity and BLEU score. This can be done using the functions in training/evaluation.py.
This project is open-source and available under the MIT License.
@misc{faseehgpt2025,
title = {FaseehGPT: An Arabic Language Model},
author = {Rohma, Ahsan Umar},
year = {2025},
url = {https://huggingface.co/alphatechlogics/FaseehGPT}
}@article{umar2025faseehgpt,
title={FaseehGPT: A Lightweight Transformer Model for Arabic Text Generation with Enhanced Morphological Understanding},
author={Umar, Ahsan},
publisher={Engineering Archive}
}