Skip to content

FaseehGPT is an advanced pipeline for training a GPT-style language model specifically designed for the Arabic language.

License

Notifications You must be signed in to change notification settings

alphatechlogics/FaseehGPT

Repository files navigation

FaseehGPT: Arabic SLM

FaseehGPT is an advanced pipeline for training a GPT-style language model specifically designed for the Arabic language. This project leverages state-of-the-art NLP techniques and pre-trained Arabic tokenizers to generate and train language models, enabling the efficient processing of Arabic text.

Check this for model training and full details Model Details


Features

  • Pre-trained Tokenizer: Utilizes asafaya/bert-base-arabic for robust Arabic tokenization.
  • Efficient Data Preprocessing: Handles large datasets and long sequences with overlapping chunking.
  • Customizable Model Architecture: Allows for easy adjustments in model dimensions, attention heads, and layers.
  • Training Pipeline: Fully equipped for model training, checkpoint saving, and evaluation.
  • Text Generation: Generates high-quality Arabic text from a trained model.
  • Evaluation Metrics: Supports perplexity and BLEU score for model performance evaluation.

Installation

Before running the pipeline, install the required dependencies:

pip install torch transformers datasets regex tqdm

Directory Structure

FaseehGPT/
│
├── models/                # Model architecture and attention mechanisms
│   ├── __init__.py
│   ├── attention.py      # Attention mechanisms implementation
│   └── gpt_model.py      # Core GPT model architecture
│
├── utils/                 # Utility functions
│   ├── __init__.py
│   ├── tokenizer.py      # Arabic tokenizer wrapper
│   ├── dataset.py        # Dataset classes for training/evaluation
│   └── metrics.py        # Perplexity and BLEU score implementations
│
├── training/             # Training and evaluation pipeline
│   ├── __init__.py
│   ├── trainer.py        # Main training script
│   ├── data_loader.py    # Dataset loading functions
│   └── evaluation.py     # Model evaluation functions
│
├── config/               # Configuration files
│   └── default_config.py # Default configuration parameters
│
├── scripts/              # Helper scripts to run training and generation
│   ├── train.py          # Script to start training
│   └── generate.py       # Script to generate text from trained model
│
├── checkpoints/          # Model checkpoints storage
│   └── .gitkeep
│
├── output/               # Generated text outputs
│   └── .gitkeep
│
├── requirements.txt      # Project dependencies
└── README.md             # Project documentation

Usage

1. Train the Model

To begin training the model, use the provided training pipeline. Set the desired configurations in the script.

python scripts/train.py

2. Generate Text

Once the model is trained, you can use it to generate Arabic text from a prompt.

python scripts/generate.py --checkpoint_path checkpoints/best_model --prompt "كان يا مكان في قديم الزمان" --max_new_tokens 100

Configuration

Configuration parameters can be found in config/default_config.py. Customize options like:

  • Model dimensions (embedding size, number of attention heads, etc.)
  • Batch size
  • Epochs
  • Training time limits
  • Maximum text sequences
  • Checkpoint Path: Specify the path to the model checkpoint for text generation.

Evaluation

Evaluate the model using metrics like perplexity and BLEU score. This can be done using the functions in training/evaluation.py.


License

This project is open-source and available under the MIT License.

Citation

@misc{faseehgpt2025,
  title     = {FaseehGPT: An Arabic Language Model},
  author    = {Rohma, Ahsan Umar},
  year      = {2025},
  url       = {https://huggingface.co/alphatechlogics/FaseehGPT}
}
@article{umar2025faseehgpt,
  title={FaseehGPT: A Lightweight Transformer Model for Arabic Text Generation with Enhanced Morphological Understanding},
  author={Umar, Ahsan},
  publisher={Engineering Archive}
}

About

FaseehGPT is an advanced pipeline for training a GPT-style language model specifically designed for the Arabic language.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published