FaseehGPT: Arabic SLM

FaseehGPT is an advanced pipeline for training a GPT-style language model specifically designed for the Arabic language. This project leverages state-of-the-art NLP techniques and pre-trained Arabic tokenizers to generate and train language models, enabling the efficient processing of Arabic text.

Check this for model training and full details Model Details

Features

Pre-trained Tokenizer: Utilizes asafaya/bert-base-arabic for robust Arabic tokenization.
Efficient Data Preprocessing: Handles large datasets and long sequences with overlapping chunking.
Customizable Model Architecture: Allows for easy adjustments in model dimensions, attention heads, and layers.
Training Pipeline: Fully equipped for model training, checkpoint saving, and evaluation.
Text Generation: Generates high-quality Arabic text from a trained model.
Evaluation Metrics: Supports perplexity and BLEU score for model performance evaluation.

Installation

Before running the pipeline, install the required dependencies:

pip install torch transformers datasets regex tqdm

Directory Structure

FaseehGPT/
│
├── models/                # Model architecture and attention mechanisms
│   ├── __init__.py
│   ├── attention.py      # Attention mechanisms implementation
│   └── gpt_model.py      # Core GPT model architecture
│
├── utils/                 # Utility functions
│   ├── __init__.py
│   ├── tokenizer.py      # Arabic tokenizer wrapper
│   ├── dataset.py        # Dataset classes for training/evaluation
│   └── metrics.py        # Perplexity and BLEU score implementations
│
├── training/             # Training and evaluation pipeline
│   ├── __init__.py
│   ├── trainer.py        # Main training script
│   ├── data_loader.py    # Dataset loading functions
│   └── evaluation.py     # Model evaluation functions
│
├── config/               # Configuration files
│   └── default_config.py # Default configuration parameters
│
├── scripts/              # Helper scripts to run training and generation
│   ├── train.py          # Script to start training
│   └── generate.py       # Script to generate text from trained model
│
├── checkpoints/          # Model checkpoints storage
│   └── .gitkeep
│
├── output/               # Generated text outputs
│   └── .gitkeep
│
├── requirements.txt      # Project dependencies
└── README.md             # Project documentation

Usage

1. Train the Model

To begin training the model, use the provided training pipeline. Set the desired configurations in the script.

python scripts/train.py

2. Generate Text

Once the model is trained, you can use it to generate Arabic text from a prompt.

python scripts/generate.py --checkpoint_path checkpoints/best_model --prompt "كان يا مكان في قديم الزمان" --max_new_tokens 100

Configuration

Configuration parameters can be found in config/default_config.py. Customize options like:

Model dimensions (embedding size, number of attention heads, etc.)
Batch size
Epochs
Training time limits
Maximum text sequences
Checkpoint Path: Specify the path to the model checkpoint for text generation.

Evaluation

Evaluate the model using metrics like perplexity and BLEU score. This can be done using the functions in training/evaluation.py.

License

This project is open-source and available under the MIT License.

Citation

@misc{faseehgpt2025,
  title     = {FaseehGPT: An Arabic Language Model},
  author    = {Rohma, Ahsan Umar},
  year      = {2025},
  url       = {https://huggingface.co/alphatechlogics/FaseehGPT}
}

@article{umar2025faseehgpt,
  title={FaseehGPT: A Lightweight Transformer Model for Arabic Text Generation with Enhanced Morphological Understanding},
  author={Umar, Ahsan},
  publisher={Engineering Archive}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

FaseehGPT: Arabic SLM

Check this for model training and full details Model Details

Features

Installation

Directory Structure

Usage

1. Train the Model

2. Generate Text

Configuration

Evaluation

License

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
Docs		Docs
checkpoints		checkpoints
config		config
models		models
output		output
scripts		scripts
training		training
utils		utils
.gitignore		.gitignore
FaseehGPT.ipynb		FaseehGPT.ipynb
GEMINI.md		GEMINI.md
LICENSE		LICENSE
Model_details.md		Model_details.md
README.md		README.md
requirements.txt		requirements.txt

License

alphatechlogics/FaseehGPT

Folders and files

Latest commit

History

Repository files navigation

FaseehGPT: Arabic SLM

Check this for model training and full details Model Details

Features

Installation

Directory Structure

Usage

1. Train the Model

2. Generate Text

Configuration

Evaluation

License

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages