This project implements a text generation model using a trigram-based approach. The model learns the probabilities of word sequences from input text and generates realistic sentences by predicting the next word based on the current context.
- Trigram Model: Builds a model based on word triplets (trigrams).
- Probability-Based Generation: Generates text using probabilities calculated from trigram frequencies.
- Top 4 Candidate Selection: Randomly selects the next word from the top 4 most probable candidates to add diversity.
- Random Start: The model can begin generation from a random starting sequence in the training data.
- Python 3.7+
- Libraries:
heapq
(standard library)random
(standard library)collections
(standard library)re
(standard library)
- Dataset:
- WikiText (for training large-scale models)
- Clone the repository:
git clone https://github.com/dhruv-pharasi/Trigram_TextGeneration.git
- Ensure you have Python installed.
- Install additional libraries if required (e.g., for preprocessing).
- Provide a text corpus as a string.
- Preprocess the text (clean, tokenize, etc.) and feed it into the model.
- Build trigram and bigram frequency dictionaries:
trigram_counts, bigram_counts = build_trigram_model(words)
-
Live Text Generation with Random Start:
starting_bigrams = list(trigram_counts.keys()) start_seq = " ".join(random.choice(starting_bigrams)) generate_text(trigram_counts, bigram_counts, start_seq, length=15)
-
Customize Starting Sequence:
start_seq = "once upon" generate_text(trigram_counts, bigram_counts, start_seq, length=15)
Once upon a time, there was a brave princess who fought dragons. The princess loved adventures.
once upon a time there was a brave princess who loved adventures and fought dragons
Trigram_TextGeneratorModel.ipynb
: Contains the implementation of the trigram-based text generation.README.md
: Documentation for the project.
- Add support for larger n-grams (e.g., 4-grams, 5-grams).
- Implement smoothing techniques (e.g., Laplace smoothing) for better handling of unseen sequences.
- Integrate with external APIs for training on large datasets.
- Provide a graphical interface for interactive text generation.