subword-tokenization

Here are 8 public repositories matching this topic...

DolbyUUU / byte_pair_encoding_BPE_subword_tokenization_implementation_python

Byte-Pair Encoding (BPE) (subword-based tokenization) algorithm implementaions from scratch with python

python nlp natural-language-processing tokenizer data-preprocessing data-cleaning bpe byte-pair-encoding subword-tokenization

Updated Jan 30, 2023
Python

SD7Campeon / Comment-Toxicity-Detection-and-Classification

Star

LLM-inspired BiLSTM pipeline for real-time, multi-label toxicity inference across adversarial discourse modalities.

nlp sklearn transformer discourse-analysis multi-label-classification affective-computing keras-tensorflow text-vectorization bilstm nlp-pipeline deep-sequential-model toxicity-analysis toxicity-prediction toxicity-detection toxicity-classification llm subword-tokenization real-time-inference contextual-nlp

Updated May 15, 2025
Jupyter Notebook

moralesangel / BPE-tokenizer

Star

A clean, educational implementation of the Byte Pair Encoding algorithm used in modern language models like GPT.

nlp machine-learning natural-language-processing deep-learning tokenizer transformers text-processing gpt tokenization bpe byte-pair-encoding large-language-models llm generative-ai subword-tokenization

Updated Nov 11, 2025
Jupyter Notebook

bmikaberidze / tokenizers-for-georgian

Star

Paper: A Comparison of Different Tokenization Methods for the Georgian Language

nlp tokenization low-resource-languages georgian-language subword-tokenization

Updated Jan 9, 2026
Python

rashi-bhansali / subword-dan-sentiment-analysis

Star

Implementation of Deep Averaging Networks (DAN) for sentiment classification with experiments on GloVe embeddings and subword tokenization using Byte Pair Encoding (BPE).

nlp deep-learning sentiment-analysis text-classification pytorch neural-networks dan glove-embeddings byte-pair-encoding subword-tokenization

Updated Jan 31, 2026
Python

TDRH-Undergraduate-Students / tokenization

Star

This repository hosts our comprehensive study on text tokenization methods, covering word-, character-, and subword-level algorithms such as BPE, WordPiece, and Unigram, and extends to discussions on multilingual, mathematical, and code tokenization. It examines their efficiency, consistency, semantic preservation, and influence on LLMs.

tokenization subword-tokenization