Byte-Pair Encoding (BPE) (subword-based tokenization) algorithm implementaions from scratch with python
-
Updated
Jan 30, 2023 - Python
Byte-Pair Encoding (BPE) (subword-based tokenization) algorithm implementaions from scratch with python
LLM-inspired BiLSTM pipeline for real-time, multi-label toxicity inference across adversarial discourse modalities.
A clean, educational implementation of the Byte Pair Encoding algorithm used in modern language models like GPT.
Paper: A Comparison of Different Tokenization Methods for the Georgian Language
Implementation of Deep Averaging Networks (DAN) for sentiment classification with experiments on GloVe embeddings and subword tokenization using Byte Pair Encoding (BPE).
This repository hosts our comprehensive study on text tokenization methods, covering word-, character-, and subword-level algorithms such as BPE, WordPiece, and Unigram, and extends to discussions on multilingual, mathematical, and code tokenization. It examines their efficiency, consistency, semantic preservation, and influence on LLMs.
BPE & Unigram Vocab Training library
A minimal Python implementation of Byte Pair Encoding (BPE) with step-by-step visualization of merge operations and vocabulary updates.
Add a description, image, and links to the subword-tokenization topic page so that developers can more easily learn about it.
To associate your repository with the subword-tokenization topic, visit your repo's landing page and select "manage topics."