التفاصيل البيبلوغرافية
العنوان: |
A partition cover approach to tokenization |
المؤلفون: |
Lim, Jia Peng, Choo, Davin, Lauw, Hady W. |
سنة النشر: |
2025 |
المجموعة: |
Computer Science |
مصطلحات موضوعية: |
Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Data Structures and Algorithms |
الوصف: |
Tokenization is the process of encoding strings into tokens from a fixed vocabulary of size $k$ and is widely utilized in Natural Language Processing applications. The leading tokenization algorithm today is Byte Pair Encoding (BPE), which formulates the tokenization problem as a compression problem and tackles it by performing sequences of merges. In this work, we formulate tokenization as an optimization objective, show that it is NP-hard via a simple reduction from vertex cover, and propose a polynomial-time greedy algorithm GreedTok. Our formulation naturally relaxes to the well-studied weighted maximum coverage problem which has a simple $(1 - 1/e)$-approximation algorithm GreedWMC. Through empirical evaluations on real-world corpora, we show that GreedTok outperforms BPE, while achieving a comparable objective score as GreedWMC (which could have achieved a higher score due to relaxation). |
نوع الوثيقة: |
Working Paper |
URL الوصول: |
http://arxiv.org/abs/2501.06246 |
رقم الانضمام: |
edsarx.2501.06246 |
قاعدة البيانات: |
arXiv |