Safely Batching Tokenization Merges
Ponente/Speaker: Alexander Morgan -- A batched approach to building tokenization vocabularies safely achieves a 2-3 order of magnitude speed improvement, depending on the target vocabulary size. This safe batching makes it possible to process billions of tokens and generate new token vocabularies in minutes on a basic laptop without changing the end tokenization result. When building a tokenization vocabulary for an LLM or a compression algorithm, the standard approach is to count all consecutive token pairs, merge the most common pair into a new token, then repeat the process until you reach the desired vocabulary size. With a large dataset, that is an enormous amount of work for a single token merge. I outline the three key insights that let you safely process larger and larger batches of token merges. Building a tokenization vocabulary is not typically done very often. However, my open source and pure-python solution aims to make it easier for anyone to try out new tokenization ideas. The tokenization step of LLM training is often derided as an annoyance that AI researchers only put up with because no other data representation works as well. Because of this, there are still lots of overlooked "easy wins" in this foundational step to LLM training. I conclude my talk by showing how a batched approach to tokenization vocabulary building can be combined with other tokenization research and reduced training runs to empirically improve LLM performance through tokenization changes alone. -- Python España: https://es.python.org/ BlueSky: https://bsky.app/profile/es.pycon.org Twitter/X: https://x.com/PyConES LinkedIn: / pycones Instagram: / pycon_es

LLMs for Testing

Keynote: After the AI Hype – What’s Real, and What’s Next - Richard Campbell - 2026

MIT Just Revealed the AI Bubble's Fatal Flaw

Andrej Karpathy: From Vibe Coding to Agentic Engineering w/ Stephanie Zhan

QSOS v2, the open-source software evaluation method, is back! Jérôme HERLEDAN

Conan O’Brien Mocks Trump At Harvard Commencement | Crowd Erupts During Viral Speech

Inside the Mind of Anthropic CEO Dario Amodei | The Circuit | Extended Interview

The French Do Not Care About Work

If You Have A Bad Memory, I’ll Help You Fix It In 28 Minutes

Something is jamming GPS over Europe. Here's what we found

But what is a Laplace Transform?

Why Aliens Would NEVER Invade Africa

Deep Dive into LLMs like ChatGPT

Inside Anthropic, the $965 Billion AI Juggernaut | The Circuit

The Physics of Euler's Formula | Laplace Transform Prelude

Full Walkthrough: Workflow for AI Coding — Matt Pocock

What do tech pioneers think about the AI revolution? - The Engineers, BBC World Service

Trump Sends Vance to Concede to Iran & Reflecting Pool Is Filled with Corruption | The Daily Show

Ex-Google Recruiter Explains Why "Lying" Gets You Hired

