Huggingface NLP Course Chapter 6
Continuing with Chapter 6: The 🤗 Tokenizer Library.
Theory
This is a good dense chapter covering the theory behind tokenizers. It covers their architecture:
The tradeoffs happening during the normalization phase. Followed by a tour of the 3 most popular subwords tokenization algorithms. I highly recommend going over the videos to get a good feel of the implementations.
BPE (aka. GPT-2)
WordPiece (aka. BERT)
Unigram (aka. T5)
Practical Coding
The chapter also go over how question answering pipeline manage contexts that are bigger than the allowed amount of tokens. (Spoiler alert) split the context into overlapping chunks and attach the question to each of them. Grade all of them and pick the highest confidence answer.
And how to write your own tokenizer pipeline from scratch using their library.
My follow along version of the code is on GitHub or bellow.