Vector Embedding : Build LLM from scratch

Welcome back to the series on building Large Language Models (LLMs) from scratch! In this lecture, we move beyond basic tokenization to explore Vector Embeddings and discover how machines actually grasp the semantic meaning of words. In this session, we cover: The Limits of Tokenization: Why simply converting words to numbers (e.g., assigning "cat" the ID 23) isn't enough, as it completely loses the semantic meaning and context of the word. Introduction to Vector Embeddings: Learn how models capture the "nearness" or similarity of words by assigning weighted characteristics across multiple dimensions (e.g., "is it an animal?", "is it a pet?", "is it edible?"). Embeddings in GPT Models: A historical look at how dimensionality has scaled, from GPT-1 using 768 dimensions for 40,000 words, up to GPT-3 leveraging 12,288 dimensions per token. The Optimization Process: Understand how algorithms start with random weights, make predictions, calculate error and loss functions, and optimize those weights during the model's self-attention training step. Exploring Google's Word2Vec: A hands-on demonstration using the Word2Vec library, which is trained on 3 million words across 300 dimensions. Vector Math & Word Similarity: Watch the model calculate semantic relationships, like finding that "Yen" minus "Japan" plus "India" equals "Rupee," or seeing that the similarity between "King" and "Queen" is 65% compared to just 22% for "King" and "man". Custom vs. Pre-Trained Embeddings: Discover why you might encounter out-of-vocabulary errors (like with the name "Bill Gates") and why developing your own custom vector embeddings is essential for specialized training data. Stay tuned for our next steps in the LLM-building journey!