Master transformer architecture by building your own GPT
A hands-on, beginner-friendly guide to understanding modern AI systems from first principles. Build a working language model while learning the fundamentals that power ChatGPT and Claude.
Learning path
22 bite-sized tutorials taking you from tensors to a working transformer inference engine.
-
0. Introduction: what we're building
AvailableA high-level overview of the project — what a transformer inference engine is and how this series is structured.
-
2. What is a tensor?
Coming soonMulti-dimensional arrays as the foundation of every model computation — shapes, strides, and indexing.
-
3. Tensor operations
Coming soonElement-wise math, scalar operations, and the building blocks that power every layer.
-
4. Memory layout matters
Coming soonRow-major order, cache lines, and why contiguous access is ~100x faster than random access.
-
5. Matrix multiplication
Coming soonThe single most important operation in deep learning — naive implementation and why it matters.
-
6. Optimizing matmul
Coming soonLoop reordering, cache blocking, and going from 0.5 to 5 GFLOPS on a single CPU core.
-
7. Batched matrix multiplication
Coming soonExtending matmul to handle batches — essential for multi-head attention.
-
8. Linear layers and GELU
Coming soonThe fundamental building block: output = input × weight + bias, plus the activation that replaced ReLU.
-
9. Layer normalization
Coming soonStabilizing activations with mean/variance normalization and learned scale-shift parameters.
-
10. Softmax and numerical stability
Coming soonTurning raw logits into probability distributions without overflow or underflow.
-
11. Self-attention from scratch
Coming soonQueries, keys, values — the mechanism that lets tokens communicate with each other.
-
12. Multi-head attention and causal masking
Coming soonParallel attention heads and the mask that prevents looking into the future.
-
13. Feed-forward networks
Coming soonThe two-layer MLP where the model processes the information attention gathered.
-
14. Transformer blocks and residual connections
Coming soonAssembling attention + FFN with skip connections and pre-norm into a complete block.
-
15. Tokenization
Coming soonConverting text to numbers and back — character-level encoding and the path to BPE.
-
16. Model weights and loading
Coming soonReading pretrained parameters from binary files and placing them into the right structures.
-
17. Embeddings and the forward pass
Coming soonToken and position embeddings, stacking transformer blocks, and producing logits end-to-end.
-
18. Sampling strategies
Coming soonTemperature, top-k, and nucleus sampling — controlling the randomness of generation.
-
19. Autoregressive generation
Coming soonThe generation loop: predict one token, append it, repeat — building a working inference engine.
-
20. KV cache
Coming soonThe key optimization for inference — caching keys and values to avoid redundant computation.
-
21. Profiling and performance
Coming soonFinding bottlenecks, parallelizing with threads, and pre-allocating memory for speed.
-
22. Rotary positional embeddings
Coming soonRoPE — encoding position directly into attention via rotation, as used in LLaMA and Mistral.