Open Source Learning Path

Master transformer architecture by building your own GPT

A hands-on, beginner-friendly guide to understanding modern AI systems from first principles. Build a working language model while learning the fundamentals that power ChatGPT and Claude.

Start building View learning path

Learning path

22 bite-sized tutorials taking you from tensors to a working transformer inference engine.

0. Introduction: what we're building
Available

A high-level overview of the project — what a transformer inference engine is and how this series is structured.
2. What is a tensor?
Coming soon

Multi-dimensional arrays as the foundation of every model computation — shapes, strides, and indexing.
3. Tensor operations
Coming soon

Element-wise math, scalar operations, and the building blocks that power every layer.
4. Memory layout matters
Coming soon

Row-major order, cache lines, and why contiguous access is ~100x faster than random access.
5. Matrix multiplication
Coming soon

The single most important operation in deep learning — naive implementation and why it matters.
6. Optimizing matmul
Coming soon

Loop reordering, cache blocking, and going from 0.5 to 5 GFLOPS on a single CPU core.
7. Batched matrix multiplication
Coming soon

Extending matmul to handle batches — essential for multi-head attention.
8. Linear layers and GELU
Coming soon

The fundamental building block: output = input × weight + bias, plus the activation that replaced ReLU.
9. Layer normalization
Coming soon

Stabilizing activations with mean/variance normalization and learned scale-shift parameters.
10. Softmax and numerical stability
Coming soon

Turning raw logits into probability distributions without overflow or underflow.
11. Self-attention from scratch
Coming soon

Queries, keys, values — the mechanism that lets tokens communicate with each other.
12. Multi-head attention and causal masking
Coming soon

Parallel attention heads and the mask that prevents looking into the future.
13. Feed-forward networks
Coming soon

The two-layer MLP where the model processes the information attention gathered.
14. Transformer blocks and residual connections
Coming soon

Assembling attention + FFN with skip connections and pre-norm into a complete block.
15. Tokenization
Coming soon

Converting text to numbers and back — character-level encoding and the path to BPE.
16. Model weights and loading
Coming soon

Reading pretrained parameters from binary files and placing them into the right structures.
17. Embeddings and the forward pass
Coming soon

Token and position embeddings, stacking transformer blocks, and producing logits end-to-end.
18. Sampling strategies
Coming soon

Temperature, top-k, and nucleus sampling — controlling the randomness of generation.
19. Autoregressive generation
Coming soon

The generation loop: predict one token, append it, repeat — building a working inference engine.
20. KV cache
Coming soon

The key optimization for inference — caching keys and values to avoid redundant computation.
21. Profiling and performance
Coming soon

Finding bottlenecks, parallelizing with threads, and pre-allocating memory for speed.
22. Rotary positional embeddings
Coming soon

RoPE — encoding position directly into attention via rotation, as used in LLaMA and Mistral.

Master transformer architecture by building your own GPT

Learning path

0. Introduction: what we're building

2. What is a tensor?

3. Tensor operations

4. Memory layout matters

5. Matrix multiplication

6. Optimizing matmul

7. Batched matrix multiplication

8. Linear layers and GELU

9. Layer normalization

10. Softmax and numerical stability

11. Self-attention from scratch

12. Multi-head attention and causal masking

13. Feed-forward networks

14. Transformer blocks and residual connections

15. Tokenization

16. Model weights and loading

17. Embeddings and the forward pass

18. Sampling strategies

19. Autoregressive generation

20. KV cache

21. Profiling and performance

22. Rotary positional embeddings