Open Source Learning Path

Master transformer architecture by building your own GPT

A hands-on, beginner-friendly guide to understanding modern AI systems from first principles. Build a working language model while learning the fundamentals that power ChatGPT and Claude.

Learning path

22 bite-sized tutorials taking you from tensors to a working transformer inference engine.

  1. 0. Introduction: what we're building

    Available

    A high-level overview of the project — what a transformer inference engine is and how this series is structured.

  2. 2. What is a tensor?

    Coming soon

    Multi-dimensional arrays as the foundation of every model computation — shapes, strides, and indexing.

  3. 3. Tensor operations

    Coming soon

    Element-wise math, scalar operations, and the building blocks that power every layer.

  4. 4. Memory layout matters

    Coming soon

    Row-major order, cache lines, and why contiguous access is ~100x faster than random access.

  5. 5. Matrix multiplication

    Coming soon

    The single most important operation in deep learning — naive implementation and why it matters.

  6. 6. Optimizing matmul

    Coming soon

    Loop reordering, cache blocking, and going from 0.5 to 5 GFLOPS on a single CPU core.

  7. 7. Batched matrix multiplication

    Coming soon

    Extending matmul to handle batches — essential for multi-head attention.

  8. 8. Linear layers and GELU

    Coming soon

    The fundamental building block: output = input × weight + bias, plus the activation that replaced ReLU.

  9. 9. Layer normalization

    Coming soon

    Stabilizing activations with mean/variance normalization and learned scale-shift parameters.

  10. 10. Softmax and numerical stability

    Coming soon

    Turning raw logits into probability distributions without overflow or underflow.

  11. 11. Self-attention from scratch

    Coming soon

    Queries, keys, values — the mechanism that lets tokens communicate with each other.

  12. 12. Multi-head attention and causal masking

    Coming soon

    Parallel attention heads and the mask that prevents looking into the future.

  13. 13. Feed-forward networks

    Coming soon

    The two-layer MLP where the model processes the information attention gathered.

  14. 14. Transformer blocks and residual connections

    Coming soon

    Assembling attention + FFN with skip connections and pre-norm into a complete block.

  15. 15. Tokenization

    Coming soon

    Converting text to numbers and back — character-level encoding and the path to BPE.

  16. 16. Model weights and loading

    Coming soon

    Reading pretrained parameters from binary files and placing them into the right structures.

  17. 17. Embeddings and the forward pass

    Coming soon

    Token and position embeddings, stacking transformer blocks, and producing logits end-to-end.

  18. 18. Sampling strategies

    Coming soon

    Temperature, top-k, and nucleus sampling — controlling the randomness of generation.

  19. 19. Autoregressive generation

    Coming soon

    The generation loop: predict one token, append it, repeat — building a working inference engine.

  20. 20. KV cache

    Coming soon

    The key optimization for inference — caching keys and values to avoid redundant computation.

  21. 21. Profiling and performance

    Coming soon

    Finding bottlenecks, parallelizing with threads, and pre-allocating memory for speed.

  22. 22. Rotary positional embeddings

    Coming soon

    RoPE — encoding position directly into attention via rotation, as used in LLaMA and Mistral.