🔥 Flash Attention derived and coded from first principles with Triton (Python)

🌅 Coding a Multimodal (Vision) Language Model from scratch in PyTorch with full explanation

🪢 ML Interpretability: feature visualization, adversarial example, interp. for language models

🕸️ Kolmogorov-Arnold Networks: MLP vs KAN, Math, B-Splines, Universal Approximation Theorem

🎯 Direct Preference Optimization (DPO) explained: Bradley-Terry model, log probabilities, math derivations

📐 Reinforcement Learning from Human Feedback explained with math derivations and the PyTorch code

🐍 Mamba and S4 Explained: Architecture, Parallel Scan, Kernel Fusion, Recurrent, Convolution, Math

🌈 Mistral 7B and Mixtral 8x7B Explained: Sliding Window Attention, Sparse Mixture of Experts, Rolling Buffer (KV) Cache, Model Sharding

🔬 Distributed Training with PyTorch: complete tutorial with cloud infrastructure and code

⚛️ Quantization explained with PyTorch - Post-Training Quantization, Quantization-Aware Training

🗃️ Retrieval Augmented Generation (RAG) Explained: Embedding, Sentence BERT, Vector Database (HNSW)

👨 BERT explained: Training, Inference, BERT vs GPT/LLamA, Fine tuning, [CLS] token

🌄 Coding Stable Diffusion From Scratch

🦙 Coding LLaMA 2 From Scratch

🦙 LLaMA explained: KV-Cache, Rotary Positional Embedding, RMS Norm, Grouped Query Attention, SwiGLU

🌍 Segment Anything - Model explanation with code

🧮 LoRA: Low-Rank Adaptation of Large Language Models - Explained visually + PyTorch code from scratch

LongNet: Scaling Transformers to 1,000,000,000 tokens: Python Code + Explanation

🖼 How diffusion models work - explanation and code!

⚙️ Variational Autoencoder - Model, ELBO, loss function and maths explained easily!

🎛 Coding a Transformer from scratch on PyTorch, with full explanation, training and inference.

🪬 Attention is all you need (Transformer) - Model explanation (including math), Inference and Training