This repository is a hardware-first roadmap for people who want to become AI hardware engineers.
It sits in the gap between "I can use AI frameworks" and "I can explain how models map onto compilers, runtimes, boards, and chips." The repository connects the layers that are usually learned separately: digital design, computer architecture, operating systems, parallel programming, embedded systems, AI workloads, deployment, ML compilers, and accelerator design.
The goal is not to collect random resources or to teach generic AI in isolation. The goal is to build cross-stack engineering judgment: how workloads create bottlenecks, how software reaches hardware, and how to design, optimize, deploy, or debug AI systems close to the silicon.
AI content in this repository exists to teach the workloads that hardware must serve. The center of gravity is still hardware, systems, deployment, and performance.
By the end, you should be able to:
- trace an AI workload from model code to compiler, runtime, and hardware behavior
- write and profile performance-critical code, including GPU and parallel workloads
- deploy AI on real embedded or programmable hardware such as Jetson and FPGA platforms
- reason about memory, latency, throughput, precision, and architecture tradeoffs
This repository is for engineers who want to move into AI hardware work, not just use AI tools at a high level.
It is built for people crossing into a neighboring layer of the stack:
-
from software into performance, compilers, and hardware behavior
-
from ML into deployment, systems, and runtime constraints
-
from embedded into AI products and accelerator-backed inference
-
from hardware into workloads, compiler flow, and software integration
-
Software Engineer: Move from APIs and infrastructure into CUDA, runtime behavior, compiler flow, memory hierarchy, and accelerator execution.
-
ML / AI Engineer: Connect quantization, batching, graph lowering, deployment, and inference behavior to the hardware limits that actually shape performance.
-
Embedded / Firmware Engineer: Extend RTOS, Linux, drivers, BSP, and bring-up skills into Jetson, edge inference, sensor pipelines, and shipped AI devices.
-
Computer Science Student: Use a structured path from fundamentals to systems, workloads, deployment, and specialization instead of guessing what to study next.
-
Hardware / RTL / FPGA Engineer: Add workload intuition, compiler context, kernels, and deployment constraints so existing hardware knowledge maps to real AI systems.
This roadmap uses an 8-layer stack to explain AI hardware work end to end. The point is not just to label layers. The point is to understand how decisions in one layer affect the others, from application code at the top to implementation and fabrication at the bottom.
L1–L6: Hands-on throughout this roadmap. L7–L8: Included so the stack stays complete, with guided conceptual labs.
Pick the path that matches both your current background and your target role. Most people should choose one primary entry path first, then branch out later.
- Software / ML: Start with execution and performance. Path:
Phase 1 (C++ / Parallel) -> Phase 3 -> Phase 4C or 4B. Best if you already build models or infrastructure and want to understand kernels, memory behavior, compiler lowering, and deployment constraints. - Embedded / Firmware: Start with systems and deployment. Path:
Phase 1 (Architecture) -> Phase 2 -> Phase 4B. Best if you already know boards, RTOS, buses, or Linux bring-up and want to move into edge AI products. - Already know CUDA: Jump to specialized tracks. Path:
Phase 4A / 4B / 4C. Best if profiling, kernels, and low-level performance already feel familiar. - Chip design target: Follow the full hardware path. Path:
Phase 1 -> Phase 2 -> Phase 4A -> Phase 5F. Best if your goal is accelerator architecture, FPGA prototyping, RTL implementation, or silicon-adjacent work.
Do not treat this repository like a book to finish once. Use it like a build-and-measure curriculum.
- Read the theory
- Build the subsystem or implementation
- Measure performance, power, correctness, or utilization
- Ship one reusable artifact
The artifact matters as much as the reading. Good outputs include a CUDA profile, TensorRT benchmark, device-tree patch, FPGA timing report, compiler experiment, or architecture write-up. The point is to leave each block with evidence of engineering work, not just notes.
Before you start, decide three things:
- Which role or stack layer you are aiming at. Start with Roles & Market Analysis.
- What hardware and toolchain you can actually use.
- How you will track outputs, failures, measurements, and decisions.
Learn the language of hardware. Go from logic gates to writing GPU code.
| Module | What you'll learn |
|---|---|
| Digital Design & HDL | How digital logic works; write Verilog, simulate circuits |
| Computer Architecture | How CPUs and GPUs work internally — pipelines, caches, memory |
| Operating Systems | Processes, memory, scheduling, device drivers |
| C++ & Parallel Computing | SIMD, OpenMP, oneTBB, CUDA, ROCm, OpenCL/SYCL |
Get hands-on with real hardware: microcontrollers, sensors, and embedded Linux.
| Module | What you'll learn |
|---|---|
| Embedded Software | ARM Cortex-M, FreeRTOS, communication buses (SPI/I2C/CAN), power management |
| Embedded Linux | Build custom Linux for embedded devices with Yocto and PetaLinux |
Understand the AI workloads your hardware must run. Two tracks — pick one or both.
Core (everyone does these):
| Module | What you'll learn |
|---|---|
| Neural Networks | How neural networks learn — backprop, CNNs, transformers from scratch |
| Deep Learning Frameworks | micrograd → PyTorch → tinygrad: understand what frameworks actually do |
Track A — Hardware & Edge AI (leads to Phase 4A/B)
| Module | What you'll learn |
|---|---|
| Computer Vision | Object detection, segmentation, 3D vision, OpenCV |
| Sensor Fusion | Fuse camera + LiDAR + IMU; Kalman filters, BEVFusion |
| Voice AI | Speech-to-text (Whisper), TTS, wake-word detection |
| Edge AI & Optimization | Quantization, pruning, deploying models on constrained devices |
Track B — Agentic AI & ML Engineering (leads to Phase 4C / Phase 5)
| Module | What you'll learn |
|---|---|
| Agentic AI & GenAI | Build LLM agents, RAG systems, tool-using AI |
| ML Engineering & MLOps | Training pipelines, model serving, monitoring |
| LLM Application Development | Fine-tuning, RAG architecture, production LLM apps |
Deploy AI on real chips. Three specialized tracks — choose based on your target role.
Design hardware accelerators and deploy AI on programmable chips.
| Module | What you'll learn |
|---|---|
| FPGA Development | Vivado, IP cores, timing constraints, hardware debugging |
| Zynq MPSoC | Combine ARM CPU + FPGA fabric on one chip |
| Advanced FPGA Design | Clock domain crossing, floorplanning, power |
| HLS (High-Level Synthesis) | Write C++ → get hardware automatically |
| Runtime & Drivers | Linux driver for your FPGA, DMA, Vitis AI |
| Projects | Build a 4K wireless video pipeline end-to-end |
Ship AI products on NVIDIA's embedded GPU platform.
| Module | What you'll learn |
|---|---|
| Jetson Platform | JetPack, L4T, GPU on Orin — get up and running |
| Carrier Board Design | Design your own PCB that hosts a Jetson module |
| L4T Customization | Custom Linux kernel, device tree, OTA updates |
| Firmware (FSP) | FreeRTOS on the safety co-processor |
| AI Application Dev | ML inference, ROS 2, real-time video on Jetson |
| Security & OTA | Secure boot, encrypted storage, over-the-air updates |
| Manufacturing | FCC/CE compliance, production flashing, DFM |
| TensorRT & DLA | Optimize models for Jetson's GPU and neural accelerator |
Learn how AI models are compiled and optimized into chip instructions.
| Module | What you'll learn |
|---|---|
| Compiler Fundamentals | How MLIR, TVM, and LLVM work; build a custom backend |
| DL Inference Optimization | Triton kernels, Flash-Attention, TensorRT-LLM, quantization |
Start here:
Go deep in one area. These tracks are ongoing and expand continuously.
| Track | What you'll specialize in | Guide |
|---|---|---|
| GPU Infrastructure | Multi-GPU systems, NVLink, NCCL, AMD ROCm/HIP, MI300X | → |
| High-Performance Computing | 40+ CUDA-X libraries: cuBLAS, cuDNN, NVSHMEM and more | → |
| Edge AI | Efficient model architectures, Holoscan, real-time pipelines | → |
| Robotics | ROS 2, Nav2, MoveIt, motion planning | → |
| Autonomous Vehicles | openpilot, BEV perception, functional safety, hardware debug | → |
| AI Chip Design | Systolic arrays, dataflow architectures, tinygrad↔hardware, ASIC flow | → |
| Target Role | Key Layers | Recommended Path |
|---|---|---|
| ML Inference Engineer | L1 | Phase 3 → Phase 4C |
| Edge AI Engineer | L1 | Phase 3 Track A → Phase 4B |
| AI Compiler Engineer | L2 | Phase 1 → Phase 4C → Phase 5B |
| GPU Runtime Engineer | L3 | Phase 1 (CUDA) → Phase 4A/B §Runtime |
| Firmware / Embedded Engineer | L4 | Phase 1 → Phase 2 → Phase 4B |
| AI Accelerator Architect | L5 | Phase 1 → Phase 4A → Phase 5F |
| RTL / FPGA Design Engineer | L6 | Phase 1 (HDL) → Phase 4A |
| Autonomous Vehicles Engineer | L1–L4 | Phase 3 Track A → Phase 4B → Phase 5E |
| AI Hardware Engineer (Full-Stack) | L1–L6 | Full curriculum — the signature role this roadmap targets |
| Project | Why it's used |
|---|---|
| tinygrad | A tiny DL framework (~2,500 lines) — shows exactly how frameworks, compilers, and hardware backends connect |
| openpilot | Real-world ADAS software — shows how perception, ML, and hardware work together in production |
| jetson-llm-runtime | A highly optimized Jetson LLM runtime project — useful for studying inference kernels, memory behavior, runtime design, build flow, and edge deployment tradeoffs |
| jetson-esp-hosted | A Jetson-oriented ESP-Hosted fork validated on Jetson Orin Nano — useful for studying SPI bring-up, Wi-Fi/BLE coprocessor integration, Linux driver loading, and embedded connectivity on real hardware |
- Roles & Market Analysis — 23 sub-roles, salary data, job postings, remote %, hiring priorities
A hardware-first roadmap for people learning to build, deploy, and optimize AI systems close to the silicon.
⭐ Star this repo if you find it useful — it helps others discover it.