GitHub - ai-hpc/ai-hardware-engineer-roadmap: Design a custom AI inference chip. That is the goal.

What is this?

This repository is a hardware-first roadmap for people who want to become AI hardware engineers.

It sits in the gap between "I can use AI frameworks" and "I can explain how models map onto compilers, runtimes, boards, and chips." The repository connects the layers that are usually learned separately: digital design, computer architecture, operating systems, parallel programming, embedded systems, AI workloads, deployment, ML compilers, and accelerator design.

The goal is not to collect random resources or to teach generic AI in isolation. The goal is to build cross-stack engineering judgment: how workloads create bottlenecks, how software reaches hardware, and how to design, optimize, deploy, or debug AI systems close to the silicon.

AI content in this repository exists to teach the workloads that hardware must serve. The center of gravity is still hardware, systems, deployment, and performance.

By the end, you should be able to:

trace an AI workload from model code to compiler, runtime, and hardware behavior
write and profile performance-critical code, including GPU and parallel workloads
deploy AI on real embedded or programmable hardware such as Jetson and FPGA platforms
reason about memory, latency, throughput, precision, and architecture tradeoffs

Who is this for?

This repository is for engineers who want to move into AI hardware work, not just use AI tools at a high level.

It is built for people crossing into a neighboring layer of the stack:

from software into performance, compilers, and hardware behavior
from ML into deployment, systems, and runtime constraints
from embedded into AI products and accelerator-backed inference
from hardware into workloads, compiler flow, and software integration
Software Engineer: Move from APIs and infrastructure into CUDA, runtime behavior, compiler flow, memory hierarchy, and accelerator execution.
ML / AI Engineer: Connect quantization, batching, graph lowering, deployment, and inference behavior to the hardware limits that actually shape performance.
Embedded / Firmware Engineer: Extend RTOS, Linux, drivers, BSP, and bring-up skills into Jetson, edge inference, sensor pipelines, and shipped AI devices.
Computer Science Student: Use a structured path from fundamentals to systems, workloads, deployment, and specialization instead of guessing what to study next.
Hardware / RTL / FPGA Engineer: Add workload intuition, compiler context, kernels, and deployment constraints so existing hardware knowledge maps to real AI systems.

AI Chip Stack

This roadmap uses an 8-layer stack to explain AI hardware work end to end. The point is not just to label layers. The point is to understand how decisions in one layer affect the others, from application code at the top to implementation and fabrication at the bottom.

L1–L6: Hands-on throughout this roadmap. L7–L8: Included so the stack stays complete, with guided conceptual labs.

Where Do I Start?

Pick the path that matches both your current background and your target role. Most people should choose one primary entry path first, then branch out later.

Software / ML: Start with execution and performance. Path: Phase 1 (C++ / Parallel) -> Phase 3 -> Phase 4C or 4B. Best if you already build models or infrastructure and want to understand kernels, memory behavior, compiler lowering, and deployment constraints.
Embedded / Firmware: Start with systems and deployment. Path: Phase 1 (Architecture) -> Phase 2 -> Phase 4B. Best if you already know boards, RTOS, buses, or Linux bring-up and want to move into edge AI products.
Already know CUDA: Jump to specialized tracks. Path: Phase 4A / 4B / 4C. Best if profiling, kernels, and low-level performance already feel familiar.
Chip design target: Follow the full hardware path. Path: Phase 1 -> Phase 2 -> Phase 4A -> Phase 5F. Best if your goal is accelerator architecture, FPGA prototyping, RTL implementation, or silicon-adjacent work.

How To Use This Roadmap

Do not treat this repository like a book to finish once. Use it like a build-and-measure curriculum.

Read the theory
Build the subsystem or implementation
Measure performance, power, correctness, or utilization
Ship one reusable artifact

The artifact matters as much as the reading. Good outputs include a CUDA profile, TensorRT benchmark, device-tree patch, FPGA timing report, compiler experiment, or architecture write-up. The point is to leave each block with evidence of engineering work, not just notes.

Before you start, decide three things:

Which role or stack layer you are aiming at. Start with Roles & Market Analysis.
What hardware and toolchain you can actually use.
How you will track outputs, failures, measurements, and decisions.

The 5 Phases

Phase 1 — Digital Foundations

Learn the language of hardware. Go from logic gates to writing GPU code.

Module	What you'll learn
Digital Design & HDL	How digital logic works; write Verilog, simulate circuits
Computer Architecture	How CPUs and GPUs work internally — pipelines, caches, memory
Operating Systems	Processes, memory, scheduling, device drivers
C++ & Parallel Computing	SIMD, OpenMP, oneTBB, CUDA, ROCm, OpenCL/SYCL

Phase 2 — Embedded Systems

Get hands-on with real hardware: microcontrollers, sensors, and embedded Linux.

Module	What you'll learn
Embedded Software	ARM Cortex-M, FreeRTOS, communication buses (SPI/I2C/CAN), power management
Embedded Linux	Build custom Linux for embedded devices with Yocto and PetaLinux

Phase 3 — Artificial Intelligence

Understand the AI workloads your hardware must run. Two tracks — pick one or both.

Core (everyone does these):

Module	What you'll learn
Neural Networks	How neural networks learn — backprop, CNNs, transformers from scratch
Deep Learning Frameworks	micrograd → PyTorch → tinygrad: understand what frameworks actually do

Track A — Hardware & Edge AI (leads to Phase 4A/B)

Module	What you'll learn
Computer Vision	Object detection, segmentation, 3D vision, OpenCV
Sensor Fusion	Fuse camera + LiDAR + IMU; Kalman filters, BEVFusion
Voice AI	Speech-to-text (Whisper), TTS, wake-word detection
Edge AI & Optimization	Quantization, pruning, deploying models on constrained devices

Track B — Agentic AI & ML Engineering (leads to Phase 4C / Phase 5)

Module	What you'll learn
Agentic AI & GenAI	Build LLM agents, RAG systems, tool-using AI
ML Engineering & MLOps	Training pipelines, model serving, monitoring
LLM Application Development	Fine-tuning, RAG architecture, production LLM apps

Phase 4 — Hardware Deployment & Compilation

Deploy AI on real chips. Three specialized tracks — choose based on your target role.

Track A — Xilinx FPGA

Design hardware accelerators and deploy AI on programmable chips.

Module	What you'll learn
FPGA Development	Vivado, IP cores, timing constraints, hardware debugging
Zynq MPSoC	Combine ARM CPU + FPGA fabric on one chip
Advanced FPGA Design	Clock domain crossing, floorplanning, power
HLS (High-Level Synthesis)	Write C++ → get hardware automatically
Runtime & Drivers	Linux driver for your FPGA, DMA, Vitis AI
Projects	Build a 4K wireless video pipeline end-to-end

Track B — NVIDIA Jetson

Ship AI products on NVIDIA's embedded GPU platform.

Module	What you'll learn
Jetson Platform	JetPack, L4T, GPU on Orin — get up and running
Carrier Board Design	Design your own PCB that hosts a Jetson module
L4T Customization	Custom Linux kernel, device tree, OTA updates
Firmware (FSP)	FreeRTOS on the safety co-processor
AI Application Dev	ML inference, ROS 2, real-time video on Jetson
Security & OTA	Secure boot, encrypted storage, over-the-air updates
Manufacturing	FCC/CE compliance, production flashing, DFM
TensorRT & DLA	Optimize models for Jetson's GPU and neural accelerator

Track C — ML Compiler

Learn how AI models are compiled and optimized into chip instructions.

Module	What you'll learn
Compiler Fundamentals	How MLIR, TVM, and LLVM work; build a custom backend
DL Inference Optimization	Triton kernels, Flash-Attention, TensorRT-LLM, quantization

Start here:

Phase 5 — Specialization

Go deep in one area. These tracks are ongoing and expand continuously.

Track	What you'll specialize in	Guide
GPU Infrastructure	Multi-GPU systems, NVLink, NCCL, AMD ROCm/HIP, MI300X	→
High-Performance Computing	40+ CUDA-X libraries: cuBLAS, cuDNN, NVSHMEM and more	→
Edge AI	Efficient model architectures, Holoscan, real-time pipelines	→
Robotics	ROS 2, Nav2, MoveIt, motion planning	→
Autonomous Vehicles	openpilot, BEV perception, functional safety, hardware debug	→
AI Chip Design	Systolic arrays, dataflow architectures, tinygrad↔hardware, ASIC flow	→

What Jobs Does This Lead To?

Target Role	Key Layers	Recommended Path
ML Inference Engineer	L1	Phase 3 → Phase 4C
Edge AI Engineer	L1	Phase 3 Track A → Phase 4B
AI Compiler Engineer	L2	Phase 1 → Phase 4C → Phase 5B
GPU Runtime Engineer	L3	Phase 1 (CUDA) → Phase 4A/B §Runtime
Firmware / Embedded Engineer	L4	Phase 1 → Phase 2 → Phase 4B
AI Accelerator Architect	L5	Phase 1 → Phase 4A → Phase 5F
RTL / FPGA Design Engineer	L6	Phase 1 (HDL) → Phase 4A
Autonomous Vehicles Engineer	L1–L4	Phase 3 Track A → Phase 4B → Phase 5E
AI Hardware Engineer (Full-Stack)	L1–L6	Full curriculum — the signature role this roadmap targets

Reference Projects Used Throughout

Project	Why it's used
tinygrad	A tiny DL framework (~2,500 lines) — shows exactly how frameworks, compilers, and hardware backends connect
openpilot	Real-world ADAS software — shows how perception, ML, and hardware work together in production
jetson-llm-runtime	A highly optimized Jetson LLM runtime project — useful for studying inference kernels, memory behavior, runtime design, build flow, and edge deployment tradeoffs
jetson-esp-hosted	A Jetson-oriented ESP-Hosted fork validated on Jetson Orin Nano — useful for studying SPI bring-up, Wi-Fi/BLE coprocessor integration, Linux driver loading, and embedded connectivity on real hardware

Additional Resources

Roles & Market Analysis — 23 sub-roles, salary data, job postings, remote %, hiring priorities

A hardware-first roadmap for people learning to build, deploy, and optimize AI systems close to the silicon.

⭐ Star this repo if you find it useful — it helps others discover it.

Name		Name	Last commit message	Last commit date
Latest commit History 427 Commits
.github/workflows		.github/workflows
Assets		Assets
Phase 1 - Foundational Knowledge		Phase 1 - Foundational Knowledge
Phase 2 - Embedded Systems		Phase 2 - Embedded Systems
Phase 3 - Artificial Intelligence		Phase 3 - Artificial Intelligence
Phase 4 - Track A - Xilinx FPGA		Phase 4 - Track A - Xilinx FPGA
Phase 4 - Track B - Nvidia Jetson		Phase 4 - Track B - Nvidia Jetson
Phase 4 - Track C - ML Compiler and Graph Optimization		Phase 4 - Track C - ML Compiler and Graph Optimization
Phase 5 - Advanced Topics and Specialization		Phase 5 - Advanced Topics and Specialization
Projects/jetson-llm-runtime		Projects/jetson-llm-runtime
scripts		scripts
.gitignore		.gitignore
.gitmodules		.gitmodules
AI-Hardware-Engineer-Roadmap.html		AI-Hardware-Engineer-Roadmap.html
Curriculum-Authoring-Guide.md		Curriculum-Authoring-Guide.md
LICENSE		LICENSE
README.md		README.md
Roles and Market Analysis.md		Roles and Market Analysis.md
mkdocs.yml		mkdocs.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

What is this?

Who is this for?

AI Chip Stack

Where Do I Start?

How To Use This Roadmap

The 5 Phases

Phase 1 — Digital Foundations

Phase 2 — Embedded Systems

Phase 3 — Artificial Intelligence

Phase 4 — Hardware Deployment & Compilation

Track A — Xilinx FPGA

Track B — NVIDIA Jetson

Track C — ML Compiler

Phase 5 — Specialization

What Jobs Does This Lead To?

Reference Projects Used Throughout

Additional Resources

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

What is this?

Who is this for?

AI Chip Stack

Where Do I Start?

How To Use This Roadmap

The 5 Phases

Phase 1 — Digital Foundations

Phase 2 — Embedded Systems

Phase 3 — Artificial Intelligence

Phase 4 — Hardware Deployment & Compilation

Track A — Xilinx FPGA

Track B — NVIDIA Jetson

Track C — ML Compiler

Phase 5 — Specialization

What Jobs Does This Lead To?

Reference Projects Used Throughout

Additional Resources

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages