Skip to content

ai-hpc/ai-hardware-engineer-roadmap

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

427 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation


What is this?

This repository is a hardware-first roadmap for people who want to become AI hardware engineers.

It sits in the gap between "I can use AI frameworks" and "I can explain how models map onto compilers, runtimes, boards, and chips." The repository connects the layers that are usually learned separately: digital design, computer architecture, operating systems, parallel programming, embedded systems, AI workloads, deployment, ML compilers, and accelerator design.

The goal is not to collect random resources or to teach generic AI in isolation. The goal is to build cross-stack engineering judgment: how workloads create bottlenecks, how software reaches hardware, and how to design, optimize, deploy, or debug AI systems close to the silicon.

AI content in this repository exists to teach the workloads that hardware must serve. The center of gravity is still hardware, systems, deployment, and performance.

By the end, you should be able to:

  • trace an AI workload from model code to compiler, runtime, and hardware behavior
  • write and profile performance-critical code, including GPU and parallel workloads
  • deploy AI on real embedded or programmable hardware such as Jetson and FPGA platforms
  • reason about memory, latency, throughput, precision, and architecture tradeoffs

Who is this for?

This repository is for engineers who want to move into AI hardware work, not just use AI tools at a high level.

It is built for people crossing into a neighboring layer of the stack:

  • from software into performance, compilers, and hardware behavior

  • from ML into deployment, systems, and runtime constraints

  • from embedded into AI products and accelerator-backed inference

  • from hardware into workloads, compiler flow, and software integration

  • Software Engineer: Move from APIs and infrastructure into CUDA, runtime behavior, compiler flow, memory hierarchy, and accelerator execution.

  • ML / AI Engineer: Connect quantization, batching, graph lowering, deployment, and inference behavior to the hardware limits that actually shape performance.

  • Embedded / Firmware Engineer: Extend RTOS, Linux, drivers, BSP, and bring-up skills into Jetson, edge inference, sensor pipelines, and shipped AI devices.

  • Computer Science Student: Use a structured path from fundamentals to systems, workloads, deployment, and specialization instead of guessing what to study next.

  • Hardware / RTL / FPGA Engineer: Add workload intuition, compiler context, kernels, and deployment constraints so existing hardware knowledge maps to real AI systems.


AI Chip Stack

This roadmap uses an 8-layer stack to explain AI hardware work end to end. The point is not just to label layers. The point is to understand how decisions in one layer affect the others, from application code at the top to implementation and fabrication at the bottom.

AI Chip Stack Diagram

L1–L6: Hands-on throughout this roadmap. L7–L8: Included so the stack stays complete, with guided conceptual labs.


Where Do I Start?

Pick the path that matches both your current background and your target role. Most people should choose one primary entry path first, then branch out later.

  • Software / ML: Start with execution and performance. Path: Phase 1 (C++ / Parallel) -> Phase 3 -> Phase 4C or 4B. Best if you already build models or infrastructure and want to understand kernels, memory behavior, compiler lowering, and deployment constraints.
  • Embedded / Firmware: Start with systems and deployment. Path: Phase 1 (Architecture) -> Phase 2 -> Phase 4B. Best if you already know boards, RTOS, buses, or Linux bring-up and want to move into edge AI products.
  • Already know CUDA: Jump to specialized tracks. Path: Phase 4A / 4B / 4C. Best if profiling, kernels, and low-level performance already feel familiar.
  • Chip design target: Follow the full hardware path. Path: Phase 1 -> Phase 2 -> Phase 4A -> Phase 5F. Best if your goal is accelerator architecture, FPGA prototyping, RTL implementation, or silicon-adjacent work.

How To Use This Roadmap

Do not treat this repository like a book to finish once. Use it like a build-and-measure curriculum.

  1. Read the theory
  2. Build the subsystem or implementation
  3. Measure performance, power, correctness, or utilization
  4. Ship one reusable artifact

The artifact matters as much as the reading. Good outputs include a CUDA profile, TensorRT benchmark, device-tree patch, FPGA timing report, compiler experiment, or architecture write-up. The point is to leave each block with evidence of engineering work, not just notes.

Before you start, decide three things:

  1. Which role or stack layer you are aiming at. Start with Roles & Market Analysis.
  2. What hardware and toolchain you can actually use.
  3. How you will track outputs, failures, measurements, and decisions.

The 5 Phases

Learn the language of hardware. Go from logic gates to writing GPU code.

Module What you'll learn
Digital Design & HDL How digital logic works; write Verilog, simulate circuits
Computer Architecture How CPUs and GPUs work internally — pipelines, caches, memory
Operating Systems Processes, memory, scheduling, device drivers
C++ & Parallel Computing SIMD, OpenMP, oneTBB, CUDA, ROCm, OpenCL/SYCL

Get hands-on with real hardware: microcontrollers, sensors, and embedded Linux.

Module What you'll learn
Embedded Software ARM Cortex-M, FreeRTOS, communication buses (SPI/I2C/CAN), power management
Embedded Linux Build custom Linux for embedded devices with Yocto and PetaLinux

Understand the AI workloads your hardware must run. Two tracks — pick one or both.

Core (everyone does these):

Module What you'll learn
Neural Networks How neural networks learn — backprop, CNNs, transformers from scratch
Deep Learning Frameworks micrograd → PyTorch → tinygrad: understand what frameworks actually do

Track A — Hardware & Edge AI (leads to Phase 4A/B)

Module What you'll learn
Computer Vision Object detection, segmentation, 3D vision, OpenCV
Sensor Fusion Fuse camera + LiDAR + IMU; Kalman filters, BEVFusion
Voice AI Speech-to-text (Whisper), TTS, wake-word detection
Edge AI & Optimization Quantization, pruning, deploying models on constrained devices

Track B — Agentic AI & ML Engineering (leads to Phase 4C / Phase 5)

Module What you'll learn
Agentic AI & GenAI Build LLM agents, RAG systems, tool-using AI
ML Engineering & MLOps Training pipelines, model serving, monitoring
LLM Application Development Fine-tuning, RAG architecture, production LLM apps

Phase 4 — Hardware Deployment & Compilation

Deploy AI on real chips. Three specialized tracks — choose based on your target role.

Track A — Xilinx FPGA

Design hardware accelerators and deploy AI on programmable chips.

Module What you'll learn
FPGA Development Vivado, IP cores, timing constraints, hardware debugging
Zynq MPSoC Combine ARM CPU + FPGA fabric on one chip
Advanced FPGA Design Clock domain crossing, floorplanning, power
HLS (High-Level Synthesis) Write C++ → get hardware automatically
Runtime & Drivers Linux driver for your FPGA, DMA, Vitis AI
Projects Build a 4K wireless video pipeline end-to-end

Track B — NVIDIA Jetson

Ship AI products on NVIDIA's embedded GPU platform.

Module What you'll learn
Jetson Platform JetPack, L4T, GPU on Orin — get up and running
Carrier Board Design Design your own PCB that hosts a Jetson module
L4T Customization Custom Linux kernel, device tree, OTA updates
Firmware (FSP) FreeRTOS on the safety co-processor
AI Application Dev ML inference, ROS 2, real-time video on Jetson
Security & OTA Secure boot, encrypted storage, over-the-air updates
Manufacturing FCC/CE compliance, production flashing, DFM
TensorRT & DLA Optimize models for Jetson's GPU and neural accelerator

Track C — ML Compiler

Learn how AI models are compiled and optimized into chip instructions.

Module What you'll learn
Compiler Fundamentals How MLIR, TVM, and LLVM work; build a custom backend
DL Inference Optimization Triton kernels, Flash-Attention, TensorRT-LLM, quantization

Start here:


Go deep in one area. These tracks are ongoing and expand continuously.

Track What you'll specialize in Guide
GPU Infrastructure Multi-GPU systems, NVLink, NCCL, AMD ROCm/HIP, MI300X
High-Performance Computing 40+ CUDA-X libraries: cuBLAS, cuDNN, NVSHMEM and more
Edge AI Efficient model architectures, Holoscan, real-time pipelines
Robotics ROS 2, Nav2, MoveIt, motion planning
Autonomous Vehicles openpilot, BEV perception, functional safety, hardware debug
AI Chip Design Systolic arrays, dataflow architectures, tinygrad↔hardware, ASIC flow

What Jobs Does This Lead To?

Target Role Key Layers Recommended Path
ML Inference Engineer L1 Phase 3 → Phase 4C
Edge AI Engineer L1 Phase 3 Track A → Phase 4B
AI Compiler Engineer L2 Phase 1 → Phase 4C → Phase 5B
GPU Runtime Engineer L3 Phase 1 (CUDA) → Phase 4A/B §Runtime
Firmware / Embedded Engineer L4 Phase 1 → Phase 2 → Phase 4B
AI Accelerator Architect L5 Phase 1 → Phase 4A → Phase 5F
RTL / FPGA Design Engineer L6 Phase 1 (HDL) → Phase 4A
Autonomous Vehicles Engineer L1–L4 Phase 3 Track A → Phase 4B → Phase 5E
AI Hardware Engineer (Full-Stack) L1–L6 Full curriculum — the signature role this roadmap targets

Reference Projects Used Throughout

Project Why it's used
tinygrad A tiny DL framework (~2,500 lines) — shows exactly how frameworks, compilers, and hardware backends connect
openpilot Real-world ADAS software — shows how perception, ML, and hardware work together in production
jetson-llm-runtime A highly optimized Jetson LLM runtime project — useful for studying inference kernels, memory behavior, runtime design, build flow, and edge deployment tradeoffs
jetson-esp-hosted A Jetson-oriented ESP-Hosted fork validated on Jetson Orin Nano — useful for studying SPI bring-up, Wi-Fi/BLE coprocessor integration, Linux driver loading, and embedded connectivity on real hardware

Additional Resources


A hardware-first roadmap for people learning to build, deploy, and optimize AI systems close to the silicon.

⭐ Star this repo if you find it useful — it helps others discover it.