Skip to content

Latest commit

 

History

History
855 lines (673 loc) · 33.4 KB

File metadata and controls

855 lines (673 loc) · 33.4 KB

Breenix ARM64 Parity Project

Tracking Document for ARM64 vs x86-64 Feature Parity

Last Updated: 2026-02-04


Executive Summary

Goal: Achieve ARM64 test parity with x86-64

Test Inventory Comparison

Category x86-64 ARM64 Gap
Kernel Test Framework 91 91* 0*
Userspace Test Binaries 122 100 -22
Rust Integration Tests 72 2 -70
Total Potential Tests 285 193 -92

*Kernel test framework exists on ARM64 but many tests may be arch-gated

Current ARM64 Userspace Results

Metric Baseline Current Target
Userspace Pass Rate 41.25% (33/80) 62.5% (50/80) ~90%
Tests in Suite 80 80 100+

Progress: +21.25% improvement from baseline, +17 new tests passing

Key Fixes Applied ✅

  1. exec() spinlock deadlock FIXED - added explicit lock release before ERET
  2. sys_read for RegularFile FIXED - implemented ext2 file read support
  3. Network ENETUNREACH FIXED - added virtio-net to QEMU scripts
  4. Filesystem readonly FIXED - removed readonly=on from QEMU disk
  5. argc/argv setup FIXED - implemented create_process_with_argv()
  6. COW syscalls ALREADY IMPLEMENTED - no changes needed

Priority Queue: Ordered by Tests Unblocked

Principle: Each priority level is chosen because it unblocks the largest number of tests for the effort required.

PRIORITY 1: Port Rust Integration Tests to ARM64 [+70 tests]

Status: NOT STARTED Effort: High (infrastructure + test porting) Tests Unblocked: 70

The Rust integration tests (tests/*.rs) provide 72 tests on x86-64 but only 2 on ARM64. This is the single largest gap in test coverage.

What's needed:

  1. Create ARM64 QEMU test harness (like shared_qemu.rs but for ARM64)
  2. Port architecture-specific test setup code
  3. Verify each test file works on ARM64:
Test File x86-64 Tests ARM64 Status
syscall_tests.rs 1 Not ported
memory_tests.rs 3 Not ported
system_tests.rs 4 Not ported
exception_tests.rs ~8 Not ported
guard_page_tests.rs ~5 Not ported
interrupt_tests.rs ~6 Not ported
timer_tests.rs ~4 Not ported
keyboard_tests.rs ~3 Not ported
logging_tests.rs ~2 Not ported
stack_bounds_tests.rs ~4 Not ported
async_executor_tests.rs ~5 Not ported
boot_post_test.rs 1 Ported
arm64_boot_post_test.rs 1 ARM64 native
shared_qemu.rs - x86-64 only
shared_qemu_aarch64.rs - Exists
simple_kernel_test.rs ~2 Not ported
kernel_build_test.rs ~1 Not ported
ring3_smoke_test.rs ~3 Not ported
ring3_enosys_test.rs ~2 Not ported

PRIORITY 2: Port x86-64 Assembly Tests to libbreenix [+12 tests]

Status: NOT STARTED Effort: Medium (rewrite tests to use libbreenix syscall wrappers) Tests Unblocked: 12

These tests are excluded from ARM64 because they use x86-64 inline assembly (int 0x80, register manipulation). They need to be rewritten to use libbreenix syscall wrappers.

Test Current Issue Porting Approach
pipe_test Uses int 0x80 for syscalls Use libbreenix::io::pipe()
pipe_refcount_test Uses int 0x80 Use libbreenix::io::pipe()
poll_test Uses int 0x80 Use libbreenix::io::poll()
select_test Uses int 0x80 Use libbreenix::io::select()
brk_test Uses int 0x80 Use libbreenix::memory::brk()
stdin_test Uses int 0x80 Use libbreenix::io::read()
timer_test Uses int 0x80 Use libbreenix::time
register_init_test x86-64 register checks Rewrite for ARM64 regs (x0-x30)
syscall_diagnostic_test Uses int 0x80 Use libbreenix syscalls
syscall_enosys Uses int 0x80 Use libbreenix syscalls
signal_regs_test Uses r12-r15 (x86-64) Rewrite for x19-x28 (ARM64)
sigaltstack_test Uses RSP access Rewrite for SP access (ARM64)

PRIORITY 3: Fix Signal Delivery [+5 tests]

Status: NOT STARTED Effort: Medium (kernel debugging) Tests Unblocked: 5 (of the 30 currently failing)

Signal delivery via kill() is not working on ARM64. SIGCHLD works (sigchld_test passes), so some signal paths are functional.

Debugging approach:

  1. Trace kill() syscall on ARM64
  2. Compare with x86-64 signal delivery path
  3. Fix the delivery mechanism

Tests that will pass once fixed:

Test What it tests
signal_test kill(pid, SIGTERM) to child
ctrl_c_test kill(pid, SIGINT) to child
alarm_test SIGALRM after timer expires
itimer_test setitimer/getitimer + SIGALRM
kill_process_group_test kill(0, sig), kill(-pgid, sig)

PRIORITY 4: Remaining Userspace Test Failures [~25 tests]

Status: INVESTIGATION NEEDED Effort: Variable Tests Unblocked: ~25

After P1-P3, approximately 25 tests may still be failing. These need individual investigation:

Category Est. Count Notes
Filesystem write bugs ~6 May already be fixed by readonly fix
Process/fork edge cases ~5 Need investigation
Network edge cases ~4 May need virtio-net driver fixes
TTY/PTY issues ~3 Need investigation
Other ~7 Need investigation

PRIORITY 5: Kernel Test Framework on ARM64 [+91 tests potentially]

Status: NOT STARTED Effort: Medium-High Tests Unblocked: Up to 91 (if not already running)

Verify the kernel test framework runs on ARM64 boot and all arch-agnostic tests pass.

What's needed:

  1. Verify test framework initializes on ARM64
  2. Identify which tests are x86-64 only vs arch-agnostic
  3. Run and fix arch-agnostic tests
  4. Port x86-64-specific tests where valuable

Test Gap Summary

Priority Work Item Tests Unblocked Effort Cumulative
P1 Port Rust integration tests +70 High 70
P2 Port x86-64 asm tests +12 Medium 82
P3 Fix signal delivery +5 Medium 87
P4 Fix remaining failures +25 Variable 112
P5 Kernel test framework +91 Medium-High 203

Total potential improvement: From current 50 passing to 203+ tests


Key Design Decisions

KD-1: Skip Parallel Boot Testing for ARM64

Decision: Do NOT create run-aarch64-boot-parallel.sh

Rationale:

  • ARM64 boots natively on Apple Silicon - extremely fast compared to x86-64 in Docker/QEMU
  • Primary benefit of parallel boot testing on x86-64 is CI throughput - not needed for ARM64
  • Race condition stress testing is better addressed by kthread parallel tests
  • Engineering effort better spent on actual bug fixes (exec return path)

Date: 2026-02-01


KD-2: Kthread Parallel Testing is Valuable

Decision: Create run-aarch64-kthread-parallel.sh

Rationale:

  • Tests threading subsystem under load (scheduler, context switching, locks)
  • Architecture-specific behavior that needs validation
  • Stress testing kthreads has caught real bugs on x86-64

Date: 2026-02-01


Project Plan

Priority -1: Project Planning [COMPLETE]

  • Analyze ARM64 vs x86-64 feature delta
  • Document key design decisions
  • Create prioritized action plan
  • Establish tracking in arm64-parity.md

Priority 0: Fix exec() Spinlock Deadlock [CRITICAL PATH] ✅ COMPLETE

Status: FIXED (2026-02-01)

Problem: 35 tests hung after calling exec(). The child process was created but never executed userspace code.

Root Cause Identified: Spinlock deadlock in ext2 filesystem

The bug was NOT in the return-to-userspace path. The actual problem:

  1. run_userspace_from_ext2() acquires ext2::root_fs() spinlock at boot
  2. This function never returns - it jumps directly to userspace via ERET
  3. The MutexGuard was never dropped, so the spinlock stayed locked forever
  4. When userspace fork+exec tried to load ELF from ext2, it deadlocked at root_fs()

Debugging Approach:

  • Implemented lock-free trace buffer (kernel/src/arch_impl/aarch64/trace.rs)
  • Added byte markers throughout exec() path (no locks, no serial, no formatting)
  • Examined trace buffer via GDB after hang
  • Trace showed: D D D E N O 1 then hang - identified load_elf_from_ext2() as hang location
  • Further tracing narrowed it to ext2::root_fs() lock acquisition

Fix Applied (main_aarch64.rs line 75):

let elf_data = fs.read_file_content(&inode)?;
// CRITICAL: Release ext2 lock BEFORE creating process and jumping to userspace.
// return_to_userspace() never returns, so fs_guard would never be dropped.
// If we hold the lock, fork/exec in userspace will deadlock trying to acquire it.
drop(fs_guard);

Validation:

  • fork_test: PASS - child exit(42), parent exit(0)
  • exec_from_ext2_test: PASS - exec's /bin/hello_world successfully
  • ARM64 boot test: PASS

Files Modified:

  • kernel/src/main_aarch64.rs - Added explicit drop(fs_guard) after ELF read
  • kernel/src/arch_impl/aarch64/trace.rs - Created trace buffer module (can be removed)
  • kernel/src/arch_impl/aarch64/syscall_entry.rs - Added trace points (can be removed)

Priority 1: Implement COW Syscalls on ARM64

Status: NOT STARTED Effort: Low (stub exists, just needs implementation)

Problem: COW_STATS and SIMULATE_OOM return ENOSYS on ARM64, blocking 6 tests.

Files:

  • kernel/src/arch_impl/aarch64/syscall_entry.rs lines 736-738

Tests blocked:

  • COW stress/validation tests that use these diagnostic syscalls

Priority 2: Create Kthread Parallel Test Script

Status: NOT STARTED Effort: Medium

Create docker/qemu/run-aarch64-kthread-parallel.sh modeled on x86-64 version.

Purpose: Stress test threading subsystem (scheduler, context switch, locks)


Priority 3: Fix Specific Syscall Bugs

Status: PARTIALLY FIXED Effort: Medium (individual investigations)

Syscall Issue Test Status
getitimer Returns uninitialized memory itimer_test ✅ FIXED
socketpair Returns EFAULT (14) unix_socket_test TODO

Priority 4: Network Configuration

Status: NOT STARTED Effort: Medium

Problem: UDP/TCP tests fail with ENETUNREACH (101)

Tests affected: udp_socket_test, network-related tests


Priority 5: Port Integration Tests

Status: NOT STARTED Effort: High (ongoing)

Port x86-64 Rust integration tests to ARM64:

Test File Priority Status
syscall_tests.rs High Not started
memory_tests.rs High Not started
system_tests.rs Medium Not started
exception_tests.rs Medium Not started
guard_page_tests.rs Low Not started
stack_bounds_tests.rs Low Not started

Test Results (2026-02-02)

Summary

Metric Baseline (2026-02-01) Current Change
PASS 33 50 +17
FAIL 47 30 -17
Pass Rate 41.25% 62.5% +21.25%

Passing Tests (50)

access_test           cat_test              clock_gettime_test
cow_oom_test          cow_readonly_test     cow_signal_test
cow_stress_test       cp_mv_argv_test       cwd_test
dup_test              echo_argv_test        fcntl_test
file_read_test        fork_memory_test      fork_pending_signal_test
fork_state_test       fork_test             fs_directory_test
getdents_test         head_test             http_test
job_control_test      job_table_test        ls_test
mkdir_argv_test       nonblock_eagain_test  nonblock_test
pause_test            pipe2_test            pipe_concurrent_test
pipe_fork_test        pipeline_test         pty_test
rm_argv_test          session_test          shell_pipe_test
sigchld_job_test      sigchld_test          signal_exec_test
signal_fork_test      signal_handler_test   signal_return_test
sigsuspend_test       tail_test             tty_test
unix_named_socket_test waitpid_test         wc_test
which_test            wnohang_timing_test

Tests Fixed by exec() Fix (+9, 2026-02-02 Session 1)

Test Category
access_test Filesystem access checks
cwd_test Current working directory
echo_argv_test Echo with arguments
fork_test Fork syscall now working
getdents_test Directory listing
itimer_test Interval timers
ls_test Directory listing
signal_exec_test Signals + exec working together
which_test PATH lookup

Tests Fixed by RegularFile Read (+8, 2026-02-02 Session 2)

Test Category
cat_test File content display
cp_mv_argv_test File copy/move operations
file_read_test Explicit file reading
fs_directory_test Directory operations
head_test File head reading
mkdir_argv_test Directory creation
rm_argv_test File deletion
tail_test File tail reading
wc_test Word count (file reading)

Failure Breakdown

Pattern Count Root Cause Priority Status
Hangs after exec() 35 ext2 spinlock deadlock P0 ✅ FIXED
sys_read returns EOPNOTSUPP 8 RegularFile not implemented P1 ✅ FIXED
Network ENETUNREACH 6 Missing virtio-net device in QEMU P1 ✅ FIXED
Filesystem write errors 6 QEMU disk mounted read-only P2 ✅ FIXED
argc/argv setup ~4 Initial process setup P3 TODO
Signal/process bugs ~8 Various P4 TODO
COW syscall ENOSYS ~2 Not implemented P5 TODO
Other ~4 Various P6 TODO

P4 Signal/Process Bugs - Detailed Enumeration

Last Updated: 2026-02-04

This section catalogs the failing tests related to signal handling and process management on ARM64. The tests are organized by root cause category to enable systematic debugging.

Summary

Category Count Tests Notes
Signal Delivery 4 alarm_test, ctrl_c_test, signal_test, itimer_test Built, need kernel fixes
Signal Handling (registers/stack) 2 signal_regs_test, sigaltstack_test NOT BUILT - need ARM64 port
Process Group Kill 1 kill_process_group_test Built, depends on signal delivery
Signal + Exec 1 signal_exec_check Built, may be false positive

Total: 8 tests in signal/process category

  • 6 tests built for ARM64 - need kernel signal delivery fixes
  • 2 tests need porting - use x86-64 inline assembly

Category 1: Signal Delivery (4 tests)

These tests fail because signals are not being delivered to userspace on ARM64, or the signal delivery timing is incorrect.

1.1 alarm_test

  • Description: Tests the alarm() syscall which schedules SIGALRM delivery after N seconds
  • What it tests:
    • Register SIGALRM handler via sigaction()
    • Call alarm(1) to schedule signal in 1 second
    • Wait (via yield loop) for signal delivery
    • Verify handler was invoked
    • Test alarm(0) to cancel pending alarm
  • Syscalls used: sigaction, alarm, yield
  • Likely root cause: ARM64 timer interrupt not triggering signal delivery, or alarm() syscall not implemented/wired up for ARM64
  • Priority: High - foundational signal delivery mechanism

1.2 ctrl_c_test

  • Description: Tests SIGINT delivery to a child process (simulating Ctrl-C)
  • What it tests:
    • Fork a child process
    • Parent sends SIGINT to child via kill()
    • Child should be terminated by default SIGINT handler
    • Parent calls waitpid() and verifies WIFSIGNALED with WTERMSIG == SIGINT
  • Syscalls used: fork, kill, waitpid
  • Likely root cause: Either kill() syscall not queueing signals on ARM64, or default signal disposition not terminating process
  • Priority: High - core signal delivery path

1.3 signal_test

  • Description: Tests basic signal delivery with SIGTERM
  • What it tests:
    • Check process existence via kill(pid, 0)
    • Fork child process
    • Parent sends SIGTERM to child
    • Verify child terminated by signal via waitpid
  • Syscalls used: fork, kill, waitpid
  • Likely root cause: Same as ctrl_c_test - signal delivery path on ARM64
  • Priority: High - same root cause as ctrl_c_test

1.4 itimer_test

  • Description: Tests interval timers (setitimer/getitimer)
  • What it tests:
    • ITIMER_VIRTUAL/ITIMER_PROF should return ENOSYS
    • ITIMER_REAL should fire SIGALRM repeatedly at specified intervals
    • getitimer() should return remaining time
    • Setting timer to zero should cancel it
  • Syscalls used: sigaction, setitimer, getitimer
  • Likely root cause: Timer subsystem not triggering SIGALRM delivery on ARM64
  • Priority: Medium - depends on timer infrastructure
  • Note: Listed as passing in some runs but failing in others - may be timing-sensitive

Category 2: Signal Handling - Registers/Stack (2 tests)

These tests verify that register state and stack are correctly managed during signal delivery and return.

IMPORTANT: Both tests use x86-64 inline assembly and are NOT BUILT for ARM64. They are excluded from build-aarch64.sh. These need ARM64 ports before they can be tested.

2.1 signal_regs_test [NOT BUILT FOR ARM64]

  • Description: Tests that callee-saved registers (r12-r15) are preserved across signal delivery
  • What it tests:
    • Set r12-r15 to known values
    • Register signal handler that clobbers these registers
    • Send SIGUSR1 to self
    • Handler runs and intentionally corrupts r12-r15
    • After sigreturn, verify original values restored
  • Syscalls used: sigaction, kill, sigreturn (implicit)
  • Status: NOT BUILT - uses x86-64 inline asm
  • Porting required: Rewrite with ARM64 assembly (x19-x28 are callee-saved on ARM64)
  • Priority: High - validates sigreturn correctness, but requires porting first

2.2 sigaltstack_test [NOT BUILT FOR ARM64]

  • Description: Tests alternate signal stack functionality
  • What it tests:
    • Set alternate stack via sigaltstack()
    • Query stack configuration
    • Register handler with SA_ONSTACK flag
    • Verify handler executes on alternate stack (check RSP/SP within alt stack range)
    • Test SS_DISABLE flag
    • Test minimum stack size validation (MINSIGSTKSZ)
  • Syscalls used: sigaltstack, sigaction, kill
  • Status: NOT BUILT - uses x86-64 inline asm (mov {0}, rsp)
  • Porting required: Replace mov {0}, rsp with ARM64 equivalent (mrs {0}, sp or similar)
  • Priority: Medium - used for stack overflow handling, but requires porting first

Category 3: Process Group Kill (1 test)

3.1 kill_process_group_test

  • Description: Tests comprehensive process group signal delivery semantics
  • What it tests:
    • kill(pid, 0) - check if process exists
    • kill(0, sig) - send signal to own process group
    • kill(-pgid, sig) - send signal to specific process group
    • kill(-1, sig) - broadcast signal to all processes
  • Syscalls used: sigaction, kill, setpgid, getpgrp, getpgid, fork, waitpid, sigsuspend, sigprocmask
  • Likely root cause: Process group handling in kill() syscall may not iterate process list correctly on ARM64
  • Priority: Medium - complex test with multiple sub-tests; may pass partially

Category 4: Signal + Exec (1 test)

4.1 signal_exec_check

  • Description: Helper program exec'd by signal_exec_test to verify signal handler reset after exec
  • What it tests:
    • Query SIGUSR1 handler state via sigaction(SIGUSR1, None, &mut old)
    • Verify handler is SIG_DFL (reset after exec per POSIX)
  • Syscalls used: sigaction
  • Likely root cause: Not really a signal bug - this is the child program. The parent test (signal_exec_test) passes, so this may be a false positive or the test is being run standalone incorrectly.
  • Priority: Low - investigate if actually failing

Tests Excluded from ARM64 Build (x86-64 Assembly)

The following tests use x86-64 inline assembly (int 0x80, rax/rdi/rsi/rdx registers) and are NOT BUILT for ARM64:

Test Reason Porting Effort
signal_regs_test x86-64 register manipulation Medium - rewrite for ARM64 registers
sigaltstack_test x86-64 RSP access Low - simple SP access
register_init_test x86-64 naked assembly High - full rewrite
pipe_test int 0x80 syscalls Medium - use libbreenix
pipe_refcount_test int 0x80 syscalls Medium - use libbreenix
poll_test int 0x80 syscalls Medium - use libbreenix
select_test int 0x80 syscalls Medium - use libbreenix
brk_test int 0x80 syscalls Medium - use libbreenix
stdin_test int 0x80 syscalls Low - use libbreenix
timer_test int 0x80 syscalls Medium - use libbreenix
syscall_diagnostic_test Direct syscall testing Medium
syscall_enosys Direct syscall testing Low

Note: These tests need to be ported to use libbreenix syscall wrappers instead of raw x86-64 inline assembly. The libbreenix library already supports ARM64 via svc #0.


Related Tests That ARE Passing

For context, these signal/process tests pass on ARM64:

Test What it validates
fork_test Basic fork/waitpid
fork_memory_test COW memory after fork
fork_pending_signal_test Pending signals across fork
fork_state_test Process state after fork
signal_exec_test Signal reset across exec
signal_fork_test Signal handlers across fork
signal_handler_test Basic signal handler registration
signal_return_test Signal handler return path
sigchld_test SIGCHLD delivery on child exit
sigchld_job_test SIGCHLD with job control
sigsuspend_test sigsuspend() syscall
waitpid_test waitpid() variants
wnohang_timing_test WNOHANG behavior
job_control_test Process groups + SIGTTOU/SIGTTIN
pause_test pause() syscall

This suggests:

  • Basic signal handler registration works
  • Fork + waitpid work
  • SIGCHLD delivery works
  • The issue is specifically with:
    • Signal delivery via kill() to terminate processes
    • Timer-based signal delivery (alarm, itimer)
    • Register preservation in sigreturn
    • Alternate signal stack

Debugging Priority Order

Phase 1: Fix kernel signal delivery (4 tests)

  1. signal_test / ctrl_c_test (same root cause)

    • Start here - simplest signal delivery test
    • Debug with GDB: breakpoint on sys_kill, trace signal queue
    • Verify signals are being queued and delivered on ARM64
  2. alarm_test / itimer_test (same root cause)

    • Timer-based signal delivery
    • Check ARM64 timer interrupt handler for signal delivery trigger
    • Depends on Phase 1 fixes
  3. kill_process_group_test

    • Complex test - fix after simpler ones pass
    • May "just work" once signal delivery is fixed

Phase 2: Port tests to ARM64 (2 tests)

  1. sigaltstack_test [NEEDS PORTING]

    • Port x86-64 mov {0}, rsp to ARM64
    • Then test alternate stack functionality
    • Lower priority - stack overflow handling
  2. signal_regs_test [NEEDS PORTING]

    • Port x86-64 r12-r15 manipulation to ARM64 x19-x28
    • Validates sigreturn register restoration
    • Important but blocked on porting

Investigation Notes

ARM64 vs x86-64 Differences:

  • Syscall entry: ARM64 uses svc #0, x86-64 uses int 0x80 or syscall
  • Registers: ARM64 x0-x30 vs x86-64 rax/rbx/etc
  • Signal frame layout may differ
  • sigaltstack_test and signal_regs_test have x86-64 inline assembly that needs porting

Key Files to Investigate:

  • kernel/src/arch_impl/aarch64/syscall_entry.rs - ARM64 syscall handler
  • kernel/src/syscall/signal.rs - Signal syscall implementations
  • kernel/src/process/signal.rs - Signal delivery logic
  • kernel/src/arch_impl/aarch64/process.rs - ARM64 process/signal frame setup

Architecture Parity Matrix

Dimension x86-64 ARM64 Parity
HAL/Kernel Core Full Full 100%
Syscalls ~70 ~68 ~97%
Drivers 3 types 4 types 100%
Userspace Binaries 120 120 100%
Shell Test Scripts 8 4 50%
Rust Integration Tests 16 2 12.5%
Test Pass Rate ~90% 62.5% Improving

New Infrastructure: DTrace-Style Tracing Framework

Added: 2026-02-02

A comprehensive lock-free tracing framework has been implemented for kernel observability on both x86-64 and ARM64. This infrastructure was originally created during ARM64 debugging (exec spinlock deadlock) and has been generalized into a production-quality subsystem.

Features

Component Description
Per-CPU Ring Buffers 1024 events × 16 bytes per CPU, lock-free writes
Provider/Probe Model Subsystems define their own trace points
Atomic Counters Per-CPU counters with 64-byte cache alignment
GDB Integration #[no_mangle] symbols for direct inspection
/proc Integration /proc/trace/{enable,events,buffer,counters,providers}
Serial Output Lock-free panic-safe dump functions

Built-in Providers

Provider Events
SYSCALL_PROVIDER SYSCALL_ENTRY, SYSCALL_EXIT
SCHED_PROVIDER CTX_SWITCH_ENTRY, CTX_SWITCH_EXIT, SCHED_PICK
IRQ_PROVIDER IRQ_ENTRY, IRQ_EXIT, TIMER_TICK

Built-in Counters

  • SYSCALL_TOTAL - Total syscall invocations
  • IRQ_TOTAL - Total interrupt invocations
  • CTX_SWITCH_TOTAL - Total context switches
  • TIMER_TICK_TOTAL - Total timer tick interrupts

Validation

GDB-based memory dump testing confirms the framework works correctly:

  • 278 events captured during kernel boot on x86-64
  • TIMER_TICK events with incrementing tick counts
  • CTX_SWITCH_ENTRY events showing thread context switches
  • Timestamps are monotonically increasing

Files Added

kernel/src/tracing/           # Core framework (~2500 lines)
  mod.rs                      # Public API and re-exports
  core.rs                     # TraceEvent, ring buffers, global state
  buffer.rs                   # Per-CPU TraceCpuBuffer
  timestamp.rs                # RDTSC (x86) / CNTVCT (ARM64) timestamps
  provider.rs                 # TraceProvider, TraceProbe registration
  counter.rs                  # Atomic per-CPU counters
  output.rs                   # Serial dump, GDB helpers
  macros.rs                   # trace_event!, define_trace_counter!
  providers/                  # Built-in providers (syscall, sched, irq)

kernel/src/fs/procfs/         # Virtual filesystem
  mod.rs                      # procfs core infrastructure
  trace.rs                    # /proc/trace/* content generators

scripts/                      # Testing infrastructure
  trace_memory_dump.py        # Parse trace buffer memory dumps
  test_tracing_via_gdb.sh     # Automated GDB-based validation

docs/planning/
  TRACING_FRAMEWORK_DESIGN.md        # Design document
  TRACING_FRAMEWORK_IMPLEMENTATION.md # Implementation guide

Cross-Architecture Support

Feature x86-64 ARM64
Timestamp source RDTSC CNTVCT_EL0
Serial output COM1 (0x3F8) PL011 UART
Tracing enabled at boot Yes TODO
/proc/trace Yes TODO

Note: ARM64 tracing init call needs to be added to main_aarch64.rs.


Completed Work (Previous Session)

Bugs Fixed

1. IRQ Handler Register Restoration (boot.S)

  • Root Cause: x0/x1 not restored from exception frame after context switch
  • Fix: Properly restore all registers x0-x30 before ERET
  • Commit: 35a569b
  • Impact: +5 tests passing (fork return value was corrupted)

2. exit() Syscall (syscall_entry.rs)

  • Root Cause: Just halted system instead of terminating process
  • Fix: Call ProcessScheduler::handle_thread_exit(), trigger reschedule
  • Commit: 35a569b
  • Impact: Child processes now exit properly, parents wake from waitpid()

3. Test Suite Missing Feature Flag

  • Root Cause: Built without --features testing
  • Fix: Added flag to build command
  • Impact: exec() no longer returns ENOSYS

Tests Fixed by Previous Session

  • waitpid_test - exit() handler fix
  • wnohang_timing_test - exit() handler fix
  • sigchld_job_test - exit() handler fix
  • pipe_fork_test - fork return value fix
  • job_control_test - combined fixes

Files Reference

ARM64-Specific Kernel Code

kernel/src/arch_impl/aarch64/           # 16 modules
kernel/src/arch_impl/aarch64/syscall_entry.rs  # Syscall dispatcher
kernel/src/arch_impl/aarch64/boot.S     # Exception handling, context switch
kernel/src/arch_impl/aarch64/process.rs # Process setup

Test Infrastructure

docker/qemu/run-aarch64-*.sh            # ARM64 test scripts
tests/arm64_boot_post_test.rs           # ARM64 integration tests
tests/shared_qemu_aarch64.rs            # Shared QEMU harness

x86-64 Reference (for comparison/porting)

kernel/src/syscall/handler.rs           # x86-64 syscall handler
kernel/src/process/manager.rs           # Process setup (shared, arch-conditional)
docker/qemu/run-kthread-parallel.sh     # Reference for parallel testing
tests/syscall_tests.rs                  # Tests to port

Next Actions

  1. Reproduce exec() hang ✅ DONE
  2. Debug and fix exec() bug ✅ DONE - spinlock deadlock fixed
  3. Run full test suite ✅ DONE - 62.5% pass rate (50/80)
  4. Implement RegularFile read ✅ DONE - +8 tests passing
  5. Fix network configuration ✅ DONE - Added virtio-net device to QEMU scripts
  6. Fix filesystem writes ✅ DONE - Removed readonly=on from QEMU disk options
  7. Fix argc/argv setup - P3 priority, initial process args (~4 tests)
  8. Fix signal/process bugs - P4 priority (~8 tests)

Session Log

2026-02-02 (Session 6) - FILESYSTEM WRITE FIX

  • Root cause: QEMU test scripts mounted ext2 disk with readonly=on flag
  • Symptom: VirtIO block device reported "Device is read-only", filesystem write operations failed
  • Fix: Removed readonly=on from QEMU drive options, scripts now create writable disk copies
  • Files modified:
    • docker/qemu/run-aarch64-boot-test-native.sh
    • docker/qemu/run-aarch64-boot-test-strict.sh
    • docker/qemu/run-aarch64-userspace-test.sh
    • docker/qemu/run-aarch64-interactive.sh
    • docker/qemu/run-aarch64-userspace.sh
  • Note: run-aarch64-test-suite.sh already had correct behavior (created writable copy)
  • Results: VirtIO block device no longer reports read-only, filesystem writes should now work

2026-02-02 (Session 5) - NETWORK CONFIGURATION FIX

  • Root cause: ARM64 QEMU test scripts missing virtio-net device configuration
  • Symptom: All network tests failing with ENETUNREACH (errno 101)
  • Fix: Added -device virtio-net-device,netdev=net0 -netdev user,id=net0 to all ARM64 QEMU scripts
  • Files modified:
    • docker/qemu/run-aarch64-test-suite.sh
    • docker/qemu/run-aarch64-boot-test-native.sh
    • docker/qemu/run-aarch64-boot-test-strict.sh
    • docker/qemu/run-aarch64-userspace-test.sh
    • docker/qemu/run-aarch64-test.sh
    • docker/qemu/run-aarch64-userspace.sh
    • docker/qemu/run-aarch64-test-runner.py
    • docker/qemu/run-aarch64-test.exp
  • Results: Network tests now pass (http_test, dns_test), boot test still passes
  • +2 tests now passing: http_test, dns_test
  • Remaining network issues: udp_socket_test loopback, tcp_socket_test blocking (separate bugs)

2026-02-02 (Session 4) - REGULAR FILE READ FIX

  • Root cause: ARM64 sys_read returned EOPNOTSUPP for FdKind::RegularFile
  • Fix: Implemented proper ext2 file read in kernel/src/syscall/io.rs
  • Also fixed: Test suite now uses writable ext2 disk copy
  • PR #140: Merged to main
  • Results: 50/80 passing (62.5%), up from 42/80 (52.5%)
  • +8 new tests passing:
    • cat_test, cp_mv_argv_test, file_read_test, fs_directory_test
    • head_test, mkdir_argv_test, rm_argv_test, tail_test, wc_test

2026-02-02 (Session 3) - TEST SUITE VALIDATION

  • Committed and merged exec() fix - PR #138 merged to main
  • Ran full ARM64 test suite - 80 tests
  • Results: 42/80 passing (52.5%), up from 33/80 (41.25%)
  • +9 new tests passing after exec() fix:
    • access_test, cwd_test, echo_argv_test, fork_test
    • getdents_test, itimer_test, ls_test, signal_exec_test, which_test
  • Notable: itimer_test now passes (was listed as P3 bug)
  • Remaining failures categorized:
    • Filesystem write errors (~12 tests) - ext2 mounted read-only?
    • Network errors (~6 tests) - ENETUNREACH
    • argc/argv setup (~4 tests)
    • COW syscalls (~2 tests)
    • Other (~14 tests)

2026-02-01 (Session 2) - MAJOR BREAKTHROUGH

  • FIXED P0 exec() bug - Root cause was ext2 spinlock deadlock, not return-to-userspace
  • Implemented lock-free trace buffer debugging system
  • Trace markers identified exact hang location: ext2::root_fs() in load_elf_from_ext2()
  • Fix: Added drop(fs_guard) in run_userspace_from_ext2() before jumping to userspace
  • Validated: fork_test PASS, exec_from_ext2_test PASS, boot test PASS
  • Remaining issue: argc/argv not set up for initial process (separate from exec)

2026-02-01 (Session 1)

  • Renamed NEXT.md to arm64-parity.md
  • Documented KD-1 (skip parallel boot) and KD-2 (kthread parallel valuable)
  • Established prioritized project plan with P0 = exec() bug
  • Baseline: 33/80 tests passing (41.25%)