Upgrade CPU scaling for multi-cluster SoCs#118
Conversation
Extends the CPU autoscaler to support big.LITTLE and tri-cluster SoCs (Allwinner A523, Snapdragon 865, etc.) while maintaining backward compatibility with single-cluster devices. Key features: - Detects CPU clusters via sysfs (/sys/devices/system/cpu/cpufreq/policyN) - Classifies clusters as LITTLE/BIG/PRIME based on max frequency and core count - Builds a PerfState ladder using governors (powersave/schedutil/performance) - Uses CPU affinity to guide the emulation thread to the appropriate cluster The governor-based approach works WITH the kernel's frequency scaling: - Our algorithm provides frame timing intelligence (which cluster tier) - The kernel provides frequency scaling intelligence (schedutil) - Inactive clusters get powersave governor, allowing them to truly idle PerfState ladder structure: - Dual-cluster: 6 states (LITTLE×3 governors, BIG×3 governors) - Tri-cluster: 9 states (adds PRIME×3 governors) - Progression: LITTLE powersave → LITTLE schedutil → LITTLE performance → BIG powersave → BIG schedutil → BIG performance → ... Single-cluster devices continue using the existing granular frequency mode with userspace governor and scaling_setspeed.
Introduces two helper functions that abstract over the three CPU scaling modes (topology/granular/fallback): - PlayerCPU_getPerformancePercent(): returns 0-100% normalized level - PlayerCPU_getModeName(): returns mode string for logging/debugging Updates the debug HUD to properly display topology mode state. Previously it fell through to fallback mode display which showed incorrect info. New format: "T3/5 60% u:75% b:80%" showing state index, max state, performance %, utilization %, and buffer fill %.
There was a problem hiding this comment.
Pull request overview
This PR adds multi-cluster CPU topology support for ARM SoCs with heterogeneous CPU architectures (big.LITTLE, tri-cluster designs like Snapdragon 865). Instead of manipulating frequency bounds, the new topology mode uses CPU governors (powersave/schedutil/performance) and CPU affinity to guide the emulation thread across performance tiers, working with the kernel's frequency scaling intelligence rather than against it.
Key changes:
- Topology detection via sysfs that enumerates CPU clusters and classifies them as LITTLE/BIG/PRIME based on max frequency
- PerfState ladder system with 3 governor levels per cluster (6 states for dual-cluster, 9 for tri-cluster)
- Governor-based scaling approach that leverages kernel's schedutil intelligence instead of fighting with frequency bounds
- Backward compatibility maintained: single-cluster devices continue using granular frequency mode
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| workspace/all/player/player_cpu.h | Defines new topology types (PlayerCPUCluster, PlayerCPUPerfState, PlayerCPUTopology), governor enum, and function declarations for multi-cluster support |
| workspace/all/player/player_cpu.c | Implements topology mode update logic, PerfState building/application, cluster classification, and CPU list parsing |
| workspace/all/common/api.h | Declares new API functions for topology detection, governor control, and thread affinity |
| workspace/all/common/api.c | Implements sysfs-based topology detection, governor/affinity control, and cluster enumeration |
| workspace/all/player/player.c | Integrates topology mode into auto CPU thread, adds HUD display for topology mode showing state/max and performance percentage |
| tests/unit/all/player/test_player_cpu.c | Adds comprehensive test coverage for topology functions including parsing, classification, PerfState building, and update decisions |
| docs/auto-cpu-scaling.md | Documents topology mode architecture, governor-based approach, PerfState ladder structure, and cluster classification algorithm |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| // Set pending_affinity under mutex (main thread will apply it) | ||
| // This avoids race condition with main thread reading pending_affinity | ||
| PlayerCPUPerfState* ps = &auto_cpu_state.topology.states[target_state]; |
There was a problem hiding this comment.
Potential race condition when accessing auto_cpu_state.topology.states[target_state] without holding the mutex. After releasing the mutex at line 855, the code accesses the topology.states array at line 869. While the topology structure is likely read-only after initialization, this access pattern is inconsistent with the mutex protection used for other shared state. Consider moving this access inside the mutex lock at lines 870-874, or document that topology.states is read-only after initialization.
| if (auto_cpu_state.use_topology) { | ||
| // Topology mode: show state/max and performance % | ||
| int perf_pct = PlayerCPU_getPerformancePercent(&auto_cpu_state); | ||
| int max_state = auto_cpu_state.topology.state_count - 1; |
There was a problem hiding this comment.
Accessing auto_cpu_state.topology.state_count without holding the auto_cpu_mutex. While topology data is likely read-only after initialization, this access at line 4107 is outside the mutex lock that was released at line 4090. For consistency with the protection of current_idx, current_state, and level, consider reading state_count under the mutex, or document that topology structure is read-only after initialization.
| if (auto_cpu_state.use_topology) { | ||
| // Topology mode: show state/max and performance % | ||
| int perf_pct = PlayerCPU_getPerformancePercent(&auto_cpu_state); | ||
| int max_state = auto_cpu_state.topology.state_count - 1; |
There was a problem hiding this comment.
Accessing auto_cpu_state.topology.state_count without holding the auto_cpu_mutex. While topology data is likely read-only after initialization, this access at line 3786 is outside the mutex lock that was released at line 3769. For consistency with the protection of current_idx, current_state, and level, consider reading state_count under the mutex, or document that topology structure is read-only after initialization.
| static int compare_cluster_by_max_khz(const void* a, const void* b) { | ||
| const PlayerCPUCluster* ca = (const PlayerCPUCluster*)a; | ||
| const PlayerCPUCluster* cb = (const PlayerCPUCluster*)b; | ||
| return ca->max_khz - cb->max_khz; |
There was a problem hiding this comment.
Potential integer overflow in subtraction-based comparison. If ca->max_khz and cb->max_khz have large values with different signs or differ by more than INT_MAX, the subtraction ca->max_khz - cb->max_khz could overflow. While this is unlikely with CPU frequencies (typically < 5000000 kHz), consider using the standard three-way comparison pattern: if (ca->max_khz < cb->max_khz) return -1; if (ca->max_khz > cb->max_khz) return 1; return 0;
| return ca->max_khz - cb->max_khz; | |
| if (ca->max_khz < cb->max_khz) | |
| return -1; | |
| if (ca->max_khz > cb->max_khz) | |
| return 1; | |
| return 0; |
| *cpu_mask |= (1 << cpu); | ||
| (*cpu_count)++; |
There was a problem hiding this comment.
The CPU count increment logic may produce incorrect results. In parse_related_cpus, (*cpu_count)++ is called for every CPU in a range without checking if the CPU was already in the mask. For example, parsing "0-3,2-5" would count CPU 2 and 3 twice, resulting in cpu_count=8 instead of 6. Consider adding a duplicate check similar to PlayerCPU_parseCPUList which checks if (!(mask & (1 << cpu))) before incrementing the count.
| *cpu_mask |= (1 << cpu); | |
| (*cpu_count)++; | |
| int bit = 1 << cpu; | |
| if (!(*cpu_mask & bit)) { | |
| *cpu_mask |= bit; | |
| (*cpu_count)++; | |
| } |
| pthread_mutex_unlock(&auto_cpu_mutex); | ||
|
|
||
| if (target_state != current_state && target_state >= 0 && | ||
| target_state < auto_cpu_state.topology.state_count) { | ||
| LOG_debug("Auto CPU: applying PerfState %d/%d\n", target_state, | ||
| auto_cpu_state.topology.state_count - 1); | ||
|
|
||
| int result = PlayerCPU_applyPerfState(&auto_cpu_state); | ||
| if (result != 0) { | ||
| LOG_warn("Auto CPU: failed to apply PerfState %d\n", target_state); | ||
| } |
There was a problem hiding this comment.
Potential race condition when accessing auto_cpu_state.topology.state_count without holding the mutex. The line reads auto_cpu_state.topology.state_count at line 858 after releasing the mutex at line 855, but this field could theoretically be modified by another thread. While unlikely in practice given the initialization pattern, consider reading state_count under the mutex for consistency with the pattern used for target_state and current_state.
| pthread_mutex_unlock(&auto_cpu_mutex); | |
| if (target_state != current_state && target_state >= 0 && | |
| target_state < auto_cpu_state.topology.state_count) { | |
| LOG_debug("Auto CPU: applying PerfState %d/%d\n", target_state, | |
| auto_cpu_state.topology.state_count - 1); | |
| int result = PlayerCPU_applyPerfState(&auto_cpu_state); | |
| if (result != 0) { | |
| LOG_warn("Auto CPU: failed to apply PerfState %d\n", target_state); | |
| } | |
| int state_count = auto_cpu_state.topology.state_count; | |
| pthread_mutex_unlock(&auto_cpu_mutex); | |
| if (target_state != current_state && target_state >= 0 && | |
| target_state < state_count) { | |
| LOG_debug("Auto CPU: applying PerfState %d/%d\n", target_state, | |
| state_count - 1); | |
| int result = PlayerCPU_applyPerfState(&auto_cpu_state); | |
| if (result != 0) { | |
| LOG_warn("Auto CPU: failed to apply PerfState %d\n", target_state); |
On multi-cluster ARM SoCs (tg5050, retroid), the existing PWR_setCPUSpeed() calls were no-ops, leaving CPUs at full power during menu browsing and in tools. PWR_setLowPowerMode() enumerates all cpufreq policies and sets each to "powersave" governor. On single-cluster devices, it uses PLAT_setCPUSpeed() instead.
Moved player_cpu.c/h to common/cpu.c/h and renamed all PlayerCPU* identifiers to CPU* since the code is now shared infrastructure used by both launcher (for topology detection) and player (for autoscaling).
Extracts debug overlay rendering from player.c into reusable platform hooks, eliminating duplication between software and hardware render paths and enabling future platform-specific optimizations.
Audio rate control improvements: - Add cubic boost to proportional term (1× at center, 4× at limits) - Prevents buffer hitting 0%/100% during integral learning - Only boosts P term, not integral (avoids over-correction) - Improves tg5040 stability: 40.5% mean fill, 704 warnings (down from 8000+) CPU scaling improvements: - Add panic grace period (60 frames after frequency change) - Ignore underruns during grace to prevent cascade panics - Add buffer threshold for reduce (min 40% to allow scaling down) - Add stability decay (reduce panic counts after stable operation) - Reduce panic step from 2 to 1 (less aggressive on panic)
…ion. Frame pacer improvements: - Measure actual vsync timing by recording SDL_RenderPresent() intervals - Use exponential moving average (α=0.01) for stable Hz measurement - Stable after 120 samples (~2 seconds of gameplay) - Continuously track drift, re-check every 300 frames (~5 seconds) - Automatically reinitialize when measured Hz differs >0.1% from current - Reset Bresenham accumulator on Hz change to prevent glitches Tolerance and cleanup: - Reduce FRAME_PACER_TOLERANCE from 2% to 1% (matches RetroArch guidance) - Remove FORCE_FRAME_PACING from tg5040/tg5050 (testing hack) Results on tg5050: - Discovered actual refresh rate: 62-68Hz (SDL reports 60Hz) - Screen drifts dynamically: starts 65-68Hz, settles to 62-63Hz - 14% mismatch explains audio buffer issues - Frame pacing insufficient at this mismatch (needs audio-clock mode) Tests: - Added 6 tests for vsync measurement and drift tracking - All 1578 tests passing
Replaces compile-time dual-loop architecture (vsync vs audio-clock) with single integrated loop that measures actual display refresh rate and adapts sync mode at runtime. Starts in audio-clock mode (safe, universal) and switches to vsync if measured Hz is compatible (<1% mismatch from game fps). Removes frame pacing (Bresenham), cubic boost, and 350 lines of code. Simplifies auto CPU scaling to use pure utilization-based decisions, with time-based reduction strategy for audio-clock mode where blocking makes utilization metrics unreliable.
Audio rate control: - Replace dual-timescale PI controller with pure P control (Arntzen algorithm) - Remove integral term, error smoothing, and per-frame SND_newFrame() updates - Reduce d parameter from 1.2% to 0.8% (larger buffer provides more headroom) - Enable rate control in both modes as buffer health mechanism Audio blocking: - Replace SDL_Delay polling with true SDL_CondWait blocking in audio-clock mode - Add SDL mutex/cond for thread-safe audio writes - Block at 90% buffer full, wake on callback drain signal - Pre-fill buffer to 50% at startup to prevent initial underruns Audio buffer: - Increase from 4096 to 6400 samples (~133ms, matches RetroArch 128ms default) - Provides headroom for CPU frequency transitions and timing variance sync_manager: - Replace EMA smoothing with circular buffer and stddev-based convergence - Measurement stable when stddev/mean < 1% (typically ~60 samples) - Timeout after 1800 samples (~30s) if never converges - Add 512-sample circular buffer for rolling statistics Vsync control: - Add GLVideo_setVsync() to enable/disable vsync at runtime - Disable vsync in audio-clock mode (blocking is sole timing source) - Enable vsync in vsync mode (display-driven timing) - Update vsync setting on mode transitions CPU scaling (all 3 modes): - Add panic counting for topology and fallback modes (was granular-only) - Add skip-blocked-states logic to all reduction paths - Use time-based probing in AC mode with buffer-guided timing - Fix stability decay to use mode-appropriate indices
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 50 out of 50 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
This PR upgrades the CPU scaling system to support multi-cluster SoCs (big.LITTLE architectures) and replaces the compile-time sync mode system with runtime-adaptive audio/video synchronization. The changes include:
Fixes #117