Skip to content

Upgrade CPU scaling for multi-cluster SoCs#118

Merged
nchapman merged 11 commits into
developfrom
feature/complex-cpu-scaling
Jan 8, 2026
Merged

Upgrade CPU scaling for multi-cluster SoCs#118
nchapman merged 11 commits into
developfrom
feature/complex-cpu-scaling

Conversation

@nchapman
Copy link
Copy Markdown
Collaborator

@nchapman nchapman commented Jan 5, 2026

This PR upgrades the CPU scaling system to support multi-cluster SoCs (big.LITTLE architectures) and replaces the compile-time sync mode system with runtime-adaptive audio/video synchronization. The changes include:

  • Moved CPU scaling from player-specific to common code, enabling launcher access
  • Added CPU topology detection and governor-based PerfState management for multi-cluster SoCs
  • Replaced frame_pacer (Bresenham algorithm) with sync_manager for runtime mode switching
  • Enhanced audio system with true blocking via SDL_CondWait (audio-clock mode)
  • Updated audio rate control to pure proportional (removed integral term)
  • Increased audio buffer size from 4096 to 6400 samples for better headroom

Fixes #117

Extends the CPU autoscaler to support big.LITTLE and tri-cluster SoCs
(Allwinner A523, Snapdragon 865, etc.) while maintaining backward
compatibility with single-cluster devices.

Key features:
- Detects CPU clusters via sysfs
  (/sys/devices/system/cpu/cpufreq/policyN)
- Classifies clusters as LITTLE/BIG/PRIME based on max frequency and
  core count
- Builds a PerfState ladder using governors
  (powersave/schedutil/performance)
- Uses CPU affinity to guide the emulation thread to the appropriate
  cluster

The governor-based approach works WITH the kernel's frequency scaling:
- Our algorithm provides frame timing intelligence (which cluster tier)
- The kernel provides frequency scaling intelligence (schedutil)
- Inactive clusters get powersave governor, allowing them to truly idle

PerfState ladder structure:
- Dual-cluster: 6 states (LITTLE×3 governors, BIG×3 governors)
- Tri-cluster: 9 states (adds PRIME×3 governors)
- Progression: LITTLE powersave → LITTLE schedutil → LITTLE performance
  → BIG powersave → BIG schedutil → BIG performance → ...

Single-cluster devices continue using the existing granular frequency
mode with userspace governor and scaling_setspeed.
Introduces two helper functions that abstract over the three CPU scaling
modes (topology/granular/fallback):
- PlayerCPU_getPerformancePercent(): returns 0-100% normalized level
- PlayerCPU_getModeName(): returns mode string for logging/debugging

Updates the debug HUD to properly display topology mode state.
Previously it fell through to fallback mode display which showed
incorrect info. New format: "T3/5 60% u:75% b:80%" showing state index,
max state, performance %, utilization %, and buffer fill %.
Copilot AI review requested due to automatic review settings January 5, 2026 23:48
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds multi-cluster CPU topology support for ARM SoCs with heterogeneous CPU architectures (big.LITTLE, tri-cluster designs like Snapdragon 865). Instead of manipulating frequency bounds, the new topology mode uses CPU governors (powersave/schedutil/performance) and CPU affinity to guide the emulation thread across performance tiers, working with the kernel's frequency scaling intelligence rather than against it.

Key changes:

  • Topology detection via sysfs that enumerates CPU clusters and classifies them as LITTLE/BIG/PRIME based on max frequency
  • PerfState ladder system with 3 governor levels per cluster (6 states for dual-cluster, 9 for tri-cluster)
  • Governor-based scaling approach that leverages kernel's schedutil intelligence instead of fighting with frequency bounds
  • Backward compatibility maintained: single-cluster devices continue using granular frequency mode

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
workspace/all/player/player_cpu.h Defines new topology types (PlayerCPUCluster, PlayerCPUPerfState, PlayerCPUTopology), governor enum, and function declarations for multi-cluster support
workspace/all/player/player_cpu.c Implements topology mode update logic, PerfState building/application, cluster classification, and CPU list parsing
workspace/all/common/api.h Declares new API functions for topology detection, governor control, and thread affinity
workspace/all/common/api.c Implements sysfs-based topology detection, governor/affinity control, and cluster enumeration
workspace/all/player/player.c Integrates topology mode into auto CPU thread, adds HUD display for topology mode showing state/max and performance percentage
tests/unit/all/player/test_player_cpu.c Adds comprehensive test coverage for topology functions including parsing, classification, PerfState building, and update decisions
docs/auto-cpu-scaling.md Documents topology mode architecture, governor-based approach, PerfState ladder structure, and cluster classification algorithm

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread workspace/all/player/player.c Outdated

// Set pending_affinity under mutex (main thread will apply it)
// This avoids race condition with main thread reading pending_affinity
PlayerCPUPerfState* ps = &auto_cpu_state.topology.states[target_state];
Copy link

Copilot AI Jan 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential race condition when accessing auto_cpu_state.topology.states[target_state] without holding the mutex. After releasing the mutex at line 855, the code accesses the topology.states array at line 869. While the topology structure is likely read-only after initialization, this access pattern is inconsistent with the mutex protection used for other shared state. Consider moving this access inside the mutex lock at lines 870-874, or document that topology.states is read-only after initialization.

Copilot uses AI. Check for mistakes.
Comment thread workspace/all/player/player.c Outdated
if (auto_cpu_state.use_topology) {
// Topology mode: show state/max and performance %
int perf_pct = PlayerCPU_getPerformancePercent(&auto_cpu_state);
int max_state = auto_cpu_state.topology.state_count - 1;
Copy link

Copilot AI Jan 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Accessing auto_cpu_state.topology.state_count without holding the auto_cpu_mutex. While topology data is likely read-only after initialization, this access at line 4107 is outside the mutex lock that was released at line 4090. For consistency with the protection of current_idx, current_state, and level, consider reading state_count under the mutex, or document that topology structure is read-only after initialization.

Copilot uses AI. Check for mistakes.
if (auto_cpu_state.use_topology) {
// Topology mode: show state/max and performance %
int perf_pct = PlayerCPU_getPerformancePercent(&auto_cpu_state);
int max_state = auto_cpu_state.topology.state_count - 1;
Copy link

Copilot AI Jan 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Accessing auto_cpu_state.topology.state_count without holding the auto_cpu_mutex. While topology data is likely read-only after initialization, this access at line 3786 is outside the mutex lock that was released at line 3769. For consistency with the protection of current_idx, current_state, and level, consider reading state_count under the mutex, or document that topology structure is read-only after initialization.

Copilot uses AI. Check for mistakes.
Comment thread workspace/all/common/api.c Outdated
static int compare_cluster_by_max_khz(const void* a, const void* b) {
const PlayerCPUCluster* ca = (const PlayerCPUCluster*)a;
const PlayerCPUCluster* cb = (const PlayerCPUCluster*)b;
return ca->max_khz - cb->max_khz;
Copy link

Copilot AI Jan 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential integer overflow in subtraction-based comparison. If ca->max_khz and cb->max_khz have large values with different signs or differ by more than INT_MAX, the subtraction ca->max_khz - cb->max_khz could overflow. While this is unlikely with CPU frequencies (typically < 5000000 kHz), consider using the standard three-way comparison pattern: if (ca->max_khz < cb->max_khz) return -1; if (ca->max_khz > cb->max_khz) return 1; return 0;

Suggested change
return ca->max_khz - cb->max_khz;
if (ca->max_khz < cb->max_khz)
return -1;
if (ca->max_khz > cb->max_khz)
return 1;
return 0;

Copilot uses AI. Check for mistakes.
Comment thread workspace/all/common/api.c Outdated
Comment on lines +3452 to +3453
*cpu_mask |= (1 << cpu);
(*cpu_count)++;
Copy link

Copilot AI Jan 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CPU count increment logic may produce incorrect results. In parse_related_cpus, (*cpu_count)++ is called for every CPU in a range without checking if the CPU was already in the mask. For example, parsing "0-3,2-5" would count CPU 2 and 3 twice, resulting in cpu_count=8 instead of 6. Consider adding a duplicate check similar to PlayerCPU_parseCPUList which checks if (!(mask & (1 << cpu))) before incrementing the count.

Suggested change
*cpu_mask |= (1 << cpu);
(*cpu_count)++;
int bit = 1 << cpu;
if (!(*cpu_mask & bit)) {
*cpu_mask |= bit;
(*cpu_count)++;
}

Copilot uses AI. Check for mistakes.
Comment on lines +855 to +865
pthread_mutex_unlock(&auto_cpu_mutex);

if (target_state != current_state && target_state >= 0 &&
target_state < auto_cpu_state.topology.state_count) {
LOG_debug("Auto CPU: applying PerfState %d/%d\n", target_state,
auto_cpu_state.topology.state_count - 1);

int result = PlayerCPU_applyPerfState(&auto_cpu_state);
if (result != 0) {
LOG_warn("Auto CPU: failed to apply PerfState %d\n", target_state);
}
Copy link

Copilot AI Jan 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential race condition when accessing auto_cpu_state.topology.state_count without holding the mutex. The line reads auto_cpu_state.topology.state_count at line 858 after releasing the mutex at line 855, but this field could theoretically be modified by another thread. While unlikely in practice given the initialization pattern, consider reading state_count under the mutex for consistency with the pattern used for target_state and current_state.

Suggested change
pthread_mutex_unlock(&auto_cpu_mutex);
if (target_state != current_state && target_state >= 0 &&
target_state < auto_cpu_state.topology.state_count) {
LOG_debug("Auto CPU: applying PerfState %d/%d\n", target_state,
auto_cpu_state.topology.state_count - 1);
int result = PlayerCPU_applyPerfState(&auto_cpu_state);
if (result != 0) {
LOG_warn("Auto CPU: failed to apply PerfState %d\n", target_state);
}
int state_count = auto_cpu_state.topology.state_count;
pthread_mutex_unlock(&auto_cpu_mutex);
if (target_state != current_state && target_state >= 0 &&
target_state < state_count) {
LOG_debug("Auto CPU: applying PerfState %d/%d\n", target_state,
state_count - 1);
int result = PlayerCPU_applyPerfState(&auto_cpu_state);
if (result != 0) {
LOG_warn("Auto CPU: failed to apply PerfState %d\n", target_state);

Copilot uses AI. Check for mistakes.
On multi-cluster ARM SoCs (tg5050, retroid), the existing
PWR_setCPUSpeed() calls were no-ops, leaving CPUs at full power during
menu browsing and in tools.

PWR_setLowPowerMode() enumerates all cpufreq policies and sets each to
"powersave" governor. On single-cluster devices, it uses
PLAT_setCPUSpeed() instead.
Moved player_cpu.c/h to common/cpu.c/h and renamed all PlayerCPU*
identifiers to CPU* since the code is now shared infrastructure used by
both launcher (for topology detection) and player (for autoscaling).
Extracts debug overlay rendering from player.c into reusable platform
hooks, eliminating duplication between software and hardware render
paths and enabling future platform-specific optimizations.
Audio rate control improvements:
- Add cubic boost to proportional term (1× at center, 4× at limits)
- Prevents buffer hitting 0%/100% during integral learning
- Only boosts P term, not integral (avoids over-correction)
- Improves tg5040 stability: 40.5% mean fill, 704 warnings (down from 8000+)

CPU scaling improvements:
- Add panic grace period (60 frames after frequency change)
- Ignore underruns during grace to prevent cascade panics
- Add buffer threshold for reduce (min 40% to allow scaling down)
- Add stability decay (reduce panic counts after stable operation)
- Reduce panic step from 2 to 1 (less aggressive on panic)
…ion.

Frame pacer improvements:
- Measure actual vsync timing by recording SDL_RenderPresent() intervals
- Use exponential moving average (α=0.01) for stable Hz measurement
- Stable after 120 samples (~2 seconds of gameplay)
- Continuously track drift, re-check every 300 frames (~5 seconds)
- Automatically reinitialize when measured Hz differs >0.1% from current
- Reset Bresenham accumulator on Hz change to prevent glitches

Tolerance and cleanup:
- Reduce FRAME_PACER_TOLERANCE from 2% to 1% (matches RetroArch guidance)
- Remove FORCE_FRAME_PACING from tg5040/tg5050 (testing hack)

Results on tg5050:
- Discovered actual refresh rate: 62-68Hz (SDL reports 60Hz)
- Screen drifts dynamically: starts 65-68Hz, settles to 62-63Hz
- 14% mismatch explains audio buffer issues
- Frame pacing insufficient at this mismatch (needs audio-clock mode)

Tests:
- Added 6 tests for vsync measurement and drift tracking
- All 1578 tests passing
Replaces compile-time dual-loop architecture (vsync vs audio-clock) with
single integrated loop that measures actual display refresh rate and
adapts sync mode at runtime. Starts in audio-clock mode (safe,
universal) and switches to vsync if measured Hz is compatible (<1%
mismatch from game fps). Removes frame pacing (Bresenham), cubic boost,
and 350 lines of code. Simplifies auto CPU scaling to use pure
utilization-based decisions, with time-based reduction strategy for
audio-clock mode where blocking makes utilization metrics unreliable.
Audio rate control:
- Replace dual-timescale PI controller with pure P control (Arntzen
  algorithm)
- Remove integral term, error smoothing, and per-frame SND_newFrame()
  updates
- Reduce d parameter from 1.2% to 0.8% (larger buffer provides more
  headroom)
- Enable rate control in both modes as buffer health mechanism

Audio blocking:
- Replace SDL_Delay polling with true SDL_CondWait blocking in
  audio-clock mode
- Add SDL mutex/cond for thread-safe audio writes
- Block at 90% buffer full, wake on callback drain signal
- Pre-fill buffer to 50% at startup to prevent initial underruns

Audio buffer:
- Increase from 4096 to 6400 samples (~133ms, matches RetroArch 128ms
  default)
- Provides headroom for CPU frequency transitions and timing variance

sync_manager:
- Replace EMA smoothing with circular buffer and stddev-based
  convergence
- Measurement stable when stddev/mean < 1% (typically ~60 samples)
- Timeout after 1800 samples (~30s) if never converges
- Add 512-sample circular buffer for rolling statistics

Vsync control:
- Add GLVideo_setVsync() to enable/disable vsync at runtime
- Disable vsync in audio-clock mode (blocking is sole timing source)
- Enable vsync in vsync mode (display-driven timing)
- Update vsync setting on mode transitions

CPU scaling (all 3 modes):
- Add panic counting for topology and fallback modes (was granular-only)
- Add skip-blocked-states logic to all reduction paths
- Use time-based probing in AC mode with buffer-guided timing
- Fix stability decay to use mode-appropriate indices
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 50 out of 50 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@nchapman nchapman merged commit cf4c0c8 into develop Jan 8, 2026
6 checks passed
@nchapman nchapman deleted the feature/complex-cpu-scaling branch January 8, 2026 05:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement topology-aware CPU autoscaling for Linux SoCs (big.LITTLE & multi-cluster)

2 participants