Upgrade CPU scaling for multi-cluster SoCs by nchapman · Pull Request #118 · lessui-hq/LessUI

nchapman · 2026-01-05T23:48:50Z

This PR upgrades the CPU scaling system to support multi-cluster SoCs (big.LITTLE architectures) and replaces the compile-time sync mode system with runtime-adaptive audio/video synchronization. The changes include:

Moved CPU scaling from player-specific to common code, enabling launcher access
Added CPU topology detection and governor-based PerfState management for multi-cluster SoCs
Replaced frame_pacer (Bresenham algorithm) with sync_manager for runtime mode switching
Enhanced audio system with true blocking via SDL_CondWait (audio-clock mode)
Updated audio rate control to pure proportional (removed integral term)
Increased audio buffer size from 4096 to 6400 samples for better headroom

Fixes #117

Extends the CPU autoscaler to support big.LITTLE and tri-cluster SoCs (Allwinner A523, Snapdragon 865, etc.) while maintaining backward compatibility with single-cluster devices. Key features: - Detects CPU clusters via sysfs (/sys/devices/system/cpu/cpufreq/policyN) - Classifies clusters as LITTLE/BIG/PRIME based on max frequency and core count - Builds a PerfState ladder using governors (powersave/schedutil/performance) - Uses CPU affinity to guide the emulation thread to the appropriate cluster The governor-based approach works WITH the kernel's frequency scaling: - Our algorithm provides frame timing intelligence (which cluster tier) - The kernel provides frequency scaling intelligence (schedutil) - Inactive clusters get powersave governor, allowing them to truly idle PerfState ladder structure: - Dual-cluster: 6 states (LITTLE×3 governors, BIG×3 governors) - Tri-cluster: 9 states (adds PRIME×3 governors) - Progression: LITTLE powersave → LITTLE schedutil → LITTLE performance → BIG powersave → BIG schedutil → BIG performance → ... Single-cluster devices continue using the existing granular frequency mode with userspace governor and scaling_setspeed.

Introduces two helper functions that abstract over the three CPU scaling modes (topology/granular/fallback): - PlayerCPU_getPerformancePercent(): returns 0-100% normalized level - PlayerCPU_getModeName(): returns mode string for logging/debugging Updates the debug HUD to properly display topology mode state. Previously it fell through to fallback mode display which showed incorrect info. New format: "T3/5 60% u:75% b:80%" showing state index, max state, performance %, utilization %, and buffer fill %.

Copilot

Pull request overview

This PR adds multi-cluster CPU topology support for ARM SoCs with heterogeneous CPU architectures (big.LITTLE, tri-cluster designs like Snapdragon 865). Instead of manipulating frequency bounds, the new topology mode uses CPU governors (powersave/schedutil/performance) and CPU affinity to guide the emulation thread across performance tiers, working with the kernel's frequency scaling intelligence rather than against it.

Key changes:

Topology detection via sysfs that enumerates CPU clusters and classifies them as LITTLE/BIG/PRIME based on max frequency
PerfState ladder system with 3 governor levels per cluster (6 states for dual-cluster, 9 for tri-cluster)
Governor-based scaling approach that leverages kernel's schedutil intelligence instead of fighting with frequency bounds
Backward compatibility maintained: single-cluster devices continue using granular frequency mode

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
workspace/all/player/player_cpu.h	Defines new topology types (PlayerCPUCluster, PlayerCPUPerfState, PlayerCPUTopology), governor enum, and function declarations for multi-cluster support
workspace/all/player/player_cpu.c	Implements topology mode update logic, PerfState building/application, cluster classification, and CPU list parsing
workspace/all/common/api.h	Declares new API functions for topology detection, governor control, and thread affinity
workspace/all/common/api.c	Implements sysfs-based topology detection, governor/affinity control, and cluster enumeration
workspace/all/player/player.c	Integrates topology mode into auto CPU thread, adds HUD display for topology mode showing state/max and performance percentage
tests/unit/all/player/test_player_cpu.c	Adds comprehensive test coverage for topology functions including parsing, classification, PerfState building, and update decisions
docs/auto-cpu-scaling.md	Documents topology mode architecture, governor-based approach, PerfState ladder structure, and cluster classification algorithm

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-05T23:53:04Z

+
+				// Set pending_affinity under mutex (main thread will apply it)
+				// This avoids race condition with main thread reading pending_affinity
+				PlayerCPUPerfState* ps = &auto_cpu_state.topology.states[target_state];


Potential race condition when accessing auto_cpu_state.topology.states[target_state] without holding the mutex. After releasing the mutex at line 855, the code accesses the topology.states array at line 869. While the topology structure is likely read-only after initialization, this access pattern is inconsistent with the mutex protection used for other shared state. Consider moving this access inside the mutex lock at lines 870-874, or document that topology.states is read-only after initialization.

Copilot · 2026-01-05T23:53:04Z

+			if (auto_cpu_state.use_topology) {
+				// Topology mode: show state/max and performance %
+				int perf_pct = PlayerCPU_getPerformancePercent(&auto_cpu_state);
+				int max_state = auto_cpu_state.topology.state_count - 1;


Accessing auto_cpu_state.topology.state_count without holding the auto_cpu_mutex. While topology data is likely read-only after initialization, this access at line 4107 is outside the mutex lock that was released at line 4090. For consistency with the protection of current_idx, current_state, and level, consider reading state_count under the mutex, or document that topology structure is read-only after initialization.

Copilot · 2026-01-05T23:53:04Z

+		if (auto_cpu_state.use_topology) {
+			// Topology mode: show state/max and performance %
+			int perf_pct = PlayerCPU_getPerformancePercent(&auto_cpu_state);
+			int max_state = auto_cpu_state.topology.state_count - 1;


Accessing auto_cpu_state.topology.state_count without holding the auto_cpu_mutex. While topology data is likely read-only after initialization, this access at line 3786 is outside the mutex lock that was released at line 3769. For consistency with the protection of current_idx, current_state, and level, consider reading state_count under the mutex, or document that topology structure is read-only after initialization.

Copilot · 2026-01-05T23:53:05Z

+static int compare_cluster_by_max_khz(const void* a, const void* b) {
+	const PlayerCPUCluster* ca = (const PlayerCPUCluster*)a;
+	const PlayerCPUCluster* cb = (const PlayerCPUCluster*)b;
+	return ca->max_khz - cb->max_khz;


Potential integer overflow in subtraction-based comparison. If ca->max_khz and cb->max_khz have large values with different signs or differ by more than INT_MAX, the subtraction ca->max_khz - cb->max_khz could overflow. While this is unlikely with CPU frequencies (typically < 5000000 kHz), consider using the standard three-way comparison pattern: if (ca->max_khz < cb->max_khz) return -1; if (ca->max_khz > cb->max_khz) return 1; return 0;

Suggested change

return ca->max_khz - cb->max_khz;

if (ca->max_khz < cb->max_khz)

return -1;

if (ca->max_khz > cb->max_khz)

return 1;

return 0;

Copilot · 2026-01-05T23:53:05Z

+				*cpu_mask |= (1 << cpu);
+				(*cpu_count)++;


The CPU count increment logic may produce incorrect results. In parse_related_cpus, (*cpu_count)++ is called for every CPU in a range without checking if the CPU was already in the mask. For example, parsing "0-3,2-5" would count CPU 2 and 3 twice, resulting in cpu_count=8 instead of 6. Consider adding a duplicate check similar to PlayerCPU_parseCPUList which checks if (!(mask & (1 << cpu))) before incrementing the count.

Suggested change

*cpu_mask |= (1 << cpu);

(*cpu_count)++;

int bit = 1 << cpu;

if (!(*cpu_mask & bit)) {

*cpu_mask |= bit;

(*cpu_count)++;

}

Copilot · 2026-01-05T23:53:05Z

+			pthread_mutex_unlock(&auto_cpu_mutex);
+
+			if (target_state != current_state && target_state >= 0 &&
+			    target_state < auto_cpu_state.topology.state_count) {
+				LOG_debug("Auto CPU: applying PerfState %d/%d\n", target_state,
+				          auto_cpu_state.topology.state_count - 1);
+
+				int result = PlayerCPU_applyPerfState(&auto_cpu_state);
+				if (result != 0) {
+					LOG_warn("Auto CPU: failed to apply PerfState %d\n", target_state);
+				}


Potential race condition when accessing auto_cpu_state.topology.state_count without holding the mutex. The line reads auto_cpu_state.topology.state_count at line 858 after releasing the mutex at line 855, but this field could theoretically be modified by another thread. While unlikely in practice given the initialization pattern, consider reading state_count under the mutex for consistency with the pattern used for target_state and current_state.

Suggested change

pthread_mutex_unlock(&auto_cpu_mutex);

if (target_state != current_state && target_state >= 0 &&

target_state < auto_cpu_state.topology.state_count) {

LOG_debug("Auto CPU: applying PerfState %d/%d\n", target_state,

auto_cpu_state.topology.state_count - 1);

int result = PlayerCPU_applyPerfState(&auto_cpu_state);

if (result != 0) {

LOG_warn("Auto CPU: failed to apply PerfState %d\n", target_state);

}

int state_count = auto_cpu_state.topology.state_count;

pthread_mutex_unlock(&auto_cpu_mutex);

if (target_state != current_state && target_state >= 0 &&

target_state < state_count) {

LOG_debug("Auto CPU: applying PerfState %d/%d\n", target_state,

state_count - 1);

int result = PlayerCPU_applyPerfState(&auto_cpu_state);

if (result != 0) {

LOG_warn("Auto CPU: failed to apply PerfState %d\n", target_state);

On multi-cluster ARM SoCs (tg5050, retroid), the existing PWR_setCPUSpeed() calls were no-ops, leaving CPUs at full power during menu browsing and in tools. PWR_setLowPowerMode() enumerates all cpufreq policies and sets each to "powersave" governor. On single-cluster devices, it uses PLAT_setCPUSpeed() instead.

Moved player_cpu.c/h to common/cpu.c/h and renamed all PlayerCPU* identifiers to CPU* since the code is now shared infrastructure used by both launcher (for topology detection) and player (for autoscaling).

Extracts debug overlay rendering from player.c into reusable platform hooks, eliminating duplication between software and hardware render paths and enabling future platform-specific optimizations.

Audio rate control improvements: - Add cubic boost to proportional term (1× at center, 4× at limits) - Prevents buffer hitting 0%/100% during integral learning - Only boosts P term, not integral (avoids over-correction) - Improves tg5040 stability: 40.5% mean fill, 704 warnings (down from 8000+) CPU scaling improvements: - Add panic grace period (60 frames after frequency change) - Ignore underruns during grace to prevent cascade panics - Add buffer threshold for reduce (min 40% to allow scaling down) - Add stability decay (reduce panic counts after stable operation) - Reduce panic step from 2 to 1 (less aggressive on panic)

…ion. Frame pacer improvements: - Measure actual vsync timing by recording SDL_RenderPresent() intervals - Use exponential moving average (α=0.01) for stable Hz measurement - Stable after 120 samples (~2 seconds of gameplay) - Continuously track drift, re-check every 300 frames (~5 seconds) - Automatically reinitialize when measured Hz differs >0.1% from current - Reset Bresenham accumulator on Hz change to prevent glitches Tolerance and cleanup: - Reduce FRAME_PACER_TOLERANCE from 2% to 1% (matches RetroArch guidance) - Remove FORCE_FRAME_PACING from tg5040/tg5050 (testing hack) Results on tg5050: - Discovered actual refresh rate: 62-68Hz (SDL reports 60Hz) - Screen drifts dynamically: starts 65-68Hz, settles to 62-63Hz - 14% mismatch explains audio buffer issues - Frame pacing insufficient at this mismatch (needs audio-clock mode) Tests: - Added 6 tests for vsync measurement and drift tracking - All 1578 tests passing

Replaces compile-time dual-loop architecture (vsync vs audio-clock) with single integrated loop that measures actual display refresh rate and adapts sync mode at runtime. Starts in audio-clock mode (safe, universal) and switches to vsync if measured Hz is compatible (<1% mismatch from game fps). Removes frame pacing (Bresenham), cubic boost, and 350 lines of code. Simplifies auto CPU scaling to use pure utilization-based decisions, with time-based reduction strategy for audio-clock mode where blocking makes utilization metrics unreliable.

Audio rate control: - Replace dual-timescale PI controller with pure P control (Arntzen algorithm) - Remove integral term, error smoothing, and per-frame SND_newFrame() updates - Reduce d parameter from 1.2% to 0.8% (larger buffer provides more headroom) - Enable rate control in both modes as buffer health mechanism Audio blocking: - Replace SDL_Delay polling with true SDL_CondWait blocking in audio-clock mode - Add SDL mutex/cond for thread-safe audio writes - Block at 90% buffer full, wake on callback drain signal - Pre-fill buffer to 50% at startup to prevent initial underruns Audio buffer: - Increase from 4096 to 6400 samples (~133ms, matches RetroArch 128ms default) - Provides headroom for CPU frequency transitions and timing variance sync_manager: - Replace EMA smoothing with circular buffer and stddev-based convergence - Measurement stable when stddev/mean < 1% (typically ~60 samples) - Timeout after 1800 samples (~30s) if never converges - Add 512-sample circular buffer for rolling statistics Vsync control: - Add GLVideo_setVsync() to enable/disable vsync at runtime - Disable vsync in audio-clock mode (blocking is sole timing source) - Enable vsync in vsync mode (display-driven timing) - Update vsync setting on mode transitions CPU scaling (all 3 modes): - Add panic counting for topology and fallback modes (was granular-only) - Add skip-blocked-states logic to all reduction paths - Use time-based probing in AC mode with buffer-guided timing - Fix stability decay to use mode-appropriate indices

Copilot

Pull request overview

Copilot reviewed 50 out of 50 changed files in this pull request and generated no new comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

nchapman added 3 commits January 5, 2026 13:53

Update CPU scaling docs.

b67543c

Copilot AI review requested due to automatic review settings January 5, 2026 23:48

Copilot started reviewing on behalf of nchapman January 5, 2026 23:49 View session

Copilot AI reviewed Jan 5, 2026

View reviewed changes

nchapman added 7 commits January 5, 2026 16:15

Refactor player_cpu to common/cpu with CPU_ prefix.

5186b6f

Moved player_cpu.c/h to common/cpu.c/h and renamed all PlayerCPU* identifiers to CPU* since the code is now shared infrastructure used by both launcher (for topology detection) and player (for autoscaling).

Refactor debug HUD to use consistent scaling.

a95774c

Extracts debug overlay rendering from player.c into reusable platform hooks, eliminating duplication between software and hardware render paths and enabling future platform-specific optimizations.

nchapman requested a review from Copilot January 8, 2026 04:56

Copilot started reviewing on behalf of nchapman January 8, 2026 04:56 View session

Copilot AI reviewed Jan 8, 2026

View reviewed changes

Minor clean up.

2021287

nchapman merged commit cf4c0c8 into develop Jan 8, 2026
6 checks passed

nchapman deleted the feature/complex-cpu-scaling branch January 8, 2026 05:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upgrade CPU scaling for multi-cluster SoCs#118

Upgrade CPU scaling for multi-cluster SoCs#118
nchapman merged 11 commits into
developfrom
feature/complex-cpu-scaling

nchapman commented Jan 5, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jan 5, 2026

Uh oh!

Copilot AI Jan 5, 2026

Uh oh!

Copilot AI Jan 5, 2026

Uh oh!

Copilot AI Jan 5, 2026

Uh oh!

Copilot AI Jan 5, 2026

Uh oh!

Copilot AI Jan 5, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

-	return ca->max_khz - cb->max_khz;
+	if (ca->max_khz < cb->max_khz)
+		return -1;
+	if (ca->max_khz > cb->max_khz)
+		return 1;
+	return 0;

Conversation

nchapman commented Jan 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

nchapman commented Jan 5, 2026 •

edited

Loading