Skip to content

Commit f0ace35

Browse files
committed
Merge branch 'pm-cpufreq' into linux-next
* pm-cpufreq: (28 commits) cpufreq: drop redundant cpus_read_lock() from store_local_boost() cpufreq/amd-pstate: Avoid shadowing ret in amd_pstate_ut_check_driver() cpufreq: intel_pstate: Document hybrid processor support cpufreq: intel_pstate: EAS: Increase cost for CPUs using L3 cache cpufreq: intel_pstate: EAS support for hybrid platforms cpufreq: Drop policy locking from cpufreq_policy_is_good_for_eas() cpufreq: intel_pstate: Populate the cpu_capacity sysfs entries arch_topology: Relocate cpu_scale to topology.[h|c] cpufreq/sched: Move cpufreq-specific EAS checks to cpufreq cpufreq/sched: schedutil: Add helper for governor checks amd-pstate-ut: Reset amd-pstate driver mode after running selftests cpufreq/amd-pstate: Add support for the "Requested CPU Min frequency" BIOS option cpufreq/amd-pstate: Add offline, online and suspend callbacks for amd_pstate_driver cpufreq: Force sync policy boost with global boost on sysfs update cpufreq: Preserve policy's boost state after resume cpufreq: Introduce policy_set_boost() cpufreq: Don't unnecessarily call set_boost() cpufreq/amd-pstate: Move max_perf limiting in amd_pstate_update cpufreq: Drop unused cpufreq_get_policy() cpufreq: Pass policy pointer to ->update_limits() ...
2 parents 8f4cd93 + fd3d883 commit f0ace35

14 files changed

Lines changed: 659 additions & 404 deletions

File tree

Documentation/admin-guide/pm/intel_pstate.rst

Lines changed: 102 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -329,6 +329,106 @@ information listed above is the same for all of the processors supporting the
329329
HWP feature, which is why ``intel_pstate`` works with all of them.]
330330

331331

332+
Support for Hybrid Processors
333+
=============================
334+
335+
Some processors supported by ``intel_pstate`` contain two or more types of CPU
336+
cores differing by the maximum turbo P-state, performance vs power characteristics,
337+
cache sizes, and possibly other properties. They are commonly referred to as
338+
hybrid processors. To support them, ``intel_pstate`` requires HWP to be enabled
339+
and it assumes the HWP performance units to be the same for all CPUs in the
340+
system, so a given HWP performance level always represents approximately the
341+
same physical performance regardless of the core (CPU) type.
342+
343+
Hybrid Processors with SMT
344+
--------------------------
345+
346+
On systems where SMT (Simultaneous Multithreading), also referred to as
347+
HyperThreading (HT) in the context of Intel processors, is enabled on at least
348+
one core, ``intel_pstate`` assigns performance-based priorities to CPUs. Namely,
349+
the priority of a given CPU reflects its highest HWP performance level which
350+
causes the CPU scheduler to generally prefer more performant CPUs, so the less
351+
performant CPUs are used when the other ones are fully loaded. However, SMT
352+
siblings (that is, logical CPUs sharing one physical core) are treated in a
353+
special way such that if one of them is in use, the effective priority of the
354+
other ones is lowered below the priorities of the CPUs located in the other
355+
physical cores.
356+
357+
This approach maximizes performance in the majority of cases, but unfortunately
358+
it also leads to excessive energy usage in some important scenarios, like video
359+
playback, which is not generally desirable. While there is no other viable
360+
choice with SMT enabled because the effective capacity and utilization of SMT
361+
siblings are hard to determine, hybrid processors without SMT can be handled in
362+
more energy-efficient ways.
363+
364+
.. _CAS:
365+
366+
Capacity-Aware Scheduling Support
367+
---------------------------------
368+
369+
The capacity-aware scheduling (CAS) support in the CPU scheduler is enabled by
370+
``intel_pstate`` by default on hybrid processors without SMT. CAS generally
371+
causes the scheduler to put tasks on a CPU so long as there is a sufficient
372+
amount of spare capacity on it, and if the utilization of a given task is too
373+
high for it, the task will need to go somewhere else.
374+
375+
Since CAS takes CPU capacities into account, it does not require CPU
376+
prioritization and it allows tasks to be distributed more symmetrically among
377+
the more performant and less performant CPUs. Once placed on a CPU with enough
378+
capacity to accommodate it, a task may just continue to run there regardless of
379+
whether or not the other CPUs are fully loaded, so on average CAS reduces the
380+
utilization of the more performant CPUs which causes the energy usage to be more
381+
balanced because the more performant CPUs are generally less energy-efficient
382+
than the less performant ones.
383+
384+
In order to use CAS, the scheduler needs to know the capacity of each CPU in
385+
the system and it needs to be able to compute scale-invariant utilization of
386+
CPUs, so ``intel_pstate`` provides it with the requisite information.
387+
388+
First of all, the capacity of each CPU is represented by the ratio of its highest
389+
HWP performance level, multiplied by 1024, to the highest HWP performance level
390+
of the most performant CPU in the system, which works because the HWP performance
391+
units are the same for all CPUs. Second, the frequency-invariance computations,
392+
carried out by the scheduler to always express CPU utilization in the same units
393+
regardless of the frequency it is currently running at, are adjusted to take the
394+
CPU capacity into account. All of this happens when ``intel_pstate`` has
395+
registered itself with the ``CPUFreq`` core and it has figured out that it is
396+
running on a hybrid processor without SMT.
397+
398+
Energy-Aware Scheduling Support
399+
-------------------------------
400+
401+
If ``CONFIG_ENERGY_MODEL`` has been set during kernel configuration and
402+
``intel_pstate`` runs on a hybrid processor without SMT, in addition to enabling
403+
`CAS <CAS_>`_ it registers an Energy Model for the processor. This allows the
404+
Energy-Aware Scheduling (EAS) support to be enabled in the CPU scheduler if
405+
``schedutil`` is used as the ``CPUFreq`` governor which requires ``intel_pstate``
406+
to operate in the `passive mode <Passive Mode_>`_.
407+
408+
The Energy Model registered by ``intel_pstate`` is artificial (that is, it is
409+
based on abstract cost values and it does not include any real power numbers)
410+
and it is relatively simple to avoid unnecessary computations in the scheduler.
411+
There is a performance domain in it for every CPU in the system and the cost
412+
values for these performance domains have been chosen so that running a task on
413+
a less performant (small) CPU appears to be always cheaper than running that
414+
task on a more performant (big) CPU. However, for two CPUs of the same type,
415+
the cost difference depends on their current utilization, and the CPU whose
416+
current utilization is higher generally appears to be a more expensive
417+
destination for a given task. This helps to balance the load among CPUs of the
418+
same type.
419+
420+
Since EAS works on top of CAS, high-utilization tasks are always migrated to
421+
CPUs with enough capacity to accommodate them, but thanks to EAS, low-utilization
422+
tasks tend to be placed on the CPUs that look less expensive to the scheduler.
423+
Effectively, this causes the less performant and less loaded CPUs to be
424+
preferred as long as they have enough spare capacity to run the given task
425+
which generally leads to reduced energy usage.
426+
427+
The Energy Model created by ``intel_pstate`` can be inspected by looking at
428+
the ``energy_model`` directory in ``debugfs`` (typlically mounted on
429+
``/sys/kernel/debug/``).
430+
431+
332432
User Space Interface in ``sysfs``
333433
=================================
334434

@@ -697,8 +797,8 @@ of them have to be prepended with the ``intel_pstate=`` prefix.
697797
Limits`_ for details).
698798

699799
``no_cas``
700-
Do not enable capacity-aware scheduling (CAS) which is enabled by
701-
default on hybrid systems.
800+
Do not enable `capacity-aware scheduling <CAS_>`_ which is enabled by
801+
default on hybrid systems without SMT.
702802

703803
Diagnostics and Tuning
704804
======================

drivers/base/arch_topology.c

Lines changed: 0 additions & 52 deletions
Original file line numberDiff line numberDiff line change
@@ -154,14 +154,6 @@ void topology_set_freq_scale(const struct cpumask *cpus, unsigned long cur_freq,
154154
per_cpu(arch_freq_scale, i) = scale;
155155
}
156156

157-
DEFINE_PER_CPU(unsigned long, cpu_scale) = SCHED_CAPACITY_SCALE;
158-
EXPORT_PER_CPU_SYMBOL_GPL(cpu_scale);
159-
160-
void topology_set_cpu_scale(unsigned int cpu, unsigned long capacity)
161-
{
162-
per_cpu(cpu_scale, cpu) = capacity;
163-
}
164-
165157
DEFINE_PER_CPU(unsigned long, hw_pressure);
166158

167159
/**
@@ -207,53 +199,9 @@ void topology_update_hw_pressure(const struct cpumask *cpus,
207199
}
208200
EXPORT_SYMBOL_GPL(topology_update_hw_pressure);
209201

210-
static ssize_t cpu_capacity_show(struct device *dev,
211-
struct device_attribute *attr,
212-
char *buf)
213-
{
214-
struct cpu *cpu = container_of(dev, struct cpu, dev);
215-
216-
return sysfs_emit(buf, "%lu\n", topology_get_cpu_scale(cpu->dev.id));
217-
}
218-
219202
static void update_topology_flags_workfn(struct work_struct *work);
220203
static DECLARE_WORK(update_topology_flags_work, update_topology_flags_workfn);
221204

222-
static DEVICE_ATTR_RO(cpu_capacity);
223-
224-
static int cpu_capacity_sysctl_add(unsigned int cpu)
225-
{
226-
struct device *cpu_dev = get_cpu_device(cpu);
227-
228-
if (!cpu_dev)
229-
return -ENOENT;
230-
231-
device_create_file(cpu_dev, &dev_attr_cpu_capacity);
232-
233-
return 0;
234-
}
235-
236-
static int cpu_capacity_sysctl_remove(unsigned int cpu)
237-
{
238-
struct device *cpu_dev = get_cpu_device(cpu);
239-
240-
if (!cpu_dev)
241-
return -ENOENT;
242-
243-
device_remove_file(cpu_dev, &dev_attr_cpu_capacity);
244-
245-
return 0;
246-
}
247-
248-
static int register_cpu_capacity_sysctl(void)
249-
{
250-
cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "topology/cpu-capacity",
251-
cpu_capacity_sysctl_add, cpu_capacity_sysctl_remove);
252-
253-
return 0;
254-
}
255-
subsys_initcall(register_cpu_capacity_sysctl);
256-
257205
static int update_topology;
258206

259207
int topology_update_cpu_topology(void)

drivers/base/topology.c

Lines changed: 52 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -208,3 +208,55 @@ static int __init topology_sysfs_init(void)
208208
}
209209

210210
device_initcall(topology_sysfs_init);
211+
212+
DEFINE_PER_CPU(unsigned long, cpu_scale) = SCHED_CAPACITY_SCALE;
213+
EXPORT_PER_CPU_SYMBOL_GPL(cpu_scale);
214+
215+
void topology_set_cpu_scale(unsigned int cpu, unsigned long capacity)
216+
{
217+
per_cpu(cpu_scale, cpu) = capacity;
218+
}
219+
220+
static ssize_t cpu_capacity_show(struct device *dev,
221+
struct device_attribute *attr,
222+
char *buf)
223+
{
224+
struct cpu *cpu = container_of(dev, struct cpu, dev);
225+
226+
return sysfs_emit(buf, "%lu\n", topology_get_cpu_scale(cpu->dev.id));
227+
}
228+
229+
static DEVICE_ATTR_RO(cpu_capacity);
230+
231+
static int cpu_capacity_sysctl_add(unsigned int cpu)
232+
{
233+
struct device *cpu_dev = get_cpu_device(cpu);
234+
235+
if (!cpu_dev)
236+
return -ENOENT;
237+
238+
device_create_file(cpu_dev, &dev_attr_cpu_capacity);
239+
240+
return 0;
241+
}
242+
243+
static int cpu_capacity_sysctl_remove(unsigned int cpu)
244+
{
245+
struct device *cpu_dev = get_cpu_device(cpu);
246+
247+
if (!cpu_dev)
248+
return -ENOENT;
249+
250+
device_remove_file(cpu_dev, &dev_attr_cpu_capacity);
251+
252+
return 0;
253+
}
254+
255+
static int register_cpu_capacity_sysctl(void)
256+
{
257+
cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "topology/cpu-capacity",
258+
cpu_capacity_sysctl_add, cpu_capacity_sysctl_remove);
259+
260+
return 0;
261+
}
262+
subsys_initcall(register_cpu_capacity_sysctl);

drivers/cpufreq/amd-pstate-ut.c

Lines changed: 13 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -242,25 +242,30 @@ static int amd_pstate_set_mode(enum amd_pstate_mode mode)
242242
static int amd_pstate_ut_check_driver(u32 index)
243243
{
244244
enum amd_pstate_mode mode1, mode2 = AMD_PSTATE_DISABLE;
245+
enum amd_pstate_mode orig_mode = amd_pstate_get_status();
246+
int ret;
245247

246248
for (mode1 = AMD_PSTATE_DISABLE; mode1 < AMD_PSTATE_MAX; mode1++) {
247-
int ret = amd_pstate_set_mode(mode1);
249+
ret = amd_pstate_set_mode(mode1);
248250
if (ret)
249251
return ret;
250252
for (mode2 = AMD_PSTATE_DISABLE; mode2 < AMD_PSTATE_MAX; mode2++) {
251253
if (mode1 == mode2)
252254
continue;
253255
ret = amd_pstate_set_mode(mode2);
254-
if (ret) {
255-
pr_err("%s: failed to update status for %s->%s\n", __func__,
256-
amd_pstate_get_mode_string(mode1),
257-
amd_pstate_get_mode_string(mode2));
258-
return ret;
259-
}
256+
if (ret)
257+
goto out;
260258
}
261259
}
262260

263-
return 0;
261+
out:
262+
if (ret)
263+
pr_warn("%s: failed to update status for %s->%s: %d\n", __func__,
264+
amd_pstate_get_mode_string(mode1),
265+
amd_pstate_get_mode_string(mode2), ret);
266+
267+
amd_pstate_set_mode(orig_mode);
268+
return ret;
264269
}
265270

266271
static int __init amd_pstate_ut_init(void)

0 commit comments

Comments
 (0)