1196212Sscottl.. SPDX-License-Identifier: GPL-2.0 2196212Sscottl 3196212Sscottl======================= 4196212SscottlEnergy Model of devices 5196212Sscottl======================= 6196212Sscottl 7196212Sscottl1. Overview 8196212Sscottl----------- 9196212Sscottl 10196212SscottlThe Energy Model (EM) framework serves as an interface between drivers knowing 11196212Sscottlthe power consumed by devices at various performance levels, and the kernel 12196212Sscottlsubsystems willing to use that information to make energy-aware decisions. 13196212Sscottl 14196212SscottlThe source of the information about the power consumed by devices can vary greatly 15196212Sscottlfrom one platform to another. These power costs can be estimated using 16196212Sscottldevicetree data in some cases. In others, the firmware will know better. 17196212SscottlAlternatively, userspace might be best positioned. And so on. In order to avoid 18196212Sscottleach and every client subsystem to re-implement support for each and every 19196212Sscottlpossible source of information on its own, the EM framework intervenes as an 20196212Sscottlabstraction layer which standardizes the format of power cost tables in the 21196212Sscottlkernel, hence enabling to avoid redundant work. 22196212Sscottl 23196212SscottlThe power values might be expressed in micro-Watts or in an 'abstract scale'. 24196212SscottlMultiple subsystems might use the EM and it is up to the system integrator to 25196212Sscottlcheck that the requirements for the power value scale types are met. An example 26196212Sscottlcan be found in the Energy-Aware Scheduler documentation 27196212SscottlDocumentation/scheduler/sched-energy.rst. For some subsystems like thermal or 28196212Sscottlpowercap power values expressed in an 'abstract scale' might cause issues. 29196212SscottlThese subsystems are more interested in estimation of power used in the past, 30196212Sscottlthus the real micro-Watts might be needed. An example of these requirements can 31196212Sscottlbe found in the Intelligent Power Allocation in 32196212SscottlDocumentation/driver-api/thermal/power_allocator.rst. 33196212SscottlKernel subsystems might implement automatic detection to check whether EM 34196212Sscottlregistered devices have inconsistent scale (based on EM internal flag). 35196212SscottlImportant thing to keep in mind is that when the power values are expressed in 36196212Sscottlan 'abstract scale' deriving real energy in micro-Joules would not be possible. 37196212Sscottl 38196212SscottlThe figure below depicts an example of drivers (Arm-specific here, but the 39196212Sscottlapproach is applicable to any architecture) providing power costs to the EM 40196212Sscottlframework, and interested clients reading the data from it:: 41196212Sscottl 42196212Sscottl +---------------+ +-----------------+ +---------------+ 43196212Sscottl | Thermal (IPA) | | Scheduler (EAS) | | Other | 44196212Sscottl +---------------+ +-----------------+ +---------------+ 45196212Sscottl | | em_cpu_energy() | 46196212Sscottl | | em_cpu_get() | 47196212Sscottl +---------+ | +---------+ 48196212Sscottl | | | 49196212Sscottl v v v 50196212Sscottl +---------------------+ 51196212Sscottl | Energy Model | 52196212Sscottl | Framework | 53196212Sscottl +---------------------+ 54196212Sscottl ^ ^ ^ 55196212Sscottl | | | em_dev_register_perf_domain() 56196212Sscottl +----------+ | +---------+ 57196212Sscottl | | | 58196212Sscottl +---------------+ +---------------+ +--------------+ 59204090Sjhb | cpufreq-dt | | arm_scmi | | Other | 60204090Sjhb +---------------+ +---------------+ +--------------+ 61204090Sjhb ^ ^ ^ 62204090Sjhb | | | 63204090Sjhb +--------------+ +---------------+ +--------------+ 64204090Sjhb | Device Tree | | Firmware | | ? | 65204090Sjhb +--------------+ +---------------+ +--------------+ 66204090Sjhb 67204090SjhbIn case of CPU devices the EM framework manages power cost tables per 68204090Sjhb'performance domain' in the system. A performance domain is a group of CPUs 69204090Sjhbwhose performance is scaled together. Performance domains generally have a 70204090Sjhb1-to-1 mapping with CPUFreq policies. All CPUs in a performance domain are 71204090Sjhbrequired to have the same micro-architecture. CPUs in different performance 72204090Sjhbdomains can have different micro-architectures. 73204090Sjhb 74204090SjhbTo better reflect power variation due to static power (leakage) the EM 75204090Sjhbsupports runtime modifications of the power values. The mechanism relies on 76204090SjhbRCU to free the modifiable EM perf_state table memory. Its user, the task 77204090Sjhbscheduler, also uses RCU to access this memory. The EM framework provides 78204090SjhbAPI for allocating/freeing the new memory for the modifiable EM table. 79204090SjhbThe old memory is freed automatically using RCU callback mechanism when there 80204090Sjhbare no owners anymore for the given EM runtime table instance. This is tracked 81204090Sjhbusing kref mechanism. The device driver which provided the new EM at runtime, 82204090Sjhbshould call EM API to free it safely when it's no longer needed. The EM 83204090Sjhbframework will handle the clean-up when it's possible. 84204090Sjhb 85204090SjhbThe kernel code which want to modify the EM values is protected from concurrent 86204090Sjhbaccess using a mutex. Therefore, the device driver code must run in sleeping 87204090Sjhbcontext when it tries to modify the EM. 88204090Sjhb 89204090SjhbWith the runtime modifiable EM we switch from a 'single and during the entire 90204090Sjhbruntime static EM' (system property) design to a 'single EM which can be 91204090Sjhbchanged during runtime according e.g. to the workload' (system and workload 92204090Sjhbproperty) design. 93204090Sjhb 94204090SjhbIt is possible also to modify the CPU performance values for each EM's 95204090Sjhbperformance state. Thus, the full power and performance profile (which 96204090Sjhbis an exponential curve) can be changed according e.g. to the workload 97204090Sjhbor system property. 98204090Sjhb 99204090Sjhb 100204090Sjhb2. Core APIs 101204090Sjhb------------ 102204090Sjhb 103204090Sjhb2.1 Config options 104204090Sjhb^^^^^^^^^^^^^^^^^^ 105204090Sjhb 106204090SjhbCONFIG_ENERGY_MODEL must be enabled to use the EM framework. 107204090Sjhb 108204090Sjhb 109204090Sjhb2.2 Registration of performance domains 110204090Sjhb^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 111204090Sjhb 112204090SjhbRegistration of 'advanced' EM 113204090Sjhb~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 114204090Sjhb 115204090SjhbThe 'advanced' EM gets its name due to the fact that the driver is allowed 116204090Sjhbto provide more precised power model. It's not limited to some implemented math 117204090Sjhbformula in the framework (like it is in 'simple' EM case). It can better reflect 118204090Sjhbthe real power measurements performed for each performance state. Thus, this 119196212Sscottlregistration method should be preferred in case considering EM static power 120196212Sscottl(leakage) is important. 121196212Sscottl 122196212SscottlDrivers are expected to register performance domains into the EM framework by 123196212Sscottlcalling the following API:: 124196212Sscottl 125204090Sjhb int em_dev_register_perf_domain(struct device *dev, unsigned int nr_states, 126196212Sscottl struct em_data_callback *cb, cpumask_t *cpus, bool microwatts); 127204090Sjhb 128196212SscottlDrivers must provide a callback function returning <frequency, power> tuples 129196212Sscottlfor each performance state. The callback function provided by the driver is free 130196212Sscottlto fetch data from any relevant location (DT, firmware, ...), and by any mean 131196212Sscottldeemed necessary. Only for CPU devices, drivers must specify the CPUs of the 132196212Sscottlperformance domains using cpumask. For other devices than CPUs the last 133196212Sscottlargument must be set to NULL. 134196212SscottlThe last argument 'microwatts' is important to set with correct value. Kernel 135196212Sscottlsubsystems which use EM might rely on this flag to check if all EM devices use 136204090Sjhbthe same scale. If there are different scales, these subsystems might decide 137204090Sjhbto return warning/error, stop working or panic. 138204090SjhbSee Section 3. for an example of driver implementing this 139204090Sjhbcallback, or Section 2.4 for further documentation on this API 140204090Sjhb 141196212SscottlRegistration of EM using DT 142196212Sscottl~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 143196212Sscottl 144196212SscottlThe EM can also be registered using OPP framework and information in DT 145196212Sscottl"operating-points-v2". Each OPP entry in DT can be extended with a property 146196212Sscottl"opp-microwatt" containing micro-Watts power value. This OPP DT property 147196212Sscottlallows a platform to register EM power values which are reflecting total power 148196212Sscottl(static + dynamic). These power values might be coming directly from 149196212Sscottlexperiments and measurements. 150196212Sscottl 151196212SscottlRegistration of 'artificial' EM 152196212Sscottl~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 153204090Sjhb 154204090SjhbThere is an option to provide a custom callback for drivers missing detailed 155196212Sscottlknowledge about power value for each performance state. The callback 156196212Sscottl.get_cost() is optional and provides the 'cost' values used by the EAS. 157196212SscottlThis is useful for platforms that only provide information on relative 158196212Sscottlefficiency between CPU types, where one could use the information to 159204090Sjhbcreate an abstract power model. But even an abstract power model can 160204090Sjhbsometimes be hard to fit in, given the input power value size restrictions. 161204090SjhbThe .get_cost() allows to provide the 'cost' values which reflect the 162196212Sscottlefficiency of the CPUs. This would allow to provide EAS information which 163196212Sscottlhas different relation than what would be forced by the EM internal 164204090Sjhbformulas calculating 'cost' values. To register an EM for such platform, the 165196212Sscottldriver must set the flag 'microwatts' to 0, provide .get_power() callback 166196212Sscottland provide .get_cost() callback. The EM framework would handle such platform 167196212Sscottlproperly during registration. A flag EM_PERF_DOMAIN_ARTIFICIAL is set for such 168196212Sscottlplatform. Special care should be taken by other frameworks which are using EM 169196212Sscottlto test and treat this flag properly. 170196212Sscottl 171196212SscottlRegistration of 'simple' EM 172196212Sscottl~~~~~~~~~~~~~~~~~~~~~~~~~~~ 173196212Sscottl 174196212SscottlThe 'simple' EM is registered using the framework helper function 175196212Sscottlcpufreq_register_em_with_opp(). It implements a power model which is tight to 176196212Sscottlmath formula:: 177196212Sscottl 178196212Sscottl Power = C * V^2 * f 179196212Sscottl 180196212SscottlThe EM which is registered using this method might not reflect correctly the 181196212Sscottlphysics of a real device, e.g. when static power (leakage) is important. 182196212Sscottl 183204090Sjhb 184204090Sjhb2.3 Accessing performance domains 185204090Sjhb^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 186196212Sscottl 187204090SjhbThere are two API functions which provide the access to the energy model: 188196212Sscottlem_cpu_get() which takes CPU id as an argument and em_pd_get() with device 189196212Sscottlpointer as an argument. It depends on the subsystem which interface it is 190196212Sscottlgoing to use, but in case of CPU devices both functions return the same 191204090Sjhbperformance domain. 192204090Sjhb 193196212SscottlSubsystems interested in the energy model of a CPU can retrieve it using the 194196212Sscottlem_cpu_get() API. The energy model tables are allocated once upon creation of 195196212Sscottlthe performance domains, and kept in memory untouched. 196196212Sscottl 197204090SjhbThe energy consumed by a performance domain can be estimated using the 198204090Sjhbem_cpu_energy() API. The estimation is performed assuming that the schedutil 199196212SscottlCPUfreq governor is in use in case of CPU device. Currently this calculation is 200196212Sscottlnot provided for other type of devices. 201196212Sscottl 202196212SscottlMore details about the above APIs can be found in ``<linux/energy_model.h>`` 203196212Sscottlor in Section 2.5 204196212Sscottl 205196212Sscottl 206196212Sscottl2.4 Runtime modifications 207196212Sscottl^^^^^^^^^^^^^^^^^^^^^^^^^ 208196212Sscottl 209196212SscottlDrivers willing to update the EM at runtime should use the following dedicated 210196212Sscottlfunction to allocate a new instance of the modified EM. The API is listed 211196212Sscottlbelow:: 212196212Sscottl 213196212Sscottl struct em_perf_table __rcu *em_table_alloc(struct em_perf_domain *pd); 214196212Sscottl 215196212SscottlThis allows to allocate a structure which contains the new EM table with 216196212Sscottlalso RCU and kref needed by the EM framework. The 'struct em_perf_table' 217196212Sscottlcontains array 'struct em_perf_state state[]' which is a list of performance 218196212Sscottlstates in ascending order. That list must be populated by the device driver 219196212Sscottlwhich wants to update the EM. The list of frequencies can be taken from 220196212Sscottlexisting EM (created during boot). The content in the 'struct em_perf_state' 221196212Sscottlmust be populated by the driver as well. 222196212Sscottl 223196212SscottlThis is the API which does the EM update, using RCU pointers swap:: 224196212Sscottl 225196212Sscottl int em_dev_update_perf_domain(struct device *dev, 226196212Sscottl struct em_perf_table __rcu *new_table); 227196212Sscottl 228196212SscottlDrivers must provide a pointer to the allocated and initialized new EM 229196212Sscottl'struct em_perf_table'. That new EM will be safely used inside the EM framework 230196212Sscottland will be visible to other sub-systems in the kernel (thermal, powercap). 231196212SscottlThe main design goal for this API is to be fast and avoid extra calculations 232196212Sscottlor memory allocations at runtime. When pre-computed EMs are available in the 233196212Sscottldevice driver, than it should be possible to simply re-use them with low 234196212Sscottlperformance overhead. 235196212Sscottl 236196212SscottlIn order to free the EM, provided earlier by the driver (e.g. when the module 237196212Sscottlis unloaded), there is a need to call the API:: 238196212Sscottl 239196212Sscottl void em_table_free(struct em_perf_table __rcu *table); 240196212Sscottl 241196212SscottlIt will allow the EM framework to safely remove the memory, when there is 242196212Sscottlno other sub-system using it, e.g. EAS. 243196212Sscottl 244196212SscottlTo use the power values in other sub-systems (like thermal, powercap) there is 245196212Sscottla need to call API which protects the reader and provide consistency of the EM 246196212Sscottltable data:: 247196212Sscottl 248196212Sscottl struct em_perf_state *em_perf_state_from_pd(struct em_perf_domain *pd); 249196212Sscottl 250196212SscottlIt returns the 'struct em_perf_state' pointer which is an array of performance 251196212Sscottlstates in ascending order. 252196212SscottlThis function must be called in the RCU read lock section (after the 253196212Sscottlrcu_read_lock()). When the EM table is not needed anymore there is a need to 254196212Sscottlcall rcu_real_unlock(). In this way the EM safely uses the RCU read section 255196212Sscottland protects the users. It also allows the EM framework to manage the memory 256196212Sscottland free it. More details how to use it can be found in Section 3.2 in the 257196212Sscottlexample driver. 258196212Sscottl 259196212SscottlThere is dedicated API for device drivers to calculate em_perf_state::cost 260196212Sscottlvalues:: 261196212Sscottl 262196212Sscottl int em_dev_compute_costs(struct device *dev, struct em_perf_state *table, 263196212Sscottl int nr_states); 264196212Sscottl 265196212SscottlThese 'cost' values from EM are used in EAS. The new EM table should be passed 266196212Sscottltogether with the number of entries and device pointer. When the computation 267196212Sscottlof the cost values is done properly the return value from the function is 0. 268196212SscottlThe function takes care for right setting of inefficiency for each performance 269196212Sscottlstate as well. It updates em_perf_state::flags accordingly. 270196212SscottlThen such prepared new EM can be passed to the em_dev_update_perf_domain() 271196212Sscottlfunction, which will allow to use it. 272196212Sscottl 273196212SscottlMore details about the above APIs can be found in ``<linux/energy_model.h>`` 274196212Sscottlor in Section 3.2 with an example code showing simple implementation of the 275196212Sscottlupdating mechanism in a device driver. 276196212Sscottl 277196212Sscottl 278196212Sscottl2.5 Description details of this API 279196212Sscottl^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 280196212Sscottl.. kernel-doc:: include/linux/energy_model.h 281196212Sscottl :internal: 282196212Sscottl 283196212Sscottl.. kernel-doc:: kernel/power/energy_model.c 284196212Sscottl :export: 285196212Sscottl 286196212Sscottl 287196212Sscottl3. Examples 288196212Sscottl----------- 289196212Sscottl 290196212Sscottl3.1 Example driver with EM registration 291196212Sscottl^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 292196212Sscottl 293196212SscottlThe CPUFreq framework supports dedicated callback for registering 294196212Sscottlthe EM for a given CPU(s) 'policy' object: cpufreq_driver::register_em(). 295196212SscottlThat callback has to be implemented properly for a given driver, 296196212Sscottlbecause the framework would call it at the right time during setup. 297196212SscottlThis section provides a simple example of a CPUFreq driver registering a 298196212Sscottlperformance domain in the Energy Model framework using the (fake) 'foo' 299196212Sscottlprotocol. The driver implements an est_power() function to be provided to the 300196212SscottlEM framework:: 301196212Sscottl 302196212Sscottl -> drivers/cpufreq/foo_cpufreq.c 303196212Sscottl 304196212Sscottl 01 static int est_power(struct device *dev, unsigned long *mW, 305196212Sscottl 02 unsigned long *KHz) 306196212Sscottl 03 { 307196212Sscottl 04 long freq, power; 308196212Sscottl 05 309196212Sscottl 06 /* Use the 'foo' protocol to ceil the frequency */ 310196212Sscottl 07 freq = foo_get_freq_ceil(dev, *KHz); 311196212Sscottl 08 if (freq < 0); 312196212Sscottl 09 return freq; 313196212Sscottl 10 314196212Sscottl 11 /* Estimate the power cost for the dev at the relevant freq. */ 315196212Sscottl 12 power = foo_estimate_power(dev, freq); 316196212Sscottl 13 if (power < 0); 317196212Sscottl 14 return power; 318196212Sscottl 15 319196212Sscottl 16 /* Return the values to the EM framework */ 320196212Sscottl 17 *mW = power; 321196212Sscottl 18 *KHz = freq; 322196212Sscottl 19 323196212Sscottl 20 return 0; 324196212Sscottl 21 } 325196212Sscottl 22 326196212Sscottl 23 static void foo_cpufreq_register_em(struct cpufreq_policy *policy) 327196212Sscottl 24 { 328196212Sscottl 25 struct em_data_callback em_cb = EM_DATA_CB(est_power); 329196212Sscottl 26 struct device *cpu_dev; 330196212Sscottl 27 int nr_opp; 331196212Sscottl 28 332196212Sscottl 29 cpu_dev = get_cpu_device(cpumask_first(policy->cpus)); 333196212Sscottl 30 334196212Sscottl 31 /* Find the number of OPPs for this policy */ 335196212Sscottl 32 nr_opp = foo_get_nr_opp(policy); 336196212Sscottl 33 337196212Sscottl 34 /* And register the new performance domain */ 338196212Sscottl 35 em_dev_register_perf_domain(cpu_dev, nr_opp, &em_cb, policy->cpus, 339196212Sscottl 36 true); 340196212Sscottl 37 } 341196212Sscottl 38 342196212Sscottl 39 static struct cpufreq_driver foo_cpufreq_driver = { 343196212Sscottl 40 .register_em = foo_cpufreq_register_em, 344196212Sscottl 41 }; 345196212Sscottl 346196212Sscottl 347196212Sscottl3.2 Example driver with EM modification 348196212Sscottl^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 349196212Sscottl 350196212SscottlThis section provides a simple example of a thermal driver modifying the EM. 351196212SscottlThe driver implements a foo_thermal_em_update() function. The driver is woken 352196212Sscottlup periodically to check the temperature and modify the EM data:: 353196212Sscottl 354196212Sscottl -> drivers/soc/example/example_em_mod.c 355196212Sscottl 356196212Sscottl 01 static void foo_get_new_em(struct foo_context *ctx) 357196212Sscottl 02 { 358196212Sscottl 03 struct em_perf_table __rcu *em_table; 359196212Sscottl 04 struct em_perf_state *table, *new_table; 360196212Sscottl 05 struct device *dev = ctx->dev; 361196212Sscottl 06 struct em_perf_domain *pd; 362196212Sscottl 07 unsigned long freq; 363196212Sscottl 08 int i, ret; 364196212Sscottl 09 365196212Sscottl 10 pd = em_pd_get(dev); 366196212Sscottl 11 if (!pd) 367196212Sscottl 12 return; 368196212Sscottl 13 369196212Sscottl 14 em_table = em_table_alloc(pd); 370196212Sscottl 15 if (!em_table) 371196212Sscottl 16 return; 372196212Sscottl 17 373196212Sscottl 18 new_table = em_table->state; 374196212Sscottl 19 375196212Sscottl 20 rcu_read_lock(); 376196212Sscottl 21 table = em_perf_state_from_pd(pd); 377196212Sscottl 22 for (i = 0; i < pd->nr_perf_states; i++) { 378196212Sscottl 23 freq = table[i].frequency; 379196212Sscottl 24 foo_get_power_perf_values(dev, freq, &new_table[i]); 380196212Sscottl 25 } 381196212Sscottl 26 rcu_read_unlock(); 382196212Sscottl 27 383196212Sscottl 28 /* Calculate 'cost' values for EAS */ 384196212Sscottl 29 ret = em_dev_compute_costs(dev, table, pd->nr_perf_states); 385196212Sscottl 30 if (ret) { 386196212Sscottl 31 dev_warn(dev, "EM: compute costs failed %d\n", ret); 387196212Sscottl 32 em_free_table(em_table); 388196212Sscottl 33 return; 389196212Sscottl 34 } 390196212Sscottl 35 391196212Sscottl 36 ret = em_dev_update_perf_domain(dev, em_table); 392196212Sscottl 37 if (ret) { 393196212Sscottl 38 dev_warn(dev, "EM: update failed %d\n", ret); 394196212Sscottl 39 em_free_table(em_table); 395196212Sscottl 40 return; 396196212Sscottl 41 } 397196212Sscottl 42 398204090Sjhb 43 /* 399196212Sscottl 44 * Since it's one-time-update drop the usage counter. 400196212Sscottl 45 * The EM framework will later free the table when needed. 401204090Sjhb 46 */ 402196212Sscottl 47 em_table_free(em_table); 403196212Sscottl 48 } 404196212Sscottl 49 405196212Sscottl 50 /* 406204090Sjhb 51 * Function called periodically to check the temperature and 407204090Sjhb 52 * update the EM if needed 408204090Sjhb 53 */ 409204090Sjhb 54 static void foo_thermal_em_update(struct foo_context *ctx) 410196212Sscottl 55 { 411196212Sscottl 56 struct device *dev = ctx->dev; 412196212Sscottl 57 int cpu; 413196212Sscottl 58 414196212Sscottl 59 ctx->temperature = foo_get_temp(dev, ctx); 415196212Sscottl 60 if (ctx->temperature < FOO_EM_UPDATE_TEMP_THRESHOLD) 416204090Sjhb 61 return; 417196212Sscottl 62 418196212Sscottl 63 foo_get_new_em(ctx); 419196212Sscottl 64 } 420196212Sscottl