1196212Sscottl.. SPDX-License-Identifier: GPL-2.0
2196212Sscottl
3196212Sscottl=======================
4196212SscottlEnergy Model of devices
5196212Sscottl=======================
6196212Sscottl
7196212Sscottl1. Overview
8196212Sscottl-----------
9196212Sscottl
10196212SscottlThe Energy Model (EM) framework serves as an interface between drivers knowing
11196212Sscottlthe power consumed by devices at various performance levels, and the kernel
12196212Sscottlsubsystems willing to use that information to make energy-aware decisions.
13196212Sscottl
14196212SscottlThe source of the information about the power consumed by devices can vary greatly
15196212Sscottlfrom one platform to another. These power costs can be estimated using
16196212Sscottldevicetree data in some cases. In others, the firmware will know better.
17196212SscottlAlternatively, userspace might be best positioned. And so on. In order to avoid
18196212Sscottleach and every client subsystem to re-implement support for each and every
19196212Sscottlpossible source of information on its own, the EM framework intervenes as an
20196212Sscottlabstraction layer which standardizes the format of power cost tables in the
21196212Sscottlkernel, hence enabling to avoid redundant work.
22196212Sscottl
23196212SscottlThe power values might be expressed in micro-Watts or in an 'abstract scale'.
24196212SscottlMultiple subsystems might use the EM and it is up to the system integrator to
25196212Sscottlcheck that the requirements for the power value scale types are met. An example
26196212Sscottlcan be found in the Energy-Aware Scheduler documentation
27196212SscottlDocumentation/scheduler/sched-energy.rst. For some subsystems like thermal or
28196212Sscottlpowercap power values expressed in an 'abstract scale' might cause issues.
29196212SscottlThese subsystems are more interested in estimation of power used in the past,
30196212Sscottlthus the real micro-Watts might be needed. An example of these requirements can
31196212Sscottlbe found in the Intelligent Power Allocation in
32196212SscottlDocumentation/driver-api/thermal/power_allocator.rst.
33196212SscottlKernel subsystems might implement automatic detection to check whether EM
34196212Sscottlregistered devices have inconsistent scale (based on EM internal flag).
35196212SscottlImportant thing to keep in mind is that when the power values are expressed in
36196212Sscottlan 'abstract scale' deriving real energy in micro-Joules would not be possible.
37196212Sscottl
38196212SscottlThe figure below depicts an example of drivers (Arm-specific here, but the
39196212Sscottlapproach is applicable to any architecture) providing power costs to the EM
40196212Sscottlframework, and interested clients reading the data from it::
41196212Sscottl
42196212Sscottl       +---------------+  +-----------------+  +---------------+
43196212Sscottl       | Thermal (IPA) |  | Scheduler (EAS) |  |     Other     |
44196212Sscottl       +---------------+  +-----------------+  +---------------+
45196212Sscottl               |                   | em_cpu_energy()   |
46196212Sscottl               |                   | em_cpu_get()      |
47196212Sscottl               +---------+         |         +---------+
48196212Sscottl                         |         |         |
49196212Sscottl                         v         v         v
50196212Sscottl                        +---------------------+
51196212Sscottl                        |    Energy Model     |
52196212Sscottl                        |     Framework       |
53196212Sscottl                        +---------------------+
54196212Sscottl                           ^       ^       ^
55196212Sscottl                           |       |       | em_dev_register_perf_domain()
56196212Sscottl                +----------+       |       +---------+
57196212Sscottl                |                  |                 |
58196212Sscottl        +---------------+  +---------------+  +--------------+
59204090Sjhb        |  cpufreq-dt   |  |   arm_scmi    |  |    Other     |
60204090Sjhb        +---------------+  +---------------+  +--------------+
61204090Sjhb                ^                  ^                 ^
62204090Sjhb                |                  |                 |
63204090Sjhb        +--------------+   +---------------+  +--------------+
64204090Sjhb        | Device Tree  |   |   Firmware    |  |      ?       |
65204090Sjhb        +--------------+   +---------------+  +--------------+
66204090Sjhb
67204090SjhbIn case of CPU devices the EM framework manages power cost tables per
68204090Sjhb'performance domain' in the system. A performance domain is a group of CPUs
69204090Sjhbwhose performance is scaled together. Performance domains generally have a
70204090Sjhb1-to-1 mapping with CPUFreq policies. All CPUs in a performance domain are
71204090Sjhbrequired to have the same micro-architecture. CPUs in different performance
72204090Sjhbdomains can have different micro-architectures.
73204090Sjhb
74204090SjhbTo better reflect power variation due to static power (leakage) the EM
75204090Sjhbsupports runtime modifications of the power values. The mechanism relies on
76204090SjhbRCU to free the modifiable EM perf_state table memory. Its user, the task
77204090Sjhbscheduler, also uses RCU to access this memory. The EM framework provides
78204090SjhbAPI for allocating/freeing the new memory for the modifiable EM table.
79204090SjhbThe old memory is freed automatically using RCU callback mechanism when there
80204090Sjhbare no owners anymore for the given EM runtime table instance. This is tracked
81204090Sjhbusing kref mechanism. The device driver which provided the new EM at runtime,
82204090Sjhbshould call EM API to free it safely when it's no longer needed. The EM
83204090Sjhbframework will handle the clean-up when it's possible.
84204090Sjhb
85204090SjhbThe kernel code which want to modify the EM values is protected from concurrent
86204090Sjhbaccess using a mutex. Therefore, the device driver code must run in sleeping
87204090Sjhbcontext when it tries to modify the EM.
88204090Sjhb
89204090SjhbWith the runtime modifiable EM we switch from a 'single and during the entire
90204090Sjhbruntime static EM' (system property) design to a 'single EM which can be
91204090Sjhbchanged during runtime according e.g. to the workload' (system and workload
92204090Sjhbproperty) design.
93204090Sjhb
94204090SjhbIt is possible also to modify the CPU performance values for each EM's
95204090Sjhbperformance state. Thus, the full power and performance profile (which
96204090Sjhbis an exponential curve) can be changed according e.g. to the workload
97204090Sjhbor system property.
98204090Sjhb
99204090Sjhb
100204090Sjhb2. Core APIs
101204090Sjhb------------
102204090Sjhb
103204090Sjhb2.1 Config options
104204090Sjhb^^^^^^^^^^^^^^^^^^
105204090Sjhb
106204090SjhbCONFIG_ENERGY_MODEL must be enabled to use the EM framework.
107204090Sjhb
108204090Sjhb
109204090Sjhb2.2 Registration of performance domains
110204090Sjhb^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
111204090Sjhb
112204090SjhbRegistration of 'advanced' EM
113204090Sjhb~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
114204090Sjhb
115204090SjhbThe 'advanced' EM gets its name due to the fact that the driver is allowed
116204090Sjhbto provide more precised power model. It's not limited to some implemented math
117204090Sjhbformula in the framework (like it is in 'simple' EM case). It can better reflect
118204090Sjhbthe real power measurements performed for each performance state. Thus, this
119196212Sscottlregistration method should be preferred in case considering EM static power
120196212Sscottl(leakage) is important.
121196212Sscottl
122196212SscottlDrivers are expected to register performance domains into the EM framework by
123196212Sscottlcalling the following API::
124196212Sscottl
125204090Sjhb  int em_dev_register_perf_domain(struct device *dev, unsigned int nr_states,
126196212Sscottl		struct em_data_callback *cb, cpumask_t *cpus, bool microwatts);
127204090Sjhb
128196212SscottlDrivers must provide a callback function returning <frequency, power> tuples
129196212Sscottlfor each performance state. The callback function provided by the driver is free
130196212Sscottlto fetch data from any relevant location (DT, firmware, ...), and by any mean
131196212Sscottldeemed necessary. Only for CPU devices, drivers must specify the CPUs of the
132196212Sscottlperformance domains using cpumask. For other devices than CPUs the last
133196212Sscottlargument must be set to NULL.
134196212SscottlThe last argument 'microwatts' is important to set with correct value. Kernel
135196212Sscottlsubsystems which use EM might rely on this flag to check if all EM devices use
136204090Sjhbthe same scale. If there are different scales, these subsystems might decide
137204090Sjhbto return warning/error, stop working or panic.
138204090SjhbSee Section 3. for an example of driver implementing this
139204090Sjhbcallback, or Section 2.4 for further documentation on this API
140204090Sjhb
141196212SscottlRegistration of EM using DT
142196212Sscottl~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
143196212Sscottl
144196212SscottlThe  EM can also be registered using OPP framework and information in DT
145196212Sscottl"operating-points-v2". Each OPP entry in DT can be extended with a property
146196212Sscottl"opp-microwatt" containing micro-Watts power value. This OPP DT property
147196212Sscottlallows a platform to register EM power values which are reflecting total power
148196212Sscottl(static + dynamic). These power values might be coming directly from
149196212Sscottlexperiments and measurements.
150196212Sscottl
151196212SscottlRegistration of 'artificial' EM
152196212Sscottl~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
153204090Sjhb
154204090SjhbThere is an option to provide a custom callback for drivers missing detailed
155196212Sscottlknowledge about power value for each performance state. The callback
156196212Sscottl.get_cost() is optional and provides the 'cost' values used by the EAS.
157196212SscottlThis is useful for platforms that only provide information on relative
158196212Sscottlefficiency between CPU types, where one could use the information to
159204090Sjhbcreate an abstract power model. But even an abstract power model can
160204090Sjhbsometimes be hard to fit in, given the input power value size restrictions.
161204090SjhbThe .get_cost() allows to provide the 'cost' values which reflect the
162196212Sscottlefficiency of the CPUs. This would allow to provide EAS information which
163196212Sscottlhas different relation than what would be forced by the EM internal
164204090Sjhbformulas calculating 'cost' values. To register an EM for such platform, the
165196212Sscottldriver must set the flag 'microwatts' to 0, provide .get_power() callback
166196212Sscottland provide .get_cost() callback. The EM framework would handle such platform
167196212Sscottlproperly during registration. A flag EM_PERF_DOMAIN_ARTIFICIAL is set for such
168196212Sscottlplatform. Special care should be taken by other frameworks which are using EM
169196212Sscottlto test and treat this flag properly.
170196212Sscottl
171196212SscottlRegistration of 'simple' EM
172196212Sscottl~~~~~~~~~~~~~~~~~~~~~~~~~~~
173196212Sscottl
174196212SscottlThe 'simple' EM is registered using the framework helper function
175196212Sscottlcpufreq_register_em_with_opp(). It implements a power model which is tight to
176196212Sscottlmath formula::
177196212Sscottl
178196212Sscottl	Power = C * V^2 * f
179196212Sscottl
180196212SscottlThe EM which is registered using this method might not reflect correctly the
181196212Sscottlphysics of a real device, e.g. when static power (leakage) is important.
182196212Sscottl
183204090Sjhb
184204090Sjhb2.3 Accessing performance domains
185204090Sjhb^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
186196212Sscottl
187204090SjhbThere are two API functions which provide the access to the energy model:
188196212Sscottlem_cpu_get() which takes CPU id as an argument and em_pd_get() with device
189196212Sscottlpointer as an argument. It depends on the subsystem which interface it is
190196212Sscottlgoing to use, but in case of CPU devices both functions return the same
191204090Sjhbperformance domain.
192204090Sjhb
193196212SscottlSubsystems interested in the energy model of a CPU can retrieve it using the
194196212Sscottlem_cpu_get() API. The energy model tables are allocated once upon creation of
195196212Sscottlthe performance domains, and kept in memory untouched.
196196212Sscottl
197204090SjhbThe energy consumed by a performance domain can be estimated using the
198204090Sjhbem_cpu_energy() API. The estimation is performed assuming that the schedutil
199196212SscottlCPUfreq governor is in use in case of CPU device. Currently this calculation is
200196212Sscottlnot provided for other type of devices.
201196212Sscottl
202196212SscottlMore details about the above APIs can be found in ``<linux/energy_model.h>``
203196212Sscottlor in Section 2.5
204196212Sscottl
205196212Sscottl
206196212Sscottl2.4 Runtime modifications
207196212Sscottl^^^^^^^^^^^^^^^^^^^^^^^^^
208196212Sscottl
209196212SscottlDrivers willing to update the EM at runtime should use the following dedicated
210196212Sscottlfunction to allocate a new instance of the modified EM. The API is listed
211196212Sscottlbelow::
212196212Sscottl
213196212Sscottl  struct em_perf_table __rcu *em_table_alloc(struct em_perf_domain *pd);
214196212Sscottl
215196212SscottlThis allows to allocate a structure which contains the new EM table with
216196212Sscottlalso RCU and kref needed by the EM framework. The 'struct em_perf_table'
217196212Sscottlcontains array 'struct em_perf_state state[]' which is a list of performance
218196212Sscottlstates in ascending order. That list must be populated by the device driver
219196212Sscottlwhich wants to update the EM. The list of frequencies can be taken from
220196212Sscottlexisting EM (created during boot). The content in the 'struct em_perf_state'
221196212Sscottlmust be populated by the driver as well.
222196212Sscottl
223196212SscottlThis is the API which does the EM update, using RCU pointers swap::
224196212Sscottl
225196212Sscottl  int em_dev_update_perf_domain(struct device *dev,
226196212Sscottl			struct em_perf_table __rcu *new_table);
227196212Sscottl
228196212SscottlDrivers must provide a pointer to the allocated and initialized new EM
229196212Sscottl'struct em_perf_table'. That new EM will be safely used inside the EM framework
230196212Sscottland will be visible to other sub-systems in the kernel (thermal, powercap).
231196212SscottlThe main design goal for this API is to be fast and avoid extra calculations
232196212Sscottlor memory allocations at runtime. When pre-computed EMs are available in the
233196212Sscottldevice driver, than it should be possible to simply re-use them with low
234196212Sscottlperformance overhead.
235196212Sscottl
236196212SscottlIn order to free the EM, provided earlier by the driver (e.g. when the module
237196212Sscottlis unloaded), there is a need to call the API::
238196212Sscottl
239196212Sscottl  void em_table_free(struct em_perf_table __rcu *table);
240196212Sscottl
241196212SscottlIt will allow the EM framework to safely remove the memory, when there is
242196212Sscottlno other sub-system using it, e.g. EAS.
243196212Sscottl
244196212SscottlTo use the power values in other sub-systems (like thermal, powercap) there is
245196212Sscottla need to call API which protects the reader and provide consistency of the EM
246196212Sscottltable data::
247196212Sscottl
248196212Sscottl  struct em_perf_state *em_perf_state_from_pd(struct em_perf_domain *pd);
249196212Sscottl
250196212SscottlIt returns the 'struct em_perf_state' pointer which is an array of performance
251196212Sscottlstates in ascending order.
252196212SscottlThis function must be called in the RCU read lock section (after the
253196212Sscottlrcu_read_lock()). When the EM table is not needed anymore there is a need to
254196212Sscottlcall rcu_real_unlock(). In this way the EM safely uses the RCU read section
255196212Sscottland protects the users. It also allows the EM framework to manage the memory
256196212Sscottland free it. More details how to use it can be found in Section 3.2 in the
257196212Sscottlexample driver.
258196212Sscottl
259196212SscottlThere is dedicated API for device drivers to calculate em_perf_state::cost
260196212Sscottlvalues::
261196212Sscottl
262196212Sscottl  int em_dev_compute_costs(struct device *dev, struct em_perf_state *table,
263196212Sscottl                           int nr_states);
264196212Sscottl
265196212SscottlThese 'cost' values from EM are used in EAS. The new EM table should be passed
266196212Sscottltogether with the number of entries and device pointer. When the computation
267196212Sscottlof the cost values is done properly the return value from the function is 0.
268196212SscottlThe function takes care for right setting of inefficiency for each performance
269196212Sscottlstate as well. It updates em_perf_state::flags accordingly.
270196212SscottlThen such prepared new EM can be passed to the em_dev_update_perf_domain()
271196212Sscottlfunction, which will allow to use it.
272196212Sscottl
273196212SscottlMore details about the above APIs can be found in ``<linux/energy_model.h>``
274196212Sscottlor in Section 3.2 with an example code showing simple implementation of the
275196212Sscottlupdating mechanism in a device driver.
276196212Sscottl
277196212Sscottl
278196212Sscottl2.5 Description details of this API
279196212Sscottl^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
280196212Sscottl.. kernel-doc:: include/linux/energy_model.h
281196212Sscottl   :internal:
282196212Sscottl
283196212Sscottl.. kernel-doc:: kernel/power/energy_model.c
284196212Sscottl   :export:
285196212Sscottl
286196212Sscottl
287196212Sscottl3. Examples
288196212Sscottl-----------
289196212Sscottl
290196212Sscottl3.1 Example driver with EM registration
291196212Sscottl^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
292196212Sscottl
293196212SscottlThe CPUFreq framework supports dedicated callback for registering
294196212Sscottlthe EM for a given CPU(s) 'policy' object: cpufreq_driver::register_em().
295196212SscottlThat callback has to be implemented properly for a given driver,
296196212Sscottlbecause the framework would call it at the right time during setup.
297196212SscottlThis section provides a simple example of a CPUFreq driver registering a
298196212Sscottlperformance domain in the Energy Model framework using the (fake) 'foo'
299196212Sscottlprotocol. The driver implements an est_power() function to be provided to the
300196212SscottlEM framework::
301196212Sscottl
302196212Sscottl  -> drivers/cpufreq/foo_cpufreq.c
303196212Sscottl
304196212Sscottl  01	static int est_power(struct device *dev, unsigned long *mW,
305196212Sscottl  02			unsigned long *KHz)
306196212Sscottl  03	{
307196212Sscottl  04		long freq, power;
308196212Sscottl  05
309196212Sscottl  06		/* Use the 'foo' protocol to ceil the frequency */
310196212Sscottl  07		freq = foo_get_freq_ceil(dev, *KHz);
311196212Sscottl  08		if (freq < 0);
312196212Sscottl  09			return freq;
313196212Sscottl  10
314196212Sscottl  11		/* Estimate the power cost for the dev at the relevant freq. */
315196212Sscottl  12		power = foo_estimate_power(dev, freq);
316196212Sscottl  13		if (power < 0);
317196212Sscottl  14			return power;
318196212Sscottl  15
319196212Sscottl  16		/* Return the values to the EM framework */
320196212Sscottl  17		*mW = power;
321196212Sscottl  18		*KHz = freq;
322196212Sscottl  19
323196212Sscottl  20		return 0;
324196212Sscottl  21	}
325196212Sscottl  22
326196212Sscottl  23	static void foo_cpufreq_register_em(struct cpufreq_policy *policy)
327196212Sscottl  24	{
328196212Sscottl  25		struct em_data_callback em_cb = EM_DATA_CB(est_power);
329196212Sscottl  26		struct device *cpu_dev;
330196212Sscottl  27		int nr_opp;
331196212Sscottl  28
332196212Sscottl  29		cpu_dev = get_cpu_device(cpumask_first(policy->cpus));
333196212Sscottl  30
334196212Sscottl  31     	/* Find the number of OPPs for this policy */
335196212Sscottl  32     	nr_opp = foo_get_nr_opp(policy);
336196212Sscottl  33
337196212Sscottl  34     	/* And register the new performance domain */
338196212Sscottl  35     	em_dev_register_perf_domain(cpu_dev, nr_opp, &em_cb, policy->cpus,
339196212Sscottl  36					    true);
340196212Sscottl  37	}
341196212Sscottl  38
342196212Sscottl  39	static struct cpufreq_driver foo_cpufreq_driver = {
343196212Sscottl  40		.register_em = foo_cpufreq_register_em,
344196212Sscottl  41	};
345196212Sscottl
346196212Sscottl
347196212Sscottl3.2 Example driver with EM modification
348196212Sscottl^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
349196212Sscottl
350196212SscottlThis section provides a simple example of a thermal driver modifying the EM.
351196212SscottlThe driver implements a foo_thermal_em_update() function. The driver is woken
352196212Sscottlup periodically to check the temperature and modify the EM data::
353196212Sscottl
354196212Sscottl  -> drivers/soc/example/example_em_mod.c
355196212Sscottl
356196212Sscottl  01	static void foo_get_new_em(struct foo_context *ctx)
357196212Sscottl  02	{
358196212Sscottl  03		struct em_perf_table __rcu *em_table;
359196212Sscottl  04		struct em_perf_state *table, *new_table;
360196212Sscottl  05		struct device *dev = ctx->dev;
361196212Sscottl  06		struct em_perf_domain *pd;
362196212Sscottl  07		unsigned long freq;
363196212Sscottl  08		int i, ret;
364196212Sscottl  09
365196212Sscottl  10		pd = em_pd_get(dev);
366196212Sscottl  11		if (!pd)
367196212Sscottl  12			return;
368196212Sscottl  13
369196212Sscottl  14		em_table = em_table_alloc(pd);
370196212Sscottl  15		if (!em_table)
371196212Sscottl  16			return;
372196212Sscottl  17
373196212Sscottl  18		new_table = em_table->state;
374196212Sscottl  19
375196212Sscottl  20		rcu_read_lock();
376196212Sscottl  21		table = em_perf_state_from_pd(pd);
377196212Sscottl  22		for (i = 0; i < pd->nr_perf_states; i++) {
378196212Sscottl  23			freq = table[i].frequency;
379196212Sscottl  24			foo_get_power_perf_values(dev, freq, &new_table[i]);
380196212Sscottl  25		}
381196212Sscottl  26		rcu_read_unlock();
382196212Sscottl  27
383196212Sscottl  28		/* Calculate 'cost' values for EAS */
384196212Sscottl  29		ret = em_dev_compute_costs(dev, table, pd->nr_perf_states);
385196212Sscottl  30		if (ret) {
386196212Sscottl  31			dev_warn(dev, "EM: compute costs failed %d\n", ret);
387196212Sscottl  32			em_free_table(em_table);
388196212Sscottl  33			return;
389196212Sscottl  34		}
390196212Sscottl  35
391196212Sscottl  36		ret = em_dev_update_perf_domain(dev, em_table);
392196212Sscottl  37		if (ret) {
393196212Sscottl  38			dev_warn(dev, "EM: update failed %d\n", ret);
394196212Sscottl  39			em_free_table(em_table);
395196212Sscottl  40			return;
396196212Sscottl  41		}
397196212Sscottl  42
398204090Sjhb  43		/*
399196212Sscottl  44		 * Since it's one-time-update drop the usage counter.
400196212Sscottl  45		 * The EM framework will later free the table when needed.
401204090Sjhb  46		 */
402196212Sscottl  47		em_table_free(em_table);
403196212Sscottl  48	}
404196212Sscottl  49
405196212Sscottl  50	/*
406204090Sjhb  51	 * Function called periodically to check the temperature and
407204090Sjhb  52	 * update the EM if needed
408204090Sjhb  53	 */
409204090Sjhb  54	static void foo_thermal_em_update(struct foo_context *ctx)
410196212Sscottl  55	{
411196212Sscottl  56		struct device *dev = ctx->dev;
412196212Sscottl  57		int cpu;
413196212Sscottl  58
414196212Sscottl  59		ctx->temperature = foo_get_temp(dev, ctx);
415196212Sscottl  60		if (ctx->temperature < FOO_EM_UPDATE_TEMP_THRESHOLD)
416204090Sjhb  61			return;
417196212Sscottl  62
418196212Sscottl  63		foo_get_new_em(ctx);
419196212Sscottl  64	}
420196212Sscottl