power-thermal-optimizations

Revision 11 as of 2007-07-12 09:14:12

Clear message

Please check the status of this specification in Launchpad before editing it. If it is Approved, contact the Assignee or another knowledgeable person before making changes.

Summary

Thermal extension The platform thermal solution depends on the kernel framework for controlling the device performing state and monitor thermal sensor for the platform. The kernel thermal monitoring and controlling mechanism is spread across acpi thermal driver and non acpi thermal sensor driver, and the thermal algorithm are embedded in the kernel driver. The proposed patch is to extend the thermal driver and unify various thermal sensing/controlling property through sysfs interface so that platform level thermal related decision can be made at user space.

S3 Resume Optimisation A patch where the device initialization on resume executed in a separate thread, giving user/app a faster control on s3 resume

Power Event Notification Linux kernel does not have a way to notify the power events (suspend to ram, suspend to disk and resume from suspend) to the user space applications. Patch to give a notification to the user-space applications when suspend to ram, suspend to disk or resume from suspend happens.

Release Note

Thermal extension The current thermal zone driver is modified to expose thermal properties of platform through Sysfs. A new thermal Sysfs driver is introduced which will export two interface for the platform specific sensor driver and component throttle driver. The cpu thermal driver will work as it is, but will interface with the thermal Sysfs driver.

S3 Resume Optimisation The current resume code at power/main.c is modified to do device initialisation in a seperate kernel thread. The IO schedulars before dispatching the request to the device, do a check to see if the resume thread is exited.

Power Event Notification On system resume and suspend handling function kobject_uevent system call is used to post the message to the user space so that the user space applications can recive the event.

Rationale

Thermal extension Linux notebooks today use a combination of ACPI and native-device thermal control. System uses ACPI’s CRT/HOT trip point for critical system shutdown, since on a handheld, shutdown and hibernate to disk (if one even exists) are likely to be synonymous. Active trip points are of no use on systems which have no fans. That leaves the single PSV trip point. ACPI 2.0 can associate (only) a processor throttling device with a trip point. But the processor isn’t expected to always be the dominant contributor to thermal footprint on handhelds like it often is on notebooks. ACPI 2.0 includes the _TZD method to associate devices with thermal zones. However, ACPI doesn’t say anything about how to throttle non-processor devices—so that must be handled by native device drivers.

S3 Resume Optimisation For Mid devices,for better power saving, the system needs to go to S3 as frequently as possible and resume from S3 as quick as possible. However the current resume time from S3 is close to 5 sec.Bulk of this is because of device initialisation. This patch parellaise the device init on resume in a seperate kernel thread. To avoid any race condition it does a check for device readiness in the io-schedular. The resume time with this patch shows 70% improvement on resume time

Power Event Notification User level applications which has dependency on the information of when system going to S3 (initiated by some other app) needs a clean notification mechanism from the kernel whenever the system goes to s3 and resuming from it. This patch provides that capability to the kernel.

Use Cases

Assumptions

Design

Thermal Extension

Thermal monitoring will be done using inexpensive thermal sensors—polled by a low-power EC.

  • Thermal management policy decisions will be made from user space, as the user has a comprehensive view of the platform.
  • The kernel provides only the mechanism to deliver thermal events to user space, and the mechanism for user space to communicate its throttling decisions to native device drivers.

attachment:thermal-power-opt.gif Figure 1

Figure 1 shows the thermal control software stack. The thermal management policy control application sits on top. It receives netlink messages from the kernel thermal zone driver. It then implements device-specific thermal throttling via sysfs. Native device drivers supply the throttling controls in sysfs and implement device-specific throttling functions.

Thermal zone module

The thermal zone module has two components — a thermal zone sysfs driver and thermal zone sensor driver.

The thermal zone sysfs driver is platform-independent, and handles all the sysfs interaction. The thermal zone sensor driver is platform-dependent. It works closely with the platform BIOS and sensor driver, and has knowledge of sensor information in the platform.

Thermal zone sysfs driver

The thermal sysfs driver exports two interfaces (thermal_control_register() and thermal_control_deregister()) to component drivers, which the component drivers can call to register their control capability to the thermal zone sysfs driver. The thermal sysfs drier also exports two interfaces—

* thermal_sensor_register() * thermal_sensor_deregister()

to the platform-specific sensor drivers, where the sensor drivers can use this interface to register their sensor capability. This driver is responsible for all thermal Sysfs entries. It interacts with all the platform specific thermal sensor drivers and component drivers to populate the sysfs entries. The thermal zone driver also provides a notification-of-temperature service to a component driver. The thermal zone sensor driver as part of registration exposes its sensing and thermal zone capability.

Thermal Zone sensor driver

The thermal zone sensor driver provides all the platform-specific sensor information to the thermal sysfs driver. It is platform-specific in that it has prior information about the sensors present in the platform. The thermal zone driver directly maps the ACPI 2.0 thermal zone definition. The thermal zone sensor driver also handles the interrupt notification from the sensor trips and delivers it to user space through netlink socket. Component Throttle driver All the component drivers participating in the given thermal zone can register with the thermal driver, each providing the set of thermal ops it can support. The thermal driver will redirect all the control requests to the appropriate component drivers when the user programs the throttling level. Its is up to the component driver to implement the thermal control. For example, a component driver associated with DRAM would slow down the DRAM clock on throttling requests.

Thermal Zone Sysfs Property

Table 1 shows the directory structure exposing each thermal zone sysfs property to user space. The intent is that any combination of ACPI and native thermal zones may exist on a platform, but the generic sysfs interface looks the same for all of them. Thus, the syntax of the files borrows heavily from the Linux hwmon subsystem.

Each thermal zone provides its current temperature and an indicator that can be used by user-space to see if the current temperature has changed since the last read. If a critical trip point is present, its value is indicated here, as well as an alarm indicator showing whether it has fired. If a passive trip point is present, its value is indicated here, as well as an alarm indicator showing whether it has fired. There are symbolic links to the device nodes of the devices associated with the thermal zone. Those devices will export their throttling controls under their device nodes.

Throttling Sysfs Properties

Devices that support throttling will have two additional properties associated with the device nodes: throttling and throttling_max. A value of 0 means maximum performance, though no throttling. A value of throttling_ max means maximum power savings in the deepest throttling state available before device state is lost.

Events will be passed from the kernel to userspace using the Linux netlink facility. Interrupts from the sensor or EC are delivered to user-space through a netlink socket.

sysfs

ACPI

Description

R/W

temp1_input

_TMP

Current temerature

RO

temp1_alarm

Temperature change occurred

RW

temp1_crit

_CRT

Crtitical alarm temperature

RO

temp1_crit_alarm

Crtical alarm occurred

RW

temp1_passive

_PSV

Passive alarm termperature

RO

temp1_passive_alarm

Passive alarm occurred

RW

<device_name1>

Link to device 1 associated with zone

RO

<device_name2>

Link to device 2 associated with zone

RO

...

...

RO

Table 1

S3 resume Optimisation

Here is a simple patch for optimising the S3 resume. With this patch the resume time is 0.85. Given the fact that device initialisation on the resume takes almost 70% of time, By executing the whole "device_resume()" function on a seperate kernel thread, the resume gets completed( ie. the user can precieve) by ~0.85 sec. To avoid any possible race condition while processing the IO request and to make sure all the io request are queued till the device resume thread exits, the IO schedulars (patched cfq and as) checks a for system_resume flag, which is set when the device resume thread starts, if the flag is set, it doesnt put the request in the dispatch queue. Once the flag is cleared i.e when the device resume thread is complete, the IO-schedular behave as in normal situation. I did some validation of this patch on a NAPA board ( Calistoga chipset with Dothan Processor with and Without SMP) locally here and havent noticed any issue so far.

Power event Notification

Here is a simple patch for power event notification to user-space applications. Basically, what it does is notify the user-space applications that the system is going to a low power state (Suspend-to-RAM and Suspend-to-Disk) and resume from that state. This is useful for the user-space applications to do some significant action when the system goes to the low power state (like saving an unsaved file). The user-space objects can form a netlink socket and listen to these events. It is done through a kobject-netlink socket. For this I have used the kobject_uevent system call, posting the notification to user space with standby, hibernate and resume in the action parameter of the kobject_uevent call (mapped to KOBJ_S3, KOBJ_S4 and KOBJ_RESUME enums).

Implementation

This section should describe a plan of action (the "how") to implement the changes discussed. Could include subsections like:

UI Changes

Should cover changes required to the UI, or specific UI that is required to implement this

Code Changes

S3 Resume Patch ( againt 2.6.21-rc7)

diff -aur linux-2.6.21-rc7-vanilla/block/as-iosched.c linux-2.6.21-rc7/block/as-iosched.c --- linux-2.6.21-rc7-vanilla/block/as-iosched.c 2007-04-16 05:20:57.000000000 +0530 +++ linux-2.6.21-rc7/block/as-iosched.c 2007-07-04 14:00:39.000000000 +0530 @@ -903,6 +903,14 @@

  • return 0;
  • rq = rq_entry_fifo(ad->fifo_list[adir].next);

+ + extern int system_resuming; + if (system_resuming != 0) + return 0;

  • return time_after(jiffies, rq_fifo_time(rq));
  • }

diff -aur linux-2.6.21-rc7-vanilla/block/cfq-iosched.c linux-2.6.21-rc7/block/cfq-iosched.c --- linux-2.6.21-rc7-vanilla/block/cfq-iosched.c 2007-04-16 05:20:57.000000000 +0530 +++ linux-2.6.21-rc7/block/cfq-iosched.c 2007-07-04 14:01:05.000000000 +0530 @@ -880,6 +880,7 @@

  • struct cfq_data *cfqd = cfqq->cfqd; struct request *rq; int fifo;

+ extern int system_resuming;

  • if (cfq_cfqq_fifo_expire(cfqq))
    • return NULL;

@@ -888,7 +889,13 @@

  • if (list_empty(&cfqq->fifo))

    • return NULL;

- + + if(system_resuming != 0) + return NULL;

  • fifo = cfq_cfqq_class_sync(cfqq);

    rq = rq_entry_fifo(cfqq->fifo.next);

diff -aur linux-2.6.21-rc7-vanilla/kernel/power/main.c linux-2.6.21-rc7/kernel/power/main.c --- linux-2.6.21-rc7-vanilla/kernel/power/main.c 2007-07-04 13:47:02.000000000 +0530 +++ linux-2.6.21-rc7/kernel/power/main.c 2007-07-04 13:59:30.000000000 +0530 @@ -23,7 +23,7 @@

  • #include <linux/vmstat.h> #include "power.h"

- +int system_resuming;

  • /*This is just an arbitrary number */ #define FREE_PAGE_NUMBER (100)

@@ -129,7 +129,16 @@

  • local_irq_restore(flags); return error;
  • }

- +static int dev_resume_proc(void * data) +{ + + system_resuming =1; + device_resume(); + system_resuming = 0; + return (0); +}

  • /**
    • suspend_finish - Do final work before exiting suspend sequence.

@@ -141,9 +150,15 @@

  • static void suspend_finish(suspend_state_t state) {

+ int thread;

  • enable_nonboot_cpus(); pm_finish(state);

- device_resume(); + system_resuming = 0; + thread = kernel_thread(dev_resume_proc,NULL,CLONE_KERNEL); + if (thread < 0){ + printk ("Suspend resume Cannot create Kernel_thread\n"); + device_resume(); + }

  • resume_console(); thaw_processes(); pm_restore_console();

Power event notification patch (Against 2.6.21.rc7)

diff -aruN kernel-mid/include/linux/kobject.h linux-pwr-evnt-notfn/include/linux/kobject.h --- kernel-mid/include/linux/kobject.h 2007-04-16 05:20:57.000000000 +0530 +++ linux-pwr-evnt-notfn/include/linux/kobject.h 2007-07-05 15:17:11.000000000 +0530 @@ -48,6 +48,9 @@

  • KOBJ_OFFLINE = (force kobject_action_t) 0x06, KOBJ_ONLINE = (force kobject_action_t) 0x07, KOBJ_MOVE = (force kobject_action_t) 0x08,

+ KOBJ_S3 = (force kobject_action_t) 0x09, + KOBJ_S4 = (force kobject_action_t) 0x0A, + KOBJ_RESUME = (force kobject_action_t) 0x0B,

  • }; struct kobject {

diff -aruN kernel-mid/kernel/power/disk.c linux-pwr-evnt-notfn/kernel/power/disk.c --- kernel-mid/kernel/power/disk.c 2007-04-16 05:20:57.000000000 +0530 +++ linux-pwr-evnt-notfn/kernel/power/disk.c 2007-07-05 15:17:04.000000000 +0530 @@ -184,6 +184,7 @@

  • resume_console();
  • Thaw:
    • unprepare_processes();

+ kobject_uevent(&power_subsys.kset.kobj, KOBJ_RESUME);

  • return error;
  • }

diff -aruN kernel-mid/kernel/power/main.c linux-pwr-evnt-notfn/kernel/power/main.c --- kernel-mid/kernel/power/main.c 2007-04-16 05:20:57.000000000 +0530 +++ linux-pwr-evnt-notfn/kernel/power/main.c 2007-07-05 15:17:04.000000000 +0530 @@ -141,6 +141,7 @@

  • static void suspend_finish(suspend_state_t state) {

+ kobject_uevent(&power_subsys.kset.kobj, KOBJ_RESUME);

  • enable_nonboot_cpus(); pm_finish(state); device_resume();

@@ -191,6 +192,11 @@

  • {
    • int error;

+ if (state == PM_SUSPEND_MEM) + kobject_uevent(&power_subsys.kset.kobj, KOBJ_S3); + if (state == PM_SUSPEND_DISK) + kobject_uevent(&power_subsys.kset.kobj, KOBJ_S4); +

  • if (!valid_state(state))
    • return -ENODEV;

    if (!mutex_trylock(&pm_mutex))

@@ -335,8 +341,10 @@

  • static int init pm_init(void) {

    • int error = subsystem_register(&power_subsys);

- if (!error) + if (!error) { + kset_set_kset_s(&power_subsys, power_subsys);

  • error = sysfs_create_group(&power_subsys.kset.kobj,&attr_group);

+ }

  • return error;
  • }

diff -aruN kernel-mid/lib/kobject_uevent.c linux-pwr-evnt-notfn/lib/kobject_uevent.c --- kernel-mid/lib/kobject_uevent.c 2007-04-16 05:20:57.000000000 +0530 +++ linux-pwr-evnt-notfn/lib/kobject_uevent.c 2007-07-05 15:15:14.000000000 +0530 @@ -52,6 +52,12 @@

  • return "online";
  • case KOBJ_MOVE:
    • return "move";

+ case KOBJ_S3: + return "standby"; + case KOBJ_S4: + return "hibernate"; + case KOBJ_RESUME: + return "resume";

  • default:
    • return NULL;
    }

Migration

Include:

  • data migration, if any
  • redirects from old URLs to new ones, if any
  • how users will be pointed to the new way of doing things, if necessary.

Test/Demo Plan

It's important that we are able to test new features, and demonstrate them to users. Use this section to describe a short plan that anybody can follow that demonstrates the feature is working. This can then be used during CD testing, and to show off after release.

This need not be added or completed until the specification is nearing beta.

Outstanding Issues

This should highlight any issues that should be addressed in further specifications, and not problems with the specification itself; since any specification with problems cannot be approved.

BoF agenda and discussion

Use this section to take notes during the BoF; if you keep it in the approved spec, use it for summarising what was discussed and note any options that were rejected.


CategorySpec