QuantalUpstartStatefulReexec

Revision 17 as of 2012-05-20 17:29:00

Clear message


Summary

This specification describes the plan to implement "stateful re-exec" support in Upstart.

Stateful re-exec refers to being able to restart Upstart and maintain the internal state across an exec(2) call.

Upstart already has the ability to re-exec itself, but it loses all internal state when doing this since it is equivalent to simply restarting Upstart. This causes problems since after re-exec there may be processes running which Upstart had been managing, but post-exec it has no knowledge of them. As such, re-exec is only used on shutdown to minimise issues.

Release Note

  • FIXME:

Rationale

By providing the ability to retain state, the following goals can be achieved:

  • Allow Upstart to be started in the initramfs and then be re-execed
  • Allow Upstart to be upgraded and not require a reboot to take advantage of new features.
  • Allow clean upgrades to libc6 (eglibc) and NIH (being two libraries Upstart relies upon).
    • Currently, upgrading either of these libraries is problematic since because Upstart cannot perform stateful re-exec, it will still be holding open the original versions of these files causing shutdown issues. For full details, see bug 985755.

Use Cases

  • Clio is a sysadmin. She wants to be able to upgrade any userland package on her 10,000 servers without having to schedule down-time.
  • Mnemosyne is an experienced Ubuntu user who lives has her life on her laptop. She hardly ever reboots, preferring instead to suspend where possible. However, she'd still like to know she's running the latest and greatest version of all the (non-kernel) packages.
  • Morpheus is a very cautious server owner. He only runs LTS releases and want to be assured that upgrading to a new LTS will be 100% reliable. He always reboots after an upgrade, but expects services he left running before the upgrade to continue to be running after upgrade.

Requirements

  • Ability to serialise Upstarts internal state.
  • Ability to deserialise previously serialised state back into Upstart.
  • Ability to upgrade from a version not supporting stateful re-exec to a version that does support it.
  • Ability to handle downgrade from a version supporting stateful re-exec to a version that does.

  • Ability to upgrade from a version supporting stateful re-exec to a newer version that also supports it but whose serialisation format may have changed.
  • Ability to downgrade from a version supporting stateful re-exec to an older version supporting it but whose serialisation format may have changed.

  • Ability to handle failure to read the existing state partially or fully.
  • Ability to handle failure to parse the existing state partially or fully.
  • The serialisation data should encode the version of Upstart.
  • The serialisation data should encode the version of the serialisation format.
  • The serialisation data should encode Session objects.
  • The serialisation data should encode Event objects.
  • The serialisation data should encode JobClass objects.

  • The serialisation data should encode Job objects.
  • Ability to retain command-line settings across an exec.
    • Currently, Upstart clears the command-line to "prettify" output for ps(1). We should probably continue to do this, but also save argv to allow the re-exec to be run with the same options as when it was originally started (it would be confusing to boot with debug mode and have it revert to non-debug mode after a re-exec).

Design

Upstart needs the ability to perform the following operations in order:

  1. Serialise its existing internal state.
  2. Re-exec itself.
  3. Read the serialised state and deserialise it back into its internal data structures.
  4. Continue operating as normal.

Although the state passing will not be "total" (not every single internal data structure will or can be handled), after the deserialisation Upstart should have as near full knowledge of the state prior to the re-exec as possible.

Implementation

State-Passing

Upstart will re-exec itself using the following process.

Note that the parent becomes the new instance of Upstart, not the child (since Upstart must continue to be PID 1).

  1. Admin or maintainer script calls "telinit u || :" to request Upstart restart itself.

    • This is the existing re-exec interface so Upstart will be changed to now also perform state-passing.

      Basic (non-stateful) re-exec support is currently available and used by /etc/init.d/umountroot to ensure Upstart doesn't hold "stale" links to old library versions which would cause shutdown to hang.

  2. The SIGTERM handler calls a new re-exec handling function.

    • This ensures Upstart is not sitting in the main loop. Ensure all signals are blocked.
  3. Dispatch all D-Bus messages to ensure no initctl commands are being handled.

    • Upstart is now effectively "paused" (no longer accepting D-Bus (and also initctl) commands, handling jobs nor emitting events).
  4. Determine if "new" /sbin/init is capable of stateful re-exec.

      • NOTE: This isn't very elegant, but better safe than sorry!

    • Run "/sbin/init --help 2>/dev/null" using popen(3).

    • Look for "--serialisation-version" option.

      • If not found, "new" version of upstart is actually an old version (we are downgrading), so log a warning to the system log and perform a "bare" re-exec (state will be lost).
    • Run "/sbin/init --serialisation-version" using popen(3).

      • This option will return two values, separated by a single comma (','). The first value is the oldest serialisation data format version supported and the second the newest version supported. It is permissible for the values to be the same. The initial version of Upstart that supports stateful re-exec will therefore return "1,1".

      Note that the first step is required since Upstart ignores unknown options so if /sbin/init was too old to support serialisation, invoking "/sbin/init --serialisation-version" would actually just run another instance of Upstart, which would not exit.

  5. Create a pipe.
    • Ensure fds are NOT marked O_CLOEXEC such that they perist across an exec(2).

  6. Parent marks D-Bus file descriptors so they are NOT closed on exec.
  7. Fork to create a child process.
  8. Parent closes writing end of pipe.
  9. Parent exec(2)s the new version of /sbin/init passing a magic flag ("--state-fd <fd>") which informs Upstart to read state from the specified file descriptor.

    • The --state-fd option is new and distinct from the existing --restart flag which performs a "bare" re-exec.

      The child is now actually the "old" process (running the original version of Upstart), whereas the parent is now the "new" process (running the most-recently-installed version of Upstart (which may be older than the old!!)

      Note that it may be worth performing the following operations prior to the exec to ensure Upstart can re-exec statefully:

    • fork a child.
    • exec /sbin/init --version and parse the output to ensure the version is atleast 1.6 (and hence supports stateful re-exec).

    • If the version check fails, log a message to syslog and default to performing a non-stateful re-exec.

      This can be handled at package policy level for Ubuntu using dpkg --compare-versions however, Upstart should probably also perform its own checks to handle scenarios where it is not running on a system supporting such policy.

  10. Parent blocks reading from file descriptor "<fd>".

  11. Child meanwhile closes reading end of the pipe.
  12. Child writes state in JSON format to writing end of pipe and exits.
    • If it fails to complete the write in "some amount of time" (say 10 seconds?), this is deemed to indicate that the new parent doesn't support serialisation but the check performed above failed somehow, so it should log an error to the system log and exit.
  13. Parent reads serialisation data from reading end of pipe and reconstructs the internal objects (sessions, events, JobClasses, Jobs).

  14. Parent closes reading end of pipe.
  15. Parent continues with normal initialisation.

Serialisation

JSON will be used to represent the serialised state.

Rationale:

  • JSON is simple.
  • JSON is standardised.
  • JSON is able to represent arrays and objects.
  • JSON is UTF-8 encoded and human-readable.
  • There are a number of available parsers.

Process

The minimum set of internal objects that are required to perform stateful re-exec are:

  • Session

  • Event

  • JobClass

  • Job

  • Log

Serialisation
  • FIXME: incomplete

  • Create a "header" encoding meta data about the serialisation, including:
    • Upstart version
    • timestamp
    • serialisation data format version
  • Serialise all Session objects.

  • Iterate over all JobClass objects in job_classes and all Job instances referenced by each JobClass.

    • If class->start_on, class->stop_on, job->blocker, or job->blocking contain values:

      • add the referenced event names to a temporary hash of event names.
      • assign a unique ID to the event and store the event ID in the serialisation data for the Job.

    • Serialise all Job objects.

  • Serialise all Event objects, including the event ID stored as a number.

  • For those jobs whose associated JobClass.console is CONSOLE_LOG, serialise a Log object.

Notes:

  • Since an Event can block a Job and a Job has pointers to one or more Event objects, the serialisation cannot be performed in a single pass.

    • To resolve this circular loop, each time an Event is seen, it will be added to a hash table with a unique id. When unserialising, use lookup.

  • If Upstart detects that the user is down-grading and it sees syntax it doesn't understand, a flag will be added to the ConfSource stating that the job cannot be restarted (although it can still be stopped).

  • Log may pose some problems since it includes an NihIo and an NihIoBuffer.

    • The NihIoBuffer is relatively easy to serialise, but the NihIo, along with including an NihIoBuffer also includes an NihIoWatch.

Deserialisation
  • FIXME: incomplete

  • Read the serialisation data ensuring the "header" can be understood.
  • Deserialise the Session objects.

  • Deserialise the remaining objects.

Create a ConfSource with a "special" path ("serialized_conf_source" or similar) that is not backed by any actual file. If a job that is already running from the initramfs is stopped and started, at that point you get the correct new /etc/init/job.conf from the main system.

Data Representation

Schema

The format of the JSON should resemble:

  • FIXME

Example

See http://people.canonical.com/~jhunt/upstart/stateful-reexec/state.json

Re-Exec Scenarios

Table showing possible re-exec scenarios.

Scenario

Old Version

New Version

Scenario

Notes

SSU-E

supported @ s-version 'x'

supported @ s-version 'x'

stateful to stateful upgrade

Expected common case (*) - versions equal.

SND

supported @ s-version 'x'

not supported

stateful to non-stateful downgrade

SSU-G

supported @ s-version 'x'

supported @ s-version 'y'

stateful to newer stateful version upgrade

moving to newer (greater than) version

SSD-L

supported @ s-version 'y'

supported @ s-version 'x'

stateful to older stateful version downgrade

moving to old (less than) version

NSU

not supported

supported @ s-version 'x'

non-stateful to stateful upgrade

Upgrading to first version of Upstart supporting stateful re-exec

NNU

not supported

not supported

non-stateful upgrade

Behaviour today

Key:

  • The term "supported" refers to re-exec support being available.
  • "s-version" refers to the serialisation data format version, not the version of Upstart itself.
  • (*) - the serialisation data format version should change as little as possible.

Risks

3rd Party Library

Making use of a 3rd-party library for JSON parsing is a risk in that it won't be using the NIH Utility Library and so won't have all the benefits associated with using it.

To mitigate this risk, the chosen library will be audited and improved where necessary before being used. Additionally, the set of tests for this feature will be extremely large and must cover all possible failure scenarios.

Unrepresentable State

This design requires any changes to internal data structures to be accompanied by:

  • Appropriate updates to the serialisation and deserialisation code.
  • Additional stateful re-exec tests to cover the changed internals.

There are 2 problems with this:

  • It's a manual process, so great care needs to be taken to ensure it happens! Smile :)

    • This could be mitigated by somehow auto-generating the serialisation/deserialisation code, but that would be very complex. A cheap compromise would be to ensure that every change to Upstart forced a complete set of system tests to run that exercise all possible stateful re-exec scenarios to atleast ensure that stateful re-exec will not break.

  • It makes any future changes to Upstart internals potentially very costly in terms of time and could slow down development speed as a result.
    • Aside from automation, it is unclear how to further mitigate this issue.

It is worth noting that Scott included (partial?) stateful re-exec support in early revisions of Upstart (versions 0.2.0 to 0.3.2 inclusive), but eventually dropped this feature due to the maintenance cost.

Testing

This is a large feature which will require extremely careful unit, functional, and system testing.

All failure scenarios should be included in the tests (where possible).

  • Ensure that a stopped system job can be serialised, deserialised and started.
  • Ensure that a system job blocked on an event can be serialised, deserialised and started.
  • Ensure that a user job blocked on an event can be serialised, deserialised and started.
  • Ensure that an event blocked on a system job can be serialised, deserialised and started.
  • Ensure that an event blocked on a user job can be serialised, deserialised and started.
  • Ensure that a stopped system job in a chroot can be serialised, deserialized and started.
  • Ensure that a stopped user job can be serialised, deserialised and started.

  for proc in pre-start main post-start pre-stop post-stop
  do
    for type in system user
    do
      for action in "allowed to continue" stopped restarted
      do
        echo "Ensure that a $type job running the $proc process can be serialised, deserialised and $action".
      done
    done
  done
  • Ensure that a task which dies after serialisation but before deserialisation is handled.

  • Ensure that a respawn service which dies after serialisation but before deserialisation is handled.

  • Ensure that a respawn service which forks once after serialisation but before deserialisation is handled.

  • Ensure that a respawn service which forks twice after serialisation but before deserialisation is handled.

Impossible Scenarios

The following is a list of scenarios that cannot be handled directly:

  • Upstart is downgraded to a version that requires an older libnih, but libnih itself is not downgraded.
  • libnih's ABI changes so it is upgraded. Upstart is then re-exec'ed but either this happens before an updated Upstart package is installed or no new Upstart package is available to install.

These scenarios must be handled via packaging policy.

Unresolved Issues

  • If Upstart is downgraded to a version which supports stateful re-exec but whose serialisation data format cannot represent the current state, should Upstart refuse to perform stateful re-exec or simply "do the best it can"?
    • Ideally, any newer version of Upstart will support every previous serialisation data format version such that this scenario be handled correctly. However:
    • Downgrading like this should not be a common operation (so should we spend the effort doing this?)
    • Ideally we need a way to query the serialisation data formation version prior to attempting the re-exec.
      • We could add a --serialisation-version flag and have Upstart fork, run "/sbin/init --serialisation-version" first and if it detects that it cannot represent the state for the new version refuse to retain state on re-exec.

Additional Information

A debug facility should be added to Upstart where it exposes a D-Bus method that allows the current state to be serialised when called. This can then be used as work progresses to ensure expected results.


CategorySpec