Summary

This specification describes the plan to implement "stateful re-exec" support in Upstart.

Stateful re-exec refers to being able to restart Upstart and maintain the internal state across an exec(2) call.

Upstart already has the ability to re-exec itself, but it loses all internal state when doing this since it is equivalent to simply restarting Upstart. This causes problems since after re-exec there may be processes running which Upstart had been managing, but post-exec it has no knowledge of them. As such, re-exec is only used on shutdown to minimise issues.

Release Note

The "stateful re-exec" feature makes no changes to the externals of Upstart, it simply means that "telinit u" is now safe to use and will restart Upstart with no loss of state. The man page for telinit(8) has been updated to state this. "telinit u" should be called when either Upstart itself or any of its dependent libraries are upgraded (libc, libnih and libjson) to ensure that the running instance of Upstart is at the same version as the on-disk version, and that it is using the latest versions of all on-disk dependent libraries.

Note that now, Upstart relies on (and is therefore linked to) libjson.so to handle serialisation and deserialisation of state.

Rationale

By providing the ability to retain state, the following goals can be achieved:

Use Cases

Requirements

Design

Upstart needs the ability to perform the following operations in order:

  1. Serialise its existing internal state.
  2. Re-exec itself.
  3. Read the serialised state and deserialise it back into its internal data structures.
  4. Continue operating as normal.

Although the state passing will not be "total" (not every single internal data structure will or can be handled), after the deserialisation Upstart should have as near full knowledge of the state prior to the re-exec as possible.

Implementation

State-Passing

Upstart will re-exec itself using the following process.

Note that the parent becomes the new instance of Upstart, not the child (since Upstart must continue to be PID 1).

  1. Admin or maintainer script calls "telinit u || :" to request Upstart restart itself.

    • This is the existing re-exec interface so Upstart will be changed to now also perform state-passing.

      Basic (non-stateful) re-exec support is currently available and used by /etc/init.d/umountroot to ensure Upstart doesn't hold "stale" links to old library versions which would cause shutdown to hang.

  2. The SIGTERM handler calls a new re-exec handling function.

    • This ensures Upstart is not sitting in the main loop. Ensure all signals are blocked.
  3. Dispatch all D-Bus messages to ensure no initctl commands are being handled.

    • Upstart is now effectively "paused" (no longer accepting D-Bus (and also initctl) commands, handling jobs nor emitting events).
  4. Serialise all required internal state.
    • If this fails, degrade to stateless re-exec.
  5. Create a pipe.
  6. Ensure pipe fds are NOT marked O_CLOEXEC such that they perist across an exec(2).

  7. Parent sets O_CLOEXEC for all D-Bus file descriptors so they are NOT closed on exec.

  8. Parent sets O_CLOEXEC for all Log file descriptors so they are NOT closed on exec.

  9. Fork to create a child process.
  10. Parent closes writing end of pipe.
  11. Parent exec(2)s the new version of /sbin/init passing a magic flag ("--state-fd <fd>") which informs Upstart to read state from the specified file descriptor.

    • The --state-fd option is new and distinct from the existing --restart flag which performs a "bare" re-exec.

      The child is now actually the "old" process (running the original version of Upstart), whereas the parent is now the "new" process (running the most-recently-installed version of Upstart (which may be older than the old!!)

  12. Parent blocks reading from file descriptor "<fd>".

  13. Child meanwhile closes reading end of the pipe.
  14. Child closes D-Bus control server connection to allow new parent to open it.
  15. Child closes D-Bus control bus connection to allow new parent to open it.
  16. Child writes state in JSON format (as generated by original parent) to writing end of pipe and exits.
    • If it fails to complete the write in "some amount of time" (say 10 seconds?), this is deemed to indicate that the new parent doesn't support serialisation but the check performed above failed somehow, so it should log an error to the system log and exit.

      This scenario should be impossible, but handle it anyway.

  17. Parent reads serialisation data from reading end of pipe and reconstructs the internal objects (sessions, events, JobClasses, Jobs).

  18. Parent closes reading end of pipe.
  19. Parent clears O_CLOEXEC flag for all deserialised D-Bus and Log objects such that they are not leaked to Jobs.

  20. Parent continues with normal initialisation.

Preparatory Tasks

Before attempting stateful re-exec, PID 1 needs to handle the following:

D-Bus

ptrace

All processes curently being ptraced need to be handled. The most reasonable approach would seem to be to wait for the application to reach the started state.

However, that is dangerous. Imagine this scenario:

  1. user creates a new job and mis-specifies the expect stanza ("export daemon" when the application doesn't even fork).

  2. sudo apt-get dist-upgrade pulls in new version of Upstart.

  3. PID 1 waits for the erroneous app to complete 2 forks.

The final step will never complete so the apt-get dist-upgrade will hang indefinately.

We could "timeout" after a few seconds of waiting maybe but that approach is ugly.

Since ptrace(3) IS retained across an exec(3) of the parent ("debugger") process, no special treatment is required: processes which were being ptraced prior to the re-exec will continue to trap and pass control to the re-exec'ed PID 1 after the re-exec.

Serialisation

JSON will be used to represent the serialised state.

Rationale:

Process

The minimum set of internal objects that are required to perform stateful re-exec are:

Serialisation
  1. Create a "header" encoding meta data about the serialisation, including:
    • Upstart version
    • timestamp
    • serialisation data format version
  2. Serialise all Session objects.

  3. Serialise all Event objects including the blocking list.

  4. Iterate over all JobClass objects in job_classes and all

    • Job instances referenced by each JobClass and serialise all Job objects "below" (as a child of) their associated parent JobClass in the JSON.

  5. For those jobs whose associated JobClass.console is CONSOLE_LOG, serialise a Log object.

Notes:

Deserialisation
  1. Read the serialisation data ensuring the "header" can be understood.
  2. Deserialise the Session objects.

  3. Deserialise the Event objects.

  4. Deserialise the JobClass objects.

  5. Deserialise the Job objects associated with each JobClass object.

  6. Create a ConfSource with a "special" path ("serialized_conf_source" or similar) that is not backed by any actual file. If a job that is already running from the initramfs is stopped and started, at that point you get the correct new /etc/init/job.conf from the main system.

Notes:

Data Representation

Schema

The format of the JSON should resemble:

Example

See http://people.canonical.com/~jhunt/upstart/stateful-reexec/state.json

Re-Exec Scenarios

Table showing possible re-exec scenarios.

Scenario

Importance

Old Version

New Version

Scenario

Re-Exec Strategy

Notes

NNU

high

not supported

not supported

non-stateful upgrade

stateless

Behaviour today

NSU

high

not supported

supported @ s-version 'x'

non-stateful to stateful upgrade

stateless (1)

Upgrading to first version of Upstart supporting stateful re-exec

NSU-X

low

non-Upstart init, not supported

supported @ s-version 'x'

non-stateful to stateful upgrade

stateless (1)

Non-Upstart init daemon upgrading to a version of Upstart supporting stateful re-exec

SSU-E

high

supported @ s-version 'x'

supported @ s-version 'x'

stateful to stateful upgrade

stateful

Expected common case (2) - versions equal.

SND

medium

supported @ s-version 'x'

not supported

stateful to non-stateful downgrade

stateless

SSU-G

high

supported @ s-version 'x'

supported @ s-version 'y'

stateful to newer stateful version upgrade

stateful

moving to newer (greater than) version

SSD-L

medium

supported @ s-version 'y'

supported @ s-version 'x'

stateful to older stateful version downgrade

stateful (3)

moving to old (less than) version

Key and Notes:

Risks

System Testing

Currently, there is no facility that would allow re-exec scenarios to be exercised against multiple /sbin/init binaries.

That is to say, although the scenarios can be tested indirectly by generating the appropriate JSON and piping it into /sbin/init, there is no way to automatically test the real scenario where, for example, version 'X' of Upstart is upgraded to version 'Y'.

3rd Party Library

Making use of a 3rd-party library for JSON parsing is a risk in that it won't be using the NIH Utility Library and so won't have all the benefits associated with using it.

To mitigate this risk, the chosen library will be audited and improved where necessary before being used. Additionally, the set of tests for this feature will be extremely large and must cover all possible failure scenarios.

Unrepresentable State

This design requires any changes to internal data structures to be accompanied by:

There are 2 problems with this:

It is worth noting that Scott included (partial?) stateful re-exec support in early revisions of Upstart (versions 0.2.0 to 0.3.2 inclusive), but eventually dropped this feature due to the maintenance cost.

D-Bus handling

D-Bus marks all sockets as O_CLOEXEC and it does not appear there is an easy way to determine the fds D-Bus has in use and clear that bit.

Testing

This is a large feature which will require extremely careful unit, functional, and system testing.

All failure scenarios should be included in the tests (where possible).

  for proc in pre-start main post-start pre-stop post-stop
  do
    for type in system user
    do
      for action in "allowed to continue" stopped restarted
      do
        echo "Ensure that a $type job running the $proc process can be serialised, deserialised and $action".
      done
    done
  done

Impossible Scenarios

The following is a list of scenarios that cannot be handled directly:

These scenarios must be handled via packaging policy.

Unresolved Issues

Limitations

We are extremely keen to involve the community from an early stage to aide in testing and to allow them to provide feedback on this useful feature. However, since the code is still in development, this initial "preview" of the stateful re-exec feature will have a number of limitations:

Additional Information

A debug facility should be added to Upstart where it exposes a D-Bus method that allows the current state to be serialised when called. This can then be used as work progresses to ensure expected results.


CategorySpec

FoundationsTeam/Specs/QuantalUpstartStatefulReexec (last edited 2012-11-13 09:35:21 by jamesodhunt)