Summary

This specification describes the way Ubuntu should gather and record run-time measurements of itself and the machine it's running on. It also describes a set of tools and libraries to facilitate displaying this information to the user. Think of it as extending top and System Monitor with a long-term history feature.

Rationale

Modern computers have many ways of measuring their run-time properties (e.g. temperature, battery charge, processor voltage and frequency). Most OS also allow measuring many of their parameters while running (e.g. resource usage and availability).

Most operating systems have tools that allow viewing some of these properties. The venerable program top is an example, as is the Gnome System Monitor or the Gnome Power Manager. Most of these tools have been written with a single purpose in mind; this is a noble thing in principle, but in this case it causes several problems.

The first problem is data availability. Most monitoring programs have been written to allow monitoring the instantaneous value of some parameters. This means most have no or very limited features for recording the data over long periods of time. In particular, a common problem is that monitoring is done only when requested, not continuously. (The Gnome Power Manager is a notable exception. However, it only remembers info until shut-down.)

The other problem is data accessibility: each program generates data separately, usually in it's own format, so it is difficult for users to view and even analyze it. In particular, tools like the Gnome System Monitor's resource usage graphs are only useful for rough estimates.

Bottom line: a single (modular) "monitoring daemon" could run continuously and accumulate information in a consistent way. The process can be managed with a unified interface, and a library with a few well-chosen utils and widgets would make tools like the Gnome System Monitor much more useful without huge efforts.

Use Cases

Scope

This specification covers feature specifications for Ubuntu. However, the features developed would likely be useful for other Linux distributions.

Implementing this specification will likely result in developing or choosing a standard for data recording, writing a daemon to gather such data, writing a set of "input modules" used by the daemon to obtain the data, and writing a library of commonly-used tools and widgets for visualizing and manipulating the logs.

Design

Note: this specification is a very early approximation. It is written from the point of view of the user rather than the developer. The author expects it to change significantly after input from developers.

This specification describes two mostly independent deliverables. The first is a system-wide data-collection service, together with several pluggable modules for collecting data, configuration files and control scripts. The second is a library of small utilities for handling the logs, widgets and applications for viewing and searching through these logs, and changes to already-existing applications that use this type of functionality. The relation between the two parts must be only (a) the format of the logs and (b) the control & configuration interface of the service.

This specification is currently incomplete, in that it does not yet specify precisely these interfaces and logs. Please help with enhancing it.

The next section describes what is the expected behavior of the two elements.

Data-collection service: Braindump

A system-wide daemon is started as early as possible after the system starts. It enters a loop where, each second, it reads the entire set of values it is configured to read, records them in a in-memory log, and goes to sleep. As soon as the partition with /var becomes writable, these in-memory logs are saved to disc, while attempting to minimize interference with other processes.

In order to read an observable value, the daemon has a registered set of "probes", each measuring a certain set of parameters. This is because there are many different types of things to record, at different times. These would be added and removed as they are made available or disappear, and as configured by the administrator. For instance, just after start-up only a limited set of probes will be available (e.g., CPU usage, memory availability and process behavior), and more will become available during running (e.g., disc access, network IO, and more, as modules are loaded, hardware is plugged in, and new probes are added by the administrator).

Each probes will only output a numeric value (or, more likely, several). Example: a decimal representing the percentage time the processor was used for the last second interval, the amount of memory used, the battery charge. The "collector" daemon maintains a list of values for each probe, and at each step collects an entire set of values.

The lists must be held in memory, for speed. Constant disk access would be undesirable, especially for laptops, because it would prevent disks from spinning down. Thus, we'll need to be able to record at last an hour of info in memory. There are two simple optimizations: first, most of this data will compress very well (if it were random, we wouldn't need to gather it), so a low-priority thread can apply a simple compression to everything older than a few minutes. Second, once we have some data we can watch for any disc access, and schedule a flush (with low-priority IO) after the first one. This way the disk spin-down is not greatly affected.

High-frequency (1 sec) data can take up lots of space really quick, but it's not necessary for very long. This means that we can take the logs older than a certain threshold (a few days) and compress them by reducing the frequency, say, to once every five minutes.

Cost estimate

Assume 32 bits per observed value, 100 values monitored, 1 Hz. This means ~1.5 MB per hour; this could be easily kept in RAM on new machines, sometimes for more than an hour (on a battery-running laptop, we'll flush only when memory is full or disk is accessed by something else). With a 100 MB hard-drive budget we could keep almost three days of history at this frequency, even without compression. For older data we aggregate it to lower frequency. This means each day we'd have to load the data from three days ago, about 40 MB of saved data, aggregate it (it only takes simple interval averaging, usually) and output ~140 kB of 5-minute frequency data. With another 100 MB, we could archive these low-frequency logs for almost two years.

To resume: 400 bytes/second recorded, about 1.5 MB per hour. Every day, with low priority, re-read and process (linear time) 40 MB of days-old data and archive it with lower frequency. With 200 MB of disk space we have 5-minute data (for 100 values) for the last two years, and 1-second data for the last three days.

Simple compression will probably further reduce the space (or lengthen the time) by at least a factor of two, more likely by an order of magnitude. Also, many values don't need 32 bits precision (e.g. temperature sensor); for the rest it's reasonable: processor occupancy--for a 1GHz CPU 32 bits can give us 1-cycle precision each second; for gigabit Ethernet 32 bits can give us bitcount each second. Most things are rarely that quick or fully used.)

Of course, we could have several levels of frequency-compression, for example 50 MB each for 1 second, 10 second, 1-minute, 5-minutes.

Examples

Examples of data we will want to record with this daemon:

Many of these signals would have interesting relationships with a very different kind of observable: events. For instance, the Power Manager is also interested in things like suspend/resume, lid closed, screen turned off, power adapter plugged-in/unplugged, etcetera. In the general case we would be interested in many similar events: network connection changes, peripherals plugged-in/unplugged, programs started/terminated, users logged-in, etc. I don't know for sure how these should be recorded, if you've worked on Power Manager or anything similar please give some input.

Returning to time-sequence signals, another thing we need to consider is level of detail. It would be harder, but it's probably very useful to record per-process observation for some probes, like resource usage. This increases the complexity significantly. (However, the volume of data is only increased a lot for servers; users rarely have more than a dozen processes actually doing anything interesting at any time, the rest will compress very well.) The biggest problem is that new processes will be started and stopped all the time, so we'll have thousands of short (usually) data streams, not only the dozens of session-long streams.

Privacy and Security

Much of the data gathered can be sensitive. It is probably safe to assign all data streams the same sensitivity that the "probes" have. For instance, currently the utility "top" can access lots of statistics without super-user privileges. There is little reason to further limit access to the long-term data gathered from the same sources. (After all, it could be recorded in parallel by someone interested.) However, everything non-privileged users cannot observe should be by default restricted to all but the super-user. (It can be made accessible to all users that originally could see it, but this is likely complicated.) In particular, GPS and even accelerometer data and user input are very sensitive. (Security & privacy experts are welcome to make a more detailed analysis.)

Precision and Performance

Obviously the precision of the recorded info is dependent on the probes' quality. However, the precision of the sampling needs some attention. Ideally a real-time scheduled process should gather the data to memory buffers, and a low-priority background process would handle compression, writing to disk, and old-data aggregation. However, a lot of care is needed to avoid the obvious risks:

The sampler process needs to call each probe every second. Normally each execution would take very little time: each probe only needs to copy several values (which the kernel already calculates) to a buffer. The amount of data gathered each step is small (less than a kB in our example above), thus the cost for all data gathering is likely comparable to the process-switch, for well-written probes. (The compression, disk-writing and down-sampling will take longer, but they're not high-priority tasks.)

However, if any of the probe modules is naughty, it can starve other processes. On the other hand, if the collector process is not real-time, then the sampling can become imprecise in exactly the most interesting moments: high load.

External programs: Braindump

We'll need several tools to be able to actually used the gathered data:

Outstanding Issues

This specification needs help from a few experienced people to advance to an implementable state:


CategorySpec

UnifiedSystemMonitoring (last edited 2008-08-06 16:22:21 by localhost)