AutomatedProblemReports

Differences between revisions 45 and 46
Revision 45 as of 2006-01-23 13:00:21
Size: 11952
Editor: ip-217-204-123-1
Comment: Removed spam/uninformed rant
Revision 46 as of 2006-06-16 08:34:52
Size: 11954
Editor: 195
Comment: update bug report number (bz -> lp)
Deletions are marked like this. Additions are marked like this.
Line 81: Line 81:
Fedora uses a similar process and apparently they developed something better than `objcopy`, which produces much smaller debug info files. This should be investigated, see [http://bugzilla.ubuntu.com/8149 Ubuntu #8149] for some further information. Fedora uses a similar process and apparently they developed something better than `objcopy`, which produces much smaller debug info files. This should be investigated, see [https://launchpad.net/bugs/14484 Ubuntu #14484] for some further information.

Introduction

We need to streamline the process of collecting data for common end-user problems, so that they can be prioritized and addressed.

This would ideally mean that crashes of userspace applications and the kernel, as well as packaging-related failures, are detected automatically so the user gets an easy-to-use frontend for adding information to the problem report, and can send the report to our database.

Rationale

Currently, many classes of problems (esp. program crashes) remain unreported or unfixed because:

  • many crashes are not easily reproducible (after e. g. installing a debug version)
  • end users do not know how to prepare a report that is really useful for developers
  • and we have no easy frontend which allow users to submit detailed problem reports.

If the process of data collection is automated and detailed information about a crash can be collected at the very time a crash occurs, this will help the developers to be notified about problems and give them much of the information they need to deal with it.

We hope that this will lead to a much better level of quality assurance in the future.

Scope

This specification deals with detecting crashes of processes running in the user's session. Crashes of system processes are covered to some degree. Kernel and package failures will be dealt with in separate specifications.

Use Cases

  • Martin wants to add a new TODO item in Evolution, which causes Evolution to crash. Instead of blaming everything to GTK bugs, he wants to provide Sebastien with as much information about the crash as possible, in order to help him fix the bug.
  • Stuart runs a PostgreSQL server in the data center. If the current postmaster process crashes, he wants to be notified about it and wants to get information about the crash.

Design

Debug symbol extraction

In order to produce good backtraces, we need to extract and store debug symbols from standard builds, and store them in a centralized repository for use in analyzing these reports.

We will use deb files as container for debug symbols. Compared to flat files, they offer the following advantages:

  • They can be arranged in a proper pool structure with a Packages file etc., so that existing tools to mirror, download, and ship debs can be reused. (However, we will not put them into the regular distribution. They should either live on a separate server (debug.ubuntu.com) or at least in a different suite (like "breezy-debug").
  • Users can actually install them if they want to.

Process crash detection

There are two ways how to detect a crash:

  • Create a small library libcrashrep.so whose init function installs a signal handler for the most common types of crashes (segmentation violation, floating point error, and bus error). The handler will catch all signals that the application does not handle itself. When a crash is detected, the library calls an external program. The library is put into /etc/ld.so.preload.

  • Extend the kernel to call an userspace program when a process exits with one of the mentioned signals. The program should be configued in /proc/sys/proc/process_crash_handler (or a similar file).

The library solution does not require any changes to the existing system, but is less robust than the kernel approach, since it requires to handle the crash in a corrupted environment. According to Ben Collins, the kernel hook is relatively easy to implement, so we should aim for this solution. If it should not work for some reason, we can always fall back to the library solution, which is already implemented and tested (and found to not produce stack trace reliably).

Presenting the information

There is no single way of presenting the collected debug information, so we have to try a list of possible actions after a crash:

  1. If the owner of the crashed process is currently logged in and the process has $DISPLAY defined, a pygtk frontend will be invoked.

  2. If the owner of the crashed process is currently logged in and the process has no $DISPLAY defined, but the process has an attached terminal, a console frontend will be invoked.

  3. If /usr/sbin/sendmail exists, a mail is sent to the process owner, containing the info file and asking for forwarding it to an appropriate email address. Since Breezy does not install even a local MTA by default, we cannot rely on this, though.

  4. Dump the report into syslog with no further action.

In the future we should consider automatic processing of the generated reports by Launchpad. For now, both the interactive interface and the automatically sent mails should just ask the user to file a bug and include the generated report.

Implementation

Debug symbol extraction

dh_strip already offers to generate a debug package with the extracted symbols. However, it requires the debug package to be mentioned in debian/control, which we do not want to do permanently. Since modifying debhelper is considered bad and we just eliminated a similar modification to dh_builddeb, we will create a new package pkgstripdebug, which diverts dh_strip to change its behaviour. This package needs to be installed into the buildd chroots, similar to pkgstriptranslations. The diverted dh_strip does the following:

  1. Create a debug package in debian/ for all packages dh_strip is asked to act on.

    • The package name is the original one plus -dbgsym appended.

    • Packages which are Architecture: all, or end with -dbg are excluded.

    • Dependencies are Depends: Original package name (= ${Source-Version}).

    • If there already is a -dbg package, Conflict: and Replaces: on it.

    • Point out the purpose and the original package name in the package description.
  2. Find all ELF files and call objcopy --only-keep-debug on them, and put the symbols into /usr/lib/debug/original path into the -dbgsym package. dh_strip has a similar feature, but has a different semantics in different compatibility levels, and generally interacts too much with the packaging to use it in a robust and generic way.

  3. Create a deb and register it with dpkg-distaddfile for Section: raw-debug, so that the launchpad installer can put them into a proper place.

  4. Call the original dh_strip with the same parameters.

Fedora uses a similar process and apparently they developed something better than objcopy, which produces much smaller debug info files. This should be investigated, see [https://launchpad.net/bugs/14484 Ubuntu #14484] for some further information.

Process crash detection

The crash handler collects the following information about the crash:

  • Executable name
  • Signal name
  • proc information (/proc/pid/{cmdline,environ,maps,status})

  • Package name and version
  • Stack trace.

To get a human readable backtrace, the handler looks for available debug symbols in /usr/lib/debug/. If none are present, the graphical crash handler should offer to download the dbgsym deb from the Ubuntu server. All data is written into a file in RFC822 format and presented to the user (see below).

Problem information file format

A rfc822 encoded file with the information about the problem. Three different problem exists, program crash, packaging problem and kernel crash. We only support the first type for now, but the file format should support future improvements. The file should contain enough information to make analyzing the problem possible. A possible list of fields includes:

  • ProblemType: [Crash|Packaging|Kernel] 

  • Date

  • Architecture

  • DistroRelease

  • Locale

  • RunningKernel

  • PackageAffected

  • Dependencies (with Versions)

  • UserNotes

  • Backtrace (ProblemType: Kernel or Crash)

  • PackageError (ProblemType: Packaging, dependency problem or dpkg output)

  • ExecutableName (ProblemType: Crash)

  • SignalName (ProblemType: Crash)

  • CmdArguments (ProblemType: Crash, from /proc/$pid/cmdline)

  • Enviroment (ProblemType: Crash, from /proc/$pid/environ)

  • ProcStatus (ProblemType: Crash, from `/proc/$pid/status)

Data Preservation and Migration

Those processes will not alter the user's data in any way.

Outstanding Issues

Future improvements

  • Automated crash reporting to Launchpad (taking privacy issues into account).
  • Handling of kernel crashes.
  • Handling of package installation/removal/upgrade errors.
  • Duplicate recognition based on the package and backtrace.
  • Offer to save the core file somewhere, so that the user can further assist the people who try to fix the bug
  • Add a power-user option to directly call ggdb or gdb-in-a-terminal

CBI

The [http://www.cs.wisc.edu/cbi/ Cooperative bug isolation] project was mentioned in this BoF, and there was some ongoing discussion about whether to adopt it in Ubuntu. CBI focuses on compiling applications with a modified toolchain to enrich them with code augmentations and debug information. However, this enlarges packages considerably, which would affect the number of packages we could ship on a CD. On the other hand, the solution that is proposed here works for all packages, does not enlarge packages, and does not require a modified toolchain. On the downside, our solution requires network access to get usable backtraces, but this can be mitigated by caching downloaded debug symbol files.

Kernel crash detection

Many kernel oopses find their way through klogd into the kernel log file. At boot time, we should detect if there is a kernel oops log in /var/log/kern.log, use ksymoops to make the dump actually readable and write the trace into an RFC822 format file which is then presented to the user (see below).

There is the kernel crashdump project at http://lkcd.sourceforge.net/ that should be investigated.

Package installation failures

For package system failures, code needs to be written so that apt can report dependency problems (apt-get install $foo fails) and package installation/removal/upgrade to a external application. Before reporting a problem apt needs to check that the installed dependencies on the system are all right (apt-get install -f runs successfully). A option in apt should control if apt reports the problems or not (so that users/developer running on a unstable distribution can turn it off). The report should include the sources.list of the user to identify problems with 3rd party repositories. In some cases the output of apt-get install -o Debug::pkgProblemResolver=true is useful as well. The list of installed packages is useful sometimes too, but it can easily get huge, so it's probably not feasible to include it in a report.

Providing minimal symbols in binaries

A possible alternative to creating separate debug packages for everything is to include some symbols in binary packages. The primary problem for upstream developers receiving backtraces are functions listed as (???) instead of giving the function name. Additional information such as source code file and line number, although interesting, is less important. Including symbols for every function directly in the binary file would provide the former, without increasing the binary size as much as including full debugging information. This can be implemented by using the -g option to strip instead of what is currently used. Some discussion is necessary to determine the optimal strip flags.


CategorySpec

AutomatedProblemReports (last edited 2008-08-06 16:26:25 by localhost)