AutomatedProblemReports

Differences between revisions 32 and 33
Revision 32 as of 2005-04-30 02:25:23
Size: 10366
Editor: intern146
Comment: Martin to move Sticky to Edited, Mark to approve
Revision 33 as of 2005-04-30 02:54:46
Size: 10349
Editor: intern146
Comment: moved sticky, removed myself from queue
Deletions are marked like this. Additions are marked like this.
Line 12: Line 12:
  * Status: EditedSpecification, BreezyGoal, DistroSpecification, MartinPittQueue, MattZimmermanQueue[[BR]]   * Status: EditedSpecification, BreezyGoal, DistroSpecification, MattZimmermanQueue[[BR]]

Status

Introduction

We need to streamline the process of collecting data for common end-user problems, so that they can be prioritized and addressed.

This would ideally mean that crashes of userspace applications and the kernel, as well as packaging related failures, are detected automatically so the user gets an easy to use frontend for adding information to the problem report and is offered to send the report to our database.

Rationale

Currently many classes of problems like program crashes remain unreported or unfixed because:

  • many crashes are not easily reproducible (after e. g. installing a debug version
  • end users do not know how to prepare a report that is really useful for developers
  • and we have no easy frontend which allow users to submit detailed problem reports.

If the process of data collection is automated and detailed information about a crash can be collected at the very time a crash occurs, this will help the developers to be notified about problems and give them much of the information they need to deal with it.

We hope that this will lead to a much better level of quality assurance in the future.

Scope and Use Cases

  • Extract and store debug symbols from standard builds, and store them in a centralized repository for use in analyzing these reports.
  • When a program crashes, send a report (with an absolute minimum of user interaction).
  • When a package installation, removal or upgrade fails, send a report (with an absolute minimum of user interaction).
  • When a kernel panic/oops/etc. occurs, send a report (with an absolute minimum of user interaction).

Implementation Plan

Data Preservation and Migration

Those processes will not alter the user's data in any way.

Packages Affected

  • update-notifier

  • debhelper

  • buildd scripts
  • apt

Debug symbol extraction

Our package build process will be modified to preserve debug symbols of all packages in all versions and publish them on our servers (e. g. http://debug.ubuntu.com/package/version/path_to_binary_or_library/filename.dbg). A package (and program) pkgstripdebug will be created which calls objcopy --only-keep-debug on binaries and libraries before stripping them. The set of debugging symbol files are exported in a tarball sourcepackagename_version_debug.tar.gz (similar to the translation tarballs), and the buildd scripts will publish the tarball contents to the download server.

Fedora uses a similar process and apparently they developed something better than objcopy, which produces much smaller debug info files. This should be investigated, see [http://bugzilla.ubuntu.com/8149 Ubuntu #8149] for some further information.

For the majority of our packages it is sufficient to modify dh_strip in the debhelper package to call pkgstripdebug before actually stripping ELF files. Packages which don't use debhelper or have a broken build system that does not build binaries and libraries with debugging information have to be manually fixed to do so.

Process crash detection

We will create a small library libcrashrep.so whose init function installs a signal handler for the most common types of crashes (segmentation violation, floating point error, and bus error). The handler will catch all signals that the application does not handle itself. When a crash is detected, the library calls an external program crashrep with the application's process id and signal number as argument. crashrep collects the following information about the crash:

  • Executable name
  • Signal name
  • proc information (/proc/pid/{cmdline,environ,maps,status})

  • Package name and version
  • Stack trace.

To get a human readable trace, crashrep attempts to download debug symbols from the Ubuntu server and load them into gdb (symbol-file foo.dbg) before performing the backtrace command. All data is written into a file in RFC822 format and presented to the user (see below).

Kernel crash detection

Many kernel oopses find their way through klogd into the kernel log file. At boot time, we should detect if there is a kernel oops log in /var/log/kern.log, use ksymoops to make the dump actually readable and write the trace into an RFC822 format file which is then presented to the user (see below).

There is the kernel crashdump project at http://lkcd.sourceforge.net/ that should be investigated.

Package installation failures

For package system failures, code needs to be written so that apt can report dependency problems (apt-get install $foo fails) and package installation/removal/upgrade to a external application. Before reporting a problem apt needs to check that the installed dependencies on the system are all right (apt-get install -f runs successfully). A option in apt should control if apt reports the problems or not (so that users/developer running on a unstable distribution can turn it off). The report should include the sources.list of the user to identify problems with 3rd party repositories. In some cases the output of apt-get install -o Debug::pkgProblemResolver=true is useful as well. The list of installed packages is useful sometimes too, but it can easily get huge, so it's probably not feasible to include it in a report.

Problem information file format

A rfc822 encoded file with the information about the problem. Three different problem exists, program crash, packaging problem and kernel crash. The file should contain enough information to make analyzing the problem possible. A possible list of fields includes:

  • ProblemType: [Crash|Packaging|Kernel] 

  • Date

  • Architecture

  • DistroRelease

  • Locale

  • RunningKernel

  • PackageAffected

  • Dependencies (with Versions)

  • DebconfInformation

  • UserNotes

  • Backtrace (ProblemType: Kernel or Crash)

  • PackageError (ProblemType: Packaging, dependency problem or dpkg output)

  • ExecutableName (ProblemType: Crash)

  • SignalName (ProblemType: Crash)

  • CmdArguments (ProblemType: Crash, from /proc/$pid/cmdline)

  • Enviroment (ProblemType: Crash, from /proc/$pid/environ)

  • ProcStatus (ProblemType: Crash, from `/proc/$pid/status)

Presenting the information

There is no single way of presenting the information contained in the RFC822 file, so we have to try a list of possible actions after a crash:

  1. If the owner of the crashed process is currently logged in and the process has $DISPLAY defined, a pygtk frontend will be invoked.

  2. If the owner of the crashed process is currently logged in and the process has no $DISPLAY defined, but the process has an attached terminal, a console frontend will be invoked.

  3. If /usr/sbin/sendmail exists, a mail is sent to the process owner, containing the info file and asking for forwarding it to an appropriate email address. Since Breezy does not install even a local MTA by default, we cannot rely on this, though.

  4. Dump the report into syslog with no further action.

The frontend should then ask the user to add some comments and ask whether to send a report to the developers. Interactive frontends should use http (since this works everywhere), the mail "frontend" should ask to forward the mail to an automatically processed email address.

User Interface Requirements

Every particular type of event needs a special dialog which displays the information to the user and asks how to proceed. The user must be able to choose whether a report shall be sent to a database. We should not do this unconditionally since stack traces, environments, etc. may contain sensitive and private information. This dialog should also allow the user to input some comment about how the problem could be reproduced (if the event notifies about a problem).

In the future we should consider using event-notifier instead of displaying dialogs directly from the information collection process.

Processing reports

The server collects reports submitted over HTTP or email and stores them into a database for now. In the future we should think about automatically factoring reports that describe the same problem and automatically generating Malone bug reports.

A more general solution would be a general event notification framework which can queue messages if the target user is not logged in. However, that feels much like reinventing mail delivery.

Discussion

The [http://www.cs.wisc.edu/cbi/ Cooperative bug isolation] project was mentioned in this BoF, and there was some ongoing discussion about whether to adopt it in Ubuntu. CBI focuses on compiling applications with a modified toolchain to enrich them with code augmentations and debug information. However, this enlarges packages considerably, which would affect the number of packages we could ship on a CD. On the other hand, the solution that is proposed here works for all packages, does not enlarge packages, and does not require a modified toolchain. On the downside, our solution requires network access to get usable backtraces, but this can be mitigated by caching downloaded debug symbol files.

Outstanding Issues

Things to consider in the future

  • Automated bug reporting to Malone.
  • Caching of downloaded debug symbols.

UDU Pre-Work

  • MartinPitt already created a prototype for crash interception and information extraction, see AutomatedCrashReporting (note that the relevant information was merged into this specification).

AutomatedProblemReports (last edited 2008-08-06 16:26:25 by localhost)