AutomatedProblemReports

Differences between revisions 16 and 17
Revision 16 as of 2005-04-28 00:41:13
Size: 3821
Editor: intern146
Comment: demoted to brain dump, we have another bof today
Revision 17 as of 2005-04-29 05:53:08
Size: 7345
Editor: intern146
Comment: current state of editing (BROKEN!)
Deletions are marked like this. Additions are marked like this.
Line 17: Line 17:
  * UduSessions: 1, 4, 8, etc [[BR]]   * UduSessions: 2(1) [[BR]]
Line 21: Line 21:

Streamline the process of collecting data for common end-user
problems, so that they can be prioritized and addressed
Streamline the process of collecting data for common end-user problems, so that they can be prioritized and addressed.
Line 29: Line 27:
 * When a program crashes, send a report (with an absolute minimum of user interaction)
 * Extract and store debug symbols from standard builds, and store them in a centralized repository for use in analyzing these reports

 * When a package installation, removal or upgrade fails, send a report (with an absolute minimum of user interaction)
 * When a kernel panic/oops/etc. occurs, send a report (with an absolute minimum of user interaction)
 * Extract and store debug symbols from standard builds, and store them in a centralized repository for use in analyzing these reports.
* When a program crashes, send a report (with an absolute minimum of user interaction).
 * When a package installation, removal or upgrade fails, send a report (with an absolute minimum of user interaction).
 * When a kernel panic/oops/etc. occurs, send a report (with an absolute minimum of user interaction).
Line 37: Line 35:
1. The debugging symbols needs to be extracted during build time and put into a central server so that they can be downloaded. Storing the debug symbols in the packages themself is not feasible because it would make the size of the packages explode. But the debug symbols needs to be available to make usefull backtraces from a crashed application. === Data Preservation and Migration ===
Line 39: Line 37:
1. A problem is how to report the crash to the user. A breezy install will no longer have a MTA installed and if a daemon/non-interactive application crashs it can't report the problem to the user (even when he is logged in and runs X it is probably not possible to connect to the display). Those processes will not alter the user's data in any way.

=== Packages Affected ===

 * `update-notifier`
 * `debhelper`
 * buildd scripts

=== User Interface Requirements ===

`update-notifier` needs to be generalized to a daemon which informs (and asks) the user about arbitrary system events. "Package updates available" is such an event, however, this also includes events like "An application has crashed", or "A new piece of hardware was plugged in, do you want to configure it?". Alternatively, we need to write a general notifier with an extensible plugin architecture from scratch and convert `update-notifier` to use that.

Every particular type of event needs a special dialog which displays the information to the user and asks how to proceed. The user must be able to choose whether a report shall be sent to a database. We should not do this unconditionally since stack traces, environments, etc. may contain sensitive and private information. This dialog should also allow the user to input some comment about how the problem could be reproduced (if the event notifies about a problem).

=== Debug symbol extraction ===

Our package build process will be modified to preserve debug symbols of all packages in all versions and publish them on our servers (e. g. `http://debug.ubuntu.com/`''package/version/path_to_binary_or_library/filename''`.dbg`). A package (and program) `pkgstripdebug` will be created which calls `objcopy --only-keep-debug` on binaries and libraries before stripping them. The set of debugging symbol files are exported in a tarball ''sourcepackagename''`_`''version''`_debug.tar.gz` (similar to the translation tarballs), and the buildd scripts will publish the tarball contents to the download server.

Fedora uses a similar process and apparently they developed something better than `objcopy`, which produces much smaller debug info files. This should be investigated, see [http://bugzilla.ubuntu.com/8149 Ubuntu #8149] for some further information.

For the majority of our packages it is sufficient to modify `dh_strip` in the `debhelper` package to call `pkgstripdebug` before actually stripping ELF files. Packages which don't use debhelper or have a broken build system that does not build binaries and libraries with debugging information have to be manually fixed to do so.

=== Process crash detection ===

We will create a small library `libcrashrep.so` whose init function installs a signal handler for the most common types of crashes (segmentation violation, floating point error, and bus error). The handler will catch all signals that the application does not handle itself. When a crash is detected, the library calls an external program `crashrep` with the application's process id and signal number as argument. `crashrep` collects the following information about the crash:

 * Executable name
 * Signal name
 * proc information (`/proc/pid/{cmdline,environ,maps,status}`)
 * Package name and version
 * Stack trace.

To get a human readable trace, `crashrep` attempts to download debug symbols from the Ubuntu server and load them into gdb (`symbol-file foo.dbg`) before performing the `backtrace` command. All data is written into a file in RFC822 format and queued in a spool directory, where it can be picked up by a frontend. Finally a dbus notification is sent out that informs clients about the crash and the location of the data file.

=== Kernel crash detection ===

Many kernel oopses find their way through `klogd` into the kernel log file. At boot time, we should detect if there is a kernel oops log in in `/var/log/kern.log`, use `ksymoops` to make the dump actually readable and write the trace into an RFC822 format file which is queued in a spool directory (further process similar to process crashes).

There is the kernel crashdump project at http://lkcd.sourceforge.net/ that should be investigated.

=== Package installation failures ===

TODO

=== Problem information files ===

TODO

=== Presenting the information ===

A Breezy install will no longer have a MTA installed and if a daemon/non-interactive application crashs it can't report the problem to the user (even when he is logged in and runs X it is probably not possible to connect to the display).
Line 41: Line 89:

front end (the future event notifier) is called (passing the name of the data file) which informs the user about the crash and asks how to proceeed.

TODO: different users? what about server processes?

By using dbus rather than directly spawning a PyGTK interface, it is possible to intercept crashes of processes whose owner is not currently logged in. which do not run as the user on the desktop. Also, crash reports can be queued if no user is currently logged in. However, we
Line 44: Line 98:
1. When a package install/upgrade/remove fails this should be reported too. This needs to be hooked into dpkg.   1. Kernel need ksymoops to get the debug information. There is a kernel crashdump project (http://lkcd.sourceforge.net/) that should be investigated. 1. When a package install/upgrade/remove fails this should be reported too. This needs to be hooked into dpkg.
Line 49: Line 101:


=== Data Preservation and Migration ===

=== Packages Affected ===

=== User Interface Requirements ===
Line 61: Line 106:
 * What data needs to be submitted?
  * core dump; MartinPitt: core dumps are difficult to handle since you need to have exactly the same libraries; a good stack trace is already very helpful, much smaller, and easier to handle
  * identifying information for all code involved (package versions? filenames? md5sums?)
  * additional run-time information (environment, command line arguments, user comments for reproduction)
 * How should the debug symbol extraction work?
 * Handling and caching of debug symbols on client
Line 68: Line 107:
 * Definition of RFC822 info file
 * Handling of package installation failures
 * Presenting the information
Line 71: Line 113:
 * MartinPitt already has a prototype for crash reports, see AutomatedCrashReporting  * MartinPitt already created a prototype for crash interception and information extraction, see AutomatedCrashReporting

Status

Introduction

Streamline the process of collecting data for common end-user problems, so that they can be prioritized and addressed.

Rationale

Scope and Use Cases

  • Extract and store debug symbols from standard builds, and store them in a centralized repository for use in analyzing these reports.
  • When a program crashes, send a report (with an absolute minimum of user interaction).
  • When a package installation, removal or upgrade fails, send a report (with an absolute minimum of user interaction).
  • When a kernel panic/oops/etc. occurs, send a report (with an absolute minimum of user interaction).
  • [http://www.cs.wisc.edu/cbi/ Cooperative bug isolation]?

Implementation Plan

Data Preservation and Migration

Those processes will not alter the user's data in any way.

Packages Affected

  • update-notifier

  • debhelper

  • buildd scripts

User Interface Requirements

update-notifier needs to be generalized to a daemon which informs (and asks) the user about arbitrary system events. "Package updates available" is such an event, however, this also includes events like "An application has crashed", or "A new piece of hardware was plugged in, do you want to configure it?". Alternatively, we need to write a general notifier with an extensible plugin architecture from scratch and convert update-notifier to use that.

Every particular type of event needs a special dialog which displays the information to the user and asks how to proceed. The user must be able to choose whether a report shall be sent to a database. We should not do this unconditionally since stack traces, environments, etc. may contain sensitive and private information. This dialog should also allow the user to input some comment about how the problem could be reproduced (if the event notifies about a problem).

Debug symbol extraction

Our package build process will be modified to preserve debug symbols of all packages in all versions and publish them on our servers (e. g. http://debug.ubuntu.com/package/version/path_to_binary_or_library/filename.dbg). A package (and program) pkgstripdebug will be created which calls objcopy --only-keep-debug on binaries and libraries before stripping them. The set of debugging symbol files are exported in a tarball sourcepackagename_version_debug.tar.gz (similar to the translation tarballs), and the buildd scripts will publish the tarball contents to the download server.

Fedora uses a similar process and apparently they developed something better than objcopy, which produces much smaller debug info files. This should be investigated, see [http://bugzilla.ubuntu.com/8149 Ubuntu #8149] for some further information.

For the majority of our packages it is sufficient to modify dh_strip in the debhelper package to call pkgstripdebug before actually stripping ELF files. Packages which don't use debhelper or have a broken build system that does not build binaries and libraries with debugging information have to be manually fixed to do so.

Process crash detection

We will create a small library libcrashrep.so whose init function installs a signal handler for the most common types of crashes (segmentation violation, floating point error, and bus error). The handler will catch all signals that the application does not handle itself. When a crash is detected, the library calls an external program crashrep with the application's process id and signal number as argument. crashrep collects the following information about the crash:

  • Executable name
  • Signal name
  • proc information (/proc/pid/{cmdline,environ,maps,status})

  • Package name and version
  • Stack trace.

To get a human readable trace, crashrep attempts to download debug symbols from the Ubuntu server and load them into gdb (symbol-file foo.dbg) before performing the backtrace command. All data is written into a file in RFC822 format and queued in a spool directory, where it can be picked up by a frontend. Finally a dbus notification is sent out that informs clients about the crash and the location of the data file.

Kernel crash detection

Many kernel oopses find their way through klogd into the kernel log file. At boot time, we should detect if there is a kernel oops log in in /var/log/kern.log, use ksymoops to make the dump actually readable and write the trace into an RFC822 format file which is queued in a spool directory (further process similar to process crashes).

There is the kernel crashdump project at http://lkcd.sourceforge.net/ that should be investigated.

Package installation failures

TODO

Problem information files

TODO

Presenting the information

A Breezy install will no longer have a MTA installed and if a daemon/non-interactive application crashs it can't report the problem to the user (even when he is logged in and runs X it is probably not possible to connect to the display). The solution for this problem is to think about a generic event notifier applet (probably usefull for other things like: "disk full", "temperature too hot" etc too) based on dbus.

front end (the future event notifier) is called (passing the name of the data file) which informs the user about the crash and asks how to proceeed.

TODO: different users? what about server processes?

By using dbus rather than directly spawning a PyGTK interface, it is possible to intercept crashes of processes whose owner is not currently logged in. which do not run as the user on the desktop. Also, crash reports can be queued if no user is currently logged in. However, we

1. A common frontend needs to be developed that will present the crash/problem to the user and is able to collect the debug symbols for a usefull report. It then must be able to send it to a central server that collects the bugreports. A daemon on the server must remove duplicates and report the problems to Malone. Http will be used to send the reports to the server.

1. When a package install/upgrade/remove fails this should be reported too. This needs to be hooked into dpkg.

1. Fedora is using a similar scheme (ubuntu #8149) that may be interessting for us too.

Outstanding Issues

UDU BOF Agenda

  • Automated bug reporting to Malone
  • Definition of RFC822 info file
  • Handling of package installation failures
  • Presenting the information

UDU Pre-Work

AutomatedProblemReports (last edited 2008-08-06 16:26:25 by localhost)