Created: 2006-07-29 by JohnMoser
This spec describes a method for categorizing and sorting AutomatedProblemReports to help developers identify similar or identical bugs.
CrashReporting can result in DrinkingFromTheFirehose due to the massive influx of problem reports. We must provide a way for developers to wade through the inundation without getting lost in sudden influxes of thousands of copies of the same problem.
There are many use cases:
- Rhythmbox spontaneously crashes. On six thousand machines.
- Totem and Nautilus both crash, due to a bug in the gstreamer XviD plug-in being triggered during playback in Totem or thumbnailing in Nautilus. Virtually every user sends this report.
In each of these cases the CrashReporting daemon would report a problem back to Ubuntu. On the server, these reports would be analyzed, categorized, and tagged using the resulting information.
The scope of this spec includes all problems reported by CrashReporting.
The scope of tagging is to identify characteristics of crash reports such as what caused the crash, what program was being executed, what libraries were linked in, where the crash occurred, and elements of the program's state such as call traces (if not destroyed by stack smashes).
The characteristics identified should pertain to the process in question only, and not to its relation to other crash reports. Other than this, there is no limit to the characteristics identified; in fact, as wide a scope as possible is desired, especially including any signal delivered at death and any information such as if a SIGSEGV occurred due to a read or execute and if it occurred due to insufficient memory protections or due to unmapped memory and where the attempt was made.
Automatically identifying bugs as duplicates is out of scope.
We will need a crash handler and reporter, which is from CrashReporting.
The server handling the CrashReporting should tag problems exhibiting known characteristics to fall into certain categories. Known characteristics could include termination methods such as SIGSEGV read unmapped or SIGKILL self.
The interface developers use to view CrashReporting reports would be capable of displaying all characteristics of a report and allowing developers to select a subset of these characteristics. This subset would then be cross-checked against all reports to generate a list of matching reports.
From a list of reports, developers should be able to review reports and tag them as being related to any other reports. For our purposes the only "relation" we care about is whether two reports are the same bug.
Reports marked as the same bug should always mark as the same bug as the earliest reported instance. The earliest instance known of any bug will be marked in the report as the being the same as "itself". This may complicate the marking process in some cases; but the searching process will be much faster because everything will point to the same bug, and thus listing reports that are "the same bug as this report" is simply looking for reports whose "the same bug as" field is the same as the current report's.
The stuff in Design needs to be implemented on top of the facilities of CrashReporting.
We will need CrashReporting working first with automatic submissions to the server.
The server handling CrashReporting must tag problems exhibiting known characteristics, allowing developers to sort and examine them.
Data preservation and migration
No issues exist.
Characteristics to tag
There are several characteristics we can use to identify crashes, including:
- Crash occurs at the same point
- The same function
i.e. we found a stack smash because it called __stack_chk_fail(), the calling function was some_vuln_function()
- The same module (library or program)
i.e. some_vuln_function() is from some_vuln_lib.so
- The same function
- list all crashes with SOME_NUMBER of common back trace from the crash point
i.e. we found a double-free(), determined some_stupid_function() called free(), and the last 3 up to there were good_function(), fine_function(), some_stupid_function().
- Crashes fault on the same problem
i.e. SIGILL crash.
SIGSEGV detected attempting to execute unmapped memory
There are several more I don't discuss here. We need a complete list.
Adding new characteristics later is alright because the only characteristics that help us are those heuristically detectable. It thus stands to reason that when a new characteristic is added, the system can rescan every report ever made and tag any matches.
Use of tags
The CrashReporting server would tag reports with information such as the above. Developers would later be able to use these tags as search criteria to find similar bugs. For example, pitti may select a crash and "search for similar crashes." He may then enter (through check boxes) various similarities to locate. Let's take the example of a SIGSEGV attempting to execute the stack; the below options may be available to him:
- Similar fault
- Attempt to execute
- Attempt to execute non-executable area
- Attempt to execute
- Same module
Previous executing function was very_broken_function(), so we assume we are in module very_broken_lib.so
The version was 0.1.19 (i.e. very_broken_lib.so.0.1.19)
- Same program
- Similar backtrace
Fault occurred in very_broken_function()
- Previous N calls (you enter a value for N)
- Related (as manually tagged by the developers)
The range of options is limitless; but some example searches include:
SIGSEGV in /usr/bin/program_that_crashed in module very_broken_lib.so (any version) apparently from very_broken_function()
- Will not worry about matching any part of the tail of the backtrace, or that the attempt was to execute.
SIGSEGV on attempt to execute related to very_broken_lib.so.
Any fault in very_broken_lib.so.
This would prove to be a powerful tool for taking characteristics of an automatically detected problem and matching it with other problems.
The other problems listed may or may not be related. Developers would manually tag them as being the same bug; group them in the same group; and have quick and easy future reference to all the different reports.
BoF agenda and discussion
This is more generic than BugPatterns. BugPatterns looks as if it is about creating an identifier based on a bug; whereas this wiki page is about identifying all characteristics of CrashReporting entries and allowing people to search for similar reports. Specifically, my understanding is that the difference is that this spec describes a method in which when you wake up one day and find that last release of Firefox caused 3,976,472 cores to be uploaded automatically, you'll be able to quickly find a scheme to relate them all to each other; while BugPatterns appears to me as a suggestion on using a pattern you've manufactured from a bug. I'm focusing on reducing information overload when you have a large amount of data to sift through. --JohnMoser
I've been meaning to bring this up. With a database of collected crashes that hopefully is easily analyzed for characteristics, how likely is it that the interface described here if implemented would be accessible to the public? I am personally interested in using the information collected to determine patterns that indicate security holes; in other words, leaving this information out would be similar to opening up all bugs marked as security issues to public view. This is not necessarily a bad thing; it just may go against existing policy. --JohnMoser
We may want to either eject the concept of a manually set "Related" tag; set it up per-user with some magic (like "search based on pitti's choices" or "average all users' ratings"); and/or move to a seven-point Likert Scale. If we are keeping it I highly recommend a seven point Likert Scale, possibly with a center at 4 being "Unknown relationship." Of note, implementing this kind of each-to-all mesh mapping would be somewhat difficult directly; perhaps by creating a meta-object that they are all linked to and unlinking anything related at level 4. --JohnMoser