Specs/M/ARMValidationDashboard

Launchpad Entry: https://blueprints.launchpad.net/ubuntu/+spec/arm-m-validation-dashboard
Created: PaulLarson
Contributors: PaulLarson, ZygmuntKrynicki
Packages affected:

Summary

As a part of the automated testing efforts on ARM we need a dashboard interface for visualizing the current state of the image. This interface must allow the user to see, at a glance, the state of functional tests, performance tests, as well as other useful data that is described in more detail below. Please note that each particular test is beyond the scope of this specification. This specification is only concerned with the infrastructure that allows to deploy a centralized test submission, processing and presentation web application.

This specification is a part of a larger project, see other blueprints for reference:

Release Note

No user visible changes

Rationale

We need to easily see how various development efforts are affecting the image over time. A dashboard interface helps us to visualize, in one place, the results of running tests on multiple machines. The dashboard can also display results of performance measurements across different image build dates to allow developers quickly see how their efforts are affecting performance. Targets and baselines can be set for any performance metric so that it is possible to detect deviations and track goals.

User stories

Bob is a release manager for Ubuntu on a particular arm device. Bob wants to check the overall status of the image produced yesterday before releasing Alpha 1. Bob visits the dashboard to check for test failures. Bob marked some tests as expected to fail on this device as not all components are yet in place and some things are still broken. As all other tests have good results Bob can go forward with the release. Bob is a user that will visit the dashboard as a part of his daily routine. He is focused on having most of the data he is interested in being displayed on a single front page. Since he is logged in his homepage contains a summary of the image is is working on. Since Bob visits this page daily he is mostly interested in a difference, or update since yesterday. The website prominently highlights package information (packages that changed, that failed to build, etc), test information (what tests were ran and processed by the system over the last 24 hours, which tests failed, if any), benchmark information (emphasized samples from new measurements, regressions and other deviations from baseline) and bug information (new or modified bugs being targeted for the upcoming milestone).
Jane is interested in basic performance metrics of current ubuntu image. Jane can check some synthetic benchmarks for CPU, GPU and IO performance. Jane can also check some end-to-end benchmarks for user applications (browser startup time, time to full desktop, time to render snapshot of key websites, etc). Jane can setup baseline for each metric and request to be notified of all variances that exceed given threshold. Jane uses the dashboard rarely, definitely not on a daily basis. Jane is looking for a performance regressions after key packages are changed or added. Jane is also looking at the numbers and graphs more than at anything else. Jane marks milestones such as 'new kernel added', 'gtk sync complete' to add some context to some graphs. Baselines allow her to see how current release performs in comparison to the previous releases. [optionally, if it goes forward] Baselines also allow to see how one distribution compares to other distributions. Jane can easily set up identical devices with Ubuntu, Fedora, and SUSE (or, for some tests, even Windows) and have the data readily available and accessible online.
Alice is on the Desktop QA team and wants to integrate some of the tests her team has created into the dashboard. QA Engineers quickly bootstrap a local installation of the dashboard and check the bundled documentation and examples. Within hours the dashboard displays results from some of the tests that local engineering team has already adapted. Alice sees the dashboard as a free tool that she can take advantage of. Alice and her team of engineers are on track to deliver measurable performance improvements of the Desktop. She is more interested in connecting the tests they have been using so far and to use the dashboard as a good user interface to all the data that they can produce. Alice is also interested in migrating historical records to the dashboard database interface but doing so is an investment she is not yet ready to justify. Alice hopes that additional people would enhance the dashboard, either by adding more valuable tests or by improving the user interface and processing components, thereby allowing her engineers to focus on what is truly important to them and not on the surrounding infrastructure.
Yung is a product manager in Big Corp ltd. Yung is building a product based on Ubuntu and wants to reuse our QA infrastructure. Yung instructs his engineers to deploy a local dashboard installation and run our tests on their new secret product. Yung's Engineers write an adapter that takes some of the results from the dashboard and pushes them to the internal QA system. Yung is a different type of user. He is not familiar with open source methodologies, technology and infrastructure as much as a regular open source developer or activist would be. Yung was briefed about this technology and how it can be applied to his work process during a business meeting with some 3rd company representatives. Yung is not a big proponent or opponent of open source technologies and merely wants to use them if is can help him do his work. For Yung there is a big factor on ease of deployment, first impression, disruptiveness, localisation (so that engineers can use other languages than english, especially far-east languages ) and privacy. If the technology fails to meet his requirements it will be discarded and not revisited again. Time to market is paramount. If the technology works and is adopted Yung is interested to know about support options.
David is an engineer at SoC Vendor Inc. David uses the QA dashboard to compare performance metrics across a whole line of SoC that are manufactured by his company. David can quickly create custom graphs by adding data series from various measured properties (or probes as the system calls them) and aggregating data sources across time or device type. David can also print them or re-use in a office document he is working on. David saves some of the most often used graphs and shares them with the rest of the team. David is another external user. David is similar to Yung in his desire for adding the value without requiring too much investment but unlike Yung he is primarily an engineer. David can is fine with experiencing minor issues and is more interested in looking under the hood to tweak the application towards his needs. David might be an internal user evaluating this technology before broader acceptance or just doing a local installation for the purpose of his project. David might join an IRC channel or a mailing list to chat with developers or ask questions. He is not interested in formal support.

Design

The dashboard has the following core concepts:

software image (set of packages shipped together)
test collection (scripts, programs and sources that constitute a test)
test collection run (act of running tests on device)
test result (single pass/fail result within a test collection run)
performance measurement (single quantitative measurement within a test collection run)
performance metric/probe (a well-defined concept that can be used to compare performance measurements)

Dashboard is designed around the concept of test collections and test collection runs. Typical/expected tests include group of unit tests for a major library, a stand-alone test suite designed to check compliance or correctness of some APIs or existing (including binary-only) application scripted to perform some scenarios. Each test collection run (abbreviated to test run from now on) is performed on specific device (computer). The result of that run is a tree of log files. Log files are uploaded to the 'gateway' component of the dashboard for storage.

The second major concept is log file analysis. Each test collection has a log processing script that is designed to understand the format of the log files and translate them to one of two entities:

pass/fail test result
performance measurement and associated performance metric (probe)

All data that is displayed by the dashboard can be traced back to a log file. The system preserves this information for credibility and assistance in manual analysis.

Pass/fail test results are simple to understand, they are an indication of some test that succeeded or failed. The identity of such tests is not maintained. That is, it is not possible to compare two test runs and see if the same pass/fail test succeeded in both automatically. This limitation is made by design.

In contrast performance measurements always need a performance metric to be meaningful. This allows to define metrics in the system, compare them across time, hardware and other factors. Metric also designates the units of each measurements. The units may be time, bytes/second, pixels/second or any other, as required by particular use case.

This decision is based on an assumption that typical qualitative (pass/fail) tests are far more numerous than quantitative tests (benchmarks) and maintaining identity support in the log processors would be an additional effort with little gain.

Implementation

Components

Dashboard is a collection of components that are maintained together. Those components are:

Dashboard, web application for user interaction (frontend)
Backend, application logic server with XML-RPC APIs for interaction with others (backend)
Data Gateway, custom FTP service for uploading and downloading files (gateway)
Log Analyzer, sandboxed process for analyzing uploaded files (analyzer)
SQL database, database of all shared application state
Python APIs for talking to various parts of the service
Command line tools for manipulating the system, registering tests, registering test runs, etc

The picture below shows how those components look like together in full deployment scenarios.

Limitations

Python 2.5+ required
Django 1.1+ required
PostgreSQL 8.4+ required
Deployment supported on Ubuntu Server 10.10+
One-way connectivity sufficient for working correctly.
IPv6 not supported officially but may work (no additional IPv6 code required)

Ubuntu Package Details

The launch control project is separated into the following packages:

launch-control
- A meta-package that depends on all components of the launch control suite.
launch-control-dashboard
- Web front-end for the whole application (the actual dashboard).
launch-control-backend
- Back-end for the whole application (database and application logic)
launch-control-data-gateway
- Data gateway service (for dropping test results from devices)
launch-control-log-analyzer
- Log analysis service.
launch-control-tools
- Command line tools for manipulating a launch-control installation.
launch-control-common
- Private APIs and other common files for the whole suite.
python-launch-control
- Public APIs wrapped as a python library.

Component Details

Dashboard Web Application

Front-end of the system
Allows to browse projects
- List with pagination
Allows to browse project releases
- requires project context
Allows to browse development snapshots / software images
- requires project release context
- shows all software images recorded in the system
- link to software profile manifest for each image
Allows to browse test collections
- shows basic information:
  - origin/url
  - license
  - shows capabilities (test/benchmarks/others)
- links to test collection run viewer
Allows to browse test collection runs (acts of running the test somewhere)
- search/filter by:
  - project (e.g. Linaro, Ubuntu)
  - image version (e.g "Linaro 10.11, 2010-07-12-01", Linaro 10.11 release, built on the 12 of June 2010, first image for that day)
  - specific device class (e.g. Beagle Board)
  - specific device (e.g. Bob's Beagle Board)
  - submitter (e.g. Bob, anonymous)
  - software software profile property (e.g. libc version=1.2.3)
  - specific hardware class property (e.g. memory=512MB)
Allows to display information for specific test collection run:
- display basic information about test collection run:
  - image version (date-time + serial for date)
  - software profile (packages and versions)
  - hardware profile (various bits)
  - test device (if registered)
  - submitter (if registered)
- display all failed tests:
  - with references from log file
- display all successful tests:
  - with AJAX'ed references to log file
  - hidden by default (summary view only)
- display all benchmark measurements:
  - when in hardware context:
    - show baseline for this hardware (if any)
    - highlight when deviates
  - when in project context:
    - show baseline for this project (if any)
    - highlight when deviates
  - when in image version/history context:
    - show results (y-axis) across image version (x-axis)
    - when in package context:
      - show specific package version (x-axis)
Allows to show aggregate results of certain test runs:
- select results matching:
  - test collection (e.g. Linux Test Suite)
  - software image version
  - hardware device class (e.g. Beagle Board)
  - hardware device (e.g. Bob's Beagle Board)
  - package version (e.g. libc6 v=1.2.3)
- for non-test benchmarks:
  - show pass/fail counts
  - with options to aggregate
- for each probe in all benchmarks:
  - show results across software image versions/time
  - show additional data series for:
    - different device class/hardware profile
Allows to show image 'health' summary:
- Test failures
- Package build failures
- Benchmarks deviated from baseline
- Unresolved bugs targeting upcoming milestone
- Unfinished work items targeting upcoming milestone

Backend Service

Back-end of the application
Shares database with the dashboard
Exposes log submission interface:
- setupSubmission(device_uuid, type, size):
  - takes arguments:
    - device_uuid - ID of the device
    - type one of:
      - LOG_SUBMISSION
      - LOG_ANALYSIS [optional]
      - SW_IMAGE_MANIFEST
      - HW_PROFILE
      - SW_PROFILE
    - size - size of the submission in bytes
  - returns:
    - submission_id
    - submission_URL - FTP URL where the files can be uploaded
  - may raise exception:
  - on success:
    - asks the gateway to prepare submission directory and give write access to device_uuid
- completeSubmission(device_uuid, submission_id)
  - takes arguments:
    - device_uuid - ID of the device
    - submision_id - as obtained from setupSubmission()
  - does not return anything
  - may raise exception:
    - InvalidDevice
    - InvalidSubmission
  - on success:
    - asks the gateway to mark the submission directory read only
    - if type == LOG_SUBMISSION:
      - schedules log for processing
Exposes queue interface for taking log processing jobs:
- getNextAnalysisJob(job_server_name, job_server_key):
  - takes arguments:
    - job_server_name - hostname of the job server - informative purpose only
    - job_server_key - shared secred of the job server
  - returns:
    - submission_id - id of the submission to analyze
    - analysis_id - unique to this request
  - associates submission with job server
  - changes status submission status to busy (by storing analysis_id)
  - sets a timeout to return results based on average processing time for same test collection [optinal]
- completeAnalysisJob(analysis_id, status):
  - takes arguments:
    - analysis_id = same job ID that was obtained from getNextJob
    - status = Finished | Failed
  - does not return anything
  - may raise exception:
    - ValueError - wrong id or status
  - on success:
    - job processing results are available in shared storage
    - processing results are loaded into the database
Exposes provisioning interface:
- configureNewDevice():
  - does not take any arguments
  - returns:
    - device_uuid - freshly assigned to this device
  - on success:
    - sets provisioning status of that device to INCOMPLETE
- updateHardwareProfile(device_uuid, submission_id):
  - takes arguments:
    - device_uuid - id of the device
    - submission_id - id of the submission
  - does not return anything
  - may raise exception:
    - InvalidDevice
    - InvalidSubmission
  - on success:
    - harvests basic profile information from the log file
    - recalculates device information for test scheduler
    - updates device and
- updateSoftwareProfile(device_uuid, submission_id):
  - takes arguments:
    - device_uuid - id of the device
    - submission_id - id of the submission
  - does not return anything
  - may raise exception:
    - InvalidDevice
    - InvalidSubmission
  - on success:
    - harvests basic profile information from the log file
    - recalculates device information for test scheduler
    - sets provisioning status of that device to COMPLETE
Exposes scheduling interface for automatic test requests:
- getNextTestJob(device_uuid):
  - takes arguments:
    - device_uuid - id of the device
  - returns:
    - test name - well-known name of the test collection to run
  - may raise exception:
    - NothingToDo - no activity required, sleep for one hour
    - DeviceNotProvisioned - device is not provisioned yet

Data Gateway Service

Implemented as a FTP daemon
- files stored in designated tree (/srv/launch-control/gateway)
- uses http://code.google.com/p/pyftpdlib/ for FTP
Management service for talking with the backend and reconfiguring the ftp service
Authenticates using:
- device UUID, submission ID
- analysis server account/password
Allows uploading submissions/files
Allows downloading files for analysis

Log Analyzer Service

Bach processing system for analyzing submitted log files
- talks to the back-end to get things to do
- talks to the data gateway to access logs
Runs log analyzers on submitted log files
- Updates the result database via internal database link [variant-1]
- Produces standardized result format document and uploads it to the data gateway [variant-2]
Runs inside a sandbox/chroot [optional]
Runs on additional compute nodes [optional]

Database model

TODO

Python APIs

TODO

Command line tools

Called launch-control-tool
usable on the device and for debugging
interface based on commands with options (like bzr)
commands for provisioning devices:
- configure-new-device
- update-software-profile <device-uuid> <submission-id>
- update-hardware-profile <device-uuid> <submission-id>
commands for talking with the gateway:
- setup-submission <device-uuid> <type>
- complete-submission <device-uuid> <submission-id>
commands for requesting test jobs:
- get-next-test-job <device-uuid>
commands for log analysis service:
- get-next-analysis-job <node-id> <node-secret>
- complete-analysis-job <node-id> <node-secret> <job-id> <status> [optional for variant 2 <submission-id>]
commands for integration with image builder
- ingest-software-image-manifest <project-name> <release-name> <image-id> <manifest-file>

Migration

Currently there is no direct migration plan. Things we could consider is migrating some bits and pieces of technology already up and running either at Canonical or somewhere else in the community/other parties that is open source and integrate their tests into our framework. If that becomes true we might want to look at migration from qa-tool-foo to our new technology.

BoF agenda and discussion

Goal: define visualization interface for QA control that shows daily/snapshot/other summary of the 'health' of the image for a given platform

Different images based on a common base:

server
desktop
netbookcurrent

Stuff we want to show:

difference from yesterday
performance history
performance regressions
performance targets (as infrastructure to see if it works during the cycle)

Dashboard mockup:

Two columns: 1)
- - Current build status
  - FTBFS count New package count, number of packages Latest build date/time
  - Test result - Build history
2)
- - News - Performance Targets

Q: What about some UI for scheculing test runs: A: we're not targeting this for the first release but we want have a UI for doing that in the future

Q: How does our project relate to other ubuntu QA projects A:

Stuff to check:

buildbot (python) hudson (java)

Action item: check hudson out? Zygmunt Krynicki Hudson instance for Bzr at canonical: http://babune.ladeuil.net:24842/view/Ubuntu/job/selftest-jaunty/buildTimeTrend

We want to store the log file of each test run just in case (for unexpected successes)

CategorySpec