ARMArchiveBranching

Summary

We wish to provide a way to create derived archives. This will consist of a way to "branch" an archive in order to make changes, which can then be tracked and fed back. We wish to see Ubuntu as an example of this, and strive to make it the same as any other from the point of view of the tools.

This will give greater scope for experimentation, allowing people to easily create an archive for larger experiments, while providing them with tools that allow them to track their changes, merge in updates from the parents, and submit their branches back for inclusion.

Rationale

Currently if you wish to undertake an experiment outside Ubuntu that you can share with others, you either need to do everything in a PPA, which has some limitations, or to set up your own archive on different infrastructure.

Allowing people to do this more easily gives scope for doing these experiments, and given the right tools the burden of managing the delta shouldn't be oppressive.

This is particularly interesting in the ARM world, where there is typically more divergence in kernels, and experimentation with compiler options. By allowing branching, if it is desired to do these things there is a greater investment of effort required.

User stories

  • Scott wishes to test a system booting with the new version of Upstart. He branches a subset of Ubuntu corresponding to the ubuntu-desktop CD to a new archive, and then uploads the new version of Upstart to it. From there he further uploads packages which have upstart jobs to transition them to the new format. He can continue to merge in changes made in Ubuntu, and then when he is ready propose all of his changes as branches back to Ubuntu.

Definitions

  • Source repository: a collection of source packages and associated metadata (such as a Sources file) such that deb-src lines pointing to the collection of files can be put in a sources.list for "apt-get source" etc. to work.
  • Binary repository: a collection of binary packages and associated metadata (such as a Packages file) such that deb lines pointing to the collection of files can be put in a sources.list for "apt-get install" etc. to work.
  • Archive: A co-located binary repository and source repository for delivery of packages to users.
  • Derived archive: An archive that has a logical relationship to one or more other archives, where changes will tend to flow from the parents in to the child, and modifications submitted back. The Ubuntu archive is an example of this, where it is derived from the Debian archive, amongst others. This is a concept understood to exist already, but not something which many tools understand. It is not a concept which is necessarily relevant to apt and other clients that retrieve packages from archives.
  • Full archive: an archive that is closed with respect to dependencies. All dependencies and build-dependencies of each package in the archive are also present in the archive. This means that an archive can be built against itself, and can be used by clients such as apt without reference to any other archive. Ubuntu is one such archive.
  • Slim/overlay archive: an archive that contains only a subset of the packages needed to fulfil the constrain of a full archive. For clients to make use of this archive they must also be able to retrieve packages from another archive to satisfy dependencies. Most PPAs are an example of this. They will often have one or more archives which they are intended to be used with. If they are a derived archive then this will usually include the parent archives, but may include others. This type of archive is only really useful for short-lived changes to packages, or to add packages that aren't in the other archives.
  • Publication: the act of writing the files needed for an apt archive to disk for serving. Often the term is used more frequently in the case where the information is tracked in a database, rather than just by the files on disk.

Assumptions

Design

We require to be able to branch an archive and publish a subset of it (or in some cases a complete copy). It should be available as a usual apt archive so that tools can work without modification with it. Where it is a subset we may want to enforce it being a closed subset, but given the complexities of determining that, we may instead want to have the tools that developers use to create them default to requesting a closed subset.

The branching will be assumed to typically have one parent where the majority of packages are taken from, the "primary parent". Additional parents can be specified where particular packages will be taken from. These secondary parents then become the primary parent for the packages taken from there.

The usual operations can be performed on the archive, such as uploading packages, including new ones, and removing packages. In addition there are certain operations that are specific to a derived archive, mainly around managing the difference from the parent archives.

The first part of this is visualising the differences between the archive and its parents. For each package in the archive a base version will be designated, starting as the version that was taken from the parent when the branching was done. It will then be possible to query the archive for the list of packages where the version in the archive is greater than the base version for that package. This indicates that the package has been modified in the archive, and the changes should be evaluated for forwarding to the parent archive. It will also be possible to query for those packages where the version in the primary parent for that package is greater than the base version, and partition the results in to two sets according to whether the package has also been modified in the archive itself.

The results of these queries will be the base of the overview for visualising the differences between an archive and its parents. They will allow developers to drill further in to the differences for specific packages as required. Ideally the archive should provide HTML versions of the results of these queries if desired, and also have a facility to snapshot some information as desired and then show how things have changed over time.

The second part of managing the difference from the parent archives is to allow modifications to be made in reaction to the current state. There are three things to handle:

  • Pulling in changed packages from the parent when the package is not modified in the child, or the modifications in the child should be dropped, known as "syncing"
  • Merging the changes in the parent with the changes in the child.
  • Submitting changes back to a parent archive.

The second and third will be done external to the archive, based on the output from the queries described above. Merging the changes will result in something that just looks like another upload. Submitting changes back doesn't require any modifications to the archive.

In the case of "syncing", the archive will have a method to copy a package from the parent, replacing the version that it currently has, assuming that the version number is greater in the parent. This should allow any archive to be specified, but will default to the primary parent for that specific package. It can be specified whether to copy binaries, or to not copy them in order for them to be rebuilt in the context of the child archive. It may be desirable for this to be a per-archive default, as some archive purposes will want this, but to save times others can skip the extra build.

There should be a pre-canned operation to pull in all packages eligible for syncing, which can be hooked up to cron if desired. This will allow the archive to stay reasonably up to date with less human input during periods of development. The archive may also want to have an option to say whether this operation should be allowed, accessible from the web interface, allowing it to be enabled/disabled without having access to the cron configuration.

As new versions are accepted by the archive it needs to update the base version for the package that is being modified. If the new version is equal to or greater than the version in the primary parent for that package, then the base version should be set to the version in the parent archive. This will ensure that the comparisons for that package will be correct given the new base. This does however require discipline in the use of version numbers. The tools we build can do the right thing for people, and the rules should be documented so that developers know how to follow them when not using the tools. We may want to go as far as to have the archive software be able to enforce rules about the version numbers, such as ensuring that packages that differ from the parent have a certain suffix added to the version number.

Bazaar branches can be used to merge packages where they have been modified in both archives, and to propose changes for merging back. This gives the full power of Bazaar for the merging stages, which is better than trying to merge source packages. It does mean that Bazaar branches will have to be maintained for the derived archives. When uploads are done they can use the Bazaar branches directly, but where e.g. copies are done from the parent archives Bazaar will not necessarily be involved. In these cases the archive will have to update the bazaar branches itself.

Branching archives will be based on seeds. The archive will store a copy of the seed as it was when it branched, so that it can be compared to the current version of the seed as used by the parent.

The visualisation of the differences between two archives will then include changes to the seeds, and the modifications will allow the seeds to be updated based on that information, and new packages pulled in as appropriate.

The archive should allow the seeds to be accessible and editable, optionally over HTTP too. This is crucial for allowing customisation of images. If a package is added that is not in the repo then the archive should either pull it in from the primary parent, or mark it as a task that needs doing somehow.

Because the archive isn't going to have a build farm built in, there needs to be handling of binary uploads too. The owner can decide to have each source upload accompanied by binaries, and to allow binary-only uploads for other architectures if more than one architecture is desired for the archive. There should be facilities for querying for outstanding builds as well. When there is a source upload without a binary for an architecture that is supported by the archive in question there should be a build record created. There should then be ways to query for outstanding build records, and to fail build records as needed. If there is a binary upload that corresponds to one of the builds then the build record should be closed with reference to the new package.

The service should also have some Bazaar branch hosting facilities, or have ties with another service that can do that, so that the Bazaar branches that are kept up to date with the packages can be used by developers. This hosting service should aim to avoid having all of the data duplicated many times, as that would be very expensive. It should have access control for the branches based on the access control for the associated packages, and so should serve the branches over ssh.

Implementation

We shall create a project named vostok which will have the code needed to achieve this.

We firstly need a database of the current state of packages, plus the logic to manipulate the files on disk based on that. Something such as reprepro might be a good candidate to base that on. This database will store the current versions of binary and source packages, and the links between them, the base versions, and the primary parents. It will also have a way to insert a new upload which does the necessary checks, updates the base version, and then puts the files on disk in the correct place to make an archive suitable for APT.

We then need a HTTP interface built on top of that database, allowing for displaying HTML reports, and for some modifications. It should also have an HTTP API interface, secured with OAuth. It should have a concept of users and permissions, and support openid to delegate these to Launchpad or another system if desired.

We also need an sftp server to receive uploads and pass them on to the core for processing.

Bazaar hosting should be tied in, with ssh server for authenticated access, and possibly loggerhead for browsing branches. The access controls for branches should be the same as the access controls for the related packages.

UI

A very basic mockup of the HTTP view of the differences of this archive from a single parent.

overview.png

NB: the last entry should have a package name other than "bzr".

It presents an up to date view of the situation, and the developer can drill down to look at the differences for a single package in more detail.

In addition there is a "sync" button to pull in changes from the parent, if the current user has permission to do that.

Further extensions would be to show merge diffs for the last entry, and a way to mark changes as unwanted to get them off the default view until another upload was done.

In addition this needs to show new packages that have been added locally as eligible for submission back, and new packages added to a relevant seed in the parent as eligible. Further to that, new dependencies should be shown as well.

There is a lot more work that needs to be done on the UI design, so this is not final.

Code

We will create a library for maintaining information about an archive in a database, and publishing the results to disk. It shouldn't have to know anything about HTTP requests and the like. Doing it in a library means that we can include it in the client-side tools for managing developer archives if desired.

We should first evaluate reprepro and other similar tools to assess their suitability for reuse. As derivation isn't supported in many tools yet it is unlikely that any of them have the concept, and so an assessment will be made as to whether it is better to extend one of those tools or to implement something that does understand the concept of derivation.

The second part of the task will be to implement a django-based frontend to this library to provide the HTTP parts. It should first provide HTML views of the data, and then allow some operations to be performed from those views. In addition it should have an OAuth-secured HTTP API for interrogating and manipulating the data.

Lastly an sftp server should be available, which can be run, if desired to upload packages to the archive.

Migration

No existing data needs to be migrated, but we need to consider Ubuntu as a target, which may require something akin to a migration, see "Unresolved issues."

Test/Demo Plan

There will be a number of milestone testcases that we have:

  • Having basic archive management working and able to accept uploads.
  • Ability to branch a subset of Ubuntu in to the new archive management software.
  • Ability to view differences between Ubuntu and a derivative using the web interface.
  • Ability to do the same via the API.
  • Ability to trigger syncs using the web interface.
  • Ability to do the same via the API.

Unresolved issues

  • How do we prevent having to copy GBs of bzr data around when branching?
  • Do we care about parents going offline? We could have situations where that left you high and dry.
  • How do we cover Ubuntu as a target with this? It won't use our custom software, so we will have to do more work in the client. Do we want to store some of the data in an instance of vostok and then have the client use both to do things?
  • Should the branching be "create an archive and request that it branch", or "request and archive and stuff it full of the stuff that it needs to be a branch"?
  • Should the client tools be able to create a new archive instance, or can they only be ported at an existing server?


CategorySpec

Specs/M/ARMArchiveBranching (last edited 2010-05-28 13:54:31 by 74)