apt-sync

Summary

Succinct is based on rsync/zsync, extended to specifically handle .deb files. This makes it possible to update existing Debian packages without downloading redundant data, common to both the original and updated package. An alternative solution would be to use dedicated patch files. An advantage of Succinct over patch files, is that separate patch files are not necessary, since this method automatically identifies and downloads only those portions of a .deb package that are different to the version available. Computing which blocks of .deb files have changed is handled entirely on the client side, so the server will not have to deal with the overhead of creating patch files on a per-client basis.

Further information about Succint can be found at:

For discussion on the patch approach, see:

This project is part of Google Summer of Code 2006, and is supervised by Michael Vogt.

Rationale

To save bandwidth when distributing updates.

Use cases

Sandy Slowspeed wants to apply the latest security updates for Dapper Drake, upon realizing that it would take over 12 hours to download the 300 MB required on a 56k modem, Sandy decides not to update after all.

Scope

Design

This project is based on the rsync algorithm as it is implemented by zsync (with all computation done on the client side). It is extended to work effectively on the .deb package format.

The rsync algorithm is used to identify which blocks of data are the same in the Debian file available, and the updated Debian file on a remote server. The identification occurs by calculating checksums over blocks of data within the two files, and identifying common sequences. A detailed explanation of the algorithm used by rsync can be found at:

The implementation of rsync does all calculations and comparisons on the server side, and a dedicated rsync server must run on the server. The high overhead on the server-side can make it unsuitable for widespread distribution of files. zsync is an alternative implementation based on the rsync algorithm, which does all comparisons on the client side. This is achieved by the server doing a one-time calculation of the checksums of all blocks in the file that needs to be distributed, and stores these checksums to another file. This file then contains all necessary data required for a client to calculate what data is required and what data is already available, so the client can do all calculations. Another advantage of this approach is that no dedicated rsync server needs to be run on the server, as HTTP can be used for all data transfer. Further explanation of the zsync program can be found at:

The rsync algorithm can be ineffective when sending compressed data, since compression can radically alter common sequences of code within a file. Debian packages contain two gzipped tar files, one for data and another for control information. Because of the compression, rsync can be ineffective if applied directly to Debian packages. The succinct project is going to identify possibilities for finding common sequences of code between two Debian packages despite the compression, and implement the best solution in a program based on zsync. Loosely speaking, zsync will be extended to "understand" Debian packages.

Further, a patch to APT will be provided so that it is able to use succinct when downloading updates from a remote server.

Sample Data

The table below shows the performance of zsync for transferring the data.tar.gz file from some Debian packages when it is uncompressed, compressed as gzip and compressed as gzip with the --rsyncable option. Note that by default zsync automatically looks inside compressed gzip files, this can be prevented using the -Z option (which will treat the gzipped file as normal binary data). In the table below, the performance on compressed data is shown with -Z and without.

Using the automatic zsync look-inside for standard gzip files repeatedly produces the best results (the least data needs to be fetched), the problem with it is that zsync can not guarantee that the exact same file will be reproduced (such that checksums match).

The table below shows the amount of data downloaded for some more packages as a percentage of the original file size. The first column of each package refers to the amount downloaded by zsync for the uncompressed data (data.tar), the second and third show the amount downloaded by zsync with the -Z option (so that it does not look inside the compressed data) for gzip and gzip --rsyncable compressed data (data.tar.gz) respectively, and the fourth and fifth column show the amount downloaded by zsync with looking inside the compressed data for gzip and gzip --rsyncable compressed data respectively.

#

tar

#

-Z tar.gz

#

-Z --rsyncable tar.gz

#

tar.gz

#

--rsyncable tar.gz

http://homepages.inf.ed.ac.uk/s0343894/test-data.png

Implementation

Code

Code will be written in Python where possible/sensible.

Data preservation and migration

Outstanding issues

  • How best to identify common segments of data between two gzipped files. A "look-inside" mechanism could be used or a --zsync option to gzip.
  • APT must be able to validate an updated .deb package. The MD5 sum must therefore match the original package. Another (depreciated) option would be to modify APT to handle different MD5 sums.

BoF agenda and discussion

I think good thing will be to fetch Packages.gz using delta-compression too. It is done in new version of apt-get in debian.

rsync:

zsync:

Discussions on the debian-devel list:

Some thoughts from Martin Pool (of rsync):

Bite-size pieces (sladen)

Generation

  1. Patch dpkg-deb to use external gzip callout

  2. Add a flag to dpkg-deb to build with --rsyncable

  3. fork zsync and call it apt-debsync or something

  4. allow apt-debsync to have an 'offset' field

  5. get dpkg to pass the offset of the zsync from the start of the resultant .deb

  6. Use the packages listed in Packagename:, Depends:, Provides: and Replaces: and search the files inside of these (dpkg -L) to get a list of search targets.

Reassembly

15:45 <mvo> that would mean we would have to put the offset inside the packages file?
15:45 <mvo> a interessting idea!
  1. Make apt-debsync do a download of 0..offset and (offset+length)..filesize. These are the header and footer of the .deb.

  2. Try to assemble the .tar from local files and downloaded pieces;
  3. Gzip the .tar

  4. Top-and-tail on with the two downloaded chunks.

Optional

  1. Add support for '.bz2' (actually really easy compared to the above, see the source code for bzip2recover).

  2. Make debsync use an text fileformat and gzip the resultant digest-file
  3. Make debsync use a longer checksum.

Meta-data

Apt-sync: version digest-file offset

Apt-sync: 0.1 packagename_version.deb.apt-sync 1234

Possibly include an offset and length of the digest-file so that the digest can be added to the end of the .deb!

Bite sized pieces (mvo)

This should be equivalent to the version from sladen, but worded differently. Sladen, please confirm.

apt-sync

General: 
- Packages.gz and file.deb files should be aptsync-able
- have a .aptsync file for each deb package in the pool

Generate:
1) patch dpkg-deb to have a option to build the debs with --rsync
 - dpkg-deb currently does its own chunking every 8kB, the result is therefore not the same stream as calling-out to gzip
 - zsync calles out to 'gzip' to compress the stream
 - dpkg-deb needs to have an option to callout to 'gzip' to know the results will be the same.
2) extract the data.tar.gz from the deb
3) construct a .aptsync file from the data.tar.gz, it is a zsync file but it needs to have additonal information like: (offset, len, checksum) of the data.tar.gz inside the deb and possible the content of the tarball)
 - if we didn't want to make /any/ changes to zsync, then the offset and length could be stored in Packages.

Assemble:
1) get the .aptsync file for the deb
2) If there is an existing .deb in /var/lib/cache, take the existing deb and extract and uncompress to get 'data.tar' and put this in /tmp,
3) Assemble a list of 'likely files' on the local disk to hunt for chunks.  This includes the output of 'dpkg -L' for previous version of the package and for those listed in 'Provides:/Replaces:'.  If the previous step found a 'data.tar', add this to the list.
4) let apt-debsync (aka zsync) construct a new 'data.tar' by search all the likely files and downloading any pieces not found locally from the central mirror.
5) Compress this new 'data.tar' using a callout to either gzip or bzip2 (zsync can do this for us).
6) Make sure the checksum matches for the new 'data.tar.gz'
7) Unconditionally download the 'top' and 'tail' of the '.deb' from the mirror.  In the case 'Packages.gz' the top and tail will be zero in length.
7) Re-assemble the .deb by cat'ing the 'top', new 'data.tar.gz' and 'tail' together
8) Ensure this checksum matches, or else just download the whole file by HTTP.

Comments

* As has been discussed previously, for gz files, --rsyncable does not help zsync. Since the zsync-look-inside appears to not currently support gz files embedded in ar/deb files it may be enabled as a hack to get it working quickly, modifying zsync to support deb files properly is possible and a better long term solution. --JohnMccabeDansted

* How can we measure the compression of --zsyncable. Isn't --zsyncable a proposed implementation of 7z? I think we mean --rsyncable --JohnMccabeDansted


CategorySpec

apt-sync (last edited 2008-08-06 16:39:31 by localhost)