OpenBSD Journal

Faster snapshots packages synch

Contributed by jj on from the difference engine dept.

Marc Espie (espie@) wrote to tech@:
I've just committed changes to pkg_create that will help mirrors synch by using much less bandwidth.
I just ran a final test.
Rsynching a full amd64 snapshot now says something like:
sent 7,315,796,510 bytes received 40,292,721 bytes 4,517,095.01 bytes/sec total size is 28,752,806,019 speedup is 3.91

A few months ago, after the "reorder files in packages", Stuart Henderson commented that this would not help mirrors, but just the end user, which got me thinking...

(Reminder: archives are compressed files. rsync does not peek inside the compressed data, so its comparison algorithms don't work so well with them, as the first different byte will change everything for the rest of the archive, so no speed-up for compressed files).

I looked at the --rsyncable patch for zlib/gzip, and talked it over with sthen@ and millert@, but pretty soon we discarded that idea. That patch is brittle (every zlib version has got its own flavor of it, with wild differences) and a nightmare to maintain. Plus it won't work at all with other compression formats.

The solution was low-tech: simply cut the archive into more gzip chunks (signatures already split the package into two parts, so we know the tools work). I chose 16 files as a simple guideline to experiment with. There were still some discrepancies, such as tar timestamps metadata, which is why those migrated to the plist a few weeks ago (side-effect: the tarball effectively says everything dates back to the epoch... not so bad).

I was pleasantly surprised: the size increase is minimal (very much under 1%).

I also wacked on gzip timestamps, which don't serve any useful purpose either, especially since the plist signature also contains a timestamp (and that one is signed, so it's ways more trustworthy).

Obviously, the first snapshot out will still copy everything. But from the second one, mirror owners should see a difference.

To benefit: - mirror owners must now use rsync algorithms. Turn off -W / --whole-file if you were using it. - turn on -y / --fuzzy, as this will "track" minor package version changes.

Note that this only applies to the "package snapshots" part of OpenBSD.

My test was a bit extreme: I did build two snaps with the exact same ports tree, so the similarities are maximal. Nevertheless, there are lots of *huge* packages in the ports tree. So I expect the bandwidth gain to be very significant anyway, especially for fast architectures which turn up one snapshot a week or more. e.g., bandwidth use should be more than halved, I expect.

(Comments are closed)


Comments
  1. By journeysquid (Tor) on http://www.bsdnow.tv/

    I'm excited to see that bzip2/xz packages are potentially in the future.

    Comments
    1. By Anonymous Coward (91.154.65.231) on

      > I'm excited to see that bzip2/xz packages are potentially in the future.

      Bzip2, old and cranky, is kind of a "worst of both worlds" compression scheme, if you compare it to fast ones (like gzip) or slow ones with higher compression ratio (like lzma). Bzip2 is slow and doesn't compress particularly well for its resource usage. Xz might be an improvement. On the other hand, gzip is cool because it's small and fast and works everywhere.

      If you're on a reasonably fast connection but using old or low-power hardware, these "better" compression programs can actually slow you down.

      Comments
      1. By journeysquid (Tor) on http://www.bsdnow.tv/

        > Bzip2, old and cranky, is kind of a "worst of both worlds" compression scheme, if you compare it to fast ones (like gzip) or slow ones with higher compression ratio (like lzma). Bzip2 is slow and doesn't compress particularly well for its resource usage. Xz might be an improvement. On the other hand, gzip is cool because it's small and fast and works everywhere.

        You are absolutely correct. While I copy-pasted the wording from the announcement, my interest is in the xz format specifically for its a.) impressive size reduction and b.) decompression speed.

    2. By BSDfan (193.200.119.132) on

      > I'm excited to see that bzip2/xz packages are potentially in the future.

      First we should have -J (--xz) switch in tar in base...

      Comments
      1. By Marc Espie (espie) on

        > > I'm excited to see that bzip2/xz packages are potentially in the future.
        >
        > First we should have -J (--xz) switch in tar in base...

        That's actually incorrect.
        The pkgtools use the perl libraries directly without tar.

        For instance, gunzipping happens thru IO::Uncompress::AnyUncompress;

        to support xz, it would require IO::Uncompress::UnXz, which depends
        on raw lzma, which depends on liblzma...

        Considering that's GPLv2+, it's a bit unlikely to happen in base.

        But at least, there's no technical impossibility in the tools...

        Comments
        1. By Renaud Allard (renaud) on

          > > > I'm excited to see that bzip2/xz packages are potentially in the future.
          > >
          > > First we should have -J (--xz) switch in tar in base...
          >
          > That's actually incorrect.
          > The pkgtools use the perl libraries directly without tar.
          >
          > For instance, gunzipping happens thru IO::Uncompress::AnyUncompress;
          >
          > to support xz, it would require IO::Uncompress::UnXz, which depends
          > on raw lzma, which depends on liblzma...
          >
          > Considering that's GPLv2+, it's a bit unlikely to happen in base.
          >
          > But at least, there's no technical impossibility in the tools...
          >
          >

          Sorry Marc, I was curious, so I just checked, not implying that you should do it of course.

          It seems liblzma is in the public domain. http://tukaani.org/xz/

          IO::Uncompress::UnXz and IO::Uncompress::UnLzma are under the same license as perl (which is in base, so I expect it to be OK) http://search.cpan.org/~pmqs/IO-Compress-Lzma-2.066/lib/IO/Uncompress/UnXz.pm http://search.cpan.org/~pmqs/IO-Compress-Lzma-2.066/lib/IO/Uncompress/UnLzma.pm
          And it seems the perl dependencies for those are also licensed the same way as perl.


        2. By BSDfan (193.200.119.132) on

          > > > I'm excited to see that bzip2/xz packages are potentially in the future.
          > >
          > > First we should have -J (--xz) switch in tar in base...
          >
          > That's actually incorrect.
          > The pkgtools use the perl libraries directly without tar.
          >
          > For instance, gunzipping happens thru IO::Uncompress::AnyUncompress;
          >
          > to support xz, it would require IO::Uncompress::UnXz, which depends
          > on raw lzma, which depends on liblzma...
          >
          > Considering that's GPLv2+, it's a bit unlikely to happen in base.
          >
          > But at least, there's no technical impossibility in the tools...

          Thanks for reply, I understand.

          But I rather mean that if I would like to extract tar.xz archive I have to install gtar from ports instead of simply using tar from base with -J switch.
          I do not want lzma library in base (which of course would be nice if license allow for this). I would like to have possibility to work with tar.xz archive without the need to install additional gtar package.
          For example bzip2 isn't in base as well and you have to install it from ports, but you have -j switch in tar in base for bzip2 support.

          Comments
          1. By Renaud Allard (renaud) on


            > But I rather mean that if I would like to extract tar.xz archive I have to install gtar from ports instead of simply using tar from base with -J switch.
            > I do not want lzma library in base (which of course would be nice if license allow for this). I would like to have possibility to work with tar.xz archive without the need to install additional gtar package.
            > For example bzip2 isn't in base as well and you have to install it from ports, but you have -j switch in tar in base for bzip2 support.

            gunzip -c file.tar.gz | tar xvf -
            bunzip2 -c file.tar.bz2 | tar xvf -
            unxz -c file.tar.xz | tar xvf -

            Seem pretty standardized to me...

            Comments
            1. By BSDfan (193.200.119.132) on

              >
              > > But I rather mean that if I would like to extract tar.xz archive I have to install gtar from ports instead of simply using tar from base with -J switch.
              > > I do not want lzma library in base (which of course would be nice if license allow for this). I would like to have possibility to work with tar.xz archive without the need to install additional gtar package.
              > > For example bzip2 isn't in base as well and you have to install it from ports, but you have -j switch in tar in base for bzip2 support.
              >
              > gunzip -c file.tar.gz | tar xvf -
              > bunzip2 -c file.tar.bz2 | tar xvf -
              > unxz -c file.tar.xz | tar xvf -
              >
              > Seem pretty standardized to me...

              I didn't know this, thanks.
              However I still prefer shortest form which is tar -xJpf archive.tar.xz ...

  2. By Marc Espie (espie) espie@nerim.net on

    Actual test on a newer "real snap" (about a week apart from the previous one):

    sent 7,502,610,665 bytes received 40,313,605 bytes 4,561,792.72 bytes/sec
    total size is 28,788,092,671 speedup is 3.82


    So, numbers are still very good... Admittedly, there is no large update in there, just business as usual...

Latest Articles

Credits

Copyright © - Daniel Hartmeier. All rights reserved. Articles and comments are copyright their respective authors, submission implies license to publish on this web site. Contents of the archive prior to as well as images and HTML templates were copied from the fabulous original deadly.org with Jose's and Jim's kind permission. This journal runs as CGI with httpd(8) on OpenBSD, the source code is BSD licensed. undeadly \Un*dead"ly\, a. Not subject to death; immortal. [Obs.]