0pointer.net
Introducing casync
In the past months I have been working on a new project: casync. casync takes inspiration from the popular rsync file synchronization tool as well as the probably even more popular git revision control system. It combines the idea of the rsync algorithm with the idea of git-style content-addressable file systems, and creates a new system for efficiently storing and delivering file system images, optimized for high-frequency update cycles over the Internet. Its current focus is on delivering IoT, container, VM, application, portable service or OS images, but I hope to extend it later in a generic fashion to become useful for backups and home directory synchronization as well (but more about that later).
The basic technological building blocks casync is built from are neither new nor particularly innovative (at least not anymore), however the way casync combines them is different from existing tools, and that's what makes it useful for a variety of usecases that other tools can't cover that well.
Why?
I created casync after studying how today's popular tools store and deliver file system images. To very incomprehensively and briefly name a few: Docker has a layered tarball approach, OSTree serves the individual files directly via HTTP and maintains packed deltas to speed up updates, while other systems operate on the block layer and place raw squashfs images (or other archival file systems, such as IS09660) for download on HTTP shares (in the better cases combined with zsync data).
Neither of these approaches appeared fully convincing to me when used in high-frequency update cycle systems. In such systems, it is important to optimize towards a couple of goals:
Most importantly, make updates cheap traffic-wise (for this most tools use image deltas of some form)
Put boundaries on disk space usage on servers (keeping deltas between all version combinations clients might want to run updates between, would suggest keeping an exponentially growing amount of deltas on servers)
Put boundaries on disk space usage on clients
Be friendly to Content Delivery Networks (CDNs), i.e. serve neither too many small nor too many overly large files, and only require the most basic form of HTTP. Provide the repository administrator with high-level knobs to tune the average file size delivered.
Simplicity to use for users, repository administrators and developers
I don't think any of the tools mentioned above are really good on more than a small subset of these points.
Specifically: Docker's layered tarball approach dumps the "delta" question onto the feet of the image creators: the best way to make your image downloads minimal is basing your work on an existing image clients might already have, and inherit its resources, maintaing full history. Here, revision control (a tool for the developer) is intermingled with update management (a concept for optimizing production delivery). As container histories grow individual deltas are likely to stay small, but on the other hand a brand-new deployment usually requires downloading the full history onto the deployment system, even though there's no use for it there, and likely requires substantially more disk space and download sizes.
OSTree's serving of individual files is unfriendly to CDNs (as many small files in file trees cause an explosion of HTTP GET requests). To counter that OSTree supports placing pre-calculated delta images between selected revisions on the delivery servers, which means a certain amount of revision management, that leaks into the clients.
Delivering direct squashfs (or other file system) images is almost beautifully simple, but of course means every update requires a full download of the newest image, which is both bad for disk usage and generated traffic. Enhancing it with zsync makes this a much better option, as it can reduce generated traffic substantially at very little cost of history/metadata (no explicit deltas between a large number of versions need to be prepared server side). On the other hand server requirements in disk space and functionality (HTTP Range requests) are minus points for the usecase I am interested in.
(Note: all the mentioned systems have great properties, and it's not my intention to badmouth them. They only point I am trying to make is that for the use case I care about — file system image delivery with high high frequency update-cycles — each system comes with certain drawbacks.)
DISCUSS