About Ugarit

What's content-addressible storage?

Traditional backup systems work by storing copies of your files somewhere. Perhaps they go onto tapes, or perhaps they're in archive files written to disk. They will either be full dumps, containing a complete copy of your files, or incrementals or differentials, which only contain files that have been modified since some point. This saves making repeated copies of unchanging files, but it means that to do a full restore, you need to start by extracting the last full dump then applying one or more incrementials, or the latest differential, to get the latest state.

Not only do differentials and incrementals let you save space, they also give you a history - you can restore up to a previous point in time, which is invaluable if the file you want to restore was deleted a few backup cycles ago!

This technology was developed when the best storage technology for backups was magnetic tape, because each dump is written sequentially (and restores are largely sequential, unless you're skipping bits to pull out specific files).

However, these days, random-access media such as magnetic disks and SSDs are cheap enough to compete with magnetic tape for long-term bulk storage (especially when one considers the cost of a tape drive or two). And having fast random access means we can take advantage of different storage techniques.

A content-addressible store is a key-value store, except that the keys are always computed from the values. When a given object is stored, it is hashed, and the hash used as the key. This means you can never store the same object twice; the second time you'll get the same hash, see the object is already present, and re-use the existing copy. Therefore, you get deduplication of your data for free.

But, I hear you ask, how do you find things again, if you can't choose the keys?

When an object is stored, you need to record the key so you can find it again later. In Ugarit, everything is stored in a tree-like directory structure. Files are uploaded and their hashes obtained, and then a directory object is constructed containing a list of the files in the directory, and listing the key of the Ugarit objects storing the contents of each file. This directory object itself has a hash, which is stored inside the directory entry in the parent directory, and so on up to the root. The root of a tree stored in a Ugarit vault has no parent directory to contain it, so at that point, we store the key of the root in a named "tag" that we can look up by name when we want it.

Therefore, everything in a Ugarit vault can be found by starting with a named tag and retrieving the object whose key it contains, then finding keys inside that object and looking up the objects they refer to, until we find the object we want.

When you use Ugarit to back up your filesystem, it uploads a complete snapshot of every file in the filesystem, like a full dump. But because the vault is content-addressed, it automatically avoids uploading anything it already has a copy of, so all we upload is an incremental dump - but in the vault, it looks like a full dump, and so can be restored on its own without having to restore a chain of incrementals.

Also, the same storage can be shared between multiple systems that all back up to it - and the incremental upload algorithm will mean that any files shared between the servers will only need to be uploaded once. If you back up a complete server, than go and back up another that is running the same distribution, then all the files in /bin and so on that are already in the storage will not need to be backed up again; the system will automatically spot that they're already there, and not upload them again.

As well as storing backups of filesystems, Ugarit can also be used as the primary storage for read-only files, such as music and photos. The principle is exactly the same; the only difference is in how the files are organised - rather than as a directory structure, the files are referenced from metadata objects that specify information about the file (so it can be found) and a reference to the contents. Sets of metadata objects are pointed to by tags as well, so they can also be found.

So what's that mean in practice?

Backups

You can run Ugarit to back up any number of filesystems to a shared storage area (known as a vault, and on every backup, Ugarit will only upload files or parts of files that aren't already in the vault - be they from the previous snapshot, earlier snapshots, snapshot of entirely unrelated filesystems, etc. Every time you do a snapshot, Ugarit builds an entire complete directory tree of the snapshot in the vault - but reusing any parts of files, files, or entire directories that already exist anywhere in the vault, and only uploading what doesn't already exist.

The support for parts of files means that, in many cases, gigantic files like database tables and virtual disks for virtual machines will not need to be uploaded entirely every time they change, as the changed sections will be identified and uploaded.

Because a complete directory tree exists in the vault for any snapshot, the extraction algorithm is incredibly simple - and, therefore, incredibly reliable and fast. Simple, reliable, and fast are just what you need when you're trying to reconstruct the filesystem of a live server.

Also, it means that you can do lots of small snapshots. If you run a snapshot every hour, then only a megabyte or two might have changed in your filesystem, so you only upload a megabyte or two - yet you end up with a complete history of your filesystem at hourly intervals in the vault.

Conventional backup systems usually either store a full backup then incrementals to their archives, meaning that doing a restore involves reading the full backup then reading every incremental since and applying them - so to do a restore, you have to download *every version* of the filesystem you've ever uploaded, or you have to do periodic full backups (even though most of your filesystem won't have changed since the last full backup) to reduce the number of incrementals required for a restore. Better results are had from systems that use a special backup server to look after the archive storage, which accept incremental backups and apply them to the snapshot they keep in order to maintain a most-recent snapshot that can be downloaded in a single run; but they then restrict you to using dedicated servers as your archive stores, ruling out cheaply scalable solutions like Amazon S3, or just backing up to a removable USB or eSATA disk you attach to your system whenever you do a backup. And dedicated backup servers are complex pieces of software; can you rely on something complex for the fundamental foundation of your data security system?

System Requirements

Ugarit should run on any POSIX-compliant system that can run Chicken Scheme. It stores and restores all the file attributes reported by the stat system call - POSIX mode permissions, UID, GID, mtime, and optionally atime and ctime (although the ctime cannot be restored due to POSIX restrictions). Ugarit will store files, directories, device and character special files, symlinks, and FIFOs.

Support for extended filesystem attributes - ACLs, alternative streams, forks and other metadata - is possible, due to the extensible directory entry format; support for such metadata will be added as required.

Currently, only local filesystem-based vault storage backends are complete: these are suitable for backing up to a removable hard disk or a filesystem shared via NFS or other protocols. However, the backend can be accessed via an SSH tunnel, so a remote server you are able to install Ugarit on to run the backends can be used as a remote vault.

However, the next backend to be implemented will be one for Amazon S3, and an SFTP backend for storing vaults anywhere you can ssh to. Other backends will be implemented on demand; a vault can, in principle, be stored on anything that can store files by name, report on whether a file already exists, and efficiently download a file by name. This rules out magnetic tapes due to their requirement for sequential access.

Although we need to trust that a backend won't lose data (for now), we don't need to trust the backend not to snoop on us, as Ugarit optionally encrypts everything sent to the vault.

Terminology

A Ugarit backend is the software module that handles backend storage. An actual storage area - managed by a backend - is called a storage, and is used to implement a vault; currently, every storage is a valid vault, but the planned future introduction of a distributed storage backend will enable multiple storages (which are not, themselves, valid vaults as they only contain some subset of the information required) to be combined into an aggregrate storage, which then holds the actual vault. Note that the contents of a storage is purely a set of blocks, and a series of named tags containing references to them; the storage does not know the details of encryption and hashing, so cannot make any sense of its contents.

For example, if you use the recommended "splitlog" filesystem backend, your vault might be /mnt/bigdisk on the server prometheus. The backend (which is compiled along with the other filesystem backends in the backend-fs binary) must be installed on prometheus, and Ugarit clients all over the place may then use it via ssh to prometheus. However, even with the filesystem backends, the actual storage might not be on prometheus where the backend runs - /mnt/bigdisk might be an NFS mount, or a mount from a storage-area network. This ability to delegate via SSH is particularly useful with the "cache" backend, which reduces latency by storing a cache of what blocks exist in a backend, thereby making it quicker to identify already-stored files; a cluster of servers all sharing the same vault might all use SSH tunnels to access an instance of the "cache" backend on one of them (using some local disk to store the cache), which proxies the actual vault storage to a vault on the other end of a high-latency Internet link, again via an SSH tunnel.

A vault is where Ugarit stores backups (as chains of snapshots) and archives (as chains of archive imports). Backups and archives are identified by tags, which are the top-level named entry points into a vault. A vault is based on top of a storage, along with a choice of hash function, compression algorithm, and encryption that are used to map the logical world of snapshots and archive imports into the physical world of blocks stored in the storage.

A snapshot is a copy of a filesystem tree in the vault, with a header block that gives some metadata about it. A backup consists of a number of snapshots of a given filesystem.

An archive import is a set of filesystem trees, each along with metadata about it. Whereas a backup is organised around a series of timed snapshots, an archive is organised around the metadata; the filesystem trees in the archive are identified by their properties.

So what, exactly, is in a vault?

A Ugarit vault contains a load of blocks, each up to a maximum size (usually 1MiB, although other backends might impose smaller limits). Each block is identified by the hash of its contents; this is how Ugarit avoids ever uploading the same data twice, by checking to see if the data to be uploaded already exists in the vault by looking up the hash. The contents of the blocks are compressed and then encrypted before upload.

Every file uploaded is, unless it's small enough to fit in a single block, chopped into blocks, and each block uploaded. This way, the entire contents of your filesystem can be uploaded - or, at least, only the parts of it that aren't already there! The blocks are then tied together to create a snapshot by uploading blocks full of the hashes of the data blocks, and directory blocks are uploaded listing the names and attributes of files in directories, along with the hashes of the blocks that contain the files' contents. Even the blocks that contain lists of hashes of other blocks are subject to checking for pre-existence in the vault; if only a few MiB of your hundred-GiB filesystem has changed, then even the index blocks and directory blocks are re-used from previous snapshots.

Once uploaded, a block in the vault is never again changed. After all, if its contents changed, its hash would change, so it would no longer be the same block! However, every block has a reference count, tracking the number of index blocks that refer to it. This means that the vault knows which blocks are shared between multiple snapshots (or shared *within* a snapshot - if a filesystem has more than one copy of the same file, still only one copy is uploaded), so that if a given snapshot is deleted, then the blocks that only that snapshot is using can be deleted to free up space, without corrupting other snapshots by deleting blocks they share. Keep in mind, however, that not all storage backends may support this - there are certain advantages to being an append-only vault. For a start, you can't delete something by accident! The supplied fs and sqlite backends support deletion, while the splitlog backend does not yet. However, the actual snapshot deletion command in the user interface hasn't been implemented yet either, so it's a moot point for now...

Finally, the vault contains objects called tags. Unlike the blocks, the tags' contents can change, and they have meaningful names rather than being identified by hash. Tags identify the top-level blocks of snapshots within the system, from which (by following the chain of hashes down through the index blocks) the entire contents of a snapshot may be found. Unless you happen to have recorded the hash of a snapshot somewhere, the tags are where you find snapshots from when you want to do a restore.

Whenever a snapshot is taken, as soon as Ugarit has uploaded all the files, directories, and index blocks required, it looks up the tag you have identified as the target of the snapshot. If the tag already exists, then the snapshot it currently points to is recorded in the new snapshot as the "previous snapshot"; then the snapshot header containing the previous snapshot hash, along with the date and time and any comments you provide for the snapshot, and is uploaded (as another block, identified by its hash). The tag is then updated to point to the new snapshot.

This way, each tag actually identifies a chronological chain of snapshots. Normally, you would use a tag to identify a filesystem being backed up; you'd keep snapshotting the filesystem to the same tag, resulting in all the snapshots of that filesystem hanging from the tag. But if you wanted to remember any particular snapshot (perhaps if it's the snapshot you take before a big upgrade or other risky operation), you can duplicate the tag, in effect 'forking' the chain of snapshots much like a branch in a version control system.

Archive imports cause the creation of one or more archive metadata blocks, each of which lists the hashes of files or filesystem trees in the archive, along with their metadata. Each import then has a single archive import block pointing to the sequence of metadata blocks, and pointing to the previous archive import block in that archive. The same filesystem tree can be imported more than once to the same archive, and the "latest" metadata always wins.

Generally, you should create lots of small archives for different categories of things - such as one for music, one for photos, and so on. You might well create separate archives for the music collections of different people in your household, unless they overlap, and another for Christmas music so it doesn't crop up in random shuffle play! It's easy to merge archives if you over-compartmentalise them, but harder to split an archive if you find it too cluttered with unrelated things.

I've spoken of archive imports, and backup snapshots, each having a "previous" reference to the last import or snapshot in the chain, but it's actually more complex than that: they have an arbitrary list of zero or more previous objects. As such, it's possible for several imports or snapshots to have the same "previous", known as a "fork", and it's possible to have an import or snapshot that merges multiple previous ones.

Forking is handy if you want to basically duplicate an archive, creating two new archives with the same contents to begin with, but each then capable of diverging thereafter. You might do this to keep the state of an archive before doing a bit import, so you can go back to the original state if you regret the import, for instance.

Forking a backup tag is a more unusual operation, but also useful. Perhaps you have a server running many stateful services, and the hardware becomes overloaded, so you clone the basic setup onto another server, and run half of the services on the original and half on the new one; if you fork the backup tag of the original server to create a backup tag for the new server, then both servers' snapshot history will share the original shared state.

Merging is most useful for archives; you might merge several archives into one, as mentioned.

And, of course, you can merge backup tags, as well. If your earlier splitting of one server into two doesn't work out (perhaps your workload reduces, or you can now afford a single, more powerful, server to handle everything in one place), you might rsync back the service state from the two servers onto the new server, so it's all merged in the new server's filesystem. To preserve this in the snapshot history, you can merge the two backup tags of the two servers to create a backup tag for the single new server, which will accurately reflect the history of the filesystem.

Also, tags might fork by accident - I plan to introduce a distributed storage backend, which will replicate blocks and tags across multiple storages to create a single virtual storage to build a vault on top of; in the event of the network of actual storages suffering a failure, it may be that snapshots and imports are only applied to some of the storages - and then subsequent snapshots and imports only get applied to some other subset of the storages. When the network is repaired and all the storages are again visible, they will have diverged, inconsistent, states for their tags, and the distributed storage system will resolve the situation by keeping the majority state as the state of the tag on all the backends, but preserving any other states by creating new tags, with the original name plus a suffix. These can then be merged to "heal" the conflict.