Debusine is a tool designed for Debian developers and Operating System developers in general. This post describes how Debusine stores and manages files.
Debusine has been designed to run a network of “workers” that can perform various “tasks” that consume and produce “artifacts”. The artifact itself is a collection of files structured into an ontology of artifact types. This generic architecture should be suited to many sorts of build & CI problems. We have implemented artifacts to support building a Debian-like distribution, but the foundations of Debusine aim to be more general than that.
For example a package build task takes a debian:source-package as input and produces some debian:binary-packages and a debian:package-build-log as output.
This generalized approach is quite different to traditional Debian APT archive implementations, which typically required having the archive contents on the filesystem. Traditionally, most Debian distribution management tasks happen within bespoke applications that cannot share much common infrastructure.
File Stores
Debusine’s files themselves are stored by the File Store layer. There can be multiple file stores configured, with different policies. Local storage is useful as the initial destination for uploads to Debusine, but it has to be backed up manually and might not scale to sufficiently large volumes of data. Remote storage such as S3 is also available. It is possible to serve a file from any store, with policies for which one to prefer for downloads and uploads.
Administrators can set policies for which file stores to use at the scope level, as well as policies for populating and draining stores of files.
Artifacts
As mentioned above, files are collected into Artifacts. They combine:
- a set of files with names (including potentially parent directories)
- a category, e.g. debian:source-package
- key-value data in a schema specified by the category and stored as a JSON-encoded dictionary.
Within the stores, files are content-addressed: a file with a given SHA-256 digest is only stored once in any given store, and may be retrieved by that digest. When a new artifact is created, its files are uploaded to Debusine as needed. Some of the files may already be present in the Debusine instance. In that case, if the file is already part of the artifact’s workspace, then the client will not need to re-upload the file. But if not, it must be reuploaded to avoid users obtaining unauthorized access to existing file contents in another private workspace or multi-tenant scope.
Because the content-addressing makes storing duplicates cheap, it’s
common to have artifacts that overlap files.
For example a debian:upload will contain some of the same files as
the related debian:source-package as well as the .changes file.
Looking at the debusine.debian.net instance that we run, we can see a content-addressing savings of 629 GiB across our (currently) 2 TiB file store. This is somewhat inflated by the Debian Archive import, that did not need to bother to share artifacts between suites. But it still shows reasonable real-world savings.
APT Repository Representation
Unlike a traditional Debian APT repository management tool, the source package and binary packages are not stored directly in the “pool” of an APT repository on disk on the debusine server. Instead we abstract the repository into a debian:suite collection within the Debusine database. The collection contains the artifacts that make up the APT repository.
To ensure that it can be safely represented as a valid URL structure (or files on disk) the suite collection maintains an index of the pool filenames of its artifacts.
Suite collections can combine into a debian:archive collection that shares a common file pool.
Debusine collections can keep an historical record of when things were added and removed. This, combined with the database-backed collection-driven repository representation makes it very easy to provide APT-consumable snapshot views to every point in a repository’s history.
Expiry
While a published distribution probably wants to keep the full history of all its package builds, we don’t need to retain all of the output of all QA tasks that were run. Artifacts can have an expiration delay or inherit one from their workspace. Once this delay has expired, artifacts which are not being held in any collection are eligible to be automatically cleaned up.
QA work that is done in a workspace that has automatic artifact expiry, and isn’t publishing the results to an APT suite, will safely automatically expire.
Daily Vacuum
A daily vacuum task handles all of the file periodic maintenance for file stores. It does some cleanup of working areas, a scan for unreferenced & missing files, and enforces file store policies. The policy work could be copying files for backup or moving files between stores to keep them within size limits (e.g. from a local upload store into a general cloud store).
In Conclusion
Debusine provides abstractions for low-level file storage and object collections. This allows storage to be scalable beyond a single filesystem and highly available. Using content-addressed storage minimizes data duplication within a Debusine instance.
For Debian distributions, storing the archive metadata entirely in a database made providing built-in snapshot support easy in Debusine.