Using systemd as Docker replacement

For a recent customer project, I migrated a Docker and Compose based project to use SquashFS images supervisored by systemd. In this post I'm writing down the process and learnings.

To prevent confusion up front: I'm not claiming this is a solution to fully replace Docker or other similar solutions. This approach was designed for a special project with special requirements you might not have.

Still, I hope this writeup might be useful to others as well.

Why switch in the first place?

Docker and Compose have served me well over the years. They're great tools to allow packaging and deploying complex applications without having to worry too much about the underlying Linux distribution.

Nonetheless, it's always a good idea to re-evaluate technical choices and try out new things. A rich toolbox allows you to solve a wide array of problems.

Some of the annoyances I had with Docker and Compose:

Compose acts like an alien on the target system. It has its own commands, configuration files, service management and even managed networks.
Docker creates its own firewall rules, a potential footgun if you're relying on ufw or nftables directly.
Log management is annoying and it's easy to lose log messages.
It's easy to fill up your entire disk with unused images or even running out of inodes.
The always running daemon is a potential security issue, especially when using root containers.
As the company behind Docker is increasing their efforts to monetize the projects, the future could bring surprises like the BSL situation with ElasticSearch/Hashicorp/Redis. You won't have that problem with systemd.

Project requirements

The project I was working on was quite the opposite to my typical cloud projects: Running on a single VM in a on-premise datacenter and no constant internet access (only for updates). Also I only had limited VPN access, so no "just casually log in and restart misbehaving services".

Because of this, I shifted my mindset more to operations and thought: "How would I built it, so it's easy to understand for the average Linux admin?"

As this was a freelance project, I did not want to permanently host an entire container registry and CI process for something that is only updated on customer requests maybe a few times a month.

Additionally, the customer should be able to build new releases of this app without any of my infrastructure.

The solution in this post is the result of this.

The build process

A while ago, based on a request at the Hetzner customer forum (requires account), I came up with an idea. The poster asked: "I create throwaway VMs that run a Docker container. This container is over 3 GiB large and deploying takes forever. How can I make this faster?".

I asked some questions for clarification and the poster replied: The huge size came from packaging up Blender and a lot of libraries.

To improve this flow, I suggested to convert the Docker image to a SquashFS, a readonly filesystem that was originally used for Linux live CDs, but is nowadays also used by the snap package manager.

A SquashFS filesystem is a compressed and deduplicated single-file image that can be mounted and decompressed on the fly. While Docker images consist of multiple layers that need to be downloaded and extracted (which takes time and many inodes), a SquashFS can be directly mounted and used.

Building these images is really easy and allows you to even reuse your existing build process: After building the container image, you can create a temporary container to export the filesystem. Pipe that directly into tar2sqfs and voila: Your image is now on average 50-70% smaller as well!

Bash

# Build the container as usual
docker build -f myapp .

# Make a temporary container (command doesn't matter as it's not executed)
docker create --name=export-temp myapp bash

# Export the container filesystem directly to tar2sqfs,
# which is available in Debian as package "squashfs-tools-ng".
docker container export export-temp | tar2sqfs myapp.sqfs

# Cleanup
docker rm export-temp

If you prefer Podman just replace the command names - it's fully compatible.

Deployment process

Put your created image file to a cloud storage of your choice or even just on a local fileserver - no need to host or maintain a container registry.

To transfer them to the target system, use wget, rsync or any other file copy tool of your choice.

To run the application, you don't even need to bother with mount/umount: systemd had the option RootImage= that does all the magic for you.

systemd-container/app.service

[Unit]
Description=API
After=postgresql.service

[Service]
User=my-app
RootImage=/opt/app/components/api.sqfs
ExecStart=/bin/container-entrypoint.sh
EnvironmentFile=/etc/app/api.env

# Persistent storage
BindPaths=/opt/core/data

# DNS
BindReadOnlyPaths=/etc/hosts
BindReadOnlyPaths=/etc/resolv.conf

[Install]
WantedBy=multi-user.target

Roadblocks and learnings

Getting everything to work has been quite an journey, especially as some error messages provided by systemd are rather unhelpful.

Rootless containers

Creating containers without root is astonishingly easy:

Create a dedicated user account on the host (e.g. using Ansible).
Set the User= option in your service file.

In most cases you don't even need to create the users in the container, this is especially great when using third party containers like databases.

Read only container filesystem

When using Docker, you can mount in things from the host to any place in the container. This is possible, as Docker adds a writable overlayfs by default. systemd doesn't, which will result in namespace mount errors when starting the service unit.

The reason behind: Mountpoints already have to exist in the container filesystem before they can be used, as it's not possible to modify the parent directory to create one.

While you can emulate the Docker behaviour (writable overlayfs) in systemd, I consider the standard behaviour to be a great security feature, as it prevents even a temporary persistence when a service is compromised.

To fix this problem, create empty directories or files at the end of your Dockerfile:

systemd-container/Dockerfile-mkdir

# Mount points needed in NAME.service. Must be part of the container image,
# as systemd cannot create them at runtime due to the lack of a overlay filesystem.
RUN mkdir -p /var/lib/app /etc/app

Disadvantage: You create a direct dependency now between your container image and how they're deployed. I can live with that, as I oversee both development and deployment.

Running ad-hoc commands

If you want to run a command in the container context directly, e.g. for upgrade scrips or debugging something in a shell, you can use the systemd-nspawn command.

Unfortunately, to make it work in all cases, you need a bunch of flags to make it work properly. For example, the following invocation is used in my project to spawn an interactive admin shell inside a container:

systemd-container/container-shell.sh

systemd-nspawn \
    --quiet \
    --image /app/.sqfs \
    --as-pid2 \
    --register=no \
    --volatile=overlay \
    --bind-ro="/etc/hosts" \
    --bind-ro="/etc/resolv.conf" \
    --bind="/app/data:/data" \
    bash

Especially —-register=no is important to prevent errors when the image is already mounted by a service. Binding the hosts and resolv.conf ensures that DNS works inside the container - see the next section for details.

No automatic name resolution configuration

To be able to resolve any hostnames on your system, you need the /etc/hosts and /etc/resolv.conf files.

Docker is very helpful in that regard and automatically mounts them for you to ensure everything is working smoothly. systemd on the other hand only does exactly what you tell it to do, and nothing else.

Now you might think: "Ha, my container doesn't need to talk to the outside world!"

Well, without a /etc/hosts, you cannot even resolve localhost. 😅

No support for Dockerfile defined entrypoint/commands

Because the SquashFS image contains only a filesystem and no metadata, you need to specify the exact command to run in your systemd service file.

For your own images this shouldn't be much of an issue, but it can be troublesome sometimes to find this out when using third-party images. In that case, you have to find their original Dockerfiles and take the commands from there.

Using systemd-analyze security to tune sandboxing

Docker provides some sandboxing by default, but what exactly it's doing is somewhat a mystery even to experienced users and rarely configured further.

systemd by default does no sandboxing, it's up to you to configure your service properly. While more work, it allows you to tighten security to extreme levels.

Start by executing systemd-analyze security app.service. You'll get a large report about potential options you could set to take privileges away from your application. As a rule of thumb: Take away everything that you don't need (Zero Trust).

<p>Example invocation of systemd-analyze with results for a service
that had some basic settings applied.</p> — Example invocation of systemd-analyze with results for a service that had some basic settings applied.

Ensure auto-restart of services

To ensure that failing services are always restarted on failure, add the following settings to your service files.

Ini

[Unit]
StartLimitIntervalSec=0

[Service]
Restart=always
RestartSec=1s

This great blog post explains the theory behind it.

Using Conflicts for safer database upgrades

The project I worked on is using ArangoDB as its primary database. Every time a major version is released, you need to to run an upgrade command that quickly starts and exits the server. During that, the actual database server must not run.

In the previous Docker-based deployment of the project, we patched the upstream container to do that (and other things) at the start, which had proven to be very fragile and might also cause unwanted upgrades to happen.

In the new deployment, there's a dedicated onshot service that is started during application upgrades by Ansible. By leveraging the Conflict= option, systemd automatically shuts down the database in case it's running and prevents it from starting while the upgrade is running.

While that case should never happen, better safe than sorry :-)

systemd-container/arangodb-upgrade.service

[Unit]
Description=ArangoDB Database Upgrade
Conflicts=arangodb.service

[Service]
Type=oneshot
User=arangodb
RootImage=/apps/arangodb.squashfs
ExecStart=/entrypoint.sh arangod --database.auto-upgrade

Disadvantages of this approach

As already mentioned in the introduction of this post, this approach is not perfect either and depending on your requirements might not be useful at all.

Non-standard process and tooling

Docker, Compose and even Kubernetes have become industry-wide standards, while deploying directly on host systems seem to become outdated practices in the cloud world (often for good reasons!).

On the other hand, this approach reduces the training time for classic system administrators, as they're generally aware of the management commands that systemd provides (like systemctl and journalctl) and don't need to remember an extra set of invocations for a different service supervisor.

More own tooling needed

The container ecosystem is huge and there are many tools available, from building container images to storing and deploying them. With this approach, you're forced to build your own tooling.

Worse CI-support

Many modern CI systems are running their jobs inside containers and do not provide access to the container host for security reasons. Within these CI systems, it might be hard to run the necessary commands to export the built container images.

Slower and resource hungry build process

Builds definitely take longer than before, as an additional step is being added. Compressing an entire SquashFS will use all available cores - and you better have at least 4 of them.

Personally I think that this one-time cost is worth the faster downloads and less storage space on the target systems.

Conclusion

This approach was deployed to the customer around a year ago and has proven to be very stable.