Vendoring dependencies - Mitch's Blog

Nowadays, large software projects can have tons of dependencies. As customers require solutions faster than ever, the software development cycle got reduced from years (back when we used CDs) to hours: At my last job, we even deployed new versions of the software multiple times per day. Because of this, developers often have to integrate third-party (open source) libraries into their code.

While is approach is quite comfortable (just add a package) and boosted productivity, this creates quite a large risk as well: If your dependency provider goes down, or someone deletes a package half of the ecosystem relies on, building your application can go from "yarn build" to "pretty much impossible".

Additionally, these dependencies get re-downloaded each time a developer sets up their local environment, or the project is built using a CI system. Therefore, modern way of programming is heavily reliant on the internet, and a stable and fast connection to it. Instead, being able to produce Reproducible builds from the repository alone should be a development goal.

In more "classic" environments, like the C/C++/Java world, this hasn't been an issue, as all dependencies were vendored into the project's repository - or things were just constantly re-implemented from scratch.

Fortunately, it's possible to bring these worlds together: The rich open ecosystems, but keeping a reliable backup of your dependencies.

In this blog post, I'm trying this for projects in the two languages I use the most: Python (PyPi) and JavaScript (NPM).

Vendoring JavaScript dependencies

For JavaScript, an automated solution is necessary: When using tools like create-react-app, you can easily grab thousands (!) of packages before even writing your first line of code.

Fortunately, we're not the first to recognize this issue, and the yarn package manager has some documentation on this matter. Let's try it out!

Preparations

To follow this on your machine, you'll need the following tools installed:

Node.js
Yarn (v1)
git-lfs and a hoster that supports it (e.g. GitLab)

The project

For realism, let's create an "empty" create-react-app:

Bash

npx create-react-app airgap-app

(depending on your PC and internet connection, this can take many seconds...)

Let's take a look how many packages we got:

Bash

cd airgap-app

yarn list | wc -l
# => 5116

When I first saw that number, I was like: Hold on, that can't be right 🤨

What if we only count direct dependencies?

Bash

yarn list --depth=0 | wc -l
# => 1252

Thanks to this dependency bloat, and the general "stability" of the ecosystem, it's no wonder projects fall apart if you don't constantly update everything. But I digress...

Creating the offline cache

According to a blog post by a yarn developer, setting up a mirror shouldn't be too hard:

Bash

yarn config set yarn-offline-mirror ./package-cache
yarn config set yarn-offline-mirror-pruning true

Note that this updates the global yarn config, even without the --global parameter set. To apply the changes only to the project, move the yarn config and change the path to be relative. Extra care is needed if you already changed some yarn setting before.

Bash

mv ~/.yarnrc ./
vim ./.yarnrc

The resulting file should look like this:

Code

# THIS IS AN AUTOGENERATED FILE. DO NOT EDIT THIS FILE DIRECTLY.
# yarn lockfile v1

yarn-offline-mirror "./package-cache"
yarn-offline-mirror-pruning true

After the configuration has been updated, we can delete the previous installed packages, as well as the yarn lockfile, and re-download everything:

Bash

rm -rf node_modules/ yarn.lock
yarn install

After successful execution, let's see how much we got in the cache:

Bash

du -sh ./package-cache
# => 32M ./package-cache

Putting 32 MiB (or more for real projects) into git is a rather not-optimal solution, as git was optimized for text, not loads of binary files. Large file support to the rescue!

Bash

# This requires git-lfs to be installed. On macOS, you can get it using Homebrew:
# brew install git-lfs

git lfs install # Inside the repository
git lfs track "./package-cache/*"

At this point, we can commit everything to git:

Bash

git add yarn.lock .yarnrc ./package-cache
git commit -m "Vendor all JavaScript dependencies"

Why use git-lfs, even though all the files are tiny? As the name indicates, LFS was designed for large files, e.g. images or other artifacts. The advantage of using LFS shows later in a project's life, when dependencies got added, updated and removed multiple times:

When you clone a git repository, it's entire history is downloaded. This also includes all historic versions of any dependency you've ever used in the project - and over the years, this can pile up to a large amount of space. When using LFS, only the files actually needed are downloaded. The only downside I see is that some of the offline working features of git don't work - but how often do you really need a older version of a library?

Using the offline cache

When later installing the packages using yarn, you can ensure that yarn never touches the internet by using the --offline argument:

Bash

rm -rf node_modules/ # Keep the yarn.lock this time!
yarn install --offline

This execution should be much quicker than the first time, as the thousands of files can be read from disk instead having to be downloaded from a CDN.

💡 For further details on the offline mode (including purging unused packages) check out the original blog post.

Use in Dockerfiles

When building in Docker, make sure you add package-cache as well.

Dockerfile

# Add package mirror
ADD .yarnrc .
ADD ./package-cache ./package-cache

# Install packages
ADD yarn.lock .
ADD package.json .
RUN yarn install --offline
RUN rm -rf ./package-cache

# Build app
ADD public public
ADD src src
RUN yarn build

💡 If you're using GitLab and your own runner, make sure git-lfs is installed on the host. This has bitten me when vendoring dependencies for a project.

Vendoring Python dependencies

In general, Python projects tend to have a much lower count on dependencies - most projects I've worked on so far were always less than 30. It's hard to compare these numbers though, as Python libraries tend to be larger (think of something like Django), on contrast the JS ecosystem likes to make tiny packages (sometimes even one for each function...).

While there has been a nice blog post for starting with yarn, for Python you'll need to scrape together information from different places (but the approach is easier):

There's an old answer on Stack Overflow - which doesn't work anymore
pip download reference - which has some examples for downloading only
There's another old SO answer - which provides the arguments for installing

Let's start by writing a small requirements.txt using some commonly used packages:

Requirements

Click==7.1.2
requests[security]==2.25.1
Flask==1.1.2

(both Flask and requests have multiple dependencies, which make them great for this example)

👍 Always remember kids: Pin your dependencies, or your stuff might randomly break the next time you build your containers.

Creating the offline cache

The following commands will create a virtualenv into the project directory, and download the dependencies to it. In difference to yarn, they will only be fetched - but not installed.

Bash

# Do not clutter packages into the system directories
python3 -m venv ./venv

# Make sure we have the latest pip
./venv/bin/pip install -U pip

# Download our requirements
./venv/bin/pip download -r ./requirements.txt --dest ./package-cache

After successful execution, let's see how much we got in the cache:

Bash

du -sh ./package-cache
# => 3.6M ./package-cache

Not bad. Not bad at all.

Using the offline cache

When installing packages using pip, you can specify the cache directory directly:

Bash

./venv/bin/pip install --no-index --find-links ./package-cache -r requirements.txt

(while there is no explicit offline mode for pip, this prevents it from using PyPi)

At this point, we can commit everything to git again:

Bash

git lfs track "./package-cache/*"

git add requirements.txt ./package-cache
git commit -m "Vendor all JavaScript dependencies"

Conclusion

This approach is something that should be considered for pretty much any project, as the upsides (resilience, time saving) greatly outweighs the downsides (larger storage requirements).