Solving GitLab Migration Issue:
Monkeypatching GitLab for Multicore Backup
Challenge: Multicore Backups in GitLab Migrations
Recently we migrated a very large GitLab installation for a customer from an old Ubuntu to a Debian-based systemd-nspawn container running on an Archlinux host.
The system was originally laid out to be a standard-sized GitLab omnibus deployment, but due to heavy usage, it transformed over time into a large behemoth. It stored multiple terabytes of data across multible disks and was operating on an aging Ubuntu system. We decided a migration to our default platform, featuring a Debian container on a fresh rolling-release operating system. (We opted to a containerized approach, because GitLab only supports Ubuntu, Debian, Suse and RHEL-based distributions and we believe in keeping deployments as vanilla as possible.)
Our customer’s requirements mandated the GitLab system to be available 24h during the regular workdays, due to a branch of the company operating in a different timezone. The migration window was a maximum of 48 hours on one weekend, so we allocated several terabytes of virtual disk space to the VM and initiated the GitLab backup procedure on a Friday night.
However, we had to stop the backup task after it had been running nonstop for almost two days. We later calculated the entire process would take around three days to finish and this estimate solely accounted for the data export. The restore phase might have been presumably quicker than the backup phase, but given our migration’s timeline of one weekend, this procedure was not an option and the whole project had to be rescheduled.
We were not alone with this issue. Many Gitlab Enterprise customers struggle with the GitLab backup routine on large instances. It seems the backup script for GitLab has gotten very little attention in the recent years. We were stuck with the 48 hour limit, so we had to come up with our own solution.
We cloned the entire VirtualMachine and ran some experiments. Eventually, we arrived at a surprisingly simple solution that reduced backup and restore time by one order of magnitude.
- Replace the system gzip with pigz
- Patch the backup helper script
- Increase the number virtual CPU cores
- Skip the unnecessary tar step in the backup
1. Replace the system gzip with pigz
The experiments showed that the compression process executed by the backup script was bottlenecked due to a single-core gzip operation. Consequently, we installed pigz1https://zlib.net/pigz/, a parallel implementation of gzip designed for modern multi-processor, multi-core machines.
We then replaced the system gzip with pigz.
So that when one called
gzip --version it returned
2. Patch the backup helper script
Using pigz alone was not sufficient, but fortunately CPUs have evolved since the early days of pigz, so we then increased the compression block size from the tiny default of 128KiB to 4096KiB. Larger values might work, but we did not test further.
/opt/gitlab/embedded/service/gitlab-rails/lib/backup/helper.rb in line 35 and 37.
3. Increase the number virtual CPU cores as needed
After some brief tests, we figured that 16 virtual cores were the sweet spot for pigz on our virtual machine.
4. Skip the unnecessary tar step in the backup
Finally, we discovered that the script pipes the multiple gzip outputs to a single tar archive (which then crashed after the backup was complete, deleting the entire progress). Since this operation takes – single-core computation – time and the step can be skipped by providing a cli flag, we executed:
The backup time was reduced by an order of magnitude to approximately four hours. The restore process doesn’t require the backup to be tar file format, instead a collection of gzip files suffices. This restoration process also completed within about 4 hours. So after 8 hours the very large GitLab installation was happily running on the new system.
You can read the article in German here: GitLab-Migration: Monkeypatching von GitLab für Multicore-Backup