Overnight (at least in Australia) GitLab had a little problem:
At the time of writing this, it looks like GitLab has lost about 6 hours of data that it will never see again, that’s affected a bit over 5000 projects.
In that document it looks like they were doing some maintenance which triggered a replication problem. While trying to fix the replication problem they’ve accidentally run a delete on the primary server instead of the secondary…oops. This post is not about having a go at the people at GitLab, I’m sure they are far smarter than I am, and I’m sure that I could make the exact same mistake (I’ve run queries on Production instead of Test before).
What I do want to comment on is 3 things that we all need to remember:
- Replication doesn’t remove the need for backups! Just because you have your data on a secondary server, it doesn’t help you if you delete the data. If someone runs a truncate statements, that’s going to be replicated!
- Know your RPO (recovery point objective). For this incident, GitLab has a real RPO of 6 hours. How much data can you afford to lose? Do your key business stake holders know your RPO and have they agreed to it? Can you lose 1 minute, 1 hour, 6 hours or more data? Are you maintaining your RPO during maintenance windows?
- Assuming you have your backups running at the right interval, are you confident you can restore them? The only way to know for sure that a database can be restored is to restore it. At my current workplace our RPO is flexible enough that we rely on differential backups to achieve it. Every differential backup is restored to a test server and has CheckDB run on it to confirm it’s a good restore.
What lessons did you learn from the GitLabs outage?