Pretty simple: verify your backups!
GitLab.com is in crisis after experiencing a severe data loss caused by human errors and ineffectual backups.
On Tuesday evening, one database experience a severe performance degradation, and the sysadmin tries to start an emergency database management.
We are performing emergency database maintenance, https://t.co/r11UmmDLDE will be taken offline— GitLab.com Status (@gitlabstatus) January 31, 2017
But another (tired) sysadmin, working late at night in the Netherlands, had accidentally deleted a directory on the wrong server during a database replication process,wiping a folder containing 300GB of live production data.
We accidentally deleted production data and might have to restore from backup. Google Doc with live notes https://t.co/EVRbHzYlk8— GitLab.com Status (@gitlabstatus) February 1, 2017
In the Google Doc, the sysadmin notify:
“This incident affected the database (including issues and merge requests) but not the git repos (repositories and wikis).”
So not all is lost? Do not be too optimistic! The document concludes with the following:
“So in other words, out of 5 backup/replication techniques deployed none are working reliably or set up in the first place.”
A few hours later GitLab publishes a post on his blog where tries to reassure users on the situation:
Yesterday we had a serious incident with one of our databases. We lost 6 hours of database data (issues, merge requests, users, comments, snippets, etc.) for GitLab.com. Git/wiki repositories and self hosted installations were not affected. Losing production data is unacceptable and in a few days we’ll post the 5 why’s of why this happened and a list of measures we will implement.
As of time of writing, we’re restoring data from a 6-hours old backup of our database.
This means that any data between 17:20 UTC and 23:25 UTC from the database (projects, issues, merge requests, users, comments, snippets, etc.) is lost by the time GitLab.com is live again.
Update 18:14 UTC: GitLab.com is back online
I like to quote this sentence from TheRegister:
The world doesn’t contain enough faces and palms to even begin to offer a reaction to that sentence.
https://t.co/vJ4RzuYLz3 melts down after wrong directory deleted. The backups failed too. Moral of the story? Verify backups.— nixCraft: The Best Linux Blog In the Unixverse (@nixcraft) February 1, 2017
Moment of silence for our fellow sysadmin/ops who is on-call right now. GitLab Database Melt Down Incident https://t.co/NlMn54uYUW— nixCraft: The Best Linux Blog In the Unixverse (@nixcraft) February 1, 2017
Let's observe a minute's silence in memory of the sysadmin. #gitlab— Eduardo Casas (@eduardo_casas) February 1, 2017
Moral of the story?
- Verify backups
- Do not work late!