Our Backup/Recovery Plan

Posted on Oct 3, 2014 in Tech Blog

Due to the recent issues OsGrid had with their hard drive array, I have received many inquiries about how we would handle such a situation.
Backup procedures/restorations vary from person to person, while there are many “Common” solutions, there will always be an argument about which is best.
It’s like Ford is better than Chevrolet, Or Chevrolet is better than Dodge, Coke is better than Pepsi, etc.

 

By what I’ve heard, it seems osgrid had a raid 10 array which they used for their assets.
Raid 10 is typically a very good solution as not only do you get “Raid”, but you also get striping.
Striping is when data is stored on multiple drives instead of a single drive.
To be more clear, if I were to store my name; “Butch Arnold” on a single drive with no striping, my name would exist exactly as “Butch Arnold” on that single drive.
If I stored my name on multiple drives using striping, I might get the name “Butch” stored on the 1st drive, while my last name might be stored on a 2nd drive.
To some, striping is a common practice and when it is working correctly, helps to retrieve the data much faster because it can read the data from 2 drives (or more) at the same time.
Those who think striping is bad usually think this because troubles similar to those experienced by osgrid has happened.

I’m not a striping or raid “Hater”, but I’ve chosen not to use them due to this risk.

 

Instead, we have a primary database on one machine which feeds 2 other slave databases located on 2 more separate machines.
Each evening, we make a backup of one of these databases and we store it on that specific machine and we send a copy of that same backup to yet another machine and also to an amazon hosted storage area.

We also generate individual region oar files and region database backups each night and store them in a similar fashion.

I’m not saying this is the way it should be done, but this has worked well for us.
It has allowed us to retrieve a backup of a specific region using either a database backup, or an oar file whenever we’ve needed it.

 

All data storage methods will fail at some point, you can count on that.
The trick is, in my opinion, to be ready to restore your data using backups stored in multiple locations as it is unlikely that all of these separate storage devices will fail at one time.

I’m not second guessing anyone here, the operators of osgrid I’m sure had systems in place, and it’s unfortunate that they failed.

In our configuration, if our main database fails we can quickly switch over to one of our slave databases.
If, for some reason, our slave databases contain corrupted data and cannot be used, we can then install a completely new database and rebuild it using one of our backups to restore our grid to that point.

 

My thoughts on backups are to do them regularly, store them on several physical machines, and have a plan in place to restore if needed.
Even our backup plans could fail due to unforeseen circumstances, but it is highly unlikely that our main database would fail, and our second and third databases would fail, and even more unlikely that each one of our backups are lost since they are stored on separate machines.

What happened to OSgrid could happen to anyone, the lesson to be learned here is to be ready in case it happens to you.

 

Another part of this which helps me sleep better is the fact we own all of our equipment.
We have 2 “Spare”, “HOT” servers running online and are there if we need them in a hurry for grid use.
We also have a 3rd spare server which is offline, but can be placed online in an hour or so and be ready for our use if needed.
We use both HP and Dell servers and we have the in house knowledge and ability to do repairs and upgrades on them as needed.
The datacenter we use is located just a short drive from me and I can go there at ant time, day or night and work on our servers myself.

While our solution may not work for some, it has worked well for us thus far.