If you manage data for a business, the biggest risk you should always consider is how much of this data am I going to lose if a disaster strikes? Your manager is always going to want to have 0 data loss, that everything saved is kept no matter how big the disaster. However, they are usually not willing to pay the cost to make this a reality, and so they will trade cost for data loss.
While our options for no data loss are becoming more affordable, there is still a substantial investment. In theory, synchronous Availability Groups and mirroring are a simple way to prevent data loss, the reality is they need people to maintain it (staff costs) and near impossible network speed to prevent latency to your offsite replica, and still have the real possibility of falling behind.
Your other options are basically to rewrite the application to somehow save all the data in two locations at the same time, re-synchronize that data when (not if) it gets out of sync and maintain that whole beast. Most managers will trade a small amount of data loss to keep it more affordable. For example, an async availability group won’t cripple the speed of your primary database, and you can realistically keep the replica within a few minutes or even a few seconds of the primary for most of the time (index rebuilds might set you back). The cost of this is infinitely cheaper than rebuilding all of your software and dealing with the complications. Most managers will happily accept that trade.
When we consider backups, we need to think about how much data we can lose from fire, server crash or a developer who ran a query without a where clause (which is why HA isn’t a backup). So how much data can you afford to lose?
What you’re trying to calculate is your recovery point objective (RPO). Your RPO is the point in time that you want to get your data back to in the event of a disaster. I normally like to work with general numbers like 5 minutes, 1 hour, 6 hours, 12 hours, 1 day, 1 week, 1 year. As a data professional, you need to know what your RPO is – your job depends on it.
Once you have figured out your RPO, you need to communicate that to your manager so that you can make sure that it meets their expectations.
It’s ok to have different RPO’s for each of your databases – some data may change infrequently, or be easy to recreate, so you can afford to lose a whole days worth, other data is critical to the business and if you lose more than a minute or two then you’re going to break the business.
It’s also ok to have different RPO’s for different types of disasters. For example, if your building burns down it might be ok to lose today’s data, but if it’s just a query that deletes data then you need to be able to get it all back.
As a data professional, you need to know how much data is possible to be lost at any point of the day or night, for different types of disasters, and you need to communicate that to your manager so they know how much to expect from you in the event of a disaster. This needs to happen before you have a disaster – it’s too late to work out how much data you’re going to lose when your building is burning to the ground.