When the Internet Fails: Application Availability, SLAs, and Disaster Recovery PlanningSeptember 24, 2008
All too frequently, Web sites do not meet the expectations of those who want to use them. Sometimes, a site is completely unreachable. At other times, although reachable, its performance is poor. Occasionally, users have other unexpected experiences: for example, after they connect to a web site almost instantly, they aren't authenticated even though they have valid login identification.
There is an economic consequence to the owners of a web site that depends on its users having a satisfactory experience, no matter whether or not a service level agreement (SLA) obligates the owner of the site to deliver at a given level.
Service Level Agreements
The terms of service for a web server (or any other kind of server, for that matter) are sometimes provided in a document called an SLA, for service level agreement. This document details the fine print about how much uptime the service guarantees, how quickly and effectively the provider will respond to an outage, and whether they will compensate the user or reduce their fees if they do breech their service guarantee. An SLA also can contain specifications for other attributes, such as quality of service (QoS).
In short, an SLA is a contract that dictates what level of service the provider is obligated to provide and what credits/remuneration, if any, is required when the terms of the SLA are not met. SLAs can be between different businesses or between different segments of the same organization.
You can read an SLA at http://www.amazon.com/gp/browse.html?node=379654011. Notice that it indemnifies Amazon from both outages and less severe failures caused by a Force Majeure.
When your site is experiencing difficulty, you should first determine whether the problem is caused by events within your organization or events in the external network(s).
The most technology-savvy enterprises can fail to deliver the on-line services that users expect, from time to time. For example, at one time or other over the past year or so, the SalesForce.com, RIM (BlackBerry), Microsoft, Google, Netflix, eBay, Skype, Vonage, IBM, and Beijing Olympic Games web sites, just to name some of the most well known cases, have disappointed their users. As you can imagine, these are not isolated instances. So, take a look at a few of the scenarios that many end users encounter.
But, first, it is important to remember that lost connectivity doesn't mean lost data, just lack of access to the data. The data is still there; you simply can't get to it right now. For some businesses, that could be disruptive to actual operations. In other cases, it means that backups or disk mirroring is suspended, so that you only have your local copies of data until connectivity is resumed.
Fortunately, full-blown outages are far less common than other situations where performance is just somewhat below acceptable or contractually set limits.
In either case, man-made or natural causes can be to blame. This article will focus on these two situations, both of which can cause your system to perform out of spec; that is, where the penalties of a SLA (if you have one) kick in.
When your site is experiencing difficulty, you should first determine whether the problem is caused by events within your organization or events in the external network(s). Naturally, if it's not in the latter, it's in the former. Fortunately, there sometimes are simple steps you can take and tools you can use to investigate this question.
The Internet Traffic Report (ITR) shown in Figures 1, 2, and 3 monitors the flow of Internet data around the world. It then displays a value between 0 and 100. Higher values indicate faster and more reliable connections. The higher the packet loss percentage, the slower the connection will work because, in most instances, it has to send the same piece of information several times.
Note: This free resource for the Internet community is updated every 5 minutes. You can request a router on your network be added to the ITR list.
Internet connectivity may be smooth today, but perhaps users won't be able to reach a few web sites in Europe tomorrow. ITR will tell you if those regions of the Internet are currently slowed down. So, by checking ITR, you may be able to determine whether your problems are global or local.
Figure 1: Global overview of Internet traffic