High-Availability Basics for Developers
“High availability computing” has, alas, sometimes been presented as a buzzword, which gets in the way of discussing the actual issues in the many zero-failure initiatives, with terms like Non-Stop Computing, Unbreakable (fill in product genre here), and a gamut of lofty, expensive, and even impractical projects. Most such terms are the lore of marketing teams, rather than those that had to make these systems work. This article is for developers and sysadmins who really do care about keeping systems running no-matter-what, with resources that might aid in achieving that goal.
High Availability (HA) is in the eye of the beholder. If you're the CEO, it means you didn't lose any sales and you didn't upset customers or business partners (much). If you're a systems engineer, it means uptime and rapid repairs. Among the terms with which you should be familiar: short mean-times-between-failures (MTBF), short mean-time-to-repair (MTTR), hot-swappable components, Redundant Arrays of Inexpensive Drives (RAID). You also think deeply about such matters as having spares within arms reach, the fact that everyone knows your cell phone number(s), and you can text nearly as fast as you can type.
The customer side of HA is different. They see a combination of qualities. It means a site loaded quickly, links fetched pages or performed actions quickly (or at least predictably), and the users had a satisfying experience. The words “quickly” and “satisfying” are qualitative and subjective, often open to deep and long arguments. Really “snappy” or responsive sites tend to earn user approval. They can identify this fact, even if they can't articulate exactly what metric is used to say this. They can find what they want and can do their work satisfyingly. Most users never realize how much systems work you do to make the experience possible, including plentiful redundancy and fault-tolerant infrastructure.
Sometimes real up-time and high availability also means monitoring social networks for signs of bugs or outages, using customers to help push the alarm bell when no other symptoms are present. External inputs can sometimes be faster than systems reporting their own problems; that’s so especially when the systems fall out of communications because they're unlinked, unavailable, or perhaps, in the middle of insanity caused by other heretofore unheard of, unique, or zero-day failures. It might be as simple as: Some equipment was stolen.
Certainly uptime is the result of quality code, and vendors (including, if we may be so bold as to mention, SmartBear) publish quality-tracking tools that make work more productive and audit-worthy for team development. In this article, however, the theme is understanding the systems approach to High Availability/HA.
High Availability is a systems approach to the application usage experience from beginning to end, over the application’s lifecycle and the infrastructure which supports it. The achieved HA goals keep an application fully functional, resistant, and tolerant to faults, and it provides a method to audit successful availability.
Above just keeping an application going, HA is a systems approach to quality-of-experience metrics including fast transaction times, low support reports, identifiably responsive interactions under peak loads, response times, queue management, resource allocation and load balancing, as well as inter-related resource availability. Success is an agreed-upon achievement of the application's functionality at the metrics chosen, measurable by audit (internal or external). This includes forensic times, as well as interdepartmental liaison actions associated with the application or its bearing on other organizational functions. It also means instance flexibility: Virtualization and cloud functionality have become almost mandatory at deployment or redeployment intervals.
Another issue in HA systems is fault tolerance, imbued through the use of systems redundancy. Problems on a host or platform allow N or N+1 or N2 client/server transactions to be moved or restarted on the redundant gear. The gear might be a disk, another virtual machine with “hot” application running successfully, or it might mean a fast instance move from one host to another. There are plentiful examples of methods to aggregate application back-end services as a “farm,” by hypervisor and/or host operating system; farm members either balance workloads (or allow transactions to “failover” when not possible to complete) to other host members of the farm.
None of these concerns address the nature of the application, which could be as simple as a single read-only Web application. The important bit is keeping things running, available, and with acceptable performance. There may be dynamic pages, and perhaps many internal scripts for various purposes that communicate with business partners. Money might be transacted. Or the application might be more local, and permit database lookups or perhaps transactional interaction with local and/or remote hosts.
The HA component relies on vetted infrastructure poised to take a licking and keep on ticking. Whether a component, communications, storage, RPC, host, or even client problem occurs, each element of the transaction chain needs a specific systems-treatment to optimize measurable outcomes.
Because applications typically see aperiodic, even wiggy loads, each of the members/teams that must support the application(s) must understand the standards so to establish and agree-upon a metric for availability. Quality code is but one important element in the delivery chain, and unless the entire delivery chain works in a way that's measurable – often from an audited perspective – then HA is only a lofty goal, rather than an objective that you can measurably achieve. Your code, however, can often adjust to variances in overall systems availability, including network traffic and storage difficulties, and by intention, make itself more resilient to problems or anomalies.
Start by consciously identifying your assumptions. The metrics often measure client experience and the transactions behind that experience. Was the application available to its audience? How quickly and accurately did it perform? Were all the elements available to achieve a transaction in a rational period of time — or better than that? What happened to transactions during outages? How are clobbered transactions committed or resolved when things go awry? How quickly can transactional difficulties be resolved? Do forensics come into play? Will the accounting department get premature baldness from tearing their hair out? Will users leave the site in droves, or are they captive to a line-of-business application?
Numerous systems and network management tools are available to help you measure activities, both internal and external. Your application can also query systems and events logs as easily as management tools, where that's practical, as a gauge of systems infrastructure metrics and of course, error messages that might be encountered through the transaction cycles.
As an example, Anturis allows a website to be tested externally, at intervals, from remote locations. It can put forth dummy transactions to a specific website, emulating user actions, then compare desired values from those obtained during a test. Anturis can relate either success or failure on an ongoing basis as a Web-based offsite testing tool in terms of Web application monitoring.
Resources for Coders
Platform and Systems Event-driven Decision Making: Platform availability surrounds the systems approach. Your code has, under certain circumstances, the ability to monitor other components, and especially hardware resources. This is done via APIs that monitor hardware, network, and shared resources. You can allow an application to deal with lack of availability or to sustain sessions or transactions until they go cold, then unwind from them gracefully.
IPMI and DMTF are both cross-platform API sets that your code can likely monitor, then act upon where these API sets are present. Each is a repository of systems events, sometimes with streams of information available, like the Simple Network Management Protocol (SNMP) is capable of offering.
Your users may have these APIs installed without even knowing it, and many hardware products and operating systems support them. You may need to check your company policy before including these in your code, as some of them may be off-limits. Testing both clients and interacting servers/daemons might be permitted — or it might trigger various kinds of alarms (often intruder application-behavior triggers and monitors).
The APIs may be implementation-specific, or generic, or platform-sensitive. Entrance points and RPC functions between your code and other hosts require a valid service bus—a link between your code and wherever a service or daemon lives—such as open ports via Ethernet. Microsoft, VMware, and others use the concept of service buses for interactive, REST communications queries with working APIs. At least we hope they're working.
Systems availability procedures are helpful, once the details of checking a service daemon has been accomplished through authentication, RPC calls, or API calls. The idea is to monitor systems functions periodically for availability. A function says something like, “Yo, service, are you alive?” The service answers. Your code branches on the result.
Your code might ask: Is the network, hard drive, share, object, RPC, database server, or something else external working? Do these objects have a condition (disk full, network I/O at 80%, share locked, object passing various qualities) that your code does something with, including discarding lots of stuff to focus on something necessary for your code to stay alive? Does the code act upon an event should a condition be present, like “Uh oh, disk is 87% full and I need more, and this is a Mac, and cache is about to vomit”? Does it report a problem for forensic analysis? To whom? How quickly? With what information? Signed off by whom?
Some management/monitoring API calls emit streams, as in periodic samples of a metric, such as network I/O. Network busy? This might indicate a thrashing condition caused by another process or peak loads are in progress. Your code might want to find a different route to a database server, or to double-check a circuit or RPC availability.
There are various initiatives that use client-resident daemons/services to communicate conditions. One of them is the Intelligent Platform Management Interface (IPMI), an Intel-spawned API set, which in many forms, is a service or a daemon that your code can query where implemented—often in server platforms. It is good for network and communications queries. Recently updated, it also covers IPv6 connectivity and qualities. IPMI keeps track of a long list of info.
The Desktop Management Taskforce (DMTF) uses a similar approach to reporting qualities of desktop, rather than servers especially for hardware conditions and qualities. If allowed by policy, you can check user-side component availability and qualities in a similar way. In a client-server environment, you can establish a baseline of client readiness that completes the transaction chain between client, server, and success.
Segmenting Code: Using breakpoints in code is another common way to trap faulty interactions, and branch to fix-it after test and sanity checks. During development phases, you can use breakpoints to measure interaction times, such as external DB queries, network responsiveness, and other dependencies that arrive at the HA metrics that are established for the overall experience. This helps reduce the blame game.
One choice is to use Dtrace or similar frameworks that allow a tracing framework that renders a real-time analysis of direct execution dynamics, including CPU usage, network waits, system calls, various caches and qualities, and more. A variant, VMware's vProbe, solves a similar problem of establishing breakpoints and metric examinations within virtualization frameworks.
The idea is to allow code examinations to both optimize code, but also to establish realistic response times among system members within the scope and lens of the application code. Some tracing frameworks also allow measurements with browsers, database platforms, and using any of several languages (ranging from C and Java to Erlang and Ruby). They're a great way to realistically baseline applications so that metrics for high availability and/or fault tolerance are realistic.
Virtualization and Cloud: HA Is Default and Mandatory
The virtualization phenomenon takes running servers, and allows them to be re-instantiated at will. Storage, networking, KVM, all can be completely disassociated from compute services. This means that applications must often rely on operating system and virtualization information to keep connections and to do work transactionally.
Each virtualization vendor has a High Availability initiative, and hypervisor vendors often support OS-hosted services application monitoring to ensure availability. As an example, VMware, in its latest 5.5 platform update, has a method to monitor Oracle, SQL Server, and other applications to permit them to be restarted if they fail somehow. This affects code utilizing these services; read-after-write transactional methods can support maintain high availability but at the cost of knowing how these APIs work so that you code doesn't blow up. Microsoft's Systems Center 2012 does similar things for Microsoft's long list of applications. And platforms like Apache and nginx have fault tolerance components that can be fiendishly difficult to use, but almost infallible when faults occur – once you garner an understanding of how they complete transactions and react to faults during transactions.
Using the cloud extends the systems domain away from the confines of the target organization. Similar principles apply. Management APIs and cloud vendors often have alerting functions for traffic, authentication, or service/communications bus problems that can be monitored. If you use provisioning initiatives such as OpenStack, Azure, or CloudStack, well-known communications bus links can allow direct communications with instances – applications and hosts — using very simple systems calls to monitor your applications and the platform where they're living.
High Availability isn't an accident, and none of us are that lucky. Catastrophic events and Murphy's Law teach us that anything that can go wrong will, but all arguments tend to cease when you have real data in an agreed-upon set of metrics for availability improvements. Applications, depending on HA mandates, have an increasing number of resources available depending not only on organizational policies (and audit/compliance requirements), but also programmer initiatives to bulletproof their code by allowing code to be more flexible and resilient from knowledge of underlying systems state.
Each responsible member's use of HA metrics isn't enough, as everyone needs to communicate with each other about the desired state—but also what happens when the undesired occurs—while you're on the beach on vacation.