What is Real-User Monitoring?

Written by Tom Huston

sqc-twitter-icon-newsletter.gif Facebook_Icon.jpg Linkedin_Icon.jpg Linkedin_Icon.jpg

Introduction

When it comes to application performance management, deciding on the best tools and techniques to use can get complicated, fast. That’s why it’s helpful to remember that the ultimate purpose of APM is simply to determine two things: (1) how your end-users are actually experiencing your website or mobile application and (2) how such data can be translated into actionable insights to achieve DevOps and business goals. And to get as close as possible to hitting both criteria, there may be no better starting point than adopting an approach that incorporates real-user monitoring (RUM).

RUM (pronounced like every pirate’s favorite beverage) is, as its name implies, an approach to Web monitoring that aims to capture and analyze every transaction of every user of your website or application. Also known as real-user measurement, real-user metrics, or end-user experience monitoring (EUM), it’s a form of passive monitoring, relying on Web-monitoring services that continuously observe your system in action, tracking availability, functionality, and responsiveness. While some “bottom-up” forms of RUM rely on capturing server-side information in order to reconstruct end-user experience, “top-down” client-side RUM can see, directly, how real human beings interact with your application and what the experience is like for them. By using local agents or small bits of JavaScript to gauge site performance and reliability from the perspective of client apps and browsers, top-down RUM focuses on the direct relationship between site speed and user satisfaction, providing valuable insights into ways you can optimize your application’s components and improve overall performance.

Unlike active monitoring, which attempts to gain Web performance insights by regularly testing synthetic interactions, RUM cuts through the guesswork by seeing exactly how your users are interacting with—or being prevented from interacting with—your site or app. From page load times to traffic bottlenecks to global DNS resolution delays, RUM offers a penetrating, top-down view of a wide range of frontend browser, backend database, and server-level issues as they’re actually experienced by all of your end-users, everywhere.



A good RUM tool, for example, can show you a clickstream analysis of how and when a specific user made it halfway through an e-commerce checkout process and then clicked away to another page, while also revealing that the majority of users from Abu Dhabi using Safari are suddenly getting lost in network translation, timing out before your homepage loads.

By monitoring every user interaction of a site—including surprising behavior that would otherwise be nearly impossible to predict and test for ahead of time—RUM goes beyond simple up/down availability monitoring, providing end-to-end transaction reporting and analysis that makes it easy to pinpoint where problems are occurring and, therefore, how they might best be resolved.

Understanding Real-User Monitoring

Perhaps the most familiar and basic example of real-user monitoring is Google Analytics, which passively monitors a certain spectrum of interactions between users and your website—such as page views, click paths, browser versions, and traffic sources—and delivers reports accordingly, using broad, user-averaging and data-sampling algorithms. As every novice Web developer knows, Google Analytics can be a great tool for gaining a high-level, generalized perspective on site performance and user profiles, showing how your website performs in a given country or browser and providing a nice introduction to RUM. But for the professional developer or system administrator for whom Web performance monitoring (WPM) can be a full-time job, specialized RUM software is generally required to determine site response times and availability, 24/7, and to drill down to a depth of detailed, multi-tiered analysis that more basic tools like Google Analytics cannot.

In a way, this distinction is comparable to the difference between a standard DVD and a 3D Blu-ray disc—largely a matter of resolution, detail, and dimensionality—although our metaphorical media formats are also representing slightly different genres. RUM-specific software focuses on analyzing site performance from the user’s perspective, while Google Analytics focuses more on profiling the users themselves.

In order to track such a high resolution of data, full-featured RUM solutions typically employ the automated reinjection of small bits of JavaScript code to monitor the client-side experience of application performance, tracking key identifying information and milestones as a user waits for an application to load or clicks through a site. These key events may include, among other things, DNS resolution, TCP connect time, SSL encryption negotiation, first-byte transmission, navigation display, page render time, TCP out-of-order segments, and user think time. Some RUM systems collect additional info, as well, such as network provider, OS, browser version, user location, application version, mobile device specs, connection type (e.g., wi-fi, 4G, LTE, EDGE, etc.), network latency, and available end-to-end bandwidth.

Specific monitoring details aside, however, most instances of RUM (including server-side techniques) will cover the following general criteria for application performance management, as detailed by Alistair Croll and Sean Power in the book Complete Web Monitoring, published by O’Reilly:

  1. Capture. The monitoring system captures page and object hits from several sources—JavaScript on a browser, a passive network tap, a load balancer, or a server and its logfiles.

  2. Sessionization. The system takes data about these hits and reassembles it into a record of the pages and components of individual visits, along with timing information.

  3. Problem detection. Objects, pages, and visits are examined for interesting occurrences—errors, periods of slowness, problems with navigation, and so on.

  4. Individual visit reporting. You can review individual visits re-created from captured data. Some solutions replay the screens as the visitors saw them; others just present a summary.

  5. Reporting and segmentation. You can look at aggregate data, such as the availability of a particular page or the performance on a specific browser.

  6. Alerting. Any urgent issues detected by the system may trigger alerting mechanisms.

Like the JavaScript installed across any site tracked by Google Analytics, and like other lightweight WPM tools, RUM software is intended to be as transparent and unobtrusive as possible. While some organizations in the past have avoided engaging in full-scale RUM for fear of unnecessarily slowing down their already latency-sensitive systems, most new APM suites, such as SmartBear’s Lucierna software, are capable of conducting real-user monitoring of 100% of client-server transactions and 100% of code in high-volume production environments while incurring minimal transaction latency and CPU overhead (around 1-2%).

Also, any perceived application sluggishness incurred by conducting real-user monitoring is easily mitigated by developers and sysadmins responding to RUM insights and making the effort to improve an application’s performance, which is why one conducts Web monitoring in the first place. The plethora of insights gained from real-user monitoring therefore obviate any latency concerns, making it harder than ever to deny the value of investing in a RUM system. Plus, they’re easier than ever to install and use.

The Limitations of RUM

For all its benefits, though, the sheer volume of data generated by real-user monitoring can have its downsides. Compared to more superficial approaches to Web monitoring, RUM’s attention to detail will generate for X number of users X times the data, so 100 users result in 100 times more transaction datasets. This level of precision naturally results in a more accurate diagnosis of end-user experience, both individually and in collective segmentations, but responding to specific issues can prove unwieldy if your tools lack the capacity to conduct root-cause analysis and to generate intelligent, prioritized decision analytics — leaving the bulk of data analysis and next steps up to a likely overwhelmed and under-resourced DevOps crew. Still, no matter how basic or sophisticated your RUM solution is, it’s generally hard to complain about having too much accurate information.

A more serious limitation of RUM is this: Even when you’re continuously monitoring 100% of your real users, how can you monitor the performance of a new app function or site feature ahead of time, before re-deployment? And what about real-time monitoring during periods of low engagement, such as late at night, when there are far fewer people using your application? Is RUM the best way to check performance then?

This is where active monitoring—also known as synthetic-user monitoring (SUM) or, more commonly, synthetic transaction monitoring (STM)—reasserts its usefulness, offering a way to fill the gaps left by the more passive, real-time RUM. Synthetic monitoring works by issuing automated, simulated transactions from a robot client to your application in order to mimic what a typical user might do. These server calls and testing scripts become “monitoring” tools by running at set, regular intervals—say, every 15 minutes—and can be issued from a single designated STM client browser or from multiple browsers at different server locations to better gauge site availability and responsiveness, globally. In this way, STM can give you a steady, solid baseline on which to monitor server and application performance, 24/7, even during periods of low user engagement. Moreover, because it consists of test scripts—simulating an end-user’s clickstream through basic navigation, form submission, shopping-cart transactions, or even online gaming—STM can run in private test environments before deploying new features or during regular offline maintenance, revealing potential obstacles before real users have the chance to run into them.

But “real users” are where the limitations of STM come into the picture, because even the most thorough and prescient DevOps professional, running a regular bevy of synthetic tests, can’t predict exactly what real humans will do. This is why STM can be said to focus more on availability while RUM focuses on functionality. Any given user’s particular path through a site or app is rarely as straightforward (or problem-free) as repeated synthetic scripts like to imagine, as Andrew McHugh explains:

How often have you hit the “back” button before checking out to confirm some details about an item? Have you ever closed a check out page only to return to it later in the day—and have you ever returned directly to the page by highlighting its cached URL in the address bar as you type in the website?

These are just a few examples of common variables that interrupt a user’s intended click path—each deviation representing a possible performance issue, none of which you would be aware of (unless you’re waiting to hear about it on Twitter) without RUM.

So even though synthetic transaction monitoring can overcome some of RUM’s biggest limitations, offering a great way to fill in its gaps, there’s still far more that can be done with real-user monitoring. (Especially since RUM also provides the best way to formulate and calibrate synthetic tests.) Speaking at Velocity 2012, performance expert Patrick Lightbody suggested that a healthy balance of STM and RUM is ideal for achieving Web performance optimization goals, segmenting an ideal APM strategy into 25% internal ops and network monitoring, 25% STM, and 50% RUM. Yet to your boss and your end-users, he added, “all that matters is the external performance as perceived by real users,” which in his model would mean shifting the perceived focus on RUM to somewhere around 95%.

But no matter how your company chooses to divvy up the balance of power, it’s clear that in any APM solution, thanks to its customer-centric focus and depth of actionable data, real-user monitoring should have the upper hand.

The Future of Real-User Monitoring

In this always-on, ever-more-mobile era, the ability to monitor real users in real time is becoming an increasingly critical weapon in any organization’s APM arsenal. But the potential benefits of investing in any Web monitoring tools must, as always, be weighed against actual business objectives. And depending on your organization’s priorities, basic Web-analytics tools coupled with simple synthetic monitoring techniques like server-pinging may work just fine. That said, in a recent report issued by Gartner, analyst Jonah Kowall advised companies to “reduce staff and monetary investments in synthetic monitoring and redirect them to higher value activities, such as real-user monitoring”—again suggesting that, at least for larger businesses, RUM offers the most bang for the buck.

With this in mind, let’s introduce just one more acronym to this already highly acronymous article: BTM, for business transaction management, which basically means application performance management done with business goals firmly in mind. Even if you decide to prioritize RUM, coupling it with STM to achieve a perfectly holistic balance, the data you generate will still only be as good as your ability to respond to it effectively. “Measurement metrics,” observes eBay’s Steve Lerner, “must contain a path for action.”

This is precisely where APM software solutions like SmartBear’s Lucierna excel. With clear, easy-to-use interfaces that display the full transaction paths of real users across frontend, backend, and network tiers, tools like these can reveal the multiple dimensions at play in any end-user engagement. Add to that robust root-cause analysis that quickly identify the source of troublespots, along with smart decision analytics that can predict the value of potential fixes, and the way to make real meaning out of vast repositories of RUM data becomes clear. Without this level of clarity and event prioritization, even the best-intentioned APM approach can fail to translate into an efficient path to healthy ROI, resulting in your BTM suffering and your company being, potentially, SOL. And no one wants that, now, do they?

Further Resources