Don’t Fool Yourself: Load Testing APIs and Scale
A typical web application uses a combination of local code and external services.
Not only is adding a Twitter feed to your application’s interface cheap and easy, but using Facebook or Google for login can actually accelerate development by utilizing existing functionality previously developed by tech enterprises. That means 3rd-party APIs.
Once the programming team knows how to utilize these APIs, they'll likely want to develop their own microservices on the back-end. So once you serve the web page, the application itself calls internal services to return a list of search results or tags, or the second or third page of results.
What will the user see if Twitter is down or slow? What if the server that hosts multiple services suddenly has load from a different service, one we did not expect?
We don't know. So we have to test.
If the services are straightforward — a simple web application and using Google's oAuth API for authentication — then we can do performance testing of the front-end, mock out a Google outage to simulate failures (at load) and call it a day. We might want to do a little more, simulating not just an outage at Google, but a slow response, as if Google were slow to respond. What is that like for the user?
It's tempting to ignore that one. If Google is down, what are you going to do? The short answer is to set the timeout, trap the error, and send a more friendly message, such as "Google authentication does not seem responding, try again later (if you can get to your Gmail but get this message, call us at support_contact_info)."
Sometimes the services are more complex and have a chain of events, or use cases, more involved than public users. Traditional web-based performance testing fails to identify these cases.
Let's talk about it.
When the Going Gets Tough
Consider an internal service, used for warehouse routing and trafficking. That service goes haywire, clogging up the virtual machine that also hosts login, search, and checkout APIs. These services are different virtual servers, but thanks to server virtualization, they are on the same physical machine. Suddenly the performance of the machine slows down, which might be fine — unless we have a chain of APIs, a dependency where one service calls another. As we saw in healthcare.gov, these dependency chains mean that four slow transactions, each running at sixty seconds, lead to a total time of at least four minutes, just to get the data out of the building — not even to the end customer.
That might be an extreme example but these problems are not just theory; two of the companies I have worked with lately mix internal and external APIs, often on the same virtual hosts. Most of the companies I work with have a shared database for at least core transactions — customers, orders, and so on.
One of my colleagues working at a large, public-cloud company, was trying to test APIs in the test environment. Management wanted a simple, overnight test run that showed "green results." The problem was the environment was in such a state of flux that he could not get consistent results. Test servers would be up for half an hour, then down, often because the test machines were on shared servers. In production, they would be separate, but in test, everyone was trying to test at roughly the same times overnight.
That's an infrastructure problem, not a testing problem. Testing can do its job and shine a light on the problem, pointing it out, but until the issue was resolved my friend would not give a clear "green light" to test results, at least without spending an hour or two each day massaging test data and re-running scenarios that had failed overnight — along with making a few assumptions.
Doing it Better
Thinking of API performance as testing, like a high school geometry test, might have half the problem. Instead of clear multiple choice questions, good performance testing is an investigation. We make a survey of the risks, and then ask what's appropriate to do to nail down the risk. That generally means simulating the software as it will be used. One tool that can help here is a diagram of the systems along with a dependency graph, understanding what APIs my software uses, and which machines those services use, and what services are also shared on that machine. To load test the internal API, for example, we might need to run the web performance tests at the same time, to simulate real use — then look at the results for both.
When I do this work, I typically either find a choke point, the shared resource that will break first, and about the amount of load it will take to break. Sometimes I find off conditions that can cause failure. My report looks more like a survey of risks and known issue than a “yes/no” that if this happens, we'll likely see that result. It's rare that I can get real performance requirements from management, so instead of discussing the service level agreement (SLA), I express the service level capability (SLC) — how the system is performing.
The list is sorted by risk, with the biggest problems up front. If the SLCs and the first few problems are not big issues for management, then we don't have a problem. If they are, then we have an engineering problem, not a test problem. (Although identifying what components will fail when allows me to help with engineering.)
The Bottom Line
Coordinating load tests under shared systems, checking when external APIs are slow (or flaky) — these things can be hard to set up and perform. Several times in my career I've been in that position, with risks that are hard to test for. To simulate load accurately I would have to turn this off, simulate that, separate this thing from another, change a variable or two, and rerun an entire process. All because I thought of just one more aspect, one more load pattern. It was always tempting to say that it was an edge case, that the programmers knew it should work, to call the work so far "good enough".
Then reality would set in. If I thought it was too hard to test, and I was the specialist, then the programmers probably hadn't thought of it at all. We would have to test and of course, the system failed. It was a lot of work but that was okay; that is what they were paying me for.
In other words, performance testing APIs is more like an investigation than a simple input/pass/fail exercise. Picture yourself as Columbo, the police detective who always needs to ask "and one more thing?" to try to connect the dots on the system. It won't be easy, but there's a much better chance you won't be fooled.
That is a task worth doing.
Ensuring API Speed and Performance
In our newest eBook, Ensuring API Speed & Performance, we will look at two of the most important processes for ensuring the performance of your API both during development and in production — API load/performance testing and API monitoring.
- The benefits of each of these API performance strategies
- Implementing API load testing and monitoring
- Using testing and monitoring together
- Finding the right tools for ensuring API performance
Get your copy!