What Am I Really Looking At? Monitoring Your System For Clues
Let me set the scene… It’s the end of a long week. We’ve had lots of stories being tested, lots of functional details being reviewed and compared, and quite a bit of exploration going into the nooks and crannies of the app. All this to make sure that we are covering “everything.” After several rounds, revisions and reworking of issues, we feel confident that we are ready for our next release. Ah, but there’s one thing we need to do first.
No release goes out to the public that until we've taken a close look at first on our own staging servers. That’s where our company does all of its work; if it’s not good enough for us, it certainly won't be good enough for our customers.
We make the update, interact with the product, and at first, everything looks clean. A day goes by, and we feel confident that we have a winner… until we receive an email from our Ops team. Our log files had grown exponentially in a short period. Wait, what?! That can’t be right.
However, as we examined them, we saw telltale signs that, indeed, multiple interactions with objects were creating error conditions, and these errors were being dutifully logged. Those logs kept swelling, and getting larger, to the point that it was threatening to fill up the file system.
We all tested and reviewed the data, and hunted down the root problem (display certain content as a preview). Fortunately, the situation was resolved, a fix was made and deployed, and we all breathed a sigh of relief. Naturally, during our retrospective, the big question came up; why did we miss this issue in development and testing?
The answer was a simple but often overlooked aspect of programming and testing. We didn’t check the logs to see if something looked out of place.
Log files serve an important purpose. They allow the users to go through and see what is happening on their systems and track down potential “rogue issues.” They can provide valuable clues as to where and when an action took place (such as a DDoS attack) or didn’t take place (queries to the database not returning the right values). They are important, but the fact remains: they have to be reviewed and monitored regularly to actually be of value to programmers and testers. Seems simple, until we get into those log files, and are confronted with a kitchen sink of information that, while valuable, suffers from “too much all at once.”
The reaction I have, more times than not, when I start digging into various log files is, “What am actually looking at?!”
When we think of monitoring, we are interested in the relative health of our systems, and trying to find out what is happening on our machines that is measurable. Log files, in many ways, act as early warning systems for our application. They let us know if our application has a “cold” or not. Like when we check our temperature and the way our body feels to determine if we might be coming down with a cold or a flu bug, we can see if something abnormal is happening in our application by trailing and watching log files.
The front end interactions may not be unusual when we are testing one to one. We may also not be able to see it if we do moderate load testing. If we are tracking the logs while we are running our tests (manual or automated) we can occasionally see large spikes in the log data occurring. One or two lines, generated at a given event, may be normal transactional behavior. Seeing large spikes and large blocks of data being written to the log, especially at brief intervals with a delay between large writes, can pinpoint areas that we would want to investigate more closely.
Log files are helpful, but why stop there?
There are many tools that programmers and testers can use, and several just within the operating system and the browsers they are using. Keeping track of CPU usage, memory, disk access, and network interface utilization can be as simple as using internal system tools on your Mac, PC or Linux box. There are also free and nearly free tools to help you see what your system is doing. These data points are valuable while you are running tests or walking through your application.
Additionally, all of the modern browsers have network monitoring capacity to see how transactions are created, measured and displayed. The developer tools can trace all of the interactions and plot them out to be viewed either while they are running in isolation on one machine, or with automated tests running underneath.
There’s a lot of creativity we can apply when we are testing applications, but if we don’t use logging and system monitoring tools as part of our testing arsenal, we’re performing the equivalent of testing with one hand tied behind our back.
Open up those gauges, dials and sliders. Read those logs (and parse them with scripts to help clear out the noise). Most important, be curious. Your system can tell you a lot about what’s going on, but you have to be curious enough to ferret that information out. If you are, logs and system resource monitoring can be a veritable goldmine for finding and confirming bugs.