Managing Log Files, or Drinking From The Fire Hose Without Drowning
  January 03, 2013

Log files are one of the most useful tools for software developers. They can also be among the most annoying tools, if you don't handle them right. Log files are notorious for burying developers in data, making it hard to extract useful information. Here are a few suggestions for how to successfully mine those mountains for nuggets of useful information.

“For starters, most applications’ log files suck!” says John Sextro, an Agile software developer at Asynchrony, a St. Louis-based development firm.

When it comes to log files, there are two problems. The first, obvious, problem is separating the wheat from the chaff. A log file blithely and happily records all the errors, unusual conditions and everything else that happens in a program as it runs.

You, however, are likely to be neither blithe nor happy when confronted with the mountain of data in that log file as you try to tease bits of information out of the mass of mostly-irrelevant data.

This leads to the second problem with log files: Only in the very simplest programs does the log file point directly to the problem with one error message per error. Instead, the programmer probably has to chase through the code, backtracking from the line that threw the error message – or messages by the dozen – to find what is causing it all. Of course, not all the exceptions thrown by the bug are likely to give the same error message.

This results in a classic quandary: Do you limit yourself to logging only what you think might be significant and run the risk of missing what's really going on? Or do you log essentially everything and try to drink from the fire hose without drowning?

Fortunately, there's a third way. The key to managing any data is organizing it, and log files are no exception. By ruthlessly organizing log file entries you can ignore the irrelevancies and focus on what's important.

Trap All Errors

The first step in this process is simple, if draconian. That is to trap every error message from every procedure in the software at all times.

“If you're not trapping errors, the application shuts down [on hitting an error],” says John Murdoch, president of Wind Gap Data, a software development firm specializing in developing data-intensive software solutions. “You as a developer have no record of what happened. It's one thing to trap an error and record of what happened. It's another to have the application stop cold. If you're trapping errors in every single procedure that will never happen.”

The second step, says Murdoch, is even more paradoxical. Don't read your log files.

“The trick is: You don't use log files by reading them,” Murdoch says. Instead, organize them using log analysis tools and work off the resulting reports.

Create Templates First

The process starts by using log analysis tools to create templates for your log files so you can use the resulting reports. “Most all logging frameworks provide the ability to specify a logging level for each message and allow the application to constrain the levels of logging that it will emit,” says Sextro. Almost every log file program provides basic facilities for organizing log files with templates and the more sophisticated ones have rather complex features. Or, if you prefer, create your own log file template programs using something like Perl or Python.

“The first thing in starting a new project is to decide how we're going to log errors. Every program should begin with a standard template,” says Murdoch.

The template needs to include information to identify the error, where it occurred, and any other useful information. The fields need to be delimited in such a way as to make the separation obvious, and grouped together appropriately by type of error and location in the software. Most logging tools and log file analysis tools can parse error files and produce such reports. Often the errors are further identified by the time of error. The result is a neat list of errors.

The Perils of Prototyping

Traditional prototyping seems like an obvious path, but Murdoch points out a subtle difficulty with this approach. You need to decide what to log and set up your templates before writing any code.

However, today software development often starts somewhere else. “Generally, people start writing applications by putting together a prototype and then adding on to the prototype. That gives them something to show the guys with the suits and wood furniture.” Once management signs off on the project the development starts building out from the prototype.

“But the first thing is you've got to decide is the plumbing and foundation, before you decide what color to paint the house,” Murdoch says. If you've already built a prototype before designing and producing the templates and other foundational work, he asserts, it's extremely difficult to go back and apply those templates to the project.

How Many Log Files?

Another question is how to set up error handling. One method is to collect data by modules, with each module generating a separate log file. Murdoch's preferred method is one giant error-handler that logs every error in the same place. That way all the errors of a particular type can be displayed, he says, no matter what module they're in. You might find a better process works for you.

One difference between teaching environments and the real world is the number and kind of errors. When you're learning, error log files tend to be sparse. In practice, the size of the software is a lot larger, generating much-bigger error log files.

One of the problems is that a single bug can throw thousands of exceptions. “Show me the procedures, show me the problems and what you'll find is a very common error,” says Murdoch. “If you're doing this using SQL Server you might have hundreds of thousands of errors. That can be overwhelming.”

Grouping log files’ result by procedure helps, but more powerful tools are available as well.

“I have 6,700 errors and oh boy, I have a problem here,” says Murdoch. “The trick is not looking at each data line; it's structuring the log files so you can use database analysis to develop a statistical analysis of what the errors look like.”

To handle all that information, Murdoch uses statistical process control methods originally developed for the auto industry. “The technique was developed by Toyota in the 1950s for statistical process control. It works by measuring defects and driving toward zero defects, he says. If you get the log in the right format, you can parse it in the database, and then use database analysis to develop a statistical profile of the quality of the application.

Driving Toward Zero User-Reported Defects

The next part of the job is to test the software extensively to root out all the defects before the software is released to customers.

This becomes even more important as applications grow larger and handle increasingly important data for customers. The error rate can be extremely low and the consequences can still be unacceptable.

Murdoch uses the banking system as an example. By Murdoch's back-of-the-envelope calculation, there are about 1.5 billion transactions daily in the U.S. banking system.

“What's an acceptable rate of failure in doing those calculations?” he asks. “If you are a black belt 6 Sigma kind of guy, that's 300 failures. In the banking system, people would be going to jail with a failure rate like that.”

The answer, Murdoch believes, is simple. “There is no such thing as an acceptable rate of user-reported error.”

But simple isn't easy. Getting to zero defects is an involved process which requires work and constant vigilance. Murdoch says it also requires a different way of thinking about software errors.

“Partly it's a mindset,” Murdoch says. “Instead of planning how to record and deal with customer reported errors, you begin with the assumption that you can't allow errors.”

“The way you deal with it,” says Murdoch, is log file analysis.”

See also:

[dfads params='groups=932&limit=1&orderby=random']

[dfads params='groups=937&limit=1&orderby=random']