Databases are Software, Too

Cameron Laird

January 17, 2011

These four programming techniques can improve your database quality (and incidentally improve relations between software developers and the data management people).

Common sense says that databases deserve to be managed just as other components of a computing solution are. Development and IT staff have strong traditions about how to maintain such “computer stuff” as applications, servers, printer drivers, and so on. When it comes to databases, though, organizations often behave as though all we’ve learned about quality, security, and engineering standards doesn’t apply. For a variety of reasons, databases often suffer undeserved neglect, with predictable consequences for reliability. A few simple changes can heal the worst of the wounds.

Do these situations sound familiar?

You find errors in a production database. You report them. Two months later, the errors are still present.

You report an error in a production database. The database staff promptly corrects it that night. Then the problem is back again the following Monday.

A long-awaited application finally comes on-line, in a carefully-coordinated weekend update that synchronizes software, database, and security configurations. Three existing applications break because of an unanalyzed change in a database schema.

IT announces that a weekend migration of a database to new hardware was successful — but a dozen existing applications lose their access to the data on which they depend.

Application developers program to a data dictionary. It clearly documents that every invoice number references a unique customer, or repair authorizations will always be countersigned by officers from two different units, or so on – but they’re not.

You’re not alone. As common — and maddening! — as such problems are, though, the good news is that there are well-established ways to attack them.

A Brief Cultural Introduction

First, it’s important to understand that databases are different, at least from the standpoint of someone with a background in development, network engineering, or related computing disciplines. Not better or worse, but different. Data professionals have a particular regard for “uptime,” “design,” and “security” that involves many of the same words as their development counterparts use, but with different meanings. Author and consultant Scott Ambler, for example, has written several articles on The Cultural Impedance Mismatch Between Data Professionals and Application Developers.

There’s a lot of good in “data culture,” and the best data shops have a sophisticated understanding and control of all the quality matters this article raises. The curious fact, though, is that a large segment of data operations remains for which the example problems listed above are typical. Moreover, the issues are not getting fixed; however urgent these defects look to you and me, the data organization itself has other priorities. To improve the situation involves changing the way work is done, not just doing more of what already leads to these errors.

Source Control

The first technique that will feel like a “no-brainer” to you as a developer, but can represent a “great leap forward” in data operations, is source control. Also discussed as “revision control,” “configuration management,” “version control,” “scripting,” “change management,” and more esoteric titles, the fundamental ideas of source control are that:

State has a textual representation;
Change is best managed in terms of texts; and
Computing systems easily automate the commonest management operations involved in development and maintenance.

How does source control apply in a data operation? While others have written on this topic before, the message has yet to spread far among database practitioners. Recognize that casual database administrators often work through graphical user interfaces (GUIs) which make interactive updates — adding a column, redefining a table, snapshotting an instance — easy. That ease is hard to manage, though: low-end GUIs support only the most primitive means for comparing state now to a previous state, documenting changes, or rolling them back. Even rarer are ways to partition or merge the efforts of different teams working on different parts of the same database.

When workflow is based on texts — scripts — all these difficulties evaporate. Capture the design of a database in a scripted schema, and script all updates to the schema: Now it’s straightforward to create an instance with any desired design from the past. Every change to the design becomes easy to tag with its date, purpose, author, and so on. You make it feasible to implement requests such as, “I know everything was working in June. Create an instance that includes all the work we’ve done since then, except I don’t want anything to do with the partitioning update from October.”

Content can be more problematic to script than to design, mostly for practical reasons having to do with scale. Even in the most intensively transactional environments, though, there are great advantages to scripting as many tables or objects as possible, and to clearly identifying which residual content is not under source control.

Source control also brings subtler benefits. When a database instance can be recreated by scripting, it becomes possible to create secondary instances, probably on cheaper hardware and software, likely with smaller content sets, that are convenient and inexpensive enough to allow developers free rein. Instead of having to send all change requests through the data organization, developers can experiment as quickly as they generate new ideas. You’ll bring measurement, rather than speculation, to design meetings.

“It is virtually impossible to overstate how much using source control improves software development,” as one recent kalzumeus commenter testifies. It’s a fact, though, that while the virtues of source control have been documented for decades already, adoption remains fragmentary. If your data organization isn’t using source control now, you can almost surely advance your quality enormously with such a project.

Scaling into the Clouds

A second way in which developers have experience with the potential to pay off in data management has to do with commoditization. Developers are increasingly comfortable moving into the cloud, virtualizing in all directions, dealing with mixed tenancy, and, in general, “outsourcing” as much as possible.

I don’t know in detail how these trends apply to data management; I doubt that anyone does yet. What is certain, though, is that data organizations have crucial decisions before them about such topical technologies as

tuned hardware-software bundles;

hosted instances; and

new models for data management, other than relational database management systems (RDBMSs).

As keenly as I see problems with all these — Exadata is single-sourcing, I don’t trust my prime corporate assets to database.com, and NoSQL is largely a fashion statement — I also can see that these or closely-related answers to scaling are certain to be in our future.

Simply because development has already lived through so much virtualization, it’s a great advantage to a data organization to include some of this experience when assigning teams to analyze and implement data’s move to the cloud. Consulting on scalability options, therefore, is a second way that developers can help with data operations.

Stage service

The article on kalzmeus mentioned above for “… source control improves software …” actually focuses most on the importance of staging servers. Author Patrick MacKenzie aptly summarizes that “… staging = production – users.” The idea is that a “dress rehearsal” on a staging server flushes out errors that otherwise would be found only when in production.

It’s a good idea, of course, and one largely familiar within the data world. Why bring it up in this article, then? Because, from the perspective of a data organization, if you’re a developer, you’re likely to be categorized as a user. By the time you see a database, it’s in production, at least as it’s known to the data team.

What’s the remedy? Get on the data team’s calendar. Let them know that, however they shake down their updates on their side, there’s value in exercising a new release with you (and other developers) before end-users see it. This might be one of the easiest adjustments to make. However obvious it seems to you, it’s unlikely to happen, though, until you, as a developer, explicitly raise it with the data organization.

Automation Beyond the Relational Model: Quality Reports

Data teams understand constraints, triggers, and so on. If you’re working with an organization where you have to fight for the uniqueness of primary keys, you’re facing bigger changes than anything this article can supply.

It’s still common, however, for data teams to restrict their automations to the relational model. That is, while they might document or otherwise agree to semantic relations about the data they maintain, if the relationship isn’t declaratively expressible, no one takes responsibility for implementation of the relation.

This is where it helps to be a good programmer in query languages. I’ve often come across cases where a design has been implemented with mild denormalizations, sometimes for well-founded reasons. The tendency I’ve observed is that once an area is a little dirty — denormalized, for example — the data team gives up on its maintenance.

One example in which I’m currently involved is entirely typical: Certain customers are supposed to have a particular attribute when their records are created, based on their affiliations and activities. If the customers don’t have the attribute, problems result that demand expert diagnosis.

I accept that the data design can’t enforce this constraint at the schema level — that is, it could — but there are other legitimate reasons not to choose that approach. It would be easy enough to write a trigger that tosses an exception whenever a non-conforming customer record is created. I’d gladly do that, with confidence that we’d expose in just a few months which operations persist in creating the anomalies.

For organizational reasons, though, I can’t contribute such a trigger. What’s the solution? I just write a report which tells me of any non-conforming records, and pass it to crontab, Nagios, or a similar mechanism. I invariably learn of non-conformances before the customer sees any symptoms, and before the data team otherwise becomes aware of them. Not only do I have precise data which flow into correction requests, but I also use the data to track the incidence of the problem. I know when it’s getting better and worse, and when it’s time to argue for a different approach.

More generally, everything that appears in a data dictionary deserves an executable expression. Developers know this as test-driven development (TDD), among other labels. Every correction or enhancement to a data instance is represented by a collection of tests that confirm the change. Your contribution as a developer might be that some of those tests will be procedural scripts, rather than declarative specifications. Once written, though, whether declarative or procedural, they all can be automated, and form a basis for quality control.

The ideas above are moderate, and generally unthreatening to data organizations. I’m not (yet) on the radical wing of Agile, where, for instance, TDD replaces documentation. When you’re combating “dirty” data, though, you can probably make rapid gains by importing help from developers who already have worked with source control, virtualization, production staging, and exception queries. Put developers to work in the ways outlined above, and the gains will soon be evident both to data and development contributors.

Databases are Software, Too

A Brief Cultural Introduction

Source Control

Scaling into the Clouds

Stage service

Automation Beyond the Relational Model: Quality Reports

You Might Also Like

Company

Resources

Products

Legal