How In-Memory Databases May Remake Your Database
Virtualization, cheap memory, and private cloud technologies have coalesced to create the perfect storm for modern databases. In the last three years, these technologies have obliterated barricades for speed and capacity that once appeared insurmountable. So they’re being marketed as accelerators.
That might make you think you can insert your existing databases into these new engines, like Easy-Bake Ovens, and transform them into super-heroic stature overnight. No, it doesn’t work that way. Although people tend to think of databases as merely collections of data, they are sophisticated network structures whose architecture is based on the platform originally intended to support them. So, for the new breed of in-memory databases, the old methodologies are incompatible. To accelerate and grow your database for the new generation, you may need to rebuild it. Here is why it might not be such a bad idea in the end.
Here’s a question that sounds like it could have been asked by the stereotypical housewife in a 1960s cake mix commercial: Why are in-memory databases like SAP’s HANA, Oracle’s TimesTen, and VMware’s SQLFire so many orders of magnitude faster than conventional, relational database systems?
I’m not in the habit of asking stupid-sounding questions, at least I don’t think I am, but I asked this of several people whom you’d think would be in the know. I didn't do this because I enjoy rehashing the Goldwater era but because I wanted to test a little theory. One fellow provided me with the only analogy you may ever hear about database technology borrowed from The Ed Sullivan Show - specifically, the stage act wherein a guy simultaneously spins china plates on bamboo sticks. For those of you playing at home, his name was Erich Brenn. This fellow remarked that a conventional database balances petabytes of valuable data upon teetering arrays of spinning plates. Which would I think is faster and more reliable, he asked me, "a re‑eally big shoe-w" that’s about to shatter into a million fragments, or a solid-state microchip that securely houses electrons?
Thank you, database-marketing guy, for that patronizing response. But like all of the topics you’ll find me writing about, it’s never that simple.
All databases are complex communications systems. The work that databases perform is expressed in terms of logic. So database architecture is typified by how that logic is represented, and to what nodes — the components of a database that do actual work — the logic is distributed.
Since a database maintains a network of users, its logic must contain both its scheme of organization (the schema) and the instructions for how data is fetched and updated (the “what it is” and the “how it works”). But because these are two fundamentally different classes, they are generally regarded as separate operations governed by exclusive principles: database logic and application logic. Entire textbooks explain this division in very nuanced detail, but a fairly wise person (wiser than Mr. Plate Spinner Guy) provided me with the more encapsulated analogy that a database is to the data as an airport is to an airplane. The application is merely a shuttle service between the two.
Marketers would like you to believe that moving the data storage-and-retrieval process from disk to memory would be analogous to building a new airport and redirecting air traffic. But when you get to the truth, no matter how padded it may be with euphemisms (passive voice, grand metaphors or frequent flyer miles), it still strikes you squarely in the face: An in-memory database is a new order of beast.
“When the assumption of disk-residency is removed, complexity is dramatically reduced,” reads Oracle’s documentation for its TimesTen in-memory database system. “When memory-residency for data is the bedrock assumption, the design gets simpler, more elegant, more compact, and requests are executed faster.”
Which is almost accurate. It isn’t so much that the design gets simpler and more elegant, like waving a wand and chanting, “Expelliarmus!” Neither TimesTen nor any other in-memory system is a magic plug-in; it’s a new communications system as fundamentally different from the existing one as airlines are from the interstate highway system. When airborne capacity is the bedrock assumption (cue the “Flintstones” music), the vehicle becomes faster, more streamlined, and less affected by gravity.
The Subject of Predicates
The reason why the in-memory scheme is fundamentally different is because, to fulfill its mission of being so much faster, it forces us to re-examine the point of principal division in database architecture — the empirical from the practical — and, to enable these new efficiencies, to relocate it.
In an earlier era when computer architecture was simpler, database architecture was more straightforward. Transactions were sequential and performed in batches. So one expressed how transactions take place in terms of things that happen — lists of commands. The way such commands can be expressed mathematically is through a system called imperative logic. Essentially, when action (a) is done, state (b) becomes true.
The problem with representing everything a computer does in terms of procedural instructions is that the same procedures yield the same results. You can’t have a database, therefore, that is independent of procedures; the only way to reproduce it is to follow the same processes. That’s a problem when you need a database to represent something as independent as, say, an economy.
The first great architectural breakthrough toward that end came in the late 1960s, when Dr. E. F. Codd of IBM developed a language, SQL, that expresses things that happen in terms of predicate logic — the rules of semantics that govern true and false statements. Its literal Latin translation makes the most sense: That which you can presume has been declared. Thus, a sentence that uses a predicate is equivalent to an observation (“Mark’s last name is Ludlow,” “The prescription was called in to CWV Pharmacy,” etc.) rather than a directive (“Make Mark’s last name Ludlow now,” etc.). And a SQL procedure, written in the original style and syntax Dr. Codd intended, could be interpreted not as a list of things to be done imperatively, like a COBOL program, but rather a scroll full of facts. The data is independent of the procedure, and merely described by the language as a series of observations. Ideally, that series is independent of any prescribed sequence; again, as intended, you could reorganize the instructions and get the same results.
As databases were first put to use in modern systems, whose operations were governed by sequences of commands, it became necessary to maintain a firm boundary of separation — a fence of sorts - with both types of logic on opposite sides. One early example of such a fence, which remains in place today in the financial sector, comes from SAP, whose most recent version of client/server architecture is called R/3. Its applications are written using a system called ABAP, whose German acronym means “general report creation processor.” Officially, ABAP databases have their own SQL interpreter called, for lack of any other name, Native SQL. But in addressing the need for databases to be portable, as well as mixed-and-matched with other brands of data, SAP created a separate language called Open SQL with the idea that a customer could replace SAP as a data provider with someone else’s system (Oracle, Microsoft, etc.) without upsetting the application logic. The database layer and the application layer are interfaced.
This notion of an interface has become critical not just to SAP but also to all database architecture. It sets up the first checkpoint between one active node and another and implies the existence of an order of operations: something that’s done first, then you proceed to the checkpoint, then you hang around and wait for a response from the other side, then you take the response and run with it. The ODBC driver has become the universal messenger for database checkpoints, a way for nodes within a system to pass on a procedure to other nodes and wait for the response.
Over time, it actually became less convenient to pretend that both sides of the fence should speak the same language. While messages for the database logic side were crafted in something that, from a distance, still resembled SQL (kind of), the applications resumed their normal course of imperative and object-oriented logic, using C++, C#, and to a large amount more recently, Java. Today, Java uses its own optimized derivative to the ODBC model, called JDBC.
The differences between an imperative language like Java and a declarative language like SQL are not just semantic. They not only differ in ways of thinking but, to borrow an image from philosophy, in states of being. Specifically, an imperative language designed to explain “what to do” must be used to craft explanations of “the way things are.” It’s like a military sergeant passing on marching orders to a Buddhist priest.
Here’s a case in point: In Java, you import the java.sql package, then declare a variable of type Connection whose value is the result passed back from the .getConnection method. That method establishes a literal connection with the database, assuming it was already published as a Web service and is accessible through a username and password. Oftentimes you don’t necessarily get a connection just because you called for one, which is why it may be a good idea to couch this method call within a try/catch clause in case of an exception.
Next, you declare an object of class PreparedStatement whose contents are the text of a dynamic SQL query. That query is usually pieced together like parts of disparate jigsaw puzzles, so it never looks quite like SQL to anything else except the SQL interpreter on the other side of the fence — for instance, fetch_query = "SELECT surname, givenname, mid, account_balance FROM " + account_title + "WHERE account_balance > " + field_value;
Assuming that query is lexically correct (prayer may help here), you assign that string to a method that effectively rephrases the query as though it were an imperative command, the result of which is an indication of success or failure expressed as an integer.
It’s no longer a division of logical functions, but rather a sublimation of one by the other.
As a direct result, database logic ceases to be free to use its own methods as originally intended. Instead it must wait for commands to be processed in batches and sent out as signals to a distant executive, which then evaluates the results in sequence (or, using the database term, tuple-at-a-time) in exactly the opposite manner of SQL’s architecture. For example, a Java function would set up a de facto cursor that looks up the first record of the ResultSet class variable, checks the record to see if that’s what we want, does something if the return value is true, and kicks the cursor can down the street one hop, under a while clause, which stops the process when it can’t be kicked any further. Even the very phraseology of the clause assumes the existence of time (the sequence of events), which in predicate logic cannot be expressed reliably.
Division by Zero
There are two unavoidable reasons why database architecture evolved to become far more complex than Dr. Codd might have hoped:
- Client/server systems like R/3 enabled thousands to use the same database at once. So it became necessary for intermediary nodes to provide individual users with views or snapshots of segments of data, so that changes that another user may be making to the same data that one user is viewing don’t suddenly render that one user’s changes invalid. Transactional logic took on a new role: that of reconciliation.
- No matter how fast those spinning ceramic plates called hard drives spun (albeit not on bamboo poles); it became impossible to record each transaction in sequence in exactly the order it transpired. Batch processing techniques needed to be employed, using the same techniques that make CPUs faster (caching, which is the real reason microchips are faster than spinning plates), and in most cases leveraging the help of sequential storage and retrieval techniques (one record at a time) that completely went against the grain of everything Codd had intended.
If you’ve ever had to get some bit of business done in a bureaucracy (for instance, getting your driver’s license renewed before next week), perhaps you’ve learned from experience that the best course of action isn’t always to direct your request or complaint to the fellow at the top, but rather to the person in the middle who greases the wheels and gets things moving. And that’s what has happened with database architecture. Logic was re-phrased so that it no longer addresses the node that even IBM once called the “executive,” but rather some node in the middle of the action that can actually get the job done. That middleman, as you may already have surmised, is the ODBC/JDBC driver.
Maybe the inefficiency of this chain of events would be masked by the speed of the modern processor, if indeed it were being executed on one processor. But it isn’t. And that’s the problem. All modern processors are designed, where applicable, to execute instructions in parallel - to bunch instructions together that can be bunched, so long as the results aren’t changed in the process. Moving databases into memory means making them available to the processor at multiple points at once. If you unravel the fabric that SQL has so neatly woven into an intricate sequence of transactions that must be processed in a certain order, then whenever you multiply those processes by the total number of users in the system, at some point, you actually end up slowing the system down rather than speeding it up.
It’s the same principle that graphics chip engineers learned two decades ago: You can’t expedite 3D graphics processes in a computer by subdividing the sequences and shoving them through multiple threads. You can expedite them, however, by queuing identical functions into multiple pipelines and executing them all at once.
The priest had it right from the start: Predicate logic can be more easily rendered in parallel because it is not framed in terms of the sequence of operations. In fact, database engines have already been optimizing their instructions for the better part of three decades, based on this simple fact.
So if you’re still reading, Mr. Plate Spinner Fellow, the moral of the story is that a faster technology foundation can actually slow down a process that was pre-engineered for a slower mechanism. This is why in-memory databases are, to varying but substantive degrees, atypical designs. It’s not that they speed up your data. Rather, when you redesign your logic to eliminate the presumptions forced upon it by the predecessor system, the result is orders of magnitude faster than what you had.
SAP’s HANA utilizes yet another derivative of SQL, this time called SQLScript. But its purpose is stated right up front, in the first official sentence of its manual (PDF available here): “The motivation for SQLScript is to offload data-intensive application logic into the database.”
Call it “Dr. Codd’s Revenge.” Through cloud and virtualization technologies, huge pools of memory can be cobbled together from multiple sources, some not even physically sharing the same systems. Once data is housed there, databases don’t need to be told how to fetch or update contents. Like Codd intended, SQL figures that part out for itself.
SAP HANA architecture, clearly showing the “fence” relocation. [Courtesy SAP]
So HANA completely reformulates database architecture in the following way:
- It picks up and moves the fence, as it were, giving the database the responsibility for orchestrating the retrieval process, and even of analyzing and filtering data so that the application receives only what it asks for.
- It creates a new concept called a data flow, similar to a workflow: a graph that explains how declarative logic statements are paired with imperative orchestration commands in a safe manner that cannot impair the integrity of the data.
- With these data flows in place, HANA’s SQLScript enables certain safe “primitives,” to invoke an old term, such as SELECT statements and custom operators, to be invoked in a declarative (procedural) fashion, but in such a way that HANA can safely execute them in parallel.
But just in case I didn’t make this point clear enough, in-memory databases are an entirely new architecture. You can move your existing data there, but only after a substantial rethinking of the way your business logic works. And the silver lining in the story for the manufacturers of conventional databases is that businesses typically aren’t compelled to take this critical step, not even to attain world-changing speed and efficiency. It is that fact that leaves my poor friend, the database-marketing guy, smiling in the end after all.