What Good are Terabytes? Because Data Sets are Not Information.


We keep on amassing more and more data. And we're told frequently about the uses to which this data can be put. But data is not information and information is hard to glean from terabytes of stuff. There are, however, needles in those haystacks.

At the same time, some organizations are jumping on the big-data bandwagon because of all the hype and promise. They aren’t looking and understanding what they need to know, want, and consider to get the project done. As Marilyn Cohodas pointed out, “You can’t shake a stick today without coming up with a fascinating story about how big-data is saving lives, catching terrorists, or unlocking the secrets of climate change. Truth is, most big data projects are never completed.”

Remember this, because it'll be important in a moment:  “A Postal Service program created after anthrax attacks gathers photos of the exterior of every piece of paper mail processed in the nation — about 160 billion pieces last year.” (New York Times, July 3, 2013).

Saving data is the easy part. Storage has gone from kilobytes to terabytes over the past 35 years. But finding things and relating things to one another is still difficult. The U.S. Postal Service (USPS) program mentioned by the New York Times is a good example. Is there information to be gathered from the 160 billion addresses and return addresses?

Yes and no. Yes, because (for example) if we have established that a given envelope contained anthrax spores, we could sort for every piece of mail with the same return address (or the same postmark) and then analyze the intended destinations. No, because scanning technology is less than perfect. My Kindle provides me with hilarious examples nearly every day. One of the leading characters in Shaw's Back to Methuselah changes his name (Burge) regularly; in Joyce's Ulysses, Molly Bloom's drowsy monologue contains III (Roman 3) for 'I'll' several times.

Postal workers (and expert readers) interpret errors when reading; scanners don't.

But let's assume you know what you're looking for. You're going to conduct a directed search. For the past three decades, systems contained a program called grep (written by Ken Thompson, 3 March 1973). grep searches the named input for lines containing a match to the given pattern. For example, if I had a complete collection of everything written by Sir Walter Scott (about 18 megabytes) and wanted to know where “Wamba” occurred, I might enter grep Wamba scott and each line in the file scott containing the name of Ivanhoe's squire would be listed.

Forty years on, grep – or one of its variants – is available on all Mac, Linux, BSD, or Windows systems.

Let's try another example, also literary. Let's say we're curious as to how much of Shakespeare is also in Scott. But the plays and poems come to over five megabytes. And we're not interested in “of” or “the” or even “Capulet.” But we could write a program to take three or four words, define them as a pattern, and try to match them against the Scott poems and novels.

Now, imagine those 160 billion addresses piled up by the post office. Can I sort all the items to a given address? Yes. Could I, sequentially, sort all the other items to that address? Yes. And could I sort out all the items from that to any of the others – and among any of those other addressees? Yes and yes.

We've set up an elaborate network. Which is precisely what the USPS (and the NSA) wanted to do.

Now, imagine setting up similar relationships for totally different purposes.

Every time something is scanned and rung up in your supermarket, it produces a record. Depending upon just how much money your supermarket wanted to invest, those records are kept and correlated.

On the most basic level, the supermarket maintains an inventory so that it knows when an item is running low and reordering is necessary. On the next step, it might note that crackers and peanut butter or cheese are purchased together.

And if your vendor has a loyalty points program, it not only tracks and increments your total, but also that you purchase mustard X as opposed to Y when you purchase ground beef or hotdogs.

Similarly, a manufacturer or a vendor can track both sales and purchases. Knowing that one member of its salesforce appears to encourage certain sales is important. Knowing that a customer purchases certain items with double the frequency of others is valuable. And tracking the quantity and the frequency enables a salesperson to call a few days or a week early – which produces good will in the customer.

But all of these things require vast amounts of memory.

And strange errors can arise. When I searched on Amazon for Blue Jasmine (a Woody Allen film, released in July 2013), I was offered the film and TV series Dead Like Me (which stars Callum Blue and Jasmine Guy). Machine searches are far from perfect.

Machines are also quite stupid. One of my credit cards has only my first initial on it. Receiving emails beginning “Dear P” ensures trashing. Similarly, “Mr./Ms. P.” Any marketing department’s use of the card data without human intervention has guaranteed that the vendor's time and money have been squandered.

Here's another example. Let's imagine that I'm in Sydney, Australia, and want to go to Perth. My laptop informs me that the distance is 3294 km by air; 4041 km if I drive; and 4352 km by rail. I can also walk (3731 km) or bike (3810 km). To the best of my knowledge, no one has ever walked or ridden across Australia. None of the early exploration parties was successful. I don't know what the nautical distance would be. But I know that the Bass Strait (between Tasmania and the mainland) is quite dangerous.

Returning to the beginning of this post: Amazon, Google, VISA, the various phone and cable companies, and a vast number of chains have lots of data about me (and you). They have lists of what we have purchased, where we have gone, what we have viewed, and both who has phoned us and to whom we have placed calls.

But mere lists of data aren't portraits. A bookstore may “know” that I've purchased several volumes about the U.S. War between the States; but it will conclude that I have an interest in that conflict. I don't. All those purchases have been bought as gifts for a relative.

On a statistical basis, market analysis incorporating the data from credit cards yields a picture of behavior. But it breaks down when we attempt to limn the individual.

Polls reveal what groups have stated or revealed, but not what Dick, Sally, Tom, or Victoria believe.

The data that comprise terabytes of storage are the raw material for megabytes of information. However, that stored data requires a great deal of processing and analysis to render it into useful information.

Storing just for the sake of amassing “stuff” isn't useful. Working out how it can be made useful is a human task… though it might use computer power to execute the plan.

See Also:

Add a little SmartBear to your life

Stay on top of your Software game with the latest developer tips, best practices and news, delivered straight to your inbox

Sign Up Now

By submitting this form, you agree to our
Terms of Use and Privacy Policy

Thanks for Subscribing

Keep an eye on your inbox for more great content.

Continue Reading