The Buzz Around Big Data
  June 27, 2014

In this video clip, Marlon Bailey, Lead Software Architect for Hosted Systems at SmartBear Software, explains why the science behind big data isn't quite as new as everyone makes it seem, and why he doesn't necessarily buy into the hyperbole that surrounds the buzzword itself.


See also:


Video Transcript:

One of the big buzz words, I think, is Big Data. And, you know, as I told you off-camera before, I said “ you stay in this industry long enough, you see things every 10 years where they just re-name the same thing over and over again.” So much of these terms are just people solving -- and not really solving, but applying to old practices to solving -- new problems, which really weren’t but were old problems. They became new problems because the actual quantity of the problems got bigger.

You know, big data is nothing more than just saying, “a lot of data,” right? But the technology may not be there to handle it the same way that you would traditionally handle it. And that’s why they go back to using old methodologies, but those old methodologies really are just still regular computer science. Going back to [the conversation of] there being a shortage of people who are doing computer science – and that cultural thing – maybe that’s why it looks like a new thing to people.

But you look at a computer system and there’s things like interconnects, your RAM and that kind of stuff that you think about and it’s like… Okay, if the interconnects are fast enough or if you had enough RAM and enough CPU, big data wouldn’t exist. You’d just put it all on one machine and use the standard stuff, right? Which, in essence... When you go into a machine and break it all down, you look at a distributed system and you’re building a large computer the same way a small computer is made. You’re just breaking it out across multiple machines, right? You have your interconnect to transfer data.

Well, with big data a gigabit Ethernet is not going to be fast enough. RAM will be fast enough. So you’re like, okay, if I was to do it the regular way where I’m making RPC calls, which is what people generally do for small chunks of data that interconnects can handle, then it’s no problem. Nobody thinks it’s anything special, right? It’s TCP/IP. You’re passing data. You’re doing UDP, which is under that. No big deal.

With big data you can’t do that. It will choke the pipe. So you’re like, okay let me compress it and ship it out without processing it. Then, once I get to the machines I can ship the program, which is much smaller, and have the program run on the machine using RAM, which is a much faster interconnect. You know, or ‘bus, right?

So, it’s all the same principles. Things are just getting moved around. But if you understand the principle underneath, none of it looks that new to you. And I think that’s the reason why you see some of these buzz words, especially like big data, where you’re like, “Oh, it’s just a lot of data.”

Yes, you have to process it differently. But you’re really just aware of what’s necessary to process data and you’re realizing that the standard system that’s here has limitations. And those limitations can be discovered through standard scientific practice, right? it's like, say you have a requirement that you need this done in an hour. Okay? You need to shovel 15 gigs of data second. Okay, well I don’t have an interconnect for that. I need to figure something else out, which means I’m going to have to start balling it all up and not doing the processing that way, but [rather] chopping it up so I make that data smaller. It’s just solving a problem.

So, big data, I know, is a big buzz term right now, but some of the other coders I talk to joke around, we say it’s just called, “a lot of data.” And not to belittle the systems that process or are built to process those things. No way. That’s sound science. Those things are built well. The buzz side of it is what’s surprising to me and kind of funny.