When Should You Use Hadoop And When Not?
Hadoop is easily one of the most important pieces of software invented in recent times. It ushered in the era of big data and continues to be the focal point of some of the largest companies out there today, including Facebook, Amazon, eBay, and Alibaba.
And while it’s easy to get caught up in all the hype around big data and the Hadoop train, it’s imperative to understand that it’s not a magic bullet that will solve all your business intelligence problems. By gaining an understanding of why it’s such a monumental piece of tech and, more importantly, its flaws, you can then assess whether your business needs it or not.
When you should use Hadoop
1. You have a lot of data to process
Hadoop is usually credited for kicking off the big data revolution. It came to being at a time when the internet was rapidly growing, and the amount of data available to the average business could no longer be processed using traditional methods. One of the companies that were struggling with this problem was Google, which would later lead to the publishing of the MapReduce whitepaper.
MapReduce is the programming model that Hadoop runs on under the hood. It’s meant for processing very large data sets (think 10TB+). This is important because its implementation encourages developers to use large clusters of machines. MapReduce is extremely inefficient for smaller sets of data.
2. For processing non-uniform sets of data
Before Hadoop gained its massive popularity, relational databases were the most popular solution for storing and analyzing data. Relational databases rely on schemas to provide an organized and systematic approach to accessing information stored in the database. These are predefined and describe the type, size, and organization of all data stored in a database. They essentially introduce structure to how information is stored.
The biggest problem with this approach is that data comes in several formats – most of which don’t have any kind of predefined structure. And yet, all these forms of data have useful information that needs to be queried and processed.
Text needed to be extracted from images, audio files needed to be automatically transcribed and PDF files needed to be processed for the growing field of Natural Language Processing (NLP). Relational databases fell short by a mile.
When Hadoop came in, it allowed developers to store all these kinds of data in a single place. Despite the number of downsides this poses, it can be an incredibly useful feature if placed in the right hands.
3. If you need to take advantage of parallel processing
The history of parallel computing goes back to the 50s, but one of the most revolutionary accomplishments in the computing world happened just nine years before the initial release of Hadoop in 2006. Invented in the 80s, massively parallel processors came to dominate the high-end computing market, and, in 1997, the ASCI Red supercomputer computer broke the 1 trillion floating operations per second barrier.
By the time MapReduce came around, this accomplishment was well solidified, and the software would go on to support embarrassingly parallel computations. These processes are usually referred to as ‘embarrassingly parallel’ because there is no communication between different processes.
Parallel computing allows for the execution of instructions in a much shorter time than sequential computing and is incredibly effective for large sets of data. This makes it necessary for applications like simulating real-world phenomena, modeling and solving computationally expensive processes that are impossible for a single computer to solve.
In other words, Hadoop is not only useful for processing large sets of data in general, but it also enables businesses to introduce new complex use cases such as simulation and data analytics.
When not to use Hadoop
1. You need answers back in a hurry
As great as Hadoop is, it isn’t without its flaws. The first and most important of these is that it relies on batch processing. Simply put, batch processing is a data processing model where computers take in all the whole dataset, process it and write a similarly large output.
It’s a pretty efficient way of dealing with big data, but can be very slow depending on factors such as the number of nodes and the configuration of the system.
This is one of the issues at the core of the Spark Hadoop debate. As opposed to batch processing, Spark relies on stream processing. This is another processing model that involves processing data in small subsets, rather than the whole chunk of data at once. This is a lot faster than batch processing, enabling computers to return data in near real-time.
2. You need to store sensitive data
Hadoop is a very massive framework. On top of the multiple extensions that it can be used alongside, it can be a problem for even experienced programmers to handle. Of all the problems Hadoop can introduce into your workflow, perhaps the most crucial of all is the security flaws it comes bundled with.
Hadoop doesn’t have any security features enabled by default. In other words, if you start up a new cluster, it assumes only trusted users have access to it. This is incredibly disadvantageous because developers have to spend extra time tinkering with configuration files and third-party programs such as Kerberos to enable encryption, authorization, and authentication.
While these functions can be enabled with a bit of hands-on work, Hadoop still doesn’t support local encryption and out-of-the-box file auditing. Getting it up and running is not a trivial task, either.
Conclusion
With all the fanfare around Hadoop, big data and related applications like Spark, it’s easy to get lost in the excitement and forget that Hadoop isn’t without its weaknesses. Hadoop is generally great for processing very large sets of data, distributed computing and dealing with unstructured data. However, it falls short when it comes to speed, security, and isn’t novice-friendly at all. Before adopting it into your ecosystem, it’s important to consider the implications alongside the benefits it grants.