Big Data Blog

Hadoop Foundation: When to use Hadoop for a Data-driven enterprise?

Confluent raised $24 million for data ‘Streams’ powering LinkedIn, Netflix, and Uber. Now, this is a company that is helping corporate giants like Netflix, Uber, and LinkedIn by letting them get new insights from their data using Apache Kafka.

And, did you know why big data is becoming such a deal? Have a look at this short video to kickstart this article.

Why is it important to get data-driven?

The context here is that digital transformation is impacting every industry. People talk about the growth of data but do not debate it. It can be:

  • Sensors & machines typically referred to as the Internet of Things
  • Geo-location
  • Server Logs
  • Cut stream social media
  • Files and emails

In tech. terminology, you have the non-relational database or non-traditional data management systems, then you have data coming from traditional sources (like ERP, CRM, and PoS terminals) that you store in data warehouses. Both of these are increasing at a rapid speed. The question becomes:

How do you effectively blend this information in a way that is transformational to the business?

Now, don’t take “transformational” as a cliche. Savvy folks and companies are already using Hadoop (without naming it like us) for years but it remains transformational for your company which has not implemented it yet. This blended data (coming from traditional & non-traditional sources) will help you to be more proactive with your customers and supply chain as opposed to reactive (like days, weeks, months) after the fact reaction.


The opportunity is to unlock the business value from a full fidelity of data and analytics across that data.


The reality is that much of the new data exists in flight, so it is in motion and it is part of the systems and devices that are part of the Internet of Everything landscape.

When you see fig 1 below, you realize that the ability to consume data is a challenge (line in the middle).

Source: Hortonworks
Source: Hortonworks

Another challenge is how you actively manage the data from as close as the point of inception, through its lifecycle, and through real-time or historical analytics that you may want to apply to it.

So, that’s the backdrop of many folks’ journey towards becoming data-driven.

How do companies start their Hadoop journey?

The guys at Hortonworks see a clear pattern, particularly over the last few years, when they help bring Hadoop into enterprise IT infrastructure. Companies have adopted the Hadoop ecosystem both from cost savings and unlocking transformational business outcomes perspective.

These are the governing use cases, if you will, that are common patterns. See in the bottom center part of cost savings (Fig 2 below).

Fig 2 | Source: Hortonworks
Fig 2 | Source: Hortonworks

What do I begin with?

Begin with ETL (extract-transform-load). Clearly, in the center bottom cost savings segment, it is right-sizing your traditional world and preparing to bring in some of the IoT sources, in a way that you can do the transformation logic in a platform like Hadoop as opposed to your traditional platform.

ETL use case

There are significant cost savings as well as performance benefits to off-loading much of that ETL workload into a Hadoop system. In many cases, the purpose is to augment or enrich downstream data warehouses, data marts, or traditional BI systems where you can begin to bring new sources of IoT data and start to enrich existing systems.

You may use a common data warehouse augmentation and optimization use case.

Active archive use-case

Some folks look at the active archive use case as a way of bringing online what previously may have been aged on the tape or may not potentially be stored at all. They have a lot of data available for archival reporting. That’s traditionally where the journeys have started in the past few years. Guys at Hortonworks say that if they look at the last 12-18 months or so, increasingly real-time and predictive models have come to the scene. And that begins to unlock the business outcomes.

You can bring data online into a central database where you can discover new patterns. Additionally, bring these new IoT sources you may have at your disposal inside your organization. Increasingly, by gathering third-party data sets, whether data sets like demographic data, population data, government data sets, and third-party data you can bring to enrich existing data so you can discover new insights.

The article is based on the Hortonworks webinar titled “Laying the foundation for a Data-Driven Enterprise with Hadoop”. It can be accessed here.