Is 2016 the year you get serious about starting your organization’s Big Data journey? Then this is a must-read series. Our first post discussed three approaches to Big Data implementation; in this blog, we’ll look at the technology at the core of Big Data analytics.
If Big Data technology seems overwhelmingly complicated, don’t worry. You’re far from alone in that view. But today is the day we change all that. It’s Big Data for Everyone day.
As tempting as it is to let the techies handle every bit of your Big Data journey, don’t give in to the temptation. You need to be able to make an informed decision about what your company is contemplating. And to do that, you need an overall picture of how Big Data works.
We’re not suggesting you roll up your sleeves and start a crash course in Hadoop. But a basic understanding of the fundamentals of Big Data processing is both necessary and relatively brief. Ready? Let’s get started.
Meet Your Big Data Tech
Think of Big Data technology like a sports team. On the team, you have five full-time players and an optional sixth player. For data processing, these players are:
- The Source Systems – where all data initially lives
- The Data Ingestion Layer – which moves data from the Source Systems to the Warehouse
- The Data Warehouse – where data is stored
- The Data Mart – where analytics-ready data is kept
- The Analytics Engine – where data is processed and analyzed
- The (optional) Data Sink – a repository for analyzed data
In this article, we’ll focus on the first five components on our list. In particular, we’ll discover what function they serve on the Big Data team, and how they relate to each other.
The Source Systems
The source systems are where all your data starts. This is the mouth of the data river, the place where all the action begins. In the source systems, raw data of all types live in their native IT environment (i.e. an Excel spreadsheet) or in a program built for the purpose, such as a MySQL database.
The Data Ingestion Layer
Putting the analytics into Big Data analytics requires that all this disparate information be processed into cohesive units and studied. Obviously, this can’t happen unless the data leaves its source system. To get to the giant holding and sorting center that is the data warehouse, the data ingestion layer needs to be invoked.
Basically, the data ingestion layer works in two ways. It can either stream information into the warehouse in real-time, or it can be sent in scheduled batches. In batch mode, an initial data transfer will likely be massive and will include various historical data. After this point, incremental data transfers will keep the warehouse up to date. The streaming mode functions nearly continuously, as either the source systems will push their data to a third party app or the consumers themselves imitate the transfer.
The Data Warehouse
A data warehouse is almost exactly what it sounds like: a vast holding space for data. Once data has been identified, it heads into the warehouse. This data can be raw (from the Source Systems) or it can be either fully or partially processed.
However, this isn’t just a passive place; as the backbone of a data analytics system, it needs to be properly built. Any data warehouse should come with scalable distributed storage. An efficient processing engine and the right programming language are key here. Your organization’s choice will be based on several factors, including the amount and type of data being processed and company preferences.
The Data Mart
The data mart is a subset of your data warehouse. Consider it as a sort of specialty shop within a giant megastore. While the data warehouse holds all your data, the data mart holds a specific set of data – say for your marketing department or your HR department. Data in the data mart has already been prepped and is ready for analysis.
The Analytics Engine
The analytics engine is where the fun really starts. Until now, data has more or less been lying around; the analytics engine gets it up and puts it to work.
There are two parts to an analytics engine: data processing and data analysis. In the data processing step, data from the warehouse is validated, cleaned, and checked for missing values. Once it is in a useable form, it can go to the analytics side. Here it is segmented, classified, and modelled. The ultimate goal for all this work is to generate recommendations based on a thorough study of your data.
Obviously, this setup isn’t something you can pull off over the weekend or set up in the corner of your office. So this raises another question:
Where Will You Keep Your Big Data Technology?
You have two options: in-house or in the cloud. Both have their advantages and disadvantages.
If you want to keep everything onsite, data integration is usually easier, and data privacy is better. But scalability – a major factor in data analytics – can present a problem unless you design your system carefully. And while the initial cost of an on-site system is higher, over the long term its maintenance costs are cheaper.
Cloud-based data analytics offers its own perks. The initial setup costs are lower. It’s highly flexible and scalable. But these are balanced by data privacy (lower), data integration (trickier), and long-term costs (more).
There’s no right or wrong answer for where you’ll keep your Big Data analytics system. Whatever works best for your business is the right choice.
So, now you have a basic understanding of the technology involved in Big Data analytics. In the next post, we’ll move deeper into the subject and talk about developing your data science team. Don’t miss it!