Cloud Native Data Platform without Hadoop Installation

5 min readApr 29, 2020

Hadoop has become to the most popular tool to handle Big Data.

But it is not easy to build a data platform being composed of hadoop, hadoop ecosystems like hive, hbase, pig, etc., and other components like spark, kafka, etc., because there are many things to do, for instance, components compatibility, and configuration tunning, optimizations, security, etc. Such difficult and complicated jobs have been accomplished by the Data Platform Providers like Hortonworks, Cloudera and MapR.

Now, since Cloudera announced stopping offering free version, we cannot use Data Platform of Cloudera without subscription any more.

If you want to build a data platform with a lot of components like hadoop, hive, spark, kafka, etc. for yourself, I think, the most complicated part to build a data platform is to install hadoop.

Maybe, people don’t want to pay for the subscription, or people want to replace hadoop with other components with many reasons. Anyway, we can take a look at another alternatives.

Here, I will talk about building data platform without hadoop installation in cloud native way.

Hadoop is Standard

To handle such files like Parquet, ORC, Avro in Hadoop File Systems, you have to use APIs based on Hadoop. You can also access files in the hadoop supported file systems like S3, Azure Blob Storage, Ozone with Hadoop Connectors. There are a lot of systems using Hadoop APIs in standard way.

We cannot imagine without hadoop to solve the problems in Big Data for now. But we can replace hadoop components with other components.

Replace Hadoop components with others

Hadoop Components are mainly composed of two parts, namely, HDFS for the storage and YARN for the resource management.

HDFS is the default distributed file system in Hadoop, but there are hadoop connectors(HDFS Connector, S3A Connector, WASBS Connector, O3FS Connector, etc.) with which you can access the files saved in the hadoop supported file systems like Object Storages: S3, MinIO, Azure Blob Storage, Ozone, Ceph, etc.

YARN is resource management in Hadoop. Hive Jobs on Tez and Spark Jobs(Batch, Streaming Jobs) can be run on YARN.

But if you want to run all the batch and streaming jobs with Spark, you can take a look at Kubernetes on which Spark Jobs can be run.

Because Hive on Tez Jobs for long running batch jobs can be run just on YARN, you can replace Tez with Spark as Execution Engine for Hive, namely, Hive on Spark Jobs can be run on Kubernetes. Hive Server2 can be replaced with Spark Thrift Server which is compatible with Hive Server2.

As seen in below table, you can compare the components in data platform with/without hadoop installation.

All the components are not listed for the complete Data Platform in this table.

Cloud Native Data Platform

Let’s see the typical Data Platform based on Lambda Architecture.

The respective components in this data platform will do their jobs as in the following.

In Ingestion Layer:

Importer: Structured Data like RDBMS Data can be imported to Storage.
File Uploader: Unstructured Data like Image, Video, Binary files can be loaded to Storage.
Event Collector: Semi-structured Data like User Behavior Event from Web Site can be collected.

In Stream Layer:

Unified Log System: Events from Event Collector will be published to the unified log system like Kafka.
Streaming: Streaming Application like Spark Streaming will handle the events consumed from the Topic of Kafka.

In Batch Layer:

Data Prep: Data Preparation like cleansing, transforming data from Storage can be processed to make Datasets for the AI/ML Jobs .
EDA: Exploratory Data Analysis can be done before running AI/ML Jobs.
Analytics: AI/ML Jobs can be run.

In Storage Layer:

Object Store: S3 Compatible Object Storage like S3, Ozone, Ceph can be used to save such files like Hadoop Files(Delta Lake, Parquet, ORC, Avro, SequenceFile, etc), and Binary Files(Image, Video, etc.).
Data Catalog: Hive Metastore can be used as Data Catalog.

In Interactive Query Layer:

Interactive Query Service: Presto can be used as interactive query service. Hive on Spark(through Spark Thrift Server) can be also used as interactive query service.

In BI Layer:

BI: Superset connecting to Presto Coordinator or Spark Thrift Server can be used as BI Tool.

For now, we have talked about that HDFS can be replaced with Object Storages, and YARN can be replaced with Kubernetes.

Let’s think about this data platform from the point of view of Container Orchestrator like Kubernetes.

As seen in the below picture, all the components of data platform and all the batch/streaming jobs can be run on Kubernetes.

Generally speaking, Stateless and Stateful Applications are running on Kubernetes Container Orchestrator. If your stateful applications need to have volumes, they can connect to external storage via CSI, etc. to get volumes mounted.

But Why Not Storage on Kubernetes?

There are S3 Compatible and CSI Supported Object Storage like Ceph, Ozone which can be deployed on Kubernetes. Take a note that S3 Compatible Object Storage will be used as Hadoop Storage, and CSI Supported Object Storage will be used as Kubernetes Storage.

Now, Object Storages like Ceph, Ozone are deployed on Kubernetes.

Cloud Native Data Platform is that all the components of data platform, all the batch/streaming jobs, and storage are running on Kubernetes.

With this cloud native data platform approach, scale-out, deployment, monitoring, security, etc. which are managed by Kubernetes can be handled with ease and in a standard way(because all are running on Kubernetes).

References

https://github.com/cloudcheflabs/dataroaster
The demo below shows how to create a data platform which consists of hive metastore, spark thrift server, trino, redash and jupyterhub, etc running on Kubernetes using DataRoaster with ease. See DataRoaster Demo: https://youtu.be/AeqkkQDwPqY

Cloud Native Data Platform without Hadoop Installation

Hadoop is Standard

Replace Hadoop components with others

Cloud Native Data Platform

References

Written by Kidong Lee