DataRoaster is a tool to provide data platforms running on kubernetes. Recently I have open-sourced it.
Before I developed DataRoaster, I used free data platforms like HDP(Hortonworks Data Platform) to build data lakes. After Hortonworks was acquired by Cloudera, HDP was not free any more. To build a data lake, there are serverless services like AWS EMR, but you have to consider the cost of using such serverless services provided by public cloud providers.
As mentioned in A Concept: Kubernetes based Private Cloud Platform, I have been looking for an alternative to commercial data platforms and serverless services. …
Trino(formerly PrestoSQL) is a popular distributed interactive query engine in data lake. Trino can be used as not only query engine, but also data preparation engine in data lake. As data platform component, Trino is one of my favorite components to use in data lake. Here I am going to show you the way to deploy Trino on Nomad.
The following components should be available to you before getting started.
As I mentioned in the previous post, I have been looking for an alternative to Kubernetes to deploy stateful applications on container orchestrators. Generally, stateful applications need volumes to persist data. The volumes should be provisioned dynamically when the stateful application job is submitted. But currently, Nomad does not support such dynamic volume provisioning which Kubernetes does. Nevertheless, Nomad gives me an advantages to operate stateful applications. There is no concept of start, stop and redeploy stateful applications like Statefulsets without data loss on Kubernetes. If you have deleted statefulsets on Kubernetes, it is really difficult to recover data on…
Hive Metastore is one of the most important components in data lake. Hive Metastore is used as data catalog in Data Lake. There are many execution engines to use Hive Metastore, for instance, Spark, Presto, Hive on Tez, Hive on Spark, etc. You can deploy Hive Metastore on Kubernetes, but here, I am going to talk about how to deploy Hive Metastore on Nomad.
If you want to run stateful applications on Kubernetes, CSI(Container Storage Interface) has an important role in provisioning volumes dynamically. Generally speaking, CSI is used to provision volumes from storages not only for Kubernetes, but also for all the other container orchestrators such as Mesos, Nomad.
Here, I am going to talk about volume provisioning for Kubernetes and Nomad from Hashicorp using Ceph CSI from the external Ceph Storage.
The important components which I have used for this post are as follows.
In this example, you will see how…
Kubernetes is a popular container orchestrator nowadays. I have also proposed the concept of private cloud platform based on Kubernetes, and I have implemented a multi-tenant data platform for this concept. From my experiences of building my platform , it is great to run stateless applications on Kubernetes, but I have noticed there are many things to take care of for stateful applications running on Kuberntes. Most of my data platform components running on Kubernetes are statefulset, things to be considered are, for instance, Pod QoS Class rules against statefulset pod being killed, careful Rollout, backup of PVs for DR…
If you want to expose a S3 service using MinIO, you should consider having MinIO Data secure In-Transit and At-Rest. In the previous post, I have talked about securing MinIO In-Transit. In this post, I am going to show how to make MinIO secure At-Rest with Vault.
There are so many Key Management System(KMS) like Hashicorp Vault, AWS Secrets Manager, GCP Secret Manager, and Gemalto KeySecure etc. out there. Here, Vault will be used as a KMS.
Let’s see the following picture to see how to encrypt MinIO data with vault.
MinIO is a popular S3 compatible Object Storage which can be run on Kubernetes. MinIO running on Kubernetes can be managed in the kubernetes standard way with ease, but you have to also worry about that MinIO as Statefulsets would be restarted by unexpected reasons(for instance, Statefulset Pods can be killed according to the Pod QoS Class rules), which could result in buckets loss. Against that, you should use such tools for the backup and restore of PV in Statefulsets. The concept of Storage on Kubernetes is great, but it is not free to make it stable on Kubernetes.
Azkaban is a popular workflow engine which I have used to run jobs especially in Data Lake for many times. There are similar workflow scheduler like Oozie, Airflow which provide more functionalities than Azkaban does, but I prefer Azkaban to others, because Azkaban has more attractive UI than others have.
Even though Azkaban provides several job types like hadoop, java, command, pig, hive, etc, I have used just command job type for most of cases. With command job type, you can just type some shell commands to run jobs. It is simple, and it works for most cases, I think…
It is not easy to run Hive on Kubernetes. As long as I know, Tez which is a hive execution engine can be run just on YARN, not Kubernetes.
There is an alternative to run Hive on Kubernetes. Spark can be run on Kubernetes, and Spark Thrift Server compatible with Hive Server2 is a great candidate. That is, Spark will be run as hive execution engine.
I am going to talk about how to run Hive on Spark in kubernetes cluster .
All the codes mentioned here can be cloned from my github repo: https://github.com/mykidong/hive-on-spark-in-kubernetes
Before running Hive on Kubernetes…