Some Differentiation Factors in Chango

Kidong Lee
3 min readMar 24, 2023

Chango is a SQL Data Lakehouse Platform. There are several data lakehouse services out there. But chango has some differentiation factors from other data lakehouse services. Chango reduces complexity and increases simplification.

Chango Architecture

Chango consists of data ingestion layer and query layer.

  • From left side, external data like csv, json and excel will be inserted to chango through data api, kafka cluster and spark streaming job in data ingestion layer.
  • From right side, all the data saved as iceberg tables in chango will be queried through trino clusters to which trino queries will be routed by trino gateway in query layer.

No Streaming Platform, No Streaming Job

In most of data lakehouses, streaming platform like kafka and writing additional streaming job like spark streaming job are needed to insert streaming events to data lakehouses.

In contrast, streaming application developer does not need such streaming platform like kafka and writing spark streaming job to insert streaming events to chango.

As depicted in the picture of chango architecture above, chango provides data ingestion group like data api, kafka cluster and spark streaming job by default in data ingestion layer to collect incoming streaming events and save them to chango. So streaming application developers just need to call java class method of chango library to insert streaming events to chango.

As a result, real time analytics can be done without needs of such complicated event streaming platform like kafka and spark streaming job in chango.

Monolithic Big Trino Cluster Problem

In query layer, let’s say, there is a monolithic big trino cluster which has a lot of worker nodes to handle trino queries by data engineering and bi teams. If long running etl queries by data engineers for data preparation of ai algorithms or another data shaping jobs could occupy all the resources of big trino cluster, then interactive queries by bi teams to show monthly statistics in bashboards could not be run. So we need trino gateway to avoid such problem.

A monolithic big trino cluster needs to be splitted into several trino clusters with regard to team functionality. Queries by data engineering team will be routed to data engineering trino clusters by trino gateway, and queries by bi teams will be routed to bi trino clusters by trino gateway separately. So no conflict will happen with trino gateway in chango.

In addition to avoiding monolithic trino cluster problem, there are other reasons to use trino gateway, for example, to support HA of trino coordinator, to scale out trino concurrent queries, and to upgrade trino clusters without downtime.

That’s it.

--

--