Distributed Lock for Iceberg in Chango

Kidong Lee
2 min readAug 17, 2024

--

Everytime iceberg table is committed, many files will be created like data files, metadata files, and snapshots. Even if iceberg supports transaction, but the conflict of iceberg table commits can occur when multiple engines write commits to the same iceberg table simultaneously.

This article shows how to avoid conflict of iceberg commits in Chango.

Distributed Lock for Iceberg

There are many scenarios of iceberg write, for example.

  • streaming ingestion in which iceberg write can occur very often.
  • write in append mode.
  • write in overwrite mode.
  • MERGE INTO query.
  • INSERT INTO query.

In order to avoid iceberg commit conflict, Chango uses distributed lock controlled by zookeeper. You can use curator to accomplish distributed lock with zookeeper.

Let’s see the following pseudo codes.

DistributedLock distributedLock = null;
try {
// create distributed lock.
distributedLock = new DistributedLock(zkHosts);

// acquire distributed lock.
int tryCount = 10;
String zkPath = DistributedLock.TX_ROOT_PATH + "/" + tablePathName;
InterProcessSemaphoreMutex interProcessSemaphoreMutex =
distributedLock.acquire(zkPath, 5, TimeUnit.SECONDS, tryCount);

if (interProcessSemaphoreMutex != null) {

// HERE, process of iceberg write.

// release lock.
interProcessSemaphoreMutex.release();
}
} finally {
if(distributedLock != null) {
distributedLock.close();
}
}

After acquiring lock for the iceberg table path in zookeeper, your logic for iceberg write will be processed, and then, distributed lock will be released. Finally, close the distributed lock.

Iceberg Table Maintenance After Iceberg Commits

As seen for now, distributed lock needs to be used to avoid conflict of iceberg commits. But because everytime iceberg table is committed, many data files and metadata, snapshot will be created, you also need to maintain iceberg tables. In Chango, Chango REST Catalog will do such as iceberg table maintenance like small data file compaction, snapshot expiration, old metadata and orphan file removal automatically. Because iceberg maintenance queries are also iceberg write process, distributed lock needs to be used to maintain iceberg tables.

Even if iceberg supports transaction, iceberg commit conflict can occur very often when multiple engines / multiple applications write commits to the same iceberg table simultaneously. Distributed lock is a very simple solution to avoid such as iceberg commit conflict.

That’s it.

--

--