Saturday, November 2, 2024

Tip of Iceberg

 



PC:https://www.dremio.com/resources/guides/apache-iceberg-an-architectural-look-under-the-covers/

 

In 1912, Titanic ship sank after hitting the iceberg.. While the Titanic might come to mind when we hear 'iceberg,' this post explores a different kind of Iceberg—one revolutionizing data architecture – Apache iceberg.

If you know about Apache iceberg , do not waste a minute to read further.

There is lot of momentum about Apache iceberg in recent past and you might have heard recent news like Datazip raised $1M fund, Google announced preview of BigQuery tables for Apache Iceberg , Cloudera unveiled a partnership with Snowflake to enhance hybrid data management (introduces a unified hybrid data lakehouse powered by Apache Iceberg).  

So, what exactly is Apache Iceberg, and how does it fit into modern data architecture? 

Data lakes are ideal for storing massive data in semi structured, unstructured data in native formats.  This is called file-based data lake and This is ideal choice for cost effective storage and flexible data exploration for organizations.

These individual files in file based data lake do not contain required information for query engines to do pruning, time travel, schema evolution (basically adding/removing/renaming columns without unloading/reloading data). This file-based data lake does not support ACID.  

On the other hand, data warehouse enforces structure, supports SQL, query pruning, guarantees ACID and optimized for analytical capabilities.

Moving data between a lake and a warehouse becomes more laborious, and keeping data up to date in both systems increases the risk of inconsistencies, delays, and operational bottlenecks. 

There is another architecture between data lake and data warehouse. It is data lakehouse. Data lake serves as storage and warehouse functionality on top of it. no need to move data between systems. Supports SQL, guarantees ACID with single copy of data.

 Components of Data lakehouse:

·         Storage system: To keep files on your cloud or distributed file system.

·         File Format: to store data efficiently – Apache Parquet or ORC

·         Table Format: to organize files into tables – Apache Iceberg or Delta Lake.

·         Catalog: to track metadata and ensure consistency and ease of access.

·         Query Engine: allows you run operations on these tables Like Spark (streaming), Dremio (batch processing) , Snowflake (batch & incremental loads)

 So Apache Iceberg is open table format which fits in data lakehouse architecture. It is developed by Netflix and donated to Apache foundation. One more advantage of Open table format is there is no vendor lock in. we can switch to other query engines easily.

Benefits of Apache Iceberg:

·         Expressive SQL: Enables updates, merges, and deletes.

·         Schema Evolution: Allows metadata updates without data rewrites.

·         Partition Evolution: Groups similar rows for efficient access.

·         Time Travel and Rollback: Access previous versions of data.

·         Transactional Consistency: Supports ACID transactions.

·         Faster Querying: Optimized for performance.

What are the key components in Iceberg and purpose of them?

·         Catalog Layer:

 

o   The catalog is responsible for managing the metadata of Iceberg tables. It maintains pointers to the current metadata files, which describe the state of the table.

o   Iceberg supports various catalog implementations, including Hive Metastore, AWS Glue, and its own standalone catalog, allowing flexibility in how metadata is stored and accessed. Snowflake has donated its Polaris catalog to Apache foundation.

o   The catalog acts as the gatekeeper, keeping track of which metadata.json of the table the engine should be working with.

 

Broadly there are 2 types of catalogs:

File system catalog: store metadata in file system, there are bottlenecks and not recommended for production.

Service Catalog: Hive metastore,  JDBC , REST based catalogs. Snowflake Polaris is REST based catalog.

·         Metadata Layer:

o   This layer consists of several components:

o   Metadata File: Contains information about the table's schema, partitioning, snapshots, and the current state of the table.

o   Manifest List: A collection of manifest files that describe the data files included in a snapshot.

o   Manifest Files: These track individual data files along with statistics and other details about each file, enabling efficient querying and data retrieval.

o   The metadata layer allows Iceberg to support features like schema evolution, partition evolution, and time travel.

·         Data Layer:

o   The data layer holds the actual data files (e.g., in formats like Parquet or ORC) along with delete files for managing data changes.

o   This layer is where the actual data resides, and it provides the necessary data for queries executed against Iceberg tables.

One unique feature of Apache iceberg over other table formats is Partition evolution. Partition information of iceberg table is tracked in metadata and not in data file. This decoupling of partition information allows modify partition logic without rewriting whole data again 😊

How can we use iceberg tables in Snowflake?

Snowflake works as query engine for Apache iceberg. Data stores in our own cloud and Snowflake acts as query engine. Using Iceberg tables in Snowflake allows you to manage large datasets stored in external cloud storage while leveraging Snowflake's powerful querying capabilities.

 Iceberg table is snowflake’s 1st class object, meaning we can get same advantages that of permanent tables in Snowflake. Supporting snowflake data types, performance, RbAC, data masking, multi table transactions, time travel, Search optimization.

 We can join iceberg table with other snowflake permanent table.  

 Below is quick tutorial : https://quickstarts.snowflake.com/guide/tasty_bytes_working_with_iceberg_tables/index.html#0

 Use Cases for Iceberg Tables in Snowflake:

·         Data Lakes with Existing Datasets: Efficiently query large datasets already in data lake formats.

·         Multi-Engine Workflows: Query the same tables across different engines (e.g., Spark and Snowflake).

·         Cost-Efficient Storage Solutions: Utilize existing data lakes without ingesting into Snowflake.

·         Machine Learning Applications: Integrate external and internal datasets seamlessly.

As Title suggests it’s just tip of the iceberg. Have you used iceberg in your snowflake environment?

Resources:

https://www.dremio.com/press-releases/dremio-team-authoring-oreillys-definitive-guide-on-apache-iceberg-only-book-of-its-kind/

 

 

No comments:

Post a Comment