In 1912, Titanic
ship sank after hitting the iceberg.. While the Titanic might come to mind when
we hear 'iceberg,' this post explores a different kind of Iceberg—one
revolutionizing data architecture – Apache iceberg.
If you know about
Apache iceberg , do not waste a minute to read further.
There is lot of
momentum about Apache iceberg in recent past and you might have heard recent
news like Datazip raised $1M fund, Google announced preview of BigQuery tables
for Apache Iceberg , Cloudera unveiled a partnership with Snowflake to enhance
hybrid data management (introduces a unified hybrid data lakehouse powered by
Apache Iceberg).
So, what exactly is Apache Iceberg, and how does it
fit into modern data architecture?
Data lakes are
ideal for storing massive data in semi structured, unstructured data in native
formats. This is called file-based data lake and This is ideal choice for
cost effective storage and flexible data exploration for organizations.
These individual
files in file based data lake do not contain required information for query
engines to do pruning, time travel, schema evolution (basically
adding/removing/renaming columns without unloading/reloading data). This
file-based data lake does not support ACID.
On the other hand,
data warehouse enforces structure, supports SQL, query pruning, guarantees ACID
and optimized for analytical capabilities.
Moving data between
a lake and a warehouse becomes more laborious, and keeping data up to date in
both systems increases the risk of inconsistencies, delays, and operational
bottlenecks.
There is another
architecture between data lake and data warehouse. It is data lakehouse. Data
lake serves as storage and warehouse functionality on top of it. no need to
move data between systems. Supports SQL, guarantees ACID with single copy of
data.
Components of Data lakehouse:
·
Storage
system: To keep files on your cloud or distributed file system.
·
File
Format: to store data efficiently – Apache Parquet or ORC
·
Table
Format: to organize files into tables – Apache Iceberg or Delta Lake.
·
Catalog:
to track metadata and ensure consistency and ease of access.
·
Query
Engine: allows you run operations on these tables Like Spark (streaming),
Dremio (batch processing) , Snowflake (batch & incremental loads)
So Apache
Iceberg is open table format which fits in data lakehouse architecture. It is
developed by Netflix and donated to Apache foundation. One more advantage of
Open table format is there is no vendor lock in. we can switch to other query
engines easily.
Benefits of Apache Iceberg:
·
Expressive
SQL: Enables updates, merges, and deletes.
·
Schema
Evolution: Allows metadata updates without data rewrites.
·
Partition
Evolution: Groups similar rows for efficient access.
·
Time
Travel and Rollback: Access previous versions of data.
·
Transactional
Consistency: Supports ACID transactions.
·
Faster
Querying: Optimized for performance.
What are the key components in Iceberg and purpose
of them?
·
Catalog
Layer:
o The catalog is responsible for managing the
metadata of Iceberg tables. It maintains pointers to the current metadata
files, which describe the state of the table.
o Iceberg supports various catalog
implementations, including Hive Metastore, AWS Glue, and its own standalone
catalog, allowing flexibility in how metadata is stored and accessed. Snowflake
has donated its Polaris catalog to Apache foundation.
o The catalog acts as the gatekeeper, keeping
track of which metadata.json of the table the engine should be working with.
Broadly
there are 2 types of catalogs:
File
system catalog: store metadata in file system, there are bottlenecks and not
recommended for production.
Service
Catalog: Hive metastore, JDBC , REST based catalogs. Snowflake Polaris is
REST based catalog.
·
Metadata
Layer:
o This layer consists of several components:
o Metadata File: Contains information about
the table's schema, partitioning, snapshots, and the current state of the
table.
o Manifest List: A collection of manifest
files that describe the data files included in a snapshot.
o Manifest Files: These track individual data
files along with statistics and other details about each file, enabling
efficient querying and data retrieval.
o The metadata layer allows Iceberg to support
features like schema evolution, partition evolution, and time travel.
·
Data
Layer:
o The data layer holds the actual data files
(e.g., in formats like Parquet or ORC) along with delete files for managing
data changes.
o This layer is where the actual data resides,
and it provides the necessary data for queries executed against Iceberg tables.
One unique feature
of Apache iceberg over other table formats is Partition evolution. Partition
information of iceberg table is tracked in metadata and not in data file. This
decoupling of partition information allows modify partition logic without rewriting
whole data again 😊
How can we use iceberg tables in Snowflake?
Snowflake works as
query engine for Apache iceberg. Data stores in our own cloud and Snowflake
acts as query engine. Using Iceberg tables in Snowflake allows you to manage
large datasets stored in external cloud storage while leveraging Snowflake's
powerful querying capabilities.
Iceberg table
is snowflake’s 1st class object, meaning we can get same advantages that of
permanent tables in Snowflake. Supporting snowflake data types, performance,
RbAC, data masking, multi table transactions, time travel, Search optimization.
We can join
iceberg table with other snowflake permanent table.
Below is
quick tutorial : https://quickstarts.snowflake.com/guide/tasty_bytes_working_with_iceberg_tables/index.html#0
Use Cases
for Iceberg Tables in Snowflake:
·
Data
Lakes with Existing Datasets: Efficiently query large datasets already in data
lake formats.
·
Multi-Engine
Workflows: Query the same tables across different engines (e.g., Spark and
Snowflake).
·
Cost-Efficient
Storage Solutions: Utilize existing data lakes without ingesting into
Snowflake.
·
Machine
Learning Applications: Integrate external and internal datasets seamlessly.
As Title suggests
it’s just tip of the iceberg. Have you used iceberg in your snowflake
environment?
Resources: