Saturday, November 2, 2024

Tip of Iceberg

 



PC:https://www.dremio.com/resources/guides/apache-iceberg-an-architectural-look-under-the-covers/

 

In 1912, Titanic ship sank after hitting the iceberg.. While the Titanic might come to mind when we hear 'iceberg,' this post explores a different kind of Iceberg—one revolutionizing data architecture – Apache iceberg.

If you know about Apache iceberg , do not waste a minute to read further.

There is lot of momentum about Apache iceberg in recent past and you might have heard recent news like Datazip raised $1M fund, Google announced preview of BigQuery tables for Apache Iceberg , Cloudera unveiled a partnership with Snowflake to enhance hybrid data management (introduces a unified hybrid data lakehouse powered by Apache Iceberg).  

So, what exactly is Apache Iceberg, and how does it fit into modern data architecture? 

Data lakes are ideal for storing massive data in semi structured, unstructured data in native formats.  This is called file-based data lake and This is ideal choice for cost effective storage and flexible data exploration for organizations.

These individual files in file based data lake do not contain required information for query engines to do pruning, time travel, schema evolution (basically adding/removing/renaming columns without unloading/reloading data). This file-based data lake does not support ACID.  

On the other hand, data warehouse enforces structure, supports SQL, query pruning, guarantees ACID and optimized for analytical capabilities.

Moving data between a lake and a warehouse becomes more laborious, and keeping data up to date in both systems increases the risk of inconsistencies, delays, and operational bottlenecks. 

There is another architecture between data lake and data warehouse. It is data lakehouse. Data lake serves as storage and warehouse functionality on top of it. no need to move data between systems. Supports SQL, guarantees ACID with single copy of data.

 Components of Data lakehouse:

·         Storage system: To keep files on your cloud or distributed file system.

·         File Format: to store data efficiently – Apache Parquet or ORC

·         Table Format: to organize files into tables – Apache Iceberg or Delta Lake.

·         Catalog: to track metadata and ensure consistency and ease of access.

·         Query Engine: allows you run operations on these tables Like Spark (streaming), Dremio (batch processing) , Snowflake (batch & incremental loads)

 So Apache Iceberg is open table format which fits in data lakehouse architecture. It is developed by Netflix and donated to Apache foundation. One more advantage of Open table format is there is no vendor lock in. we can switch to other query engines easily.

Benefits of Apache Iceberg:

·         Expressive SQL: Enables updates, merges, and deletes.

·         Schema Evolution: Allows metadata updates without data rewrites.

·         Partition Evolution: Groups similar rows for efficient access.

·         Time Travel and Rollback: Access previous versions of data.

·         Transactional Consistency: Supports ACID transactions.

·         Faster Querying: Optimized for performance.

What are the key components in Iceberg and purpose of them?

·         Catalog Layer:

 

o   The catalog is responsible for managing the metadata of Iceberg tables. It maintains pointers to the current metadata files, which describe the state of the table.

o   Iceberg supports various catalog implementations, including Hive Metastore, AWS Glue, and its own standalone catalog, allowing flexibility in how metadata is stored and accessed. Snowflake has donated its Polaris catalog to Apache foundation.

o   The catalog acts as the gatekeeper, keeping track of which metadata.json of the table the engine should be working with.

 

Broadly there are 2 types of catalogs:

File system catalog: store metadata in file system, there are bottlenecks and not recommended for production.

Service Catalog: Hive metastore,  JDBC , REST based catalogs. Snowflake Polaris is REST based catalog.

·         Metadata Layer:

o   This layer consists of several components:

o   Metadata File: Contains information about the table's schema, partitioning, snapshots, and the current state of the table.

o   Manifest List: A collection of manifest files that describe the data files included in a snapshot.

o   Manifest Files: These track individual data files along with statistics and other details about each file, enabling efficient querying and data retrieval.

o   The metadata layer allows Iceberg to support features like schema evolution, partition evolution, and time travel.

·         Data Layer:

o   The data layer holds the actual data files (e.g., in formats like Parquet or ORC) along with delete files for managing data changes.

o   This layer is where the actual data resides, and it provides the necessary data for queries executed against Iceberg tables.

One unique feature of Apache iceberg over other table formats is Partition evolution. Partition information of iceberg table is tracked in metadata and not in data file. This decoupling of partition information allows modify partition logic without rewriting whole data again 😊

How can we use iceberg tables in Snowflake?

Snowflake works as query engine for Apache iceberg. Data stores in our own cloud and Snowflake acts as query engine. Using Iceberg tables in Snowflake allows you to manage large datasets stored in external cloud storage while leveraging Snowflake's powerful querying capabilities.

 Iceberg table is snowflake’s 1st class object, meaning we can get same advantages that of permanent tables in Snowflake. Supporting snowflake data types, performance, RbAC, data masking, multi table transactions, time travel, Search optimization.

 We can join iceberg table with other snowflake permanent table.  

 Below is quick tutorial : https://quickstarts.snowflake.com/guide/tasty_bytes_working_with_iceberg_tables/index.html#0

 Use Cases for Iceberg Tables in Snowflake:

·         Data Lakes with Existing Datasets: Efficiently query large datasets already in data lake formats.

·         Multi-Engine Workflows: Query the same tables across different engines (e.g., Spark and Snowflake).

·         Cost-Efficient Storage Solutions: Utilize existing data lakes without ingesting into Snowflake.

·         Machine Learning Applications: Integrate external and internal datasets seamlessly.

As Title suggests it’s just tip of the iceberg. Have you used iceberg in your snowflake environment?

Resources:

https://www.dremio.com/press-releases/dremio-team-authoring-oreillys-definitive-guide-on-apache-iceberg-only-book-of-its-kind/

 

 

Friday, May 27, 2016

Informatica Scenario and SQL Scenario Question

Scenario:
Source data looks like below




 Target 1 (Event number of occurrences)
A has come 2 times in the source, it should go to Target 1



Target 2 (odd number of occurrences)
If source data repeats odd number of times then it should go to Target 2



Solution:
create table test_odd_even_src
(
col1 varchar2(3)
);

create table test_odd_tgt
(
col1 varchar2(3),
col2 int
);
create table test_even_tgt
(
col1 varchar2(3),
col2 int
)
;

insert into test_odd_even_src values('A');
insert into test_odd_even_src values('A');
insert into test_odd_even_src values('B');
insert into test_odd_even_src values('B');
insert into test_odd_even_src values('B');
insert into test_odd_even_src values('C');
insert into test_odd_even_src values('D');


Mapping Looks like this:












Aggregator Transformation:











Joiner:












Router Transformation:












Now try to insert 2 more records into source and run the session again (Before that truncate target tables)
insert into test_odd_even_src values('A');
insert into test_odd_even_src values('B');

Source Data after inserting 2 new records.



select * from test_even_tgt


select * from test_odd_tgt



SQL Scenario:

Source
id,color
a1,yellow
a2,unknown
a1,green
a3,unknown
a1,unknown
a2,blue
Target
id,color
a1,yellow
a1,green
a3,unknown
a2,blue
Data Preparation:
create table test_color_src
(
id varchar2(30),
color varchar2(30)
);

create table test_color_tgt
(
id varchar2(30),
color varchar2(30)
);


insert into test_color_src values('a1','yellow');
insert into test_color_src values('a2','unknown');
insert into test_color_src values('a1','green');
insert into test_color_src values('a3','unknown');
insert into test_color_src values('a1','unknown');
insert into test_color_src values('a2','blue');

Solution:
Source Data:



SQL Query and Its output:

select id,color from test_color_src where color <> 'unknown'
union
(select id,'unknown' from test_color_src where color = 'unknown'
MINUS
select id,'unknown' from test_color_src where color <> 'unknown'
)



Thanks for Reading my blog..!!

Wednesday, December 9, 2015

SQL Test : Sample Questions from Amazon Test (2/2)

 SQL questions are about monorail company. Database consists of tables for a monorail company.

 Below listed table definition with column name and one row of sample data.

1.      Table employee contains information of all the employee in the organization (including drivers of monorail).

 

| eid  | ename   | salary |

| 555  | Raheem  | 235    |

 

2. Table monorail Contains the information aboubt each monorail.

 

| rid   | rname      | running_range |

| 1234  | Speedster  | 454           |

 

3. Table route_info contains the description of route on which monorails used to travel.

 

 

| route_no   | origin      | distance | cost |

| 4456       | Gwalior     | Delhi    | 543  |

 

4. Table assignment contains the data about which monorail is driven by which employee. This can be a many-to-many relation.

 

| eid   | rid      |

| 101   | 4454     |

 

NOTE: Make all necessary assumptions.

Table structure repeated below for quick reference:

 

employee(eid,ename,salary)

monorail(rid,rname,running_range)

route_info(route_no,origin,destination,distance,cost)

assignment(eid,rid)

Table structures:

 

create table employee

(

eid int,

ename varchar2(50),

salary int

)

;

 

create table monorail

(

rid int,

rname varchar2(50),

running_range int

);

 

 

create table route_info

(

route_no int,

origin varchar2(50),

destination varchar2(50),

distance int,

cost int

);

 

 

create table assignment

(

eid int,

rid int

);

 

Problem 1: Underpaid Employee :

 

Find the names of employees (“ename”) whose salary is less than the price (“cost”) of cheapest route from ‘Bangalore’ to ‘Delhi’ (No need to consider route with connections).

Problem 2: one stop Route:

 

Find one stop from ‘Bangalore’ to ‘Delhi’. The route will need to make one stop transfer and don’t need to consider time and cost. Query should return two columns : first section route_no and second section route_no.

Problem 3: How many Monorails Does Each Drive:

 

Provide all employee names (“ename”) along with status on whether they are driving monorail or not (Y/N) and also the number of monorails driving (0 if they are not driving monorail).

Note: The format of output should be like below

Raheem N 0

Nusrat Y 3

……….

 

Problem 4: Cheapest route to destination:

 

List all the route destination along with their route_no of the cheapest route to that destination from any origin. Output should have three column: destination, route_no and cost of the route.

 

Problem 5: Top Paid Drivers:

 

List all the monorails along with the top 2 paid (salary ranked 1st or 2nd) drivers of each monorail. The output should have four columns: the name of the monorail (rname), the salary ranking in the drivers of same monorail, the name of the driver (ename), the driver’s salary (salary).

Solution 1: Underpaid Employee :

 

select ename from employee where salary < (select min(cost) from route_info where origin='Bangalore' and destination='Delhi');

Solution 2: one stop Route:

 

select a.route_no as first_section_route,b.route_no as sec_section_route

from route_info a,route_info b where a.origin='Bangalore'

and b.destination='Delhi' and a.destination=b.origin

Solution 3: How many Monorails Does Each Drive:

 

select emp.ename, case when xx.cnt is NULL then 'N' else 'Y' end, nvl(xx.cnt,0)

from employee emp left outer join

(

select eid,ename,nvl(count(1),0) as cnt from

(select e.eid,e.ename,a.rid from employee e inner join assignment a

on e.eid=a.eid) x

group by x.eid,x.ename

) xx

on emp.eid=xx.eid

 

Solution 4: Cheapest route to destination:

 

Select r.destination, r.route_no,r.cost from route_info r, (select destination,min(cost) as mr_cost from route_info group by destination) mr

where r.cost = mr_cost and r.destination=mr.destination

 

Solution 5: Top Paid Drivers:

 

select m.rname,x.salary_rank, x.ename, x.salary from

(

select a.rid, a.eid,e.ename,e.salary, rank() over (partition by a.rid order by e.salary desc) as salary_rank from assignment a inner join employee e on a.eid=e.eid

 ) X , monorail m

 where x.rid=m.rid and salary_rank<=2

 

Click on below link for Part 1.

 

For SQL Test : Sample Questions from Amazon Test (1/2)