Tutorial

9 min read

Data lakehouse with Snowflake Iceberg tables - introduction

Snowflake has officially entered the world of Data Lakehouses! What is a data lakehouse, where would such solutions be a perfect fit and how could they be introduced into the Snowflake-centered data ecosystem? We’ll walk you through this topic in our series of blog posts. Today, as an introduction to the topic, you will find the answers to questions like: Why is a data lakehouse a solution that combines all the key features of a data warehouse and a data lake, and which shortcomings of these solutions does it address? What role do open data formats (like Iceberg) play in DLH architectures and what are Snowflake Iceberg Tables? In the second part, we will share the blueprint architecture and reveal some interesting observations on how cost-efficient, flexible and secure a Data Lakehouse on Snowflake with the Iceberg format could be. Additionally we touch upon which items you have to be aware of when deciding on such a solution. Now, let’s get to the first part!

Where does a data Lakehouse come from?

In modern data environments it’s become obvious that limiting focus on structured data and predefined schemas is no longer sufficient if you would like to get maximum value from your data assets, maximize the associated time-to-value and optimize the total cost of ownership (TCO). Traditional data warehousing is also very often connected with issues in handling massive volumes of data or varying workloads. Data Lakes on the other hand allow for the storage of raw, unstructured or semi-structured data in its native format. Although they enable flexibility in data interpretation, they may face challenges in governance and security due to a lack of predefined structures and performance optimization for structured queries. So, as often happens in nature, when none of the solutions cover all the needs, a new… third solution eventually appears.

Data Lakehouses offer a balance, supporting diverse analytics needs with the ability to handle both structured and unstructured data efficiently, in order to provide first-class support for machine learning and data science workloads. They combine the governance capabilities of data warehouses with the flexibility of data lakes, providing strong security and access controls while supporting diverse data types. Some of the most common challenges of data warehouses like data staleness, reliability, total cost of ownership, data lock-in and limited use-case support are also addressed. It looks awesome in theory, but… how is this concept introduced from a technical standpoint?

evolution-data-platform-getindata

Source: Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics

Data Lakehouses and open formats

In the era of Hadoop and data lakes from these days transactional consistency, schema evolution and metadata management were among the top challenges when it came to building data products. There was a need for an abstraction layer that would address these shortcomings - this is more or less how open table formats came into play. Apache Iceberg is one of the most popular ones (along with Delta Lake and Hudi) in the context of Data Lakehouse architectures.

Apache Iceberg as an open table format

Apache Iceberg is an open-source table format for large-scale data processing. It provides a table format that supports schema evolution, transactional consistency and efficient data pruning. The key features of Apache Iceberg from the perspective of Data Lakehouse architecture include:

Interoperability: it ensures interoperability across different data processing engines (like Spark, Trino, Flink, Hive etc) and frameworks. This allows users to leverage various analytics tools and engines while accessing the same underlying data stored in the Data Lakehouse.

Schema Evolution: allows for the evolution of table schemas over time, without requiring a rewrite of the entire dataset. This flexibility is crucial in dynamic data environments where schema changes are frequent.

Transactional Consistency: provides ACID (Atomicity, Consistency, Isolation, Durability) transactions for write operations. This ensures that data writes are either fully committed or fully rolled back, maintaining consistency.

Snapshot Isolation: supports snapshot isolation, allowing concurrent readers to operate on the same table without interfering with one another. This is important for maintaining consistent query results in multi-user environments.

Time Travel: enables querying data at different points in time, providing a historical view of the dataset. This feature is valuable for auditing, debugging and analysis of changes over time.

Metadata Management: maintains metadata on the table, including information about schema, partitions and statistics. This metadata is essential for optimizing query performance and enabling efficient pruning of data during query execution.

Optimized Query Performance: provides optimizations for query performance by enabling efficient data pruning and filtering, reducing the amount of data that needs to be scanned during query execution. This is particularly important for large-scale data lakehouse environments.

Apache Iceberg and Snowflake

Snowflake has entered the BYOS (Bring Your Own Storage) game.

Snowflake Data Cloud supports processing big data workloads using numerous file formats, including Parquet, Avro, ORC, JSON and XML. While Snowflake’s internal, fully managed table format simplifies the storage maintenance like encryption, transactional consistency, versioning, fail-safe and time travel, some organizations with regulatory or other constraints are either not able to store all of their data in Snowflake or prefer to store data externally in open formats (like Apache Iceberg). One of the key reasons why some organizations rely on the open formats is interoperability - they can process their data safely with Spark, Trino, Flink, Presto, Hive and many more in the same tables, at the same time. Others prefer that storage costs are allocated to their cloud provider’s bill or simply prefer open formats because they don’t like to feel locked-in and like to be flexible when it comes to their architecture choices. At the end of the day it always comes down to the popular dilemma - managed vs. flexible solutions. Regardless of the reasons, fans of open table formats should be now happy as Snowflake has recently announced support for Iceberg tables.

Iceberg tables store their data and metadata files in an external cloud storage location (Amazon S3, Google Cloud Storage, or Azure Storage), which is not part of Snowflake storage and does not incur Snowflake storage costs. Such external storage might be relevant in some of the organizations due to compliance & data security restrictions. This, however, means that all management (including data security aspects) of this storage is on you (or at least not on Snowflake). Snowflake connects to your storage location using an external volume, so data is outside of the Snowflake but you keep the performance and other benefits of Snowflake (e.g., security, governance, sharing.)

Snowflake supports different catalog options - you can use Snowflake as an Iceberg catalog, but also use a catalog integration to connect Snowflake to an external Iceberg catalog such as AWS Glue, or to Iceberg metadata files in object storage. An Iceberg table that uses Snowflake as the Iceberg catalog provides full Snowflake platform support with read and write access. Snowflake handles all life-cycle maintenance, such as compaction of the table.

A general a rule of thumb is to use Snowflake Managed Iceberg when data needs to be in an open format, consumable by external processes, and where Snowflake is maintaining the table (and catalog) and Snowflake Unmanaged Iceberg when Snowflake needs to read open format Iceberg data, but is just a consumer and is referencing an external catalog (e.g. AWS Glue Data Catalog).

So it feels that with the advent of Iceberg tables, Snowflake data architectures have become more flexible and open to new types of organizations. Among the benefits, we really like the commitment to eliminating data movement, ensuring that the data stays in place, which has a positive impact on reducing latency and optimizing overall query performance.

Please note that at the time of writing, Snowflake Iceberg Tables were in Public Preview, which means there were some limitations and disclaimers for potential use. Please check the details here.

So… when should you use Iceberg with Snowflake?

Knowing what the functionalities of Iceberg format are and its integration with Snowflake, a natural question arises - what are the most common use cases for the Iceberg format in Snowflake? Let’s consider the scenarios which we find most suitable:

You already have existing large datasets in your data lake in Iceberg format, which you want to query using Snowflake with a similar performance to tables with native Snowflake format, without having to ingest the data
You want Snowflake to manage your datasets and you want to query them with Snowflake, but you also need to query the same tables directly using other query engines such as Spark, Trino, Redshift, etc. without having to pass data through Snowflake
You would like to access a massive amount of data for running sophisticated ML/AI training pipelines where a standard JDBC/ODBC interface might be a bottleneck and on the other hand you would like same datasets to be queryable from reporting tools.

As already mentioned - bigger flexibility means a greater responsibility of securing your data via IAM rules, as data is stored in your storage and can be accessed directly without going through Snowflake. Therefore, if you primarily only use your data for business intelligence, then you probably don't have the need for an open format like Iceberg. However, if you want access to your data in many different ways with different tools, then the open standard option can be more beneficial.

Our Data Lakehouse on Snowflake

OK, it’s been a nice update from the field. But… is that all we have prepared? Of course, not! As a group of expert engineers we don’t just write IT novels - we like to build stuff. That was also the case with Data Lakehouse on Snowflake with the Iceberg format. In the second part of our blog post we’ll share with you our blueprint architecture and reveal a couple of interesting observations on how cost efficient, flexible and secure this kind of solution could be. Stay tuned!

Inspirations:

Snowflake Iceberg Tables — Powering open table format analytics

Apache Iceberg or Snowflake Table format? | by James Malone

Iceberg Tables on Snowflake: Design considerations and Life of an INSERT query | Medium

When To Use Iceberg Tables in Snowflake | by Mike Taveirne

Iceberg tables | Snowflake Documentation

Apache Iceberg

Snowflake

Data Lakehouse

Last updated: 11 March 2024

Written by

Michał Rudko

Big Data Analyst / Architect

Like this post?
Spread the word

Want more? Check our articles

getindator create a cover graphic for a blog post about optimiz 05dfdc1c 8a91 4d99 9b19 137eabe195b0

Tutorial

Optimizing Flink SQL: Joins, State Management and Efficient Checkpointing

In the fast-paced world of data processing, efficiency and reliability are paramount. Apache Flink SQL offers powerful tools for handling batch and…

deploy you own databricksobszar roboczy 1 4

Tutorial

Deploy your own Databricks Feature Store on Azure using Terraform

A tutorial on how to deploy one of the key pieces of the MLOps-enabling modern data platform: the Feature Store on Azure Databricks with Terraform as…

getindata running machine learning platform pipelines kedro kubeflow airflow mariusz strzelecki

Tutorial

Running Machine Learning Pipelines with Kedro, Kubeflow and Airflow

One of the biggest challenges of today’s Machine Learning world is the lack of standardization when it comes to models training. We all know that data…

Big Data Event

How we evaluate the CfP submissions and build the conference agenda at Big Data Technology Warsaw Summit

Big Data Technology Warsaw Summit 2021 is fast approaching. Please save the date - February 25th, 2021. This time the conference will be organized as…

paweł lesszczyński 2obszar roboczy 1 4x 100

Tutorial

Alert backoff with Flink CEP

Flink complex event processing (CEP).... ....provides an amazing API for matching patterns within streams. It was introduced in 2016 with an…

Looking Back at 2024: GetInData’s in Data & AI

Let’s take a moment to look back at 2024 and celebrate everything we’ve achieved. This year has been all about sharing knowledge, creating impactful…

Check All

Contact us

Interested in our solutions?
Contact us!

Together, we will select the best Big Data solutions for your organization and build a project that will have a real impact on your organization.

What did you find most impressive about GetInData?

They did a very good job in finding people that fitted in Acast both technically as well as culturally.

Type the form or send a e-mail: hello@getindata.com

Data lakehouse with Snowflake Iceberg tables - introduction

Where does a data Lakehouse come from?

Data Lakehouses and open formats

Apache Iceberg as an open table format

Apache Iceberg and Snowflake

So… when should you use Iceberg with Snowflake?

Our Data Lakehouse on Snowflake

Get notified about the second part of this blog post

Like this post?Spread the word

Want more? Check our articles

Optimizing Flink SQL: Joins, State Management and Efficient Checkpointing

Deploy your own Databricks Feature Store on Azure using Terraform

Running Machine Learning Pipelines with Kedro, Kubeflow and Airflow

How we evaluate the CfP submissions and build the conference agenda at Big Data Technology Warsaw Summit

Alert backoff with Flink CEP

Looking Back at 2024: GetInData’s in Data & AI

Contact us

Interested in our solutions?Contact us!

Like this post?
Spread the word

Interested in our solutions?
Contact us!