Iceberg (https://iceberg.apache.org) is an open source alternative to Delta Lake that I cannot recommend enough.
It organizes your Parquet files (or other serialization formats) in a logical structure with snapshots to allow time travel and git-like semantics for data management and Write-Audit-Publish strategies.
My favorite use recently is the idempotent change data capture to ease replication in the event of failures. When your publishing job fails, you can simply replay the same diff between two snapshots and pick up where you left off.
AFAIK, it’s limited to fast-forward merge strategies, but you can also create or replace branches and tags, along with cherry-picking snapshots.
Additional information can be found in:
Comparing Delta Lake to Parquet is a bit nonsense isn't it? Like comparing Postgres to a zip file. After trying all of the major open table formats, Iceberg in the future in my opinion. Delta is great if you use Databricks but otherwise I don't see a compelling reason to use it over Iceberg.
Lots of organizations have Parquet data lakes and are considering switching to Delta Lake.
Converting a Parquet table to a Delta table is an in-place, cheap computation. You can just add the Delta Lake metadata to an existing Parquet table and then take advantage of transactions and other features. I don't think it's a meaningless comparison.
There is no Parquet table. Parquet is a compressed file format like a zip. Parquet can be read into table formats like Hive, Delta, etc. That is why this comparison makes no sense.
> Lots of Parquet files in the same directory are typically referred to as a "Parquet table".
This is my point though? This is an apples to oranges comparison. A directory of Parquet files is not a table format. Comparing Delta to Hive or Iceberg is a more apt comparison. I have worked with all types of companies and I have yet to work with one that is just using a directory of Parquet files and calling it a day without using something like Hive with it.
Yea, comparing Delta Lake to Iceberg is more apt, but I've been shying away from that content cause I don't wanna flamewar. Another poster is asking for this post tho, so maybe I should write it.
I don't really see how Delta vs Hive comparison makes sense. A Delta table can be registered in the Hive metastore or can be unregistered. If you persist a Spark DataFrame in Delta with save it's not registered in the Hive metastore. If you persist it with saveAsTable it is registered. I've been meaning to write a blog post on this, so you're motivating me again.
I've seen a bunch of enterprises that are still working with Parquet tables that aren't registered in Hive. I worked at an org like this for many years and didn't even know Hive was a thing, haha.
> I don't really see how Delta vs Hive comparison makes sense. A Delta table can be registered in the Hive metastore or can be unregistered.
You are right about Delta tables in the Hive metastore but if you are writing from the perspective of "there are companies that don't know what Hive is" then I feel the next step up is "there are companies that just stuff files in S3 and query them with Athena(which handles all the Hive stuff for you when you make tables). Explaining what Delta gives them over that I feel is something worth explaining.
But the article isn't just saying which is better (though I realize the article is meant as a sales pitch, so there's some of that), it's explaining what the differences are. I can't claim any sort of expertise in video codecs and file formats, but I'm guessing that one is a codec and the other is a file format that wraps the codec. If that's true, then I would say that a similar comparison between the two is also valid.
Also, as someone who chooses between Parquet and Delta on a fairly regular basis, I can say from experience that there are as many situations where both are viable options than there are situations where such a comparison is invalid. So, it's hardly an apples to oranges comparison. At worst it's maybe a pomelos to oranges comparison.
I think the primary cognitive dissonance is that Delta Lake is a storage framework, not an open table format. It lives in a different layer of the data storage and processing stack than a file/table format like Parquet. Any time you're comparing apples to oranges, it's going to set off alarm bells with some readers.
Delta Lake is a storage framework that uses Parquet files. Parquet files are a thing, but Delta Lake files are not a thing. The Delta Lake framework uses Parquet files, plus some additional stuff (transaction log, checkpoint files) that enable capabilities that Parquet files alone do not.
That collection of files is called a "Delta table". If the title of this was, "Benefits of Delta Tables vs. Parquet Files Alone", and the article was revised to be more careful about not conflating Delta Lake and Parquet, I think that would benefit everyone.
But, similarly, if you're setting up a data lake, a co-located collection of Parquet files that share the same schema is often colloquially referred to as a "Parquet table". And it's common put some extra layers beyond just squirreling some files away on a disk somewhere to manage and govern these logical units that people call Parquet tables.
I think that the people elsewhere in this thread who are trying to be pedantic about this are maybe more familiar with some of the individual open source technologies that are used in data lake applications than they are with conventions around how these technologies get assembled into a full-fledged data management system in a business setting?
In other words, it seems like people who are trying to talk about the forest are getting downvoted and picked on by a bunch of folks who seem to maintain that there is no forest, only trees.
> …if you're setting up a data lake, a co-located collection of Parquet files that share the same schema is often colloquially referred to as a "Parquet table". And it's common put some extra layers beyond just squirreling some files away on a disk somewhere to manage and govern these logical units that people call Parquet tables.
Absolutely, and standardizing that has really cool benefits. You're clearly knowledgeable enough to know where the author means "Delta table" instead of "Delta Lake," "Parquet tables" instead of "Parquet", etc., but not everyone is.
I can understand if the author feels picked on, but I'm sure he knows the bar for technical correctness for developer content marketing is high (especially on HN). Honestly, if I were Mr. Powers I'd be happy for the "strict mode" feedback!
That's not quite what's annoying me. Maybe the more bothersome thing is that some number of people in here have been regularly downvoting people who have done nothing worse than posting factually correct information.
Almost as if people were trying to use HN's voting system as an ersatz referendum on their preferred big data packages rather than as a way to self-moderate the quality of the discussion.
We're supposed to be a crowd that favors mature discussions about technical topics. We can do better.
* What does CLI support mean in the context of a Lakehouse storage system? You can open up a Spark shell or Python shell to interface with your Delta table. That's like saying "CSV doesn't have a CLI". I don't get it.
I'm not well versed in these things, but at this point, aren't you re-inventing database systems? Talking about things like ACID transactions, schema evolution, dropping columns, ... in the context of a file-format feels bizarre to me.
Yep, it is re-inventing database systems and you raise a great question.
At first glance, it seems like Delta Lake is inferior to a database. Most databases support multi-table transactions and Delta Lake only support transactions for single table. ACID transaction support is nothing new for a database.
Delta Lake is useful for large datasets and to keep costs low.
There are organizations that are ingesting hundreds of terabytes and petabytes of data into a Delta table every day. They're able to ingest data, perform upserts, and build realtime pipelines with this architecture.
Delta Lake is also free, so you only have to pay for storing the files in the cloud. This is a lot cheaper than a database usually.
Data warehouses are often packaged with a certain amount of shared RAM/storage. This can be a problem for a team with large workflows from many users. It's annoying to share compute with someone that's running a large experiment.
It’s not so bizarre if you realize that bringing ACID semantics to files, lets you use the scalability of file/blob storage like S3 combined with DB-like access.
Traditional RDBMSes just don’t scale so well as S3. But S3 didn’t have ACID semantics. Now it does!
You don't have access to the underlying data storage for Snowflake BigQuery, and Firebolt. You can import files into those, but the data will be duplicated.
If you have less than a terabyte of data, using one of those data warehouses is the smart move though.
But they also kind of lock you into that classic DB-like interface. Which works great for many people - hence their overwhelming popularity - but not necessarily everyone.
This is exactly right. The lakehouse is a custom data warehouse you can build out of these cloud primitives to suit the specific data needs of an organisation. Think of it as a database scaled up by several orders of magnitude. Everything from storage costs to latency can be optimised as design choices. The common core in this architecture is data held in standard file formats such as parquet, delta tables, avro etc.
Yes, it is basically just another relational database system. -but-, it's a database system that's optimized for a different purpose.
A traditional RDBMS is designed for OLTP workloads, and it does a great job of that. Ideally operations are small, discrete, and handled within milliseconds. In service of that speed, you also want to keep them small and lean, so that you can take maximum advantage of caching hot data in memory. Maybe on the megabytes-to-gigabytes scale.
A data warehouse is designed for more OLAP-style workloads, but the emphasis is still on real-time responses to relatively predictable requests. But it's at the more relaxed end of the "real-time" scale - a query might take a few seconds to run. You'll use extract-transform-load jobs to get the data organized into a structure that's optimized for those workloads before you load it into the warehouse. Data volumes still matter here, but they can be allowed to get quite a bit bigger than what's typical in OLTP databases. Think gigabytes-to-terabytes scale.
Lakehouses, on the other hand, are meant for more of a "get the data somewhere, and then figure out how to use it" mindset. So getting the data into it follows more of an extract-load-transform regime, meaning that significant processing and transformation of the data happens in the course of executing the query itself. The kinds of questions you want to ask are almost unconstrained, and that changes the performance situation again. Millisecond response times are now something that just never happens. Instead you're looking at seconds to minutes, perhaps even hours, being typical execution times for a query. The data also gets bigger again. People often suggest it's potentially on the terabytes-to-petabytes scale, but I haven't seen that myself. Mostly because I've never worked anywhere were anyone even wants to have that much data sitting around to have to manage and govern.
I would say don't get caught up too much on the scale consideration, though. That's real, but I think that the more interesting distinction, and the one that explains why OLTP systems and data warehouses are often implemented using the same RDBMS systems, while lakehouses really do merit a completely different tech stack, is the ETL vs ELT distinction.
Let's suppose you have a data lake with 40,000 Parquet files. You need to list the files before you can read the data. This can take a few minutes. I've worked on data lakes that require file listing operations that run for hours. Key/value stores aren't good at listing files like Unix filesystems.
When Spark reads the 40,000 Parquet files it needs to figure out the schema. By default, it'll just grab the schema from one of the files and just assume that all the others have the same schema. This could be wrong.
You can set an option telling Spark to read the schemas of all 40,000 Parquet files and make sure they all have the same schema. That's expensive.
Or you can manually specify the schema, but that can be really tedious. What if the table has 200 columns.
The schema in the Parquet footer is perfect for a single file. I think storing the schema in the metadata is much better when data is spread across many Parquet files.
Yea, that's exactly what Delta Lake does. All the table metadata is stored in a Parquet file (it's initially stored in JSON files, but eventually compacted into Parquet files). These tables are sometimes so huge that the table metadata is big data also.
If the format is splittable you generally can get similar benefits, and parquet files have metadata to point a given reader at a specific chunk of the file that can be read independently. In the case of parquet the writer decides when to finish writing a block/RowGroup, so manually creating smaller files than that can increase parallelism. But you can only go so far as I'm pretty sure I've seen spark combine together very small files into a single threaded read task.
"This post explains how to scale developer advocacy by creating content in a way that answers current user questions and makes it easier to generate additional content in the future"
As a lead for a team of developers who have used Parquet and considering Iceberg for our next-gen stuff, you aren't "answering current user questions" about whether we should consider DeltaLake, at least for me. You are marketing to a past world.
You are right there will be a flamewar, and others will discount some of what you say because of your bias, you will get criticism and personal remarks (mostly off base) and you will suffer tremendous heat for it. I have been there in a past life re: unix wars.
But, particularly if you acknowledge opposing views in your content and don't hide counterarguments via cherry picking, you will really add value to the data community in exposing the truth, and educating people both on your team and the other team which ultimately spurs improvements where both sides have gaps and performs a greater benefit for the broader community.
It takes courage and care to put a controversial rigorous viewpoint out there; you do risk your "reputation". But, particularly if you make corrections where appropriate, people will recognize you as genuine.
It is not bad to have a point of view. What is bad is to hide your bias or counterarguments to deceive people.
Be part of the thesis + antithesis-> synthesis Hegelian dialog that brings progress. Ultimately as you advocate for your customers (developers/data users), not "your team", you will perform a true service to the community, even if only you and a few others recognize it.
Yea, there is a Rust implementation of the Delta Lake protocol that lets you do upserts without Spark too. This allows pandas, Polars, DataFusion, and PyArrow users to easily do upserts as well.
There are a few features missing from the FOSS Scala/Spark implementation of Delta Lake, but I wouldn't say a lot. The FOSS version supports all the table features in the Delta Lake protocol.
The Delta Rust implementation is missing more table features, but we're closing the gap fast. We just added support for constraints to Delta Rust and are working on change data feed right now.
I’d take issue with the “Iceberg is slow” theme that Databricks in particular has tried to push.
If that were true, Snowflake would not be as fast on Iceberg/Parquet as its native format. The engine makes something fast or slow, not the table format.
Back when were choosing between the three formats about 1.5 years ago, Iceberg was definitely the slowest. If the situation has changed since then, I would love to see an updated comparison.
We tested all three of them using Spark batches that converted a stream of changes into SCD2.
Databricks has been struggling to defend Delta against the fast-moving improvements and widening adoption of Iceberg, championed by two of its major competitors, AWS and Snowflake. This article seems like a bizarre, and maybe even misleading, artifact, given that no one in the industry is comparing Parquet to Delta. They’re weighing Iceberg, which like Delta, can organize and structure groups of parquet (or other format) files…
I did some keyword research and wrote this post cause lots of folks are doing searches for Delta Lake vs Parquet. I'm just trying to share a fair summary of the tradeoffs with folks who are doing this search. It's a popular post and that's why I figured I would share it here.
Data Lakes (i.e. Parquet files in storage without a metadata layer) don't support transactions, require expensive file listing operations, and don't support basic DML operations like deleting rows.
Delta Lake stores data in Parquet files and adds a metadata layer to provide support for ACID transactions, schema enforcement, versioned data, and full DML support. Delta Lake also offers concurrency protection.
This post explains all the features offered by Delta Lake in comparison to a plain vanilla Parquet data lake.