Comparing Delta Lake to Parquet is a bit nonsense isn't it? Like comparing Postg...

MrPowers · on Jan 19, 2024

Lots of organizations have Parquet data lakes and are considering switching to Delta Lake.

Converting a Parquet table to a Delta table is an in-place, cheap computation. You can just add the Delta Lake metadata to an existing Parquet table and then take advantage of transactions and other features. I don't think it's a meaningless comparison.

Iceberg is cool too.

BadHumans · on Jan 19, 2024

There is no Parquet table. Parquet is a compressed file format like a zip. Parquet can be read into table formats like Hive, Delta, etc. That is why this comparison makes no sense.

MrPowers · on Jan 19, 2024

Lots of Parquet files in the same directory are typically referred to as a "Parquet table".

Yes, Parquet can be compressed with zip, but snappy is much more common because it's splittable.

Parquet tables can be registered in a Hive metastore. Delta metadata can be added to a Parquet table to make it a Delta table.

BadHumans · on Jan 19, 2024

> Lots of Parquet files in the same directory are typically referred to as a "Parquet table".

This is my point though? This is an apples to oranges comparison. A directory of Parquet files is not a table format. Comparing Delta to Hive or Iceberg is a more apt comparison. I have worked with all types of companies and I have yet to work with one that is just using a directory of Parquet files and calling it a day without using something like Hive with it.

MrPowers · on Jan 19, 2024

Yea, comparing Delta Lake to Iceberg is more apt, but I've been shying away from that content cause I don't wanna flamewar. Another poster is asking for this post tho, so maybe I should write it.

I don't really see how Delta vs Hive comparison makes sense. A Delta table can be registered in the Hive metastore or can be unregistered. If you persist a Spark DataFrame in Delta with save it's not registered in the Hive metastore. If you persist it with saveAsTable it is registered. I've been meaning to write a blog post on this, so you're motivating me again.

I've seen a bunch of enterprises that are still working with Parquet tables that aren't registered in Hive. I worked at an org like this for many years and didn't even know Hive was a thing, haha.

BadHumans · on Jan 19, 2024

> I don't really see how Delta vs Hive comparison makes sense. A Delta table can be registered in the Hive metastore or can be unregistered.

You are right about Delta tables in the Hive metastore but if you are writing from the perspective of "there are companies that don't know what Hive is" then I feel the next step up is "there are companies that just stuff files in S3 and query them with Athena(which handles all the Hive stuff for you when you make tables). Explaining what Delta gives them over that I feel is something worth explaining.

chimerasaurus · on Jan 19, 2024

I agree with the points you make above.

bunderbunder · on Jan 19, 2024

I fail to see what's nonsense about comparing an extension to a format to the format it extends.

CharlesW · on Jan 19, 2024

It’s like saying, “Which is better, ISOBMFF or MPEG-4”? It’s comparing a format with an application of the format.

bunderbunder · on Jan 19, 2024

But the article isn't just saying which is better (though I realize the article is meant as a sales pitch, so there's some of that), it's explaining what the differences are. I can't claim any sort of expertise in video codecs and file formats, but I'm guessing that one is a codec and the other is a file format that wraps the codec. If that's true, then I would say that a similar comparison between the two is also valid.

Also, as someone who chooses between Parquet and Delta on a fairly regular basis, I can say from experience that there are as many situations where both are viable options than there are situations where such a comparison is invalid. So, it's hardly an apples to oranges comparison. At worst it's maybe a pomelos to oranges comparison.

CharlesW · on Jan 19, 2024

I think the primary cognitive dissonance is that Delta Lake is a storage framework, not an open table format. It lives in a different layer of the data storage and processing stack than a file/table format like Parquet. Any time you're comparing apples to oranges, it's going to set off alarm bells with some readers.

Delta Lake is a storage framework that uses Parquet files. Parquet files are a thing, but Delta Lake files are not a thing. The Delta Lake framework uses Parquet files, plus some additional stuff (transaction log, checkpoint files) that enable capabilities that Parquet files alone do not.

That collection of files is called a "Delta table". If the title of this was, "Benefits of Delta Tables vs. Parquet Files Alone", and the article was revised to be more careful about not conflating Delta Lake and Parquet, I think that would benefit everyone.

bunderbunder · on Jan 19, 2024

But, similarly, if you're setting up a data lake, a co-located collection of Parquet files that share the same schema is often colloquially referred to as a "Parquet table". And it's common put some extra layers beyond just squirreling some files away on a disk somewhere to manage and govern these logical units that people call Parquet tables.

I think that the people elsewhere in this thread who are trying to be pedantic about this are maybe more familiar with some of the individual open source technologies that are used in data lake applications than they are with conventions around how these technologies get assembled into a full-fledged data management system in a business setting?

In other words, it seems like people who are trying to talk about the forest are getting downvoted and picked on by a bunch of folks who seem to maintain that there is no forest, only trees.

CharlesW · on Jan 19, 2024

> …if you're setting up a data lake, a co-located collection of Parquet files that share the same schema is often colloquially referred to as a "Parquet table". And it's common put some extra layers beyond just squirreling some files away on a disk somewhere to manage and govern these logical units that people call Parquet tables.

Absolutely, and standardizing that has really cool benefits. You're clearly knowledgeable enough to know where the author means "Delta table" instead of "Delta Lake," "Parquet tables" instead of "Parquet", etc., but not everyone is.

I can understand if the author feels picked on, but I'm sure he knows the bar for technical correctness for developer content marketing is high (especially on HN). Honestly, if I were Mr. Powers I'd be happy for the "strict mode" feedback!

bunderbunder · on Jan 19, 2024

That's not quite what's annoying me. Maybe the more bothersome thing is that some number of people in here have been regularly downvoting people who have done nothing worse than posting factually correct information.

Almost as if people were trying to use HN's voting system as an ersatz referendum on their preferred big data packages rather than as a way to self-moderate the quality of the discussion.

We're supposed to be a crowd that favors mature discussions about technical topics. We can do better.