It's not doing that. If you look at the repository, it's adding a new commit wit...

btown · 2026-03-18T20:16:42 1773865002

I recall that this became a big problem for the Homebrew project in terms of load on the repo, to the extent that Github asked them not to recommend/default-enable shallow clones for their users: https://github.com/Homebrew/brew/issues/15497#issuecomment-1...

This is likely to be lower traffic, and the history should (?) scale only linearly with new data, so likely not the worst thing. But it's something to be cognizant of when using SCM software in unexpected ways!

roncesvalles · 2026-03-18T20:32:35 1773865955

How would shallow clone be more stressful for GitHub than a regular clone?

enchilada · 2026-03-18T20:45:06 1773866706

Shallow clones (and the resulting lack of shared history data) break many assumptions that packfile optimisations rely on.

See also: https://github.com/orgs/Homebrew/discussions/225

vovavili · 2026-03-18T20:04:00 1773864240

This makes more sense. I still wonder if the author isn't just effectively recreating Apache Iceberg manually here.

tamnd · 2026-03-19T10:47:48 1773917268

I intentionally kept it lightweight. Just Parquet files + simple partitioning + commits on Hugging Face. That already covers most of what I need, without introducing a heavier stack or extra dependencies.

Also, I wanted something that is easy to consume anywhere. With this setup, you can point DuckDB or Polars directly at the data and start querying, no catalog or special tooling required.

tomrod · 2026-03-18T20:05:27 1773864327

Are they paying for the repo space, I wonder?

cyanydeez · 2026-03-18T20:25:08 1773865508

someones paying to keep name dropping Iceberg(tm)

mulmen · 2026-03-18T22:40:24 1773873624

Weird accusation. Iceberg is an Apache project. I don’t think anyone gets paid when you use it so not sure what the benefit of shilling would be. It is just a table format that’s well suited for this purpose. I would expect any professional to make a similar recommendation.

sureglymop · 2026-03-19T10:09:45 1773914985

So they are sharding by time/day?

I have a similar project right now where I am scraping a dataset that is only ever offering the current state. I am trying to preserve the history of this dataset and was thinking of using the same strategy. If anyone has experience or pointers in how to best add time as a dimension to an existing generic dataset, I'd love to read about it.