How is large file support? What strategy is employed for binary file handling? H...

sgbeal · on July 5, 2022

> How is large file support?

Fossil is, because of its sqlite dependency, limited to blobs no larger than 2GB each. Some of its algorithms require keeping two versions of a file in memory at once, so "stupidly huge" blobs are not something it's ideal for. Fossil is designed for SCM'ing source code, and source code never gets anywhere near 2GB per file. The only projects which use such files seem to be (based on fossil forum traffic) high-end games and similar media-heavy/media-centric projects which fossil is not designed for.

> What strategy is employed for binary file handling?

That's a vague question, so here's a vague answer: it handles binaries just fine and can delta them just fine. It cannot do automatic merging of binary files which have been edited concurrently by 2+ users because doing so requires file-format-specific logic. (AFAIK _no_ SCM can merge (as opposed to delta) binaries of any sort.)

geenat · on July 5, 2022

Fair enough, like vanilla git as of today.

I do hope someday git and others employ either a git annex or mercurial-style scheme where if it's a large binary file: 1. no diff is performed, and 2. only the latest version is kept within the history.

This would blow wide open the possibilities for using Fossil in binary-heavy projects such as machine learning, games, simulation.

I could see the SQLite limitation worked around by just splitting up binary data into multiple pieces.

sgbeal · on July 5, 2022

> I do hope someday git and others employ either a git annex or mercurial-style scheme where if it's a large binary file: 1. no diff is performed, and 2. only the latest version is kept within the history.

That will never happen in fossil: one of fossil's core-most design features and goals is that it remembers _everything_, not just the latest copy of a file. The way it records checkins, as a list of files and their hashes, is fundamentally incompatible with the notion of tossing out files. It is capable of permanently removing content, but that's a feature best reserved for removal of content which should never have been checked in (e.g. passwords, legally problematic checkins, etc.). Removing content from a fossil repo punches holes in the DAG/blockchain and is always to be considered a measure of last resort. In my 14+ years in the fossil community, i can count on 2 fingers the number of times i've recommended that a user use that capability.

> I could see the SQLite limitation worked around by just splitting up binary data into multiple pieces.

There's no need to work around that "limitation" because "source code" trees don't deal with files of anywhere _near_ that size. Fossil is, first and foremost, designed to support the sqlite project itself: it was literally designed and written to be sqlite's SCM. Projects with scales of 1000x that project's are nowhere near fossil's radar.

Sharding large files over multiple blobs doesn't solve some of the underlying limitations, e.g. performing deltas. Fossil's delta algorithm requires that both the "v1" and "v2" versions of a given piece of content be in memory at once (along with the delta itself), and rewriting it to account for sharded blobs would be an undertaking in and of itself. That's almost certain to never happen until/unless the sqlite project needs such a feature (which, i'm confident in saying, it never will).

TL;DR: fossil is, plain and simple, not the SCM for projects which need massive blobs.

geenat · on July 5, 2022

Fair enough, thank you for the detailed insight.