Can you elaborate? GP's argument is verbose with a lot of explanations and examp...

ggrrhh_ta · on Nov 22, 2021

When the text file is a source of data rather than a reflection of the data that you want to operate on, and particularly, when a lot of that data has been introduced by humans (or based on human input) in some suitable units, it very often happens that many numbers that are generate are short, shorter than a regular double.

I know it is difficult to believe, so, don't; I was not trying in my comment to say that text is better everywhere always; of course not; I myself let users have text as a reference while creating a one-to-one representation in a well known binary format for speed of access. When you really have huge amounts of data, each use of the data usually requires filtering/selection/manipulation/new memory layout/etc. that makes the original optimizations in the binary format useless and complicated to maintain, and more difficult to allow for low friction use by anyone even 10 years in the future. I consider the textual representation as the source of truth; but the binary allows me to do the filtering faster, although in many occasions it takes more space. Yes, one could choose better encodings for the binary data (the 12-bit referenced above, etc.) but the complexity of the code increases and it becomes even more obscure to the humans that have to use (sometime debug) on that data.

I understand is difficult to believe. Well, do not believe. Probably my experience is very niche.

okamiueru · on Nov 22, 2021

Seems we then spoke past each other. I did not focus on practical aspects or other such tradeoffs. Just, what is the most efficient way to represent numbers. If your argument is that for many practical applications, text is good enough. Then, I don't disagree.

But, if the argument was that text is a more efficient way to store numerical data. Then, there is no need to bring beliefs into this discussion. It is just not the case. Serialisation of numbers have been solved by engineers many decades ago, and there are many to chose from based on extra information on what the numbers should represent. One such representation is actually useful as a lookup table for text. But, to suggest that breaking up a single numerical value into individual lookup table values, is somehow more efficient than storing the original value with the right choice of encoding it... I don't know. I don't care all that much. But, surely you can see my argument?

The whole thing is a tautology. For the sake of silliness, if ascii encoding numbers somehow was the most efficient way of encoding numbers (it isn't, but, let's say it was) then it would only then be exactly as efficient as "choosing the best way to serialize a number". But, surely, whatevever that data was, would then be most efficiently represented by the ascii representation. So, we should store the ascii representation of the bytes that store the lookup table values, as encoded ascii values. Doesn't take more than a handful iterations before you run out of storage space available on the planet.

ggrrhh_ta · on Nov 22, 2021

I agree with everything; for users that are experts in their domain (not computer science), if you don't use text in the frontend that they can parse with little (but very inefficient) code in their scripting language of choice, then, they will be reinventing some even more difficult/complex/yet-another way to handle text that you will have to support in the backend for performance - and in the backend I choose binary formats (although my experience, without focusing too much in any other aspects of performance than retrieval, because most uses with large amounts of data require specific filtering/reordering/new memory layout for efficient algorithms, etc.). And, counterintuitively, when you do that, and you have hundreds of files which are hundreds of megabytes where some columns are integers, and you see the size of your binary data you wonder... why is my binary data few times larger?...

okamiueru · on Nov 23, 2021

That is a good point. It is a good observation that just binary encoding numbers does not mean you have an efficient serialized representation. And, that sometimes that can be so bad that even encoding numbers as serialized text beats it out.

I think what threw me off was the context of this discussion, which was 3MF file format. For such cases, the usefulness of humans opening up the data in text editors is somewhat contrived. I believe that tradeoff for the 3MF format was a mistake, and will likely hinder broad adoption. I'm however mostly concerned with the extra processing required in order to decode the human readable files. We're talking a likely speedup of a factor of 50-100. For models that push this limit, imagine a file that takes 60 seconds to process. This isn't entirely unreasonable for a model with insane level of details. If they instead had gone for a binary file representation, the load time would likely be less than a second. Failing to recognize this tradeoff for human readable text... is unfortunate.

prerok · on Nov 22, 2021

Hmm, I think I see your point. So, for example, if most numbers were something like 0.1, 0.25, 3.25 and the like, they would take 4 bytes in most cases, whereas storing doubles would always take 8 bytes.

We don't want to take float32 because of the inherently bad precision of working with them, so storing as text is better.

Thanks for the explanation.

thrashh · on Nov 22, 2021

If you’re reading or writing a file format, whether it’s text, floating point or some other encoding, you don’t have to store the data in memory the same way when you work with it.

…which frees you to write your data to disk however you want.

And given that freedom to choose any serialization format, you definitely should not manually be figuring out every byte you write — you probably write an abstraction, so wherever complexity there is should be completely transparent because you unit tested the hell out of your sterilization routines.

…which to me makes the whole complexity point moot, and which also makes the whole “text is better for memory minimization” argument also moot.

I do think text formats have a place — primarily readability and interoperability — but to choose it for purely reducing your file sizes sounds insane to me.

okamiueru · on Nov 23, 2021

There is a good reason for why one would want to store data the same way as when you work on it. It removes the need to process the data itself. A common approach is to follow it up with a fast compression algorithm when storage space is important.

The difference in post-processing when having to decode text is around two orders of magnitude, which is very significant. In the case of the 3MF format in question in this article, I find the tradeoff for human readable files to be a very strange one. The benefit of that at the cost of say waiting a minute to load a file that would otherwise take a second, is a puzzling one. But, it's all a tradeoff after all. If humans work with the files, it is a strong argument work with text. When software stores and loads files, it will likely be a poor choice to store data as text.

But, you are right. The file size consideration is not all that relevant. However, what was mostly discussed was the argument that serializing text that represents numbers is more space efficient. Which as I mentioned a few times, can only be the case when the straight serialization of the numbers is done poorly.

ggrrhh_ta · on Nov 22, 2021

of course... I never thought that text files to reduce file sizes is sane... Maybe because I don't want to write in a wishy-washy way as if I was stepping into a mine field I came across like that [look at my first comment... I think I started as "I know it is counterintuitive" - I wanted to point point out the binary (as naturally interpreted: storing floats/doubles) does not necessarily mean smaller file sizes for files generated by humans.

okamiueru · on Nov 22, 2021

I too am curious. Not that it really matters. The statement itself is a tautology. There is no "opinion" here. Just straight logic and math. Text serialization is a way to serialize numbers. It's just a very poor and roundabout way made for human readable text. The idea that such a roundabout way of storing numbers as other numbers should somehow make things more efficient than storing the actual numbers... Is kinda silly.