Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

If a wise programmer decided he needs a serialization format, would he deliberately include in that format all the crap so vividly pointed to by the article?

No.

He will think of the "serialization format" as an interchange format between two different instances of his program. One process first writes the data file and another process later will read it. He also knows that sooner or later the "serialization format" needs to talk with different versions of his program, not just different running instances.

AFAIK, the Word .doc also started (and unfortunately continued) as basically a not-so-designed memory dump of the in-memory OLE data model. It's a format that more often than not has infamously stumped its own implementation as well. (Over time, OpenOffice has saved quite a lot of .doc files of Office users.)



The overriding aim of most formats was to load into memeory efficiently - fast load times was the key winner in the 80s and 90s. So you did not want a simple serialisation because that meant slow CPU intensive save and loads. But if you slammed it in pretty much as it would be in memeory you would win. Downside is if you change the in memeory representation of the running program you had to change the file format.

And .mov would have no such concerns - it's prime use case is store data in serialised chunks anyway - it was already serialised so could use very dumb stores.


You are making a general argument why serialization formats should not exist. Fine, but in reality, and for any number of reasons, they do: they are easier at first, they are actually often somewhat easier over time, the pain cost that occurs is often easily amortized over time, they are fast to load (no transformations), they are fast to edit (you can often treat them as some insane memory page container and do internal allocation for updates, leaving old content begin until it is recycled), and their concept makes them capable of handling random seemingly-unrelated garbage that these mega-programs end up being popular for.

They aren't even always considered the non-ideal: I have seen many an argument from people who use Smalltalk that the ideal transfer format is to literally serialize part of the running program state and call it a "document", including whatever code might be required to operate the more epic parts of the document. (If you think about it, this is actually fairly similar to the various file formats that involve OLE, as you end up having the identifier of some code the user hopefully has installed attached with a block of data that that code hopefully can reinstate ate itself using.)

So, given that it is a tradeoff, and given that it was often a neccessary one for file formats where you want or need to be able to edit files that both contain numerous nearly-unrelated features (OLE would be the most beautiful example of this in the Word container format) where the entire contents may be larger than the RAM available to the entire computer, it simply seems silly to complain about this: man up, import the data, make your own format for saving your files, and stop complaining that someone in 1990 made something that over 22 years has become slightly difficult to understand without that historical context.


> AFAIK, the Word .doc also started (and unfortunately continued) as basically a not-so-designed memory dump

This may be true but not the whole story. It's the reason why the MS office team bit the bullet and replaced .doc with .docx about 5 years ago http://en.wikipedia.org/wiki/Office_Open_XML

Docx is basically XML in a zip file. It's a beast and has lots of compromises for backward compatibility, but as a design starting point, "zipped XML" is far far better than a binary dump of the in-memory data.


It's still worse than ODT (which itself isn't exactly pretty), for no good reason. That's sad.


ODT is also XML-based, to Docx's problems compared to ODT can't be blamed on XML.


I never said it has anything to do with XML. OOXML is extremely complex for little reason. Even though it is also quite complex, ODT is much, much simpler.


There are actually reasons for some of OOXML's weirdness, just not good ones. For instance, it appears the reason why OOXML is pretty much the only XML-based document format which doesn't use a mixed content model is because there's a huge amount of prior art that'd have made it impossible to patent if they had. (Microsoft tried anyway though.)


I'm not disagreeing with you; but the context is mostly about the use of XML.


It could be possible that the format was first very reasonable, but the surrounding platform has changed completely during the development. Then the new layers of specification have been added in a form that seemed to be the best possible solution on that platform and on that time. Wasn't Photoshop at the beginning an app for the original m68k Macintosh? Surely different kind of field sizes made more sense in that world than ours - also the tradeoffs in the sake of performance could have had some say.


Word dates back to 1983, while OLE was only introduced in 1990 (but otherwise I think you are correct)


Office 97 dates back to 1996.


Not to mention security bugs too.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: