I realize you're trying to elucidate but the context of that author's sentence was "text editors" therefore things like vi/emacs/TextPad/Notepad do not understand a binary format such as a zip file. (And docx is a zip file as you already noted.)
Yes, inside the binary zip file is a text file called "document.xml" but none of those plain text editors will parse a zip file to get to that xml file. From the viewpoint of plain text editors, a docx file is an opaque binary format.
Yeah, i see but my line of thinking was more along: "if I have this docx in 100 years time will I be able to decode its contents?"
I think the answer is probably yes (pending possible " "premature" apocalyptic scenarios =)
Regarding the zip compression, docx is a "Document Container File" standardized under ISO/IEC, so provided we still have computers and the standard documentation is still know, it should in theory be possible to DEFLATE the file.
Inside you would find all the text content as xml, and as some have pointed out here, binary blobs as well. Those would be mostly font files and image files (or am I missing some others that would be important?). In my opinion the fonts would not be critical to the content meaning.
The images would of course depend on what format they were encoded in. But that's a whole other discussion!
I wonder whether in an ideal world there would be an alternative to git that doesnt have git's long learning curve for use by scholars in quantitative disciplines.
Why go through all this trouble if you are a social science student? If I were not into computing, I would probably just use Google Docs for all the drafts, and for the final version, I would copy+paste into Word and give it a finishing touch (proper formatting).
(That last step should not be necessary really, because that is the publisher's task, but publishers seem to be getting away with being lazy).
I summarize this to my students using two laws and two postulates.
Two laws:
(1) For long documents, especially if they require citations, bibliographies, mathematical notation, etc. – Word is a poor choice.
(2) For documents that take a long time to write and that you want to survive a long time – Word is a poor choice.
The problem is that the alternative is to learn a new way of doing things, which may have a long learning curve. So don't kill your research by wasting time learning new tools. But if you can spare some time to learn better tools, you should.
To understand the alternative, you need to internalize the two postulates:
(1) You should use tools that focus on the semantic aspects of the text, not it's visual appearance. For example: tools that encourage you to say “this is a chapter heading” are good, tools that encourage you to sat “this is Arial-14” are bad.
(2) You should save your work in standard file formats. For text, which is most of what we produce, this means TEXT FILES, not Word documents (.doc,.docx) that tend to break when they switch versions.
> You should use tools that focus on the semantic aspects of the text, not it's visual appearance. For example: tools that encourage you to say “this is a chapter heading” are good, tools that encourage you to sat “this is Arial-14” are bad.
All modern text processing systems can do that, including Word and Google Docs.
> that tend to break when they switch versions.
I don't see the problem. I have two other postulates. (1) While you are writing research papers or even a thesis, you should not change the version of your word processor (2) While writing, you should focus on the content, not on the formatting.
Follow these laws, and you will do just fine with any decent word processor.
Indeed with strict discipline you can use semantic styles in Word and other tools. People typically don't and the default behavior in many cases is annoying or leads to problems (copying style, when you intend to copy just the text and so on). And, fiddling with these things takes a lot of time.
As for software versions, moving operating systems, computer crashes and the like -- things happen when you are working on a dissertation and any other long term project, and you end up trying to rescue your files. Moreover, when you are a scholar, you often find yourself needing stuff you prepared decades earlier. In many cases the versions are not even easily available, if at all (Word 2.0, anyone?)
There are tons of sites by academics detailing these scenarios with more details. I submitted a few links to HN a few minutes ago.
.docx is not a binary file format unlike the .doc files from the past.
.docx is a plain text xml based format that has an Open published standard.
Most people don't know this, but you can simply unzip the contents of a docx or xlsx file and inspect its plain text contents.
Archival and long term accessibility of the document contents should not be the deciding factor in opting out of .docx