Document everything as you explore it. I'm an advocate for literate programming but accept it's not going to be accepted by most organizations. So I use it as a personal tool.
Tools: emacs, org mode, org babel.
Create a parallel directory structure, hypothetical project:
You may add in some notes about the general purpose of each of those files.
main.org copy the entire code into the main.org file like:
* main.js
#+BEGIN_SRC js
// all the code from main.js
#+END_SRC
* [[file:../index.org][Project Root]]
Start splitting the contents of main.js into separate snippets, I don't know Javascript very well so let me make some quick C example:
* main.c
#+BEGIN_SRC c :noweb yes :tangle yes
<<includes>>
<<structs>>
<<functions>>
#+END_SRC
** includes
#+NAME: includes
#+BEGIN_SRC c
,#include <stdio.h> // [0]
#+END_SRC
** structs
#+NAME: structs
#+BEGIN_SRC c
// structs
#+END_SRC
** functions
#+NAME: functions
#+BEGIN_SRC c :noweb yes
<<some-func>>
<<main>>
#+END_SRC
*** main
#+NAME: functions
#+BEGIN_SRC c
int main (...) {...}
#+END_SRC
*** some_func
This function will initialize a block of memory
to be used as a shared buffer between two processes.
#+NAME: some_func
#+BEGIN_SRC c
void some_func (...) {...}
#+END_SRC
As you go through this you can make cross-references to other files and functions/structures. Eventually you'll find a smallest reasonable unit. A long, but clear, function doesn't need to be dissected. But a short, complex one, may end up with each line broken out and analyzed.
I don't just import a massive code base and do this in one go. Instead I import parts of it and break down all the related files to a particular topic ("How does X happen?"). I trace it from start to end, and then repeat with the next question. Good, modular code makes this much, much easier. The more tightly coupled, the harder it is to understand no matter the method.
[0] The comma is inserted by org babel to distinguish from it's on #-prefixed content.
I like this and have considered this approach using a git branch for annotations (although specific to using git, not familiar with other version control software). Have you done the git branch (or equivalent) approach?
I have made a new repo or a branch. Yes, but I typically keep it to myself and generate reports for others (if used at work).
EDIT: I was on mobile earlier, so extending my thoughts.
I typically make a new branch or repository but keep it on my own machine. I've gotten zero interest from colleagues in collaborating on this sort of thing, but they usually like the output. Org mode (my tool of choice, but not the only one) creates decent HTML output (you may want to play around with your own CSS or color schemes for the code blocks). So what I've done when we on-boarded a new project was to start doing this for certain critical sections that were under-documented. I then generated HTML output as a sort of white paper, and a PowerPoint deck that walked through the structure and control flow (would be best if I used flowcharts, but usually this is just text).
If we had good development machines at work, I'd definitely do the above with PlantUML or something similar to do text-based diagrams. Org will produce and embed images in the HTML output. This would make the flow for producing documentation much easier, I disliked trying to embed flowcharts created in Visio (tool available at work) into the HTML. I had to generate them, export to an image, link the image in org, and then keep it up to date manually. For a few charts it's not bad, but if you make a lot it's tedious to switch between tools and correctly export the image.
=====
For non-work stuff, I try to use literate programming from the start, but it's always solo projects so there's no "selling" this method. If I were collaborating with others, I'd have to reconsider the method. Leo has (from what I've read) an effective literate->code->literate story (that is, edit the code and the changes show back up in the literate format). Org mode can do that, but I haven't explored it. I'd really want that if I was to pursue literate programming in a collaborative environment (so that those uninterested in my method could still contribute).
[I had constructed a reply to your comment while it was being edited so when I posted the comment was much longer and had provided more than enough detail! Revised this comment accordingly]
I came to believe that ‘bird view’ summary documentation (index.org here, readme.md elsewhere) should be created for each more-or-less isolated module in the codebase. It should describe why the module exists and how it is used, i.e. its external contract/API, including the expected ranges of argument values.
This makes it much easier to learn proper use of a module when adding new calls to it. And of course, several months down the road you'll feel like you're seeing the module for the first time, so an overview should serve you well as a reminder.
I agree. You could do what I've described to produce such documentation if it hasn't been constructed already. Which is (as a professional maintenance programmer) the situation I'm normally in (poorly documented code design, even if we have a "complete" system specification).
And even if such documentation exists, it's often useful to recreate it yourself in developing an understanding of a complex code base (or at least sections of it).
It's not really the same: documentation that's tied to code structure tends to describe what code does instead of why it exists and how it works on a larger scale. That's why I prefer (additionally) having plain-human-language descriptions separated from the code―it forces the perspective of an external user, at least a little.
This is a gripe of mine especially with inline comments that are too often as useful as this:
// increments the counter
i += 1
At the same time, the ‘self-documenting code’ crowd forget that code can't really describe the rationale for its existence and e.g. the expected sequence of calls to its public functions, so plain-language descriptions are still necessary even if the syntax of the chosen language approaches English.
I agree completely, but as a maintenance coder I often don't inherit good documentation. Typically, by the time it hit my shop the code "documentation" was doxygen or similar auto-generated documentation. It showed the program structure but not why it happened. When I do this I don't just tear the code apart. I explain the rationale (as I understand it):
* can_send: () -> bool
=can_send= will signal =true= if the conditions are correct
for transmitting a message over the radio. Otherwise, it'll
transmit false. Here are the conditions that it checks:
- Condition :: description
- Condition :: description
If any of these are true, then we can transmit.
#+BEGIN_SRC C
// body of can_send
#+END_SRC
With perhaps more levels to my org tree structure if appropriate. Perhaps one of those conditions is particularly complex, I'd give it its own explanation.
If I have a system spec, which in my field I usually do, I'll try to relate it back to the specific requirements or specification elements that this code is implementing.
- Condition :: description, which maps to Requirement SRD-1010.
* Message Y
// description of the message format
// code for packing it or the class struct or whatever
So the first pass is more "what does this do", second pass is "why does it do it". Again, it's because of where I'm coming from, always late to the party. If I were doing a project from scratch, I'd try to keep the "why's" present more than the "what's".
As someone who would rather read comments than code, I like the idea of "literate programming". But I was expecting you to be taking notes about each function, not documenting the file structure.
What's the goal of breaking the file down like this? You don't try to maintain this when you change the code, right? So it's just a one-time familiarization with the code files? But why do I need to note down "This section of the file has structs?" I can see that by scrolling or with an IDE.
Doesn't reading through one entire source file make about as much sense as reading the first paragraph of every column in the newspaper? Don't you want to read up and down the call stack of something that does something interesting, instead of a bunch of code that may never need maintenance as long as you work there?
I threw that post together in about 5 minutes at work and never came back to it.
I do describe the what and why more, but I start with the code structure because that's what I've been handed and need to understand. I also work in embedded systems where, generally, the code call tree is acyclic and sticking with this format works well (each file is often its own module with clearly defined, if not clearly documented, interfaces for the outside). If I weren't dealing with these systems I'd need to reconsider the structure.
It's not just "this section has structs". I'd start with that:
** Structs
#+NAME: structs
#+BEGIN_SRC c
// all the struct defs
#+END_SRC
then:
** Structs
#+NAME: structs
#+BEGIN_SRC c :now yes
<<msg123>>
// rest
#+END_SRC
*** msg123
This struct holds messages of type 123. Here's the spec for it [some link].
Here's a list of each component and their acceptable values:
- type :: stored in the lower 5-bits of the first 16-bit word, should be 123
- timetag :: stored in the second 16-bit word, represents time since midnight
in seconds.
#+NAME: msg123
#+BEGIN_SRC c
// code
#+END_SRC
If the structs are straightforward (think a standard quick-and-dirty llnode definition), I won't bother breaking it down because it's clear (for me). If I'm communicating it to a new developer, maybe I write more.
While I don't actually develop from this code (except for personal projects), I do a sanity check. I attempt to tangle the code (generate the source output from the org files) and run a git diff. If it shows any non-whitespace differences, then I accidentally altered a line I didn't mean to. Which means my documentation will be wrong.
Tools: emacs, org mode, org babel.
Create a parallel directory structure, hypothetical project:
Create a new directory structure with one org file per source file and one index org file: (You can organize it differently, this has worked for me.)index.org will be a simple tree view of the folder hierarchy:
You may add in some notes about the general purpose of each of those files.main.org copy the entire code into the main.org file like:
Start splitting the contents of main.js into separate snippets, I don't know Javascript very well so let me make some quick C example: As you go through this you can make cross-references to other files and functions/structures. Eventually you'll find a smallest reasonable unit. A long, but clear, function doesn't need to be dissected. But a short, complex one, may end up with each line broken out and analyzed.I don't just import a massive code base and do this in one go. Instead I import parts of it and break down all the related files to a particular topic ("How does X happen?"). I trace it from start to end, and then repeat with the next question. Good, modular code makes this much, much easier. The more tightly coupled, the harder it is to understand no matter the method.
[0] The comma is inserted by org babel to distinguish from it's on #-prefixed content.