Write like coding

Disclaimer. This post is to state some guidelines to work within my group. Nevertheless, the advice may help someone. But some other times, the discussion may derail into something particular of our practices.

If you are writing a paper or your thesis, the chances of using LaTeX (and friends) is high. Moreover, you may already have some setup for writing your documents, and you may be doing fine working on it.

Now, most probably you are not versioning the documents, as it may be weird to have a control system for them, right? Well, if you are doing it, then you can skip my justification and jump to our guidelines. If you are not, then continue reading.

Why should I version my documents?

Documents are just code that is compiled by humans. One step further and your writing may be in an actual coding language (like LaTeX). However, in your career in academia (either on the current program or the next) you may need to reuse a piece of writing and re-shape it to suit your needs.

So, I argue that you should have versions of your writing, and treat them as code. As such, you can apply all the good practices you have for working with code into your documents.

But, you still don’t believe. So, let me give you some examples:

  • Working with others. At some point, you will stop writing alone and will start working with others. At that point, you may need to parallelize the work, so several people can be working on different sections of the paper simultaneously.

    Moreover, you can benefit from issues and boards (like in Gitlab or similar software) to keep a track of the discussions in the same place where the writing is happening.

    But I have these open tools (e.g., Overleaf or Sharelatex) to write and collaborate without handling the details myself. So, why bother?

    Well, yea. But, do you own them? Or do they work and keep the history the way you like them or the way they do? In my case, I couldn’t settle for their ways of hiding history, and I wanted to maintain my own.

  • Different versions of your document. When you are writing your thesis this was more visible: you end up with all the versions and shapes of the document, with a coherent and sounding versioning name convention, right? right? (Relevant phd comic.)

    relevant phd commics

    Well, having a real versioning system will solve this problem. As you have the history right there, and you can use tools developed to control versions right away.

    Moreover, you may have different ways of writing, depending on the venue and target audience, that in your edits may get lost. Well, no more! If you have the system in place you can reuse them later on, and just branch out of them, when using git.

  • Tracking changes and comments. You can use the facilities of the versioning system as roadmaps to keep your writing in place. Not only ideas for writing, or to-do lists, but also to track comments from other members of the team, or even reviewers (when a revision comes in). In this sense, having a tool that concentrates all the effort and allows you to link back into your writing is very helpful.

So, now that we have stated the benefits of having such a system in place. Let’s discuss it.

Guidelines

Repository

My rule of thumb is to keep one paper per repository until published. However, if you need to do changes to the ideas of venues for the paper, then you can just keep branches of the different versions of it.

Let’s assume you worked and prepared a paper for a venue (that could be a conference or a journal). For any reason, maybe your paper got rejected or you couldn’t make it into the deadline. Now, you need to re-purpose the paper for another venue. Do not create another repository! Branch out, and tag the commits. Then, go back and keep working on it. See some ideas of how to handle the branches..

File Structure

By default, you will have at least two documents: paper.tex and paper.bib. The former is the full writing and the latter is the references. I highly advise you to use a set of bibstrings to work instead of default journal names. That will allow you to have the flexibility to work with different setups for full and abbreviated names.

Additionally, you can have a folder with your images. The rule of thumb is to commit assets on raw formats (and prefer to use vector images, .svg, over raster, and use native tools, like tikz, to store your drawings). Hence, you will need a routine to convert them into .pdf when using them (see the CI setup).

Refrain to do commit -a, and if you need (and want) to, then set up a .gitignore to catch the auxiliary files.

Branching Models

As with code, you should have a master branch that takes care of the deliverables. And your work can happen on branches that grow out of and merge back to it. However, since we are working with documents you can violate the rules of not committing into master and keep things simple. But, be aware of the dangers.

For instance, if you get the paper rejected and need to change the format and venues, having different branches that have no subsequent commits may look weird when someone comes and tries to follow your steps on the paper. For that matter, having a branch journal or venue that has the writing evolution of the paper for that venue, and that goes back into master, may be better. Also, another option is to use tags and just mark the commits with different information.

Notice that when writing in parallel with others, you must stick to a branching model for the sections (or features), and then going back into master with recorded merges. So, all can keep track of what is happening.

Also, to divide the work, keep in mind that it is easier to modify the work of others, and to edit the sections, instead of doing a massive merge on two separate pieces of writing of the same section.

Continuous Integration and Development

One aspect when writing in LaTeX that doesn’t happen on other tools is that you end up having some libraries and scripts of your own that your team may not have. Then, when sharing the code, the same nightmares of setting a developing environment start. For the no-savant, this becomes a barrier and they brand LaTeX of difficult to work with. I digress, though.

The main problem arises from the lack of reproducible environments. One solution is to use a reusable environment for all involved. That is, you can all share a Docker, for instance, and all work on it with the data persisted on your computer. If that is too much, at least have a minimal continuous development cycle setup in your repository that is easier to reproduce by your team.

At a minimum, I advise having a PDF creation pipeline so you can check that your document is created on others’ machines. A template that can be used to test the paper’s creation using Gitlab as your main repository is:

Let’s break it down:

  • The variable FILE (on line 2) is the one that you need to update to make it work in your repository. That variable is used on the rest of the script to reference your paper.

  • The before_script section defines a set of commands that will be processed before executing your stage.

    In this case, the script sets up the journal-list repo (lines 14 and 15), for the custom bibstrings. Other on-demand repos can be cloned and installed into LaTeX in this section.

    Additionally, since we are assuming only raw images will be committed, we need to convert them into .pdfs. To do so, we use inkscape (line 18), as we assume that they are stored at the images folder (see the structure of the repository for more details).

  • Then the build stage is the one that actually compiles the document (lines 20 to 28).

    The script we use is latexmk (line 23) since it takes care of calling alternatively pdflatex and bibtex as needed. However, any other way of creating the pdf can be used here.

    Also, we need to expose the created pdf. We do this by stating the artifacts path on the stage (lines 24 to 26).

  • To compile and test all these steps, we are using a docker image named adnrv/texlive (line 21). That is an automated build that creates a minimal version (with different flavors) of the latest Texlive.

Issues and Boards

Now, you have a consistent structure, a branching model to progress, and a continuous integration that lets you know if your paper is working for others or not. The next step is to bring the conversation that normally happens in other places into the repository itself.

At Gitlab, we have the advantage of having the issue tracker and board on the same site. Let’s take advantage of that!

To have a useful conversation create the issues and mark them using the different labels:

  • Board labels:
    • .ToDo is the set of tasks that you (and your team) are planning to develop.
    • .Working is the set of tasks that you are effectively working on right now. Once you finish them, close them or move them to their appropriate label.
  • Issue labels:
    • Discussion tasks are the ones that need to have attention from the team and that concentrate points that need to be settled, or that have information that needs to be conveyed there.

      Remember that the idea of logging the information is that others can come after and check why the paper was in that particular shape, and what arguments led to that decision.

    • Enhancements tasks are related to improving the content of the paper or its related parts (e.g., images or tables).

    • Maintenance tasks are to maintain the repository structure or meta-tasks.

    • Writing tasks are the ones related to actually writing different pieces of the paper. Use them to mark the work on particular sections of the writing. They can involve the creation of epics used to mark a set of tasks.

    • Reviewing tasks are often done after a writing one. They are about updating and refining your writing. You can assign particular reviewing tasks to other team members and state what the revision they should be performing is about.

    • Study tasks are related to reading or checking some other documentation. They may still appear here, but most should be confined to edge cases on your tasks.

The main idea of these labels is to help minimally organize the repository. Feel free to create other tags that may be useful for the particular task at hand, and we may graduate them to the general labels.

Also remember, that having the issues is to help you administrate your time, but not for it to be spent only on them.

Summing it up

In conclusion, you can treat your writing as a code, and then use the best practices to make it work. Using these tools may look bothersome at the beginning, but mastering them will make them easier and easier to use.

Remember that others will be working on these files later on. So be kind, and document your work. And, on top of that, document the process too!

If you do it, others will. They will be glad, as you will.

Leave a Comment