Research Flow

About

Since middle of 2014, I started using Gitlab more and more to host not only my papers and personal projects, but also to host the code of my research. Then, on 2016, I started moving the workflow of research to the platform too (with the aid of issues and boards). Since then, I have been using it with my research team to host the pipelines of research projects, papers, and the flow from research ideas to results.

This post is a summary of my experiences within the past years working with my group of research using some kind of “research flow” based on git. It is a start to unify my experiences and practices (hopefully, good) to follow when working in a team, and trying to get the best out of it, while using Gitlab (or a set of tools that do the versioning plus issue board integration).

Ideally, this post is to save me some time. It is an explanation to new students on the way of working within our team and our repos. It is a (not so) minimal introduction of our research flow. However, I still think that it may be useful to have it open and get some feedback from others with more and different (or similar) experiences.

The idea

I spend a lot of time explaining to students or other team members a way to work (and do research) based on git and issue boards. Most of the time, I try to follow some good practices from software engineering. But doing all the process may be cumbersome and too slow for a small research team. Also, doing rapid development with minimum documentation may not be suited when turn out is high and documentation is the legacy to the worked problem.

The main idea behind this post (guide?) is to use the workflow based on gitflow to perform research. As most of the time, you are doing similar tasks as you would in a software development project; the goals of one methodology can be applied to the other.

There are many lessons learned by thousands of development teams, and that knowledge can be useful for research too (or at least we are trying to figure out if it will be useful).

This post intends to motivate the task, and set a starting point (by setting some conventions) to work with others doing research, and using git as a powerful tool to track the progress and the research results. Moreover, the repositories and findings can be published as scratch-pads to share the lessons learned, and not to be thrown away.

Mistakes are more valuable than successes, only if you share them.

Existing problems

What I do works, why should I keep reading?

Check this scenarios, and lets see…

Where did it go?

It is incredible that people still do not version their work. But it happens. If you are not versioning your work, at least use some kind of versioning system.

Imagine the following. You are working in your research project. After hours, days! or work you are ready to show it to someone. But before, lets do a little clean up. Then, on your carelessness you write a simple cleaning script:

$ ls
awesome_algorithm.cpp awesome_headers.h data outputs results
$ cd output
bash: cd: output: No such file or directory
$ rm -rf *
$ ls
# Now you cry realizing what happened

Yep, wrong command in a wrong directory, and the Murphy laws. You better have a backup somewhere. And if you backup often then it may be just some hours of work lost.

Lets give it a look

But, I already learned that lesson, and I do regular backups using this other great software.

Well, but what about working with others? You do that from time to time, don’t you?

Again, imagine. You have a meeting with your advisor in a few hours (as usual some tweaking of the code in the between). You copy your precious work into a special folder to show your work:

$ cp -R ./work ./meeting-with-prof
$ cd ./meeting-with-prof
# Hack your code until something pretty thing comes out
# Show your pretty plots and data

Now, you feel OK because your mess is hidden in work and that the original things will be hidden (because reasons), and no one will steal your ideas (?). And your hard work payed off, and you have pretty nice data that your advisor agreed to be your Figure 1 in the introduction of your next paper. Yei!

However, you already forgot all the tests you made, and the parameters you used to obtain said data. Moreover, you may have changes in the implementation. What if you where hacking the code even before or during the meeting?

What will you do now with the code? And the results? The pretty plots?

Git to the rescue! (or any version management system for that matter)

Versioning

Git is a control version system, that will handle everything. It is really easy to learn, and there is a lot of documentation about it.

But, I like to use this other versioning system? Why should I go that way?

Fair enough. The idea is not for you to switch systems, but to have one (at least). Nevertheless, I like how git interacts with other tools that we can take advantage off. Keep reading.

Basics

If you are new, and do not know about git, I encourage you to read the basics (somewhere else), and to try it for yourself. Before moving on, be sure that you understand and know how to

  • Create a repository
  • Perform a checkout
  • Add files
  • Commit
  • Push to remotes
  • Branching
  • Merge
  • Tagging

Now that you know the basic commands, lets move to what we came to do. Lets talk about a workflow.

Extra lectures

Continuous Integration

I will summarize CI as:

I test, therefore I code; code as often as you test.

You test your code, don’t you? And, can you do it automatically and frequently? Your group may not have an integration environment and a deployment one, but that won’t prevent you from testing. Right? Create tests for your repository and test often.

You can automate the process of testing by creating hooks to your git repository when committing or pushing. That will force you to have working code, all the time. Period.

There are many sites that offer CI tools, and that even will execute a docker for you to test your code. Take advantage of them.

At this point is where Gitlab starts excelling for my team. We can use the built-in CI pipelines that interact really nicely with the repositories. Then, every repo has its own code (.gitlab-ci.yml) that defines what to test and in which order. And, once it is set up, tests happen automatically on every push.

Disclaimer. Doing research on computer vision and machine learning with CI is hard for us. Since we don’t have a scalable infrastructure that allows us to run dockers with Tensorflow and other software on demand. But that is another issue for another time.

Lets talk repos

When working in research, most probably you are not alone. At the minimum, you have an advisor (right?). And even if you are alone now, you most probably want to share your ideas and findings with someone else (so you won’t be alone in the long run).

The idea of sharing with others creates the need of having a structure to share your ideas in a nice and clean way. Thus, you need to handle each idea by its own merits. In our convention, we will use one repository per problem-solution pair. In general, our team works with two types of repositories:

  • Research (that involves, idea development to production of code)
  • Writing (either, papers, projects, or theses)

From ideas to code

Each repository will hold the work done and research performed to solve a given problem using a method. If you want to test a different solution, create a new repository. When I say “solution,” I’m talking about a particular type of algorithm (or set of them) that work together to solve that problem. However, you may have several ideas that spawn different instances of that solution. In our case, we maintain instances of the same solution together to

  • compare and benchmark them;
  • to iterate and follow on previous instances, and improve upon them;
  • to learn from previous mistakes; and
  • for new members that join a problem-solution pair, to have a clear starting point (assuming that documentation and proper integration is happening).

Finally, you want to develop within the same repository, but you want to keep the work separated and organized. That is why you need a convention on branches to create and manage your work.

The idea behind this separation is to keep it clean. Maybe you reach a dead end now, but later we can revisit what we tested. Moreover, remember that you may be working with others. In a lab, an idea that failed now may flourish later in the light of new ideas, findings, and time (specially time).

Additionally, having a place where someone can explore the inception of an idea, all the way into a final set of results (and the experiments that produce them) will help immensely. You can do reproducible research, and more important in the open. Others can check your work, and build upon it. The process is more than its tasks.

Disclaimer. Currently, in my team, the repos that do idea to code are closed to the public but open internally. However, my goal is to open them and do open research in the open. Nevertheless, overcoming the problems of having open code and paranoia (mostly paranoia) as a small team is holding us back. This is one of the problems that need to be addressed on how to open the research and bring the “research flow” to the next level.

You have results and want to write the paper, it is time for a new repository!

Writing repos

The other type of work we frequently do is write. Our pipeline for writing follows closer the previous ideas of continuous integration. Let me go in detail.

We do writing on (La)TeX and friends. (If you don’t know it, I highly recommend it for formal documentation and even for documents that can be automated or have repetition on them. It is really good.) In summary, (La)TeX is a language that lets you code the style once, and worry about the content of your document (like CSS and HTML) but the output is a PDF. However, that is an oversimplification of its capabilities. But you get the idea.

Since we write collaboratively, we need a way to keep track of the changes in the document, and to have a way of knowing what we are writing is not breaking on other computers.

But, I have a lot of experience working with Word, and they have a versioning system there. So…

You don’t know the pain of writing a thesis in word to be saying that. Or writing with a lot of references with different styles, or without programmatically enhanceable macros. Here is where the beauty of having a versioning and integration system comes in.

Our pipeline involves a docker container that has all the software ((La)TeX and friends) installed, and that builds the documents on demand when the code is pushed. That is, if in your writing frenzy you also coded something that changes the layouts or some other aspect of the structure of the document and forgets to share it with the team (through the repo), you get a message that your code is not compiling. Then, you can promptly check it and fix the error. In contrast, previously when doing this type of collaboration, a set of mails follow explaining that the code is not compiling on your side, and a war started on how you do not know how to use the tools. Only to realize, after some iterations, that there is a change in a library and that that should be updated too. Oh! The good old days.

Moreover, when I’m working with students on their thesis or papers since the documents are created on the web, we always have the latest version ready for discussion or for showing up. Similar things happen with projects (although it is harder to introduce this type of workflow to colleagues).

On top of having the PDFs built and available, we can have a pipeline for following the work and tasks. More about that on the following.

Branching Model

Now that we have repositories, we need a way to create the work in a way that is easy to manage. In the following, I present a rather simplistic way of working based on previous gitflow approaches.

Development (master)

The core of each repository should be the master branch as development or active branch. The main implementation and active code (the one that actually compiles and runs) should be there. Or the code that represents (or summarizes) the working solution of your problem (remember the problem-solution pair).

As the problem gets solved, the code will be produced. Either, through small and rapid iterations within master itself, or longer development cycles using feature branches that will be merged back into master.

Since we are not doing traditional software development cycles, some of our ideas may not work and will end up as dead branches that are not merged back. In that case, what you should do is to write a postmortem report (probably as a .md file within the same branch, which is linked on the README.md or is easy to access). The idea of a postmortem is to understand what went wrong and where. If you can shed light on your work, others may come later with a new vision and may try to solve the problems you faced, or find new ways of tackling the problem. That is where the open research part kicks in.

Document all the things!

Features

The idea of using features is to allow you (and your team) do parallel development and track it. The parallelism may come from several ideas thrown by your advisor and team in the last meeting, you’re reading some cool papers and ideas that may fit your problem/solution, fixes that you realize that need to be done, work that is being done by your team in the same problem, among many other instances. The final point is that some parallel work will happen and you need to be prepared.

So, when should I do rapid commits versus a branch?

While coding:

  • When you are working on different ideas and experimenting, but not sure if that will be a good thing to put into master, create an experimental (feature) branch.
  • When you are doing work on a repository that may take a while (that is several commits and work that may break master), create a long-development (feature) branch.
  • When you are fixing some code that is not OK on the master branch, create a hotfix or fix (feature) branch.

And while writing:

  • When you are working collaboratively on parts of a paper that may conflict too much, create a section (feature) branch.
  • When you are writing your thesis, and advance on several chapters simultaneously, create a chapter (feature) branch.
  • When working on complex figures that may demand a long time and may mess up others work, create a figure (feature) branch.

Do you see my point? The idea is to spread the work and do not interfere with others work. Some of the software today (like Gitlab) have automated merge requests that will help you merge your code back if the tests are passing (remember to CI).

Merging

All these branches will branch from master into specifically named ones. For example, topic-one, fix-042, issue-137, and so on. And will return to master when the work is over.

Now, there is a nice debate on whether to use merge over rebase. Git documentation has a nice description of this matter

One point of view on this is that your repository’s commit history is a record of what actually happened. It’s a historical document, valuable in its own right, and shouldn’t be tampered with. From this angle, changing the commit history is almost blasphemous; you’re lying about what actually transpired. So what if there was a messy series of merge commits? That’s how it happened, and the repository should preserve that for posterity.

The opposing point of view is that the commit history is the story of how your project was made. You wouldn’t publish the first draft of a book, and the manual for how to maintain your software deserves careful editing. This is the camp that uses tools like rebase and filter-branch to tell the story in the way that’s best for future readers.

Now, to the question of whether merging or rebasing is better: hopefully you’ll see that it’s not that simple. Git is a powerful tool and allows you to do many things to and with your history, but every team and every project is different. Now that you know how both of these things work, it’s up to you to decide which one is best for your particular situation.

My point of view, since we are doing research (in contrast to pure software development), is that the notes and scratchpad of researchers are as valuable as the final product. Thus, having the history of what happened, and which features (branches) did or didn’t work, is a really nice thing for the community. Not only having the good results but the bad ones too. We learn more from our failures than from our successes. So, make a lot of mistakes but document them! (But remember my disclaimer.)

For starters, the convention asks for merges instead of rebases . The merges can have a nice and neat no fast forward (--no-ff) to always create an entry point. However, change the default message to some informative one, that explains what are you merging into the repository. As with other commits, do not write about the contents, rather talk about your intentions.

Most of the time, we want to keep track of all the features for later use and references. But do not be afraid of cleaning your repo. Thus, it is OK for you to delete unnecessary branches.

Other branching models

There are several branching models that can be adapted or used. The previous ideas are adaptations of “A successful branching model” and the “Gitlab Flow.” However, there are several opinions and other models that support other paths, cf. a not-so successful branching model or the simpler Github Flow.

Tracking progress

By now, you have knowledge of version control systems, you have a solid setup, and everyone in your team is in sync by following the same rules on branching and merging. Cool! However, how anyone will know what to do? And when to deliver the results? And what if you miss the next deadline?

A lesson from software engineering is to know which features to develop, which bugs to solve, and when to deliver them. You need a schedule and a way of tracking it!

The existing platforms (the most common ones: Gitlab or Github) support issue tracking systems embedded into the repositories. So we will take advantage of them.

Milestones

First, you need to set your goals, and by “you” I mean you and your team. Set realistic things that need to be delivered. What are you expecting to see to measure your advancement in the solution to the problem? What is the expected outcome for your next paper?

If your answer is “a piece of code that solves my problem,” think again. To reach the final “code” that solves your problem, you may need to have done several things before. These “things” are the different milestones that you need to hit before reaching your final objective.

Don’t be afraid of change and modify the road map. But it is needed at the start.

But, I like coding, and writing and planning what I want to do is a waste of my time. I can spend that time doing more code.

Think again. A good plan can save you a lot of time on wasted development. You need to carefully think about what to code, and how to test it. If you spend time doing your homework and do some research on what others have done, and how they did it, you can save a lot of time. Thus, your milestones will involve reading tasks, writing (documenting the project) tasks, coding your solutions, as well as coding tests, and then running experiments on your code. Writing the outcomes and explanations of the code and the executions.

Issue boards

Now you have repositories. You are regularly versioning your code. You even have branches and you are collaborating with others writing code and papers. But, how are you coordinating with them? Do you send mails to divide the work? What about the follow-ups? Do you set everything in the research meeting?

Introducing, issue boards. An issue board is just a project management tool (ideally integrated with your repositories).

How is that different from having emails and using chats to communicate?

First, it provides a log of the process. When new members join the team or the project, they can see the decisions made and take actions based on that. In contrast to mailing, where discussions are lost for the newcomers, having a log available for others creates an open context (this will play particularly well if we want to push the open research ideas). Additionally, you can track milestones using these tools, and attach meta-information regarding commits, merge requests, and other goodies from your coding time.

Normally, we operate with a single development board (either for coding or writing). However, in theory, you may have as many boards as needed. I have a use case that I haven’t implemented successfully for every type of repo, yet. More on that soon.

Our boards have basically four stages:

  • Backlog: where you have all your issues
  • .To-Do: where you have all the issues you are planning to work on this development cycle.
  • .Working: where you have all the issues that you are actively working on.
  • Closed: where all the closed issues live.

Notice that, I have a simple markup . to differentiate board labels from “normal” issue labels. Again, this will be handy if you have a set of boards that you use regularly, so newcomers can understand and differentiate how to use them.

A normal use scenario is to plan your week/month/cycle of work and bring the particular issues you want to work on. And move them accordingly within the different lists. The idea is that if one person is in charge of a particular list, they can consume the tasks from that list without further notice. Also, another person that comes into the project can see that some tasks were moved from .Working into .To-Do back and forth, or some other strange behavior that may give clues on how to improve the process or to understand the intricate nature of that particular issue.

A particular case of multi-boarding arises when writing. When we are working collaboratively there are two dimensions within the issues: one for the writing process (drafting, writing, editing, and closing), and another for the development of the document (to do, doing, and closing). In this case, having two boards may help to distribute the work and understand how particular tasks (that may be tied to topics to be developed within the document) evolve in each of their dimensions.

Issues

The boards are composed of issues or tasks. A task is the atomic unit of measure that you will be using to measure your progress in the project. The tasks will vary from project to project, but the main idea is to make them a unit of work that may be composed of several subtasks to work on. The issues are created to address particular problems within the project and to move it forward.

You can take advantage of the issue tracking system, and use the issues to track other tasks (not necessarily code or writing) to reach your goals. So, divide the big milestone into different tasks that need to be done to reach it. (You will need a labeling system to organize your work.) A simple measure of done is the percentage of closed tasks within the milestone.

Depending on your time and the distribution of work, you will end up with several tasks (issues) that will lead to the final milestone. Moreover, take advantage of the commenting system and localize the conversations among your team to make progress and to adapt the road-map as needed.

Labels

We need a taxonomy to relate tasks and issues and be in sync. The following covers most cases that my team has been using.

Research, paper studying, and writing:

  • Study: understanding papers, code, or documentation.

    This type of task must produce an artifact (that is, something tangible as a result). Don’t get caught in the trap of I read a lot of papers, so my work is done. Show it (somehow).

    My suggestion is to exploit the wiki pages or the markdown files within a documentation folder on the development platforms. Each time you have a study task it should produce an entry or enhance it. Again, the main idea of using the repository as a research scratch-pad is to share all your findings (that includes what you read).

    The advantage of the wiki is that it can be cloned and shared (commonly it is your-repo.wiki.git).

    Disclaimer. I haven’t found a way to make this type of task to work consistently. There is another type of repo (that I didn’t talk about before): the literature review. The idea of it is to create a scratch-pad of research summaries of particular topics. When someone wants to enter a topic, they can easily catch up using this condensed and processed material. However, that hasn’t worked. People just stop writing. My assumption on this type of behavior is that there is no instant or apparent return of investment. I’m accepting suggestions on how to make this particular part work. Since having information from papers will help everybody. However, it is not happening.

  • Write: document features, scenarios, or other related topics of the solution. These labels appear when describing results, as well. Basically, this type of tasks produces artifacts, similarly to study tasks.

  • Review: reviewing exiting writing, mostly tasks based on editing. These tasks appear more commonly within the writing repos.

Code and development related:

  • Development: coding features and adding new functionalities. Exploring new algorithms or ideas.

  • Enhancement: coding enhancements on existing features. Ideally, we make the distinction to keep track of what are new pieces of code and what are improvements. That also helps to preempt tasks that are not necessarily critical in a particular pipeline.

  • Test: related to creating and executing tests (or experiments).

  • Bug: reporting a bug, and eventually working on a fix to the existing solution.

Discussions:

  • Discussions: suggestions made about the solution-problem relation, or a discussion about a particular solution. Commonly, these are used in tasks that become controversial, or with a task that is created for planing the next stages that need a discussion that should be logged.

Repository maintenance:

  • Maintenance: this is work dealing with the repository or the project tools. The common tasks are working on cleaning tasks, preparing the boards, or CI environments.

Reports

After working, you have some results you want to show to your team. The main idea is to keep the knowledge flowing and reduce the overhead for the team.

Documenting what you are doing is as (or even more) important than doing that thing. The difference between taking a tidy and documented project, versus and the undocumented and messy project is huge. You can be on speed and working much faster on the former, in comparison to the latter.

If you did:

  • a study task you want to share your results (the paper you read and the main ideas and findings) to your team or future readers. Thus, the goal of that task should be to create a report that shows what you learn;
  • a development task you want to explain the main idea of your features (normally an algorithm, test code, or experiment implementation) as documentation. That means that development and write features should go together; or
  • a test (or better experiment) task you want to explain your findings and share what you found in your work. You want to share your graphs, figures, and insights with your team.

To do so, you should end up creating a short document that clearly explains your work, findings, and conclusions. More importantly, assume that anyone can take a look at your writing. Hence, you need to be careful to link to corresponding bodies of work, previous results, or related work within your repository or others.

Conclusions

The previous flow is a collection of practices that I have been picking from several sources, and an attempt to formalize a research process to create a culture of reproducibility and persistence. There are several challenges that we need to overcome that may be different from more permanent teams of development. For instance, we have a high turnout rate. In this case, having this documentation will be super handy. However, the same need creates a problem on how to maintain a culture of work.

Some parts of the flow work better than others (as seen by the number of disclaimers). I’m still exploring and experimenting with the use of these tools, and the flow itself to reduce overhead and improve the effectiveness. Sadly, I don’t have metrics to compare and to evaluate the process ( maybe that is a good next step to look into).

Do you have similar experiences? Do you use these types of workflows? Do you have these problems? Did you solve them? How?!

Leave a Comment