Git and Jupyter Notebooks: The Ultimate Guide
Using Git to version control your Jupyter notebooks has many advantages but it's a bit tricky to version control your Jupyter notebooks on Git. In this guide, we show you all the best practices, workflows, and tools to make Jupyter Notebooks play nicely with Git, GitHub, and Bitbucket.
Git ↔ Jupyter Challenges
Let's briefly list down the challenges of using Git with Jupyter Notebooks -
- Notebook Git diffs are hard to review
- Resolving notebook merge conflict is painful
- Large notebooks fail to render on GitHub
- Notebook code reviews & collaboration is tough
In this article, we'll look at each of these problems & suggest optimal solution for each.
Using Git with Jupyter
First, let's get the basics out of the way. There are 2 main ways to perform Git operations (clone / pull / push) for Jupyter Notebooks -
If you're new to Git, the above articles would give you step-by-step guide to version control notebooks on GitHub. If you're an experienced Git user, just note that the basic Git commands remain the same regardless of notebook's .ipynb file type.
Git Diffs for Jupyter Notebooks
We use Git diff to see the difference between two versions of the same file. This could be:
- Local Notebook Diff: To review the changes you’ve made on your local machine before committing your work and sharing it with others.
- Commit & PR Diff: To review the changes already pushed upstream in the context of a pull request or commit
1. Local Notebook Diff
You can use the git diff command to view the difference between the current state of the file you are working on and the most recently checked-in version of the same file. You might use Git diffs to review the changes before committing or to compare different versions of notebook.
A general Git diff will show you textual diff for all changed files:
Some issues with git diff output for Jupyter Notebooks:
- Notebooks are JSON under the hood, git diff shows raw JSON changes without any markdown or output rendering, code syntax highlighting etc.
- Rich output (images, plots, widgets) raw diffs are impossible to review e.g. see the encoded image/png data in the above diff.
- Notebook diffs have a lot of noise due to unintended metadata changes which you might not care about (e.g. execution_count).
Solution → nbdime
You always want to review local notebook diffs before pushing the changes. But as we saw above, it's quite painful to review notebook diffs with plain git diff on command line.
One tool that solves this is nbdime. It's a dedicated diff & merge library for Jupyter Notebooks. You can install nbdime and review your local notebook changes in a rich rendered diff format as shown below -
Solution → JupyterLab Git Extension
Another option is to use JupyterLab Git extension to review local notebook diffs. This extension uses the nbdime library underneath. But the key advantage is the ability to review local notebook diffs directly in the JupyterLab UI with a click of a button.
2. Commits & PR Diff
nbdime is great for reviewing local changes but once you push the changes to a remote repository, your colleagues might want to review your notebook changes & provide feedback.
This is where rich notebook diffs on GitHub are super helpful. Your teammates can simply go to GitHub & open the commit or pull request page to review the notebook changes. While useful, the rich notebook diffs on GitHub has some limitations -
- One can't write comments on rich notebook diffs. So your teammates can't give feedback on your notebook changes.
- GitHub rich diffs fail on large notebooks, even if the actual differences you are reviewing are small. So if your notebooks are large, the GitHub diffs won’t be of much use.
- GitHub doesn’t display interactive widgets or plots. So if your notebook includes interactive visualizations from Plotly or Bokeh, or an ipywidget, it won’t render on GitHub.
Solution → ReviewNB
We saw limitations of GitHub's notebook rich diff support above. One way to solve it is to use ReviewNB. It's a dedicated app for Jupyter Notebooks code reviews. It helps you review notebook pull requests on GitHub & Bitbucket.
You can see rich diffs and write comments on any notebook cell as shown below. Unlike GitHub, ReviewNB offers in-line commenting on rich diffs, works with large notebooks, and renders interactive plots & widgets.
Handling Notebook Merge Conflicts
A common Git workflow is:
- Create a new branch, usually from main,
- Work in the branch.
- When your work is ready (usually after a pull request reviewed by your colleagues), merge the new branch back into the main branch
Sometimes during this process, you may encounter a merge conflict. Merge conflicts usually happen when the branch you are trying to merge back into has changed, causing a conflict that Git needs your input to resolve.
For Jupyter notebooks, merge conflicts can be caused by code changes, but they can also be caused by notebook metadata. Metadata conflicts can be particularly difficult to visualize and fix. You might first notice a merge conflict on a pull request page:
Or on the command line following a merge attempt:
If you start Jupyter and try to open a notebook in this state, you’ll get an error message saying that the notebook is unreadable. It's because the Git conflict markers make the notebook JSON invalid.
Solution → Text Editor
Small notebook git merge conflicts can be resolved without any extra tooling, using either the GitHub UI or any text editor of your choice.
- Use the git status command to check which files have merge conflicts.
- Open the notebook in a text editor like Notepad, and you’ll see why Jupyter can’t read it: The notebook is no longer in standard JSON format because of the merge-conflict text.
- To find the conflicts, search for the conflict markers <<<<<<<, =======, >>>>>>>.
- Manually edit the JSON text to fix the conflict—you can choose the original version, the new version, or change the text to be some combination of both.
- Remove the conflict markers when you have fixed the conflicting contents.
- Mark the conflict as resolved by committing the notebook
Solution → nbdime / nbdev / JupyterLab Git Extension
Small (1-2 line) conflicts can be resolved easily on a text editor but large conflicts are difficult. The biggest challenge is ensuring that the end state is a valid notebook JSON. Miss a comma or a double quote or a closing brace, and suddenly your notebook is not valid anymore.
If your merge conflict is more than a line or two, you would do better to use a tool like nbdime, nbdev, or the JupyterLab Git extension. Here's a step-by-step guide on resolving merge conflicts using the JupyterLab Git extension -
- When we try to merge conflicting git branches in the JupyterLab Git extension, merge fails midway & the offending notebook appear in the “Conflicted” section of the left-hand Git panel (e.g. see NASA_Sea_level.ipynb in the screenshot below).
- Double-click on the notebook to view the conflict. You will see the three versions: Current, Incoming & the Common Ancestor -
- Change the bottom cell to manually resolve the conflict. Either choose one of the three versions above or some combination thereof.
- Click on “Mark as resolved”. You will see the notebook file move from the “Conflicted” section to the “Staged” section. It’s ready to commit. Use the bottom commit box to create & push the commit, and you’re done!
Large Notebook Rendering on GitHub
GitHub correctly renders small Jupyter Notebook files & diffs. Unfortunately, GitHub fails to render large notebooks showing errors like “Unable to render rich display” or “The notebook took too long to render”.
Why notebook rendering fails on GitHub?
There are 2 reasons for this -
- GitHub rich diff functionality uses nbdime under the hood. And nbdime diffs can be very slow for large notebooks (as mentioned by others here and here).
- For performance reasons, GitHub imposes a 5-second limitation on rich diff rendering. So in cases where nbdime can’t render a diff in 5 seconds, GitHub won’t render it at all.
This tells us why notebook diff rendering fails on GitHub. We assume similar reasons exist for the failure of rendering standalone notebooks. Bottom line is, at GitHub's scale, they don't want to wait tens of seconds for files to render on their platform.
Solution → NBviewer
NBviewer is an online service provided by the Jupyter project that offers a free tool for rendering static Jupyter Notebooks from URLs. It allows you to explore the structure of a GitHub repository and select specific notebooks for viewing.
NBviewer requires your notebooks to be publicly accessible. If you're operating within a private GitHub repository, NBviewer cannot be used.
Solution → Binder
Binder is another Jupyter subproject that offers a free service enabling the sharing of executable notebooks on an online platform. To utilize Binder, simply provide the path to your repository, and it will generate a JupyterLab instance for you.
Binder is particularly advantageous when your notebook includes interactive elements like plots or widgets. GitHub's notebook viewer disables such interactivity for security reasons. Similar to nbviewer mentioned earlier, mybinder.org is also limited to public repositories exclusively. If you wish to access notebooks from private repositories, you will need to deploy your own BinderHub.
Solution → ReviewNB
ReviewNB is a SaaS for Jupyter Notebook code reviews & collaboration on GitHub & Bitbucket. ReviewNB can render large notebook files that GitHub won't render natively. Below we see the ~30mb notebook that GitHub couldn’t render natively, rendering fine in ReviewNB —
ReviewNB can be used with both public and private repositories. ReviewNB also supports Google-Doc style commenting & discussion on Jupyter notebook cells.
Lastly, ReviewNB also shows you rich diffs for commit & PRs with the ability to write comments on the rich diff. It also renders all interactive elements like plots & widgets, making it easy to review cell outputs.
During our exploration of using Jupyter notebooks with Git/GitHub, we focused on addressing specific challenges such as Git diffs, merge conflicts, code reviews, and rendering large notebooks on GitHub. We offered state-of-the-art solutions to each of these problems.
Whether you are new to Git or looking to enhance your Jupyter Notebook collaboration, this ultimate guide serves as a valuable resource for maximizing productivity and efficiency in your data science projects.