Adam Bronte

Thoughts, stories, and data.

05 Jul 2021

Running Jupyter Notebooks in Production

Jupyter notebooks have quickly become an industry-standard in the data science/engineering field. Their repl-style nature fits perfectly these sorts of workloads where you tend to iterate on a single code block. Pretty much every major cloud provider has its own hosted version of Jupyter.

Using Jupyter in a development environment is great and all, but what happens when you need to go to production? As you may already know, Jupyter notebooks aren’t regular source files. They are a big JSON file that contains the code from each cell as plain text. Since you can’t run notebooks directly, what are the options?

Nbconvert

Nbconvert is a library for interacting with Jupyter notebooks programmatically. It’s a great off-the-shelf solution that would cover most use cases. It also has some cool features like writing the cell outputs back to the notebook.

I was using nbconvert to run my notebooks in production for a long time but eventually ran into some issues. Mainly nbconvert doesn’t provide clean stack traces when an error occurs, making debugging a bit difficult. The other thing was that it batches up all the stdout while running the cells, and if you print a lot, you can run into issues. This behavior also made it difficult to follow the notebook as it was being executed.

Papermill

I haven’t actually used Papermill, but it seems like a promising project. It looks very similar to nbconvert but with the addition of parameterizing notebooks. I think this is a project to keep an eye on.

Run the notebook yourself

At the end of the day, there’s nothing special about notebooks. It’s just code living in a file just like any other piece of code, so why not treat it as such?

The way I execute Jupyter notebooks in production is just by parsing out the JSON and running each cell in an exec namespace. This allows you to see each block being executed and gives you a lot more control over when things go wrong. Mainly you can produce a very nice stack trace when an error occurs.

For example, here is a dead-simple notebook executor.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
import json

path = 'path/to/notebook.ipynb'
nb = json.loads(open(path).read())

# isolated namespace to exec the nobebook in
namespace = {}

print('Executing notebook %s' % path)

for cell in nb['cells']:
   if cell['cell_type'] != 'code':
      continue

   code = ''.join(cell['source'])

   try:
      exec(code, namespace)
   except Exception as e:
      print(code)
      raise(e)

Since I don’t care about saving the output back into the notebook, this works very well for me. If you need to save the cell output or do anything more involved, this probably isn’t the best solution. I’d suggest exploring Papermill or nbconvert in that case.