Avoid notebooks in production
Like many people we both love and hate notebooks such as Jupyter (https://jupyter.org/). Data science and the initial stages of model/algorithm development are creative processes, requiring lots of visualisations and quick pivoting between modelling approaches. For this rapid analysis of data and prototyping of algorithms, notebooks are excellent tools and they are the tool of choice for many data scientists. However they have a number of features which make them difficult to use in production.
- Notebook files contain both code and outputs - these can be large (e.g. images) and also contain important business or even personal data. When used in conjunction with version control such as Git, data is by default committed to the repo. You can work round this but it is all too easy to inadvertently pass data to where it shouldn’t be. It also means that it is difficult/impossible to see exactly what changes have been made to the code from one commit to the next.
- Notebook cells can run out of order - meaning that different results are possible from the same notebook - depending on what order you run the cells in.
- Variables can stay in the kernel after the code which created them has been deleted. Variables can be shared between notebooks using magic commands.
- Not all python features work in a notebook e.g. multi-processing will not function in Jupyter
- The format of notebooks does not lend itself easily to testing - there are no intuitive test frameworks for notebooks.
In some cases we have used tools like papermill to run notebooks in production, but most of the time moving to standard modular code after an initial prototype has been created will make it more testable, easier to move into production and will probably speed up your algorithm development as well.
I first came into contact with a Jupyter notebook while working on a predictive maintenance machine learning project, after a number of years as a production software developer. In this scenario, I found notebooks to be an invaluable resource. The ability to organise your code into segments with full markdown support and charts showing your thinking and output at each stage made demos and technical discussions simple and interactive. In addition, the tight integration with Amazon SageMaker and S3 meant I could work with relative freedom and with computing power on-tap while remaining in the client’s estate.
However, as our proof of concept got more complicated, with a multi-stage ELT pipeline and varying data normalisation techniques etc, I found myself maintaining a block of core ELT code that was approaching 500 lines of untested spaghetti. I had tried, with some success, to functionalise it so it wasn’t just one script and I could employ some DRY principles. However, I couldn’t easily call the functions from one notebook to another so I resorted to copy and paste. Often I would make a small change somewhere and introduce a regression that made my algorithm performance drop off a cliff, resulting in losing half a day trying to figure out where I had gone wrong. Or maybe I’d restart my code in a morning and it wouldn’t work because it relied on some globally scoped variable that I’d created and lost with my kernel the night before. If there were tests, I could have spotted these regressions and fixed them quickly, which would have saved me far more time in lost productivity than the tests would have taken to write in the first place.
In retrospect, when I come to do work like this in the future, I would opt for a hybrid approach. I would write the initial code for each stage in a notebook where I could make changes in an interactive way and design an initial process that I was happy with. Then, as my code ‘solidified’, I would create an installable package in a separate GIT repository where I could make use of more traditional software development practices.
Using this approach has a number of advantages:
- You can import your code into any notebook by a simple pip install. You can use the same tested and repeatable ELT pipeline in a number of notebooks with differing algorithms with confidence.
- You can write and run tests and make use of CI tools, linting and all the other goodies software developers have created to make our code more manageable.
- Reduce your notebook’s size, so that when you’re doing presentations and demos you don’t need 1,000 lines of boilerplate before you get to the good stuff.
The final advantage of this approach, in a world of deadlines where proof of concepts far too often become production solutions, is that you productionise your code as you go. This means that when the time comes that your code needs to be used in production, standardising it doesn’t seem like such an insurmountable task.
Equal Experts, UK