Machine Learning and Testing

6 min readMar 8, 2019

We write machine learning code in a very specific context. But from what I have seen so far nothing has convinced me that machine learning code is fundamentally different from any other type of code.

This means that standard development practices apply with testing being their very important component.

The rewards of testing can be immense, but so can be the price that one would need to pay for testing poorly or not at all.

Let’s take a closer look at what testing looks like in the context of various machine learning applications.

Scenario 1 — writing single purpose code

This is the bread and butter of ML work. You work on a problem that is broadly defined by the type of data you have and by the objective. Your cost function might change, your data might change, but you don’t expect an NLP project to suddenly transform into image recognition and vice versa.

A Kaggle competition is a good approximation of this. I have worked on quite a few of them to some success and recently started to share my code.

But have I written a single test?

No, and I don’t plan on writing any.

ML code can be quite challenging to write. It can also be not easy to troubleshoot even with outstanding tools at our disposal.

Given all this, automated testing might seem like the way to go, but the solution that the community has converged to is far superior and more efficient.

It captures the benefits of automated testing and is achieved by approaching a project in a specific fashion. And by performing a lot of little actions along the way… actions that we all do if we want to have a chance at delivering a working solution. Actions that have specific names in software development, but yet the community has not adopted them.

So how does it all work in practice? How do you work on an ML project and maintain your sanity?

You start small and develop a fully working solution. One that starts with reading in the data and delivers a result at the end. It can be very simple, for instance predicting the most common class for all the examples in the validation set.

This establishes a baseline that you can fall back on. You add new functionality in small increments, rerunning the entire pipeline every now and then.

There is an error or the performance decreases? Not a problem — there is only this small chunk of code that you added since the last time you ran the pipeline so finding the issue is going to be easy.

This is called integration testing.

But there is more. Depending on the complexity of code that you write, quite often you might need to test every line of it. Jupyter notebook makes this super easy.

Say I would like to normalize my inputs. Not a problem. I write the code and run it on my data. Esc, b opens a new cell below. A quick calculation of the new mean and std dev (or a quick glance at the resulting image) and I can tell whether the code works correctly or not. Esc, d, d and the cell is gone.

This is equivalent to unit testing.

If we test, wouldn’t automating it be better? It might seem to be the case but both automating an action as well as organizing and retaining code comes at a price. One that is generally not worth paying in the context of single purpose code.

Working on a jupyter notebook, I probably run hundreds if not thousands of ad hoc tests. Once a test passes, I have some level of confidence the code works and I don’t have to test it again.

I don’t plan to refactor the code, move it around, add functionality. This code needs to work with the data provided within the constraints of the problem.

I get the code to run on inputs of a specific form at some point in time. I verify that it works through the ultimate systems test — loss on a properly constructed validation set. And I move on.

The constraints of the problem are usually such that we don’t expect to have to do frequent, major changes. And even if we make a change, the code will still likely have to serve a single purpose, solve just the problem at hand.

The reason that I don’t expect the code to change much or often means manual testing might be enough. But what about writing a library? One that can be used in ways we can barely foresee? Or writing an application to address the needs of a living, evolving organization?

Today we write a training loop to be used on tabular data, but a month from now our boss tells us that Department X needs to also train on images. What now? Oh, BTW, your colleagues from 1st floor need to train on tabular data, but from a relational database, not csvs.

This changes the situation completely.

Scenario 2 — writing code that evolves over time and is used in unforeseen contexts

The fact that code on a long running project will change is the main reason for automated testing. Whenever you make a change, you need to test.

Adding new functionality is a source of entropy in your codebase. To fight it, you refactor. You cannot refactor without tests. As a consequence, without testing your application falls apart. And the reason for making an investment into automating tests is to lower the costs over time.

This situation is exactly the same as it would be for any other type of code and ML code is not unique in this regard.

But coming back to the topic at hand —are there any other reasons for testing reusable code? Turns out the very features that will make your code easy to test are the same ones that make it extendable and composable! If you write your code in a way to achieve the primary purpose, you get the secondary benefit for free.

If you would like to learn more about the technical aspects of testing, here is a great conference talk on this subject.

Scenario 3 — writing code that affects lives

Whatever the code is, whoever writes it, code that has life impacting potential should be thoroughly tested. We probably not only should test that the code functions properly and will function properly in the wild, but as algorithms start playing a more and more significant role in how our society functions, we might also need to start testing the impact code changes will have on the surrounding world.

If youtube is pushing a change to its recommendation algorithm that will drive engagement (and ad revenue) but will do so by promoting conspiracy videos, should it test the impact on society? Yes. There is no absolution through the fact of not having considered something. The more powerful algorithms become in shaping our reality, the more responsibility we as the authors need to take. You cannot be driven by profit and say you are agnostic to the impact that you are having. That is the definition of an externality. Whether a corporation destroys an ecosystem by dumping waste into a river, or poisons people’s minds through promoting malicious content, the end result is the same.

In many ways, this post is a continuation of one that I wrote while taking a fastai Deep Learning course. Many of the ideas on structuring ML code are taken directly from the lectures.

Thank you for tuning in! I rarely blog but I tweet quite often — if you would be interested in staying in touch you can find me on Twitter here.

Machine Learning and Testing

Scenario 1 — writing single purpose code

Scenario 2 — writing code that evolves over time and is used in unforeseen contexts

Scenario 3 — writing code that affects lives

Written by Radek Osmulski