Update 2/6: After reporting our findings back to the client, we found out that the dataset we used was biased. Not the typical "target = 1" cases, but the training data did not reflect the reality well. Therefore the model suffers, too. "Garbage in, garbage out." The team has learned the lesson and the next dataset would be holistic and representative.
Well - the title pretty much summarizes what I have been doing in the past week.
Long story short:
- there was a business problem to solve. (Due to intellectual property rules, cannot say more beyond this. Think of it as a binary classification problem.)
- We (Data Scientists) came up with a solution (a model for sure).
- Before deploying the model into production, we tested it out for a certain period and collected the results.
- Due to the time lags in some additional processing, the results were not available until last week.
- We were then shocked by the news: for the first three groups that were supposed to have a precision of 40%, we only got 9%.
Seriously, that was my first thought. Moreover, we (data scientists) did not see any metrics other than the summarized statistics (Basically a statement on a slide: for the top 3 groups, the precision is 9%. Model precision: 40%). Though the training set we used was not the most recent, we certainly did not expect to see such a dramatic difference. (Around 5% fluctuation is understandable) But we have told the engineers that they should start implementing this model in two weeks! We had to miss the deadline once due to the developers' limited capacity. If we wait again, it would be another three months. Plus, once our managers know they would need a model diagnostic writeup.
To save our butts, I reached out to analysts who collected the testing results and calculated the performance figure. She sent me back a spreadsheet with unique identifiers (objects that we are predicting) and some attributes. Without firing up any weapons from my toolkit, I realized the first problem: attributes reported in the validation report do not match with the target requirement.
For instance, let's say that we would like to build a model to predict whether a given day would be ideal for a day trip. For a day to be qualified as "ideal", it must meet the following two criteria:
- The expected $ I spent would be less than $500
- The weather between 10 am to 12 pm is not rainy
But the report only contains information such as "weather between 8 am to 11 am", and "the expected $ the team spent". These two pieces of information are then put together and treated as the target. You can't say there is no value in recording the two attributes; in fact, there is an overlap. However, this is not what we agreed upon. When gauging the model performance, we need to refer back to the model definition and capture exactly the needed features. Otherwise, it's not an apples-to-apples comparison.
Another problem lies in the model attribute distributions. I compared how the distributions of selected features differ from the training set, testing set, and (essentially) the holdout testing set. Very few features have similar distributions. A machine learning model learns what was presented to it during the training time. This is not even a generalization problem - over time, some features may have drifted and that is a call for more recent data. Consider Twitter spam behaviors. If the model was trained using 2014 data, of course, the model will fail in finding 2020 spammers. There is no guarantee that over time, features in our model have constant distributions.
The third problem is the most interesting one, and it deserves a section itself.
Be careful with Pipelines...
When I trained a model on the server and immediately used it to predict the training dataset (Caution: THIS PRACTICE IS NOT ENCOURAGED), each instance got an assigned probability. I saved the model to my local workstation, load the same training dataset, and used the saved model to generate predictions. Probabilities changed for the same instance! Why? I have the seed set, and the model configuration (i.e. parameters of the model) is what I need. The way I calculated model performance metrics (accuracy, precisions, etc) is correct. The model pickle file was not contaminated with random noises. Could it be because of the different environment setups? I was under the impression that a trained ML model should always spit out the same output given the same input! After spending a whole morning debugging (in Jupyter Notebook), I was pulling my hair out, crying.
At first glance, this seems to be a problem with recent
scikit-learn updates. My local was running 0.22.1 while the server had 0.21.3. It should be a major update, because my local machine refused to load the saved pickle file at all. For compatibility, I had to downgrade
scikit-learn on my end. Now the local environment loaded the saved pickle file correctly, but the predictions were still everywhere!
Side note: instead of the model (e.g. SVM, GBM, logistics regression etc.), I saved the whole sk-learn
Pipeline object. The pipeline contains some preprocessing steps, such as handling missing values. But these steps are bare minimal. Before training the model, I had to take an extra step to clean the training data, such as removing dots and dollar signs, binning into different groups, keeping top
n values, etc. These preprocessing are considered as "data transformations" and thus not included in the model pipeline.
... You probably have an idea of what went wrong now after reading my notes above. And that's exactly the problem - when I unpickled the saved pipeline object locally, I fed it the raw training data directly. The data transformation step was omitted completely at the inference time. During the training phase, the model saw a feature split nicely into 10 bins; but at the inference time, even the input data is the same, because of the missing data transformation step, now the model had to deal with the raw data instead of the transformed 10 bins. After adding all required data transformations, the saved model pickle returns the same result for a given instance.
So what could I do?
- Fix the Pipeline and pickle again, so that the saved pickle contains the required data transformation. In production, the model handles raw data naturally.
- Have the engineers transformed the data first before sending it to our model.
2 is the preferred way by our developers, as they have some clever procedures to tweak data transformations in real-time. So all we need to do now is to apply the transformation on the holdout, beta test data and collect model outputs. Since the field test already concludes, with the correct target information we can measure how well the model is performing.
Don't be afraid if your model performs poorly on unseen data. Take the steps to debug it (I would have to say: different from debugging codes. I miss VS debuggers!):
- If you are not the one who collects test results, ask the relevant party to provide information. Specifically, what attributes were captured, how were they calculated the performance metrics.
- Compare against your training data to see if attribute distributions change over time (think of Twitter spamming behaviors - spammers change frequently to evade rules).
- If data pipeline/preprocessing is involved, check if raw data was exposed to the model directly, with any critical preprocessing steps missing.
- Keep asking questions! (Even under tight deadlines)
I should have taken a closer look at the data transformation step earlier, but naively I thought it was handled by the sklearn Pipelines already. It's not until I dissected the pickle file that I realized how data transformation came into play. Also: don't feed your trained model with training data again. It's not a good way to judge model performance... overfitting is inevitable unless you know what you're doing (in this case, I do). Happy coding! Hopefully, I could update more frequently in 2020; I do feel that mistakes are my main drive force for writing blogs nowadays😂😂😂