Building machine learning models may seem straightforward, but the real challenge lies in effectively operationalizing them in production and addressing potential errors and biases. Model debugging becomes even more complex when dealing with biased or training data that is labelled incorrectly to begin with. Recently, a study from Stanford University and UC-Berkeley shed light on GPT-4’s degrading performance compared to its earlier versions, prompting the need for re-evaluating Large Language Models (LLMs). As more and more training data is generated synthetically through AI, the problem of model degradation and bias gets amplified.
In a recent conversation with a Machine Learning industry practitioner from an F50 company, the topic of debugging ML models emerged. The practitioner shared their use of a plethora of MLOps tools that are model agnostic interpretable and post hoc Explainable AI (XAI) capable tools. While these tools effectively identified features contributing to model outcomes and data/model drift, there was a significant drawback – none of the tools provided actionable recommendations to remedy identified issues. The common suggestion was to retrain the model, which proved less than ideal due to the significant resource expenditure and potential amplification of bias from flawed training datasets. The practitioner questioned the need for XAI tools when a pre-determined model retraining schedule could be followed instead.
Common issues with post hoc model explainability
- Any inherent bias in training dataset amplifies the bias and further degrades model performance.
- High error rates in labeling the training dataset (especially for image classification models).
- False correlation among features leading to incorrect inferences; this is again especially common in image classification techniques as illustrated in the study referred below.
- Model explainability tools fall short on actionable steps to address model degradation beyond model retraining.
- Constant retraining of the models is inherently cost prohibitive.
A study on model debugging techniques
Are there better alternatives to post hoc model analysis or using interpretable models? A recent thesis by ML researcher Julius Adebayo from MIT CSC/EE department addresses the limitations of post hoc models for model debugging and offers recommendations to overcome some of the challenges associated. This study focuses more on prediction phase compared to other approaches on training phase and is a great read in understanding limitations posed by current post hoc analysis methods. The thesis introduces the concept of “model guiding,” proposing the use of an audit set as a form of ‘unit’ test for the pre-trained model. However, it is to be seen how approaches like model guiding can be implemented in practice.
How is your organization addressing the limitations associated with model debugging? Are there better tools or methods to address these challenges? Please share your experiences and insights in the comments below!