Machine Learning In Production – Skew and Drift

In this post we will go over a very important concept when it comes to Machine Learning models, especially when you deploy them in production.

Drift: Drift, or concept drift, refers to the phenomenon where the statistical properties of the target variable or the input features change over time. In other words, the relationship between the input variables and the target variable is no longer stable. This can occur due to various reasons such as changes in the underlying data-generating process, changes in user behaviour, or changes in the environment. Concept drift can have a significant impact on the performance of machine learning models because they are trained on historical data that may no longer be representative of the current state. Models may need to be continuously monitored and updated to adapt to concept drift, or specialized techniques for handling concept drift, such as online learning or ensemble methods, can be employed.

To measure this type of skew, you can use various statistical measures –

  1. Feature Comparison: Calculate summary statistics (such as mean, median and variance) for each feature in the training dataset and the production dataset. Compare these statistics to identify any significant differences. You can use measures like the Kolmogorov-Smirnov test or the Jensen-Shannon divergence to quantify the skew between the distributions.
  2. Domain Expertise: Consult with domain experts or stakeholders who are familiar with the data and understand the expected distribution of features. They can provide insights into potential skewness or changes in feature distributions that might be critical to consider.
  3. Monitoring and Drift Detection: Implement a monitoring system to track the distribution of features in the production environment continuously. There are various drift detection algorithms available, such as the Drift-Detection Method (DDM) or the Page-Hinkley Test. These methods analyze the incoming data over time and detect significant changes or shifts in the feature distributions.

By combining these techniques, you can gain insights into the skewness between the training and production feature distributions. Detecting and addressing such skewness is crucial for maintaining the performance and reliability of machine learning models in real-world scenarios.

Comments

Leave a comment