How to include domain knowledge in the model
Why is this needed?
Imagine that you are given a labeled data set and your task is to predict a new one. What are you going to do? You will probably first try to train a machine learning model to find rules for labeling new data. What’s next? Details before the start of our flagship data science course.
The machine learning model is convenient, but with ML it is difficult to understand why the model makes such a prediction. You also cannot use domain knowledge in such a model.
Is there any other way to set data labeling rules based on your knowledge other than relying on machine learning model predictions?
This is where human-learn comes in handy.
What is human learning?
human-learn is a Python package for building systems based on rules that are easy to build and compatible with scikit-learn.
To install human-learn, run the command:
pip install human-learn
We will learn how to create a model with a simple function. You can try and fork the source code for this article at this link:
To evaluate the performance of a rule-based model, let’s start by predicting a dataset using a machine learning model.
Machine learning model
As an example, let’s use the dataset Occupation Detection Dataset from the UCI Machine Learning Repository.
Our task is to predict whether the premises are occupied by temperature, humidity, light and carbon dioxide concentration. The room is vacant if Occupancy=0,
and busy if Occupancy=1
.
After downloading, unzip the archive and read the data:
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
# Get train and test data
train = pd.read_csv("occupancy_data/datatraining.txt").drop(columns="date")
test = pd.read_csv("occupancy_data/datatest.txt").drop(columns="date")
# Get X and y
target = "Occupancy"
train_X, train_y = train.drop(columns=target), train[target]
val_X, val_y = test.drop(columns=target), test[target]
Look at the first ten records of the dataset train
:
train.head(10)
Train Model RandomForestClassifier
[классификатор случайного леса] from scikit-learn on the training dataset and use this model to predict the test dataset:
# Train
forest_model = RandomForestClassifier(random_state=1)
# Preduct
forest_model.fit(train_X, train_y)
machine_preds = forest_model.predict(val_X)
# Evalute
print(classification_report(val_y, machine_preds))
The prediction quality is pretty good. However, it is not known how the model makes these predictions. Let’s see if we can mark up new data according to simple rules.
Rule based model
Here are four steps for creating data tagging rules. Need to:
- Put forward a hypothesis.
- Examine the data to confirm the hypothesis.
- Start with simple rules that are based on observations.
- Improve the rules.
Putting forward a hypothesis
The light in the room is an important indicator of whether the room is occupied. Thus, it can be assumed that the brighter the room, the more likely it is to be occupied.
Let’s take a look at the data and see if that’s the case.
Exploring the data
To test the assumption, let’s use a boxplot to find the difference between the illumination in an occupied (Occupancy=1) and an empty (Occupancy=0) room.
import plotly.express as px
import plotly.graph_objects as go
feature = "Light"
px.box(data_frame=train, x=target, y=feature)
You can see a significant difference in the median between occupied and empty rooms.
Starting with simple rules
Now we will create rules for determining the occupancy of a room by its illumination. For example, if the amount of light exceeds a certain value, then Occupancy=1, otherwise Occupancy=0.
But what threshold value to choose? Let’s start with the value 100
and see what happens:
To create a rule-based model with human-learn, we:
- write a simple function in Python that sets the rules;
- let’s use
FunctionClassifier
to turn this feature into a scikit-learn model.
import numpy as np
from hulearn.classification import FunctionClassifier
def create_rule(data: pd.DataFrame, col: str, threshold: float=100):
return np.array(data[col] > threshold).astype(int)
mod = FunctionClassifier(create_rule, col="Light")
Let’s make a prediction of the test dataset and evaluate the prediction:
mod.fit(train_X, train_y)
preds = mod.predict(val_X)
print(classification_report(val_y, preds))
Model Accuracy [на основе правил] higher accuracy RandomForestClassifier
!
Improving the rules
Now let’s see if we can get better accuracy by experimenting with the thresholds. To analyze the relationship between a specific value of illumination and the occupancy of the premises, we use parallel coordinates.
from hulearn.experimental.interactive import parallel_coordinates
parallel_coordinates(train, label=target, height=200)
Visualization in parallel coordinates shows that the probability of occupancy of a room with illumination of more than 250 lux is high. The optimal threshold separating an occupied room from an empty one seems to be somewhere between 250 and 750 lux.
Find the best threshold value in this range using GridSearch
from scikit-learn.
from sklearn.model_selection import GridSearchCV
grid = GridSearchCV(mod, cv=2, param_grid={"threshold": np.linspace(250, 750, 1000)})
grid.fit(train_X, train_y)
We get the best threshold value:
best_threshold = grid.best_params_["threshold"]
best_threshold
> 364.61461461461465
Now let’s plot this value on a range chart:
We use the model with the best threshold value to predict the test dataset:
human_preds = grid.predict(val_X)
print(classification_report(val_y, human_preds))
A threshold value of 365 gives better accuracy than a threshold value of 100.
Combining a machine learning model and a rule-based model
To create a rule-based model, it is good to turn to domain knowledge, but the approach has disadvantages:
- it is difficult to generalize the model to inaccessible data;
- it is difficult to come up with rules for complex data;
- no feedback to improve the model.
Therefore, the combination of a rule-based model and a machine learning model will help data scientists scale and improve the model while retaining the ability to use domain expertise.
One easy way to combine these two models is to decide if we need to reduce false negatives (False Negative – FN) or false positives (False Positive – FP).
Reducing the number of false negatives
You should probably decrease FN in a case like predicting a patient has cancer (it’s better to err on the side of telling patients they have cancer than to not detect it, [чем наоборот: сообщить, что рака нет, когда он есть]).
To decrease FN, select positive forecastswhere the two models give different answers:
Reducing the number of false positives
You should probably reduce the amount of FP in cases such as recommending violent videos to kids (it’s better to make the mistake of not recommending kids videos to kids than recommending adult videos to kids).
To reduce the number of FPs, select negative forecastswhere the two models give different answers.
Other, more complex strategies can be used to decide on the choice of forecast.
For more details on how to combine a machine learning model with rule-based models, I recommend watching Jeremy Jordan’s excellent video:
Conclusion
Congratulations! You just learned what a rule-based model is and how to combine it with a machine learning model. I hope this article gives you the knowledge you need to develop your own rule-based model.
Add to favorites this repositoryif you want to try the code for my articles.
Data source:
Accurately determine office occupancy from light, temperature, humidity, and carbon dioxide concentrations using statistical learning models. Luis M. Candanedo, Véronique Feldheim. Energy and Buildings. Volume 112, January 15, 2016, pp. 28-39.
And we will teach you how to work with data carefully so that you upgrade your career and become a sought-after IT specialist.
Data Science and Machine Learning
Python, web development
Mobile development
Java and C#
From basics to depth
As well as