Pipeline for creating text information classification

df = pd.read_csv('file_directory', chunksize = 10000) filtered_chunk_list=[] for chunk in tqdm(df): chunk['text'] = chunk['text'].apply(lambda x: prepare_text(str(x)) ) all_texts="|".join( chunk['text'].tolist()) clean_texts = del_stopwords(all_texts) chunk['text'] = lemmatize(clean_texts) chunk['title'] = chunk['title'].apply(lambda x: prepare_text(str(x))) all_titles="|".join( chunk['title'].tolist()) clean_titles = del_stopwords(all_titles) chunk['title'] = lemmatize(clean_titles) chunk['topic'] = chunk['topic'].map(topics_dict) filtered_chunk_list.append(chunk) model_df = pd.concat(filtered_chunk_list) model_df.to_csv('text_prepare.csv', index=False)

The following function helps to determine the best hyperparameters for the selected model:

def search_best_estimator(pipeline, param_grid, x, y):
    hrs = HalvingRandomSearchCV(
        estimator=pipeline,
        param_distributions=param_grid,
        scoring='f1_weighted',
        cv=3,
        n_candidates="exhaust",
        factor=5,
        n_jobs=-1,
    )
    _ = hrs.fit(x, y)
    return hrs.best_estimator_

To determine the quality of the model, in addition to using various metrics, it is useful to visually present the results in the form of graphs, histograms, etc. One of the visual representation methods is the error matrix (confusion matrix). On this matrix, it is easy to determine the number of correctly / incorrectly predicted values ​​by the model.

def plot_confusion_matrix(y_test, y_preds, model):
    fig, ax = plt.subplots(figsize=(16,10))
    cm = confusion_matrix(y_test, y_preds)
    cmp = ConfusionMatrixDisplay(cm, display_labels = model.classes_ )
    cmp.plot(ax=ax)
    plt.show()

Model Training

When all the necessary functions are implemented, we start training the model.

We are building a pipeline, which includes the method of vectorization of our texts, as well as a model that will be used to build the classification.

x, y = df['text'].tolist(), df['topic'].tolist()
pipeline = Pipeline(
    steps = [("tfidf", TfidfVectorizer() ),("base",RandomForestClassifier() )]
)
param_grid = {
"tfidf__min_df": [i for i in range(25,35,5)],
"base__n_estimators": [i for i in range(150,250,50)],
"base__max_depth": [i for i in range(25,35,5)],
"base__min_samples_split":[i for i in range(6,10,2)],
"base__min_samples_leaf": [2],
}
estimator = search_best_estimator(pipeline, param_grid, x, y)
Делим выборку на тренировочную и тестовую.
X_train, X_test, y_train, y_test = train_test_split(
    x, y, random_state=42, test_size=0.3, stratify=y
)

We train the model, make a prediction on the test data, and build a confusion matrix on the results.

As can be seen from the error matrix (confusion matrix), the largest number of false positives falls on the second class. For the task of classification by several classes, the weighted metric F1-score was chosen, since when assessing the quality in this type of task, we do not calculate the overall F-1 score, but instead calculate the F1 score for each class in the ratio one / the rest. With this approach, we evaluate the success of each class separately, as if there were separate classifiers for each class. The accuracy of the classifier on the weighted metric F1-score is 69%. What could be the reason for this behavior when training the model? Let’s display the number of texts in each of the classified classes in our training sample:

After that, everything falls into place, class 2 is majority and therefore, when training the model, it is most strongly oriented to the data obtained from the texts of this class. How to train a multiclass classification model with class imbalance?

It is necessary to resort to one of the methods of resampling (resampling). The essence of this approach is either to add missing data to an insufficiently large set (oversampling), or to remove elements from an oversized set (undersampling).

In order to improve the quality of the model, we apply one of the undersampling methods, namely NearMiss.

vectorizer = TfidfVectorizer(min_df=30)
vect_x = vectorizer.fit_transform(x)

nm = NearMiss()
X_res, Y_res = nm.fit_resample(vect_x, y)
pipeline2 = Pipeline( steps = [("base", RandomForestClassifier() )]  )
param_grid2 = {
        "base__n_estimators": [i for i in range(200,300,50)],
        "base__max_depth": [i for i in range(25,35,5)],
        "base__min_samples_split": [i for i in range(8,12,2)],
}
estimator2 = search_best_estimator(pipeline2, param_grid2, X_res, Y_res)
X_train, X_test, y_train, y_test = train_test_split(
    vect_x, y, random_state=42, test_size=0.3, stratify=y
)
estimator2.fit(X_res, Y_res)
y_preds2 = estimator2.predict(X_test)
plot_confusion_matrix(y_test, y_preds2, estimator2)

As can be seen from the confusion matrix, undersampling led to an improvement in classification. The F1-score metric increased to 79%.

xgbc = XGBClassifier()
xgbc.fit(X_res,[j-1 for j in Y_res])

X_train, X_test, y_train, y_test = train_test_split(
    vect_x, y, random_state=42, test_size=0.3, stratify=y
)
pred_y = xgbc.predict(X_test)
f1 = f1_score( y_test, [j+1 for j in pred_y], average="weighted")
print( f' Model F1-score: {f1}' )

plot_confusion_matrix(y_test2, pred_y, xgbc)

The weighted parameter of the F1-score metric for the XGBoostClassifier model reached 92%. According to the confusion matrix, it can be seen that the proportion of errors associated with attributing texts to the majority class and not related to it has significantly decreased.

Conclusion

Based on the conducted research, it was possible to achieve a classification quality of 92% according to the weighted F1-score metric. At the moment, there are a fairly large number of models specializing in solving problems in the field of Natural Language Processing, including those based on deep learning, so the considered models and the quality of their work are not a “panacea” and may be of little use for another dataset. But, there is no limit to perfection! One of the options for improving the classification is to continue experiments with the selection of the optimal link “resampling method – vectorization method – classification model”.

It is also worth noting the fact that the imbalance in classes has a significant impact on the accuracy of the model, and therefore I believe that it is advisable to train only on balanced data.

Link to Github repository.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *