Predicting the quality of extraction of iron oxide from ore using machine learning


Link to the repository GitHub.

About myself

Hello, my name is Ydyrys Olzhas. I am a 3rd year student at the National Research Technological University “MISIS”, specializing in metallurgy, but in my free time I also study Data Science. I implemented this project to show how effectively machine learning methods can be applied to optimize and improve metallurgical processes. So let me start with a little theoretical introduction.

Note: It is not always easy to find databases of real manufacturing plants, especially metallurgical ones. I hope that this article will help the development of such a narrow area and large enterprises will begin to upload data for educational practice..


Iron ores are rocks and minerals from which metallic iron can be economically extracted. Silica is the main impurity in iron ore. Its high content can lead to a large volume of slag. This, in turn, leads to environmental pollution. By predicting the impurity content in the ore, we can help the engineers at the plant make the necessary calculations in the early stages of production.

Silica prediction involves many chemical analyzes that are time consuming and require high transaction costs. Using ML models will simplify our process by solving all our problems in one fell swoop…


The data was obtained from the website Kaggle.

# Выводим информацию о датафрейме

The second and third columns are the quality indicators of the iron ore slurry immediately before it is fed to the flotation plant. Columns four through eight are the most important variables that affect the quality of the original product at the end of the process. From column 9 to column 22, we can see the process data (level and air flow inside the flotation columns, which also affect the quality of the process. The last two columns are the final measurements of the quality of the iron ore slurry obtained in the laboratory. The goal is to predict the last column, which represents the percentage of silica in iron ore concentrate.

Exploratory Data Analysis

Exploratory data analysis (EDA) is an approach to analyzing datasets to summarize their key characteristics, often using visual methods.

# Выводим статистическую информацию

First of all, I considered the statistical information for this, I applied the method df.describe(). From the displayed column, we can see that the maximum percentage of silica after flotation is 5.5% and the minimum is 0.6%. The percentage of iron concentrate after flotation is 62-68%.

Column histogram

With the help of histograms, we can see the statistical information more clearly.

df.hist(figsize= (20,20))"png")

Correlation matrix

The correlation coefficient characterizes the value reflecting the degree of relationship between two variables. From this diagram, we can conclude that there is a relationship between iron raw materials and silica. There is also a relationship between silica concentrate and iron concentrate.

sns.heatmap(df.corr(), annot=True)"png")

Building and evaluating the model

In this practical project, I will use LightGBM And Optuna for better model performance. I also used cross-validation to evaluate the model, which will help me avoid overfitting the model.


LightGBM is a framework that provides an implementation of gradient boosted decision trees. It was created by a group of Microsoft researchers and developers. Main advantages:

  • Higher learning rate and high efficiency.

  • Less memory usage.

  • Higher precision.

  • Support for parallel, distributed and GPU learning.

  • Ability to work with large amounts of data.


Optuna is an automatic hyperparameter optimization software framework designed specifically for machine learning.

  • Scalability

  • Parallelization of computation

  • Fast rendering

from optuna.integration import LightGBMPruningCallback
import optuna
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import KFold
import lightgbm as lgbm

EPS = 1e-8

# Создаем функцию objective для optuna
def objective(trial, X, y):
    # Параметры обучения
    param_grid = {
    "verbosity": -1,
    "boosting_type": "gbdt",
    "n_estimators": trial.suggest_categorical("n_estimators", [10000]),
    "learning_rate": trial.suggest_categorical("learning_rate", [0.0125, 0.025, 0.05, 0.1]),
    "num_leaves": trial.suggest_int("num_leaves", 2, 2048),
    "max_depth": trial.suggest_int("max_depth", 3, 12),
    "min_data_in_leaf": trial.suggest_int("min_data_in_leaf", 1, 100),
    "lambda_l1": trial.suggest_float("lambda_l1", 1e-8, 10.0, log=True),
    "lambda_l2": trial.suggest_float("lambda_l2", 1e-8, 10.0, log=True),
    "min_gain_to_split": trial.suggest_float("min_gain_to_split", 0, 15),
    "bagging_fraction": min(trial.suggest_float("bagging_fraction", 0.3, 1.0 + EPS), 1.0),
    "bagging_freq": trial.suggest_int("bagging_freq", 1, 7),
    "feature_fraction": min(trial.suggest_float("feature_fraction", 0.3, 1.0 + EPS), 1.0),
    "feature_pre_filter": False,
    "extra_trees": trial.suggest_categorical("extra_trees", [True, False]),
    # Перекрестная проверка
    cv = KFold(n_splits=5, shuffle=True)
		# Массив куда мы сохраняем результаты проверки
    cv_scores = np.empty(5)
    for idx, (train_idx, test_idx) in enumerate(cv.split(X, y)):
        X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]
        y_train, y_test = y[train_idx], y[test_idx]
        # Создаем регрессионную модель LightGBM
        model = lgbm.LGBMRegressor( **param_grid)
        # Обучаем модель
            eval_set=[(X_test, y_test)],
                LightGBMPruningCallback(trial, "rmse")
        preds = model.predict(X_test)
        # Сохраняем в массив результаты проверки
        cv_scores[idx] = mean_squared_error(y_test, preds)

    return np.mean(cv_scores) # Возращаем среднее значение всех проверок
# Создаем новое обучение.
study = optuna.create_study(direction="minimize", study_name="LGBM Classifier")
func = lambda trial: objective(trial, X, y)
# Вызываем оптимизацию функций objective.
study.optimize(func, n_trials=20)

To evaluate the model, I use the RMSE (Root Mean Squared Error) metric

For each point, the squared difference between the predictions and the target is calculated, and then these values ​​are averaged and raised to the root. The higher this value, the worse the model.

print(f"tНаилучшее значение (rmse): {study.best_value:.5f}")
print(f"tНаилучшие параметры:")
for key, value in study.best_params.items():
print(f"tt{key}: {value}")

		Наилучшее значение (rmse): 0.01053
		Наилучшие параметры:
				n_estimators: 10000
				learning_rate: 0.025
				num_leaves: 628
				max_depth: 11
				min_data_in_leaf: 1
				lambda_l1: 1.970304366797382e-06
				lambda_l2: 3.183217431386711e-08
				min_gain_to_split: 0.06980772043041306
				bagging_fraction: 0.9383496311685677
				bagging_freq: 7
				feature_fraction: 0.978126829339409
				extra_trees: False


Based on the conducted research, we can see how effective it is to apply machine learning methods in contrast to laboratory research. After spending an hour writing code and training the model, we get an incredible prediction accuracy (99%).


Similar Posts

Leave a Reply