Briefly about the Rumale library for machine learning in Ruby
Let's install
Open Gemfile and add the line:
gem 'rumale'
After that we use bundle install
to install the library:
$ bundle install
If you want to install Rumale without Bundler, you can do it directly through the command gem install
:
$ gem install rumale
After installing the library, connect it to the project:
require 'rumale'
Building and training models in Rumale
We will load data using the Daru and RDatasets libraries.
Linear regression
Linear regression is a basis for predicting numerical values. Rumale uses the class for this purpose Rumale::LinearModel::LinearRegression
:
require 'daru'
require 'rumale'
# создание набора данных
data = Daru::DataFrame.from_csv('housing_prices.csv')
x = data['size'].to_a
y = data['price'].to_a
# преобразование данных в формат, подходящий для Rumale
x = Numo::DFloat[x].reshape(x.size, 1)
y = Numo::DFloat[y]
# построение и обучение модели линейной регрессии
model = Rumale::LinearModel::LinearRegression.new
model.fit(x, y)
# предсказание на новых данных
predicted = model.predict(x)
puts "Предсказанные значения: #{predicted.to_a}"
House size and price data is downloaded from a CSV file, converted into arrays, and then used to train a linear regression model.
Support Vector Machine (SVM)
Support Vector Machine is an algorithm for classification problems. In Rumale it is represented by the class Rumale::LinearModel::SVC
:
require 'daru'
require 'rumale'
require 'rdatasets'
# загрузка набора данных Iris
iris = RDatasets.load(:datasets, :iris)
x = iris[0..3].to_matrix
y = iris['Species'].map { |species| species == 'setosa' ? 0 : 1 }
# преобразование данных в формат Numo::NArray
x = Numo::DFloat[*x.to_a]
y = Numo::Int32[*y]
# построение и обучение модели SVM
model = Rumale::LinearModel::SVC.new(kernel: 'linear', reg_param: 1.0)
model.fit(x, y)
# предсказание на новых данных
predicted = model.predict(x)
puts "Предсказанные значения: #{predicted.to_a}"
The SVM model classifies flowers as setosa
or not.
Clustering using K-Means
K-Means is a clustering algorithm that groups data based on similarities. Rumale uses the class Rumale::Clustering::KMeans
:
require 'daru'
require 'rumale'
require 'rdatasets'
# загрузка набора данных Iris
iris = RDatasets.load(:datasets, :iris)
x = iris[0..3].to_matrix
# преобразование данных в формат Numo::NArray
x = Numo::DFloat[*x.to_a]
# построение и обучение модели K-Means
model = Rumale::Clustering::KMeans.new(n_clusters: 3, max_iter: 300)
model.fit(x)
# предсказание кластеров
labels = model.predict(x)
puts "Кластеры: #{labels.to_a}"
We use the Iris data to cluster it into three groups using K-Means.
Other algorithms
Random Forest:
require 'daru'
require 'rumale'
require 'rdatasets'
# загрузка набора данных Iris
iris = RDatasets.load(:datasets, :iris)
x = iris[0..3].to_matrix
y = iris['Species'].map { |species| species == 'setosa' ? 0 : 1 }
# преобразование данных в формат Numo::NArray
x = Numo::DFloat[*x.to_a]
y = Numo::Int32[*y]
# построение и обучение модели Random Forest
model = Rumale::Ensemble::RandomForestClassifier.new(n_estimators: 10, max_depth: 3)
model.fit(x, y)
# предсказание на новых данных
predicted = model.predict(x)
puts "Предсказанные значения: #{predicted.to_a}"
Gradient Boosting:
require 'daru'
require 'rumale'
require 'rdatasets'
# загрузка набора данных Iris
iris = RDatasets.load(:datasets, :iris)
x = iris[0..3].to_matrix
y = iris['Species'].map { |species| species == 'setosa' ? 0 : 1 }
# преобразование данных в формат Numo::NArray
x = Numo::DFloat[*x.to_a]
y = Numo::Int32[*y]
# построение и обучение модели Gradient Boosting
model = Rumale::Ensemble::GradientBoostingClassifier.new(n_estimators: 100, learning_rate: 0.1, max_depth: 3)
model.fit(x, y)
# предсказание на новых данных
predicted = model.predict(x)
puts "Предсказанные значения: #{predicted.to_a}"
Model evaluation and validation
Metrics for assessing the quality of models
Mean Square Error (MSE): measures the average of squared errors, i.e. the difference between predicted and actual values:
require 'numo/narray'
require 'rumale'
# пример данных
y_true = Numo::DFloat[3.0, -0.5, 2.0, 7.0]
y_pred = Numo::DFloat[2.5, 0.0, 2.0, 8.0]
# расчет MSE
mse = Rumale::EvaluationMeasure::MeanSquaredError.new
mse_value = mse.score(y_true, y_pred)
puts "MSE: #{mse_value}"
Determination coefficient (R²): measures the proportion of variance explained by the model. The R² value ranges from 0 to 1, with 1 being a perfect fit:
# расчет R²
r2 = Rumale::EvaluationMeasure::RSquared.new
r2_value = r2.score(y_true, y_pred)
puts "R²: #{r2_value}"
Cross-validation
Cross-validation allows us to evaluate the generalization ability of a model. One of the most common methods is K-Fold cross-validation.
K-Fold cross-validation:
require 'rumale'
require 'daru'
require 'rdatasets'
# загрузка данных Iris
iris = RDatasets.load(:datasets, :iris)
x = iris[0..3].to_matrix
y = iris['Species'].map { |species| species == 'setosa' ? 0 : 1 }
x = Numo::DFloat[*x.to_a]
y = Numo::Int32[*y]
# определение модели
model = Rumale::LinearModel::LogisticRegression.new
# определение метрики оценки
mse = Rumale::EvaluationMeasure::MeanSquaredError.new
# настройка K-Fold кросс-валидации
kf = Rumale::ModelSelection::KFold.new(n_splits: 5, shuffle: true, random_seed: 1)
# проведение кросс-валидации
cv = Rumale::ModelSelection::CrossValidation.new(estimator: model, splitter: kf, evaluator: mse)
report = cv.perform(x, y)
# вывод результатов
mean_score = report[:test_score].sum / kf.n_splits
puts "5-CV MSE: #{mean_score}"
After performing cross-validation or other evaluation methods, it is very important not to forget that you also need to correctly interpret the results obtained.
Mean and standard deviation: these indicators give an idea of the stability and reliability of the model. For example, low avg. an error value and a low standard deviation indicate a stable and accurate model:
mean_score = report[:test_score].mean
std_score = report[:test_score].std
puts "Mean MSE: #{mean_score}, Standard Deviation: #{std_score}"
You can still connect gnuplot to visualize and help understand the performance of the model on various data sets:
require 'gnuplot'
Gnuplot.open do |gp|
Gnuplot::Plot.new(gp) do |plot|
plot.title "K-Fold Cross Validation Scores"
plot.ylabel "MSE"
plot.xlabel "Fold"
plot.data << Gnuplot::DataSet.new(report[:test_score]) do |ds|
ds.with = "linespoints"
ds.title = "Fold MSE"
end
end
end
Learn more with this wonderful library mYou can find it here.
And you can always get acquainted with other tools and libraries within practical online courses from my colleagues at OTUS.