Briefly about the Rumale library for machine learning in Ruby

Let's install

Open Gemfile and add the line:

gem 'rumale'

After that we use bundle install to install the library:

$ bundle install

If you want to install Rumale without Bundler, you can do it directly through the command gem install:

$ gem install rumale

After installing the library, connect it to the project:

require 'rumale'

Building and training models in Rumale

We will load data using the Daru and RDatasets libraries.

Linear regression

Linear regression is a basis for predicting numerical values. Rumale uses the class for this purpose Rumale::LinearModel::LinearRegression:

require 'daru'
require 'rumale'

# создание набора данных
data = Daru::DataFrame.from_csv('housing_prices.csv')
x = data['size'].to_a
y = data['price'].to_a

# преобразование данных в формат, подходящий для Rumale
x = Numo::DFloat[x].reshape(x.size, 1)
y = Numo::DFloat[y]

# построение и обучение модели линейной регрессии
model = Rumale::LinearModel::LinearRegression.new
model.fit(x, y)

# предсказание на новых данных
predicted = model.predict(x)
puts "Предсказанные значения: #{predicted.to_a}"

House size and price data is downloaded from a CSV file, converted into arrays, and then used to train a linear regression model.

Support Vector Machine (SVM)

Support Vector Machine is an algorithm for classification problems. In Rumale it is represented by the class Rumale::LinearModel::SVC:

require 'daru'
require 'rumale'
require 'rdatasets'

# загрузка набора данных Iris
iris = RDatasets.load(:datasets, :iris)
x = iris[0..3].to_matrix
y = iris['Species'].map { |species| species == 'setosa' ? 0 : 1 }

# преобразование данных в формат Numo::NArray
x = Numo::DFloat[*x.to_a]
y = Numo::Int32[*y]

# построение и обучение модели SVM
model = Rumale::LinearModel::SVC.new(kernel: 'linear', reg_param: 1.0)
model.fit(x, y)

# предсказание на новых данных
predicted = model.predict(x)
puts "Предсказанные значения: #{predicted.to_a}"

The SVM model classifies flowers as setosa or not.

Clustering using K-Means

K-Means is a clustering algorithm that groups data based on similarities. Rumale uses the class Rumale::Clustering::KMeans:

require 'daru'
require 'rumale'
require 'rdatasets'

# загрузка набора данных Iris
iris = RDatasets.load(:datasets, :iris)
x = iris[0..3].to_matrix

# преобразование данных в формат Numo::NArray
x = Numo::DFloat[*x.to_a]

# построение и обучение модели K-Means
model = Rumale::Clustering::KMeans.new(n_clusters: 3, max_iter: 300)
model.fit(x)

# предсказание кластеров
labels = model.predict(x)
puts "Кластеры: #{labels.to_a}"

We use the Iris data to cluster it into three groups using K-Means.

Other algorithms

Random Forest:

require 'daru'
require 'rumale'
require 'rdatasets'

# загрузка набора данных Iris
iris = RDatasets.load(:datasets, :iris)
x = iris[0..3].to_matrix
y = iris['Species'].map { |species| species == 'setosa' ? 0 : 1 }

# преобразование данных в формат Numo::NArray
x = Numo::DFloat[*x.to_a]
y = Numo::Int32[*y]

# построение и обучение модели Random Forest
model = Rumale::Ensemble::RandomForestClassifier.new(n_estimators: 10, max_depth: 3)
model.fit(x, y)

# предсказание на новых данных
predicted = model.predict(x)
puts "Предсказанные значения: #{predicted.to_a}"

Gradient Boosting:

require 'daru'
require 'rumale'
require 'rdatasets'

# загрузка набора данных Iris
iris = RDatasets.load(:datasets, :iris)
x = iris[0..3].to_matrix
y = iris['Species'].map { |species| species == 'setosa' ? 0 : 1 }

# преобразование данных в формат Numo::NArray
x = Numo::DFloat[*x.to_a]
y = Numo::Int32[*y]

# построение и обучение модели Gradient Boosting
model = Rumale::Ensemble::GradientBoostingClassifier.new(n_estimators: 100, learning_rate: 0.1, max_depth: 3)
model.fit(x, y)

# предсказание на новых данных
predicted = model.predict(x)
puts "Предсказанные значения: #{predicted.to_a}"

Model evaluation and validation

Metrics for assessing the quality of models

Mean Square Error (MSE): measures the average of squared errors, i.e. the difference between predicted and actual values:

require 'numo/narray'
require 'rumale'

# пример данных
y_true = Numo::DFloat[3.0, -0.5, 2.0, 7.0]
y_pred = Numo::DFloat[2.5, 0.0, 2.0, 8.0]

# расчет MSE
mse = Rumale::EvaluationMeasure::MeanSquaredError.new
mse_value = mse.score(y_true, y_pred)
puts "MSE: #{mse_value}"

Determination coefficient (R²): measures the proportion of variance explained by the model. The R² value ranges from 0 to 1, with 1 being a perfect fit:

# расчет R²
r2 = Rumale::EvaluationMeasure::RSquared.new
r2_value = r2.score(y_true, y_pred)
puts "R²: #{r2_value}"

Cross-validation

Cross-validation allows us to evaluate the generalization ability of a model. One of the most common methods is K-Fold cross-validation.

K-Fold cross-validation:

require 'rumale'
require 'daru'
require 'rdatasets'

# загрузка данных Iris
iris = RDatasets.load(:datasets, :iris)
x = iris[0..3].to_matrix
y = iris['Species'].map { |species| species == 'setosa' ? 0 : 1 }

x = Numo::DFloat[*x.to_a]
y = Numo::Int32[*y]

# определение модели
model = Rumale::LinearModel::LogisticRegression.new

# определение метрики оценки
mse = Rumale::EvaluationMeasure::MeanSquaredError.new

# настройка K-Fold кросс-валидации
kf = Rumale::ModelSelection::KFold.new(n_splits: 5, shuffle: true, random_seed: 1)

# проведение кросс-валидации
cv = Rumale::ModelSelection::CrossValidation.new(estimator: model, splitter: kf, evaluator: mse)
report = cv.perform(x, y)

# вывод результатов
mean_score = report[:test_score].sum / kf.n_splits
puts "5-CV MSE: #{mean_score}"

After performing cross-validation or other evaluation methods, it is very important not to forget that you also need to correctly interpret the results obtained.

Mean and standard deviation: these indicators give an idea of ​​the stability and reliability of the model. For example, low avg. an error value and a low standard deviation indicate a stable and accurate model:

mean_score = report[:test_score].mean
std_score = report[:test_score].std
puts "Mean MSE: #{mean_score}, Standard Deviation: #{std_score}"

You can still connect gnuplot to visualize and help understand the performance of the model on various data sets:

require 'gnuplot'

Gnuplot.open do |gp|
  Gnuplot::Plot.new(gp) do |plot|
    plot.title "K-Fold Cross Validation Scores"
    plot.ylabel "MSE"
    plot.xlabel "Fold"

    plot.data << Gnuplot::DataSet.new(report[:test_score]) do |ds|
      ds.with = "linespoints"
      ds.title = "Fold MSE"
    end
  end
end

Learn more with this wonderful library mYou can find it here.

And you can always get acquainted with other tools and libraries within practical online courses from my colleagues at OTUS.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *