Tabular Classification and Apple ML Regression


One of the most remarkable and important features of neural networks is the ability to work with tabular data. In principle, this directly follows from their nature, however, in terms of use in conjunction with Machine Learning, this property is not so obvious. At the same time, it reveals the potential of using artificial intelligence in mobile applications – a neural network trained on large amounts of data is able to make predictions that are very close to reality.

Apple introduces tools to quickly create and effectively use artificial intelligence elements in your applications.

In the overview article, we looked at the templates that are included in the “Create ML” application that comes with XCode: https://habr.com/ru/post/711400 Here we will focus on two of them: “Tabular Classification” and “Tabular Regression”.

With the help of the first template, you can make predictions about the event, with the help of the second one, you can get the value on the graph (if you build a graph from tabular data) in the vicinity of a certain point that was not plotted on the graph.

Here you need to understand one very important difference: you can plot a point on a graph on any abstract data based on the behavior of the graph. And predictions should only be made about events that have causal, even if not obvious, connections. For example, it is possible to predict the accident-free operation of an aviation terminal based on the average daily temperature (because temperature affects flocks of migratory birds and runway icing), but the probability of a trump card falling out based on the color and direction of movement of a crocodile is not. Even if you get the data, it won’t make any sense. And, at the same time, it is easy to link the characteristics of a crocodile with a card combination on a chart. And it’s just as easy to specify points to the left and to the right of any arbitrarily chosen value.

Data on passengers of the Titanic was used as the initial data. Information about them is reliable and well studied, therefore, it is possible to carry out a reverse check of the obtained model. Source – website https://www.kaggle.com Data files can be downloaded from the following page: https://www.kaggle.com/competitions/titanic-dataset/data

The general sequence of working with the model is as follows:

  1. Data preparation.

  2. Data separation.

  3. Loading in data into “Create ML” application

  4. Conducting model training.

  5. Export model from application.

  6. Creating an iOS / macOS application with an imported model.

  7. Implementation of forecasts for the entered parameters.

Data preparation.

The data that you download from the site will not be used in its pure form. This is because the Create ML application does not properly handle null values. Therefore, before using the received files, it is necessary, firstly, to get rid of extra fields, and secondly, to replace the empty fields with the value “0”. The data files are comma-delimited CSV files, so they can be easily opened and processed by any spreadsheet editor.

We used the following set of fields for analysis:

  • pclass – passenger class

  • sex – gender

  • age – age

  • survived – a sign that the passenger survived during the crash of the liner.

The remaining fields may also be important in order to increase the accuracy of the forecast, but are redundant as a demonstration material.

Initial data in a spreadsheet editor
Initial data in a spreadsheet editor

Data separation.

The training of the model takes place in several stages. To improve the quality of the prediction, part of the data is used as validation data. And after the model is created, another piece of data is used as data for testing. Accordingly, all data is divided in the ratio 60 / 20 / 20. Both the data for validation and the data for testing can be selected by the “Create ML” application itself. But, in the case of a tabular model, the learning process may not be started if you do not explicitly specify a file for testing. To simplify the task somewhat, we selected 53 records from the main file and saved them in a separate file. (By the way, when downloading a data package from the site, the package includes data for testing as a separate file. But in this case, you will have to prepare the data for this file separately. For demonstration, this is somewhat redundant). There are 797 entries left in the main file.

Loading data

After selecting the “Tabular Classification” template, a local database is created inside the application, with the name “MyTabularClassifier 1”. This name does not oblige you to anything, and you can immediately rename it with whatever you see fit. The model you created will be exported with this name.

Here, you can add a few more options for the model. Learning from tabular data is quite fast, therefore, you should not neglect this opportunity, since there are several learning algorithms available, and it is not known in advance which model will give more accurate data.

Click on the blue cross in the Traning Data section and select the source data file. After that, you will also have the opportunity to set the target predicted field in the Target section. In our case, this is “survived” (“survival”). As well as the fields that are used for the forecast (pclass, sex, age) – the Features section. Keep in mind that if you select all the fields in the Features section, an error may occur in the process of training the model, since they are not all filled, therefore, we recommend that you limit yourself to the required minimum.

Loading data to train the model
Loading data to train the model

The “Validation Data” section can be left with the “Automatic” value, but in the “Testing Data” section, you need to add the file created at the “Separation” stage.

It remains to set the training parameters in the “Parameters” section. As in the previous case, you can leave everything as it is, but it is for this case that you have to create several local databases in the application.

Choosing an Algorithm for Model Training
Choosing an Algorithm for Model Training

Education

The ability to start training becomes available after all the necessary settings have been made in the previous step. If the “Train” button is inactive, then most likely something has not been done.

After clicking on the “Train” button, the application will automatically switch first to the “Training” tab, and then to the “Evaluation” tab. The whole process will take from 1 to 5 seconds. However, this only applies to tabular data. If, for example, we are talking about segmenting photos, then the duration can be calculated in hours and even days when the processor is fully loaded (the fan will not let you forget that the process is ongoing).

On the “Evaluation” tab, you can evaluate the accuracy of the created model. However, we recommend that you return to the “Settings” tab – it has changed. In the parameters section, you can see with what initial settings the training was started.

Preset parameters for automatic mode
Preset parameters for automatic mode

Once the training is completed, these parameters cannot be changed. But you can create another database and choose a different learning algorithm and set the appropriate parameters. It should be remembered that unwisely set constants can cause the model to become “overfitted”, in which case you will always get the same value for any input parameter set. For example, “42”.

On the Preview tab, you can test your model. To do this, you need to have a fresh data file in exactly the same format that was used for training. You can upload a test file here to see how this mechanism works. At the very bottom there will be a line-by-line navigator that allows you to move between the data in this file. So for passenger ID 465, the probability of survival is 90%.

In-App Prediction
In-App Prediction

Output tab The Prediction subtab shows the names and types of the input and output parameters of the model. This will be important when using the model in your mobile application, since these names and types are properties (fields) of the data model class.

Input and output parameters
Input and output parameters

Export

Export of the created model is carried out by pressing the “Get” button on the “Output” tab (just below and to the right of the screen title).

As a result, the file “MyTabularClassifier 1.mlmodel” is stored in the file system. The model file size is only 14 KB, which is several times smaller than the file size that was used to train this model.

This file can be opened immediately with XCode. XCode allows you to change some attributes (authorship, description, license, version), as well as encrypt the model using the private key of the development team (although for this you must become a member of the Apple Developer Program and be connected to iCloud).

Creating an iOS / macOS application with an imported model.

The application to use the previously created model is extremely primitive. In fact, it can be a normal console application. In order not to bother entering the parameters, it was decided to make predictions for all gender and age categories of all classes of passengers. So, if you make predictions for people from 0 to 70 years old, there will be 426 values ​​in total, which is not at all a lot.

Most of the source code is occupied by console output directives.

    func startProbability() {
        Task {
            let genders = ["male", "female"]
            let config = MLModelConfiguration()
            guard let net = try? TitanicProbability(configuration: config) else { return }

            print("=======================================")
            print("STARTED")
            print("=======================================")
            var index = 0
            for pclass in 1 ... 3 {
                for gender in genders {
                    for age in 0 ... 70 {
                        let input = TitanicProbabilityInput(pclass: Double(pclass), sex: gender, age: Double(age))
                        do {
                            let result = try net.prediction(input: input)
                            index += 1
                            print("\(String(format: "%03d", index))): cl: \(pclass), \(gender == "male" ? "M" : "F" ), \(age), survived \(result.survived), probability \(result.survivedProbability)")
                        } catch let er {
                            print("EX: class: \(pclass), gender: \(gender), age:\(age) - \(er.localizedDescription)")
                        }
                    }
                }
            }
            print("=======================================")
        }

    }

When the model was exported, the original file was renamed to “TitanicProbability” and added as a file to the project. Accordingly, the same name “TitanicProbability” is used in the source code to access the model. XCode automatically turns the file added to the project into the Vision class of the framework. Further, work with it is carried out in the same way as with any other OOP class – an instance is created, properties are set, and the necessary methods are called. In our case, this is the prediction method with an instance of input parameters.

Forecasting.

Since we don’t have to enter input parameters manually through the UI, all prediction comes down to three nested loops in succession. As a result, in the console we get a comprehensive table of parameter combinations, and a forecast for each combination.

Model Predictions
Model Predictions

If there were more such input parameters, we would not be able to display all forecast options so easily. And here we smoothly move on to the idea of ​​OLAP analysis, and wonder if Apple will create an appropriate tool at least in the distant future, or do developers need to take everything into their own hands again?

Regression

Regression analysis almost completely repeats the same steps that we did for tabular classification. The only difference is that there is only one targeted parameter, survived (aka output parameter), and it is expressed as a real number in the range from -1 to 1. Where did the negative number come from? The survival probability itself, of course, is in the range from 0 to 1. However, since the target age and sex characteristics do not have specific given values ​​in the data for training and testing, then the extrapolation of the curve (out of the marked values) may go beyond the available range.

    func startRegression() {
        Task {
            let genders = ["male", "female"]
            let config = MLModelConfiguration()
            guard let net = try? TitanicRegressor(configuration: config) else { return }

            print("=======================================")
            print("STARTED")
            print("=======================================")
            var index = 0
            for pclass in 1 ... 3 {
                for gender in genders {
                    for age in 0 ... 70 {
                        let input = TitanicRegressorInput(pclass: Double(pclass), sex: gender, age: Double(age))
                        do {
                            let result = try net.prediction(input: input)
                            index += 1
                            print("\(String(format: "%03d", index))): class: \(pclass), gender: \(gender), age:\(age), survived \(result.survived)")

                        } catch let er {
                            print("EX: class: \(pclass), gender: \(gender), age:\(age) - \(er.localizedDescription)")
                        }
                    }
                }
            }
            print("=======================================")
        }
    }
Regression
Regression

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *