Time series processing and Bayesian models for handwriting recognition

Hello everyone! I recently completed a course “Machine Learning Advanced” on advanced machine learning techniques.
I was working on a time series processing project. Project topic: “Application of time series processing algorithms and Bayesian models for the problem of extracting symbols from temporal information from a digital pen”. I hope to apply the techniques I’ve tried to my main field – I usually work in computer vision, where I use neural networks for image processing.
Now I work for a company that deals with automatic document analysis. At one time I was developing a handwriting generator. I had to dive deep into datasets and models that had been collected and built before me.
Digitized handwritten texts were needed. They are divided into 2 types: offline and online. The first type – offline – is a scan or photograph of a manuscript, a raster image. Online manuscripts consist of dots – the path of a digital pen during writing; this is a vector image. Such sets are created manually using styluses and graphics tablets. At the same time, you can measure the acceleration and angle of the pen relative to, for example, a tablet or other surface.
However, there is an exception – the set ‘Stabilo-OnHW’ does not belong to either the first or the second type. The authors of the project used a special pen. Although she writes on plain paper, she also has special sensors: 2 accelerometers, a magnetometer, a gyroscope and a pressure meter.
So we collected the parameters of the behavior of the pen at a time when a person displays various characters. Create many such sets with manuscripts – and this is already the basis for the simultaneous translation of paper notes into digital format without any recognition and scanning.
119 writers were asked to draw with a developed pen according to the Latin letter: lowercase and uppercase. A total of 31275 characters were collected. With a frequency of 200 Hz, the sensors recorded readings. Accelerometers, a gyroscope and a magnetometer made measurements in 3 coordinates (x, y, z), the pressing force was measured. Thus, each symbol corresponds to 13 time series.
As you can see from the figure above, over time, the readings of the sensors change. I will put forward a hypothesis: for the same letters, the behavior of measurements will be similar, for different letters – different. However, this is difficult to verify manually by analyzing each letter and the entire dataset. Let’s try to build a letter recognizer based on sensor measurements.
The authors of the data set suggest doing it for different partitions: writer dependent (WD), writer independent (WI). In the first case, it is proposed to build separate classifiers for each writer. WI is more complex. Recognition is required to be carried out on the basis of the totality of the collected data, without taking into account information about the writer. Therefore, the model will be built for the second partition option.
Lowercase and uppercase letters are separated into separate samples. There is also a variant with mixed letters. In this paper, we will consider all these options. Let’s build graphs of measurements for some letters from several authors.
Sensor readings for different letters and for different writers. Selected letters ‘a’, ‘R’ and ‘C’
The figure above shows the sensor readings for the letters ‘a’, ‘R’ and ‘C’ for 2 randomly selected writers. It can be seen that the behavior of the pressing force is similar for the same letters. It is also worth paying attention to the similarity of the behavior of gyroscope measurements.
Presumably, using the proposed data, you can build a recognizer.
As mentioned earlier, sensors take measurements at a certain frequency, i.e. at predetermined time intervals. Different people write different letters in different ways – they draw a pen on paper at different speeds. For this reason, the number of measurements for all spellings is different. This fact complicates the construction of a recognition model, since machine learning algorithms require fixed-length sequences for input.
The graphs above show the distribution of sequence lengths depending on the letter. The sizes of these rows vary from letter to letter, from writer to writer. The maximum sequence length exceeds 3400 measurements. In my opinion, based on the left graph, 60 measurements can be taken as the optimal sequence length. Bringing measurements to this length is done using the resampling algorithm from scipy – resample. The “compression” – “stretching” of the initial time series to the specified length is performed.
Having reduced the rows to one total length, it is possible to build average graphs of sensor measurements for each letter. The figure shows graphs for the letters ‘a’, ‘C’ and ‘R’. You can see that the graphs turned out to be not very noisy – their behavior does not vary much over short time intervals. The average readings differ from letter to letter. Therefore, these data can be used to further build a recognition model. For this, several machine learning algorithms and time series processing methods will be used.
The authors of the data set themselves conducted experiments with recognition. They trained “classic” machine learning models and neural networks. At the same time, for the first type of models, the data were used both without preprocessing and using a special mini-model that converts series into a special representation.
The author’s results for “classical” models are presented in the table above. The metric used is accuracy for multiclass classification.
To obtain a baseline solution, it was decided to try LogisticRegression and LinearSVM from Scikit-Learn, similar to what the ‘Stabilo-OnHW’ developers did.
As new algorithms were taken:
- XGBoost with max. depth 10 and number of trees 500
- LightGBM with max. depth 10 and number of trees 500
- Simple Bayesian Linear Model:
The implementation is done with pymc3.
Additionally, libraries for automatic generation of features from time series were tested: TSFresh and TSFEL.
The extraction of new features from the readings of the magnetometer and pressure sensor was carried out using the TSFresh library. Graphical analysis of the data revealed the stationarity of magnetometer measurements. Therefore, the MinimalFCParameters set was chosen for them. The force of pressing varies more. It uses the EfficientFCParameters parameters.
The gyroscope readings for different letters look like a set of periodic functions. It is theoretically possible to extract time-frequency information from the readings of accelerometers. TSFEL is used for these measurements. It provides more variability in the kinds of features to generate, for example statistical and spectral features can be extracted.
The obtained features are filtered – correlated and constant features are removed. Then there is processing by the select_features function from TSFresh. As a result, 1467 features are obtained from 60 values of the time series.
The models I chose were trained on the following sets: resampled series and automatically generated features.
The table above shows the values of the accuracy metric on the test set. The results of the baseline I received coincided with the author’s. I note that gradient boosted models achieve better quality compared to simple models. At the same time, LightGBM shows better quality than XGBoost.
The table shows that the synthesis of new features improves linear models, but has a bad effect on the operation of gradient boosting models. The Bayesian model performs better than linear models, but worse than gradient boosting. The PyMC3 library used to implement the Bayesian model was unable to process data with new features. For this reason, there is no Bayesian model in comparison for features.
The authors in their article present the results of recognition using recurrent neural networks. The gradient boosting I used performs 5% worse than more complex networks. This indicates that acceptable recognition results have been achieved.
To analyze errors, consider the confusion matrix for the results of LightGBM on lowercase, uppercase, and mixed letters.
The confusion matrix shows that letters like ‘c/C’,’o/O’,’s/S’,’z/Z’ are problematic for the recognizer model. Hypothetically, this is due to the fact that the lowercase and uppercase spelling of such letters is similar. And given that different authors write in different handwriting and with different stroke sizes, such letters can be indistinguishable from each other according to the readings of the sensors.
For lowercase and uppercase letters, the model confuses letters that may have the same elements. For example, the letters ‘P’ and ‘D’ have an arc that bulges to the right. This creates an opportunity for poor handwriting to display these letters in the same way. The model may not catch the smallest changes in the behavior of the pen and make mistakes in predicting the letter.
I generated almost 1.5 thousand features on the Stabilo-OnHW set thanks to the libraries new to me: TSFresh and TSFEL. Based on them, I built handwriting recognition models. New features even improved the quality of linear models by 8-10%.
Gradient boosting showed decent results without complex pre-processing, which even worsens models of this kind. I did not change the hyperparameters – the maximum tree depth and the number of trees – to keep the comparisons correct. But in order to achieve better quality, experiments with their change are also needed.
You can also build a better model if you improve letter case recognition. I think an ensemble of several models will be useful here. One of the models recognizes the case of a letter: lowercase or uppercase. The rest are needed to predict a specific character.
Additionally, it is worth experimenting with different architectures of neural networks. Unfortunately, this is already beyond the scope of the project. However, it would be interesting to compare the performance of such models with the results obtained by the authors of ‘Stabilo-OnHW’ and construct the best recognition model.
By the way, right now OTUS has opened a recruitment for the Machine Learning specialization, within which you can go through the full cycle of training. More about specialization can be found here.