Predicting the Future: A Neurocomputer Speech Recognition Model

What is human speech? These are words whose combinations allow you to express this or that information. The question is, how do we know when one word ends and another begins? The question is rather strange, many will think, because from birth we hear the speech of people around us, we learn to speak, write and read. The accumulated luggage of linguistic knowledge, of course, plays an important role, but in addition there are neural networks of the brain that divide the flow of speech into component words and / or syllables. Today we will meet with you a study in which scientists from the University of Geneva (Switzerland) created a neuro-computer model for decoding speech by predicting words and syllables. What brain processes became the basis of the model, what is meant by the big word “prediction”, and how effective is the created model? The answers to these questions await us in the report of scientists. Go.

Study basis

For us humans, human speech is understandable and articulate (most often). But for a machine, this is just a stream of acoustic information, a solid signal that needs to be decoded before being understood.

The human brain acts in much the same way, it just happens extremely quickly and imperceptibly for us. The foundations of this and many other brain processes, scientists consider certain neural vibrations, as well as their combinations.

In particular, speech recognition is associated with a combination of theta and gamma oscillations, since it allows hierarchically coordinating the encoding of phonemes in syllables without prior knowledge of their duration and temporary occurrence, i.e. upstream processing * in real time.

Upstream processing* (bottom-up) Is a type of information processing based on the receipt of data from the environment for the formation of perception.

Natural speech recognition also depends heavily on contextual signals, which allow you to predict the content and temporal structure of the speech signal. Previous studies have shown that the prediction mechanism plays an important role during the perception of continuous speech. This process is associated with beta vibrations.

Another important component of the recognition of speech signals can be called predictive coding, when the brain constantly generates and updates the mental model of the environment. This model is used to generate predictions for touch input that are compared with actual touch input. Comparison of the predicted and actual signal leads to the identification of errors that serve to update and revise the mental model.

In other words, the brain is always learning something new, constantly updating the model of the world around it. This process is considered critical in the processing of speech signals.

Scientists note that many theoretical studies support both upward and downward * approaches to speech processing.

Downstream processing * (top-down) – analysis of the system into its components in order to get an idea of ​​its compositional subsystems using the reverse engineering method.

The previously developed neurocomputer model, including the connection of realistic theta and gamma excitation / inhibitory networks, was able to pre-process speech in such a way that it could then be correctly decoded.

Another model, based solely on predictive coding, could accurately recognize individual speech elements (such as words or complete sentences, if we consider them as one speech element).

Consequently, both models worked, just in different directions. One was focused on the real-time aspect of speech analysis, and the other on recognition of isolated speech segments (no analysis required).

But what if we combine the basic principles of these radically different models into one? According to the authors of the study we are considering, this will improve productivity and increase the biological realism of neurocomputer speech processing models.

In their work, scientists decided to check whether a speech recognition system based on predictive coding can get some benefit from the processes of neural oscillations.

They developed the Precoss neurocomputer model (from predictive coding and oscillations for speech), based on the predictive coding structure, to which theta and gamma-vibrational functions were added to cope with the continuous nature of natural speech.

The specific purpose of this work was to find the answer to the question whether a combination of predictive coding and neural oscillations can be beneficial for the rapid identification of syllabic components of natural sentences. In particular, the mechanisms by which theta oscillations can interact with upward and downward information flows were examined, and the impact of this interaction on the efficiency of the syllable decoding process was evaluated.

Architecture Precoss Models

An important function of the model is that it should be able to use the temporary signals / information present in continuous speech to determine syllable boundaries. Scientists have suggested that internal generative models, including temporal predictions, should benefit from such signals. To account for this hypothesis, as well as the repetitive processes occurring during speech recognition, a continuous prediction coding model was used.

The developed model clearly separates “what” and “when”. “What” – refers to the identity of the syllable and its spectral representation (not a temporary, but an ordered sequence of spectral vectors); “When” – refers to the prediction of the time and duration of syllables.

As a result, forecasts take two forms: the beginning of a syllable, signaled by the theta module; and syllable duration, signaled by exogenous / endogenous theta oscillations, which specify the duration of the sequence of units with gamma synchronization (diagram below).

Image No. 1

Precoss extracts the sensory signal from internal representations of its source by referring to the generating model. In this case, the touch input corresponds to the slow amplitude modulation of the speech signal and the 6-channel auditory spectrogram of the full natural sentence, which the model internally generates from four components:

  • theta swing;
  • a slow amplitude modulation unit in a theta module;
  • pool of syllable units (as many syllables as present in the natural introductory sentence, i.e. from 4 to 25);
  • a bank of eight gamma units in the spectro-temporal module.

Together, the units of syllables and gamma oscillations generate downward predictions regarding the input spectrogram. Each of the eight gamma units represents a phase in a syllable; they are activated sequentially, and the entire activation sequence is repeated. Therefore, each syllable unit is associated with a sequence of eight vectors (one per gamma unit) with six components each (one per frequency channel). The acoustic spectrogram of an individual syllable is generated by activating the corresponding syllable unit throughout the duration of the syllable.

While the syllable block encodes a specific acoustic pattern, gamma blocks temporarily use the corresponding spectral prediction for the duration of the syllable. Information on the duration of the syllable is given by the theta oscillation, since its instantaneous speed affects the speed / duration of the gamma sequence.

Finally, the accumulated data on the intended syllable must be deleted before processing the next syllable. To do this, the last (eighth) gamma block, which encodes the last part of the syllable, resets all syllable units to a general low level of activation, which allows you to collect new evidence.

Image No. 2

The performance of the model depends on whether the gamma sequence coincides with the beginning of the syllable, and whether its duration corresponds to the duration of the syllable (50-600 ms, average = 182 ms).

Evaluation of the model relative to the sequence of syllables is provided by units of syllables, which together with gamma units generate the expected spectral-temporal patterns (the result of the model), which are compared with the introductory spectrogram. The model updates its estimates of the current syllable in order to minimize the difference between the generated and the actual spectrogram. The level of activity increases in those syllable units whose spectrogram corresponds to sensory input, and decreases in others. In the ideal case, minimizing real-time forecasting errors leads to increased activity in one separate syllable unit corresponding to the input syllable.

Simulation results

The model presented above includes physiologically motivated theta oscillations, which are controlled by slow amplitude modulations of the speech signal and transmit information about the beginning and duration of the syllable to the gamma component.

This theta-gamma-link provides a temporary alignment of the internal generated predictions with the syllable boundaries detected by the input data (option A in image No. 3).

Image No. 3

To assess the relevance of syllable synchronization based on slow amplitude modulation, a comparison was made of model A with option B, in which theta activity is not modeled by vibrations, but arises from self-repetition of the gamma sequence.

In model B, the duration of the gamma sequence is no longer controlled exogenously (due to external factors) by theta oscillations, and endogenously (due to internal factors) it uses the preferred gamma speed, which, when the sequence is repeated, leads to the formation of an internal theta rhythm. As in the case of theta oscillations, the duration of the gamma sequence has a preferred speed in the theta range, which can potentially adapt to variable syllable durations. In this case, it is possible to test the theta rhythm arising from the repetition of the gamma sequence.

To more accurately evaluate the specific effects of the theta gamma of compounding and dumping accumulated data in syllabic units, additional versions of the previous models A and B were made.

Options C and D were distinguished by the absence of a preferred gamma radiation rate. Variants E and F additionally differed from variants C and D by the absence of resetting the accumulated syllable data.

Of all the variants of the model, only A has a true theta-gamma connection, where gamma activity is determined by the theta module, while in the model, the gamma speed is set endogenously.

It was necessary to establish which version of the model is the most effective, for which a comparison was made of the results of their work in the presence of common input data (natural sentences). The graph in the image above shows the average performance of each of the models.

Significant differences were present between the options. Compared to models A and B, performance was significantly lower in models E and F (23% on average) and C and D (15%). This indicates that erasing accumulated data about a previous syllable before processing a new syllable is a critical factor in coding a syllable stream in natural speech.

Comparison of options A and B with options C and D showed that the theta-gamma connection, be it stimulus (A) or endogenous (B), significantly improves model performance (an average of 8.6%).

Generally speaking, experiments with different versions of the models showed that it worked best when syllable units were reset after each gamma unit sequence was completed (based on internal information about the spectral structure of the syllable), and when the speed of gamma radiation was determined by the theta-gamma coupling.

The performance of the model with natural sentences, therefore, does not depend on the precise signaling of the beginning of syllables through theta oscillations controlled by the stimulus, nor on the exact mechanism of the theta-gamma communication.

As the scientists themselves admit, this is a rather surprising discovery. On the other hand, the absence of differences in performance between stimulus-driven and endogenous theta-gamma-linking reflects the fact that the duration of syllables in natural speech is very close to the model’s expectations, in which case there will be no advantage for theta-signal driven directly by input data.

To better understand such an unexpected turn of events, scientists conducted another series of experiments, but with compressed speech signals (x2 and x3). As behavioral studies show, the understanding of speech compressed in x2 times practically does not change, but it drops significantly when compressed by 3 times.

In this case, a stimulated theta-gamma link can be extremely useful for parsing and decoding syllables. The simulation results are presented below.

Image No. 4

As expected, overall performance fell with increasing compression ratio. To compress x2, there was still no significant difference between the stimulus and endogenous theta-gamma linkages. But in the case of compression x3 there is a significant difference. This suggests that stimulus-driven theta oscillation driving the theta-gamma link was more beneficial for syllable coding than an endogenously established theta rate.

It follows that natural speech can be processed using a relatively fixed endogenous theta generator. But for more complex input speech signals (i.e., when the speed of speech is constantly changing), a controlled theta generator is required that transmits accurate time information about the syllables to the gamma encoder (beginning of the syllable and duration of the syllable).

The ability of the model to accurately recognize syllables in the input sentence does not take into account the variable complexity of the various compared models. Therefore, a Bayesian Information Criterion (BIC) was evaluated for each model. This criterion quantitatively determines the compromise between the accuracy and complexity of the model (image No. 5).

Image No. 5

Option A showed the highest BIC values. A previous comparison of models A and B could not accurately distinguish between their performance. However, thanks to the BIC criterion, it became apparent that option A provides more confident syllable recognition than a model without theta oscillations driven by the stimulus (model B).

For a more detailed acquaintance with the nuances of the study, I recommend a look at report of scientists and Additional materials to him.


Summarizing the above results, we can say that the success of the model depends on two main factors. The first and most important is the dumping of accumulated data based on model information about the content of the syllable (in this case, its spectral structure). The second factor is the relationship between the theta and gamma processes, which ensures the inclusion of gamma activity in the theta cycle, corresponding to the expected duration of the syllable.

At its core, the developed model imitated the functioning of the human brain. The sound entering the system was modulated by a theta wave resembling the activity of neurons. This allows you to determine the boundaries of syllables. Further, faster gamma waves help code the syllable. In the process, the system offers possible syllables and adjusts the selection if necessary. Jumping between the first and second levels (theta and gamma), the system discovers the correct version of the syllable, and then is reset to zero to start the process anew for the next syllable.

During practical tests, 2888 syllables were successfully decoded (220 sentences of natural speech, English was used).

This study not only combined two opposing theories, putting them into practice as a single system, but also made it possible to better understand how our brain perceives speech signals. It seems to us that we perceive speech “as is”, i.e. without any complicated supporting processes. However, given the simulation results, it turns out that neural theta and gamma oscillations allow our brain to make small predictions about which syllable we hear, on the basis of which speech perception is formed.

Whoever says anything, but the human brain sometimes seems much more mysterious and incomprehensible than the unexplored corners of the universe or the hopeless depths of the oceans.

Thank you for your attention, remain curious and have a good working week, guys. 🙂

Some advertising

Thank you for staying with us. Do you like our articles? Want to see more interesting materials? Support us by placing an order or recommending to your friends, cloud VPS for developers from $ 4.99, A unique analogue of entry-level servers that was invented by us for you: The whole truth about VPS (KVM) E5-2697 v3 (6 Cores) 10GB DDR4 480GB SSD 1Gbps from $ 19 or how to divide the server correctly? (options are available with RAID1 and RAID10, up to 24 cores and up to 40GB DDR4).

Dell R730xd 2 times cheaper at the Equinix Tier IV data center in Amsterdam? Only here 2 x Intel TetraDeca-Core Xeon 2x E5-2697v3 2.6GHz 14C 64GB DDR4 4x960GB SSD 1Gbps 100 TV from $ 199 in the Netherlands! Dell R420 — 2x E5-2430 2.2Ghz 6C 128GB DDR3 2x960GB SSD 1Gbps 100TB — от $99! Читайте о том Как построить инфраструктуру корп. класса c применением серверов Dell R730xd Е5-2650 v4 стоимостью 9000 евро за копейки?

Similar Posts

Leave a Reply