Neural stack (neural stack machine)

This article briefly discusses the architecture of a neural stack.
Neural stack is one of the models of neural networks with external memory.

You can read more about neural networks with external memory here.

Part 1: What is a neural stack?

The basic idea is to train a recurrent neural network (specifically an LSTM) to manipulate a stack of numbers.

This model is trained using the backpropagation method (Backpropagation). And since a recurrent neural network is used, the method will be called Backpropagation through time.

The developers of this model solved the problem of teaching a neural network to add numbers to the stack and remove numbers from the stack.

In the original article The authors consider not only the neural stack, but also the neural queue and neural deck.

Let the external memory (neural stack) be a differentiable structure into and from which real numbers are added and removed. These discrete operations (push and pop) we deliberately make continuous, allowing operations push And pop have a real factor that takes values ​​in the interval (0, 1).

Intuitively, we can interpret these values ​​as the degree of confidence with which the controller wants to place the number v onto the stack or pop the top of the stack.

Formulas describing the operation of the neural stack

Formulas describing the operation of the neural stack

Neural stack as a recurrent layer

Neural stack as a recurrent layer

Illustration of neural stack operations, recurrent structure and control

Illustration of neural stack operations, recurrent structure and control

Formally, a stack fully parameterized by the size of the attachment mis described at some time step t matrix V_t values t\times mand vector of strength (force) s_t \in \mathbb{R}^t .

This matrix forms the core of the recurrent layer, which is acted upon by the controller. The neural stack receives values ​​from the controller v_t \in \mathbb{R}_msignal pop u_t \in (0, 1) and signal push d_t\in (0, 1).

The controller outputs the read vector r_t ∈ \mathbb{R}^m.

The recurrence of this layer is due to the fact that it will receive the pair (V_{t−1}, s_{t−1}) and output a pair as the next state (V_t, s_t)following the dynamics described below.

Here Vt[i] represents the i-th row (m-dimensional vector) V_tA st[i] represents i-th value s_t.

Equation 1

Equation 1 shows the update of a recurrent layer represented as a matrix V. The number of rows grows over time.

The values ​​pushed onto the stack at each time step are saved (whether they are logically still on the stack or not). The values ​​are added to the bottom of the matrix (top of the stack) and are never changed.

Equation 2

Equation 2 shows the influence of signals push And pop to update the force (intensity) vector s_{t−1} to receive s_t.

  1. Surgery first pop removes objects from the stack.

We can consider the meaning pop u_t as the initial removal value for the operation.

a) We move the force vector s_{t−1} from the very first index j until the last.

b) If the next force scalar s_j less than the remaining number of deletions u_tit is subtracted from the remaining number of deletions u_t and its value is set to 0.

c) If the remaining number of deletions u_tless than next value s_jthen the remaining number of deletions is subtracted from this value and the deletion stops.

  1. Then the value push is set to the value added at the current time step.

Equation 3

Equation 3 shows the dynamics of the read operation, which is similar to the operation pop.

Fixed initial reading value equal to 1is set at the top of a temporary copy of the force vector s_t.

Vector s_t moves from the largest index to the smallest.

If the next strength value is less than the remaining read value, its value is stored for this operation and subtracted from the remaining read value.

If not, then it is temporarily set equal to the remaining read value, and the strength scalar values ​​of all subscripts are temporarily set to 0.

Result r_t read operation is a weighted sum of rows V_tscaled by the temporary scalar values ​​produced during the traversal.

An example of calculating a read from the stack in three time steps, after clicks and claps, as described above, is shown in the figure 1a.

The third step shows how to set the value s3[2] V 0 For V3[2] logically deletes v_2 from the stack and how it is ignored during reading.

This completes the description of the direct dynamics of the neural stack, represented as a recurrent layer, as shown in the figure b. All operations described in this section are differentiable. Equations describing the inverse dynamics are given below.

The architectures of the neural queue and neural deck will probably be discussed later.

Analysis of inverse dynamics of a neural stack

Here we will look at the partial derivatives of the output with respect to the input.

We will use the Kronecker symbol \sigma_{ij} (\sigma_{ij} = 1If i = j, 0 otherwise).

The equations below are valid for any valid line numbers i And n.

E

E

All partial derivatives other than those obtained using the chain derivative rule can be taken to be 0.

Initial initialization of model parameters

The original article describes that the best results for the model are provided by the following initialization: offsets (bias) for the pop operation with a negative number.

Model results

Let's consider the task of printing a string in reverse order.

Example:

As an example of input and target data, assume '$' is the leading character, '|' is a delimiter, and '&' is a terminator. Input data can be:

$A string to reverse|

and the expected result will be as follows:

esrever ot gnirts A&

The result of the models

The result of the models

You can find out more about this model Here And Here.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *