[Научпоп с кодом] What is a “liquid” neural network and how to teach it to play in Atari?

The algorithms at the heart of traditional networks are tuned during training, when a huge amount of data is supplied to calibrate the best values ​​​​of their weights, liquid (“fluid”) neural networks adapt better.

“They are able to change their basic equations based on the input they observe,” in particular, by changing the reaction rate of neurons, says the director of the Computer Science and Artificial Intelligence Laboratory at the Massachusetts Institute of Technology. Daniela Rus.

One of the first tests to demonstrate this ability is an attempt to drive an autonomous car. After all, a conventional neural network analyzes visual data from a car camera only at regular intervals, and a liquid network of 19 neurons and 253 synapses (tiny by machine learning standards) can be much more responsive.

“Our model can sample more often, for example, when the road is winding,” says a co-author of this and other papers on liquid networks.

The model was successful in keeping the car moving, but according to Lechner, it had one drawback: “It was very slow” due to non-linear equations representing synapses and neurons, which usually cannot be solved without recalculations on a computer that performs several iterations. before reaching a solution. This work is usually delegated to special software packages – solvers, which must be applied to each synapse and neuron separately.

IN article 2022 scientists have shown a liquid neural network that bypassed this bottleneck. This network was based on equations of the same type, but a key advance was Hasani’s discovery that such equations did not need to be solved with time-consuming computer calculations. Instead, the network could function using a near-exact or “closed” solution that could, in principle, be developed on paper with a pencil in hand. As a rule, these nonlinear equations do not have solutions in closed form, but Hasani stumbled upon a fairly good approximate solution.

“A closed-form solution is an equation solution where you plug in parameter values, then do some simple math, and you get the answer, in one shot.”

This speeds up calculations and reduces energy costs.

Liquid neural networks offer an “elegant and compact alternative,” says Ken Goldberg, a roboticist at the University of California, Berkeley. Experiments already show that these networks can be faster and more accurate than other so-called continuous-time neural networks, which model systems that change over time, he says.

Ramin Hasani and Matthias Lechner, the initiators of the new architecture, realized many years ago that C.elegans could be the perfect organism to use to figure out how to build resilient neural networks that can adapt to the unexpected. This worm is one of the few creatures with a fully structured nervous system, and it is capable of a range of activities: moving, finding food, sleeping, mating, and even learning from experience. “He lives in the real world, where there is constant change, and can work well in almost any environment,” Lechner said.

Respect for the lowly worm led Lechner and Hasani to their new networks, where each neuron is controlled by an equation that predicts its behavior over time. And just as neurons are connected to each other, these equations depend on each other. The network, in fact, solves an ensemble of coupled equations, allowing you to characterize the state of the system at any moment – unlike traditional neural networks, which only produce results at certain points in time.

“[Обычные нейросети] can tell you what’s going on in just one, two or three seconds,” Lechner said. “But a continuous-time model like ours can describe what happens in 0.53 seconds, 2.14 seconds, or whatever time you choose.”

Liquid networks also differ in how they process synapses, the connections between artificial neurons. The strength of these connections in a standard neural network can be expressed by a single number – its weight. In liquid networks, the exchange of signals between neurons is a probabilistic process controlled by a “nonlinear” function: responses to inputs are not always proportional. Doubling the input, for example, can cause the output to shift much more or less. It is because of this natural volatility that networks are called liquid, “liquid”. The response of a neuron can vary depending on what kind of input it receives.

“Their method is to beat the competition by several orders of magnitude without sacrificing accuracy,” said Saiyan Mitracomputer scientist at the University of Illinois, Urbana-Champaign.

Their latest networks are not only faster, but also unusually stable, Hasani says, meaning the system is capable of handling huge inputs without crashing. “The main contribution here is that stability and other nice properties are built into these systems due to their clean structure,” says Sriram Sankaranarayanan, computer scientist at the University of Colorado, Boulder. Liquid networks seem to operate within a “golden spot: they are complex enough that something interesting happens in them, but not complex enough to lead to chaotic behavior.”

Now scientists from MIT are testing their latest network on an autonomous UAV. Although the drone was trained to navigate the forest, it was moved to the Cambridge urban environment to see how it copes with the new conditions. Lechner considers the preliminary results of the experiment to be encouraging.

The team is working on improving the architecture of their network. The next step, according to Lechner, is “figuring out how many or how few neurons it takes to solve the problem.”

Scientists want to develop the best way to connect neurons. Now every neuron is connected to every other neuron, but synaptic connections C.elegans work differently: they are more selective. Through studies of the nervous system of roundworms, scientists hope to determine which neurons in their system should be connected.

In addition to autonomous driving and flying, liquid networks are well suited for analyzing electrical networks, financial transactions, weather, and other phenomena that change over time. According to Hasani, the latest version of liquid networks can be used “to model brain activity on a scale not previously possible.”

Mitra is especially intrigued by this. “In a way, it’s poetic to show that this research can come full circle. [полный, замкнутый круг]he thinks [отсылка на замкнутую форму уравнений]. “Neural networks are evolving to the point where the very ideas we gleaned from nature could soon help us better understand nature.”


Let’s move on to practice. Let’s show an example directly from the documentation for the package with the implementation of liquid neural networks. And, of course, his work.

Technical details

Neural circuit politics (NCP) are repetitive neural network patterns inspired by the nervous system of the nematode C. elegans. Compared to standard ML models, NCPs have:

  1. Neurons modeled by an ordinary differential equation;
  2. Sparse and structured connections of neurons;

Neuron models

Today, the package provides two models of neurons: LTC and CfC on neurons as differential equations connected by sigmoidal synapses. The term “fluid [ликвидная] time constant” comes from the property of LTC: the behavior of these neurons in time adjusts to the input data (the speed of reaction to some stimuli may depend on a particular input signal). LTCs are ordinary differential equations, so their behavior can only be described in time.

LTCs are universal approximators and implement causal dynamic models. The LTC model has one significant drawback: it requires a numerical differential equation solver to compute the output, which seriously slows down training and inference time. Equation Models [замкнутой формы] with continuous time (CfC) eliminate this bottleneck, they contain an approximate solution of the differential equation in closed form.

Both LTC and CfC models are recurrent neural networksthey have a temporal state, and therefore only apply to sequential or time series data.

neural connections

You can use both of the above models, and with a fully connected wiring diagram. To do this, we simply pass the number of neurons, as is done in standard models such as LSTM, GRU, MLP or transformers.

from ncps.torch import CfC

# a fully connected CfC network
rnn = CfC(input_size=20, units=50)

In the form of an object ncps.wirings.Wiring sparse structured joins can be specified. Neural Circuit Policy (NCP) – the most interesting neural connection paradigm in this package, it includes a 4-level principle of recurrent connection of sensory, intermediate, command and motor neurons.

_images/wirings.png

The easiest way to create NCP neural connections is to use the AutoNCP class, which requires you to specify the total number of neurons and the number of motor neurons, that is, the size of the output.

from ncps.torch import CfC
from ncps.wirings import AutoNCP

wiring = AutoNCP(28, 4) # 28 neurons, 4 outputs
input_size = 20
rnn = CfC(input_size, wiring)

Scheme

Teaching a fluid neural network to play in Atari

Below, we will teach NCP how to play an Atari game using reinforcement learning. The code is written in TensorFlow, and is used for training ray[rllib]. We will use the algorithm proximal policy optimization (PPO) – proximal policy optimization. This is a good basic algorithm that works in both discrete and continuous action space.

../_images/breakout.webp

Installation and Requirements

First you need to install some packages:

pip3 install ncps tensorflow "ale-py==0.7.4" "ray[rllib]" "gym[atari,accept-rom-license]==0.23.1"

Define the model

The model consists of a convolution block followed by a CfC type recurrent neural network. For model compatibility with rllib, let’s create a subclass of the class ray.rllib.models.tf.recurrent_net.RecurrentNetwork.

Our Conv-CfC network has two output tensors. These are the tensors:

  • Distributions of possible actions (policy);
  • Scalar evaluation of the cost function;

The second tensor is required for our PPO RL algorithms. Learning both a policy and a cost function in the same network often has some of the benefits of learning from shared features.

import numpy as np
from ray.rllib.models.modelv2 import ModelV2
from ray.rllib.models.tf.recurrent_net import RecurrentNetwork
from ray.rllib.utils.annotations import override
import tensorflow as tf
from ncps.tf import CfC

class ConvCfCModel(RecurrentNetwork):
    """Example of using the Keras functional API to define a RNN model."""

    def __init__(
        self,
        obs_space,
        action_space,
        num_outputs,
        model_config,
        name,
        cell_size=64,
    ):
        super(ConvCfCModel, self).__init__(
            obs_space, action_space, num_outputs, model_config, name
        )
        self.cell_size = cell_size

        # Define input layers
        input_layer = tf.keras.layers.Input(
            # rllib flattens the input
            shape=(None, obs_space.shape[0] * obs_space.shape[1] * obs_space.shape[2]),
            name="inputs",
        )
        state_in_h = tf.keras.layers.Input(shape=(cell_size,), name="h")
        seq_in = tf.keras.layers.Input(shape=(), name="seq_in", dtype=tf.int32)

        # Preprocess observation with a hidden layer and send to CfC
        self.conv_block = tf.keras.models.Sequential([
            tf.keras.Input(
                (obs_space.shape[0] * obs_spac.shapee[1] * obs_space.shape[2])
            ),  # batch dimension is implicit
            tf.keras.layers.Lambda(
                lambda x: tf.cast(x, tf.float32) / 255.0
            ),  # normalize input
            # unflatten the input image that has been done by rllib
            tf.keras.layers.Reshape((obs_space.shape[0], obs_space.shape[1], obs_space.shape[2])),
            tf.keras.layers.Conv2D(
                64, 5, padding="same", activation="relu", strides=2
            ),
            tf.keras.layers.Conv2D(
                128, 5, padding="same", activation="relu", strides=2
            ),
            tf.keras.layers.Conv2D(
                128, 5, padding="same", activation="relu", strides=2
            ),
            tf.keras.layers.Conv2D(
                256, 5, padding="same", activation="relu", strides=2
            ),
            tf.keras.layers.GlobalAveragePooling2D(),
        ])
        self.td_conv = tf.keras.layers.TimeDistributed(self.conv_block)

        dense1 = self.td_conv(input_layer)
        cfc_out, state_h = CfC(
            cell_size, return_sequences=True, return_state=True, name="cfc"
        )(
            inputs=dense1,
            mask=tf.sequence_mask(seq_in),
            initial_state=[state_in_h],
        )

        # Postprocess CfC output with another hidden layer and compute values
        logits = tf.keras.layers.Dense(
            self.num_outputs, activation=tf.keras.activations.linear, name="logits"
        )(cfc_out)
        values = tf.keras.layers.Dense(1, activation=None, name="values")(cfc_out)

        # Create the RNN model
        self.rnn_model = tf.keras.Model(
            inputs=[input_layer, seq_in, state_in_h],
            outputs=[logits, values, state_h],
        )
        self.rnn_model.summary()

    @override(RecurrentNetwork)
    def forward_rnn(self, inputs, state, seq_lens):
        model_out, self._value_out, h = self.rnn_model([inputs, seq_lens] + state)
        return model_out, [h]

    @override(ModelV2)
    def get_initial_state(self):
        return [
            np.zeros(self.cell_size, np.float32),
        ]

    @override(ModelV2)
    def value_function(self):
        return tf.reshape(self._value_out, [-1])

Once the model is defined, it can be registered with rllib:

from ray.rllib.models import ModelCatalog

ModelCatalog.register_custom_model("cfc", ConvCfCModel)

Define the reinforcement learning algorithm and its hyperparameters

Each RL algorithm relies on a dozen hyperparameters that can have a huge impact on learning performance, and PPO is no exception. Luckily the authors of rllib have provided a configuration that works decently with PPO and Atari environments. We will use this configuration:

import argparse
import os
import gym
from ray.tune.registry import register_env
from ray.rllib.algorithms.ppo import PPO
import time
import ale_py
from ray.rllib.env.wrappers.atari_wrappers import wrap_deepmind

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--env", type=str, default="ALE/Breakout-v5")
    parser.add_argument("--cont", default="")
    parser.add_argument("--render", action="store_true")
    parser.add_argument("--hours", default=4, type=int)
    args = parser.parse_args()

    register_env("atari_env", lambda env_config: wrap_deepmind(gym.make(args.env)))
    config = {
        "env": "atari_env",
        "preprocessor_pref": None,
        "gamma": 0.99,
        "num_gpus": 1,
        "num_workers": 16,
        "num_envs_per_worker": 4,
        "create_env_on_driver": True,
        "lambda": 0.95,
        "kl_coeff": 0.5,
        "clip_rewards": True,
        "clip_param": 0.1,
        "vf_clip_param": 10.0,
        "entropy_coeff": 0.01,
        "rollout_fragment_length": 100,
        "sgd_minibatch_size": 500,
        "num_sgd_iter": 10,
        "batch_mode": "truncate_episodes",
        "observation_filter": "NoFilter",
        "model": {
            "vf_share_layers": True,
            "custom_model": "cfc",
            "max_seq_len": 20,
            "custom_model_config": {
                "cell_size": 64,
            },
        },
        "framework": "tf2",
    }

    algo = PPO(config=config)

When the algorithm is run, checkpoints are created that we can restore later. We will save these points in the rl_ckpt folder and add a restore using the checkpoint ID passed with the argument --cont:

os.makedirs(f"rl_ckpt/{args.env}", exist_ok=True)
if args.cont != "":
    algo.load_checkpoint(f"rl_ckpt/{args.env}/checkpoint-{args.cont}")

Visualization of the interaction between policy and environment

To show exactly how the trained policy plays the Atari game, one needs to write a function that turns on the environment mode render_mode and executes the policy in a closed loop.

To calculate the actions, we use the function in the algorithm object – this is compute_single_actionbut you need to take care of initializing the RNN hidden state yourself:

def run_closed_loop(algo, config):
    env = gym.make(args.env, render_mode="human")
    env = wrap_deepmind(env)
    rnn_cell_size = config["model"]["custom_model_config"]["cell_size"]
    obs = env.reset()
    state = init_state = [np.zeros(rnn_cell_size, np.float32)]
    while True:
        action, state, _ = algo.compute_single_action(
            obs, state=state, explore=False, policy_id="default_policy"
        )
        obs, reward, done, _ = env.step(action)
        if done:
            obs = env.reset()
            state = init_state

Launching the PPO

Now we run the reinforcement learning algorithm. The program shows the game of the network if the argument is specified --render:

if args.render:
    run_closed_loop(
        algo,
        config,
    )
else:
    start_time = time.time()
    last_eval = 0
    while True:
        info = algo.train()
        if time.time() - last_eval > 60 * 5:  # every 5 minutes print some stats
            print(f"Ran {(time.time()-start_time)/60/60:0.1f} hours")
            print(
                f"    sampled {info['info']['num_env_steps_sampled']/1000:0.0f}k steps"
            )
            print(f"    policy reward: {info['episode_reward_mean']:0.1f}")
            last_eval = time.time()
            ckpt = algo.save_checkpoint(f"rl_ckpt/{args.env}")
            print(f"    saved checkpoint '{ckpt}'")

        elapsed = (time.time() - start_time) / 60  # in minutes
        if elapsed > args.hours * 60:
            break

You can find all source code Here.

On a modern desktop computer, it takes about an hour to get a return of 20 and about 4 hours to reach a return of 50.

For Atari environments, rllib distinguishes between two returns: episodic (that is, with 1 life in the game) and game (with three lives), so the return reported by rllib may differ from that obtained by evaluating the feedback model.

The output of the script looks like this:

> Ran 0.0 hours
>     sampled 4k steps
>     policy reward: nan
>     saved checkpoint 'rl_ckpt/ALE/Breakout-v5/checkpoint-1'
> Ran 0.1 hours
>     sampled 52k steps
>     policy reward: 1.9
>     saved checkpoint 'rl_ckpt/ALE/Breakout-v5/checkpoint-13'
> Ran 0.2 hours
>     sampled 105k steps
>     policy reward: 2.6
>     saved checkpoint 'rl_ckpt/ALE/Breakout-v5/checkpoint-26'
> Ran 0.3 hours
>     sampled 157k steps
>     policy reward: 3.4
>     saved checkpoint 'rl_ckpt/ALE/Breakout-v5/checkpoint-39'
> Ran 0.4 hours
>     sampled 210k steps
>     policy reward: 6.7
>     saved checkpoint 'rl_ckpt/ALE/Breakout-v5/checkpoint-52'
> Ran 0.4 hours
>     sampled 266k steps
>     policy reward: 8.7
>     saved checkpoint 'rl_ckpt/ALE/Breakout-v5/checkpoint-66'
> Ran 0.5 hours
>     sampled 323k steps
>     policy reward: 10.5
>     saved checkpoint 'rl_ckpt/ALE/Breakout-v5/checkpoint-80'
> Ran 0.6 hours
>     sampled 379k steps
>     policy reward: 10.7
>     saved checkpoint 'rl_ckpt/ALE/Breakout-v5/checkpoint-94'
...

A useful theory and even more practice with immersion in the IT environment are waiting for you in our courses:

Brief catalog of courses

Data Science and Machine Learning

Python, web development

Mobile development

Java and C#

From basics to depth

And

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *