Optimizing neurons in Tensorflow?

Recall that calculation graphs are a representation of a data structure in the form of a graph. I think everyone has seen what mathematical graphs look like. Nodes – operations: activation function, addition, subtraction, ReLU. Edges are data streams; they reflect dependencies between operations. Typically, an edge is directed from a source node to a destination node and transmits the results of calculations from one node to another.

Unlike Pytorch, where the data structure is built on the fly after the neuron begins training, the graph in TensorFlow is static. Yes, we are losing a little space for experimentation. But why do we need them when we are writing a neuron for real production?

Automatic optimization for the lazy

Let's start with, perhaps, the simplest and semi-automated methods for optimizing calculation graphs, and the first of them is XLA. A high-level compiler that speeds up linear algebra operations – compiling them into machine-readable code for your GPU or TPU.

It works relatively simply; first, the graph is analyzed to identify opportunities for optimization. Combining sequential operations or merging computational graphs. Why convert operations or delete them altogether. And only after these procedures XLA compiles the resulting graph into code, specializing it for TPU or GPU.

To use the compiler, just set the desired tf.function flag in the function.

@tf.function(jit_compile=True)
def model_inference(input):
    return model(input)

Or apply the flag immediately before the required operations. If you really want to use the compiler globally, set the environment variable:

import os
os.environ['TF_XLA_FLAGS'] = '--tf_xla_enable_xla_devices'

This method is automated and does not always guarantee “high-quality” optimization. Therefore, let's think about how to get the optimal benefits from optimization and even reducing the weight of the model using other methods – graph freezing and checkpointing.

Freezing a graph or how to get the entire calculation map in one file

We mummify/freeze the variables and their weights into constants – this way we can upload the graph to a separate single file. As with a virtual development environment or simply a project with its own dependencies, a computation graph can be attached to a repository. But the most important thing for us is that it is needed to optimize calculations.

The principle of freezing is not only to convert variables into constants, but also to remove all graph nodes that are not needed for model inference. And this is the advantage of the method. Thanks to it, the file weight can be significantly reduced.

But, of course, such a model cannot simply be retrained.

Once we have trained and created the model itself, we just need to remove the training nodes and naturally convert all the variables into constants.

# Конвертация переменных модели в константы и удаление узлов обучения

frozen_graph = tf.compat.v1.graph_util.convert_variables_to_constants(
    sess=tf.compat.v1.keras.backend.get_session(),  
  # Получение текущей сессии TensorFlow
    input_graph_def=tf.compat.v1.graph_util.remove_training_nodes(  
      # Удаление узлов обучения
        tf.compat.v1.graph_util.convert_variables_to_constants(  
          # Конвертация переменных в константы
            tf.compat.v1.keras.backend.get_session(),  
          # Текущая сессия TensorFlow
            tf.compat.v1.keras.backend.get_session().graph.as_graph_def(),  
          # Граф модели
            [node.op.name for node in model.outputs]  
          # Имена выходных узлов модели
        )
    ),
    output_node_names=[node.op.name for node in model.outputs]  
  # Имена выходных узлов модели
)

Then we save our finished graph to a file

with tf.io.gfile.GFile('frozen_model.pb', 'wb') as f:  
# Открытие файла на запись
    f.write(frozen_graph.SerializeToString())  
# Сериализация замороженного графа и запись его в файл

Checkpoint is our spawn point in Tensorflow

Creating a lightweight checkpoint in TensorFlow saves the model state (weights and optimizer) to a file for later restoration and continued training. This is useful if training a model takes a long time and you want to save its state regularly during training to avoid losing progress if it crashes or is interrupted. The same cntrl + S, only in the framework.

Well, why do we need these “saves”? The point is that checkpoints can be managed.

And accordingly, squeeze out all the necessary juices at the time of storage.

For example, using the module tf.train.Saver (...) we can set the number of required variables that we want to save during initialization. For example, you can simply enter all the weights there.

On the other hand, checkpoints offer other functionality. Thanks to MetaGraph, we can immediately build a calculation graph when loading a model from a checkpoint.

But we propose another idea. Just duplicate the graph construction code from the training and no longer save the metadata.

ckpt_filepath = saver.save(sess, filepath, write_meta_graph=False)

Here we assign the metadata “entries” to False and save only the weights and other states of the graph when constructing the checkpoint. This way you can reduce the weight of the model by almost five times – we use it!

But there are a couple of other interesting ways to pump up your neuron and work with graphs – do pruning, something obviously obscene, or quantization.

Pruning or removing extra neurons in the AI brain

Pruning or how to significantly reduce the size of a model due to a slight decrease in accuracy. More precisely, removing unnecessary parameters: weights or neurons. For example, by estimating gradients with respect to parameters, the built-in Grappler in the framework or PruningAPI from google-research removes the most useless neurons.

The implementation from Google (Pruning) works on the principle of adding mask and threshold variables, which accumulate information about the weights that give correct predictions. Afterwards, the unnecessary weights are simply reset to zero and as a result we get the same optimized frozen graph.

For example,

pruning_params = {
    'pruning_schedule': sparsity.ConstantSparsity(0.5, begin_step=0, frequency=100) 
  # Определение параметров прунинга
}

pruned_model = sparsity.prune_low_magnitude(model, **pruning_params) 
# Применение прунинга к слоям модели

First, a Sequential model is created with several fully connected layers. The “sparseness” hyperparameter is then defined in the form of “schedule” and frequency. These parameters are passed to the prune_low_magnitude function, which applies pruning to the specified layers in the model. The model is then compiled and trained in the same way as a regular model.

Of course, theoretically, the weight of the model should decrease and the speed should increase.

But does this method work for all types of neurons? – no, in the end, even after removing all unnecessary elements in the model, you still have embendings that we do not recommend discarding.

Graph Transform Tool and Quantization Method

You can read more about how to use the Graph Transform tool at Github.

Additionally, you need to import the utility.

import tensorflow as tf
from tensorflow.tools.graph_transforms import TransformGraph

Create a computation graph in TensorFlow and define a list of transformations you want to apply to the graph. Transformations can be defined as a list of dictionaries, each of which contains the name of the transformation and its parameters.

For example, let’s define a graph for a neuron:

imported_model = tf.saved_model.load(export_dir)
graph_def = imported_model.signatures["serving_default"].graph.as_graph_def()

And let’s set the transformations directly:

transforms = [
    {"name": "FoldConstants", "params": {}}
]

optimized_graph_def = TransformGraph(graph_def, input_nodes, output_nodes, transforms)

You can use the graph like this:

optimized_graph_def = TransformGraph(graph_def, input_nodes, output_nodes, transforms)

Here we define the graph itself, input and output nodes, transformations.

And in general, GTT is a powerful tool, but not so simple. And it is not always needed in development, it all depends on the situation. Let's call it professional optimization mode.

Using the tool, you can perform quantization – the process of reducing the precision of numbers (usually from 32-bit floats to 8-bit integers). It's simple. Tool has its own transformations for this optimization method.

transforms = [   
              {'name': 'quantize_weights', 'params': {'dtype': 'int8'}},  # 
  Квантинизация весов    
  {'name': 'quantize_nodes', 'params': {'dtype': 'int8'}},    
  # Квантинизация активаций
]

Converting variables to a new bit size reduces the size of our file, and therefore optimizes performance, although it slightly reduces the accuracy of the neural network. But the effectiveness of the method has not been sufficiently assessed. This option is great for running neuron on mobile devices where there is not enough computing power.

The benefits of optimization are obvious: reduced model size, improved performance, faster training, and efficient use of resources, your TPU and GPU. We use and code)