Features of automatic differentiation in PyTorch. Part 2

Hello! The BARS Group team is in touch, and we continue our conversation about the PyTorch framework.

PyTorch is an open source ML framework for Python that is widely used for solving neural network applications. As a general rule, machine learning frameworks are often focused on either ease of use or speed. PyTorch, on the other hand, is different in that it combines both advantages. It maintains code as a model, simplifies debugging, and aligns with other popular scientific computing libraries while remaining efficient and supporting hardware accelerators such as GPUs. However, every aspect of PyTorch is a regular Python program under the complete control of the user.

This is the second part translation articles from the PyTorch development team (Adam Paschke, Sam Gross and their associates). IN first part the authors analyzed the fundamental differences between PyTorch and DyNet and other frameworks and libraries for automatic differentiation, as well as the features of its interface (variable flags, hooks, extensions). Today – information about the implementation of this framework in such aspects as memory management (on-line clearing of intermediate values ​​when they become unnecessary), performing operations on the tensor and how to cancel them.


Inside, a Variable is just a wrapper around a Tensor that also contains a reference to a Functions object graph. This graph is an immutable, purely functional representation of the derivative of a computed function. Variables are just mutable pointers to that graph (they change when the in-place operation happens).

Functions can be thought of as closures containing all the context needed to compute vector Jacobian products. They take output gradients and return input gradients (formally the left product of the left product, including the expression for the corresponding operation). A function graph is a one-argument closure that takes a left-product and multiplies it by the derivatives of all the operations it contains. The transmitted left products are themselves variables, which makes the evaluation of the graph differentiable.

Memory management

The main use case for PyTorch is training machine learning models on the GPU. Since one of the biggest limitations of GPUs is a small amount of memory, PyTorch takes great care to ensure that all intermediate values ​​are freed as soon as they are no longer needed. Indeed, Python is well suited for this purpose because it counts references by default (using the garbage collector only to break loops).

Variable and Function in PyTorch should be designed to work well in reference counting mode. For example, a Function writes pointers to the Function that receives its results, so that the function subgraph is freed when its storing output Variable becomes unused. This is the opposite of normal closure ownership, where a closure retains other closures it calls.

Another problem is avoiding cycles in the link graph. A simple implementation of automatic differentiation can easily introduce such loops (for example, when a function being differentiable wants to keep a reference to its output). PyTorch breaks them down by writing not a full-fledged variable, but a “stored variable”, which in such cases does not contain a pointer directly to the function itself.

C++ operators

Although all operations in Python can be expressed using the extension APIs, they require a lot of overhead for the interpreter. Porting operators to C++ reduces overhead and reduces the latency of a single differentiable operation sent from Python to 3.4 µs compared to 1.7 µs for a tensor operation. An added benefit is that you can have multiple threads executing them in parallel (unlike Python, which limits concurrency due to the GIL). This is especially important in the context of multiple GPUs, which cannot be loaded by a single CPU thread.

Support for on-site operations

Often PyTorch users want to perform in-place operations on a tensor to avoid allocating a new tensor when it is known not to be needed. Intuitively, an in-place operation is equivalent to a corresponding out-of-place operation. The exception is when a variable that changes in place has its evaluation history “rebased” to point to the derivative of the operator in place, instead of pointing to the previous function (the evaluation history is always purely functional). However, these in-place operations interact subtly with autograd.

Intuitively, an in-place operation is equivalent to a corresponding out-of-place operation, except that the variable that c is in place has a “rebased” evaluation history pointing to the derivative of the in-place operator rather than its previous function (the evaluation history is always remains purely functional). However, these in-place operations interact in subtle ways with the mechanisms of autodifferentiation.


An in-place operation can invalidate the data needed to calculate derivatives. Consider the following example:

y = x.tanh()



This program tells PyTorch to perform an in-place operation on y, however this is not true if the y whose value is tanh(x) has been stored so that it can be used in the reverse calculation (recall that tanh’ (x) = 1 − tanh2 (x)).

Making a copy of y on save would be inefficient, instead PyTorch crashes at runtime when auto differentiating this program. Each source variable store is associated with a version counter that tracks how many in-place operations have been applied to the store. Simultaneously with saving the variable, the version counter is changed. When trying to use a stored variable, an error occurs if the stored value does not match the current one.


PyTorch supports non-trivial aliases between variables. Operations such as transpose and shrink create new tensors with new sizes and strides that share storage with the original tensors. The problem with aliases is that they can require non-trivial transformations of the calculation history of many variables. Consider the following example:




Normally, an in-place operation on x only affects the history of x. However, in this case, adding in place to x also updates some elements of y. Thus, the calculation history of y has also changed. Support for this case is rather non-trivial, which is why PyTorch rejects this program by using an extra field in the version counter (see revocation above) to determine that the data is shared.

In future work, we aim to relax this limitation. The problem is that there can be arbitrarily many variable aliases, so it’s not possible to look at each one one at a time and update their calculation histories. However, it may be possible to delay the history of the computation, materializing it only when the computation yields a result that is not related to the original variable.

In conclusion, a few links for those who want to get to know PyTorch better and study the topic of automatic differentiation on this framework:

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *