The Gradient Descent We Deserve

This article will consider a slight modification of the gradient descent algorithm, millions of times less prone to getting stuck in a local optimum than the existing ones (and with the possibility of parallelization).

This article was prompted to write the quality of modern publications on the topic of problems of modern machine learning (https://habr.com/ru/company/skillfactory/blog/717360/ Wouter van Heeswijk, PhD).

What’s wrong with that?

Unsatisfactory policy gradient updates. For example, very small steps, when they hang in the local optimum, or large steps, due to which they miss the biggest reward.

In any formulation, we or explore the whole space in small steps (and do not miss the optimum), or we come to the optimum in large steps. In general, it would be more correct to adjust the step size based on some analysis of the error function using the FFT, but this is in the next article.

Algorithms based on the approximation of a utility function systematically choose actions with an overestimated utility.

The guarantee of optimal heuristics in the absence of underestimation of the utility function. In the case of trying to save a million / billion steps of the algorithm, you can get suboptimal heuristics and not get a solution even for a trillion.

It should also be remembered that gradient descent is not the optimal heuristic for all types of error functions (that’s why we came up with training with a moment, and other tricks that at first glance help to find optimal solutions faster). However, it makes no sense to lose optimality even in some subset of the set of parameters.

What to do?

Let’s write our own version of gradient descent, with many starting points and a priority queue.

In general, the algorithm looks like this:

We initialize N starting points, for each of the points we count the error, write everything to the priority queue (you can do it in parallel)
Cycle:
We get a point from the queue, we consider the gradient of the error, we consider the change in position, we consider the error
If the point is stuck in the local optimum (the last N positions differ by less than epsilon) – add the point to the list of those stuck in the local optimum (maybe initialize a new point)
If not, put the point back in the priority queue

Repeat steps 2-4 until the steps are over or we get the required error level

What will it give us?

It does not require proof that the chance to get stuck in the local optimum is N times less, where N is the number of starting points.

The complexity (number of steps required to solve) will be less than N gradient descents (in M steps), or M*N, but greater than M+N (initialization and descent).

Code

python

from queue import PriorityQueue
import numpy as np
import random
from math import sqrt

##Parameters

#learning rate
dt = 0.01

#maximum step count
maxSteps = 1000000

#minimum_error
minError = 0.01
##Specifical parameters

#small number to calculate derivative of target function (for using script with large object functions, like neural nets or so)
delta = 0.001

#target vector len
trace_len = 20

#initilal point count
N = 100000

#size of parameters space
spaceSize=[[-1000,1000],[-1000,1000]]

#for analysis purpose
GetErrorCallCount = 0

localOptimums = []

#taked from https://github.com/behnamasadi/gradient_descent/
def objective_function(vector):
    z=-(4*np.exp(-(vector[0]-4)**2 - (vector[1]-4)**2)+2*np.exp(-(vector[0]-2)**2 - (vector[1]-2)**2))+4
    return z
    
#we will compute exact derivative
def get_objective_function_derivatives(vector,delta,error):
    res = []
    
    initial_error = error
    
    pos_vector = vector[0]

    for n in range(len(pos_vector)):
        vec = pos_vector
        vec[n] = vec[n]+delta
        
        derivative_n = -(GetError(vec)-initial_error)/delta
        res.append(derivative_n)

    
    #change to correct derivative calculation
    return res
    
#returns normalized derivative vector
def get_error_gradient(vector,delta, error):
    derivative = get_objective_function_derivatives(vector, delta, error)
    #recalculate derivative, if take 0 multiply delta by ten
    
    while all(d ==0 for d in derivative):
        # print("get zero derivative")
        # print(derivative)
        # print(delta)
        delta=delta*10
        
        #doesnt help, return ReInitialization flag
        if (delta>1000):
            return derivative
            
            
        derivative = get_objective_function_derivatives(vector, delta, error)
        
    return derivative
    
def initGradDesc(sampleVector, priorityQueue, N):
    res = []
    global spaceSize
    #Let's generate N random vectors with same length
    for num in range(N):

        #Initialize derivatives and vector randomly
        pos = []
        for i in range(len(sampleVector)):
            pos.append(random.uniform(spaceSize[i][0],spaceSize[i][1]))
        
        error = GetError(pos)

        res.append([error, [pos]])
    
    #Now we have N random vectors, please note that we must have normaziled data everywhere )
    #lets add all those vectors in priorityQueue with our GetError Function

    for x in res:
        priorityQueue.put(x)

#TODO: rewrite euler integration and calculation of derivatives

#use simpliest Euler method to find change of coordinates with some derivatives
def calculateNewPosition(stateVector, error_grad):
    delta = []

    delta = [x[0]+x[1]*dt for x in zip(stateVector[0],error_grad)]

    return delta

    
#returns error for current vector as a solution
def GetError(vector):
    #update call count
    global GetErrorCallCount
    GetErrorCallCount = GetErrorCallCount+1;
    return objective_function(vector)
    
def getLengthBetween(item, newPoint):
    l = sqrt(sum((i[0]-i[1])*(i[0]-i[1]) for i in zip(item,newPoint)))
    return l
    
def makeGradDesc(priorityQ, point):
    global delta
    
    errorInit = point[0]
    state_vector = point[1]
 
    #Note that point structure is [vec, derivatives_1,...,derivatives_N]
    error_grad = get_error_gradient(state_vector,delta,errorInit)
   
    newPoint = calculateNewPosition(state_vector, error_grad)
    
    error = GetError(newPoint)
    
    #merge data into one list
    res = [newPoint]

    curr_trace = 0;
    for positions in state_vector:
        res.append(positions)
        curr_trace = curr_trace+1
        if (curr_trace>=trace_len):
            break
        
    #calculate error 
    res = [error,res]
        
    #Check for local optimum (cost function dont change last N steps), initialize new point
    if (all(getLengthBetween(item, newPoint) < dt for item in state_vector)):
        localOptimums.append(res)
        initGradDesc(newPoint, priorityQ, 1)
    else:
        priorityQ.put(res)
    
#main gradient descent function that runs for MaxStepCount or while minimal error is not reached    
def PreformDradDesc(priorityQ, MaxStepCount,minError):
    point = priorityQ.get()
    error = point[0]
    steps = 0
    while ((abs(error)>=abs(minError)) and (steps<MaxStepCount)):
        steps = steps + 1
        makeGradDesc(priorityQ, point)
        point = priorityQ.get()
        error =  point[0]
        
    return [steps, error, point]

#Where to store statistics about steps count/result error
research_results = []

#Queue needed for our A*
priorityQ = PriorityQueue()

initGradDesc(spaceSize, priorityQ, N+1)

print("Data example: ")
point = priorityQ.get()
print(point)


#Print results
print("Cost function calls for initialization: ")
print(GetErrorCallCount)

minimum_error = PreformDradDesc(priorityQ, maxSteps, minError)
print("Minimum error: ")
print(minimum_error[1])
print("Result point: ")
print(minimum_error[2][1][1])

print("Initial points: ")
print(N)
print("Total steps: ")
print(minimum_error[0])
print("Maximum steps: ")
print(maxSteps)

print("Cost function calls: ")
print(GetErrorCallCount)

print("Local minimums queue len: ")
print(len(localOptimums))
# for item in localOptimums:
    # print(item) 
    
print("Raw results: ")
for item in research_results:
    print(item)

Separately, it should be told about the calculation of the gradient. If the gradient is zero and the error is still large, just multiply the infinitesimal increment by 10.
This allows you to slightly expand the range of calculations and find even small values of derivatives (which do not fit in float32-64).