This article will consider a slight modification of the gradient descent algorithm, millions of times less prone to getting stuck in a local optimum than the existing ones (and with the possibility of parallelization).
This article was prompted to write the quality of modern publications on the topic of problems of modern machine learning (https://habr.com/ru/company/skillfactory/blog/717360/ Wouter van Heeswijk, PhD).
What’s wrong with that?
Unsatisfactory policy gradient updates. For example, very small steps, when they hang in the local optimum, or large steps, due to which they miss the biggest reward.
In any formulation, we or explore the whole space in small steps (and do not miss the optimum), or we come to the optimum in large steps. In general, it would be more correct to adjust the step size based on some analysis of the error function using the FFT, but this is in the next article.
Algorithms based on the approximation of a utility function systematically choose actions with an overestimated utility.
The guarantee of optimal heuristics in the absence of underestimation of the utility function. In the case of trying to save a million / billion steps of the algorithm, you can get suboptimal heuristics and not get a solution even for a trillion.
It should also be remembered that gradient descent is not the optimal heuristic for all types of error functions (that’s why we came up with training with a moment, and other tricks that at first glance help to find optimal solutions faster). However, it makes no sense to lose optimality even in some subset of the set of parameters.
What to do?
Let’s write our own version of gradient descent, with many starting points and a priority queue.
In general, the algorithm looks like this:
We initialize N starting points, for each of the points we count the error, write everything to the priority queue (you can do it in parallel)
We get a point from the queue, we consider the gradient of the error, we consider the change in position, we consider the error
If the point is stuck in the local optimum (the last N positions differ by less than epsilon) – add the point to the list of those stuck in the local optimum (maybe initialize a new point)
If not, put the point back in the priority queue
Repeat steps 2-4 until the steps are over or we get the required error level
What will it give us?
It does not require proof that the chance to get stuck in the local optimum is N times less, where N is the number of starting points.
The complexity (number of steps required to solve) will be less than N gradient descents (in M steps), or M*N, but greater than M+N (initialization and descent).
from queue import PriorityQueue import numpy as np import random from math import sqrt ##Parameters #learning rate dt = 0.01 #maximum step count maxSteps = 1000000 #minimum_error minError = 0.01 ##Specifical parameters #small number to calculate derivative of target function (for using script with large object functions, like neural nets or so) delta = 0.001 #target vector len trace_len = 20 #initilal point count N = 100000 #size of parameters space spaceSize=[[-1000,1000],[-1000,1000]] #for analysis purpose GetErrorCallCount = 0 localOptimums =  #taked from https://github.com/behnamasadi/gradient_descent/ def objective_function(vector): z=-(4*np.exp(-(vector-4)**2 - (vector-4)**2)+2*np.exp(-(vector-2)**2 - (vector-2)**2))+4 return z #we will compute exact derivative def get_objective_function_derivatives(vector,delta,error): res =  initial_error = error pos_vector = vector for n in range(len(pos_vector)): vec = pos_vector vec[n] = vec[n]+delta derivative_n = -(GetError(vec)-initial_error)/delta res.append(derivative_n) #change to correct derivative calculation return res #returns normalized derivative vector def get_error_gradient(vector,delta, error): derivative = get_objective_function_derivatives(vector, delta, error) #recalculate derivative, if take 0 multiply delta by ten while all(d ==0 for d in derivative): # print("get zero derivative") # print(derivative) # print(delta) delta=delta*10 #doesnt help, return ReInitialization flag if (delta>1000): return derivative derivative = get_objective_function_derivatives(vector, delta, error) return derivative def initGradDesc(sampleVector, priorityQueue, N): res =  global spaceSize #Let's generate N random vectors with same length for num in range(N): #Initialize derivatives and vector randomly pos =  for i in range(len(sampleVector)): pos.append(random.uniform(spaceSize[i],spaceSize[i])) error = GetError(pos) res.append([error, [pos]]) #Now we have N random vectors, please note that we must have normaziled data everywhere ) #lets add all those vectors in priorityQueue with our GetError Function for x in res: priorityQueue.put(x) #TODO: rewrite euler integration and calculation of derivatives #use simpliest Euler method to find change of coordinates with some derivatives def calculateNewPosition(stateVector, error_grad): delta =  delta = [x+x*dt for x in zip(stateVector,error_grad)] return delta #returns error for current vector as a solution def GetError(vector): #update call count global GetErrorCallCount GetErrorCallCount = GetErrorCallCount+1; return objective_function(vector) def getLengthBetween(item, newPoint): l = sqrt(sum((i-i)*(i-i) for i in zip(item,newPoint))) return l def makeGradDesc(priorityQ, point): global delta errorInit = point state_vector = point #Note that point structure is [vec, derivatives_1,...,derivatives_N] error_grad = get_error_gradient(state_vector,delta,errorInit) newPoint = calculateNewPosition(state_vector, error_grad) error = GetError(newPoint) #merge data into one list res = [newPoint] curr_trace = 0; for positions in state_vector: res.append(positions) curr_trace = curr_trace+1 if (curr_trace>=trace_len): break #calculate error res = [error,res] #Check for local optimum (cost function dont change last N steps), initialize new point if (all(getLengthBetween(item, newPoint) < dt for item in state_vector)): localOptimums.append(res) initGradDesc(newPoint, priorityQ, 1) else: priorityQ.put(res) #main gradient descent function that runs for MaxStepCount or while minimal error is not reached def PreformDradDesc(priorityQ, MaxStepCount,minError): point = priorityQ.get() error = point steps = 0 while ((abs(error)>=abs(minError)) and (steps<MaxStepCount)): steps = steps + 1 makeGradDesc(priorityQ, point) point = priorityQ.get() error = point return [steps, error, point] #Where to store statistics about steps count/result error research_results =  #Queue needed for our A* priorityQ = PriorityQueue() initGradDesc(spaceSize, priorityQ, N+1) print("Data example: ") point = priorityQ.get() print(point) #Print results print("Cost function calls for initialization: ") print(GetErrorCallCount) minimum_error = PreformDradDesc(priorityQ, maxSteps, minError) print("Minimum error: ") print(minimum_error) print("Result point: ") print(minimum_error) print("Initial points: ") print(N) print("Total steps: ") print(minimum_error) print("Maximum steps: ") print(maxSteps) print("Cost function calls: ") print(GetErrorCallCount) print("Local minimums queue len: ") print(len(localOptimums)) # for item in localOptimums: # print(item) print("Raw results: ") for item in research_results: print(item)
Separately, it should be told about the calculation of the gradient. If the gradient is zero and the error is still large, just multiply the infinitesimal increment by 10.
This allows you to slightly expand the range of calculations and find even small values of derivatives (which do not fit in float32-64).
In 10 out of 10 launches, a global minimum was obtained (which is not bad in itself 🙂 ).
The local minima detector works both for a local minimum and for a plateau.