next up previous contents
Next: Backpropagation Up: Network training Previous: Network training   Contents


Gradient descent

In order to define a measure of how far the network output is away from the true value, we introduce a cost function which is a function of all weights:
(3.2)

where is the desired output, i.e. the training target, and the network output. Thus, this function approaches zero as the network output approaches the training target. The main task is now to adjust the weights such that the cost function becomes minimal. However, the surface of the cost function is in general very complicated. Ideally, the training procedure should search for the absolute minimum of this surface. Unless the specific task is very simple, this is in general impossible. Instead, one has to end the training procedure in a local minimum. This immediately illustrates the main difficulty: The next valley (local minimum) could be much deeper than the present one but there is no way to find out.

The gradient descent algorithm suggests to change each weight proportional to the present gradient of the cost function:

(3.3)

where $\eta$ is the learning rate which has to be chosen ``appropriately'', i.e. there is no general rule what its value should be. If $\eta$ is too small, the algorithm will be very slow, on the other hand, the algorithm may oscillate wildly if $\eta$ is too large. Optimally, each weight should have its own learning rate. Then connections far away from the minimum can change rapidly whereas the connections already close to the minimum may change only sightly.

Note that other choices for the cost function are possible, see e.g. section 3.2.7.


next up previous contents
Next: Backpropagation Up: Network training Previous: Network training   Contents
Ulrich Kerzel 2002-08-27