Online Local Gain Adaptation for Multi-Layer Perceptrons

N. N. Schraudolph. Online Local Gain Adaptation for Multi-Layer Perceptrons. Technical Report IDSIA-09-98, Istituto Dalle Molle di Studi sull'Intelligenza Artificiale, 1998.

Download


423.0kB	121.8kB	217.5kB

Abstract

We introduce a new method for adapting the step size of each individual weight in a multi-layer perceptron trained by stochastic gradient descent. Our technique derives from the K1 algorithm for linear systems (Sutton, 1992), which in turn is based on a diagonalized Kalman Filter. We expand upon Sutton's work in two regards: K1 is a) extended to nonlinear systems, and b) made more efficient by linearizing an exponentiation operation. The resulting ELK1 (extended, linearized K1) algorithm is computationally little more expensive than alternative proposals (Zimmermann, 1994; Almeida et al., 1999), and does not require an arbitrary smoothing parameter. On a first benchmark problem ELK1 clearly outperforms these alternatives, as well as stochastic gradient descent with momentum, even when the number of floating-point operations required per weight update is taken into account. Unlike the method of Almeida et al., ELK1 does not require statistical independence between successive training patterns, and handles large initial learning rates well.

BibTeX Entry

@techreport{olga,
     author = {Nicol N. Schraudolph},
      title = {\href{http://nic.schraudolph.org/pubs/olga.pdf}{
               Online Local Gain Adaptation for Multi-Layer Perceptrons}},
     number = {IDSIA-09-98},
institution = {Istituto Dalle Molle di Studi sull'Intelligenza Artificiale},
    address = {Galleria 2, CH-6928 Manno, Switzerland},
       year =  1998,
   b2h_type = {Other},
  b2h_topic = {>Stochastic Meta-Descent},
   abstract = {
    We introduce a new method for adapting the step size of each individual
    weight in a multi-layer perceptron trained by stochastic gradient descent.
    Our technique derives from the K1 algorithm for linear systems
    (Sutton, 1992), which in turn is based on a diagonalized Kalman Filter.
    We expand upon Sutton's work in two regards: K1 is a) extended
    to nonlinear systems, and b) made more efficient by linearizing
    an exponentiation operation.  The resulting ELK1 (extended,
    linearized K1) algorithm is computationally little more expensive
    than alternative proposals (Zimmermann, 1994; Almeida et al., 1999),
    and does not require an arbitrary smoothing parameter.  On a first
    benchmark problem ELK1 clearly outperforms these alternatives, as well
    as stochastic gradient descent with momentum, even when the number of
    floating-point operations required per weight update is taken into
    account.  Unlike the method of Almeida et al., ELK1 does not require
    statistical independence between successive training patterns, and
    handles large initial learning rates well.
}}