Reinforcement Learning with Self-Modifying Policies

J. Schmidhuber, J. Zhao, and N. N. Schraudolph. Reinforcement Learning with Self-Modifying Policies. In S. Thrun and L. Pratt, editors, Learning to Learn, pp. 293–309, Kluwer Academic Publishers, Norwell, MA, 1998.

Download


284.5kB	154.4kB	119.8kB

Abstract

A learner's modifiable components are called its policy. An algorithm that modifies the policy is a learning algorithm. If the learning algorithm has modifiable components represented as part of the policy, then we speak of a self-modifying policy (SMP). SMPs can modify the way they modify themselves etc. They are of interest in situations where the initial learning algorithm itself can be improved by experience---this is what we call "learning to learn". How can we force some (stochastic) SMP to trigger better and better self-modifications? The success-story algorithm (SSA) addresses this question in a lifelong reinforcement learning context. During the learner's life-time, SSA is occasionally called at times computed according to SMP itself. SSA uses backtracking to undo those SMP-generated SMP-modifications that have not been empirically observed to trigger lifelong reward accelerations (measured up until the current SSA call---this evaluates the long-term effects of SMP-modifications setting the stage for later SMP-modifications). SMP-modifications that survive SSA represent a lifelong success history. Until the next SSA call, they build the basis for additional SMP-modifications. Solely by self-modifications our SMP/SSA-based learners solve a complex task in a partially observable environment (POE) whose state space is far bigger than most reported in the POE literature.

BibTeX Entry

@incollection{SchZhaSch98,
     author = {J\"urgen Schmid\-huber and Jieyu Zhao and Nicol N. Schraudolph},
      title = {\href{http://nic.schraudolph.org/pubs/SchZhaSch98.pdf}{
               Reinforcement Learning with Self-Modifying Policies}},
      pages = {293--309},
     editor = {Sebastian Thrun and Lorien Pratt},
  booktitle = {Learning to Learn},
  publisher = {Kluwer Academic Publishers},
    address = {Norwell, MA},
       year =  1998,
   b2h_type = {Book Chapters},
  b2h_topic = {Reinforcement Learning},
   abstract = {
    A learner's modifiable components are called its policy.
    An algorithm that modifies the policy is a learning algorithm.
    If the learning algorithm has modifiable components represented as
    part of the policy, then we speak of a self-modifying policy (SMP).
    SMPs can modify the way they modify themselves etc.  They are of
    interest in situations where the initial learning algorithm itself
    can be improved by experience\,---\,this is what we call "learning
    to learn".
    How can we force some (stochastic) SMP to trigger better
    and better self-modifications?  The {\em success-story algorithm}\/
    (SSA) addresses this question in a lifelong reinforcement learning
    context. During the learner's life-time, SSA is occasionally called
    at times computed according to SMP itself. SSA uses backtracking
    to undo those SMP-generated SMP-modifications that have not been
    empirically observed to trigger lifelong reward accelerations
    (measured up until the current SSA call\,---\,this evaluates the
    long-term effects of SMP-modifications setting the stage for later
    SMP-modifications). SMP-modifications that survive SSA represent a
    lifelong success history. Until the next SSA call, they build the
    basis for additional SMP-modifications. Solely by self-modifications
    our SMP/SSA-based learners solve a complex task in a partially
    observable environment (POE) whose state space is far bigger than
    most reported in the POE literature.
}}