The mathematics behind TD
The temporal difference (TD) model (Sutton
& Barto, 1990) is an extension of the ideas underlying the RW
model (Rescorla & Wagner, 1972). Most
notably the TD model abandons the construct of a “trial”, favoring
instead time-based formulations. Also notable is the introduction of
eligibility traces, which allow the model to bridge temporal gaps and
deal with the credit assignment problem.
Implementation note: As of calmr
version
0.6.2
, stimulus representation in TD is based on complete
serial compounds (i.e., time-specific stimulus elements entirely
discriminable from each other), and the eligibility traces are of the
replacing type.
General Note: There are several descriptions of the TD model out
there, however, all of the ones I found were opaque when it comes to
implementation. Hence, the following description of the model has a
focus on implementation details.
1 - Maintaining stimulus representations
TD maintains stimulus traces as eligibility traces. The eligibility
of stimulus i at time t, eti, is given by:
eti=et−1iσγ+xti
where σ and γ are decay and discount parameters,
respectively, and xti is the
activation of stimulus i at time
t (1 or 0 for present and absent
stimuli, respectively).
Internally, ei is represented
as a vector of length d, where
d is the number of stimulus
compounds (not in the general sense of the word compound, but in terms
of complete serial compounds, or CSC). For example, a 2s stimulus in a
model with a time resolution of 0.5s will have a d=4, and the second entry in that
vector represents the eligibility of the compound active after the
stimulus has been present for 1s.
Similarly, xti entails the
specific compound of stimulus i at
time t, and not the general
activation of i at that time. For
example, suppose two, 2s stimuli, A
and B are presented with an overlap
of 1s, with A’s onset occurring
first. Can you guess what stimulus compounds will be active at t=2 with a time resolution of 0.5s?
2 - Generating expectations
The TD model generates stimulus expectations based on the presented
stimuli, not on the strength of eligibility traces. The
expectation of of stimulus j at
time t, Vtj, is given by:
Vtj=wt′jxt=K∑iwti,jxti
Where wtj is a matrix of
stimulus weights at time t pointing
towards j, ′ denotes transposition, and wi,j denotes an entry in a square
matrix denoting the association from i to j. As with the eligibility traces above,
the entries in each matrix are the weights of specific stimulus
compounds.
Internally, the wtj is
constructed on a trial-by-trial, step-by-step basis, depending on the
stimulus compounds active at the time.
3 - Learning associations
Owing to its name, the TD model updates associations based on a
temporally discounted prediction of upcoming stimuli. This temporal
difference error term is given by:
δtj=λtj+γVtj−Vt−1j
where λj is the value of
stimulus j at time t, which also determines the asymptote
for stimulus weights towards j.
The temporal difference error term is used to update w via:
wti,j=wti,j+αiβ(xtj)δtjeti
where αi is a learning
rate parameter for stimulus i, and
β(xj) is a function that
returns one of two learning rate parameters (βon or βoff) depending on whether j is being presented or not at time t.
4 - Generating responses
As with many associative learning models, the transformation between
stimulus expectations and responding is unspecified/left in the hands of
the user. The TD model does not return a response vector, but it
suffices to assume that responding is the identity function on the
expected stimulus values, as follows:
rtj=Vtj
References
Rescorla, R. A., & Wagner, A. R. (1972). A theory of
Pavlovian conditioning: Variations in the
effectiveness of reinforcement and nonreinforcement. In A. H. Black
& W. F. Prokasy (Eds.), Classical conditioning II:
Current research and theory. (pp. 64–69).
Appleton-Century-Crofts.
Sutton, R. S., & Barto, A. G. (1990). Time-derivative models of
Pavlovian reinforcement. In M. Gabriel & J. W. Moore
(Eds.), Learning and computational neuroscience (pp. 497–537).
MIT Press.