Next: 11.2 Learning in Spiking Up: 11. Learning Equations Previous: 11. Learning Equations

Subsections

# 11.1 Learning in Rate Models

We would like to understand how activity-dependent learning rules influence the formation of connections between neurons in the brain. We will see that plasticity is controlled by the statistical properties of the presynaptic input that is impinging on the postsynaptic neuron. Before we delve into the analysis of the elementary Hebb rule we therefore need to recapitulate a few results from statistics and linear algebra.

## 11.1.1 Correlation Matrix and Principal Components

A principal component analysis (PCA) is a standard technique to describe statistical properties of a set of high-dimensional data points and is usually performed in order to find those components of the data that show the highest variability within the set. If we think of the input data set as of a cloud of points in a high-dimensional vector space centered around the origin, then the first principal component is the direction of the longest axis of the ellipsoid that encompasses the cloud; cf. Fig. 11.1. If the data points consisted of, say, two separate clouds then the first principal component would give the direction of a line that connects the center points of the two clouds. A PCA can thus be used to break a large data set into separate clusters. In the following, we will quickly explain the basic idea and show that the first principal component gives the direction where the variance of the data is maximal.

Let us consider an ensemble of data points {,...,} drawn from a (high-dimensional) vector space, for example N. For this set of data points we define the correlation matrix Cij as

 Cij =   =   . (11.1)

Angular brackets . denote an average over the whole set of data points. Similar to the variance of a single random variable we can also define the covariance matrix Vij of our data set,

 Vij = ( - ) ( - ) . (11.2)

In the following we will assume that the coordinate system is chosen so that the center of mass of the set of data points is located at the origin, i.e., = = 0. In this case, correlation matrix and covariance matrix are identical.

The principal components of the set {,...,} are defined as the eigenvectors of the covariance matrix V. Note that V is symmetric, i.e., Vij = Vji. The eigenvalues of V are thus real-valued and different eigenvectors are orthogonal (Horn and Johnson, 1985). Furthermore, V is positive semi-definite since

 V  = yi   yj = yi  0 (11.3)

for any vector N. Therefore, all eigenvalues of V are non-negative.

We can sort the eigenvectors according to the size of the corresponding eigenvalues ... 0. The eigenvector with the largest eigenvalue is called the first principal component. It points in the direction where the variance of the data is maximal. To see this we calculate the variance of the projection of onto an arbitrary direction that we write as = ai  with ai2 = 1 so that || = 1. The variance along is

 = . =  V  =  ai2 . (11.4)

The right-hand side is maximal under the constraint ai2 = 1 if a1 = 1 and ai = 0 for i = 2, 3,..., N, that is, if = .

## 11.1.2 Evolution of synaptic weights

In the following we analyze the evolution of synaptic weights using the Hebbian learning rules that have been described in Chapter 10. To do so, we consider a highly simplified scenario consisting of an analog neuron that receives input from N presynaptic neurons with firing rates via synapses with weights wi; cf. Fig. 11.2A. We think of the presynaptic neurons as input neurons', which, however, do not have to be sensory neurons. The input layer could, for example, consist of neurons in the lateral geniculate nucleus (LGN) that project to neurons in the visual cortex. We will see that the statistical properties of the input control the evolution of synaptic weights.

For the sake of simplicity, we model the presynaptic input as a set of static patterns. Let us suppose that we have a total of p patterns {;1 < < p}. At each time step one of the patterns is selected at random and presented to the network by fixing the presynaptic rates at = . We call this the static-pattern scenario. The presynaptic activity drives the postsynaptic neuron and the joint activity of pre- and postsynaptic neurons triggers changes of the synaptic weights. The synaptic weights are modified according to a Hebbian learning rule, i.e., according to the correlation of pre- and postsynaptic activity; cf. Eq. (10.3). Before the next input pattern is chosen, the weights are changed by an amount

 wi = (11.5)

Here, 0 < 1 is a small constant called learning rate'. The learning rate in the static-pattern scenario is closely linked to the correlation coefficient ccorr2 in the continuous-time Hebb rule introduced in Eq. (10.3). In order to highlight the relation, let us assume that each pattern is applied during an interval t. For t sufficiently small, we have = ccorr2 t.

In a general rate model, the firing rate of the postsynaptic neuron is given by a nonlinear function of the total input

 = gwi   ; (11.6)

cf. Fig. 11.2B. For the sake of simplicity, we restrict our discussion in the following to a linear rate model with

 = wi  . (11.7)

Obviously, this is a highly simplified neuron model, but it will serve our purpose of gaining some insights in the evolution of synaptic weights.

If we combine the learning rule (11.5) with the linear rate model of Eq. (11.7) we find after the presentation of pattern

 wi =  wj   =  wj   . (11.8)

The evolution of the weight vector = (w1,..., wN) is thus determined by the iteration

 wi(n + 1) = wi(n) +  wj   , (11.9)

where denotes the pattern that is presented during the nth time step.

We are interested in the long-term behavior of the synaptic weights. To this end we assume that the weight vector evolves along a more or less deterministic trajectory with only small stochastic deviations that result from the randomness at which new input patterns are chosen. This is, for example, the case if the learning rate is small so that a large number of patterns has to be presented in order to induce a substantial weight change. In such a situation it is sensible to consider the expectation value of the weight vector, i.e., the weight vector (n) averaged over the sequence (,,...,) of all patterns that so far have been presented to the network. From Eq. (11.9) we find

 wi(n + 1) = wi(n) +  wj(n) = wi(n) +  wj(n) = wi(n) +  Cij wj(n) . (11.10)

The angular brackets denote an ensemble average over the whole sequence of input patterns (,,...). The second equality is due to the fact that input patterns are chosen independently in each time step, so that the average over wj(n) and ( ) can be factorized. In the final expression we have introduced the correlation matrix Cij,

 Cij =   =   . (11.11)

Expression (11.10) can be written in a more compact form using matrix notation,

 (n + 1) = ( +  C) (n) = ( +  C)n+1 (0) , (11.12)

where (n) = w1(n),..., wN(n) is the weight vector and is the identity matrix.

If we express the weight vector in terms of the eigenvectors of C,

 (n) = ak(n)  , (11.13)

we obtain an explicit expression for (n) for any given initial condition ak(0), viz.,

 (n) = (1 + )n ak(0)  . (11.14)

Since the correlation matrix is positive semi-definite all eigenvalues are real and positive. Therefore, the weight vector is growing exponentially, but the growth will soon be dominated by the eigenvector with the largest eigenvalue, i.e., the first principal component,

 (n) (11.15)

cf. Section 11.1.1. Recall that the output of the linear neuron model (11.7) is proportional to the projection of the current input pattern on the direction . For , the output is therefore proportional to the projection on the first principal component of the input distribution. A Hebbian learning rule such as Eq. (11.8) is thus able to extract the first principal component of the input data.

From a data-processing point of view, the extraction of the first principle component of the input data set by a biologically inspired learning rule seems to be very compelling. There are, however, a few drawbacks and pitfalls. First, the above statement about the Hebbian learning rule is limited to the expectation value of the weight vector. We will see below that, if the learning rate is sufficiently low, then the actual weight vector is in fact very close to the expected one.

Second, while the direction of the weight vector moves in the direction of the principal component, the norm of the weight vector grows without bounds. We will see below in Section 11.1.3 that suitable variants of Hebbian learning allow us to control the length of the weight vector without changing its direction.

Third, principal components are only meaningful if the input data is normalized, i.e., distributed around the origin. This requirement is not consistent with a rate interpretation because rates are usually positive. This problem, however, can be overcome by learning rules such as the covariance rule of Eq. (10.10) that are based on the deviation of the rates from a certain mean firing rate. We will see in Section 11.2.4 that a spike-based learning rule can be devised that is sensitive only to deviations from the mean firing rate and can thus find the first principal component even if the input is not properly normalized.

### 11.1.2.1 Self-averaging (*)

So far, we have derived the behavior of the expected weight vector, . Here we show that explicit averaging is not necessary provided that learning is slow enough. In this case, the weight vector is the sum of a large number of small changes. The weight dynamics is thus self-averaging' and the weight vector can be well approximated by its expectation value .

We start from the formulation of Hebbian plasticity in continuous time,

 wi = ccorr2   ; (11.16)

cf. Eq. (10.3). Each pattern is presented for a short period of duration t. We assume that the weights change during the presentation by a small amount only, i.e., [dwj(t')/dt'] dt' wj(t). This condition can be met either by a short presentation time t or by a small learning coefficient ccorr2. Under this condition, we can take the postsynaptic firing rate (t) = wj(t as constant for the duration of one presentation. The total weight change induced by the presentation of pattern to first order in t is thus

 wi(t) = wi(t + t) - wi(t) =  wj(t)   + (t2) . (11.17)

with = ccorr2 t; cf. Eq. (11.8).

In the next time step a new pattern is presented so that the weight is changed to

 wi(t + 2t) = wi(t + t) + ccorr2 t wj(t + t)    + (t2) . (11.18)

Since we keep only terms to first order in t, we may set wj(t + t) = wj(t) in the sum on the right-hand side of Eq. (11.18). Let us suppose that in the interval [t, t + p t] each of the p patterns has been applied exactly once. Then, to first order in t,

 wi(t + p t) - wi(t) = ccorr2 twj(t)   + (t2) . (11.19)

For ccorr2 t 1, all higher-order terms can be neglected. Division by p t yields

 = ccorr2 wj(t) Cij . (11.20)

The left-hand side can be approximated by a differential operator dw/dt,

 wi(t) = ccorr2 wj(t) Cij . (11.21)

We thus recover our previous result that weights are driven by the correlations in the input but with the additional vantage that no explicit averaging step is necessary (Sanders and Verhulst, 1985).

## 11.1.3 Weight Normalization

We have seen in Section 11.1.2 that the simple learning rule (10.3) leads to exponentially growing weights. Since this is biologically not plausible, we must use a modified Hebbian learning rule that includes weight decrease and saturation; cf. Chapter 10.2. Particularly interesting are learning rules that lead to a normalized weight vector. Normalization is a desirable property since it leads to a competition between synaptic weights wij that converge on the same postsynaptic neuron i. Competition means that if a synaptic efficacy increases, it does so at the expense of other synapses that must decrease.

For a discussion of weight vector normalization two aspects are important, namely what is normalized and how the normalization is achieved. Learning rules can be designed to normalize either the sum of weights, wij, or the quadratic norm, ||2 = wij2 (or any other norm on N). In the first case, the weight vector is constrained to a plane perpendicular to the diagonal vector = (1,..., 1); in the second case it is constrained to a hyper-sphere; cf. Fig. 11.4.

Second, the normalization of the weight vector can either be multiplicative or subtractive. In the former case all weights are multiplied by a common factor so that large weights wij are corrected by a larger amount than smaller ones. In the latter case a common constant is subtracted from each weight. Usually, subtractive normalization is combined with hard bounds 0wijwmax in order to avoid runaway of individual weights. Finally, learning rules may or may not fall into the class of local learning rules that we have considered in Chapter 10.2.

A systematic classification of various learning rules according to the above three criteria has been proposed by Miller and MacKay (1994). Here we restrict ourselves to two instances of learning with normalization properties which we illustrate in the examples below. We start with the subtractive normalization of the summed weights wij and turn then to a discussion of Oja's rule as an instance of a multiplicative normalization of wij2.

### 11.1.3.1 Example: Subtractive Normalization of wi

In a subtractive normalization scheme the sum over all weights, wi, can be kept constant by subtracting the average total weight change, N-1, from each synapse after the weights have been updated according to a Hebbian learning rule with = wj  . Altogether, the learning rule is of the form

 wi = - N-1 = wj   - N-1wj   , (11.22)

where denotes the weight change that is due to the pure Hebbian learning rule without the normalization. It can easily be verified that wi = 0 so that wi = const. The temporal evolution of the weight vector is thus restricted to a hyperplane perpendicular to (1,..., 1) N. Note that this learning rule is non-local because the change of weight depends on the activity of all presynaptic neurons.

In a similar way as in the previous section, we calculate the expectation of the weight vector (n), averaged over the sequence of input patterns (,,...),

 wi(n + 1) = wi(n) + Cij wj(n) - N-1Ckj wj(n) , (11.23)

or explicitly, using matrix notation

 (n) = [1 +  (C - )]n (0) , (11.24)

with = N-1 Ckj. The evolution of the weight vector is thus determined by eigenvectors of the matrix (C - ) that are in general different from those of the correlation matrix C. Hebbian learning with subtractive normalization is driven by the correlations of the input in the subspace orthogonal to the diagonal vector (1,..., 1). Though the sum of the weights stays constant individual weights keep growing. It is thus necessary to adopt an additional criterion to stop the learning process and to prevent that some components of the weight vector grow beyond all bounds. A subtractive weight normalization is usually combined with hard boundaries for the weights; cf. Section . With these constraints, the weight vector converges to a final state where (almost) all weights are saturated at the upper or lower bound (Miller and MacKay, 1994); cf. Fig. 11.5A.

### 11.1.3.2 Example: Multiplicative Normalization of ||

Normalization of the sum of the weights, wi, needs an additional criterion to prevent individual weights from perpetual growth. A more elegant way is to require that the sum of the squared weights, i.e., the length of the weight vector, wi2, remains constant. This restricts the evolution of the weight vector to a sphere in the N dimensional weight space. In addition, we can employ a multiplicative normalization scheme where all weights all multiplied by a common factor instead of subtracting a common constant. The advantage of multiplicative compared to subtractive normalization is that small weights will not change their sign during the normalization step.

In order to formalize the above idea we first calculate the naïve' weight change (n) in time step n according to the common Hebbian learning rule,

 (n) =  [(n) . ]  . (11.25)

The update of the weights is accompanied by a normalization of the norm of the weight vector to unity, i.e.,

 (n + 1) = (11.26)

If we assume that the weights are changed only by a very small amount during each step ( 1), we can calculate the new weights (n + 1) to first order in ,

 (n + 1) = (n) + (n) - (n) [(n) . (n)] + () . (11.27)

The effective' weight change (n) including normalization to leading order in is thus

 (n) = (n) - (n) [(n) . (n)] , (11.28)

which corresponds to the vector component of that is orthogonal to the current weight vector . This is exactly what we would have expected because the length of the weight vector must stay constant; cf. Fig. 11.4B.

We may wonder whether Eq. (11.28) is a local learning rule. In order to answer this question, we recall that the naïve' weight change =    uses only pre- and postsynaptic information. Hence, we can rewrite Eq. (11.28) in terms of the firing rates,

 wj =    -  wj(n)  . (11.29)

In the second term on the right-hand side we have made use of the linear neuron model, i.e., = wk . Since the weight change depends only on pre- and postsynaptic rates, Eq. (11.29), which is known as Oja's learning rule (Oja, 1982), is indeed local; cf. Eq. (10.11).

In order to see that Oja's learning rule selects the first principal component we show that the eigenvectors {,...,} of C are fixed points of the dynamics but that only the eigenvector with the largest eigenvalue is stable. For any fixed weight vector we can calculate the expectation of the weight change in the next time step by averaging over the whole ensemble of input patterns {,,...}. With (n) =  C  we find from Eq. (11.28)

 =  C  -   [ . C ] , (11.30)

We claim that any eigenvector of the correlation matrix C is a fixed point of Eq. (11.30). Indeed, if we substitute = in the above equation we find that = 0. In order to investigate the stability of this fixed point we consider a small perturbation = + c  in the direction of . Here, | c| 1 is the amplitude of the perturbation. If we substitute = + c  into Eq. (11.30) we find

 = c  ( - )  + (c2) . (11.31)

The weight vector will thus evolve in the direction of the perturbation if > so that initial perturbation will increase. In this case, is unstable. On the other hand, if < the averaged weight change tends to decrease the perturbation and is stable. Consequently, the eigenvector of C with the largest eigenvalue, viz., the first principle component, is the sole stable fixed point of the dynamics generated by the learning rule of Eq. (11.26). Figure 11.5B shows a simple example.

## 11.1.4 Receptive Field Development

Most neurons of the visual system respond only to stimulation from a narrow region within the visual field. This region is called the receptive field of that neuron. Depending on the precise position of a narrow bright spot within the receptive field the corresponding neuron can either show an increase or a decrease of the firing rate relative to its spontaneous activity at rest. The receptive field is subdivided accordingly into ON' and OFF' regions in order to further characterize neuronal response properties. Bright spots in an ON region increase the firing rate whereas bright spots in an OFF region inhibit the neuron.

Different neurons have different receptive fields, but as a general rule, neighboring neurons have receptive fields that look' at about the same region of the visual field. This is what is usually called the retinotopic organization of the neuronal projections - neighboring points in the visual field are mapped to neighboring neurons of the visual system.

The visual system forms a complicated hierarchy of interconnected cortical areas where neurons show increasingly complex response properties from one layer to the next. Neurons from the lateral geniculate nucleus (LGN), which is the first neuronal relay of visual information after the retina, are characterized by so-called center-surround receptive fields. These are receptive fields that consist of two concentric parts, an ON region and an OFF region. LGN neurons come in two flavors, as ON-center and OFF-center cells. ON-center cells have a ON-region in the center of their receptive field that is surrounded by a circular OFF-region. In OFF-center cells the arrangement is the other way round; a central OFF-region is surrounded by an ON-region; cf. Fig. 11.6.

Neurons from the LGN project to the primary visual cortex (V1), which is the first cortical area involved in the processing of visual information. In this area neurons can be divided into simple cells' and 'complex cells'. In contrast to LGN neurons, simple cells have asymmetric receptive fields which results in a selectivity with respect to the orientation of a visual stimulus. The optimal stimulus for a neuron with a receptive field such as that shown in Fig. 11.6D, for example, is a light bar tilted by about 45 degrees. Any other orientation would also stimulate the OFF region of the receptive field leading to a reduction of the neuronal response. Complex cells have even more intriguing properties and show responses that are, for example, selective for movements with a certain velocity and direction (Hubel, 1995).

It is still a matter of debate how the response properties of simple cells arise. The original proposal by Hubel and Wiesel (1962) was that orientation selectivity is a consequence of the specific wiring between LGN and V1. Several center-surround cells with slightly shifted receptive fields should converge on a single V1 neuron so as to produce the asymmetric receptive field of simple cells. Alternatively (or additionally), the intra-cortical dynamics can generate orientation selectivity by enhancing small asymmetries in neuronal responses; cf. Section 9.1.3. In the following, we pursue the first possibility and try to understand how activity-dependent processes during development can lead to the required fine-tuning of the synaptic organization of projections from the LGN to the primary visual cortex (Miller, 1995,1994; Miller et al., 1989; Linsker, 1986c,b,a; Wimbauer et al., 1997a,b; MacKay and Miller, 1990).

### 11.1.4.1 Model architecture

We are studying a model that consists of a two-dimensional layer of cortical neurons (V1 cells) and two layers of LGN neurons, namely one layer of ON-center cells and one layer of OFF-center cells; cf. Fig. 11.7A. In each layer, neurons are labeled by their position and projections between the neurons are given as a function of their positions. Intra-cortical projections, i.e., projections between cortical neurons, are denoted by wV1, V1(,), where and are the position of the pre- and the postsynaptic neuron, respectively. Projections from ON-center and OFF-center LGN neurons to the cortex are denoted by wV1, ON(,) and wV1, OFF(,), respectively.

In the following we are interested in the evolution of the weight distribution of projections from the LGN to the primary visual cortex. We thus take wV1, ON(,) and wV1, OFF(,) as the dynamic variables of the model. Intra-cortical projections are supposed be constant and dominated by short-range excitation, e.g.,

 wV1, V1(,) exp -  . (11.32)

As in the previous section we consider - for the sake of simplicity - neurons with a linear gain function. The firing rate () of a cortical neuron at position is thus given by

where () is the firing rate of a neuron in the ON/OFF layer of the LGN.

Due to the intra-cortical interaction the cortical activity shows up on both sides of the equation. Since this is a linear equation it can easily be solved for . To do so we write () = (), where is the Kronecker that is one for = and vanishes otherwise. Equation (11.33) can thus be rewritten as

If we read the left-hand side as a multiplication of the matrix M(,) [ - wV1, V1(,)] and the vector () we can define the inverse I of M by

 I(,) M(,) = (11.33)

and solve Eq. (11.34) for (). We find

 () = (,) () + (,) () , (11.34)

which relates the input to the output via the `effective' weights

 (,) I(,) wV1, ON/OFF(,) . (11.35)

### 11.1.4.2 Plasticity

We expect that the formation of synapses between LGN and V1 is driven by correlations in the input. In the present case, these correlations are due to the retinotopic organization of projections from the retina to the LGN. Neighboring LGN neurons receiving stimulation from similar regions of the visual field are thus correlated to a higher degree than neurons that are more separated. If we assume that the activity of individual photoreceptors on the retina is uncorrelated and that each LGN neuron integrates the input from many of these receptors then the correlation of two LGN neurons can be calculated from the form of their receptive fields. For center-surround cells the correlation is a Mexican hat-shaped function of their distance (Miller, 1994; Wimbauer et al., 1997a), e.g.,

where c is a form factor that describes the depth of the modulation. CON, ON is the correlation between two ON-center type LGN neurons. For the sake of simplicity we assume that OFF-center cells have the same correlation, COFF, OFF = CON, ON. Correlations between ON-center and OFF-center cells, however, have the opposite sign, CON, OFF = COFF, ON = - CON, ON.

In the present formulation of the model each LGN cell can contact every neuron in the primary visual cortex. In reality, each LGN cell sends one axon to the cortex. Though this axon may split into several branches its synaptic contacts are restricted to small region of the cortex; cf. Fig. 11.7B. We take this limitation into account by defining an arborization function A(,) that gives the a priori probability that a connection between a LGN cell at location and a cortical cell at is formed (Miller et al., 1989). The arborization is a rapidly decaying function of the distance, e.g.,

 A(,) = exp -  . (11.36)

To describe the dynamics of the weight distribution we adopt a modified form of Hebb's learning rule that is completed by the arborization function,

 wV1, ON/OFF(,) =  A(,) () () . (11.37)

If we use Eq. (11.34) and assume that learning is slow enough so that we can rely on the correlation functions to describe the evolution of the weights, we find

and a similar equation for wV1, OFF.

Expression (11.41) is still a linear equation for the weights and nothing exciting can be expected. A prerequisite for pattern formation is competition between the synaptic weights. Therefore, the above learning rule is extended by a term wV1, ON/OFF(,()2 that leads to weight vector normalization and competition; cf. Oja's rule, Eq. (10.11).

### 11.1.4.3 Simulation results

Many of the standard techniques for nonlinear systems that we have already encountered in the context of neuronal pattern formation in Chapter 9 can also be applied to the present model (Wimbauer et al., 1998; MacKay and Miller, 1990). Here, however, we will just summarize some results from a computer simulation consisting of an array of 8×8 cortical neurons and two times 20×20 LGN neurons. Figure 11.8 shows a typical outcome of such a simulation. Each of the small rectangles shows the receptive field of the corresponding cortical neuron. A bright color means that the neuron responds with an increased firing rate to a bright spot at that particular position within its receptive field; dark colors indicate inhibition.

There are two interesting aspects. First, the evolution of the synaptic weights has lead to asymmetric receptive fields, which give rise to orientation selectivity. Second, the structure of the receptive fields of neighboring cortical neurons are similar; neuronal response properties thus vary continuously across the cortex. The neurons are said to form a map for, e.g., orientation.

The first observation, the breaking of the symmetry of LGN receptive fields, is characteristic for all pattern formation phenomena. It results from the instability of the homogeneous initial state and the competition between individual synaptic weights. The second observation, the smooth variation of the receptive fields across the cortex, is a consequence of the excitatory intra-cortical couplings. During the development, neighboring cortical neurons tend to be either simultaneously active or quiescent and due to the activity dependent learning rule similar receptive fields are formed.

Next: 11.2 Learning in Spiking Up: 11. Learning Equations Previous: 11. Learning Equations
Gerstner and Kistler
Spiking Neuron Models. Single Neurons, Populations, Plasticity
Cambridge University Press, 2002