We would like to understand how activity-dependent learning rules influence the formation of connections between neurons in the brain. We will see that plasticity is controlled by the statistical properties of the presynaptic input that is impinging on the postsynaptic neuron. Before we delve into the analysis of the elementary Hebb rule we therefore need to recapitulate a few results from statistics and linear algebra.
A principal component analysis (PCA) is a standard technique to describe statistical properties of a set of high-dimensional data points and is usually performed in order to find those components of the data that show the highest variability within the set. If we think of the input data set as of a cloud of points in a high-dimensional vector space centered around the origin, then the first principal component is the direction of the longest axis of the ellipsoid that encompasses the cloud; cf. Fig. 11.1. If the data points consisted of, say, two separate clouds then the first principal component would give the direction of a line that connects the center points of the two clouds. A PCA can thus be used to break a large data set into separate clusters. In the following, we will quickly explain the basic idea and show that the first principal component gives the direction where the variance of the data is maximal.
Let us consider an ensemble of data points {,...,} drawn from a (high-dimensional) vector space, for example ^{N}. For this set of data points we define the correlation matrix C_{ij} as
The principal components of the set {,...,} are defined as the eigenvectors of the covariance matrix V. Note that V is symmetric, i.e., V_{ij} = V_{ji}. The eigenvalues of V are thus real-valued and different eigenvectors are orthogonal (Horn and Johnson, 1985). Furthermore, V is positive semi-definite since
V = y_{i} y_{j} = y_{i} 0 | (11.3) |
We can sort the eigenvectors according to the size of the corresponding eigenvalues ... 0. The eigenvector with the largest eigenvalue is called the first principal component. It points in the direction where the variance of the data is maximal. To see this we calculate the variance of the projection of onto an arbitrary direction that we write as = a_{i} with a_{i}^{2} = 1 so that || = 1. The variance along is
= ^{ . } = V = a_{i}^{2} . | (11.4) |
In the following we analyze the evolution of synaptic weights using the Hebbian learning rules that have been described in Chapter 10. To do so, we consider a highly simplified scenario consisting of an analog neuron that receives input from N presynaptic neurons with firing rates via synapses with weights w_{i}; cf. Fig. 11.2A. We think of the presynaptic neurons as `input neurons', which, however, do not have to be sensory neurons. The input layer could, for example, consist of neurons in the lateral geniculate nucleus (LGN) that project to neurons in the visual cortex. We will see that the statistical properties of the input control the evolution of synaptic weights.
For the sake of simplicity, we model the presynaptic input as a set of static patterns. Let us suppose that we have a total of p patterns {;1 < < p}. At each time step one of the patterns is selected at random and presented to the network by fixing the presynaptic rates at = . We call this the static-pattern scenario. The presynaptic activity drives the postsynaptic neuron and the joint activity of pre- and postsynaptic neurons triggers changes of the synaptic weights. The synaptic weights are modified according to a Hebbian learning rule, i.e., according to the correlation of pre- and postsynaptic activity; cf. Eq. (10.3). Before the next input pattern is chosen, the weights are changed by an amount
In a general rate model, the firing rate of the postsynaptic neuron is given by a nonlinear function of the total input
If we combine the learning rule (11.5) with the linear rate model of Eq. (11.7) we find after the presentation of pattern
We are interested in the long-term behavior of the synaptic weights. To this end we assume that the weight vector evolves along a more or less deterministic trajectory with only small stochastic deviations that result from the randomness at which new input patterns are chosen. This is, for example, the case if the learning rate is small so that a large number of patterns has to be presented in order to induce a substantial weight change. In such a situation it is sensible to consider the expectation value of the weight vector, i.e., the weight vector (n) averaged over the sequence (,,...,) of all patterns that so far have been presented to the network. From Eq. (11.9) we find
(n + 1) = ( + C) (n) = ( + C)^{n+1} (0) , | (11.12) |
If we express the weight vector in terms of the eigenvectors of C,
(n) = a_{k}(n) , | (11.13) |
(n) = (1 + )^{n} a_{k}(0) . | (11.14) |
From a data-processing point of view, the extraction of the first principle component of the input data set by a biologically inspired learning rule seems to be very compelling. There are, however, a few drawbacks and pitfalls. First, the above statement about the Hebbian learning rule is limited to the expectation value of the weight vector. We will see below that, if the learning rate is sufficiently low, then the actual weight vector is in fact very close to the expected one.
Second, while the direction of the weight vector moves in the direction of the principal component, the norm of the weight vector grows without bounds. We will see below in Section 11.1.3 that suitable variants of Hebbian learning allow us to control the length of the weight vector without changing its direction.
Third, principal components are only meaningful if the input data is normalized, i.e., distributed around the origin. This requirement is not consistent with a rate interpretation because rates are usually positive. This problem, however, can be overcome by learning rules such as the covariance rule of Eq. (10.10) that are based on the deviation of the rates from a certain mean firing rate. We will see in Section 11.2.4 that a spike-based learning rule can be devised that is sensitive only to deviations from the mean firing rate and can thus find the first principal component even if the input is not properly normalized.
So far, we have derived the behavior of the expected weight vector, . Here we show that explicit averaging is not necessary provided that learning is slow enough. In this case, the weight vector is the sum of a large number of small changes. The weight dynamics is thus `self-averaging' and the weight vector can be well approximated by its expectation value .
We start from the formulation of Hebbian plasticity in continuous time,
w_{i} = c^{corr}_{2} ; | (11.16) |
w_{i}(t) = w_{i}(t + t) - w_{i}(t) = w_{j}(t) + (t^{2}) . | (11.17) |
In the next time step a new pattern is presented so that the weight is changed to
w_{i}(t + p t) - w_{i}(t) = c^{corr}_{2} tw_{j}(t) + (t^{2}) . | (11.19) |
= c^{corr}_{2} w_{j}(t) C_{ij} . | (11.20) |
w_{i}(t) = c^{corr}_{2} w_{j}(t) C_{ij} . | (11.21) |
We have seen in Section 11.1.2 that the simple learning rule (10.3) leads to exponentially growing weights. Since this is biologically not plausible, we must use a modified Hebbian learning rule that includes weight decrease and saturation; cf. Chapter 10.2. Particularly interesting are learning rules that lead to a normalized weight vector. Normalization is a desirable property since it leads to a competition between synaptic weights w_{ij} that converge on the same postsynaptic neuron i. Competition means that if a synaptic efficacy increases, it does so at the expense of other synapses that must decrease.
For a discussion of weight vector normalization two aspects are important, namely what is normalized and how the normalization is achieved. Learning rules can be designed to normalize either the sum of weights, w_{ij}, or the quadratic norm, ||^{2} = w_{ij}^{2} (or any other norm on ^{N}). In the first case, the weight vector is constrained to a plane perpendicular to the diagonal vector = (1,..., 1); in the second case it is constrained to a hyper-sphere; cf. Fig. 11.4.
Second, the normalization of the weight vector can either be multiplicative or subtractive. In the former case all weights are multiplied by a common factor so that large weights w_{ij} are corrected by a larger amount than smaller ones. In the latter case a common constant is subtracted from each weight. Usually, subtractive normalization is combined with hard bounds 0w_{ij}w^{max} in order to avoid runaway of individual weights. Finally, learning rules may or may not fall into the class of local learning rules that we have considered in Chapter 10.2.
A systematic classification of various learning rules according to the above three criteria has been proposed by Miller and MacKay (1994). Here we restrict ourselves to two instances of learning with normalization properties which we illustrate in the examples below. We start with the subtractive normalization of the summed weights w_{ij} and turn then to a discussion of Oja's rule as an instance of a multiplicative normalization of w_{ij}^{2}.
In a subtractive normalization scheme the sum over all weights, w_{i}, can be kept constant by subtracting the average total weight change, N^{-1}, from each synapse after the weights have been updated according to a Hebbian learning rule with = w_{j} . Altogether, the learning rule is of the form
w_{i} | = - N^{-1} | |
= w_{j} - N^{-1}w_{j} , | (11.22) |
In a similar way as in the previous section, we calculate the expectation of the weight vector (n), averaged over the sequence of input patterns (,,...),
w_{i}(n + 1) = w_{i}(n) + C_{ij} w_{j}(n) - N^{-1}C_{kj} w_{j}(n) , | (11.23) |
Normalization of the sum of the weights, w_{i}, needs an additional criterion to prevent individual weights from perpetual growth. A more elegant way is to require that the sum of the squared weights, i.e., the length of the weight vector, w_{i}^{2}, remains constant. This restricts the evolution of the weight vector to a sphere in the N dimensional weight space. In addition, we can employ a multiplicative normalization scheme where all weights all multiplied by a common factor instead of subtracting a common constant. The advantage of multiplicative compared to subtractive normalization is that small weights will not change their sign during the normalization step.
In order to formalize the above idea we first calculate the `naïve' weight change (n) in time step n according to the common Hebbian learning rule,
(n) = [(n)^{ . }] . | (11.25) |
(n + 1) = (n) + (n) - (n) [(n)^{ . }(n)] + () . | (11.27) |
We may wonder whether Eq. (11.28) is a local learning rule. In order to answer this question, we recall that the `naïve' weight change = uses only pre- and postsynaptic information. Hence, we can rewrite Eq. (11.28) in terms of the firing rates,
In order to see that Oja's learning rule selects the first principal component we show that the eigenvectors {,...,} of C are fixed points of the dynamics but that only the eigenvector with the largest eigenvalue is stable. For any fixed weight vector we can calculate the expectation of the weight change in the next time step by averaging over the whole ensemble of input patterns {,,...}. With (n) = C we find from Eq. (11.28)
= c ( - ) + (c^{2}) . | (11.31) |
Most neurons of the visual system respond only to stimulation from a narrow region within the visual field. This region is called the receptive field of that neuron. Depending on the precise position of a narrow bright spot within the receptive field the corresponding neuron can either show an increase or a decrease of the firing rate relative to its spontaneous activity at rest. The receptive field is subdivided accordingly into `ON' and `OFF' regions in order to further characterize neuronal response properties. Bright spots in an ON region increase the firing rate whereas bright spots in an OFF region inhibit the neuron.
Different neurons have different receptive fields, but as a general rule, neighboring neurons have receptive fields that `look' at about the same region of the visual field. This is what is usually called the retinotopic organization of the neuronal projections - neighboring points in the visual field are mapped to neighboring neurons of the visual system.
The visual system forms a complicated hierarchy of interconnected cortical areas where neurons show increasingly complex response properties from one layer to the next. Neurons from the lateral geniculate nucleus (LGN), which is the first neuronal relay of visual information after the retina, are characterized by so-called center-surround receptive fields. These are receptive fields that consist of two concentric parts, an ON region and an OFF region. LGN neurons come in two flavors, as ON-center and OFF-center cells. ON-center cells have a ON-region in the center of their receptive field that is surrounded by a circular OFF-region. In OFF-center cells the arrangement is the other way round; a central OFF-region is surrounded by an ON-region; cf. Fig. 11.6.
Neurons from the LGN project to the primary visual cortex (V1), which is the first cortical area involved in the processing of visual information. In this area neurons can be divided into `simple cells' and 'complex cells'. In contrast to LGN neurons, simple cells have asymmetric receptive fields which results in a selectivity with respect to the orientation of a visual stimulus. The optimal stimulus for a neuron with a receptive field such as that shown in Fig. 11.6D, for example, is a light bar tilted by about 45 degrees. Any other orientation would also stimulate the OFF region of the receptive field leading to a reduction of the neuronal response. Complex cells have even more intriguing properties and show responses that are, for example, selective for movements with a certain velocity and direction (Hubel, 1995).
It is still a matter of debate how the response properties of simple cells arise. The original proposal by Hubel and Wiesel (1962) was that orientation selectivity is a consequence of the specific wiring between LGN and V1. Several center-surround cells with slightly shifted receptive fields should converge on a single V1 neuron so as to produce the asymmetric receptive field of simple cells. Alternatively (or additionally), the intra-cortical dynamics can generate orientation selectivity by enhancing small asymmetries in neuronal responses; cf. Section 9.1.3. In the following, we pursue the first possibility and try to understand how activity-dependent processes during development can lead to the required fine-tuning of the synaptic organization of projections from the LGN to the primary visual cortex (Miller, 1995,1994; Miller et al., 1989; Linsker, 1986c,b,a; Wimbauer et al., 1997a,b; MacKay and Miller, 1990).
We are studying a model that consists of a two-dimensional layer of cortical neurons (V1 cells) and two layers of LGN neurons, namely one layer of ON-center cells and one layer of OFF-center cells; cf. Fig. 11.7A. In each layer, neurons are labeled by their position and projections between the neurons are given as a function of their positions. Intra-cortical projections, i.e., projections between cortical neurons, are denoted by w_{V1, V1}(,), where and are the position of the pre- and the postsynaptic neuron, respectively. Projections from ON-center and OFF-center LGN neurons to the cortex are denoted by w_{V1, ON}(,) and w_{V1, OFF}(,), respectively.
In the following we are interested in the evolution of the weight distribution of projections from the LGN to the primary visual cortex. We thus take w_{V1, ON}(,) and w_{V1, OFF}(,) as the dynamic variables of the model. Intra-cortical projections are supposed be constant and dominated by short-range excitation, e.g.,
w_{V1, V1}(,) exp - . | (11.32) |
As in the previous section we consider - for the sake of simplicity - neurons with a linear gain function. The firing rate () of a cortical neuron at position is thus given by
Due to the intra-cortical interaction the cortical activity shows up on both sides of the equation. Since this is a linear equation it can easily be solved for . To do so we write () = (), where is the Kronecker that is one for = and vanishes otherwise. Equation (11.33) can thus be rewritten as
I(,) M(,) = | (11.33) |
(,) I(,) w_{V1, ON/OFF}(,) . | (11.35) |
We expect that the formation of synapses between LGN and V1 is driven by correlations in the input. In the present case, these correlations are due to the retinotopic organization of projections from the retina to the LGN. Neighboring LGN neurons receiving stimulation from similar regions of the visual field are thus correlated to a higher degree than neurons that are more separated. If we assume that the activity of individual photoreceptors on the retina is uncorrelated and that each LGN neuron integrates the input from many of these receptors then the correlation of two LGN neurons can be calculated from the form of their receptive fields. For center-surround cells the correlation is a Mexican hat-shaped function of their distance (Miller, 1994; Wimbauer et al., 1997a), e.g.,
In the present formulation of the model each LGN cell can contact every neuron in the primary visual cortex. In reality, each LGN cell sends one axon to the cortex. Though this axon may split into several branches its synaptic contacts are restricted to small region of the cortex; cf. Fig. 11.7B. We take this limitation into account by defining an arborization function A(,) that gives the a priori probability that a connection between a LGN cell at location and a cortical cell at is formed (Miller et al., 1989). The arborization is a rapidly decaying function of the distance, e.g.,
To describe the dynamics of the weight distribution we adopt a modified form of Hebb's learning rule that is completed by the arborization function,
w_{V1, ON/OFF}(,) = A(,) () () . | (11.37) |
If we use Eq. (11.34) and assume that learning is slow enough so that we can rely on the correlation functions to describe the evolution of the weights, we find
Expression (11.41) is still a linear equation for the weights and nothing exciting can be expected. A prerequisite for pattern formation is competition between the synaptic weights. Therefore, the above learning rule is extended by a term w_{V1, ON/OFF}(,) ()^{2} that leads to weight vector normalization and competition; cf. Oja's rule, Eq. (10.11).
Many of the standard techniques for nonlinear systems that we have already encountered in the context of neuronal pattern formation in Chapter 9 can also be applied to the present model (Wimbauer et al., 1998; MacKay and Miller, 1990). Here, however, we will just summarize some results from a computer simulation consisting of an array of 8×8 cortical neurons and two times 20×20 LGN neurons. Figure 11.8 shows a typical outcome of such a simulation. Each of the small rectangles shows the receptive field of the corresponding cortical neuron. A bright color means that the neuron responds with an increased firing rate to a bright spot at that particular position within its receptive field; dark colors indicate inhibition.
There are two interesting aspects. First, the evolution of the synaptic weights has lead to asymmetric receptive fields, which give rise to orientation selectivity. Second, the structure of the receptive fields of neighboring cortical neurons are similar; neuronal response properties thus vary continuously across the cortex. The neurons are said to form a map for, e.g., orientation.
The first observation, the breaking of the symmetry of LGN receptive fields, is characteristic for all pattern formation phenomena. It results from the instability of the homogeneous initial state and the competition between individual synaptic weights. The second observation, the smooth variation of the receptive fields across the cortex, is a consequence of the excitatory intra-cortical couplings. During the development, neighboring cortical neurons tend to be either simultaneously active or quiescent and due to the activity dependent learning rule similar receptive fields are formed.
© Cambridge University Press
This book is in copyright. No reproduction of any part
of it may take place without the written permission
of Cambridge University Press.