Supervised learning in a singlelayer neural network
Let's consider a singlelayer neural network with b inputs and c
outputs:

W_{ij} = weight from input i to unit j in output
layer;
W_{j }is the vector of all the weights of the jth
neuron
in the output layer.

I^{p} = input vector (pattern p) = (I_{1}^{p},
I_{2}^{p}, ..., I_{b}^{p}).

T^{p} = target output vector (pattern p) = (T_{1}^{p},
T_{2}^{p}, ..., T_{c}^{p}).

A^{p} = Actual output vector (pattern p) = (A_{1}^{p},
A_{2}^{p}, ..., A_{c}^{p}).

g() = sigmoid activation function: g(a ) = [1 + exp
(a)]^{1}
Supervised learning
We have seen that different weights of a neural network produce
different functions of the input. To train a network, we can present some sample inputs and compare the actual output to the desired results. The
difference is called the error.
The different learning rules tell us which way to adjust the weights to
reduce this error. We say that training has converged when this
error
reaches some small, acceptable level.
Often the learning rule takes the following form:
W_{ij } (t+1) = W_{ij }
(t) + eta . err (p)
where 0 <= eta < 1 is a parameter that controls the
learning
rate, and err(p) is the error when input pattern p is
presented.
[Back
to the Adaline/Perceptron/Backprop applet page]
Adaline learning
ADALINE is an acronym for ADAptive LINear Element (or ADAptive LInear
NEuron).
It was developed by Bernard Widrow and Marcian Hoff (1960).
The adaline learning rule (also known as the leastmeansquares
rule,
the delta rule, and the WidrowHoff rule) is a training rule that
minimises
the output error using (approximate) gradient descent. After each
training
pattern I^{p} is presented, the correction to
apply
to the weights is proportional to the error. The correction is
calculated
before the thresholding step, using err_{ij} (p)=T^{p}W_{ij}
I^{p}:
Thus, the weights are adjusted by
W_{ij} (t+1) = W_{ij}
(t) + eta (T^{p}W_{ij} I^{p})
(I^{p})
This corresponds to gradient descent on the quadratic error surface,
E_{j}=Sum_{p} [T^{p}  W_{j}
^{.} I^{p}] ^{2}
[Back
to the Adaline/Perceptron/Backprop applet page]
Perceptron learning
In perceptron learning, the weights are adjusted only when a
pattern
is misclassified. The correction to the weights
after
applying the training pattern p is
W_{ij} (t+1) = W_{ij}
(t) + eta (T^{p } A^{p}) (I^{p})
This corresponds to gradient descent on the error surface E (W_{ij}
)= Sum_{misclassified} [W_{ij} (A^{p})(I^{p})].
[Back
to the Adaline/Perceptron/Backprop applet page]
Pocket algorithm
The perceptron learning algorithm does not terminate if the learning
set
is not linearly separable. In many realworld cases,
however,
we want to find the "best" linear separation even when the learning
sets
are not ideal. The pocket algorithm is a modification of the perceptron
rule proposed by S. I. Gallant (1990). It stores the best weight vector
so far in a "pocket" while continuing to learn. The weights are
actually
modified only if a better weight vector is found.
[Back
to the Adaline/Perceptron/Backprop applet page]
Backpropagation
The backpropagation algorithm was developed for training multilayer
perceptron
networks. In this applet, we will study how it works for a singlelayer
network. It was popularized by Rumelhart, Hinton and Williams
(1986),
although similar ideas had been developed previously by others (Werbos,
1974; Parker, 1985). The idea is to train a network by
propagating
the output errors backward through the layers. The errors serve to
evaluate
the derivatives of the error function with respect to the weights,
which
can then be adjusted.
The backpropagation algorithm for a singlelayer network using the
sumofsquares
error function consists of two phases:

Feedforward  apply an input; evaluate the activations a_{j}
and store the error delta_{j }at each node j
a_{j} = Sum _{i}(W_{ij}
(t) I^{p}_{i})
A^{p}_{j } = g (a_{j}
)
delta_{j } = A^{p}_{j }
I^{p}_{j}
 Backpropagation  compute the adjustments and update the
weights.
Since there is just one layer, the output layer, we compute
W_{ij} (t+1) = W_{ij}
(t)  eta delta_{i } I^{p}_{j}
(This is called "online" learning, because the weights are adjusted
each time a new input is presented. In "batch" learning, the
weights
are adjusted after summing over all the patterns in the training set.)
[Back
to the Adaline/Perceptron/Backprop applet page]
Optimal Perceptron learning
In the case of linear separable problems a perceptron can find
different solutions:
It would now be interesting to find the hyperplane that assures
the maximal safety tolerance:
The margins of that hyperplane touches a limited number of
special points which define the hyperplane and which are called the Support Vectors.
The perceptron has to determine the
samples for which . The remaining
samples with are
the Support Vectors sv.
Represents
the distance between a sample and.
z_{ }and z+_{ }represent the
projection of the critical points on the axis defined by.
Algorithm of the Optimal Perceptron:
[Back to the Adaline/Perceptron/Backprop applet page]
Further reading

C. M. Bishop. Neural Networks for Pattern Recognition. Clarendon
Press, Oxford, 1995. pp 95103 (adaline and perceptron); pp 140148
(backprop)

J. Hertz, A. Krogh, and R.G. Palmer. Introduction to the
Theory
of Neural Computation. AddisonWesley, Redwood City CA, 1991. pp
89111

R. Rojas. Neural Networks: A Systematic Introduction.
SpringerVerlag,
Berlin 1996. pp 8491 (perceptron learning); pp 159162 (backprop)
[Back
to the Adaline/Perceptron/Backprop applet page]