Mini-project Assistant: Laurent Badel Deadline: Friday 25 June 2004 Last questions: Wednesday 23 June 2004 |
Page en Français | Link to FAQs.org | |
---|---|---|
Derivation of the backpropagation algorithm |
The aim of this mini-project is to give you the feeling for a realistic
application of artificial neural networks. Try to stay as simple as possible
(you can for example hard-code lots of things, we don't care). From our
experience, many students spend too much time for the user interface
and other non-relevant things.
We will not read your code, nor will we care about a demonstration of your
user-interface. Consider yourself warned!
As always in supervised learning, you start from a database
{(x^{mu}, t^{mu}); mu = 1, ... p }.
You will train a neural network that generates output values x_{out}^{mu} which should be as close as possible to the real target value t^{mu}. The task can either be a prediction of a value (the target values are real numbers) or an assignment to a class (the target values are binary numbers).
The project consists of the following steps:
These steps are described below.
DELVE, and UCI Machine Learning Repository host a collection of databases. Most of these databases have several thousands of entries (p>1000). You can therefore use them for realistic predictions.
An overview of the databases can be found in the DELVE Datasets Summary Table and in the UCI Machine Learning Repository Content Summary.
The general philosophy of the DELVE database is described on the DELVE Task-Array page.
For the Mini-project, choose one of the following four tasks, for which DELVE or UCI provides databases. (For each of the databases, look at the `detailed documentation' to find out what exactly is to be done)
Don't touch the data in the test set during training. It is reserved for the final performance measure. Think of it as data hidden to you during all of your work.
In the Delve database, results are reported as a function of the number of data points in the training set for learning and validation together. Check out the result data page for your task to learn what sizes you should use for your training set. You can find the results in the DELVE summary table.
Normally, you could use all the remaining data for testing. However, Delve often tells you how many samples to take for testing. Read the task specification which can be found in the prototask.spec file.
In the UCI database, the data pages are already split into a training set and a test set.
Preprocessing is slightly different, depending on wether attributes are real-valued or categories.
Real-valued attributes.
Take your training set and normalise each
ordinal and real-valued attribute to zero mean and unit
standard deviation.
Note that you are not allowed to look at the test
set to calculate these parameters (mean and variance)! For the test set, you
apply the same preprocessing as to the training set (i.e. you take the same
parameters you calculated before. Don't recalculate them, inspecting the
test-set. This would be completely wrong!).
Categorical attributes.
Map each categorical attribute to a 1-of-n coding, thus increasing the number
of attributes.
If there are missing values, fill them with reasonable replacements. For ordinal and real-valued inputs, replace missing values with zero (the mean value after normalisation) and for categorical input, add an extra bit to the 1-of-n coding to encode a missing value.
For more information about the encoding, see section 7.3 of the Delve User Manual
Write a program which implements the backpropagation method with momentum term as discussed in the lectures. You may use any programming language you want. Split your training set into a learning set and a validation set of comparable size for cross-validation.
If you want you can look at the C program which is available at http://lslwww.epfl.ch/~aperez/NN_tutorial/Bkprop.c or at the program bpsim at http://diwww.epfl.ch/mantra/tutorial/english/mlpc/html/index.html.
TASK 1: in a situation where you have overfitting, plot learning and validation error as a function of time (epochs). This is the training curve and will be called plot 1.
Some hints:
Don't change the number of hidden units, but keep them at the same number as in plot 1.
Compare early stopping with a penalty-term based regularisation method such as weight decay or weight elimination.
For early stopping, you return after learning to the point on the learning curve where the generalisation error is minimial. Maybe you want to do several runs and take the best network.
TASK 2: plot learning and validation error as a function of time. Show at what time you freeze the weights. (plot 2)
The penalty term for weight decay is:
penalty = lambda * Sum_{i} (w_{i}^{2})
where the sum runs over all weights in all layers, but not over the thresholds. (Remember that the thresholds are treated as additional weights attached to a bias unit with constant activity x_{i} = -1. Thus in terms of the learning algorithm the thresholds are treated as further weights. They are, however, not included in the sums of the penalty term.)
Similarly the term for weight elimination is:
penalty = lambda * Sum_{i} (w_{i}^{2} / c + w_{i}^{2})
Again, the sum does not include the thresholds. Take for example c=1/N where N is the total number of weights (threshold excluded).
For penalty-term regularisation, optimise the value of the penalty term lambda. For both methods, you should use cross-validation (ie. you train on the learning set and evaluate the quality of the result on the validation set). After you have found the optimal lambda, you retrain your network several times and retain the best one.
TASK 3: plot learning and validation error as a function of lambda (plot 3a). In an additional graph (plot 3b), plot the errors versus time for the optimal lambda you found. Add comments on all graphs plotted so far.
some hints for lambda optimisation:
You should have three networks now:
This is the point of no return. Once you look at the test set, you are not allowed to change your networks. If you're ready, you can now pass all samples of the test set to measure the performance of your solutions.
TASK 4: Report the performance error of your networks and compare them with the performance error of other methods as reported in the DELVE data base.
To download the DELVE results sheet, click in the column 'view results' of the DELVE summary table. On this sheet, the bars indicate the normalised error (see below) for different methods. The numbers at the top indicate the training set size (learning and validation together). You can ignore the boxes at the bottom.
For final error measurements, DELVE uses the following procedure:For real and ordinal attributes, the loss function is the square of the difference of the predicted and the real target value. The baseline method is the prediction of the mean value of the test set.
For categorical attributes, the loss is zero if the the prediction is correct and one if incorrect. The baseline method is to always predict the class with the highest number of members in the test set.
Thus, the formula to evaluate the normalised expected loss evaluates to:
prediction tasks: | E = (Sum_{mu}[t^{mu} - x_{out}^{mu}]^{2}) / (Sum_{mu}[t^{mu} - <t>]^{2}) | ; {mu = 1, ... p} ; where <t> is the mean of the target values. |
classification tasks: | E = (Sum_{mu}|t^{mu} - sgn[x^{out}(mu)]|) / (Sum_{mu}|t^{mu} - t_{max}|) | ; here t_{max} is the target value of the biggest class |
For more information about DELVE's error measurements, see sections 5.2 (loss functions) and 8.3 (loss and normalisation) of the Delve User Manual
TASK 5: Compare the performance of the two regularisation methods that you have chosen. Which one is better for your data set?
In the report you should:
Please respect the following rules: