logo_lcn EPFL
Artificial Neural Networks, Summer term 2003-2004
Mini-project

Assistant: Laurent Badel

Deadline: Friday 25 June 2004

Last questions: Wednesday 23 June 2004

Page en Franšais Link to FAQs.org
Derivation of the backpropagation algorithm

The aim of this mini-project is to give you the feeling for a realistic application of artificial neural networks. Try to stay as simple as possible (you can for example hard-code lots of things, we don't care). From our experience, many students spend too much time for the user interface and other non-relevant things.
We will not read your code, nor will we care about a demonstration of your user-interface. Consider yourself warned!



As always in supervised learning, you start from a database

{(xmu, tmu);  mu = 1, ... p }.

You will train a neural network that generates output values xoutmu which should be as close as possible to the real target value tmu. The task can either be a prediction of a value (the target values are real numbers) or an assignment to a class (the target values are binary numbers).

The project consists of the following steps:

  1. Choose a database
  2. Split the data into a training and a test set
  3. Preprocess your data
  4. Train a neural network using the backpropagation algorithm
  5. Optimise the generalisation error using two regularisation methods.
  6. Measure the performance on the test set
  7. Write a short report (2 - 3 pages maximum).

These steps are described below.


  1. Choose a database:
  2. DELVE, and UCI Machine Learning Repository host a collection of databases. Most of these databases have several thousands of entries (p>1000). You can therefore use them for realistic predictions.

    An overview of the databases can be found in the DELVE Datasets Summary Table and in the UCI Machine Learning Repository Content Summary.

    The general philosophy of the DELVE database is described on the DELVE Task-Array page.

    For the Mini-project, choose one of the following four tasks, for which DELVE or UCI provides databases. (For each of the databases, look at the `detailed documentation' to find out what exactly is to be done)


  3. Split the data set into 2 subsets:
    1. Training set (for learning and validation)
    2. Test set (for prediction and final error measure)

    Don't touch the data in the test set during training. It is reserved for the final performance measure. Think of it as data hidden to you during all of your work.

    In the Delve database, results are reported as a function of the number of data points in the training set for learning and validation together. Check out the result data page for your task to learn what sizes you should use for your training set. You can find the results in the DELVE summary table.

    Normally, you could use all the remaining data for testing. However, Delve often tells you how many samples to take for testing. Read the task specification which can be found in the prototask.spec file.

    In the UCI database, the data pages are already split into a training set and a test set.


  4. Preprocess your data.
  5. Preprocessing is slightly different, depending on wether attributes are real-valued or categories.

    1. Real-valued attributes.
      Take your training set and normalise each ordinal and real-valued attribute to zero mean and unit standard deviation. Note that you are not allowed to look at the test set to calculate these parameters (mean and variance)! For the test set, you apply the same preprocessing as to the training set (i.e. you take the same parameters you calculated before. Don't recalculate them, inspecting the test-set. This would be completely wrong!).

    2. Categorical attributes.
      Map each categorical attribute to a 1-of-n coding, thus increasing the number of attributes.

    3. If there are missing values, fill them with reasonable replacements. For ordinal and real-valued inputs, replace missing values with zero (the mean value after normalisation) and for categorical input, add an extra bit to the 1-of-n coding to encode a missing value.

    For more information about the encoding, see section 7.3 of the Delve User Manual


  6. Train a Neural Network to predict (guess) the output value(s):

    Write a program which implements the backpropagation method with momentum term as discussed in the lectures. You may use any programming language you want. Split your training set into a learning set and a validation set of comparable size for cross-validation.

    If you want you can look at the C program which is available at http://lslwww.epfl.ch/~aperez/NN_tutorial/Bkprop.c or at the program bpsim at http://diwww.epfl.ch/mantra/tutorial/english/mlpc/html/index.html.

    TASK 1: in a situation where you have overfitting, plot learning and validation error as a function of time (epochs). This is the training curve and will be called plot 1.

    Some hints:


  7. Minimise the generalisation error using two regularisation methods:
  8. Don't change the number of hidden units, but keep them at the same number as in plot 1.

    Compare early stopping with a penalty-term based regularisation method such as weight decay or weight elimination.

    1. Early stopping :

      For early stopping, you return after learning to the point on the learning curve where the generalisation error is minimial. Maybe you want to do several runs and take the best network.

      TASK 2: plot learning and validation error as a function of time. Show at what time you freeze the weights. (plot 2)

    2. Penalty methods :

      The penalty term for weight decay is:

      penalty = lambda * Sumi (wi2)

      where the sum runs over all weights in all layers, but not over the thresholds. (Remember that the thresholds are treated as additional weights attached to a bias unit with constant activity xi = -1. Thus in terms of the learning algorithm the thresholds are treated as further weights. They are, however, not included in the sums of the penalty term.)

      Similarly the term for weight elimination is:

      penalty = lambda * Sumi (wi2 / c + wi2)

      Again, the sum does not include the thresholds. Take for example c=1/N where N is the total number of weights (threshold excluded).

      For penalty-term regularisation, optimise the value of the penalty term lambda. For both methods, you should use cross-validation (ie. you train on the learning set and evaluate the quality of the result on the validation set). After you have found the optimal lambda, you retrain your network several times and retain the best one.

      TASK 3: plot learning and validation error as a function of lambda (plot 3a). In an additional graph (plot 3b), plot the errors versus time for the optimal lambda you found. Add comments on all graphs plotted so far.

      some hints for lambda optimisation:

      • Typically, the optimal lambda is small (eg. 0.0001). Check out a reasonable range first using a low resolution. In your reasonable range, increase the resolution to find the optimal lambda
      • For each value of lambda, you have to run the whole training run until convergence. Then, you measure learning and validation error and this results in one point in the error vs. lambda plot. (i.e. the last point on the training curve)
      • Maybe you want to average the errors over several runs for a fixed lambda.
      • If your training set is small (eg. 128 samples), you may get better (more stable) results using leave one out cross validation: Take each sample in turn as validation set, then average the validation error over all your measurements.


  9. Measure the performance on the test set:
  10. You should have three networks now:

    This is the point of no return. Once you look at the test set, you are not allowed to change your networks. If you're ready, you can now pass all samples of the test set to measure the performance of your solutions.

    TASK 4: Report the performance error of your networks and compare them with the performance error of other methods as reported in the DELVE data base.

    To download the DELVE results sheet, click in the column 'view results' of the DELVE summary table. On this sheet, the bars indicate the normalised error (see below) for different methods. The numbers at the top indicate the training set size (learning and validation together). You can ignore the boxes at the bottom.

    For final error measurements, DELVE uses the following procedure:
    1. Choose the evaluation function appropriate for your task. (The evaluation function in classification tasks is different from the one in real-valued prediction tasks) This function takes two arguments: the predicted value of a method (xoutmu) and the true value for the target (tmu). The evaluation function represents the error when predicting one example. The evaluation function is also called the "loss".
    2. Estimate the expected value of the loss. This is done by taking the mean loss over all samples in the test set.
    3. If you predict several targets (eg. when categorising into multiple classes), the loss is the sum of the losses over all targets (outputs).
    4. Normalise the expected loss. You do that by dividing your loss by a "standard loss" which comes from a baseline method with almost pre-specified performance. Don't worry, it only sounds complicated.

    For real and ordinal attributes, the loss function is the square of the difference of the predicted and the real target value. The baseline method is the prediction of the mean value of the test set.

    For categorical attributes, the loss is zero if the the prediction is correct and one if incorrect. The baseline method is to always predict the class with the highest number of members in the test set.

    Thus, the formula to evaluate the normalised expected loss evaluates to:

    prediction tasks: E = (Summu[tmu - xoutmu]2) / (Summu[tmu - <t>]2) ; {mu = 1, ... p} ; where <t> is the mean of the target values.
    classification tasks: E = (Summu|tmu - sgn[xout(mu)]|) / (Summu|tmu - tmax|) ; here tmax is the target value of the biggest class

    For more information about DELVE's error measurements, see sections 5.2 (loss functions) and 8.3 (loss and normalisation) of the Delve User Manual

    TASK 5: Compare the performance of the two regularisation methods that you have chosen. Which one is better for your data set?


  11. Write a short report (2 - 3 pages maximum):.
  12. In the report you should:

    Please respect the following rules:

    Thank you.

[Neural Java home page]
This page is: http://diwww.epfl.ch/mantra/tutorial/english/project.html
Last updated: 03-April-03
Maintainer: Laurent Badel