MANTRA EPFL

Modern Approaches to Machine Learning 
Graduate School in Computer Science
Winter term 2003/2004
Exercises: Mini-project

Assistant: Denis Sheynikhovich

The aim of this mini-project is to give you the feeling of a realistic application of artificial neural networks. Try to stay as simple as possible (you can for example hard-code lots of things, we don't care). From our experience, many students spend too much time for the user interface and other non-relevant things.
We will not read your code, nor will we care about a demonstration of your user-interface. Consider yourself warned!

As always in supervised learning, you start from a database

{(xmu,tmu);  mu = 1, ... p }

You will use a neural network to that generates output values xoutmu which should be as close as possible to the target value tmu

The project consists of the following steps:

  1. Choose a database
  2. Preprocess your data
  3. Compare different learning algorithms on your dataset
  4. Write a short report (3 - 4 pages maximum)
These steps are described below.
  1. Choose a database:

  2. UCI's Machine Learning page hosts a collection of databases. The number of examples varies between around 100 to several thousands. Most of the databases are natural ones, so you will be dealing with real-world data!

    An overview of the databases can be found here.

    For the Mini-project, you will be assigned one of the databases listed in the following table. Look at the documentation to find out where the data comes from, what the attributes mean and which range their values are in.
     

    Data set # attributes # classes # examples missing data # attr. perf.
    discr. cont. train test
    banana - 2 2 803 -     88.8 - 93.4
    wisconsin breast cancer - 9 2 699 - yes 10 94.1 - 96.1
    wisconsin breast cancer (diagnostic) - 30 2 569 -     97.5
    dermatology - 34 6 366 - yes 34 92.1 - 95.7
    ecoli - 7 8 336 -     92.1 - 95.7
    glass - 9 6 214 -     56.9 - 65.4
    heart (cleveland) - 13 2 303 - yes 13 77.5 - 81.9
    ionosphere - 34 2 351 -     85.6 - 93.9
    iris - 4 3 150 -     92.9 - 96.9
    optical digit - 64 10 3823 -     93.1 - 95.9
    pen - 16 10 7494 -     96.6 - 99.2
    pima indians diabetes - 8 2 768 -     72.5 - 74.6
    satimage - 36 6 4435 2000     84.9 - 88.9
    image segmentation - 19 7 2310 -     89.1 - 92.6
    sonar - 60 2 208 -     63.3 - 81.6
    tic-tac-toe - 9 2 958 -     76.7 - 90.0
    soybean 35 - 19 683 - yes 134 85.3 - 92.9
    vowel - 10 11 990 -     72.7 - 88.0
    waveform - 21 3 600 -     74.3 - 84.1
    waveform+noise - 40 3 600 -     66.7 - 80.6
    Selected Databases and their properties (Thanks to Perry Moerland).
    For those datasets with no fixed training/test set size,you should divide the whole set into 3 parts for training, testing, and validation, taking the class distributions into account.
    #attr indicates the number of attributes after preprocessing missing data.
    The last column shows the range of percentage of correct classification obtained by Perry Moerland for different EM algorithms.








  3. Preprocess your data


  4. Normalize each ordinal and real-valued attributes to zero mean and unit standard deviation. Note that you are not allowed to look at the test set to calculate these parameters (mean and variance)! For the test set, you apply the same preprocessing as to the training set (i.e. you take the same parameters you calculated before. Don't recalculate them, inspecting the test-set. This would be completely wrong!).

    Finally, missing values have to be filled with reasonable replacements. For ordinal and real-valued inputs, replace missing values with zero (the mean value after normalization).


  5. Compare different learning algorithms on your dataset


  6. Write a computer program implementing the following learning methods:
     
    1. Back-propagation with early stopping and one of the penalty-based regularization methods.
    2. Expectation-Maximization (EM) algorithm.
    3. Support Vector Machines (Kernel-Adatron algorithm*).

    *Friess, T.-T., Cristianini, N., & Campbell, C. (1998). The kernel adatron algorithm: a fast and simple learning procedure for support vector machines. In Proceeding of 15th Intl. Conf. Machine Learning. Morgan Kaufman Publishers.
    !! NOTE !!: THERE ARE ERRORS IN SOME OF THESE PAPERS: the correct algorithm can be found here [pdf].


    You may choose any programming language you like. Compare the performance of these algorithms on your database.


    The references to additional literature that may help you are given on the course page.


  7. Write a short report (3 - 4 pages maximum).


  8. In the report you should state: In addition to writing a report, you will also be asked to present the results in a short seminar talk.

Project schedule


[Neural Java home page]
Last updated: 03-Dec-03 by Denis Sheynikhovich