Modern Approaches to Machine Learning
Graduate School in Computer Science
Winter term 2003/2004
Exercises: Mini-project
Assistant: Denis Sheynikhovich
|
The aim of this mini-project is to give you the feeling of a realistic
application of artificial neural networks. Try to stay as simple as possible
(you can for example hard-code lots of things, we don't care). From our experience,
many students spend too much time for the user interface and other non-relevant
things.
We will not read your code, nor will we care about a demonstration of your
user-interface. Consider yourself warned!
As always in supervised learning, you start from a database
{(xmu,tmu); mu = 1, ...
p }
You will use a neural network to that generates output values xoutmu
which should be as close as possible to the target value tmu
The project consists of the following steps:
- Choose a database
- Preprocess your data
- Compare different learning algorithms on your dataset
- Write a short report (3 - 4 pages maximum)
These steps are described below.
- Choose a database:
UCI's Machine Learning page
hosts a collection of databases. The number of examples varies between around
100 to several thousands. Most of the databases are natural ones, so you will
be dealing with real-world data!
An overview of the databases can be found here.
For the Mini-project, you will be assigned one of the databases listed
in the following table. Look at the documentation to find out where the data
comes from, what the attributes mean and which range their values are in.
| Data set |
# attributes |
# classes |
# examples |
missing data |
# attr. |
perf. |
| discr. |
cont. |
train |
test |
| banana |
- |
2 |
2 |
803 |
- |
|
|
88.8 - 93.4 |
| wisconsin breast cancer |
- |
9 |
2 |
699 |
- |
yes |
10 |
94.1 - 96.1 |
wisconsin breast cancer (diagnostic) |
- |
30 |
2 |
569 |
- |
|
|
97.5 |
| dermatology |
- |
34 |
6 |
366 |
- |
yes |
34 |
92.1 - 95.7 |
| ecoli |
- |
7 |
8 |
336 |
- |
|
|
92.1 - 95.7 |
glass |
- |
9 |
6 |
214 |
- |
|
|
56.9 - 65.4 |
| heart
(cleveland) |
- |
13 |
2 |
303 |
- |
yes |
13 |
77.5 - 81.9 |
| ionosphere |
- |
34 |
2 |
351 |
- |
|
|
85.6 - 93.9 |
| iris |
- |
4 |
3 |
150 |
- |
|
|
92.9 - 96.9 |
| optical digit |
- |
64 |
10 |
3823 |
- |
|
|
93.1 - 95.9 |
| pen |
- |
16 |
10 |
7494 |
- |
|
|
96.6 - 99.2 |
| pima
indians diabetes |
- |
8 |
2 |
768 |
- |
|
|
72.5 - 74.6 |
| satimage |
- |
36 |
6 |
4435 |
2000 |
|
|
84.9 - 88.9 |
| image segmentation |
- |
19 |
7 |
2310 |
- |
|
|
89.1 - 92.6 |
| sonar |
- |
60 |
2 |
208 |
- |
|
|
63.3 - 81.6 |
| tic-tac-toe |
- |
9 |
2 |
958 |
- |
|
|
76.7 - 90.0 |
| soybean |
35 |
- |
19 |
683 |
- |
yes |
134 |
85.3 - 92.9 |
| vowel |
- |
10 |
11 |
990 |
- |
|
|
72.7 - 88.0 |
| waveform |
- |
21 |
3 |
600 |
- |
|
|
74.3 - 84.1 |
| waveform+noise |
- |
40 |
3 |
600 |
- |
|
|
66.7 - 80.6 |
Selected Databases and their properties
(Thanks to Perry Moerland).
For those datasets with no fixed training/test set size,you should
divide the whole set into 3 parts for training, testing, and validation,
taking the class distributions into account.
#attr indicates the number of attributes after preprocessing missing data.
The last column shows the range of percentage of correct classification obtained
by Perry Moerland for different EM algorithms.
- Preprocess your data
Normalize each ordinal and real-valued attributes to zero mean and unit
standard deviation. Note that you are not allowed to look at the test
set to calculate these parameters (mean and variance)! For the test set,
you apply the same preprocessing as to the training set (i.e. you take the
same parameters you calculated before. Don't recalculate them, inspecting
the test-set. This would be completely wrong!).
Finally, missing values have to be filled with reasonable replacements.
For ordinal and real-valued inputs, replace missing values with zero (the
mean value after normalization).
- Compare different learning algorithms on
your dataset
Write a computer program implementing the following learning methods:
- Back-propagation with early stopping and one of the penalty-based
regularization methods.
- Expectation-Maximization (EM) algorithm.
- Support Vector Machines (Kernel-Adatron algorithm*).
*Friess, T.-T., Cristianini, N., & Campbell,
C. (1998). The kernel adatron algorithm: a fast and simple learning procedure
for support vector machines. In Proceeding of 15th Intl. Conf. Machine Learning.
Morgan Kaufman Publishers.
: THERE ARE ERRORS IN SOME OF THESE PAPERS: the
correct algorithm can be found here [pdf].
You may choose any programming language you like. Compare the performance
of these algorithms on your database.
The references to additional literature that may help you are given on the
course page.
- Write a short report (3 - 4 pages maximum).
In the report you should state:
- The database you have chosen.
- The network structure
- for Backprop, the number of nodes/layers
- for EM: the number of Gaussians and type of constraint (e.g. spherical,
diagonal)
- for Support Vector Machines: the number and type of kernels
- Show learning graphs (learning base, validation base) for all algorithms
(Backprop, EM, SVM).
- State your results.
In addition to writing a report, you will also be asked to present the results
in a short seminar talk.
Project schedule
- The project will be described and the databases will be assigned after
the lecture on the 3rd of December 2003
- The first sessions with questions will be held on the 17th of December 2003 after the class
- There will be another session in order to follow the development of the
project. It will take place in the regular classroom at 10:00 on the 14th
of January 2004.
- Apart from this session you can, of course, come to my office (AAB122)
at any time you want, or drop me an email.
-
- The date of the oral exams will be assigned nearer the time.
[Neural Java
home page]
Last updated: 03-Dec-03 by Denis Sheynikhovich