Computational Learning  Project #2
Submitted by: Chaim Linhart
July 1998
Abstract
In this project we construct radial basis functions networks (RBFs) for classifying seismic data according to source region and type. We apply both supervised and hybrid (twostage) training techniques to train ensembles of networks, and discuss the influence of the various parameters on their performance. We also compare the results with those achieved by the regular feedforward networks (the MLPs from project #1). The data mapping produced by unsupervised clustering of RBFs, with various preprocessing procedures, is discussed in detail.
In project #1 we have presented the problem of classifying seismic data according to the source region (one of six regions) or type (quarry explosion or earthquake). The structure of the input data was described, as well as the preprocessing procedures applied to it. The MLPs we constructed achieved reasonable results, and we analyzed the influence of the various parameters (preprocessing steps, size of network, number of training iterations) on them.
Sections 2 and 3 in this paper provide a short description of the input data, and the preprocessing procedures we apply to it. We then construct leaveoneout ensembles of RBF networks using supervised training, run the test data through them, and analyze the results. This is described in section 4, where we also explain what influences the results, and how, as in the previous project. In section 5 we repeat the analysis for RBFs with hybrid training. The emphasis here is the interpretation of the data mapping achieved by the kernels for different preprocessing steps. Finally, a short summary is given in section 6.
The "homework\data" directory contains 53 text files. Each file is a recording of one seismic event, at one of three stations  Amirim, Parod, and Shefer. 39 recordings (three recordings of 13 events) comprise the training sample set, and the other 14 are out test sample set, as shown in table 1.
Region 
Quarry/Earthquake 
Distance (km.) 
Training sets 
Test samples 
1: Kadarim 
Quarry 
3 
2, 10 
1, 14 
2: Amiad 
Quarry 
9 
4 
12 
3: Galilee 
Earthquake 
17 
11 

4: Golan 
Earthquake 
25 
5, 7 
6, 13 
5: Yehiam 
Quarry 
25 
8b, 12, 14, 15 

6: Hanaton 
Quarry 
26 
1, 3, 13 
3, 7, 8, 11 
7: Kinneret 
Earthquake 
20, 50 
2, 9, 10 

8: Lebanon 
Quarry 
70 
5 

Earthquake 
70 
4 
Table 1  The training and test data (each training set includes 3 recordings  "Amirim_xx.txt", "Parod_xx.txt", and "Shefer_xx.txt"; each test sample is one recording  "test_xx.txt").
In order to evaluate the performance of our networks, we will give each classification a score. In the case of sourcetype classification, this will simply be 0 for a wrong result, and 1 for a correct one. For the sourceregion problem, we wish to distinguish "close" answers from totally "far" ones. Therefore, an earthquake from Galilee that is misclassified as a Golan event, or vice versa, will award the network with half a point. The same goes for Kadarim and Amiad (the close quarries); Yehiam and Hanaton (the far quarries); and the Kinneret events, when classified as Galilee, Golan, or "unknown" (all three results are reasonable, since this site does not appear in the training data, and is rather similar to regions 3 and 4). A special bonus of 0.7 will be given to the Lebanon events, when classified as "unknown", as this site is far from all six regions in the training data.
The seismic recordings have already been (partially) preprocessed for us, and are given as sonograms  spectral images (matrices of "pixels") of the events. Each spectral image in an input file contains 60*11 energy values for 60 seconds duration and a logfrequency scale from 250.5 Hz (topdown). Energy is scaled by log2 and coded in ascii ("a"=1, "b"=2, "c"=3, ... ; "." means 0 as there is no detectable signal energy). The first 5 lines in a file contain additional comments like recording station, time and target values for classification, and more. Each event starts at the 10th sample, i.e. with 9 seconds noise prerun.
Which factors effect the sonogram, and how?
Earthquakes and explosions produce different seismic patterns, and our networks can utilize this fact to classify the sourcetype. Event duration depends on the distance and the magnitude of the explosion/earthquake  far events give longer seismograms (the position of the S onset is a linear function of the distance). The distance and the magnitude also determine the recorded energies  the energy is proportional to 1/d, where d is the distance of the event, and to m², where m is its magnitude.
We will consider these properties in the preprocessing stage, and perform transformations on the data to achieve the required invariance.
Preprocessing refers to the manipulation done on the data before it enters the neural network. It is a very important phase in the development of a NN solution, and often has a crucial effect on its performance, i.e., the percentage of misclassifications of test or real data, the size of network needed to achieve it, etc.
As mentioned in the previous section, some preprocessing has already been applied to the original data, to transform the (almost) continuous seismograms to discrete spectral images with fixed translation (all events start at the 10^{th} second of the sonograms). However, it is not enough.
Since we want our classifications to be independent of the magnitude of the events, we shall apply a magnitude normalization to our input  this is achieved by simply subtracting the maximal energy in a sonogram from all other nonzero energies (recall that this is done in logarithmic scale, so actually we are dividing the energies so that the largest peak in all seismograms will have the same amplitude).
When classifying by source type, we also apply a distance normalization to the input, by "stretching" each sonogram according to the distance of the corresponding event. Thus, our networks will work on data that is distanceinvariant, and as a result close events will be classified the same way as far ones.
In addition, we shall always rescale the input to [0,1], since the NN implementation we use works better when the data is given in this range.
Another important aspect of preprocessing is to reduce the dimensionality of the input, making the network’s task easier, since its efforts are focused to the most informative features. Furthermore, smaller networks converge faster, and are less vulnerable to overfitting. Dimensionality reduction can be performed in many ways  for instance, by applying a Fourier transform and selecting the strongest frequencies, or by taking the locations of the maximal energies in each sonogram as the features, etc. Here, we shall use the averaging technique introduced in the first project, which captures the general pattern in the sonogram. The idea is to average rectangular areas in the spectral image, i.e. calculate a smaller spectral image, in which each pixel is the average of a corresponding submatrix in the original image. The averaged rectangles (submatrices) are smaller in the left part of the image, since the first seconds of the seismogram contain more information and tend to be more variable.
Further dimensionality reduction is achieved by discarding the first 9 seconds in each sonogram, which contain only noise, i.e. no information about the seismic event itself (note that one may want to leave this data as a "noise reference" for the networks).
Finally, we perform a principal components analysis to project the ndimensional input vectors onto the subspace spanned by the d eigenvectors that correspond to the d largest eigenvalues (where d<n). In other words, the input vectors are described by a new (smaller) set of coordinates, that preserve much of their variability (and, hopefully, the information stored in them).
Using these feature selection techniques, the input dimension is reduced from 660 (11x60) to an arbitrarily small dimension, usually in the range 210. Though we can experiment with many different imaginative preprocessing procedures, as we have done in the previous project, this is not the goal here. We shall apply the same preprocessing steps as in the first project, where we trained twolayer perceptron networks (MLP’s) for the same classification tasks. This will enable us to compare the performance of our RBF’s with that of the MLP’s.
A more detailed explanation of the preprocessing routines, along with various variants of them and examples, can be found in project #1.
Supervised training refers to the iterative process of updating the network’s parameters so that its actual output will fit the predefined target values. In other words, the error on the given training set is brought to a local minimum. The most common method to update the network is the d rule, which can be expanded to multilayered networks using backpropagation. Faster and more reliable convergence can be achieved using various optimization methods, such as conjugate gradient descent. In project #1 we have used this technique to train our MLP’s for the sourceregion and sourcetype classification problems.
The same approach can be applied to RBF’s. We iteratively update the network’s parameters, until it converges to a solution. In each iteration, the second layer multiplicative weights are adjusted as in the MLP’s; the error is then backpropagated, and the first layer weights  the gaussians’ centers and covariances, are updated in a similar way. The exact formulas are obtained by deriving the error function, usually sum of squares, by each component.
It can be shown that the average sumofsquares error of a given estimator f for a given data set D is a sum of two terms  the first term doesn’t depend on f or D, and the second is a sum of the bias and the variance of the estimator f. Minimizing the error is therefore translated to finding a minimum of the bias+variance. When using ensembles of networks (i.e., averaging the output of several estimators), the variance decreases as the size of the ensemble grows. The best results are achieved when the estimators, or experts, are independent.
Constructing an ensemble with a relatively small interdependency can be accomplished using the leaveoneout procedure. In each iteration, three training samples from the same seismic event (which are highly correlated) are leftout, and a network is trained on the remaining 36 recordings. Thus, we get 13 networks, each trained on a different subset of the training sample set. Combining these experts will hopefully give a "smart" ensemble, as explained above (we’ll call this the "leaveoneout ensemble"). In our case of classification, as opposed to probability density estimation for instance, it does not always make sense to average the networks’ outputs. For example, if half of the experts classify an input as sourceregion "1", and the rest classify it as "3", then we clearly do not want the final answer to be "2". Instead, we shall apply a simple voting  the output value that appeared most (i.e., had the largest number of votes) is the final classification result of the ensemble. In order to deal with problematic votes, we decide that if no output value received at least N/2+1 votes (where N is the size of the ensemble  13 in our case), the final answer is "unknown".
First, let us deal with the sourceregion classification problem. Figure 1 shows the average scores of ensembles with 4, 6, and 8 RBF’s, for various numbers of features selected (150 training iterations were used). The networks with 6 kernels performed slightly better than the others, giving a score of 8.7 (which is 67% of the maximal score) when 4 features were selected in the preprocessing stage. An ensemble of MLP’s with 6 hidden units yields less satisfying classifications, with an average score of 7.5. It is clear from the figure, though, that supervised RBF networks do not cope well with highdimensional input spaces. Using 10 features, the RBF networks usually receive a score of less than 5, while the corresponding MLP ensembles reach 7. The problem with supervised RBF’s is that the gaussians sometimes "explode"  i.e., their variance grows too much. This causes a small number of kernels to "take over", thus degrading the performance. Fortunately, in our case 4 features seem to be enough for a reasonable classification.
Table 2 shows typical results of the leaveoneout ensembles, with 4, 6, and 8 RBF’s, utilizing 4 features. The first row contains the correct classification, whose score is 12.9. This is the most we can expect from our networks.
Figure 1  Average scores of leaveoneout ensembles of RBF networks, using supervised training, with 4 (blue line), 6 (red), and 8 (green) RBF’s.
N 
1 
2 
3 
4 
5 
6 
7 
8 
9 
10 
11 
12 
13 
14 
S 

1 
7 
6 
8 
8 
4 
6 
6 
7 
7 
6 
2 
4 
1 
12.9 

1 
4 
1 
4 
5 
5 
5 
4 
5 
5 
U 
U 
5 
4 
4 
1 
7.5 
2 
6 
1 
4 
5 
U 
5 
4 
5 
6 
3 
U 
5 
4 
4 
1 
8.7 
3 
8 
1 
4 
5 
3 
3 
4 
5 
6 
3 
3 
5 
4 
4 
1 
8 
Table 2  Typical sourceregion classifications (of the 14 test samples) by ensembles of RBF networks with N kernels (utilizing 4 input features), using supervised training. "S" stands for "score", "U" for "Unknown".
In order to improve our solution, we can combine several ensembles to one giant ensemble, whose final output is determined by a democratic vote. However, since the ensembles are highly dependent, we usually do not get better results. This can be solved by adding a zeromean gaussian noise to the training samples. Thus, each ensemble learns a slightly different data set, and they are therefore less correlated. Of course, the noise shouldn’t be too "loud". Usually a small noise with standard deviation of about 1, did improve the overall performance, especially in the networks that used more input features. We will discuss a more detailed example in the next section. The main shortcoming of such an "ensemble of ensembles" is that it is computationally heavy.
When constructing an ensemble of ensembles, one might wonder if a democratic vote is indeed the best policy. Perhaps a weighted vote is better. For instance, we may give each ensemble a weight proportional to its score (meaning we prefer the opinion of smart experts), or according to its variability (each ensemble consists of several networks, and if they can’t make up their mind, why should we listen to them?). These suggested approaches haven’t been researched in this paper, mainly because of the enormous amount of computer time they require (also, a larger test set is needed).
An additional improvement to our networks may be to bound the variances of the radial kernels, so that they won’t "explode". This can be done, for instance, by adding a term proportional to the variances to the overall error. Deriving the new error function will force the gradient descent to choose solutions with smaller kernel variances, which are usually better. This may enable us to work with more input features, i.e. more information.
The sourcetype classification problem is much easier. A leaveoneout ensemble that consists of networks with 2 RBF’s and 2 input features receives a score of 12 (out of 14). The most common errors are on "test_12" (a very difficult test sample), and "test_4", "test_5" (both events are from Lebanon, a region that is much farther than those in the training data). This score was also achieved by the MLP’s in project #1.
Matlab code:
The Matlab function "seismonet" constructs ensembles for sourceregion or sourcetype classification, trains them on the training sample set, and gives the classification result for the test data. Each ensemble is built using the leaveoneout procedure, and is composed of networks with userdefined characteristics. The user controls the type of networks (RBF’s with supervised or hybrid training, or standard MLP’s), and the number of hidden units, training iterations, and features to select. A gaussian noise can be added to the training data to increase the independency between the ensembles. For more details, type ‘help seismonet’, or browse the code.
Hybrid training is composed of two separate stages. In the first stage, the kernels’ parameters (the gaussians’ centers and covariances) are determined using an unsupervised training approach. Then, the second layer multiplicative weights are trained using the regular supervised technique.
The first stage can be performed with the EMalgorithm, for instance, which is basically a probabilityrelaxation technique for optimizing interdependent parameters. In each iteration, the new prior class probabilities are calculated from the current posterior probabilities; then, the gaussians’ centers and variance are calculated using the new priors; finally, new posteriors are extracted from these new parameters, and are used by the next iteration, and so on. Usually, this process converges after a relatively small number of iterations (less than 30 in our case).
The second stage is equivalent to solving a system of linear equations, in which the secondlayer weights are the unknowns, the hidden units’ outputs are their coefficients, and the target output values are the free variables. To avoid numerical problems, the pseudoinverse technique can be applied, which results in an approximate solution.
The hybrid training approach has several advantages. It is much faster than the supervised method, and is easier to interpret (as we shall demonstrate). It is especially useful when labelled data is in short supply, since the first stage can be performed on all the data (not only the labelled part).
The hidden units in a RBF network are actually the component densities of a Gaussian mixture model (GMM). The second layer weights are their mixing coefficients. The first stage unsupervised training determines the gaussians, thus partitioning the data into multidimensional radial clusters. The second stage of the training sets the mixing coefficients, in order to map the gaussians (actually, the distance of the input vector from each gaussian) to valid output values. Hybrid training of RBF networks can therefore be studied by simply examining the clustering it forms in various cases.
Table 3 describes the typical clustering when using two RBF’s, for three preprocessing procedures. When only feature selection is applied to the data (by default, we select 15 features), we get a magnitude/distance clustering  all the recordings in cluster #1 are of nearby quarry blasts (Kadarim and Amiad regions), with a maximal magnitude of at least 20. All of the remaining events are weaker (except "test_8", which has a magnitude of 20, but is from a quarry farther away), and more distant (except "test_1", which is also from Kadarim, but is weaker, and has a magnitude of 18). So basically, the data is clustered according to the magnitude of the recordings (further examination reveals that the "boundary" cases of "test_8" and "Parod_10", which have a magnitude of 20, are indeed farthest from the clusters’ centers).
The reason is simply that this is the most obvious clustering of the data in our feature space  the two aforementioned clusters turn out to be the most distinct partition of the given input.
When applying a magnitudeinvariance normalization, we get regional clusters  one cluster contains all recordings from the Yehiam and Hanaton regions, except for "Shefer_8b" (which is somewhat an outlier), and the other cluster consists of the rest of the data. After normalizing the input according to the magnitudes, this is the most obvious way to separate the data to two subsets. The reason is that the recordings from Yehiam and Hanaton are very similar; in fact, the classification networks tend to mix them up. The centers of the two RBF’s thus formed are very close in all coordinates but one, which implies that a simple halfplane could probably yield the same partition. It is important to emphasize that the clustering varies, and is highly influenced by the number of features we select in the preprocessing phase. For instance, when selecting 10 features instead of 15, the data is partitioned to a group with training sets 5 and 11 (Galilee and Golan earthquakes, excluding set 7), and the test samples from the Kinneret and Lebanon.
Using both magnitude and distance normalizations yields a simple clustering of earthquakes vs. quarry blasts, with the exception of test samples 1, 4, and 12, that are included in the earthquake cluster. These are reasonable "errors"; in particular, "test_12" is a very problematic recording. Fortunately, this intuitive partition according to sourcetype is also the most obvious one the EMalgorithm found, when magnitude and distance differences are ignored.
PreProcessing 
Cluster #1 
Cluster #2 
1  None  Training sets: 2, 4, 10  All the rest 
Test samples: 14  
2  Magnitude normalization  Training sets: 2, 4, 5, 7, 10, 11, shefer_8b  All the rest 
Test samples: 1, 2, 4, 5, 6, 9, 10, 1214  
3  Magnitude and Distance  Training sets: 5, 7, 11  All the rest 
Test samples: 1, 2, 4, 5, 6, 9, 10, 12, 13 
Table 3  Typical clusterings of 2 RBFs.
Tables 4, 5, and 6 show typical clusterings for 4, 6, and 8 RBF’s, respectively. Basically, the same phenomena we have just observed repeat here, with additional refinements. In each of the preprocessing cases, the two clusters formed with 2 RBF’s are broken into subclusters when using 4 Gaussians  each additional kernel divides one of the old clusters to two subgroups. The same applies when increasing the number of RBF’s to 6, and finally to 8. For example, when no normalization is performed in the preprocessing, cluster #2 in table 3 is split to 3 clusters in table 4  clusters 1, 2, and 4; and cluster 4 in table 4 is further broken to clusters 1, 2, and 6 in table 5.
Another example when magnitude and distance normalizations are applied, cluster #2 in table 3, that consists of the quarry recordings, is split to 3 clusters when adding two RBF’s: the first roughly includes the blasts from Kadarim and Amiad (cluster 3 in table 4), the second is composed of only one recording (cluster 1  a good example how a kernel might collapse to an input point  the opposite problem of the kernels’ explosions we have witnessed in the supervised training), and the third includes mainly the blasts from Yehiam and Hanaton. This shows that after the primary discrimination (quarry vs. earthquake) was detected by two RBF’s, a "finetuning" was performed by the two additional kernels; in this case, the quarry blasts were divided according to their sourceregion (close vs. far). Using more RBF’s results in smaller clusters that are closer to one another, and their interpretation becomes less obvious.
PreProcessing 
Cluster #1 
Cluster #2 
Cluster #3 
Cluster #4 
1  None 
Training: 7  Training: 5, 11  Training: 2, 4, 10  All the rest 
Test: 1, 2, 13  Test: 6, 10, 12  Test: 14  
2  Magnitude 
Training: 5,7,S8b  Training: 11  Training: 2, 4, 10  All the rest 
Test: 1,2,6,12,13  Test: 4,5,9,10  Test: 14  
3  Magnitude and Distance 
Training: P3  Training: 1,A3,S3, A8b,P8b,1215 
Training: 2, 4, 10  All the rest 
Test:  Test: 8, 11  Test: 3, 7, 14 
Table 4  Typical clusterings of 4 RBFs ("A" stands for Amirim, "P" for Parod, "S" for Shefer).
PP 
Cluster #1 
Cluster #2 
Cluster #3 
Cluster #4 
Cluster #5 
Cluster #6 
1  Tr:  Tr: S12,S14  Tr: 7  Tr: 5,11  Tr: 2,4,10  All the rest 
Te: 4,5,9  Te: 3,7,8  Te: 1,2,13  Te: 6,10,12  Te: 14  
2  Tr: 7  Tr: 5, S11  Tr: A11,P11  Tr: S1,3,8b,A12,S12, 13,A14,P14,A15,P15  Tr: 2, A4,A10, S10  All the rest 
Te: 13  Te: 2,6,12  Te: 4,5,9,10  Te:  Te: 1, 14  
3  Tr: P11,S11  Tr: P3  Tr: P8b  Tr: 5,7,S8b,A11  Tr: 2,4,10  All the rest 
Te: 2,6,12  Te:  Te:  Te: 1,4,5,9,10,13  Te: 3,7,14 
Table 5  Typical clusterings of 6 RBFs ("A" stands for Amirim, "P" for Parod, "S" for Shefer).
PP 
C #1  C #2  C #3  C #4  C #5  C #6  C #7  C #8 
1  Tr: P3  Tr:  Tr: P4,10  Tr: 2,A4,S4  Tr: 7  Tr: 5,11  Tr: S12,S14  rest 
Te:  Te: 4,5  Te: 14  Te:  Te: 1,2,9,13  Te: 6,10,12  Te: 3,7,8  
2  Tr: P3  Tr: A7,P7  Tr: 2,A4, A10,S10  Tr: A3,A8b, P8b,A12,S12,A13, P13,A14,P14  Tr: 5,S7,S11  Tr: P4,S4,S8b  Tr: A11,P11  rest 
Te:  Te: 13  Te: 1,14  Te:  Te: 2,6,12  Te:  Te: 4,5, 9,10  
3  Tr: P8b  Tr: A1,S1, P3,S3  Tr: A5,S5, P11,S11  Tr: A2,S2, A8b,P14,S15  Tr: P5,A7, S7,S8b,A11  Tr: P7  Tr: P2,4, 10  rest 
Te:  Te:  Te: 2,6, 9  Te: 3,8, 11  Te: 1,10, 12,13  Te: 4,5  Te: 7,14 
Table 6  Typical clusterings of 8 RBFs.
The clusters mapping we have just studied can help us choose the best architecture for our classification problems. Clearly, the ideal number of RBF’s for the sourcetype problem is 2, as can be seen in table 3 (when both normalizations are applied)  this already gives a very good partition of the data sets, with only few errors. Using more RBF’s does not eliminate these errors, and in fact, sometimes introduces new ones due to overfitting. And indeed, an ensemble constructed using the leaveoneout method, as in the previous section, gave 11 correct answers, and misclassified only test samples 1, 5, and 12 (the problematic input). Combining 5 such ensembles, each utilizing a different number of features between 10 and 20, and having a random noise (with standard deviation of 0.7) added to the data, gives a better output, with 12 correct classifications (the two errors are for test_1 and test_12).
Choosing the ideal number of Gaussians for the sourceregion problem is more difficult. It seems from the tables above, that a good choice is four  this yields a fair separation, and does not introduce overfitting (the clusters don’t collapse), as the larger models do. Actually, the best results were obtained by ensembles of networks with only 3 RBF’s, as can be seen in table 7. The table shows typical classifications of leaveoneout ensembles with 2, 3, 4, 5, and 6 hidden units.
Network #6 in table 7 is a combination of 9 ensembles, with various numbers of RBF’s (3, 4, and 5) and input features, and with different random noise added to the data of each ensemble. This network received an amazing score of 9.9! It had only two wrong answers  test_3, and test_12 (well, nobody’s perfect). This is a good example for the benefits of a large ensemble  each of the networks received a medium score, ranging from 5.5 to 9.5, but together they gave an excellent output. Note that this high result is not typical; however, it may imply that large ensembles that utilize networks of different architectures are indeed much better (due to the fact that they are more independent). Further research is needed to determine if this is true.
N 
1 
2 
3 
4 
5 
6 
7 
8 
9 
10 
11 
12 
13 
14 
S 

1 
7 
6 
8 
8 
4 
6 
6 
7 
7 
6 
2 
4 
1 
12.9 

1 
2 
1 
U 
5 
6 
6 
U 
6 
5 
6 
U 
5 
U 
U 
1 
6.5 
2 
3 
1 
4 
5 
U 
6 
4 
6 
5 
U 
4 
5 
4 
4 
1 
8.7 
3 
4 
1 
4 
5 
6 
6 
4 
6 
5 
4 
4 
5 
4 
4 
1 
8 
4 
5 
1 
3 
5 
3 
3 
3 
5 
5 
3 
3 
5 
4 
4 
1 
7 
5 
6 
1 
3 
5 
3 
3 
4 
5 
5 
3 
3 
5 
4 
4 
1 
7.5 
6 
 
1 
4 
U 
U 
U 
4 
6 
6 
3 
4 
6 
4 
4 
1 
9.9 
Table 7  Typical sourceregion classifications of ensembles of RBF networks with N kernels (utilizing 15 input features), using hybrid training. "S" stands for "score", "U" for "Unknown".
Matlab code:
The Matlab function "cluster" constructs a GMM, trains it on the seismic data using the EMalgorithm, and prints the clusters. The user controls the number of gaussians and training iterations, as well as the preprocessing procedures. Type ‘help cluster’ for more information, or see the code.
The leaveoneout ensembles were constructed using the same "seismonet" function as in the previous section.
In this paper, we have demonstrated the performance of ensembles of networks, when applied to the sourceregion and sourcetype seismic classification problems. The ensembles consisted of RBF networks, that were constructed using the leaveoneout procedure.
Both supervised and hybrid training techniques achieved good results  8.7 for the sourceregion problem (better than the performance of the MLP’s from Project #1), and 12 for the sourcetype classification. In the supervised training approach, the kernels tend to "explode" when the input dimension is relatively large. The best results for the region classification were received by a network with 6 kernels, and only 4 input features. On the other hand, when the hybrid training procedure is applied with many RBF’s, kernels tend to "collapse", and we get overfitting. In this case, the best network had only 3 RBF’s, and utilized 15 features.
Other important differences between the two training approaches, is that the hybrid method is much faster, and can use also the test samples for the first (unsupervised) step. In my opinion, this makes it much more attractive for most cases.
One may wish to combine different networks (perhaps with different architectures) into a large ensemble. We have performed some such experiments, and the results were occasionally excellent. Further research is needed to determine the best strategies for constructing these ensembles  how to choose the networks, exactly which noise to use, how to average the outputs of the network to a final result, etc.
We have also discussed the mapping achieved by the unsupervised training of a GMM. We saw that the clustering that was formed was rather easy to interpret, and was usually the most distinct, intuitive partitioning of the data, given the specific preprocessing steps that were applied.
Chaim Linhart
chaim@math.tau.ac.ilJuly 1998.