Computational Learning - Project #2

Submitted by: Chaim Linhart

July 1998

 

Abstract

In this project we construct radial basis functions networks (RBFs) for classifying seismic data according to source region and type. We apply both supervised and hybrid (two-stage) training techniques to train ensembles of networks, and discuss the influence of the various parameters on their performance. We also compare the results with those achieved by the regular feed-forward networks (the MLPs from project #1). The data mapping produced by unsupervised clustering of RBFs, with various pre-processing procedures, is discussed in detail.

 

  1. Introduction

In project #1 we have presented the problem of classifying seismic data according to the source region (one of six regions) or type (quarry explosion or earthquake). The structure of the input data was described, as well as the pre-processing procedures applied to it. The MLPs we constructed achieved reasonable results, and we analyzed the influence of the various parameters (pre-processing steps, size of network, number of training iterations) on them.

Sections 2 and 3 in this paper provide a short description of the input data, and the pre-processing procedures we apply to it. We then construct leave-one-out ensembles of RBF networks using supervised training, run the test data through them, and analyze the results. This is described in section 4, where we also explain what influences the results, and how, as in the previous project. In section 5 we repeat the analysis for RBFs with hybrid training. The emphasis here is the interpretation of the data mapping achieved by the kernels for different pre-processing steps. Finally, a short summary is given in section 6.

 

  1. The Input Data

The "homework\data" directory contains 53 text files. Each file is a recording of one seismic event, at one of three stations - Amirim, Parod, and Shefer. 39 recordings (three recordings of 13 events) comprise the training sample set, and the other 14 are out test sample set, as shown in table 1.

 

Region

Quarry/Earthquake

Distance (km.)

Training sets

Test samples

1: Kadarim

Quarry

3

2, 10

1, 14

2: Amiad

Quarry

9

4

12

3: Galilee

Earthquake

17

11

 

4: Golan

Earthquake

25

5, 7

6, 13

5: Yehiam

Quarry

25

8b, 12, 14, 15

 

6: Hanaton

Quarry

26

1, 3, 13

3, 7, 8, 11

7: Kinneret

Earthquake

20, 50

 

2, 9, 10

8: Lebanon

Quarry

70

 

5

 

Earthquake

70

 

4

Table 1 - The training and test data (each training set includes 3 recordings - "Amirim_xx.txt", "Parod_xx.txt", and "Shefer_xx.txt"; each test sample is one recording - "test_xx.txt").

 

In order to evaluate the performance of our networks, we will give each classification a score. In the case of source-type classification, this will simply be 0 for a wrong result, and 1 for a correct one. For the source-region problem, we wish to distinguish "close" answers from totally "far" ones. Therefore, an earthquake from Galilee that is mis-classified as a Golan event, or vice versa, will award the network with half a point. The same goes for Kadarim and Amiad (the close quarries); Yehiam and Hanaton (the far quarries); and the Kinneret events, when classified as Galilee, Golan, or "unknown" (all three results are reasonable, since this site does not appear in the training data, and is rather similar to regions 3 and 4). A special bonus of 0.7 will be given to the Lebanon events, when classified as "unknown", as this site is far from all six regions in the training data.

The seismic recordings have already been (partially) pre-processed for us, and are given as sonograms - spectral images (matrices of "pixels") of the events. Each spectral image in an input file contains 60*11 energy values for 60 seconds duration and a log-frequency scale from 25-0.5 Hz (top-down). Energy is scaled by log2 and coded in ascii ("a"=1, "b"=2, "c"=3, ... ; "." means 0 as there is no detectable signal energy). The first 5 lines in a file contain additional comments like recording station, time and target values for classification, and more. Each event starts at the 10th sample, i.e. with 9 seconds noise prerun.

Which factors effect the sonogram, and how?

Earthquakes and explosions produce different seismic patterns, and our networks can utilize this fact to classify the source-type. Event duration depends on the distance and the magnitude of the explosion/earthquake - far events give longer seismograms (the position of the S onset is a linear function of the distance). The distance and the magnitude also determine the recorded energies - the energy is proportional to 1/d, where d is the distance of the event, and to m², where m is its magnitude.

We will consider these properties in the pre-processing stage, and perform transformations on the data to achieve the required invariance.

 

  1. Pre-Processing

Pre-processing refers to the manipulation done on the data before it enters the neural network. It is a very important phase in the development of a NN solution, and often has a crucial effect on its performance, i.e., the percentage of mis-classifications of test or real data, the size of network needed to achieve it, etc.

As mentioned in the previous section, some pre-processing has already been applied to the original data, to transform the (almost) continuous seismograms to discrete spectral images with fixed translation (all events start at the 10th second of the sonograms). However, it is not enough.

Since we want our classifications to be independent of the magnitude of the events, we shall apply a magnitude normalization to our input - this is achieved by simply subtracting the maximal energy in a sonogram from all other non-zero energies (recall that this is done in logarithmic scale, so actually we are dividing the energies so that the largest peak in all seismograms will have the same amplitude).

When classifying by source type, we also apply a distance normalization to the input, by "stretching" each sonogram according to the distance of the corresponding event. Thus, our networks will work on data that is distance-invariant, and as a result close events will be classified the same way as far ones.

In addition, we shall always re-scale the input to [0,1], since the NN implementation we use works better when the data is given in this range.

Another important aspect of pre-processing is to reduce the dimensionality of the input, making the network’s task easier, since its efforts are focused to the most informative features. Furthermore, smaller networks converge faster, and are less vulnerable to over-fitting. Dimensionality reduction can be performed in many ways - for instance, by applying a Fourier transform and selecting the strongest frequencies, or by taking the locations of the maximal energies in each sonogram as the features, etc. Here, we shall use the averaging technique introduced in the first project, which captures the general pattern in the sonogram. The idea is to average rectangular areas in the spectral image, i.e.- calculate a smaller spectral image, in which each pixel is the average of a corresponding sub-matrix in the original image. The averaged rectangles (sub-matrices) are smaller in the left part of the image, since the first seconds of the seismogram contain more information and tend to be more variable.

Further dimensionality reduction is achieved by discarding the first 9 seconds in each sonogram, which contain only noise, i.e. no information about the seismic event itself (note that one may want to leave this data as a "noise reference" for the networks).

Finally, we perform a principal components analysis to project the n-dimensional input vectors onto the sub-space spanned by the d eigen-vectors that correspond to the d largest eigen-values (where d<n). In other words, the input vectors are described by a new (smaller) set of coordinates, that preserve much of their variability (and, hopefully, the information stored in them).

Using these feature selection techniques, the input dimension is reduced from 660 (11x60) to an arbitrarily small dimension, usually in the range 2-10. Though we can experiment with many different imaginative pre-processing procedures, as we have done in the previous project, this is not the goal here. We shall apply the same pre-processing steps as in the first project, where we trained two-layer perceptron networks (MLP’s) for the same classification tasks. This will enable us to compare the performance of our RBF’s with that of the MLP’s.

A more detailed explanation of the pre-processing routines, along with various variants of them and examples, can be found in project #1.

 

  1. Supervised Training

Supervised training refers to the iterative process of updating the network’s parameters so that its actual output will fit the pre-defined target values. In other words, the error on the given training set is brought to a local minimum. The most common method to update the network is the d -rule, which can be expanded to multi-layered networks using back-propagation. Faster and more reliable convergence can be achieved using various optimization methods, such as conjugate gradient descent. In project #1 we have used this technique to train our MLP’s for the source-region and source-type classification problems.

The same approach can be applied to RBF’s. We iteratively update the network’s parameters, until it converges to a solution. In each iteration, the second layer multiplicative weights are adjusted as in the MLP’s; the error is then back-propagated, and the first layer weights - the gaussians’ centers and covariances, are updated in a similar way. The exact formulas are obtained by deriving the error function, usually sum of squares, by each component.

It can be shown that the average sum-of-squares error of a given estimator f for a given data set D is a sum of two terms - the first term doesn’t depend on f or D, and the second is a sum of the bias and the variance of the estimator f. Minimizing the error is therefore translated to finding a minimum of the bias+variance. When using ensembles of networks (i.e., averaging the output of several estimators), the variance decreases as the size of the ensemble grows. The best results are achieved when the estimators, or experts, are independent.

Constructing an ensemble with a relatively small inter-dependency can be accomplished using the leave-one-out procedure. In each iteration, three training samples from the same seismic event (which are highly correlated) are left-out, and a network is trained on the remaining 36 recordings. Thus, we get 13 networks, each trained on a different subset of the training sample set. Combining these experts will hopefully give a "smart" ensemble, as explained above (we’ll call this the "leave-one-out ensemble"). In our case of classification, as opposed to probability density estimation for instance, it does not always make sense to average the networks’ outputs. For example, if half of the experts classify an input as source-region "1", and the rest classify it as "3", then we clearly do not want the final answer to be "2". Instead, we shall apply a simple voting - the output value that appeared most (i.e., had the largest number of votes) is the final classification result of the ensemble. In order to deal with problematic votes, we decide that if no output value received at least N/2+1 votes (where N is the size of the ensemble - 13 in our case), the final answer is "unknown".

 

First, let us deal with the source-region classification problem. Figure 1 shows the average scores of ensembles with 4, 6, and 8 RBF’s, for various numbers of features selected (150 training iterations were used). The networks with 6 kernels performed slightly better than the others, giving a score of 8.7 (which is 67% of the maximal score) when 4 features were selected in the pre-processing stage. An ensemble of MLP’s with 6 hidden units yields less satisfying classifications, with an average score of 7.5. It is clear from the figure, though, that supervised RBF networks do not cope well with high-dimensional input spaces. Using 10 features, the RBF networks usually receive a score of less than 5, while the corresponding MLP ensembles reach 7. The problem with supervised RBF’s is that the gaussians sometimes "explode" - i.e., their variance grows too much. This causes a small number of kernels to "take over", thus degrading the performance. Fortunately, in our case 4 features seem to be enough for a reasonable classification.

Table 2 shows typical results of the leave-one-out ensembles, with 4, 6, and 8 RBF’s, utilizing 4 features. The first row contains the correct classification, whose score is 12.9. This is the most we can expect from our networks.

 

Figure 1 - Average scores of leave-one-out ensembles of RBF networks, using supervised training, with 4 (blue line), 6 (red), and 8 (green) RBF’s.

 

 

 

N

1

2

3

4

5

6

7

8

9

10

11

12

13

14

S

   

1

7

6

8

8

4

6

6

7

7

6

2

4

1

12.9

1

4

1

4

5

5

5

4

5

5

U

U

5

4

4

1

7.5

2

6

1

4

5

U

5

4

5

6

3

U

5

4

4

1

8.7

3

8

1

4

5

3

3

4

5

6

3

3

5

4

4

1

8

Table 2 - Typical source-region classifications (of the 14 test samples) by ensembles of RBF networks with N kernels (utilizing 4 input features), using supervised training. "S" stands for "score", "U" for "Unknown".

 

In order to improve our solution, we can combine several ensembles to one giant ensemble, whose final output is determined by a democratic vote. However, since the ensembles are highly dependent, we usually do not get better results. This can be solved by adding a zero-mean gaussian noise to the training samples. Thus, each ensemble learns a slightly different data set, and they are therefore less correlated. Of course, the noise shouldn’t be too "loud". Usually a small noise with standard deviation of about 1, did improve the overall performance, especially in the networks that used more input features. We will discuss a more detailed example in the next section. The main shortcoming of such an "ensemble of ensembles" is that it is computationally heavy.

When constructing an ensemble of ensembles, one might wonder if a democratic vote is indeed the best policy. Perhaps a weighted vote is better. For instance, we may give each ensemble a weight proportional to its score (meaning we prefer the opinion of smart experts), or according to its variability (each ensemble consists of several networks, and if they can’t make up their mind, why should we listen to them?). These suggested approaches haven’t been researched in this paper, mainly because of the enormous amount of computer time they require (also, a larger test set is needed).

An additional improvement to our networks may be to bound the variances of the radial kernels, so that they won’t "explode". This can be done, for instance, by adding a term proportional to the variances to the overall error. Deriving the new error function will force the gradient descent to choose solutions with smaller kernel variances, which are usually better. This may enable us to work with more input features, i.e. more information.

 

The source-type classification problem is much easier. A leave-one-out ensemble that consists of networks with 2 RBF’s and 2 input features receives a score of 12 (out of 14). The most common errors are on "test_12" (a very difficult test sample), and "test_4", "test_5" (both events are from Lebanon, a region that is much farther than those in the training data). This score was also achieved by the MLP’s in project #1.

 

Matlab code:

The Matlab function "seismonet" constructs ensembles for source-region or source-type classification, trains them on the training sample set, and gives the classification result for the test data. Each ensemble is built using the leave-one-out procedure, and is composed of networks with user-defined characteristics. The user controls the type of networks (RBF’s with supervised or hybrid training, or standard MLP’s), and the number of hidden units, training iterations, and features to select. A gaussian noise can be added to the training data to increase the independency between the ensembles. For more details, type ‘help seismonet’, or browse the code.

 

  1. Hybrid Training

Hybrid training is composed of two separate stages. In the first stage, the kernels’ parameters (the gaussians’ centers and covariances) are determined using an unsupervised training approach. Then, the second layer multiplicative weights are trained using the regular supervised technique.

The first stage can be performed with the EM-algorithm, for instance, which is basically a probability-relaxation technique for optimizing inter-dependent parameters. In each iteration, the new prior class probabilities are calculated from the current posterior probabilities; then, the gaussians’ centers and variance are calculated using the new priors; finally, new posteriors are extracted from these new parameters, and are used by the next iteration, and so on. Usually, this process converges after a relatively small number of iterations (less than 30 in our case).

The second stage is equivalent to solving a system of linear equations, in which the second-layer weights are the unknowns, the hidden units’ outputs are their coefficients, and the target output values are the free variables. To avoid numerical problems, the pseudo-inverse technique can be applied, which results in an approximate solution.

The hybrid training approach has several advantages. It is much faster than the supervised method, and is easier to interpret (as we shall demonstrate). It is especially useful when labelled data is in short supply, since the first stage can be performed on all the data (not only the labelled part).

The hidden units in a RBF network are actually the component densities of a Gaussian mixture model (GMM). The second layer weights are their mixing coefficients. The first stage unsupervised training determines the gaussians, thus partitioning the data into multi-dimensional radial clusters. The second stage of the training sets the mixing coefficients, in order to map the gaussians (actually, the distance of the input vector from each gaussian) to valid output values. Hybrid training of RBF networks can therefore be studied by simply examining the clustering it forms in various cases.

 

Table 3 describes the typical clustering when using two RBF’s, for three pre-processing procedures. When only feature selection is applied to the data (by default, we select 15 features), we get a magnitude/distance clustering - all the recordings in cluster #1 are of nearby quarry blasts (Kadarim and Amiad regions), with a maximal magnitude of at least 20. All of the remaining events are weaker (except "test_8", which has a magnitude of 20, but is from a quarry farther away), and more distant (except "test_1", which is also from Kadarim, but is weaker, and has a magnitude of 18). So basically, the data is clustered according to the magnitude of the recordings (further examination reveals that the "boundary" cases of "test_8" and "Parod_10", which have a magnitude of 20, are indeed farthest from the clusters’ centers).

The reason is simply that this is the most obvious clustering of the data in our feature space - the two aforementioned clusters turn out to be the most distinct partition of the given input.

When applying a magnitude-invariance normalization, we get regional clusters - one cluster contains all recordings from the Yehiam and Hanaton regions, except for "Shefer_8b" (which is somewhat an outlier), and the other cluster consists of the rest of the data. After normalizing the input according to the magnitudes, this is the most obvious way to separate the data to two subsets. The reason is that the recordings from Yehiam and Hanaton are very similar; in fact, the classification networks tend to mix them up. The centers of the two RBF’s thus formed are very close in all coordinates but one, which implies that a simple half-plane could probably yield the same partition. It is important to emphasize that the clustering varies, and is highly influenced by the number of features we select in the pre-processing phase. For instance, when selecting 10 features instead of 15, the data is partitioned to a group with training sets 5 and 11 (Galilee and Golan earthquakes, excluding set 7), and the test samples from the Kinneret and Lebanon.

Using both magnitude and distance normalizations yields a simple clustering of earthquakes vs. quarry blasts, with the exception of test samples 1, 4, and 12, that are included in the earthquake cluster. These are reasonable "errors"; in particular, "test_12" is a very problematic recording. Fortunately, this intuitive partition according to source-type is also the most obvious one the EM-algorithm found, when magnitude and distance differences are ignored.

 

Pre-Processing

Cluster #1

Cluster #2

1 - None Training sets: 2, 4, 10 All the rest
  Test samples: 14  
2 - Magnitude normalization Training sets: 2, 4, 5, 7, 10, 11, shefer_8b All the rest
  Test samples: 1, 2, 4, 5, 6, 9, 10, 12-14  
3 - Magnitude and Distance Training sets: 5, 7, 11 All the rest
Test samples: 1, 2, 4, 5, 6, 9, 10, 12, 13  

Table 3 - Typical clusterings of 2 RBFs.

 

Tables 4, 5, and 6 show typical clusterings for 4, 6, and 8 RBF’s, respectively. Basically, the same phenomena we have just observed repeat here, with additional refinements. In each of the pre-processing cases, the two clusters formed with 2 RBF’s are broken into sub-clusters when using 4 Gaussians - each additional kernel divides one of the old clusters to two sub-groups. The same applies when increasing the number of RBF’s to 6, and finally to 8. For example, when no normalization is performed in the pre-processing, cluster #2 in table 3 is split to 3 clusters in table 4 - clusters 1, 2, and 4; and cluster 4 in table 4 is further broken to clusters 1, 2, and 6 in table 5.

Another example- when magnitude and distance normalizations are applied, cluster #2 in table 3, that consists of the quarry recordings, is split to 3 clusters when adding two RBF’s: the first roughly includes the blasts from Kadarim and Amiad (cluster 3 in table 4), the second is composed of only one recording (cluster 1 - a good example how a kernel might collapse to an input point - the opposite problem of the kernels’ explosions we have witnessed in the supervised training), and the third includes mainly the blasts from Yehiam and Hanaton. This shows that after the primary discrimination (quarry vs. earthquake) was detected by two RBF’s, a "fine-tuning" was performed by the two additional kernels; in this case, the quarry blasts were divided according to their source-region (close vs. far). Using more RBF’s results in smaller clusters that are closer to one another, and their interpretation becomes less obvious.

 

Pre-Processing

Cluster #1

Cluster #2

Cluster #3

Cluster #4

1 - None

Training: 7 Training: 5, 11 Training: 2, 4, 10 All the rest
  Test: 1, 2, 13 Test: 6, 10, 12 Test: 14  

2 - Magnitude

Training: 5,7,S8b Training: 11 Training: 2, 4, 10 All the rest
  Test: 1,2,6,12,13 Test: 4,5,9,10 Test: 14  

3 - Magnitude and Distance

Training: P3

Training: 1,A3,S3, A8b,P8b,12-15

Training: 2, 4, 10 All the rest
Test: Test: 8, 11 Test: 3, 7, 14  

Table 4 - Typical clusterings of 4 RBFs ("A" stands for Amirim, "P" for Parod, "S" for Shefer).

 

PP

Cluster #1

Cluster #2

Cluster #3

Cluster #4

Cluster #5

Cluster #6

1 Tr: Tr: S12,S14 Tr: 7 Tr: 5,11 Tr: 2,4,10 All the rest
  Te: 4,5,9 Te: 3,7,8 Te: 1,2,13 Te: 6,10,12 Te: 14  
2 Tr: 7 Tr: 5, S11 Tr: A11,P11 Tr: S1,3,8b,A12,S12, 13,A14,P14,A15,P15 Tr: 2, A4,A10, S10 All the rest
  Te: 13 Te: 2,6,12 Te: 4,5,9,10 Te: Te: 1, 14  
3 Tr: P11,S11 Tr: P3 Tr: P8b Tr: 5,7,S8b,A11 Tr: 2,4,10 All the rest
  Te: 2,6,12 Te: Te: Te: 1,4,5,9,10,13 Te: 3,7,14  

Table 5 - Typical clusterings of 6 RBFs ("A" stands for Amirim, "P" for Parod, "S" for Shefer).

 

 

PP

C #1 C #2 C #3 C #4 C #5 C #6 C #7 C #8
1 Tr: P3 Tr: Tr: P4,10 Tr: 2,A4,S4 Tr: 7 Tr: 5,11 Tr: S12,S14 rest
  Te: Te: 4,5 Te: 14 Te: Te: 1,2,9,13 Te: 6,10,12 Te: 3,7,8  
2 Tr: P3 Tr: A7,P7 Tr: 2,A4, A10,S10 Tr: A3,A8b, P8b,A12,S12,A13, P13,A14,P14 Tr: 5,S7,S11 Tr: P4,S4,S8b Tr: A11,P11 rest
  Te: Te: 13 Te: 1,14 Te: Te: 2,6,12 Te: Te: 4,5, 9,10  
3 Tr: P8b Tr: A1,S1, P3,S3 Tr: A5,S5, P11,S11 Tr: A2,S2, A8b,P14,S15 Tr: P5,A7, S7,S8b,A11 Tr: P7 Tr: P2,4, 10 rest
  Te: Te: Te: 2,6, 9 Te: 3,8, 11 Te: 1,10, 12,13 Te: 4,5 Te: 7,14  

Table 6 - Typical clusterings of 8 RBFs.

 

The clusters mapping we have just studied can help us choose the best architecture for our classification problems. Clearly, the ideal number of RBF’s for the source-type problem is 2, as can be seen in table 3 (when both normalizations are applied) - this already gives a very good partition of the data sets, with only few errors. Using more RBF’s does not eliminate these errors, and in fact, sometimes introduces new ones due to overfitting. And indeed, an ensemble constructed using the leave-one-out method, as in the previous section, gave 11 correct answers, and misclassified only test samples 1, 5, and 12 (the problematic input). Combining 5 such ensembles, each utilizing a different number of features between 10 and 20, and having a random noise (with standard deviation of 0.7) added to the data, gives a better output, with 12 correct classifications (the two errors are for test_1 and test_12).

Choosing the ideal number of Gaussians for the source-region problem is more difficult. It seems from the tables above, that a good choice is four - this yields a fair separation, and does not introduce overfitting (the clusters don’t collapse), as the larger models do. Actually, the best results were obtained by ensembles of networks with only 3 RBF’s, as can be seen in table 7. The table shows typical classifications of leave-one-out ensembles with 2, 3, 4, 5, and 6 hidden units.

Network #6 in table 7 is a combination of 9 ensembles, with various numbers of RBF’s (3, 4, and 5) and input features, and with different random noise added to the data of each ensemble. This network received an amazing score of 9.9! It had only two wrong answers - test_3, and test_12 (well, nobody’s perfect). This is a good example for the benefits of a large ensemble - each of the networks received a medium score, ranging from 5.5 to 9.5, but together they gave an excellent output. Note that this high result is not typical; however, it may imply that large ensembles that utilize networks of different architectures are indeed much better (due to the fact that they are more independent). Further research is needed to determine if this is true.

 

 

N

1

2

3

4

5

6

7

8

9

10

11

12

13

14

S

   

1

7

6

8

8

4

6

6

7

7

6

2

4

1

12.9

1

2

1

U

5

6

6

U

6

5

6

U

5

U

U

1

6.5

2

3

1

4

5

U

6

4

6

5

U

4

5

4

4

1

8.7

3

4

1

4

5

6

6

4

6

5

4

4

5

4

4

1

8

4

5

1

3

5

3

3

3

5

5

3

3

5

4

4

1

7

5

6

1

3

5

3

3

4

5

5

3

3

5

4

4

1

7.5

6

-

1

4

U

U

U

4

6

6

3

4

6

4

4

1

9.9

Table 7 - Typical source-region classifications of ensembles of RBF networks with N kernels (utilizing 15 input features), using hybrid training. "S" stands for "score", "U" for "Unknown".

 

Matlab code:

The Matlab function "cluster" constructs a GMM, trains it on the seismic data using the EM-algorithm, and prints the clusters. The user controls the number of gaussians and training iterations, as well as the pre-processing procedures. Type ‘help cluster’ for more information, or see the code.

The leave-one-out ensembles were constructed using the same "seismonet" function as in the previous section.

 

  1. Summary

In this paper, we have demonstrated the performance of ensembles of networks, when applied to the source-region and source-type seismic classification problems. The ensembles consisted of RBF networks, that were constructed using the leave-one-out procedure.

Both supervised and hybrid training techniques achieved good results - 8.7 for the source-region problem (better than the performance of the MLP’s from Project #1), and 12 for the source-type classification. In the supervised training approach, the kernels tend to "explode" when the input dimension is relatively large. The best results for the region classification were received by a network with 6 kernels, and only 4 input features. On the other hand, when the hybrid training procedure is applied with many RBF’s, kernels tend to "collapse", and we get over-fitting. In this case, the best network had only 3 RBF’s, and utilized 15 features.

Other important differences between the two training approaches, is that the hybrid method is much faster, and can use also the test samples for the first (unsupervised) step. In my opinion, this makes it much more attractive for most cases.

One may wish to combine different networks (perhaps with different architectures) into a large ensemble. We have performed some such experiments, and the results were occasionally excellent. Further research is needed to determine the best strategies for constructing these ensembles - how to choose the networks, exactly which noise to use, how to average the outputs of the network to a final result, etc.

We have also discussed the mapping achieved by the unsupervised training of a GMM. We saw that the clustering that was formed was rather easy to interpret, and was usually the most distinct, intuitive partitioning of the data, given the specific pre-processing steps that were applied.

 

 

 

 

 

Chaim Linhart

chaim@math.tau.ac.il
July 1998.