Wednesday, July 23, 2008

Fixing the Bug in Threshold Variation

In yesterday's experiment results, we noted that the prediction rate dropping below the required prediction rates even though the minimum prediction rate is set to 0.95 even though the threshold is varied accordingly. This works properly for only few cases and fails in others. The below figure shows a result where the training worked properly with threshold variation at the node.

I did some code changes, and the bug disappeared. After trying a series of experiments, each node of the classifier was able to achieve the desired true positive rate. There were few instances where the classifier went amok. The graph below shows the performance of the classifier at one of these bad instances.


Analyzing why there is sudden increase in the false positive rate, the following was observed in these type of failures. The histogram of output of the boosted classifier (at a particular node) is shown below. The three histograms shows the following:
  1. Blue: Entire distribution of PredictedY [760 0 .. 0 99 14 0 .. 0 727]
  2. Green: Distribution of PredictedY at Y = 1 [ .. 0 0 0 26 5 0 .. 0 727]
  3. Red: Distribution of PredictedY at Y = 0 [760 0 .. 0 73 9 0 .. 0 0]
Here we see that the threshold changing logarithm is not able to find a suitable threshold to achieve the required true positive rate because we change the threshold only to different values of predicted Y (i.e. 0, 3.3620, 3.5603, 6.9223) and not any intermediate values. Here the classifier would have achieve its true positive target if has tried a threshold of say 2.00


The Fix for the above bug was done this way. (Note:The values are different from above since it is a different experiment). For uniqueSortedPredictedY = 0, 3.1102, 3.6728, 6.7830. Intermediate values were also chosen. intermediateValuesOfPredY = 1.5551, 3.3915, 5.2279. This was the also included in the list of possible thresholds and the required true positive rate was achieved for the classifier. I guess this has closed bugs in the Threshold Variation algorithm. Next we will move ahead with the Linear classifier implementation. All the experiments where run on the 10D Gaussian dataset.

Brodatz dataset: After this bug fix, the experiment was run on a the tough Brodatz pattern distinguishing dataset. Since the feature values where finding it difficult to classify at the such high true positive rates the training failed in the first node it self. The distribution of the PredictedY at a node is shown below. A lower target was set and the classifier started to build but failed to move beyond the the 2nd node.
Synapse dataset: The code run for Synapse dataset gave the following training & test results. The true positive target was set at 0.9.
The bug fixed code can be found in /usr/sci/crcnsdata/CRCNS/Synapses/Code/Matlab/ML_Boosting4_DT3 directory. The Boost.m was modified.

No comments: