Sunday, October 26, 2008

The new gentle boost classifier

After the kNN classifier experiment, we will be trying a new classifier inspired by one of the poster in MIABB 2008 - Mitochondria detection in electron microscopy images. The salient features of this classification mechanism are the following:
  1. Pre-processing: The image is histogram equalized and median filtered. Histogram equalization will help reduce intensity variations across the image and median filtering will decrease the salt & pepper kind of noise in the image.
  2. Features: Histograms and Gabor filter responses for multiple window sizes and multiple Gabor filter frequencies are used as features for the classifer.
  3. Ground truth markup: The entire Synapse region is marked up for hte experiment and all the marked up pixels will be used as positive examples for the training the classifer.
  4. Learning Algorithm: The learning algorithm uses Gentle Boost (a variation of AdaBoost) to train the classifer. The classifer is not used specified in the poster.

Friday, October 17, 2008

kNN with clustered test points

The setup for this experiment is as follows
  1. The data is divided into 4 folds (4 quadrants of the image)
  2. Positive training examples are all the SIFT key points that are within 10 pixel distance from the converged synapse markup points.
  3. Negative training examples are all the SIFT key points that are farther than 27.2 pixel distance from the converged synapse markup points. These points are clustered so that the training data is balanced.
  4. Test points are all the converged SIFT points in that quadrant. The test points are clustered.
Observations of why this experiment could fail:
  1. There are certain markups that very closely placed, but because of clustering the entire region might have only one representative point at that region which could be closer to either markups. There are multiple markups(151/468) that are closer than 27.5 (disk radius) pixels. The below histogram shows the distance between a markup and next closest markup point. (Note: The histogram shows only the distances for the pixels that have a separation of less than 50 pixels[214/468] pixels from each other and not all pixels)
  2. After choosing a representative point from a cluster. The histogram of distances between the markup point and nearest test points shown below. It is very clear that after choosing a representative point by this method, we don't even have test points near the actual markup neighborhood.
Thus this method of clustering would not work appropriately to generate representative test points.

Monday, October 13, 2008

kNN with Unclustered Test points

In this the image is divided into 4 quadrants.
Test Data: SIFT key points in each quadrant is used as test points.
Training Data: The SIFT key points from the other quadrants are used to train the classifier. The negative examples are clustered so that the dataset is more balanced.

The statistics about points in each fold is
Fold 1
Training points Total = 5027, Positive points = 2119
Test points Total = 28216, Positive points = 848
Test Result: Positive fold

Wednesday, October 8, 2008

Another kNN setup

In this the image is divided into 4 quadrants and SIFT key points in each quadrant is used as test points and the SIFT key points from the other quadrants are used to train the classifier.

The statistics about points in each fold is
Fold 1
Total points = 1797, Positive points = 848
Fold 2
Total points = 1831, Positive points = 877
Fold 3
Total points = 1499, Positive points = 486
Fold 4
Total points = 1732, Positive points = 756

The definition of FP_rate has been changed, other definitions remain the same
#TP = number of ground truth positives (synapses marked by the Mark lab) with at least one marking done by the classifier within some radius (10 pixels).
#GTP = number of ground truth points (synapses marked by the Mark lab)
=> TP_rate = #TP / #GTP
#FP = number of positions marked by the classifier - #TP.
=> FP_rate = #FP / (#nSIFTpoints - #GTP}

Monday, October 6, 2008

ROC Curve

On visual inspection of the results it actually looked better than the Confusion matrix values. The reason was that there were multiple points SIFT points near the expert markup. Of those multiple points only few were detected (like 2 of 10). That is the reason why the true positive was that low. So instead of tracking the True positive as the TP of the kNN classifier we give the following definition as described below.

#TP = number of ground truth positives (synapses marked by the Mark lab) with at least one marking done by the classifier within some radius (10 pixels).
#GTP = number of ground truth points (synapses marked by the Mark lab)
=> TP_rate = #TP / #GTP
#FP = number of positions marked by the classifier - #TP.
=> FP_rate = #FP / #{markings by classifier}This was the definition for the true positive rate and the false negative rate. Using this definition the ROC curve was done for different value of k (1,3,5,..51) in KNN classifier and different weights for the positive ones(1.0, 1.1,1.2...2.0). The figure below is one such example.The weird looking graph plotted for all ks :)

Wednesday, October 1, 2008

New kNN setup

In the new kNN experiment setup we will be using the old dataset since it has many examples compared to the Dr.Marc marked examples. There will be a 5-fold validation on the kNN classifier. In this classifier the disk size used in 55 rather than 35 used in the previous experiment.

Step 1: Generating SIFT key points
Number of SIFT key points = 447947

Step 2: Converging the SIFT key points
After this filtering the number of points = 24944
After converging the points the number of unique points = 21844

Step 3: Clustering the points
Separation distance = 27.5 pixels
Number of clusters identified = 8395

Step 4: Positive & Negative examples : Separate Clustering
Separate clustering of points was done for positive and negative examples
Positive SIFT points are ones less than 10 pixels away from the Converged Synapses. (Positives = 608, Clusters = 336, ClustersFromConvergedSynapses = 377)
Negative SIFT points are ones greater than 55 pixels away from the Converged Synapses. (Negatives = 17688, Clusters = 6198)
One observation is that the number of clusters identified depends on the first point chosen.
The for the negative data set the points would be clustered and representative points would be taken, but in case of positive we will use the entire set because clustering them halves the number of positive examples and would make the skewed data set (1:10) to (1:20).
The data points are stored in /usr/sci/crcnsdata/CRCNS/Synapses/Code/Matlab/kmeans/PositiveNegSIFTPoints.mat
Add more twist to the tale, when doing to clusterToPoint2 reduction for the Negative examples it finally ends up with 5784 points. On repeating the same procedure 3 times the reduction stops and the number of points are 4625.

Step 5: Set up for 5-fold validation
Positive points(uniqueConvergedSIFTpointsLT10) = 608 and Negative points (NegativesClusterCenters4) = 4625. From PositiveNegSIFTPoints.mat file.

The function /usr/sci/crcnsdata/CRCNS/Synapses/Code/Matlab/kmeans/createFolds.m creates the number of folds for the data.

Step 6: Circular Region Extraction
The entire setup for the experiment is done by the function "generateFeatures('synapse1-5fold')". The testing for the individual folds can be done by running the scripts foldXXRun (XX = 1,2,3,4,5). The results of the individual fold runs are found in foldXXResults.

Test Results:
A quantitative examination of the results is shown in the bar graph of confusion matrix entries for all folds.
Qualitative analysis:
For the qualitative analysis the test patches for the 5 folds have been extracted and stored in an image.