Abstract – High Dimensional cancer microarray is devilishly challenging while finding the best features for classification. In this  paper a new algorithm is proposed based on iterative qualitative mutual information to choose the features that can provide optimal  feature set with reliability, stability, and best classification results. It finds the qualitative (i.e. utility) score of each feature with the  help of Random Forest algorithm and combines it with mutual information of each feature with its class variable. Adding a qualitative  measure along with mutual information can improve the robustness and find redundant features in data. The proposed algorithm has  been compared with other representative methods through the ten microarray based cancer datasets in terms of number of features and  classification accuracy of three well-known classifiers: Naïve Bayes, IB1 and C4.5. Experimental results show that the proposed  approach is effective in producing an optimal feature subset and improves the accuracy of these datasets. 

Introduction – Today, the data available are high Dimensional in  nature and this nature of data has become a significant  issue. The size of the available datasets is high in  terms of both the number of features in each sample  and the number of data samples. There has been  a lot of research going on high dimensional  microarray data. Microarray datasets have a  characteristic of having large number of genes but  very less number of samples, which leads to an  unsound application of classification methods. Many of researchers have shown that  the high dimensional data sets contain redundant and  irrelevant genes and these redundant genes greatly  affect the performance of the classifier. So, in order to improve the performance of the classifier and remove  the redundant features, feature selection has become  an important task in case of microarray data.

Table 1 – Dataset Description 

Dataset Samples Original genesPreprocessed  genesClasses
Colon_1 37 22883 8826 2
Prostate 102 12600 5966 2
Breast 97 24482 5000 2
Colon 62 2000 2000 2
SRBCT 83 2308 2308 4
Endometrium 42 8872 3000 4
Leukemia 72 7129 7129 3
Melanoma 38 8076 8076 3
CNS-v1 34 7129 2277 2
Lung 32 12533 12533 2

Conclusion – This paper proposes and implements a feature  subset selection algorithm useful for microarray data.  In the Algorithm 1, qualitative mutual information  measure is considered to remove the irrelevant as well  as the redundant features from the data. It also  exhibits the robust and stable gene subset. Random  Forest algorithm is implemented in between this  proposed algorithm to find importance score of each  feature. This importance score is utilized as it is  helpful in finding the correlation as well as interaction  between the features. Using this property, this paper  calculates the share of preference for each variable.  This share of preference has been used as utility  measure along with mutual information of each  variable with class/target. It also resolves the short  coming of random forest by balancing each dataset  before giving it as input to the Random forest  algorithm. The classification results on ten microarray data depicts that the Algorithm 1  has improved the accuracy as compared to other feature selection methods for seven out of ten  datasets. For two of the datasets, melanoma and  leukemia the proposed RFST algorithm is performing better. Algorithm 1 is also effective in producing a  reliable gene subset. One of the statistical tests named  Friedman test applied on the algorithms also proves  that the Algorithm 1 is significantly better than other  feature selection algorithms. When Algorithm 1 is  compared with latest gene selection algorithms, it is observed that Algorithm 1 is at par with  them.