Abstract – High Dimensional cancer microarray is devilishly challenging while finding the best features for classification. In this paper a new algorithm is proposed based on iterative qualitative mutual information to choose the features that can provide optimal feature set with reliability, stability, and best classification results. It finds the qualitative (i.e. utility) score of each feature with the help of Random Forest algorithm and combines it with mutual information of each feature with its class variable. Adding a qualitative measure along with mutual information can improve the robustness and find redundant features in data. The proposed algorithm has been compared with other representative methods through the ten microarray based cancer datasets in terms of number of features and classification accuracy of three well-known classifiers: Naïve Bayes, IB1 and C4.5. Experimental results show that the proposed approach is effective in producing an optimal feature subset and improves the accuracy of these datasets.
Introduction – Today, the data available are high Dimensional in nature and this nature of data has become a significant issue. The size of the available datasets is high in terms of both the number of features in each sample and the number of data samples. There has been a lot of research going on high dimensional microarray data. Microarray datasets have a characteristic of having large number of genes but very less number of samples, which leads to an unsound application of classification methods. Many of researchers have shown that the high dimensional data sets contain redundant and irrelevant genes and these redundant genes greatly affect the performance of the classifier. So, in order to improve the performance of the classifier and remove the redundant features, feature selection has become an important task in case of microarray data.
Table 1 – Dataset Description
Dataset | Samples | Original genes | Preprocessed genes | Classes |
Colon_1 | 37 | 22883 | 8826 | 2 |
Prostate | 102 | 12600 | 5966 | 2 |
Breast | 97 | 24482 | 5000 | 2 |
Colon | 62 | 2000 | 2000 | 2 |
SRBCT | 83 | 2308 | 2308 | 4 |
Endometrium | 42 | 8872 | 3000 | 4 |
Leukemia | 72 | 7129 | 7129 | 3 |
Melanoma | 38 | 8076 | 8076 | 3 |
CNS-v1 | 34 | 7129 | 2277 | 2 |
Lung | 32 | 12533 | 12533 | 2 |
Conclusion – This paper proposes and implements a feature subset selection algorithm useful for microarray data. In the Algorithm 1, qualitative mutual information measure is considered to remove the irrelevant as well as the redundant features from the data. It also exhibits the robust and stable gene subset. Random Forest algorithm is implemented in between this proposed algorithm to find importance score of each feature. This importance score is utilized as it is helpful in finding the correlation as well as interaction between the features. Using this property, this paper calculates the share of preference for each variable. This share of preference has been used as utility measure along with mutual information of each variable with class/target. It also resolves the short coming of random forest by balancing each dataset before giving it as input to the Random forest algorithm. The classification results on ten microarray data depicts that the Algorithm 1 has improved the accuracy as compared to other feature selection methods for seven out of ten datasets. For two of the datasets, melanoma and leukemia the proposed RFST algorithm is performing better. Algorithm 1 is also effective in producing a reliable gene subset. One of the statistical tests named Friedman test applied on the algorithms also proves that the Algorithm 1 is significantly better than other feature selection algorithms. When Algorithm 1 is compared with latest gene selection algorithms, it is observed that Algorithm 1 is at par with them.