University of Ottawa - Carleton University
Ottawa-Carleton Institute for Computer Science (OCICS) Presentation
|
November 9, 2012 @ 10:00a.m. Machine Learning from Imbalanced Medical Data Sets
|
Speaker: Sabo Yanosh Location: 3101 CB (Canal Building) |
ABSTRACT The most number of data mining classification algorithms assume that the data sets they are being applied against are balanced and consider the misclassification errors cost to be equal; a real world data however is usually imbalanced, representing one class domination on the other(s). Therefore, when traditional classification algorithms applied over these complex imbalanced data sets, they fail to satisfactory represent the distributive features of the data and as a result do not provide adequate accuracies through the classes. Any data set in which classes are distributed unequally can be called imbalanced. Typical domains where imbalance problem occurs are bioinformatics, fraud detection, medical diagnosis, risk management and others. Traditional data mining classification methods when applied against imbalanced data fail to successfully predict the crucial minority class for the reason that they are overwhelmed with the majority class examples (become too specific and insensitive to the minority class examples). Since all classification approaches aim to maximize the accuracy and the error rate, unequal class distribution in the training data disable a classification model to perform well.
In clinical data sets, data are mostly composed of example of ''healthy'' patients with only a small percentage of ''sick'' ones, which lead to the class imbalance problems. In class imbalance problems, inputting all the data into the classifier to build up the classifier will normally lead a learning bias to the majority class. To deal with this issue, different strategies can be employed: over-sample the minority class and under-samples the majority one to balance the data sets. Furthermore, after balancing the data size of classes, ensembling machine learning methods are introduced that leverage the power of multiple models to achieve better prediction accuracy than any of the individual models could on their own. The basic goal when designing an ensemble is to achieve a better accuracy. Different data sets are employed to illustrate the approaches presented in the project. Various evaluation metrics are used to evaluate the results produced by classification models. |
| Return to Schedule |
|