Data Science, Machine Learning & Artificial Intelligence: classification

Showing posts with label classification. Show all posts

Sunday, April 2, 2017

Python function to compute K-category correlation coefficient

I created a Python function to compute the K-category correlation coefficient. K-category correlation is a measure of classification performance and may be considered a multiclass generalization of the Matthews correlation coefficient.

Python function to compute K-category correlation coefficient at Github.

Academic paper:

Comparing two K-category assignments by a K-category correlation coefficient

Abstract

Predicted assignments of biological sequences are often evaluated by Matthews correlation coefficient. However, Matthews correlation coefficient applies only to cases where the assignments belong to two categories, and cases with more than two categories are often artificially forced into two categories by considering what belongs and what does not belong to one of the categories, leading to the loss of information. Here, an extended correlation coefficient that applies to K-categories is proposed, and this measure is shown to be highly applicable for evaluating prediction of RNA secondary structure in cases where some predicted pairs go into the category “unknown” due to lack of reliability in predicted pairs or unpaired residues. Hence, predicting base pairs of RNA secondary structure can be a three-category problem. The measure is further shown to be well in agreement with existing performance measures used for ranking protein secondary structure predictions.

Paper author's server and software is available at http://rk.kvl.dk/

Monday, September 29, 2014

AdaBoost

Note: AdaBoost is extremely sensitive to mislabeled samples in your data. For example, if you are trying to classify transactions as either "fraud" or "not fraud" if you have even one mislabeled, then the classifier will over learn that one bad sample and be useless. There are other versions of boosting algorithms that try to overcome this but if you have data for which you can not be sure of the labels then consider using some other method.

AdaBoost, short for "Adaptive Boosting", is a machine learning meta-algorithm formulated by Yoav Freund and Robert Schapire who won the prestigious "Gödel Prize" in 2003 for their work. It can be used in conjunction with many other types of learning algorithms to improve their performance. The output of the other learning algorithms ('weak learners') is combined into a weighted sum that represents the final output of the boosted classifier. AdaBoost is adaptive in the sense that subsequent weak learners are tweaked in favor of those instances misclassified by previous classifiers. AdaBoost is sensitive to noisy data and outliers. In some problems, however, it can be less susceptible to the overfitting problem than other learning algorithms. The individual learners can be weak, but as long as the performance of each one is slightly better than random guessing (i.e., their error rate is smaller than 0.5 for binary classification), the final model can be proven to converge to a strong learner.

While every learning algorithm will tend to suit some problem types better than others, and will typically have many different parameters and configurations to be adjusted before achieving optimal performance on a dataset, AdaBoost (with decision trees as the weak learners) is often referred to as the best out-of-the-box classifier. When used with decision tree learning, information gathered at each stage of the AdaBoost algorithm about the relative 'hardness' of each training sample is fed into the tree growing algorithm such that later trees tend to focus on harder to classify examples.

http://en.wikipedia.org/wiki/AdaBoost

Very nice AdaBoost slide deck:

http://cmp.felk.cvut.cz/~sochmj1/adaboost_talk.pdf

Matlab and C++ implementations:

http://graphics.cs.msu.ru/en/science/research/machinelearning/adaboosttoolbox

Friday, September 19, 2014

Random Forest Tutorial

This is slide deck from a lecture. It is a good introduction to RDF's with advantages and disadvantages compared with other methods.

www.dabi.temple.edu/~hbling/8590.002/Montillo_RandomForests_4-2-2009.pdf

Wednesday, September 17, 2014

Decision Forests for Computer Vision and Medical Image Analysis book from Microsoft Research

This is the best resource I have found for understanding and using decision tree based machine learning algorithms. It is very thorough on both theory and practical use with comparisons with other algorithms such as SVM's and AdaBoost.

There are also a bunch of supplemental materials available for free including a very nice PowerPoint with great explanations. The supplemental materials include C++ and C# code.

Decision Forests - Microsoft Research

Decision Forests for Classification, Regression, Density Estimation, Manifold Learning and Semi-Supervised Learning

This technical report from Microsoft Research is an A-Z tutorial on how decision tree machine learning algorithms work. It includes in depth explanations of random forests, extra tree classifiers, random ferns and other variations for both classification and regression.

It is in report format and compares decision forests to other types of machine learning algorithms such as SVM. Some simple toy problems give the basics and some real life applications such as body position recognition and medical image are included.

There is also an accompanying PowerPoint with some nice animations.

http://research.microsoft.com/pubs/155552/decisionForests_MSR_TR_2011_114.pdf is not available

Alternatives to support vector machines in neuroimaging ensembles of decision trees for classification and information mapping with predictive models

This is a nice tutorial for using random decision forests for classifying medical images. There is a comparison with some other methods, especially SVM's.

http://web.stanford.edu/~richiard/slides/PRNI2013Tutorial_export.pdf is not available

Outside the Closed World: On Using Machine Learning for Network Intrusion Detection

This paper has a good discussion of the use of machine learning for intrusion detection and network security.

In network intrusion detection research, one popular strategy for finding attacks is monitoring a network's activity for anomalies: deviations from profiles of normality previously learned from benign traffic, typically identified using tools borrowed from the machine learning community. However, despite extensive academic research one finds a striking gap in terms of actual deployments of such systems: compared with other intrusion detection approaches, machine learning is rarely employed in operational "real world" settings. We examine the differences between the network intrusion detection problem and other areas where machine learning regularly finds much more success. Our main claim is that the task of finding attacks is fundamentally different from these other applications, making it significantly harder for the intrusion detection community to employ machine learning effectively. We support this claim by identifying challenges particular to network intrusion detection, and provide a set of guidelines meant to strengthen future research on anomaly detection.

IEEE Xplore Full-Text HTML : Outside the Closed World: On Using Machine Learning for Network Intrusion Detection

Support Vector Machines (SVMs) organization

This site is a good compilation of everything related to SVM's. There are links to many academic papers, tutorials, applications and much more. There are also links to learn about all the mathematics necessary to really understand how SVM's work. I am impressed by the fact that the site creators are not just cheerleaders for SVM's, they do a good job of stating the advantages and disadvantages of SVM's as well as comparing them to competing machine learning methods.

Support Vector Machines

Texturecam: Autonomous Image Analysis For Astrobiology Survey

This is a paper about a project to include software on robotic rover spacecraft that uses a random forest algorithm to allow the rover to autonomously classify rocks by texture. This helps the rover to search for signs of life.

ml.jpl.nasa.gov/papers/thompson/thompson-2012-lpsc.pdf

LIBSVM -- A Library for Support Vector Machines

This is the library for support vector machines (SVM's). They have a version for many different programming languages including C++, Python, R, MATLAB, Perl, Ruby, Weka, Common LISP, CLISP, Haskell, OCaml, LabVIEW, and PHP interfaces. C# .NET code and CUDA extension is available.

If you are new to SVM's there is a cool java applet and a javascript toy that will show you how they work.

LIBSVM -- A Library for Support Vector Machines

DTREG SVM - Support Vector Machines

This is a commercial machine learning package that I have not used. The page in the link contains a very good explanation of how support vector machines (SVM) work.

SVM - Support Vector Machines

Exploiting tree-based variable importances to selectively identify relevant variables

This paper proposes a novel statistical procedure based on permutation tests for extracting a subset of truly relevant variables from multivariate importance rankings derived from tree-based supervised learning methods. It shows also that the direct extension of the classical approach based on permutation tests for estimating false discovery rates of univariate variable scoring procedures does not extend very well to the case of multivariate tree-based importance measures.

jmlr.org/proceedings/papers/v4/huynhthu08a/huynhthu08a.pdf

Large-scale prediction of long disordered regions in proteins using random forests

Many proteins contain disordered regions that lack fixed three-dimensional (3D) structure under physiological conditions but have important biological functions. Prediction of disordered regions in protein sequences is important for understanding protein function and in high-throughput determination of protein structures. Machine learning techniques, including neural networks and support vector machines have been widely used in such predictions. Predictors designed for long disordered regions are usually less successful in predicting short disordered regions. Combining prediction of short and long disordered regions will dramatically increase the complexity of the prediction algorithm and make the predictor unsuitable for large-scale applications. Efficient batch prediction of long disordered regions alone is of greater interest in large-scale proteome studies.

BMC Bioinformatics | Full text | Large-scale prediction of long disordered regions in proteins using random forests

Extremely randomized trees by Pierre Geurts Damien Ernst Louis Wehenkel

This paper proposes a new tree-based ensemble method for supervised classification and regression problems. It essentially consists of randomizing strongly both attribute and cut-point choice while splitting a tree node. In the extreme case, it builds totally randomized trees whose structures are independent of the output values of the learning sample. The strength of the randomization can be tuned to problem specifics by the appropriate choice of a parameter. We evaluate the robustness of the default choice of this parameter, and we also provide insight on how to adjust it in particular situations. Besides accuracy, the main strength of the resulting algorithm is computational efficiency. A bias/variance analysis of the Extra-Trees algorithm is also provided as well as a geometrical and a kernel characterization of the models induced.

orbi.ulg.ac.be/bitstream/2268/9357/1/geurts-mlj-advance.pdf