Data Science, Machine Learning & Artificial Intelligence: RDF

Showing posts with label RDF. Show all posts

Thursday, December 4, 2014

Visualizing decision trees in scikit-learn

For single decision trees:

http://scikit-learn.org/dev/modules/tree.html

http://stackoverflow.com/questions/10570042/visualizing-a-decision-tree-example-from-scikit-learn

Hints on how to do it for a Random Forest or Extra Tree classifier:

http://stackoverflow.com/questions/17057139/how-to-find-key-trees-features-from-a-trained-random-forest

http://stackoverflow.com/questions/17362576/random-forest-implementation-in-python

Friday, September 19, 2014

Random Forest Tutorial

This is slide deck from a lecture. It is a good introduction to RDF's with advantages and disadvantages compared with other methods.

www.dabi.temple.edu/~hbling/8590.002/Montillo_RandomForests_4-2-2009.pdf

Wednesday, September 17, 2014

Decision Forests for Computer Vision and Medical Image Analysis book from Microsoft Research

This is the best resource I have found for understanding and using decision tree based machine learning algorithms. It is very thorough on both theory and practical use with comparisons with other algorithms such as SVM's and AdaBoost.

There are also a bunch of supplemental materials available for free including a very nice PowerPoint with great explanations. The supplemental materials include C++ and C# code.

Decision Forests - Microsoft Research

Decision Forests for Classification, Regression, Density Estimation, Manifold Learning and Semi-Supervised Learning

This technical report from Microsoft Research is an A-Z tutorial on how decision tree machine learning algorithms work. It includes in depth explanations of random forests, extra tree classifiers, random ferns and other variations for both classification and regression.

It is in report format and compares decision forests to other types of machine learning algorithms such as SVM. Some simple toy problems give the basics and some real life applications such as body position recognition and medical image are included.

There is also an accompanying PowerPoint with some nice animations.

http://research.microsoft.com/pubs/155552/decisionForests_MSR_TR_2011_114.pdf is not available

Alternatives to support vector machines in neuroimaging ensembles of decision trees for classification and information mapping with predictive models

This is a nice tutorial for using random decision forests for classifying medical images. There is a comparison with some other methods, especially SVM's.

http://web.stanford.edu/~richiard/slides/PRNI2013Tutorial_export.pdf is not available

Detection of malicious code by applying machine learning classifiers on static features: A state-of-the-art survey

This journal article discusses the application of various machine learning methods to malware detection and information security.

This research synthesizes a taxonomy for classifying detection methods of new malicious code by Machine Learning (ML) methods based on static features extracted from executables. The taxonomy is then operationalized to classify research on this topic and pinpoint critical open research issues in light of emerging threats. The article addresses various facets of the detection challenge, including: file representation and feature selection methods, classification algorithms, weighting ensembles, as well as the imbalance problem, active learning, and chronological evaluation. From the survey we conclude that a framework for detecting new malicious code in executable files can be designed to achieve very high accuracy while maintaining low false positives (i.e. misclassifying benign files as malicious). The framework should include training of multiple classifiers on various types of features (mainly OpCode and byte n-grams and Portable Executable Features), applying weighting algorithm on the classification results of the individual classifiers, as well as an active learning mechanism to maintain high detection accuracy. The training of classifiers should also consider the imbalance problem by generating classifiers that will perform accurately in a real-life situation where the percentage of malicious files among all files is estimated to be approximately 10%.

Detection of malicious code by applying machine learning classifiers on static features: A state-of-the-art survey

Nonlinear regression in environmental sciences by support vector machines combined with evolutionary strategy

A hybrid algorithm combining support vector regression with evolutionary strategy (SVR-ES) is proposed for predictive models in the environmental sciences. SVR-ES uses uncorrelated mutation with p step sizes to find the optimal SVR hyper-parameters. Three environmental forecast datasets used in the WCCI-2006 contest – surface air temperature, precipitation and sulphur dioxide concentration – were tested. We used multiple linear regression (MLR) as benchmark and a variety of machine learning techniques including bootstrap-aggregated ensemble artificial neural network (ANN), SVR-ES, SVR with hyper-parameters given by the Cherkassky–Ma estimate, the M5 regression tree, and random forest (RF). We also tested all techniques using stepwise linear regression (SLR) first to screen out irrelevant predictors. We concluded that SVR-ES is an attractive approach because it tends to outperform the other techniques and can also be implemented in an almost automatic way. The Cherkassky–Ma estimate is a useful approach for minimizing the mean absolute error and saving computational time related to the hyper-parameter search. The ANN and RF are also good options to outperform multiple linear regression (MLR). Finally, the use of SLR for predictor selection can dramatically reduce computational time and often help to enhance accuracy.

Nonlinear regression in environmental sciences by support vector machines combined with evolutionary strategy

Texturecam: Autonomous Image Analysis For Astrobiology Survey

This is a paper about a project to include software on robotic rover spacecraft that uses a random forest algorithm to allow the rover to autonomously classify rocks by texture. This helps the rover to search for signs of life.

ml.jpl.nasa.gov/papers/thompson/thompson-2012-lpsc.pdf

Large-scale prediction of long disordered regions in proteins using random forests

Many proteins contain disordered regions that lack fixed three-dimensional (3D) structure under physiological conditions but have important biological functions. Prediction of disordered regions in protein sequences is important for understanding protein function and in high-throughput determination of protein structures. Machine learning techniques, including neural networks and support vector machines have been widely used in such predictions. Predictors designed for long disordered regions are usually less successful in predicting short disordered regions. Combining prediction of short and long disordered regions will dramatically increase the complexity of the prediction algorithm and make the predictor unsuitable for large-scale applications. Efficient batch prediction of long disordered regions alone is of greater interest in large-scale proteome studies.

BMC Bioinformatics | Full text | Large-scale prediction of long disordered regions in proteins using random forests