Sunday, March 11, 2018

Principal Component Analysis (PCA)

This is a collection of links for Principal Component Analysis (PCA) that I have found useful. I plan to update this post in the future.


PCA Must Read:

This answer by amoeba at the stats forum at StackExchange is not only the best PCA explanation I've seen it is the best answer I have ever seen on any forum about anything. It has different levels of explanation from complete laymen to mathematician.

https://stats.stackexchange.com/a/140579/39873

Keep in mind that PCA does not always work! Video


PCA Articles:





PCA Code:



PCA Useful Forum Discussions:







Sunday, April 2, 2017

Python function to compute K-category correlation coefficient

I created a Python function to compute the K-category correlation coefficient. K-category correlation is a measure of classification performance and may be considered a multiclass generalization of the Matthews correlation coefficient.

Python function to compute K-category correlation coefficient at Github.


Academic paper:

Comparing two K-category assignments by a K-category correlation coefficient

Abstract

Predicted assignments of biological sequences are often evaluated by Matthews correlation coefficient. However, Matthews correlation coefficient applies only to cases where the assignments belong to two categories, and cases with more than two categories are often artificially forced into two categories by considering what belongs and what does not belong to one of the categories, leading to the loss of information. Here, an extended correlation coefficient that applies to K-categories is proposed, and this measure is shown to be highly applicable for evaluating prediction of RNA secondary structure in cases where some predicted pairs go into the category “unknown” due to lack of reliability in predicted pairs or unpaired residues. Hence, predicting base pairs of RNA secondary structure can be a three-category problem. The measure is further shown to be well in agreement with existing performance measures used for ranking protein secondary structure predictions.


Paper author's server and software is available at http://rk.kvl.dk/

Saturday, April 1, 2017

Demo of Blended Model Machine Learning Technique

I started with Emanuele's code and switched to data generated with scikit's "make classification" algorithm. I also added a Jupyter notebook blending demo : https://github.com/denson/kaggle_pbr

 The general concept is that if we build multiple different models trained on different samples of our training data we get multiple predictions that are substantially better than chance and that are uncorrelated with each other.

 In step 1 we take stratified fold samples of our training data and build multiple models (in this case RDF entropy,RDF-gini ET-entropy,ET-gini and GBT) on each fold. We then use the trained models to predict the training sample not in the training part of this fold. It is super important that you do not use a given model to predict training data that was used to train that model on that fold. We also predict all the test data with each model. These predictions are a way of transforming the training data and the test data into a different space with the predicted probabilities as the transformed information. We take a simple average of the predictions of each type of model (eg RDF-gini) and that becomes the transformed data for the next step. If we have 5 different models as in this case our input data for step 2 will have 5 columns and the same number of rows as the training set and test set respectively.

 In step 2 we use a train a logistic regresson on the transformed training data and use it to predict the transformed test data. We take the predicted probabilities from the logistic regression as our final answer.

 This method usually results in an improvement over a single highly tuned model for "hard" problems and not "simple" problems. By hard I mean that the decision boundary between classes is highly non-linear. Overlapping classes and non-linear relationships between features contribute to making problems hard.

 This academic paper describes the concept:

 Stacked Regressions 

 I found this at Kaggle:

 Kaggle competion question

Friday, September 4, 2015

Tuesday, August 25, 2015

3 Wrong Ways to Store a Password (And 5 code samples doing it right)

Mainly for web development but useful in other contexts as well.

http://adambard.com/blog/3-wrong-ways-to-store-a-password/

Choosing Colormaps

From the matplotlib documentation:


The idea behind choosing a good colormap is to find a good representation in 3D colorspace for your data set. The best colormap for any given data set depends on many things including:
  • Whether representing form or metric data 
  • Your knowledge of the data set (e.g., is there a critical value from which the other values deviate?)
  • If there is an intuitive color scheme for the parameter you are plotting
  • If there is a standard in the field the audience may be expecting

For many applications, a perceptual colormap is the best choice — one in which equal steps in data are perceived as equal steps in the color space. Researchers have found that the human brain perceives changes in the lightness parameter as changes in the data much better than, for example, changes in hue. Therefore, colormaps which have monotonically increasing lightness through the colormap will be better interpreted by the viewer.

 http://matplotlib.org/users/colormaps.html


More info from IBM:

http://www.research.ibm.com/people/l/lloydt/color/color.HTM