Data Science, Machine Learning & Artificial Intelligence

Principal Component Analysis (PCA)

2018-03-11T06:48:00.000-07:00

This is a collection of links for Principal Component Analysis (PCA) that I have found useful. I plan to update this post in the future.

PCA Must Read:

This answer by amoeba at the stats forum at StackExchange is not only the best PCA explanation I've seen it is the best answer I have ever seen on any forum about anything. It has different levels of explanation from complete laymen to mathematician.

https://stats.stackexchange.com/a/140579/39873

Keep in mind that PCA does not always work! Video

PCA Articles:

PCA Code:

PCA Useful Forum Discussions:

Python function to compute K-category correlation coefficient

2017-04-02T15:36:00.000-07:00

I created a Python function to compute the K-category correlation coefficient. K-category correlation is a measure of classification performance and may be considered a multiclass generalization of the Matthews correlation coefficient.

Python function to compute K-category correlation coefficient at Github.

Academic paper:

Comparing two K-category assignments by a K-category correlation coefficient

Abstract

Predicted assignments of biological sequences are often evaluated by Matthews correlation coefficient. However, Matthews correlation coefficient applies only to cases where the assignments belong to two categories, and cases with more than two categories are often artificially forced into two categories by considering what belongs and what does not belong to one of the categories, leading to the loss of information. Here, an extended correlation coefficient that applies to K-categories is proposed, and this measure is shown to be highly applicable for evaluating prediction of RNA secondary structure in cases where some predicted pairs go into the category “unknown” due to lack of reliability in predicted pairs or unpaired residues. Hence, predicting base pairs of RNA secondary structure can be a three-category problem. The measure is further shown to be well in agreement with existing performance measures used for ranking protein secondary structure predictions.

Paper author's server and software is available at http://rk.kvl.dk/

Demo of Blended Model Machine Learning Technique

2017-04-01T07:23:00.000-07:00

I started with Emanuele's code and switched to data generated with scikit's "make classification" algorithm. I also added a Jupyter notebook blending demo : https://github.com/denson/kaggle_pbr

The general concept is that if we build multiple different models trained on different samples of our training data we get multiple predictions that are substantially better than chance and that are uncorrelated with each other.

In step 1 we take stratified fold samples of our training data and build multiple models (in this case RDF entropy,RDF-gini ET-entropy,ET-gini and GBT) on each fold. We then use the trained models to predict the training sample not in the training part of this fold. It is super important that you do not use a given model to predict training data that was used to train that model on that fold. We also predict all the test data with each model. These predictions are a way of transforming the training data and the test data into a different space with the predicted probabilities as the transformed information. We take a simple average of the predictions of each type of model (eg RDF-gini) and that becomes the transformed data for the next step. If we have 5 different models as in this case our input data for step 2 will have 5 columns and the same number of rows as the training set and test set respectively.

In step 2 we use a train a logistic regresson on the transformed training data and use it to predict the transformed test data. We take the predicted probabilities from the logistic regression as our final answer.

This method usually results in an improvement over a single highly tuned model for "hard" problems and not "simple" problems. By hard I mean that the decision boundary between classes is highly non-linear. Overlapping classes and non-linear relationships between features contribute to making problems hard.

This academic paper describes the concept:

Stacked Regressions

I found this at Kaggle:

Kaggle competion question

A Neural Network in 11 lines of Python

2015-09-04T10:46:00.000-07:00

This is an excellent tutorial on neural networks that does a good job of explaining not only how they work but why they work as well. The python code is very easy to follow.

There is code for a very simple example and a more advanced one.

https://iamtrask.github.io/2015/07/12/basic-python-network/

see also:

http://denson-data-science.blogspot.com/2015/09/neural-network-step-by-step.html

Neural Network: A Step by Step Backpropagation Example

2015-09-04T10:39:00.000-07:00

This great neural network tutorial goes step-by-step through backpropagation in training a neural network. There is also companion python code.

http://mattmazur.com/2015/03/17/a-step-by-step-backpropagation-example/comment-page-1/

see also:

http://denson-data-science.blogspot.com/2015/09/a-neural-network-in-11-lines-of-python.html

3 Wrong Ways to Store a Password (And 5 code samples doing it right)

2015-08-25T13:50:00.001-07:00

Mainly for web development but useful in other contexts as well.

http://adambard.com/blog/3-wrong-ways-to-store-a-password/

Choosing Colormaps

2015-08-25T13:46:00.002-07:00

From the matplotlib documentation:

The idea behind choosing a good colormap is to find a good representation in 3D colorspace for your data set. The best colormap for any given data set depends on many things including:

Whether representing form or metric data

Your knowledge of the data set (e.g., is there a critical value from which the other values deviate?)

If there is an intuitive color scheme for the parameter you are plotting

If there is a standard in the field the audience may be expecting

For many applications, a perceptual colormap is the best choice — one in which equal steps in data are perceived as equal steps in the color space. Researchers have found that the human brain perceives changes in the lightness parameter as changes in the data much better than, for example, changes in hue. Therefore, colormaps which have monotonically increasing lightness through the colormap will be better interpreted by the viewer.

http://matplotlib.org/users/colormaps.html

More info from IBM:

http://www.research.ibm.com/people/l/lloydt/color/color.HTM

FastML: a great resource for machine learning

2015-08-25T12:58:00.000-07:00

There are many useful articles at this site. It is useful for everyone from novice to advanced:

http://fastml.com/

How to Select the Correct Encryption Approach

2015-08-25T12:56:00.000-07:00

This article is a pretty good start at selecting an encryption method.

http://www.itbusinessedge.com/articles/how-to-select-the-correct-encryption-approach.html?google_editors_picks=true

Shannon Entropy

2015-08-25T12:54:00.000-07:00

In information theory, entropy (more specifically, Shannon entropy) is the expected value (average) of the information contained in each message received. 'Messages' don't have to be text; in this context a 'message' is simply any flow of information. The entropy of the message is its amount of uncertainty; it increases when the message is closer to random, and decreases when it is less random. The idea here is that the less likely (i.e. more random) an event is, the more information it provides when it occurs. This seems backwards at first: it seems like messages which have more structure would contain more information, but this is not true. For example, the message 'aaaaaaaaaa' (which appears to be very structured and not random at all [although in fact it could result from a random process]) contains much less information than the message 'alphabet' (which is somewhat structured, but more random) or even the message 'axraefy6h' (which is very random). In information theory, 'information' doesn't necessarily mean useful information; it simply describes the amount of randomness of the message, so in the example above the first message has the least information and the last message has the most information, even though in everyday terms we would say that the middle message, 'alphabet', contains more information than a stream of random letters. Therefore, we would say in information theory that the first message has low entropy, the second has higher entropy, and the third has the highest entropy.

https://en.wikipedia.org/wiki/Entropy_(information_theory)

Non-technical article:

http://gizmodo.com/if-it-werent-for-this-equation-you-wouldnt-be-here-1719514472?google_editors_picks=true

Contrast Limited Adaptive Histogram Equalization (CLAHE)

2015-08-25T12:45:00.002-07:00

CLAHE is a useful tool for preprocessing images (or video) for computer vision/pattern recognition tasks. It more or less helps you "see" areas of the image that are in shadows.

There are many available implementations of this but I like the one in open CV:

http://opencv-python-tutroals.readthedocs.org/en/latest/py_tutorials/py_imgproc/py_histograms/py_histogram_equalization/py_histogram_equalization.html

Note: it is usually better to convert images to HSV colorspace first.

Before

After

Additional info:

http://fiji.sc/wiki/index.php/Enhance_Local_Contrast_(CLAHE)

Get Much Smarter About Machine Learning in 2 Minutes

2015-07-30T14:07:00.001-07:00

This is a great presentation by Stephanie Yee and Tony Chu. It is targeted at people new to the concept/field of machine learning. There are excellent animations that make things very clear.

http://www.r2d3.us/visual-intro-to-machine-learning-part-1/

nolearn and lasagne tutorial

2015-04-07T13:12:00.002-07:00

This short notebook is meant to help you getting started with nolearn and lasagne in order to train a neural net and make a submission to the Otto Group Product Classification Challenge.

http://nbviewer.ipython.org/github/ottogroup/kaggle/blob/master/Otto_Group_Competition.ipynb

An Intuitive Explanation of Bayes' Theorem

2015-04-01T13:26:00.000-07:00

This is a great introduction to Bayes' Theorem and strong evidence that a large majority of medical doctors are not scientists.

About 85% of doctors get this problem wrong!

1% of women at age forty who participate in routine screening have breast cancer. 80% of women with breast cancer will get positive mammographies. 9.6% of women without breast cancer will also get positive mammographies. A woman in this age group had a positive mammography in a routine screening. What is the probability that she actually has breast cancer?

http://www.yudkowsky.net/rational/bayes

quicksort visualization

2015-03-13T23:41:00.000-07:00

This is a great visualization of the quicksort algorithm:

https://www.youtube.com/watch?v=aXXWXz5rF64

Quicksort (sometimes called partition-exchange sort) is an efficient sorting algorithm, serving as a systematic method for placing the elements of an array in order. Developed by Tony Hoare in 1960, it is still a very commonly used algorithm for sorting. When implemented well, it can be about two or three times faster than its main competitors, merge sort and heapsort.[1]
Quicksort is a comparison sort, meaning that it can sort items of any type for which a "less-than" relation (formally, a total order) is defined. In efficient implementations it is not a stable sort, meaning that the relative order of equal sort items is not preserved. Quicksort can operate in-place on an array, requiring small additional amounts of memory to perform the sorting.
Mathematical analysis of quicksort shows that, on average, the algorithm takes O(n log n) comparisons to sort n items. In the worst case, it makes O(n2) comparisons, though this behavior is rare.

http://en.wikipedia.org/wiki/Quicksort

The Halting Problem

2015-03-13T23:34:00.000-07:00

This is a great video that describes the halting problem.

https://www.youtube.com/watch?v=92WHN-pAFCs

What is the halting problem you ask?

In computability theory, the halting problem is the problem of determining, from a description of an arbitrary computer program and an input, whether the program will finish running or continue to run forever.
Alan Turing proved in 1936 that a general algorithm to solve the halting problem for all possible program-input pairs cannot exist. A key part of the proof was a mathematical definition of a computer and program, which became known as a Turing machine; the halting problem is undecidable over Turing machines. It is one of the first examples of a decision problem.

http://en.wikipedia.org/wiki/Halting_problem

Easily distributing a parallel IPython Notebook on a cluster

2015-02-18T12:34:00.003-08:00

I haven't tried this yet, but it is high on my todo list:

http://twiecki.github.io/blog/2014/02/24/ipython-nb-cluster/

Natural Language Processing with Python

2015-01-08T13:11:00.002-08:00

This is a book about Natural Language Processing. By "natural language" we mean a language that is used for everyday communication by humans; languages like English, Hindi or Portuguese. In contrast to artificial languages such as programming languages and mathematical notations, natural languages have evolved as they pass from generation to generation, and are hard to pin down with explicit rules. We will take Natural Language Processing — or NLP for short — in a wide sense to cover any kind of computer manipulation of natural language. At one extreme, it could be as simple as counting word frequencies to compare different writing styles. At the other extreme, NLP involves "understanding" complete human utterances, at least to the extent of being able to give useful responses to them.
Online book:
http://www.nltk.org/book/ch00.html

NLTK 3.0 documentation:

http://www.nltk.org/

Python for Data Science

2015-01-08T13:08:00.003-08:00

This short primer on Python is designed to provide a rapid "on-ramp" to enable computer programmers who are already familiar with concepts and constructs in other programming languages learn enough about Python to facilitate the effective use of open-source and proprietary Python-based machine learning and data science tools.
http://nbviewer.ipython.org/github/gumption/Python_for_Data_Science/blob/master/1_Introduction.ipynb

The Great, Big List of LATEX Symbols

2015-01-08T13:01:00.004-08:00

LATEX symbols reference:

http://www.rpi.edu/dept/arc/training/latex/LaTeX_symbols.pdf

Enlightening Symbols: A Short History of Mathematical Notation and Its Hidden Powers

2015-01-08T13:00:00.003-08:00

While all of us regularly use basic math symbols such as those for plus, minus, and equals, few of us know that many of these symbols weren't available before the sixteenth century. What did mathematicians rely on for their work before then? And how did mathematical notations evolve into what we know today? In Enlightening Symbols, popular math writer Joseph Mazur explains the fascinating history behind the development of our mathematical notation system. He shows how symbols were used initially, how one symbol replaced another over time, and how written math was conveyed before and after symbols became widely adopted.
Traversing mathematical history and the foundations of numerals in different cultures, Mazur looks at how historians have disagreed over the origins of the numerical system for the past two centuries. He follows the transfigurations of algebra from a rhetorical style to a symbolic one, demonstrating that most algebra before the sixteenth century was written in prose or in verse employing the written names of numerals. Mazur also investigates the subconscious and psychological effects that mathematical symbols have had on mathematical thought, moods, meaning, communication, and comprehension. He considers how these symbols influence us (through similarity, association, identity, resemblance, and repeated imagery), how they lead to new ideas by subconscious associations, how they make connections between experience and the unknown, and how they contribute to the communication of basic mathematics.
From words to abbreviations to symbols, this book shows how math evolved to the familiar forms we use today.

http://www.amazon.com/Enlightening-Symbols-History-Mathematical-Notation/dp/0691154635/

Article about the book:

http://www.theguardian.com/science/alexs-adventures-in-numberland/2014/may/21/notation-history-mathematical-symbols-joseph-mazur

Quantifying Uncertainty: Modern Computational Representation of Probability and Applications

2015-01-08T12:35:00.003-08:00

This is a link to a pdf file containing a tutorial on modeling uncertainty:

http://www.wire.tu-bs.de/forschung/talks/06_Opatija.pdf

Many descriptions (especially of future events) contain
elements, which are uncertain and not precisely known.

For example future rainfall, or discharge from a river.

More generally, action from surrounding environment.

The system itself may contain only incompletely known

parameters, processes or fields (not possible or too

costly to measure)

There may be small, unresolved scales in the model,

they act as a kind of background noise.

All these introduce some uncertainty in the model.

Uncertainty may be aleatoric, which means random and not reducible, or

epistemic, which means due to incomplete knowledge.

Useful Pandas Features

2015-01-08T12:29:00.001-08:00

A tutorial on 10 useful Pandas features:

http://manishamde.github.io/blog/2013/03/07/pandas-and-python-top-10/

pandas Ecosystem

2015-01-08T12:25:00.004-08:00

Increasingly, packages are being built on top of pandas to address specific needs in data preparation, analysis and visualization. This is encouraging because it means pandas is not only helping users to handle their data tasks but also that it provides a better starting point for developers to build powerful and more focused data tools. The creation of libraries that complement pandas’ functionality also allows pandas development to remain focused around it’s original requirements.

This is an in-exhaustive list of projects that build on pandas in order to provide tools in the PyData space.

http://pandas.pydata.org/pandas-docs/version/0.15.0/ecosystem.html

Seaborn

2015-01-08T12:22:00.000-08:00

Seaborn is a Python visualization library based on matplotlib. It provides a high-level interface for drawing attractive statistical graphics.

https://github.com/mwaskom/seaborn

Examples/tutorial:

http://nbviewer.ipython.org/github/mwaskom/seaborn/blob/master/examples/plotting_distributions.ipynb