Data Science, Machine Learning & Artificial Intelligence: 2015

Friday, September 4, 2015

A Neural Network in 11 lines of Python

This is an excellent tutorial on neural networks that does a good job of explaining not only how they work but why they work as well. The python code is very easy to follow.

There is code for a very simple example and a more advanced one.

https://iamtrask.github.io/2015/07/12/basic-python-network/

see also:

http://denson-data-science.blogspot.com/2015/09/neural-network-step-by-step.html

Neural Network: A Step by Step Backpropagation Example

This great neural network tutorial goes step-by-step through backpropagation in training a neural network. There is also companion python code.

http://mattmazur.com/2015/03/17/a-step-by-step-backpropagation-example/comment-page-1/

see also:

http://denson-data-science.blogspot.com/2015/09/a-neural-network-in-11-lines-of-python.html

Tuesday, August 25, 2015

3 Wrong Ways to Store a Password (And 5 code samples doing it right)

Mainly for web development but useful in other contexts as well.

http://adambard.com/blog/3-wrong-ways-to-store-a-password/

Choosing Colormaps

From the matplotlib documentation:

The idea behind choosing a good colormap is to find a good representation in 3D colorspace for your data set. The best colormap for any given data set depends on many things including:

Whether representing form or metric data

Your knowledge of the data set (e.g., is there a critical value from which the other values deviate?)

If there is an intuitive color scheme for the parameter you are plotting

If there is a standard in the field the audience may be expecting

For many applications, a perceptual colormap is the best choice — one in which equal steps in data are perceived as equal steps in the color space. Researchers have found that the human brain perceives changes in the lightness parameter as changes in the data much better than, for example, changes in hue. Therefore, colormaps which have monotonically increasing lightness through the colormap will be better interpreted by the viewer.

http://matplotlib.org/users/colormaps.html

More info from IBM:

http://www.research.ibm.com/people/l/lloydt/color/color.HTM

FastML: a great resource for machine learning

There are many useful articles at this site. It is useful for everyone from novice to advanced:

http://fastml.com/

How to Select the Correct Encryption Approach

This article is a pretty good start at selecting an encryption method.

http://www.itbusinessedge.com/articles/how-to-select-the-correct-encryption-approach.html?google_editors_picks=true

Shannon Entropy

In information theory, entropy (more specifically, Shannon entropy) is the expected value (average) of the information contained in each message received. 'Messages' don't have to be text; in this context a 'message' is simply any flow of information. The entropy of the message is its amount of uncertainty; it increases when the message is closer to random, and decreases when it is less random. The idea here is that the less likely (i.e. more random) an event is, the more information it provides when it occurs. This seems backwards at first: it seems like messages which have more structure would contain more information, but this is not true. For example, the message 'aaaaaaaaaa' (which appears to be very structured and not random at all [although in fact it could result from a random process]) contains much less information than the message 'alphabet' (which is somewhat structured, but more random) or even the message 'axraefy6h' (which is very random). In information theory, 'information' doesn't necessarily mean useful information; it simply describes the amount of randomness of the message, so in the example above the first message has the least information and the last message has the most information, even though in everyday terms we would say that the middle message, 'alphabet', contains more information than a stream of random letters. Therefore, we would say in information theory that the first message has low entropy, the second has higher entropy, and the third has the highest entropy.

https://en.wikipedia.org/wiki/Entropy_(information_theory)

Non-technical article:

http://gizmodo.com/if-it-werent-for-this-equation-you-wouldnt-be-here-1719514472?google_editors_picks=true

Contrast Limited Adaptive Histogram Equalization (CLAHE)

CLAHE is a useful tool for preprocessing images (or video) for computer vision/pattern recognition tasks. It more or less helps you "see" areas of the image that are in shadows.

There are many available implementations of this but I like the one in open CV:

http://opencv-python-tutroals.readthedocs.org/en/latest/py_tutorials/py_imgproc/py_histograms/py_histogram_equalization/py_histogram_equalization.html

Note: it is usually better to convert images to HSV colorspace first.

Before

After

Additional info:

http://fiji.sc/wiki/index.php/Enhance_Local_Contrast_(CLAHE)

Thursday, July 30, 2015

Get Much Smarter About Machine Learning in 2 Minutes

This is a great presentation by Stephanie Yee and Tony Chu. It is targeted at people new to the concept/field of machine learning. There are excellent animations that make things very clear.

http://www.r2d3.us/visual-intro-to-machine-learning-part-1/

Tuesday, April 7, 2015

nolearn and lasagne tutorial

This short notebook is meant to help you getting started with nolearn and lasagne in order to train a neural net and make a submission to the Otto Group Product Classification Challenge.

http://nbviewer.ipython.org/github/ottogroup/kaggle/blob/master/Otto_Group_Competition.ipynb

Wednesday, April 1, 2015

An Intuitive Explanation of Bayes' Theorem

This is a great introduction to Bayes' Theorem and strong evidence that a large majority of medical doctors are not scientists.

About 85% of doctors get this problem wrong!

1% of women at age forty who participate in routine screening have breast cancer. 80% of women with breast cancer will get positive mammographies. 9.6% of women without breast cancer will also get positive mammographies. A woman in this age group had a positive mammography in a routine screening. What is the probability that she actually has breast cancer?

http://www.yudkowsky.net/rational/bayes

Friday, March 13, 2015

quicksort visualization

This is a great visualization of the quicksort algorithm:

https://www.youtube.com/watch?v=aXXWXz5rF64

Quicksort (sometimes called partition-exchange sort) is an efficient sorting algorithm, serving as a systematic method for placing the elements of an array in order. Developed by Tony Hoare in 1960, it is still a very commonly used algorithm for sorting. When implemented well, it can be about two or three times faster than its main competitors, merge sort and heapsort.[1]
Quicksort is a comparison sort, meaning that it can sort items of any type for which a "less-than" relation (formally, a total order) is defined. In efficient implementations it is not a stable sort, meaning that the relative order of equal sort items is not preserved. Quicksort can operate in-place on an array, requiring small additional amounts of memory to perform the sorting.
Mathematical analysis of quicksort shows that, on average, the algorithm takes O(n log n) comparisons to sort n items. In the worst case, it makes O(n2) comparisons, though this behavior is rare.

http://en.wikipedia.org/wiki/Quicksort

The Halting Problem

This is a great video that describes the halting problem.

https://www.youtube.com/watch?v=92WHN-pAFCs

What is the halting problem you ask?

In computability theory, the halting problem is the problem of determining, from a description of an arbitrary computer program and an input, whether the program will finish running or continue to run forever.
Alan Turing proved in 1936 that a general algorithm to solve the halting problem for all possible program-input pairs cannot exist. A key part of the proof was a mathematical definition of a computer and program, which became known as a Turing machine; the halting problem is undecidable over Turing machines. It is one of the first examples of a decision problem.

http://en.wikipedia.org/wiki/Halting_problem

Wednesday, February 18, 2015

Easily distributing a parallel IPython Notebook on a cluster

I haven't tried this yet, but it is high on my todo list:

http://twiecki.github.io/blog/2014/02/24/ipython-nb-cluster/

Thursday, January 8, 2015

Natural Language Processing with Python

This is a book about Natural Language Processing. By "natural language" we mean a language that is used for everyday communication by humans; languages like English, Hindi or Portuguese. In contrast to artificial languages such as programming languages and mathematical notations, natural languages have evolved as they pass from generation to generation, and are hard to pin down with explicit rules. We will take Natural Language Processing — or NLP for short — in a wide sense to cover any kind of computer manipulation of natural language. At one extreme, it could be as simple as counting word frequencies to compare different writing styles. At the other extreme, NLP involves "understanding" complete human utterances, at least to the extent of being able to give useful responses to them.
Online book:
http://www.nltk.org/book/ch00.html

NLTK 3.0 documentation:

http://www.nltk.org/

Python for Data Science

This short primer on Python is designed to provide a rapid "on-ramp" to enable computer programmers who are already familiar with concepts and constructs in other programming languages learn enough about Python to facilitate the effective use of open-source and proprietary Python-based machine learning and data science tools.
http://nbviewer.ipython.org/github/gumption/Python_for_Data_Science/blob/master/1_Introduction.ipynb

The Great, Big List of LATEX Symbols

LATEX symbols reference:

http://www.rpi.edu/dept/arc/training/latex/LaTeX_symbols.pdf

Enlightening Symbols: A Short History of Mathematical Notation and Its Hidden Powers

While all of us regularly use basic math symbols such as those for plus, minus, and equals, few of us know that many of these symbols weren't available before the sixteenth century. What did mathematicians rely on for their work before then? And how did mathematical notations evolve into what we know today? In Enlightening Symbols, popular math writer Joseph Mazur explains the fascinating history behind the development of our mathematical notation system. He shows how symbols were used initially, how one symbol replaced another over time, and how written math was conveyed before and after symbols became widely adopted.
Traversing mathematical history and the foundations of numerals in different cultures, Mazur looks at how historians have disagreed over the origins of the numerical system for the past two centuries. He follows the transfigurations of algebra from a rhetorical style to a symbolic one, demonstrating that most algebra before the sixteenth century was written in prose or in verse employing the written names of numerals. Mazur also investigates the subconscious and psychological effects that mathematical symbols have had on mathematical thought, moods, meaning, communication, and comprehension. He considers how these symbols influence us (through similarity, association, identity, resemblance, and repeated imagery), how they lead to new ideas by subconscious associations, how they make connections between experience and the unknown, and how they contribute to the communication of basic mathematics.
From words to abbreviations to symbols, this book shows how math evolved to the familiar forms we use today.

http://www.amazon.com/Enlightening-Symbols-History-Mathematical-Notation/dp/0691154635/

Article about the book:

http://www.theguardian.com/science/alexs-adventures-in-numberland/2014/may/21/notation-history-mathematical-symbols-joseph-mazur

Quantifying Uncertainty: Modern Computational Representation of Probability and Applications

This is a link to a pdf file containing a tutorial on modeling uncertainty:

http://www.wire.tu-bs.de/forschung/talks/06_Opatija.pdf

Many descriptions (especially of future events) contain
elements, which are uncertain and not precisely known.

For example future rainfall, or discharge from a river.

More generally, action from surrounding environment.

The system itself may contain only incompletely known

parameters, processes or fields (not possible or too

costly to measure)

There may be small, unresolved scales in the model,

they act as a kind of background noise.

All these introduce some uncertainty in the model.

Uncertainty may be aleatoric, which means random and not reducible, or

epistemic, which means due to incomplete knowledge.

Useful Pandas Features

A tutorial on 10 useful Pandas features:

http://manishamde.github.io/blog/2013/03/07/pandas-and-python-top-10/

pandas Ecosystem

Increasingly, packages are being built on top of pandas to address specific needs in data preparation, analysis and visualization. This is encouraging because it means pandas is not only helping users to handle their data tasks but also that it provides a better starting point for developers to build powerful and more focused data tools. The creation of libraries that complement pandas’ functionality also allows pandas development to remain focused around it’s original requirements.

This is an in-exhaustive list of projects that build on pandas in order to provide tools in the PyData space.

http://pandas.pydata.org/pandas-docs/version/0.15.0/ecosystem.html

Seaborn

Seaborn is a Python visualization library based on matplotlib. It provides a high-level interface for drawing attractive statistical graphics.

https://github.com/mwaskom/seaborn

Examples/tutorial:

http://nbviewer.ipython.org/github/mwaskom/seaborn/blob/master/examples/plotting_distributions.ipynb

Vincent: A Python to Vega translator

The folks at Trifacta are making it easy to build visualizations on top of D3 with Vega. Vincent makes it easy to build Vega with Python.

https://github.com/wrobstory/vincent

Bokeh

Bokeh is a Python interactive visualization library for large datasets that natively uses the latest web technologies. Its goal is to provide elegant, concise construction of novel graphics in the style of Protovis/D3, while delivering high-performance interactivity over large data to thin clients.

http://bokeh.pydata.org/en/latest/tutorial/index.html

ggplot

ggplot is an extremely un-pythonic package for doing exactly what ggplot2 does. The goal of the package is to mimic the ggplot2 API. This makes it super easy for people coming over from R to use, and prevents you from having to re-learn how to plot stuff.

https://github.com/yhat/ggplot

Qgrid

Qgrid is an IPython extension which uses SlickGrid to render pandas DataFrames within an IPython notebook. It's being developed for use in Quantopian's hosted research environment, and this repository holds the latest source code:

https://github.com/quantopian/qgrid

Demo:

http://nbviewer.ipython.org/github/quantopian/qgrid/blob/master/qgrid_demo.ipynb

Graceful Tree conjecture

In graph theory, a graceful labeling of a graph with m edges is a labeling of its vertices with some subset of the integers between 0 and m inclusive, such that no two vertices share a label, and such that each edge is uniquely identified by the positive, or absolute difference between its endpoints. A graph which admits a graceful labeling is called a graceful graph.
The name "graceful labeling" is due to Solomon W. Golomb; this class of labelings was originally given the name β-labelings by Alex Rosa in a 1967 paper on graph labelings.
A major unproven conjecture in graph theory is the Graceful Tree conjecture or Ringel–Kotzig conjecture, named after Gerhard Ringel and Anton Kotzig, which hypothesizes that all trees are graceful. The Ringel-Kotzig conjecture is also known as the "graceful labeling conjecture". Kotzig once called the effort to prove the conjecture a "disease".

http://en.wikipedia.org/wiki/Graceful_labeling

Web page with a Javascript program that generates graceful labels for a user generated tree:

http://bl.ocks.org/NPashaP/7683252