Data Science, Machine Learning & Artificial Intelligence: python

Showing posts with label python. Show all posts

Sunday, April 2, 2017

Python function to compute K-category correlation coefficient

I created a Python function to compute the K-category correlation coefficient. K-category correlation is a measure of classification performance and may be considered a multiclass generalization of the Matthews correlation coefficient.

Python function to compute K-category correlation coefficient at Github.

Academic paper:

Comparing two K-category assignments by a K-category correlation coefficient

Abstract

Predicted assignments of biological sequences are often evaluated by Matthews correlation coefficient. However, Matthews correlation coefficient applies only to cases where the assignments belong to two categories, and cases with more than two categories are often artificially forced into two categories by considering what belongs and what does not belong to one of the categories, leading to the loss of information. Here, an extended correlation coefficient that applies to K-categories is proposed, and this measure is shown to be highly applicable for evaluating prediction of RNA secondary structure in cases where some predicted pairs go into the category “unknown” due to lack of reliability in predicted pairs or unpaired residues. Hence, predicting base pairs of RNA secondary structure can be a three-category problem. The measure is further shown to be well in agreement with existing performance measures used for ranking protein secondary structure predictions.

Paper author's server and software is available at http://rk.kvl.dk/

Friday, September 4, 2015

A Neural Network in 11 lines of Python

This is an excellent tutorial on neural networks that does a good job of explaining not only how they work but why they work as well. The python code is very easy to follow.

There is code for a very simple example and a more advanced one.

https://iamtrask.github.io/2015/07/12/basic-python-network/

see also:

http://denson-data-science.blogspot.com/2015/09/neural-network-step-by-step.html

Neural Network: A Step by Step Backpropagation Example

This great neural network tutorial goes step-by-step through backpropagation in training a neural network. There is also companion python code.

http://mattmazur.com/2015/03/17/a-step-by-step-backpropagation-example/comment-page-1/

see also:

http://denson-data-science.blogspot.com/2015/09/a-neural-network-in-11-lines-of-python.html

Tuesday, April 7, 2015

nolearn and lasagne tutorial

This short notebook is meant to help you getting started with nolearn and lasagne in order to train a neural net and make a submission to the Otto Group Product Classification Challenge.

http://nbviewer.ipython.org/github/ottogroup/kaggle/blob/master/Otto_Group_Competition.ipynb

Wednesday, February 18, 2015

Easily distributing a parallel IPython Notebook on a cluster

I haven't tried this yet, but it is high on my todo list:

http://twiecki.github.io/blog/2014/02/24/ipython-nb-cluster/

Thursday, January 8, 2015

Natural Language Processing with Python

This is a book about Natural Language Processing. By "natural language" we mean a language that is used for everyday communication by humans; languages like English, Hindi or Portuguese. In contrast to artificial languages such as programming languages and mathematical notations, natural languages have evolved as they pass from generation to generation, and are hard to pin down with explicit rules. We will take Natural Language Processing — or NLP for short — in a wide sense to cover any kind of computer manipulation of natural language. At one extreme, it could be as simple as counting word frequencies to compare different writing styles. At the other extreme, NLP involves "understanding" complete human utterances, at least to the extent of being able to give useful responses to them.
Online book:
http://www.nltk.org/book/ch00.html

NLTK 3.0 documentation:

http://www.nltk.org/

Python for Data Science

This short primer on Python is designed to provide a rapid "on-ramp" to enable computer programmers who are already familiar with concepts and constructs in other programming languages learn enough about Python to facilitate the effective use of open-source and proprietary Python-based machine learning and data science tools.
http://nbviewer.ipython.org/github/gumption/Python_for_Data_Science/blob/master/1_Introduction.ipynb

Useful Pandas Features

A tutorial on 10 useful Pandas features:

http://manishamde.github.io/blog/2013/03/07/pandas-and-python-top-10/

pandas Ecosystem

Increasingly, packages are being built on top of pandas to address specific needs in data preparation, analysis and visualization. This is encouraging because it means pandas is not only helping users to handle their data tasks but also that it provides a better starting point for developers to build powerful and more focused data tools. The creation of libraries that complement pandas’ functionality also allows pandas development to remain focused around it’s original requirements.

This is an in-exhaustive list of projects that build on pandas in order to provide tools in the PyData space.

http://pandas.pydata.org/pandas-docs/version/0.15.0/ecosystem.html

Seaborn

Seaborn is a Python visualization library based on matplotlib. It provides a high-level interface for drawing attractive statistical graphics.

https://github.com/mwaskom/seaborn

Examples/tutorial:

http://nbviewer.ipython.org/github/mwaskom/seaborn/blob/master/examples/plotting_distributions.ipynb

Vincent: A Python to Vega translator

The folks at Trifacta are making it easy to build visualizations on top of D3 with Vega. Vincent makes it easy to build Vega with Python.

https://github.com/wrobstory/vincent

Bokeh

Bokeh is a Python interactive visualization library for large datasets that natively uses the latest web technologies. Its goal is to provide elegant, concise construction of novel graphics in the style of Protovis/D3, while delivering high-performance interactivity over large data to thin clients.

http://bokeh.pydata.org/en/latest/tutorial/index.html

ggplot

ggplot is an extremely un-pythonic package for doing exactly what ggplot2 does. The goal of the package is to mimic the ggplot2 API. This makes it super easy for people coming over from R to use, and prevents you from having to re-learn how to plot stuff.

https://github.com/yhat/ggplot

Qgrid

Qgrid is an IPython extension which uses SlickGrid to render pandas DataFrames within an IPython notebook. It's being developed for use in Quantopian's hosted research environment, and this repository holds the latest source code:

https://github.com/quantopian/qgrid

Demo:

http://nbviewer.ipython.org/github/quantopian/qgrid/blob/master/qgrid_demo.ipynb

Monday, September 29, 2014

Random sample consensus (RANSAC)

Random sample consensus (RANSAC) is an iterative method to estimate parameters of a mathematical model from a set of observed data which contains outliers. It is a non-deterministic algorithm in the sense that it produces a reasonable result only with a certain probability, with this probability increasing as more iterations are allowed. The algorithm was first published by Fischler and Bolles at SRI International in 1981.

A basic assumption is that the data consists of "inliers", i.e., data whose distribution can be explained by some set of model parameters, though may be subject to noise, and "outliers" which are data that do not fit the model. The outliers can come, e.g., from extreme values of the noise or from erroneous measurements or incorrect hypotheses about the interpretation of data. RANSAC also assumes that, given a (usually small) set of inliers, there exists a procedure which can estimate the parameters of a model that optimally explains or fits this data.

http://en.wikipedia.org/wiki/RANSAC

Tutorial:

http://vision.ece.ucsb.edu/~zuliani/Research/RANSAC/docs/RANSAC4Dummies.pdf

Matlab code:

http://www.mathworks.com/discovery/ransac.html

Python code:

http://wiki.scipy.org/Cookbook/RANSAC

Just for fun RANSAC song:

https://www.youtube.com/watch?v=1YNjMxxXO-E

Corner Detector

Detecting corners is often a good first step in computer vision. If you can match corners from two images you are well on your way to figuring out how they fit together for example.

Corner detection is an approach used within computer vision systems to extract certain kinds of features and infer the contents of an image. Corner detection is frequently used in motion detection, image registration, video tracking, image mosaicing, panorama stitching, 3D modelling and object recognition. Corner detection overlaps with the topic of interest point detection.

http://en.wikipedia.org/wiki/Corner_detection

Lecture slide decks:

http://www.cse.psu.edu/~rcollins/CSE486/lecture06.pdf

http://courses.cs.washington.edu/courses/cse577/05sp/notes/harris.pdf

Tutorials:

Python/OpenCV

http://opencv-python-tutroals.readthedocs.org/en/latest/py_tutorials/py_feature2d/py_features_harris/py_features_harris.html

YouTube video:

https://www.youtube.com/watch?v=vkWdzWeRfC4

Matlab code:

http://www.mathworks.com/matlabcentral/fileexchange/9272-harris-corner-detector

Restricted Boltzmann machine

Learning to use RBM's is on my todo list...I'll update when I get around to it. RBM's are just one technique for deep learning.

The Restricted Boltzmann Machine (RBM) has become increasingly popular of late after its success in the Netflix prize competition and other competitions. Most of the inventive work behind RBMs was done by Geoffrey Hinton. In particular the training of RBMs using an algorithm called "Contrastive Divergence" (CD). CD is very similar to gradient descent. A good consequence of the CD is its ability to "dream". Of the various machine learning methods out there, the RBM is the only one which has this capacity baked in implicitly.

http://bayesianthink.blogspot.com/2013/05/the-restricted-boltzmann-machine-rbm.html#.VCnWzikijjI

This is some Matlab code a guy made of a class he was taking. It is probably not great but if you are working in Matlab it is probably better than starting from scratch:

https://code.google.com/p/matrbm/

RBM tutorial:

http://deeplearning.net/tutorial/rbm.html#rbm

RBM in scikit-learn:

http://scikit-learn.org/stable/modules/neural_networks.html

A Practical Guide to Training Restricted Boltzmann Machines:

http://www.cs.toronto.edu/~hinton/absps/guideTR.pdf

Friday, September 19, 2014

Genetic Algorithms: Cool Name & Damn Simple

Nice GA tutorial.

Genetic algorithms are a mysterious sounding technique in mysterious sounding field--artificial intelligence. This is the problem with naming things appropriately. When the field was labeled artificial intelligence, it meant using mathematics to artificially create the semblance of intelligence, but self-engrandizing researchers and Isaac Asimov redefined it as robots.

The name genetic algorithms does sound complex and has a faintly magical ring to it, but it turns out that they are one of the simplest and most-intuitive concepts you'll encounter in A.I.

Genetic Algorithms: Cool Name & Damn Simple - Irrational Exuberance

Curve fitting with Pyevolve

This is a very nice tutorial for genetic algorithms. It uses pyevolve but the tutorial part is useful even if you are using a different language/implementation for GA.

A Coder's Musings: Curve fitting with Pyevolve

pygene - simple python genetic algorithms/programming library

I played around with this a bit before I decided on pyevolve instead. However, pygene might suit your needs better.

pygene - simple python genetic algorithms/programming library

blaa/PyGene · GitHub