Data Science, Machine Learning & Artificial Intelligence: January 2015

Thursday, January 8, 2015

Natural Language Processing with Python

This is a book about Natural Language Processing. By "natural language" we mean a language that is used for everyday communication by humans; languages like English, Hindi or Portuguese. In contrast to artificial languages such as programming languages and mathematical notations, natural languages have evolved as they pass from generation to generation, and are hard to pin down with explicit rules. We will take Natural Language Processing — or NLP for short — in a wide sense to cover any kind of computer manipulation of natural language. At one extreme, it could be as simple as counting word frequencies to compare different writing styles. At the other extreme, NLP involves "understanding" complete human utterances, at least to the extent of being able to give useful responses to them.
Online book:
http://www.nltk.org/book/ch00.html

NLTK 3.0 documentation:

http://www.nltk.org/

Python for Data Science

This short primer on Python is designed to provide a rapid "on-ramp" to enable computer programmers who are already familiar with concepts and constructs in other programming languages learn enough about Python to facilitate the effective use of open-source and proprietary Python-based machine learning and data science tools.
http://nbviewer.ipython.org/github/gumption/Python_for_Data_Science/blob/master/1_Introduction.ipynb

The Great, Big List of LATEX Symbols

LATEX symbols reference:

http://www.rpi.edu/dept/arc/training/latex/LaTeX_symbols.pdf

Enlightening Symbols: A Short History of Mathematical Notation and Its Hidden Powers

While all of us regularly use basic math symbols such as those for plus, minus, and equals, few of us know that many of these symbols weren't available before the sixteenth century. What did mathematicians rely on for their work before then? And how did mathematical notations evolve into what we know today? In Enlightening Symbols, popular math writer Joseph Mazur explains the fascinating history behind the development of our mathematical notation system. He shows how symbols were used initially, how one symbol replaced another over time, and how written math was conveyed before and after symbols became widely adopted.
Traversing mathematical history and the foundations of numerals in different cultures, Mazur looks at how historians have disagreed over the origins of the numerical system for the past two centuries. He follows the transfigurations of algebra from a rhetorical style to a symbolic one, demonstrating that most algebra before the sixteenth century was written in prose or in verse employing the written names of numerals. Mazur also investigates the subconscious and psychological effects that mathematical symbols have had on mathematical thought, moods, meaning, communication, and comprehension. He considers how these symbols influence us (through similarity, association, identity, resemblance, and repeated imagery), how they lead to new ideas by subconscious associations, how they make connections between experience and the unknown, and how they contribute to the communication of basic mathematics.
From words to abbreviations to symbols, this book shows how math evolved to the familiar forms we use today.

http://www.amazon.com/Enlightening-Symbols-History-Mathematical-Notation/dp/0691154635/

Article about the book:

http://www.theguardian.com/science/alexs-adventures-in-numberland/2014/may/21/notation-history-mathematical-symbols-joseph-mazur

Quantifying Uncertainty: Modern Computational Representation of Probability and Applications

This is a link to a pdf file containing a tutorial on modeling uncertainty:

http://www.wire.tu-bs.de/forschung/talks/06_Opatija.pdf

Many descriptions (especially of future events) contain
elements, which are uncertain and not precisely known.

For example future rainfall, or discharge from a river.

More generally, action from surrounding environment.

The system itself may contain only incompletely known

parameters, processes or fields (not possible or too

costly to measure)

There may be small, unresolved scales in the model,

they act as a kind of background noise.

All these introduce some uncertainty in the model.

Uncertainty may be aleatoric, which means random and not reducible, or

epistemic, which means due to incomplete knowledge.

Useful Pandas Features

A tutorial on 10 useful Pandas features:

http://manishamde.github.io/blog/2013/03/07/pandas-and-python-top-10/

Increasingly, packages are being built on top of pandas to address specific needs in data preparation, analysis and visualization. This is encouraging because it means pandas is not only helping users to handle their data tasks but also that it provides a better starting point for developers to build powerful and more focused data tools. The creation of libraries that complement pandas’ functionality also allows pandas development to remain focused around it’s original requirements.

This is an in-exhaustive list of projects that build on pandas in order to provide tools in the PyData space.

http://pandas.pydata.org/pandas-docs/version/0.15.0/ecosystem.html

Seaborn

Seaborn is a Python visualization library based on matplotlib. It provides a high-level interface for drawing attractive statistical graphics.

https://github.com/mwaskom/seaborn

Examples/tutorial:

http://nbviewer.ipython.org/github/mwaskom/seaborn/blob/master/examples/plotting_distributions.ipynb

Vincent: A Python to Vega translator

The folks at Trifacta are making it easy to build visualizations on top of D3 with Vega. Vincent makes it easy to build Vega with Python.

https://github.com/wrobstory/vincent

Bokeh

Bokeh is a Python interactive visualization library for large datasets that natively uses the latest web technologies. Its goal is to provide elegant, concise construction of novel graphics in the style of Protovis/D3, while delivering high-performance interactivity over large data to thin clients.

http://bokeh.pydata.org/en/latest/tutorial/index.html

ggplot

ggplot is an extremely un-pythonic package for doing exactly what ggplot2 does. The goal of the package is to mimic the ggplot2 API. This makes it super easy for people coming over from R to use, and prevents you from having to re-learn how to plot stuff.

https://github.com/yhat/ggplot

Qgrid

Qgrid is an IPython extension which uses SlickGrid to render pandas DataFrames within an IPython notebook. It's being developed for use in Quantopian's hosted research environment, and this repository holds the latest source code:

https://github.com/quantopian/qgrid

Demo:

http://nbviewer.ipython.org/github/quantopian/qgrid/blob/master/qgrid_demo.ipynb

Graceful Tree conjecture

In graph theory, a graceful labeling of a graph with m edges is a labeling of its vertices with some subset of the integers between 0 and m inclusive, such that no two vertices share a label, and such that each edge is uniquely identified by the positive, or absolute difference between its endpoints. A graph which admits a graceful labeling is called a graceful graph.
The name "graceful labeling" is due to Solomon W. Golomb; this class of labelings was originally given the name β-labelings by Alex Rosa in a 1967 paper on graph labelings.
A major unproven conjecture in graph theory is the Graceful Tree conjecture or Ringel–Kotzig conjecture, named after Gerhard Ringel and Anton Kotzig, which hypothesizes that all trees are graceful. The Ringel-Kotzig conjecture is also known as the "graceful labeling conjecture". Kotzig once called the effort to prove the conjecture a "disease".

http://en.wikipedia.org/wiki/Graceful_labeling

Web page with a Javascript program that generates graceful labels for a user generated tree:

http://bl.ocks.org/NPashaP/7683252

Data Science, Machine Learning & Artificial Intelligence