PyCon Montreal 2015 and Motivation

PyCon2015-logo-lockupRectangle

I just got back from a fun week in Montreal for PyCon 2015. Due to my work commitments since relocating to Seattle and leaving the San Diego Data Science Meetup I organized behind, I’ve been concerned that I was losing touch with the data science and general Python community. I figured an international conference would force me to get out of town, plus I love combining conference trips with a vacation. My last international conference was useR! 2013 in Spain http://kldavenport.com/the-r-user-conference-2013-albacete-spain/.

The Montréal-Python group hosted a meetup http://montrealpython.org/en/2015/03/mp53/ the night after the conference ended where Olivier Grisel spoke about what’s new in scikit-learn 0.16 http://scikit-learn.org/stable/whats_new.html. Scikit-learn is my favorite package across any language, and the talk reminded me that I haven’t done anything meaningful with it in about a year. Combine this fact with the Computational Photography graduate class I’ve been plugging away at this Spring (mostly OpenCV), and I now know it’s time to write about image work with scikit-learn.

Keep scrolling through the repeated content in the beginning of the IPython notebook below, I still need to find time to migrate to Pelican! :

Click here to view the notebook in it’s own window instead of this iframe.

The 35-hour Workweek with Python

35_hour_work_week_header

I was prompted to write this post after reading the NYT’s In France, New Review of 35-Hour Workweek. For those not familiar with the 35-hour workweek, France adopted it in February 2000 with the suppport of then Prime Minister Lionel Jospin and the Minister of Labour Martine Aubry. Simply stated, the goal was to increase quality of life by reducing the work hour per worker ratio by requiring corporations to hire more workers to maintain the same work output as before. This in theory would also reduce the historic 10% unemployment rate.

I mostly write about ML, but I’ve been meaning to write about Pandas’ latest features and tight integration with SciPy such as data imputation and statistical modeling, and the actual working hours of EU countries will serve as fun source of data for my examples. I found data on the annual average hours worked per EU country from 1998 to 2013 on The Economic Observation and Research Center for the Development of the Economy and Enterprise Development website. This notebook won’t serve as in-depth research on the efficacy of this policy, but more of a tutorial on data exploration, although a follow-up post exploring the interactions of commonly tracked economic factors and externalities of the policy might be fun.

In this IPython notebook we’ll work through generating descriptive statistics on our hours worked dataset then work through the process of interpolation to and extrapolation as defined below:

Interpolation is an estimation of a value within two known values in a sequence of values. For this data this might mean replacing missing average hour values in given date positions between the min and max observations.

Extrapolation is an estimation of a value based on extending a known sequence of values or facts beyond the area that is certainly known. Extrapolation is subject to greater uncertainty and a higher risk of producing meaningless data. For this data where the max observed date is 2013, we might want to extrapolate what the data points might be out to 2015.

Click here to view the notebook in it’s own window instead of this iframe.

 

 

Regularized Logistic Regression Intuition

kldavenport_com_regularized_logistic_regressionIn this notebook we’ll manually implement regularized logistic regression in order to facilitate intuition about the algorithm’s underlying math and to demonstrate how regularization can address overfitting or underfitting. We’ll then implement logistic regression in a practical manner utilizing the ubiquitous scikit-learn package. The post assumes the reader is familiar with the concepts of optimization, cross validation, and non-linearity.

Andrew Ng’s excellent Machine Learning class on Coursera is not only an excellent primer on the theoretical underpinnings of ML, but it also introduces it’s students to practical implementations via coding. Unfortunately for data enthusiasts that are pickled in Python or R, the class excercises are in Matlab or Octave. Ng’s class introduces logistic regression with regularization early on which is logical as it’s methods underpin more advanced concepts. If you’re interested, check out Caltech’s more theory heavy Learning From Data course on edX which digs a little deeper in does the same but REALLY dives deep into perceptron.

Using training set X and label set y (data from Ng’s class), we’ll use Logistic Regression to estimate the probability that an observation is of label class 1 or 0. Specifically we’ll predict whether computer microchips will pass a QA test based on two measurements X[0] and X[1].

We’ll generate additional features from the original QA data (feature engineering) in order to create a better feature space to learn from and achieve a better fit. One potential pitfall is that we may overfit the training data which decreases the learned models ability to generalize causing poor performance on new unseen observations. Regularization aims to keep the influence of the new features in check.

View the notebook at nbviewer instead of this iframe.

Dynamic Time-Series Modeling

time-series SMA

Today’s article will showcase a subset of Pandas’ time-series modeling capabilities. I’ll be using financial data to demonstrate the capabilities, however, the functions can be applied to any time-series data (application logs, netflow, bio-metrics, etc). The focus will be on moving or sliding window methods. These dynamic models account for time-dependent changes for any given state in a system whereas steady-state or static models are time-invariant as they naively calculate the system in equilibrium.

In correlation modeling (Pearson, Spearman, or Kendall) we look at the co-movement between the changes in two arrays of data, in this case time-series arrays. A dynamic implementation would include a rolling-correlation that would return a series or array of new data whereas a static implementation would return a single value that represents the correlation “all at once”. This distinction will become clearer with the visualizations below.

Let’s suppose we want to take a look at how SAP and Oracle vary together. One approach is to simply overlay the time-series plots of both the equities. A better method is to utilize a rolling or moving correlation as it can help reveal trends that would otherwise be hard to detect. Let’s take a look below:

View the notebook at nbviewer instead of this iframe.