A wild dataset has appeared! Now what?


Where do we start when we stumble across a dataset we don’t know much about? Lets say one where we don’t necessarily understand the underlying generative process for some or all of the variables. Lets assume for now we’re sure there aren’t one off interventions or level shifts in the data, and we don’t know anything about the distribution of the features, trends, seasonality, model parameters, variance, etc.

I tend to start with the simplest, most interpretable models first, regardless if the problem requires classification, regression, or causality modeling. This allows me to assess how difficult the problem is before wasting time applying a complex solution.

The IPython notebook below will outline exploratory analysis in terms of 1) Histograms and Aggregation, 2) Correlation Structure , 3) Dimensional Reduction. Note this isn’t meant to be an exhaustive effort to enumerate all types of imputation and pre-processing, but a quick examination of some best practices.

Examining Your Presence on Twitter with Python

My Evil The Following with absoluteBLACK’s direct mount oval ring.


The purpose of this post is to show how a sponsorship/marketing manager might track their athletes or brand ambassadors. The code we’re writing below can be used for many other applications such as tracking general trends across locales or HR insidiously monitoring if employees are discussing the company in a manner consistent with social media policies (just kidding!) Since we’re doing something specific I hope this post doesn’t get lost in the sea of yet another web scraping or twitter data mining post using yet another beginner’s abstracted away R or Python package.

Twitter helps me stay up to date with the #pydata community through following prominent contributors and hashtags, I also use twitter (and facebook) to see what famous athletes and brands are doing in the mountain bike world, they seem to get instant Instagram followers. I discover events and great places to visit that I wouldn’t have stumbled across in other ways. Most of these major brands sponsor professional athletes, and to a different extent, ambassadors that might be competitive amateurs or supporters of a local community such as trail builders and K-12 instructors. These companies do so to spread word of their brand, technology, and experience. One especially well known and sought after sponsorship program is from Patagona (ski patrol and surf competitors take note!)

There are plenty of general social media platform management companies out there like https://hootsuite.com/ and https://buffer.com/ that pair CRM tools with data analytics, but there aren’t many sports specific platforms that enable marketing managers to track and maximize the value of their athletes and sports marketing programs such as http://www.hookit.com/. These solutions differ in that they consider an athlete’s reach via their placement in race events and travel coverage, the latter being an intelligent metric as doing 100 races in your back yard isn’t the same as doing 10 in a different country.

I was picked as one of absoluteBLACK’s (a sweet component manufacturer out of England) ambassadors for 2016, and similar to other brand’s programs, they expect that their ambassadors use social media to share photos or videos of their products in a lifestyle or action shot accompanied by relevant tags. I thought it would be interesting to take a look at absoluteBLACK’s presence on Twitter by examining who is talking about them and what they are posting and from where. If you were in sponsorship or marketing, you could run this easy to configure script weekly and start examining your presence.

Note for fullscreen, click the gist link below this embedded view.

Lending Club Data Analysis Revisited with Python

2.5 years ago I analyzed Lending Club’s issued loans data (yikes! I was using R back then!) . It was the most visited blog post on my site in 2013 through 2014. Today it’s still number 5. Reddit picked up my simple “35-hour work week with Python” post which is now #1:


Lending Club is the first peer-to-peer lending company to register its offerings as securities with the Securities and Exchange Commission (SEC). Their operational statistics are public and available for download. It has been a while since I’ve posted an end to end solution blog post and would like to replicate the post with a bit more sophistication in Python with the latest dataset from lendinglub.com. In summary, let’s examine all the attributes Lending Club collects on users and how they influence the interest rates issued.