2.5 years ago I analyzed Lending Club’s issued loans data (yikes! I was using R back then!) . It was the most visited blog post on my site in 2013 through 2014. Today it’s still number 5. Reddit picked up my simple “35-hour work week with Python” post which is now #1:
Lending Club is the first peer-to-peer lending company to register its offerings as securities with the Securities and Exchange Commission (SEC). Their operational statistics are public and available for download. It has been a while since I’ve posted an end to end solution blog post and would like to replicate the post with a bit more sophistication in Python with the latest dataset from lendinglub.com. In summary, let’s examine all the attributes Lending Club collects on users and how they influence the interest rates issued.
By now we all know what Random Forests is. We know about the great off-the-self performance, ease of tuning and parallelization, as well as it’s importance measures. It’s easy for engineers implementing RF to forget about it’s underpinnings. Unlike some of it’s more modern and advanced contemporaries, descision trees are easy to interpret. A neural net might obtain great results but it is difficult to work backwards from and explain to stake holders as the weights of the connections between two neurons have little meaning on their own. Decision trees won’t be a great choice for a feature space with complex relationships between numerical variables, but it’s great for data with a simplier mix of numerical and categorical.
I recently dusted off one of my favorite books, Programming Collective Intelligence by Toby Segaran (2007), and was quickly reminded how much I loved all the pure python explanations of optimization and modeling. It was never enough for me to read about and work out proofs on paper, I had to implement something abstract in code to truly learn it.
I went through some of the problems sets and to my dismay, realized that the code examples were no longer hosted on his personal site. A quick google search revealed that multiple kind souls had not only shared their old copies on github, but even corrected mistakes and updated python methods.
I’ll be using some of this code as inpiration for an intro to decision trees with python.
I just got back from a fun week in Montreal for PyCon 2015. Due to my work commitments since relocating to Seattle and leaving the San Diego Data Science Meetup I organized behind, I’ve been concerned that I was losing touch with the data science and general Python community. I figured an international conference would force me to get out of town, plus I love combining conference trips with a vacation. My last international conference was useR! 2013 in Spain http://kldavenport.com/the-r-user-conference-2013-albacete-spain/.
The Montréal-Python group hosted a meetup http://montrealpython.org/en/2015/03/mp53/ the night after the conference ended where Olivier Grisel spoke about what’s new in scikit-learn 0.16 http://scikit-learn.org/stable/whats_new.html. Scikit-learn is my favorite package across any language, and the talk reminded me that I haven’t done anything meaningful with it in about a year. Combine this fact with the Computational Photography graduate class I’ve been plugging away at this Spring (mostly OpenCV), and I now know it’s time to write about image work with scikit-learn.