Gradient Boosting: Analysis of LendingClub’s Data

I originally posted this article last year. I am reposting it as a primer for my upcoming article where I take a second look at LendingClub, but this time with the the Python data stack.

An old 5.75% CD of mine recently matured and seeing that those interest rates are gone forever, I figured I’d take a statistical look at LendingClub’s data. Lending Club is the first peer-to-peer lending company to register its offerings as securities with the Securities and Exchange Commission (SEC). Their operational statistics are public and available for download.

The latest dataset consisted of 119,573 entries and some of it’s atributes included:

  1. Amount Requested
  2. Amount Funded by Investors*
  3. Interest Rate
  4. Term (Loan Length)
  5. Purpose of Loan
  6. Debt/Income Ratio **
  7. State
  8. Rent or Own Home
  9. Monthly Income
  10. FICO Low
  11. FICO High
  12. Credit Lines Open
  13. Revolving Balance
  14. Inquiries in Last 6 Months
  15. Length of Employment

*Lending club states that the amount funded by investors has no affect on the final interest rate assigned to a loan.

** DTI ratio takes all of your monthly liabilities and divides the total by your gross monthly income.

Once I had the .csv loaded as a dataframe in R, I had a little data munging to accomplish. I wasn’t sure what method I was going to use at this point, but I always address missing data. In this instance I replaced NA entries with averages and converted interest rate to a numeric column ( the % sign in the column caused R to import as string). I also created a debt-to-income ratio variable by dividing annual income by 12 and dividing the result by revolving balance.

I wanted to create a  RangeFICO column (inherently a factor or character variable in R due to the “-” in an entry like “675 -725″) then create a numeric MeanFICO column.

df$FICOrange = paste(df$fico_range_low, df$fico_range_high, sep='-')
FICOMEAN = function(x) (as.numeric(substr(x, 0, 3)) + as.numeric(substr(x,
5, 8)))/2
df$MeanFICO = sapply(df$FICO.Range, FICOMEAN)

I try to identify confounders early in dredging. Mathbabe aka Catherine O’Neil has a great post on confounders here. Simple ANOVA listed Amount Requested, Debt/Income Ratio, Rent or Own Home, Inquiries in Last 6 Months, Length of Loan, and Purpose of Loan as being significantly correlated with MeanFICO and Interest Rate. Fair Isaac itself states the following:

Fair Isaac’s research shows that opening several credit accounts in a short period of time represents greater credit risk. When the information on your credit report indicates that you have been applying for multiple new credit lines in a short period of time (as opposed to rate shopping for a single loan, which is handled differently as discussed below), your FICO score can be lower as a result.

This coupled with their FICO score breakdown information further confirms that Debt/Income Ratio and Inquiries in Last 6 Months are definite confounders.

The next step was quantitatively expressing the importance of each variable in determining a loan’s interest rate. From experience I would hypothesize that FICO score and loan length are key factors, but let’s find out for sure. The current base rate is 5.05%. Below is a quick screenshot of a subset of the current rates by risk, essentially the point of this post is to examine how LendingClub obtains their “Adjustment for Risk & Volatility” values.

lendingclubrates

Since there are so many variables, some which may be highly correlated, and some that affect Interest Rate in a non-linear manner,  I decided against multiple linear regressions and wanted to utilize gradient boosting (library(gbm)). Gradient boosting constructs additive regression models by sequentially fitting a simple parameterized function base learn to current “pseudo”-residuals by least-squares at each iteration.[1]

Overfitting is a big concern when modeling data so I used a 10-fold cross validation (might be overkill considering the 90k+ training obs).I also used the  gbm.perf() function to estimate “the optimal number of boosting iterations for a gbm object” and  gbm.Summary() to summarize the relative influence of each variable in the gbm object. Below is a graphical summary of the relative influence of each variable.gbm importance bar graph

I fit a gbm (gradient boosted model) to a subset of the data (training data) to generate a list describing how each variable reduced the squared error. According to this output the three most important variables were FICOMean, Term, and Amount Requested. Alone these variables could predict 93% of the interest rate.

gbm = gbm(int_rate ~., training,
            n.trees=1000,
            shrinkage=0.01,
            distribution="gaussian",
            interaction.depth=7,
            bag.fraction=0.9,
            cv.fold=10,
            n.minobsinnode = 50
)

I calculated the Root Mean Square (RMS) error % of gbm to assess the power of the model. Basically the RMS error is a measure of the differences between values predicted by a model/estimator and the values actually observed. The model’s RMS error % was 15.48%, not bad.

RMSE

Below is the output of gbm.Summary().

var rel.inf
MeanFICO 64.92519872
term 23.35757505
loan_amnt 4.89518426
funded_amnt 3.78362094
purpose 1.0948237
annual_inc 1.06986
revol_bal 0.2888473
home_ownership 0.27482517
addr_state 0.18597635
DTIratio 0.11379097
emp_length 0.01029754

It’s easy to understand the relative weakness of the individual variables that were rolled up into FICO score. What is interesting, however, is how insignificant home ownership and loan purposes were in determining interest rates. The contrast between a loan that facilitates the purchase of Jetskis (assets that don’t build equity) compared to a pre-mortgage application debt consolidation loan is stark. I am surprised by the lack of weight in Home Ownership especially when you consider LendingClub’s max loan rate of $35,000.

Below is a graph generated with the full dataset (119,573 obs). It demonstrates the relationship between FICO, Interest Rate, and Term. I used a gam line with formula: y ~ s(x, bs = “cs”) opposed  to a simple lm (linear model) line as I wanted to demonstrate the curtailing steepness as the line approaches higher FICO scores. This exhibits the diminishing returns with higher FICO scores.

meanficovsinterestrate

A linear model explains:

  • Every 1 unit increase in MeanFICO results in a .096 unit decrease in interest rate.
  • Every $1,000 unit increase in amount requested results in a 0.28 increase in interest rate.

LendingClub’s models could and perhaps should be much more complicated than this. They could employ text analytics to asses volatility and default risk based on the purpose summary produced by the user, although there is no way to verify the intent. LendingClub could also examine the micro-economical climates of each state or zipcode and factor in housing availability  rent control, socioeconomic factors in geography, race, gender, etc. It would be interesting to learn how the government would assess the “fair and equal” nature of this type of lending.

[1] http://www-stat.stanford.edu/~jhf/  Link posted for you (Chris Rice)

The Cost Function of K-Means

k-means

When exploring a novel dataset, I believe most analysts will run through the familiar steps of generating summary statistics and/or plotting distributions and feature interactions. Clustering and PCA are a great way to continue assessing data as they can explain “natural” grouping or ordering as well as attributing variance to certain features. When working with large datasets you’ll have to start considering scalability and complexity of the algorithm underlying a given method, that is to say that even if you were merely a ”consumer of statistics”, you now have to consider the math. The many clustering approaches available today belong to either non-parametric/hierarchical or parametric methods, with differentiation in the former being agglomerative (bottom-up) versus divisive (top-down) and reconstructive versus generative methods for the latter.

A strength of the K-Means clustering algorithm (Parametric-Reconstructive) is it’s efficiency (Big O Notation) with  onkt  where n, k, t equal the number of iterations, clusters, and data points. A potential strength or weakness is that K-Means converges to a local optimum which is easier than solving for the global optimum but can lead to less optimal convergence. Well known weaknesses of K-Means include required cluster (k) specification and sensitivity to initialization, however both have many options for mitigation (x-means, k-means ++, PCA, etc.)

The steps of K-Means tend to be intuitive, after random initialization of cluster centroids, the algorithm’s inner-loop iterates over two steps:

  1. Assign each observation  x_i to the closest cluster centroid u_j
  2. Update each centroid to the mean of the points assigned to it.

Logically, K-Means attempts to minimize distortion defined by the the sum of the squared distances between each observation and its closest centroid. The stopping criterion is the change in distortion, once the change is lower than a pre-established threshold, the algorithm ends. In sci-kit learn the default threshold is 0.0001. Alternatively a max number of iterations can be set prior to initialization. More interesting approaches to threshold determination are outlined in Kulls and Jordan’s paper: New Algorithms via Bayesian Nonparametrics.

k-means cost

The notion isn’t mentioned much in practical use, but K-Means is fundamentally a coordinate descent algorithm. Coordinate descent serves to minimize a multivariate function along one direction at a time. The inner-loop of k-means repeatedly minimizes the function with respect to k while holding μ fixed, and then minimizes with respect to μ while holding k fixed. This means the function must monotonically decrease and that values must converge. This distortion function is a non-convex function, meaning the coordinate descent is not guaranteed to converge to the global minimum and the algorithm can be susceptible to local optima. This is why it is optimal to run k-means many times using random initialization values for the clusters, then selecting the run with lowest distortion or cost.

Since there isn’t a general theoretical approach to find the optimal number of k for a given data set, a simple approach is to compare the results of multiple runs with different k classes and choose the best one according to a given criterion (Elbow, BIC, Schwarz Criterion, etc.). Any approach must be taken with caution as increasing k results in smaller error function values, but also increases the chance of overfitting.

If you haven’t already, I recommend taking a look at Mini-Batch K-Means and K-Means++ leveraging the built-in joblib parallelization functions (n_jobs). The joblib dispatch method works on K-Means by breaking down the pairwise matrix into n_jobs even slices and computing them in parallel.

Mahalanobis Distance and Outliers

mahalnobis_kldavenportcom

I wrote a short article on  Absolute Deviation Around the Median a few months ago after having a conversation with Ryon regarding robust parameter estimators. I am excited to see a wet lab scientist take a big interest in ways to challenge the usual bench stats paradigm of t-tests, means, and z-scores. Since then I’ve been prototyping a couple of SOFM and KNN variants at PacketSled and found it odd that some of the applied literature defaulted to cartesian coordinates without explanation.  I’ve understood that a model’s efficacy improvement in using one coordinate system versus another depends on the natural groupings or clusters present in the data.  When taking the simple k-means as an example,  your choice of coordinate systems depends on the full-covariance of your clusters. Utilizing the Euclidean distance assumes that clusters have identity covariances i.e. all dimensions are statistically independent and the variance of along each of the dimensions (columns) is equal to one.
Note: for more on identity covariances check out this excellent post from Dustin Stansbury (Surprising to see so much MATLAB when Fernando Pérez is at Cal),
http://theclevermachine.wordpress.com/tag/identity-covariance/

Simple Conceptualization:
If we were to visualize this characteristic (clusters with identity covariances) in 2-dimensional space, our clusters would appear as circles. Conversely, If the covariances of the clusters in the data are not identity matrices, then the clusters might appear as ellipses.  I’m thinking I can attribute some the simpler distance measure affinity in these papers to interest in maintaining computational efficiency (time/numerical integrity) as the Mahalanobis distance requires the inversion of the covariance matrix which can be costly.

Consider the image above to be a scatterplot of multivariate data where the axes are defined by the data rather than the data being plotted in a predetermined coordinate space. The center (black dot) is the point whose coordinates are the average values of the coordinates of all points of the figure.  The black line that runs along the length of highest variance through the field of points can be called the x-axis. The second axis will be orthogonal to the first (red line). The scale can be determined by a simple three-sigma rule which dictates that 68% of points need to lay within one unit from the center(origin). Essentially the the Mahalanobis distance is an euclidian distance that considers the covariance of the data by down-weighting the axis with higher variance.

I’ll move on to a quick Python implementation of an outlier detection function based on the Mahalanobis Distance calculation. See below for the IPython notebook:

View the notebook at nbviewer.

Quick Look: Facebook’s Kaggle Competition

Following Friday’s news of yhat’s ggplot port (which I hope they promptly rename to avoid search engine conflation with other variants), I thought it’d be fun to explore the large Stack Overflow dataset Facebook provided (9.7 GB) for their latest Kaggle competition. I discovered that the ggplot port is off to a great start  and will only get better as they address all the missing core features (..count.. & ..density.. etc.). I skipped using it for the specific visualization I wanted and utilized matplotlib and R’s ggplot2 via rmagic/rpy2.

Below is work I generated in IPython with Pandas, Numpy, and R. When dealing with data this large, I’ve utilized Pandas’ HDF5 capabilities or imported the data into my local instance of MongoDB to then be queried from Python. Once this is in place I can then leverage some of NLTK’s features such as sentence & word tokenization,  part-of-speech tagging, and text classification. NLTK allows you to leverage classification algorithms such as Naive Bayes, Maximum Entropy, Logistic Regression, Decision Tree,  and SVM. I’ve only had the opportunity to use NLTK for one project involving a client’s presence on twitter (substantially smaller data) and I’d love to see how it handles larger datasets.

On the topic of sentiment analysis, check out Stanford’s incredible new API:
http://nlp.stanford.edu:8080/sentiment/rntnDemo.html

View the notebook at nbviewer.