Regularized Logistic Regression Intuition

kldavenport_com_regularized_logistic_regressionIn this notebook we’ll manually implement regularized logistic regression in order to facilitate intuition about the algorithm’s underlying math and to demonstrate how regularization can address overfitting or underfitting. We’ll then implement logistic regression in a practical manner utilizing the ubiquitous scikit-learn package. The post assumes the reader is familiar with the concepts of optimization, cross validation, and non-linearity.

Andrew Ng’s excellent Machine Learning class on Coursera is not only an excellent primer on the theoretical underpinnings of ML, but it also introduces it’s students to practical implementations via coding. Unfortunately for data enthusiasts that are pickled in Python or R, the class excercises are in Matlab or Octave. Ng’s class introduces logistic regression with regularization early on which is logical as it’s methods underpin more advanced concepts. If you’re interested, check out Caltech’s more theory heavy Learning From Data course on edX which digs a little deeper in does the same but REALLY dives deep into perceptron.

Using training set X and label set y (data from Ng’s class), we’ll use Logistic Regression to estimate the probability that an observation is of label class 1 or 0. Specifically we’ll predict whether computer microchips will pass a QA test based on two measurements X[0] and X[1].

We’ll generate additional features from the original QA data (feature engineering) in order to create a better feature space to learn from and achieve a better fit. One potential pitfall is that we may overfit the training data which decreases the learned models ability to generalize causing poor performance on new unseen observations. Regularization aims to keep the influence of the new features in check.

View the notebook at nbviewer instead of this iframe.

Dynamic Time-Series Modeling

time-series SMA

Today’s article will showcase a subset of Pandas’ time-series modeling capabilities. I’ll be using financial data to demonstrate the capabilities, however, the functions can be applied to any time-series data (application logs, netflow, bio-metrics, etc). The focus will be on moving or sliding window methods. These dynamic models account for time-dependent changes for any given state in a system whereas steady-state or static models are time-invariant as they naively calculate the system in equilibrium.

In correlation modeling (Pearson, Spearman, or Kendall) we look at the co-movement between the changes in two arrays of data, in this case time-series arrays. A dynamic implementation would include a rolling-correlation that would return a series or array of new data whereas a static implementation would return a single value that represents the correlation “all at once”. This distinction will become clearer with the visualizations below.

Let’s suppose we want to take a look at how SAP and Oracle vary together. One approach is to simply overlay the time-series plots of both the equities. A better method is to utilize a rolling or moving correlation as it can help reveal trends that would otherwise be hard to detect. Let’s take a look below:

View the notebook at nbviewer instead of this iframe.

A Real World Introduction to Information Entropy


I’ve been using IPython notebook so much that it might finally be time to stand up a Pelican based site on this server in order to utilize Jake Vanderplas’ IPython integration method. This post might be my last iframe crime against proper web design principles.

The intent of this post is to generally explore information entropy applied to a toy problem in network security. I’ll outline a common problem and the basic concepts of entropy then show a practical implementation using the the Kullback-Leibler divergence and the Python data stack.

In network security the latest malware botnet threat paradigm utilizes peer-to-peer (P2P) communication methods and domain generating algorithms (DGAs). This method avoids any single point of failure and evades many countermeasures as the command and control framework is embedded in the botnets themselves instead of the outdated paradigm of relying on external servers.

A potential method of minimizing the impact of these threats is imploying a profiler that detects attributes consistent with DGA and P2P.

View the notebook at nbviewer.

Header image from, google “pachinko entropy” for some interesting links.

The Cost Function of K-Means


When exploring a novel dataset, I believe most analysts will run through the familiar steps of generating summary statistics and/or plotting distributions and feature interactions. Clustering and PCA are a great way to continue assessing data as they can explain “natural” grouping or ordering as well as attributing variance to certain features. When working with large datasets you’ll have to start considering scalability and complexity of the algorithm underlying a given method, that is to say that even if you were merely a “consumer of statistics”, you now have to consider the math. The many clustering approaches available today belong to either non-parametric/hierarchical or parametric methods, with differentiation in the former being agglomerative (bottom-up) versus divisive (top-down) and reconstructive versus generative methods for the latter.

A strength of the K-Means clustering algorithm (Parametric-Reconstructive) is it’s efficiency (Big O Notation) with  onkt  where n, k, t equal the number of iterations, clusters, and data points. A potential strength or weakness is that K-Means converges to a local optimum which is easier than solving for the global optimum but can lead to less optimal convergence. Well known weaknesses of K-Means include required cluster (k) specification and sensitivity to initialization, however both have many options for mitigation (x-means, k-means ++, PCA, etc.)

The steps of K-Means tend to be intuitive, after random initialization of cluster centroids, the algorithm’s inner-loop iterates over two steps:

  1. Assign each observation  x_i to the closest cluster centroid u_j
  2. Update each centroid to the mean of the points assigned to it.

Logically, K-Means attempts to minimize distortion defined by the the sum of the squared distances between each observation and its closest centroid. The stopping criterion is the change in distortion, once the change is lower than a pre-established threshold, the algorithm ends. In sci-kit learn the default threshold is 0.0001. Alternatively a max number of iterations can be set prior to initialization. More interesting approaches to threshold determination are outlined in Kulls and Jordan’s paper: New Algorithms via Bayesian Nonparametrics.

k-means cost

The notion isn’t mentioned much in practical use, but K-Means is fundamentally a coordinate descent algorithm. Coordinate descent serves to minimize a multivariate function along one direction at a time. The inner-loop of k-means repeatedly minimizes the function with respect to k while holding μ fixed, and then minimizes with respect to μ while holding k fixed. This means the function must monotonically decrease and that values must converge. This distortion function is a non-convex function, meaning the coordinate descent is not guaranteed to converge to the global minimum and the algorithm can be susceptible to local optima. This is why it is optimal to run k-means many times using random initialization values for the clusters, then selecting the run with lowest distortion or cost.

Since there isn’t a general theoretical approach to find the optimal number of k for a given data set, a simple approach is to compare the results of multiple runs with different k classes and choose the best one according to a given criterion (Elbow, BIC, Schwarz Criterion, etc.). Any approach must be taken with caution as increasing k results in smaller error function values, but also increases the chance of overfitting.

If you haven’t already, I recommend taking a look at Mini-Batch K-Means and K-Means++ leveraging the built-in joblib parallelization functions (n_jobs). The joblib dispatch method works on K-Means by breaking down the pairwise matrix into n_jobs even slices and computing them in parallel.