Dynamic Time-Series Modeling

time-series SMA

Today’s article will showcase a subset of Pandas’ time-series modeling capabilities. I’ll be using financial data to demonstrate the capabilities, however, the functions can be applied to any time-series data (application logs, netflow, bio-metrics, etc). The focus will be on moving or sliding window methods. These dynamic models account for time-dependent changes for any given state in a system whereas steady-state or static models are time-invariant as they naively calculate the system in equilibrium.

In correlation modeling (Pearson, Spearman, or Kendall) we look at the co-movement between the changes in two arrays of data, in this case time-series arrays. A dynamic implementation would include a rolling-correlation that would return a series or array of new data whereas a static implementation would return a single value that represents the correlation “all at once”. This distinction will become clearer with the visualizations below.

Let’s suppose we want to take a look at how SAP and Oracle vary together. One approach is to simply overlay the time-series plots of both the equities. A better method is to utilize a rolling or moving correlation as it can help reveal trends that would otherwise be hard to detect. Let’s take a look below:

View the notebook at nbviewer instead of this iframe.

A Real World Introduction to Information Entropy

pachinko_entropy

I’ve been using IPython notebook so much that it might finally be time to stand up a Pelican based site on this server in order to utilize Jake Vanderplas’ IPython integration method. This post might be my last nbviewer.org iframe crime against proper web design principles.

The intent of this post is to generally explore information entropy applied to a toy problem in network security. I’ll outline a common problem and the basic concepts of entropy then show a practical implementation using the the Kullback-Leibler divergence and the Python data stack.

In network security the latest malware botnet threat paradigm utilizes peer-to-peer (P2P) communication methods and domain generating algorithms (DGAs). This method avoids any single point of failure and evades many countermeasures as the command and control framework is embedded in the botnets themselves instead of the outdated paradigm of relying on external servers.

A potential method of minimizing the impact of these threats is imploying a profiler that detects attributes consistent with DGA and P2P.

View the notebook at nbviewer.

Header image from http://archive.wired.com/magazine/2010/11/pl_decode_pachinko/all/, google “pachinko entropy” for some interesting links.

The Cost Function of K-Means

k-means

When exploring a novel dataset, I believe most analysts will run through the familiar steps of generating summary statistics and/or plotting distributions and feature interactions. Clustering and PCA are a great way to continue assessing data as they can explain “natural” grouping or ordering as well as attributing variance to certain features. When working with large datasets you’ll have to start considering scalability and complexity of the algorithm underlying a given method, that is to say that even if you were merely a “consumer of statistics”, you now have to consider the math. The many clustering approaches available today belong to either non-parametric/hierarchical or parametric methods, with differentiation in the former being agglomerative (bottom-up) versus divisive (top-down) and reconstructive versus generative methods for the latter.

A strength of the K-Means clustering algorithm (Parametric-Reconstructive) is it’s efficiency (Big O Notation) with  onkt  where n, k, t equal the number of iterations, clusters, and data points. A potential strength or weakness is that K-Means converges to a local optimum which is easier than solving for the global optimum but can lead to less optimal convergence. Well known weaknesses of K-Means include required cluster (k) specification and sensitivity to initialization, however both have many options for mitigation (x-means, k-means ++, PCA, etc.)

The steps of K-Means tend to be intuitive, after random initialization of cluster centroids, the algorithm’s inner-loop iterates over two steps:

  1. Assign each observation  x_i to the closest cluster centroid u_j
  2. Update each centroid to the mean of the points assigned to it.

Logically, K-Means attempts to minimize distortion defined by the the sum of the squared distances between each observation and its closest centroid. The stopping criterion is the change in distortion, once the change is lower than a pre-established threshold, the algorithm ends. In sci-kit learn the default threshold is 0.0001. Alternatively a max number of iterations can be set prior to initialization. More interesting approaches to threshold determination are outlined in Kulls and Jordan’s paper: New Algorithms via Bayesian Nonparametrics.

k-means cost

The notion isn’t mentioned much in practical use, but K-Means is fundamentally a coordinate descent algorithm. Coordinate descent serves to minimize a multivariate function along one direction at a time. The inner-loop of k-means repeatedly minimizes the function with respect to k while holding μ fixed, and then minimizes with respect to μ while holding k fixed. This means the function must monotonically decrease and that values must converge. This distortion function is a non-convex function, meaning the coordinate descent is not guaranteed to converge to the global minimum and the algorithm can be susceptible to local optima. This is why it is optimal to run k-means many times using random initialization values for the clusters, then selecting the run with lowest distortion or cost.

Since there isn’t a general theoretical approach to find the optimal number of k for a given data set, a simple approach is to compare the results of multiple runs with different k classes and choose the best one according to a given criterion (Elbow, BIC, Schwarz Criterion, etc.). Any approach must be taken with caution as increasing k results in smaller error function values, but also increases the chance of overfitting.

If you haven’t already, I recommend taking a look at Mini-Batch K-Means and K-Means++ leveraging the built-in joblib parallelization functions (n_jobs). The joblib dispatch method works on K-Means by breaking down the pairwise matrix into n_jobs even slices and computing them in parallel.

Mahalanobis Distance and Outliers

mahalnobis_kldavenportcom

I wrote a short article on  Absolute Deviation Around the Median a few months ago after having a conversation with Ryon regarding robust parameter estimators. I am excited to see a wet lab scientist take a big interest in ways to challenge the usual bench stats paradigm of t-tests, means, and z-scores. Since then I’ve been prototyping a couple of SOFM and KNN variants at PacketSled and found it odd that some of the applied literature defaulted to cartesian coordinates without explanation.  I’ve understood that a model’s efficacy improvement in using one coordinate system versus another depends on the natural groupings or clusters present in the data.  When taking the simple k-means as an example,  your choice of coordinate systems depends on the full-covariance of your clusters. Utilizing the Euclidean distance assumes that clusters have identity covariances i.e. all dimensions are statistically independent and the variance of along each of the dimensions (columns) is equal to one.
Note: for more on identity covariances check out this excellent post from Dustin Stansbury (Surprising to see so much MATLAB when Fernando Pérez is at Cal),
http://theclevermachine.wordpress.com/tag/identity-covariance/

Simple Conceptualization:
If we were to visualize this characteristic (clusters with identity covariances) in 2-dimensional space, our clusters would appear as circles. Conversely, If the covariances of the clusters in the data are not identity matrices, then the clusters might appear as ellipses.  I’m thinking I can attribute some the simpler distance measure affinity in these papers to interest in maintaining computational efficiency (time/numerical integrity) as the Mahalanobis distance requires the inversion of the covariance matrix which can be costly.

Consider the image above to be a scatterplot of multivariate data where the axes are defined by the data rather than the data being plotted in a predetermined coordinate space. The center (black dot) is the point whose coordinates are the average values of the coordinates of all points of the figure.  The black line that runs along the length of highest variance through the field of points can be called the x-axis. The second axis will be orthogonal to the first (red line). The scale can be determined by a simple three-sigma rule which dictates that 68% of points need to lay within one unit from the center(origin). Essentially the the Mahalanobis distance is an euclidian distance that considers the covariance of the data by down-weighting the axis with higher variance.

I’ll move on to a quick Python implementation of an outlier detection function based on the Mahalanobis Distance calculation. See below for the IPython notebook:

View the notebook at nbviewer.