I discovered Multivariate Adaptive Regression Splines (MARS) 2 years ago when I started visiting the the Machine Learning Task View page to learn about R predictive analytic packages. Up to this point I had only employed traditional modeling tools such as logistic and multivariate regression in R and k-nearest neighbor and C4.5 classifier in WEKA. Coincidentally about one week later I read about Salford System’s data conference in San Diego on twitter. The first thing I noticed was their corporate brand logos for algorithms such as MARS, CART, and Random Forest. I thought to myself, “I guess their corporate employees and not academics created these algorithms and donated portions of the code for open source implementations”. I checked out their site and found that Dr. Jerome Friedman of Stanford’s Statistics department was involved to some capacity in the creation of Salford Systems. Perhaps Dr. Friedman worked out a licensing agreement that allowed Salford to say they have the “truest implementation” of MARS in their FAQ section. Oddly enough, there is even a STATA implementation called MARSplnes. I found Dr. Breiman’s Berkeley site with a notice of Salford’s exclusive rights to Random Forests branding here.

This all somehow reminded me of the R vs SAS fiasco where an SAS exec misrepresented R, even open source endeavors in general, during a NYT interview. She continued to do so in the subsequent “apology” (the comments section is pretty good and the gap between R and SAS has only widened since then). Back to my point, I was curious how open source implementations of MARS such as earth exist considering this licensing, but perhaps it’s only the name MARS that’s licensed. If MARS was developed and published on Stanford University’s time would it not belong to Stanford the same way a Qualcomm Electrical Engineer’s FPGA design belongs to Qualcomm? I’d be interested in learning more about the intricacies of algorithm IP.

Some of the things I love most about data science are all the community contributions. Apache Foundation’s (Cassandra, Hadoop, Hive, Pig, etc.) and the Python data stack (Numpy, SciPy, Pandas, nltk, etc.) to name a few. This open source directionality is obvious in data science, data management, and academia as these tools are quickly surpassing any offerings from SAS and SPSS.

*2013-03-08 update*: I was attending a Salford webinar a few days ago and during the Q&A chat I asked about the differences between commercial and open source implementations of MARS and didn’t get an answer. I received an email a couple days later stating, “The original Friedman version which is available through Salford remains the fastest and the most complete but there are other implementations in R.” I replied asking if they could quantify this “complete-ness” some how, perhaps the improvements are through thread optimization or some kind of enhanced boosting, I don’t know, but I’d like to find out.