Adapted from Biel 2011
I found Professor Julian McAuley’s work at UCSD when I was searching for academic work identifying the ontology and utility of products on Amazon. Professor McAuley and his students have accomplished impressive work inferring networks of substitutable and complementary items. They constructed a browseable product graph of related products and discovered topics or ‘microcategories’ that are associated with product relationships to infer networks of substitutable and complementary products. Much of this work utilizes topic modeling, and as I’ve never applied it in academia or work, this blog will be a practical intro to Latent Dirichlet Allocation (LDA) through code.
More broadly what can we do with and what do we need to know about LDA?
- It is an Unsupervised Learning Technique that assumes documents are produced from a mixture of topics
- LDA extracts key topics and themes from a large corpus of text
- Each topic is a ordered list of representative words (Order is based on importance of word to a Topic)
- LDA describes each document in the corpus based on allocation to the extracted topics
- Many domain specific methods to create training datasets
- It is easy to use for exploratory analysis
We’ll be using a subset (reviews_Automotive_5.json.gz) of the 142.8 million reviews spanning May 1996 – July 2014 that Julian and his team have compiled and provided in a very convenient manner on their site.
Where do we start when we stumble across a dataset we don’t know much about? Lets say one where we don’t necessarily understand the underlying generative process for some or all of the variables. Lets assume for now we’re sure there aren’t one off interventions or level shifts in the data, and we don’t know anything about the distribution of the features, trends, seasonality, model parameters, variance, etc.
I tend to start with the simplest, most interpretable models first, regardless if the problem requires classification, regression, or causality modeling. This allows me to assess how difficult the problem is before wasting time applying a complex solution.
The IPython notebook below will outline exploratory analysis in terms of 1) Histograms and Aggregation, 2) Correlation Structure , 3) Dimensional Reduction. Note this isn’t meant to be an exhaustive effort to enumerate all types of imputation and pre-processing, but a quick examination of some best practices.
My Evil The Following with absoluteBLACK’s direct mount oval ring.
The purpose of this post is to show how a sponsorship/marketing manager might track their athletes or brand ambassadors. The code we’re writing below can be used for many other applications such as tracking general trends across locales or HR insidiously monitoring if employees are discussing the company in a manner consistent with social media policies (just kidding!) Since we’re doing something specific I hope this post doesn’t get lost in the sea of yet another web scraping or twitter data mining post using yet another beginner’s abstracted away R or Python package.
Twitter helps me stay up to date with the #pydata community through following prominent contributors and hashtags, I also use twitter (and facebook) to see what famous athletes and brands are doing in the mountain bike world. I discover events and great places to visit that I wouldn’t have stumbled across in other ways. Most of these major brands sponsor professional athletes, and to a different extent, ambassadors that might be competitive amateurs or supporters of a local community such as trail builders and K-12 instructors. These companies do so to spread word of their brand, technology, and experience. One especially well known and sought after sponsorship program is from Patagona (ski patrol and surf competitors take note!)
There are plenty of general social media platform management companies out there like https://hootsuite.com/ and https://buffer.com/ that pair CRM tools with data analytics, but there aren’t many sports specific platforms that enable marketing managers to track and maximize the value of their athletes and sports marketing programs such as http://www.hookit.com/. These solutions differ in that they consider an athlete’s reach via their placement in race events and travel coverage, the latter being an intelligent metric as doing 100 races in your back yard isn’t the same as doing 10 in a different country.
I was picked as one of absoluteBLACK’s (a sweet component manufacturer out of England) ambassadors for 2016, and similar to other brand’s programs, they expect that their ambassadors use social media to share photos or videos of their products in a lifestyle or action shot accompanied by relevant tags. I thought it would be interesting to take a look at absoluteBLACK’s presence on Twitter by examining who is talking about them and what they are posting and from where. If you were in sponsorship or marketing, you could run this easy to configure script weekly and start examining your presence.
Note for fullscreen, click the gist link below this embedded view.