Can machines find fans of the Star wars ?

4 min readSep 6, 2020

Star Wars is one of the most famous symbols of the pop-culture and entertainment industry. It has been expanded into various films and other media, including television series, video games, novels, comic books, theme park attractions, and themed areas. There are millions of star wars fans all over the world. Can we find them automatically by machine learning? Can machine learning tell the common language among those fans? This article will present my study on applying AI and machine learning techniques to figure this out.

The data of this study was obtained from reddit.com where the user submitted posts are organized by subject into “subreddits”. The submissions from ‘starwars’ and ‘marvelstudios’ subreddits were collected by ‘pushshift’ API. Using these posts, a variety of classification machine learning models were built to predict which subreddit a post came from.

Data preparation

20,000 posts were collected from each of the two subreddits(StarWars and malvelstudios)
The titles of the posts were used for this study and duplicated titles were discarded
Final posts kept: ‘starwars’ — 19,044; ‘marvelstudios’ — 19184
The relationship between model predicting accuracy and number of posts was investigated using Multinomial Naive Bayes Model. Using different number of posts(half from starwars subreddit and half from marvelstudios subreddit), the prediction accuracy of the train and test datasets were plotted in the chart below:

As can be seen from the above chart: when the number of posts were small, the prediction accuracy had big difference between train and test datasets, this meant more posts were needed. As the number of posts were beyond 10,000, the prediction accuracy of train and test datasets were getting closer. Considering both the model quality and computing costs, 20,000 posts (10,000 from starwars subreddit and 10,000 from marvelstudios subreddit) were used for the rest of the study.

Text Vectorizers comparison

In order to let the machine to read the human language, we need to use some methods to convert the words into numbers. This is so-called text vectorization. There are many ways to do this. In this study, I tried two vectorizers: Count Vectorizer and Term Frequency-Inverse Document Frequency Vectorizer before building Multinomial Naive Bayes models. The models gave similar prediction results with these two vectorizers. In the rest of this study, the count vectorizer was used to convert text into numbers.

Classification models comparison

A few machine learning models were tried in this study, including: Multinomial Naive Bayes (MNB), k-nearest neighbors (KNN), logistic regression, random forest, and support vector machine (SVM). The accuracy of these models are summarized in the table below:

From the above table, we can see:

MNB and Logistic Regression models gave good scores for both training and testing datasets
Random forest and support vector machine models gave good scores for training datasets, but the scores for testing datasets are as good
KNN model performed not as well as other models

Top driving words

Thanks to the interpretability from logistic regression model, we easily get the words to drive machine to classify the posts from star wars subreddit. The top 15 driving words and the corresponding coefficient in the logistic regression model were plotted in the following chart. The coefficients gave a weight that posts with the words to be classified as from star war subreddit. The larger the coefficient values are, the higher chance that the post with that word to be classified as from star wars. These top driving words could also be considered as popular common words among the fans of star wars.

In order to see the driving words’ impacts to the machine learning models, I removed all the top 15 driving words from all the posts and built the logistic regression models again. The prediction accuracy reduced from 90.3% to 85.3%.

Summary

This study demonstrated that machine learning models could identify the origination of posts collected from two different subreddits with high accuracy. The work could be extended to classify more than 2 classes. With the interpretability of logistic regression models, we could also find the most common words of the subreddit. As mentioned at the beginning, these classification models have the potentials to be used for finding star wars fans and sending recommendations to them.