Predicting Number of Upvotes for A Headline in Hacker News


Introduction

You can download the dataset here (stories_cleaned.csv).
The dataset contains more than 1 million rows of entry, for simplicity, we will use only random 3000 rows to make our exciting predictions!


Background

Hacker News is a social news website focusing on computer science and entrepreneurship.
Users can upvote those news and the news with most upvotes will make it to the front page, making them more visible to the community.


Technologies

Python, Pandas, Numpy, Scikit-learn, Natural Language Processing (bag-of-words model).


Goal

We want to predict the number of upvotes a headline would receive from the Hacker News community.
And we will accomplish this in 3 high-level steps:

  1. Read in and clean the dataset
  2. Train the model
  3. Make a prediction

Let's begin!



Step 1: Read in and clean the dataset

Let's first import pandas and numpy

import pandas as pd
import numpy as np

Now, let's read in and clean the dataset. We will take random 3000 rows from the million++ entries.
stories = pd.read_csv("stories_cleaned.csv")
stories = stories.dropna()          # drop rows which contain missing values
stories = stories.sample(n=3000)    # randomly sample only 3000 rows
stories.columns = ["num_upvotes", "headline"]    # set column names



Step 2: Train the model

To keep this demo short and sweet, I'll skip some explanations of how the code works. However, I've added in comments to facilitate your understanding of the code.
An important vocabulary to know however, is the word "token". See "token" as an individual word in a headline and an individual word is also a "feature" in a machine learning model.
So, "token" = "feature" = an individual word.

First, we prepare the necessary data for our model training.
Some quick points:

### tokenization - breaking headlines down to individual words ###
lists_of_tokens = []             # will be a list of lists
for hl in stories["headline"]:   # for each headline in the column
lists_of_tokens.append(hl.split())    # split by white space

### lowercasing and removing punctuations ###
punctuations = [",", ":", ";", ".", "'", '"', "’", "?", "/", "-", "+", "&", "(", ")"]
lists_of_tokens_cleaned = []

for token_list in lists_of_tokens:
    cur_cleaned_list = []
    for token in token_list:       # for each token in the list
        token = token.lower()      # convert to all lowercase
        for punc in punctuations:  # remove all punctuations
            token = token.replace(punc, "")
        cur_cleaned_list.append(token)  # append the cleaned token to cur list
    lists_of_tokens_cleaned.append(cur_cleaned_list)  # append the cleaned list to l_o_t_cleaned

### initiate a df with 0s ###
# find a list of unique tokens and we want only tokens that occur more than once
unique_tokens = []
once_tokens = []    # stores tokens that has occurred once

for token_list in lists_of_tokens_cleaned:
    for token in token_list:
        if token not in once_tokens:
            once_tokens.append(token)
        elif token not in unique_tokens:  # ...but is in once_tokens[] then..
            unique_tokens.append(token)

# create the dataframe: each row = a headline, each column = a token.
counts = pd.DataFrame(0, index=np.arange(len(lists_of_tokens_cleaned)), columns=unique_tokens)

### fill in counts for each cell ###
for row_index, token_list in enumerate(lists_of_tokens_cleaned):  #enumerate() to use index
for token in token_list:
if token in unique_tokens:
counts.iloc[row_index][token] += 1   # increment count

### remove noisy tokens - occurred too seldom and too frequent ###
word_counts = counts.sum(axis=0)     # sum by columns; word_counts is a Series
counts = counts.loc[:, (word_counts>=10) & (word_counts<=100)]  # take only tokens that occur >10 and <100 times

We now split the dataset
# split dataset
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(counts, stories["num_upvotes"], test_size=0.2)  # X is the counts of tokens

Then we train our model!
# train
from sklearn.linear_model import LinearRegression

lr = LinearRegression()
lr.fit(X_train, y_train)



Step 3: Make a prediction

Let's create a function that takes in a headline and spits out a predicted number of upvotes based on our model's knowledge :).

def upvote_predictor(headline):
    # tokenization
    tokenized_headline = headline.split()

    # lowercasing and removing punctuations
    punctuation = [",", ":", ";", ".", "'", '"', "’", "?", "/", "-", "+", "&", "(", ")"]
    tokenized_headline_cleaned = []
    for token in tokenized_headline:
        token = token.lower()
        for punc in punctuation:
            token = token.replace(punc, "")
        tokenized_headline_cleaned.append(token)

    # get unique tokens
    unique_tokens = []
    once_tokens = []
    for token in tokenized_headline_cleaned:
        if token not in once_tokens:
            once_tokens.append(token)
        elif token not in unique_tokens:
            unique_tokens.append(token)

    # create prediction input
    single_counts = pd.DataFrame(0, index=[0], columns=X_test.columns)
    for token in tokenized_headline_cleaned:
        if token in single_counts.columns:
            single_counts[token] += 1

    prediction = lr.predict(single_counts)
    return prediction

All we need to do now is to call our custom function to make a prediction!
# calling our custom function
print(upvote_predictor("How many upvotes would I get?!"))

Output:
upvote prediction



Wrap Up

If you run the code multiple times, you would notice that we will get a different predicted number of upvotes in every run. That's because our model is trained with a different set of 3000 rows out of the original 1 million++ entries everytime!
There are certainly many optimizations that could be done to improve our model's prediction accuracy, like using the entire dataset, adding other useful features like "length of headline", trying with different threshold for removing noisy features etc.

But that's it for now. Thank you for accompanying me on this exciting exploration. Hope you enjoyed it like I did!