Predictive Models

Screen Shot 2020-09-07 at 11.06.34 PM.png

Screen Shot 2020-09-07 at 11.06.34 PM copy.png

Interviewer-AI

The interviewer is an machine learning project, using OpenAI’s GPT-2 text generation and Twitter’s Activity API. It’s goal is to create and share questions potentially asked in an interview. This uses a combined dataset of questions from websites like LinkedIn, GlassDoor, Indeed, even Google by searching “Interview questions filetype:pdf” and using OCR (Optical character recognition) to grab text. The model used was created by Max Woolf (https://github.com/minimaxir/gpt-2-simple) with 3-4 hours of training on Google’s CoLab. Twitter’s Activity API is used with a NodeJS webhook (Ngrok) to watch for new incoming answers submitted by “Interviewees” on Twitter. Upon submitting, a read-receipt, typing indicator, and an auto-reply message is sent as confirmation for a successful answer submission. Answers are then retweeted anonymously for all to see and interact with. The purpose of this project is to collaborate and see how others might answer unique or bizarre questions, all while being challenged yet having fun!

I originally wrote the Interviewer in 2018 as a tool to help me become more familiar with interview questions. When I was asked a question I hadn’t prepared for or heard in a while, I often became anxious. This was something I wanted to improve on and use my skills in programming to create a solution for. This resulted in a simple Twitter bot, which tweeted interview questions daily, using old Twitter API to gather the latest Direct Message to retweet later anonymously. This “New-Answer Check” was hooked up to a cronjob which fired off every two minutes due to Twitter API request limits. The limitation really became an issue when multiple “interviewees” would answer within the two minute time-frame, which resulted in only the latest of the two answers being retweeted.

After some time, the dataset of questions became repetitive and I wanted to create new questions, but I couldn’t without having some sort of a biased dataset. This is when I knew incorporating machine learning would solve that issue and really create something interesting. The combination of GPT-2’s text generation and Twitter’s Activity API solved both obstacles; creating one-of-a-kind questions and a real-time Direct Message monitor.

68747470733a2f2f692e696d6775722e636f6d2f616763536176352e6a7067.jpeg

SMARTX: A STOCKX PROJECTION MODEL

This time-series model forecasts future sneaker resell prices on StockX using Facebook’s Prophet Model. The goal of this project is to:

Help consumers buy products at a desired/lowest price
Help investors maximize their profits by forecasting increases in value

The StockX Projection Model will scrap every sale made on a given sneaker (size specific or all) using unofficial API and organize them into a dataframe. From there, the Prophet Model take the ‘ds’ and ‘y’ (Date-stamp and Price) columns and create a prediction X (can be set to any value) days into the future.

This model will incorporate “holidays” using it’s built in library of country-based holidays OR user-generated dates. In this case, a scraper grabs the dates of “Seller Fee Promotions” from the past to be included. Other dates ("Valentine's Day", "Tax Day", "Graduation", "Back-To-School”) were added to show correlations.

Left: User-generated dates with the addition of other dates
Right: Prophet's built-in holidays

After fitting and forecasting, the output will show change-points throughout the graph which indicate major price changes. The graph-gif on the very top of this page, a plotly graph, gives an example for deeper interaction and exploration.

Calling the plot_components method provides charts on the impact of "holidays" and daily/weekly/yearly seasonalities

Lastly, another scrape is made to grab all sales throughout each size. Once organized into a dataframe, a categorical plot is graphed to visualize the frequency and price of each sale in each available size.

Screen Shot 2020-09-14 at 1.55.16 PM.png

South Korea Covid-19 “Probability of Death” Predictive Model

With stay-in-place orders set, my parents were coming a little anxious in quarantine and were trying to inch their way out to see friends and family. At the time, the US didn’t have too much public data to work with, most were unorganized and full of NULLs, but then I found this hosted on Kaggle and thought a predictive model might be something interesting to build.

First things first: What are we working with and what do I like. A few columns really grabbed my attention, some of which later I learned were removed by the uploader due to privacy reasons. These columns were age (rounded by 10s), disease (underlying disease boolean), and contact number (people whom the infected had been in close contact with). Unfortunately the contact_number column was filled with NULLs and removal of those rows would chop the dataset, so I decided against using it. I chose the following and began with the standard exploratory data analysis (EDA):

Screen Shot 2020-09-14 at 2.06.15 PM.png

Screen Shot 2020-09-14 at 2.07.03 PM.png

I decided to make released_date and deceased_date columns to later measure the symptom duration starting with the confirmed_date. This combined with pd.to_datetime helped me create days_diff_deceased and days_diff_recovered.

After that I turned the sex and deceased columns into booleans, renamed a few columns for clarity, and called up a sns.heatmap to show correlation between values. Followed was a series of graphs visualizing the data to better represent the numbers.

Screen Shot 2020-09-14 at 2.07.43 PM.png

Screen Shot 2020-09-14 at 2.08.27 PM.png

Screen Shot 2020-09-14 at 2.09.45 PM.png

Screen Shot 2020-09-14 at 2.09.58 PM.png

Screen Shot 2020-09-14 at 2.10.06 PM.png

Screen Shot 2020-09-14 at 2.10.15 PM.png

Screen Shot 2020-09-14 at 2.10.48 PM.png

Screen Shot 2020-09-14 at 2.11.00 PM.png

After cleaning up the dataset, I used Sklearn’s train_test_split and K-Nearest-Neighbors to fit and create a true or false predictive model.

Screen Shot 2020-09-14 at 2.13.23 PM.png

Screen Shot 2020-09-14 at 2.13.44 PM.png

Finally, to create the probability model, a logistical regression approach was used to calculate probability. The input required values for sex, age-range, and the presence of an underlying disease, which output the likelihood of death.

Screen Shot 2020-09-14 at 2.14.09 PM.png

There were a few more things I wanted to incorporate, such as the mode of travel, commute routes, and regions, but these were removed due to privacy. A major challenge creating this model was the dataset being so limited since numbers were coming in on a three-week basis.

But using what was given, I was able to create a decent “Probability of Death” predictive model, which surprisingly did help convince the parents to stay in-doors and learn how to use Zoom:D