The idea behind this project was to build a prototype for cocktail recommendation app. The recommendation is made using normalized (by amount) cocktails’ ingredients and a list of user rated cocktails. Additionally, prototype is capable of unsupervised clustering using cocktails similarity matrix.
The data was scraped from thecocktaildb using JSON API.
List of libraries used in this project:
library(dplyr); library(tidyr); library(data.table)
library(ggplot2); library(stringr); library(knitr);
library(jsonlite); library(broom); library(curl)
It always starts with collecting proper data before one can test his/her ideas or hypothesis. This project is not an exception so let’s scrape the data we need from the web. Since we don’t know the list of ids containing cocktail data, ids are called sequentially and every non-empty entry is added to the bottom of the dataframe. Through trial and error the id range for all cocktails in the database was determined to be within 10000 to 20000 range.
database=data.frame()
for(i in 10000:20000) {
link="http://www.thecocktaildb.com/api/json/v1/1/lookup.php?i="
z=paste(link,i,sep="")
data=as.data.frame(fromJSON(z))
database=rbind(database,data)
print(i)
}
# Save resulting dataframe in a csv file
fwrite(data_base, file = "database.csv")
Before we proceed to cleaning the data let’s quickly explore structure of the scraped data.
# Read previously saved data, set blank spaces to NAs.
data_base<- read.csv("data/database.csv", header=T,
na.strings=c(""))
# Remove rows numbers
data_base<-data_base[,-1]
# Display first 5 rows
# head(data_base, n=5)
# Display column names
colnames(data_base)
## [1] "drinks.idDrink" "drinks.strDrink"
## [3] "drinks.strCategory" "drinks.strAlcoholic"
## [5] "drinks.strGlass" "drinks.strInstructions"
## [7] "drinks.strDrinkThumb" "drinks.strIngredient1"
## [9] "drinks.strIngredient2" "drinks.strIngredient3"
## [11] "drinks.strIngredient4" "drinks.strIngredient5"
## [13] "drinks.strIngredient6" "drinks.strIngredient7"
## [15] "drinks.strIngredient8" "drinks.strIngredient9"
## [17] "drinks.strIngredient10" "drinks.strIngredient11"
## [19] "drinks.strIngredient12" "drinks.strIngredient13"
## [21] "drinks.strIngredient14" "drinks.strIngredient15"
## [23] "drinks.strMeasure1" "drinks.strMeasure2"
## [25] "drinks.strMeasure3" "drinks.strMeasure4"
## [27] "drinks.strMeasure5" "drinks.strMeasure6"
## [29] "drinks.strMeasure7" "drinks.strMeasure8"
## [31] "drinks.strMeasure9" "drinks.strMeasure10"
## [33] "drinks.strMeasure11" "drinks.strMeasure12"
## [35] "drinks.strMeasure13" "drinks.strMeasure14"
## [37] "drinks.strMeasure15" "drinks.dateModified"
The columns are: drink’s id, drink’s name, category (beer based/cocktail/Soft drink), type of the drink (alcoholic/non-alcoholic), type of the glass to serve the drink in, preparation instructions, link to drink’s image (if available), ingredients (up to 15 ingredients!), corresponding measures for each ingredient and the date of the last modification of the entry.
Since we are going to use ingredients and their measures, let’s keep only these columns, along with the cocktail’s name.
db <- data_base %>%
select(cocktail.name=drinks.strDrink,
contains("Ingredient"),
contains("Measure"))
#head(db,n=5)
The data entries don’t look quite homogeneous and most likely originally were pulled from different sources. Additionally, data contains varying symbols which cannot be seen in a regular Rstudio viewer, but can be extracted by calling cell directly (by its row and column). There are white spaces, double white spaces, tab indentation and new line symbols. Let’s replace them with NAs.
db[ db == "" ] <- NA
db[ db == " " ] <- NA
db[ db == " "]<- NA
db[ db == "/t"] <- NA
db[ db == "\n"] <- NA
# Convert every column to character type
db <- as.data.frame(lapply(db, as.character), stringsAsFactors=FALSE)
#head(db,n=10)
Now let’s reshape the data into the long format which will have only 3 columns: “cocktail name”, “ingredient” and “measure”. For that we will first gather ingredients and their measures into two separate dataframes. And then recombine them into a single data set.
db_ing_tidy <- db %>%
select(cocktail.name, contains("Ingredient"))%>%
gather(ingredients_number, ingredient, - cocktail.name)%>%
arrange(desc(cocktail.name))
head(db_ing_tidy, n=10)
## cocktail.name ingredients_number ingredient
## 1 Zorro drinks.strIngredient1 Sambuca
## 2 Zorro drinks.strIngredient2 Bailey's irish cream
## 3 Zorro drinks.strIngredient3 White Creme de Menthe
## 4 Zorro drinks.strIngredient4 <NA>
## 5 Zorro drinks.strIngredient5 <NA>
## 6 Zorro drinks.strIngredient6 <NA>
## 7 Zorro drinks.strIngredient7 <NA>
## 8 Zorro drinks.strIngredient8 <NA>
## 9 Zorro drinks.strIngredient9 <NA>
## 10 Zorro drinks.strIngredient10 <NA>
db_m_tidy <- db %>%
select(cocktail.name, contains("Measure"))%>%
gather(measure_number, amount, - cocktail.name)%>%
rename(cocktail=cocktail.name)%>%
arrange(desc(cocktail))
head(db_m_tidy, n=10)
## cocktail measure_number amount
## 1 Zorro drinks.strMeasure1 2 cl
## 2 Zorro drinks.strMeasure2 2 cl
## 3 Zorro drinks.strMeasure3 2 cl
## 4 Zorro drinks.strMeasure4 <NA>
## 5 Zorro drinks.strMeasure5 <NA>
## 6 Zorro drinks.strMeasure6 <NA>
## 7 Zorro drinks.strMeasure7 <NA>
## 8 Zorro drinks.strMeasure8 <NA>
## 9 Zorro drinks.strMeasure9 <NA>
## 10 Zorro drinks.strMeasure10 <NA>
Since most of the cocktails don’t have all 15 ingredients filled, there are going to be multiple rows with ingredient=NA. Additionally, there might be repetitions in ingredients list for the same cocktail. So we are going to deal with both of these issues by first selecting distinct rows and then removing all rows containing NA in the ingredient column.
db_join<-bind_cols(db_ing_tidy,db_m_tidy)%>%
select(cocktail.name,ingredient,amount)
# Keep distinct rows only. Remove rows where ingredient is NA.
db_tidy <- db_join %>%
distinct()%>%
filter(!is.na(ingredient))%>%
filter(ingredient!="na")%>%
filter(ingredient!="NA")
head(db_tidy,n=10)
## cocktail.name ingredient amount
## 1 Zorro Sambuca 2 cl
## 2 Zorro Bailey's irish cream 2 cl
## 3 Zorro White Creme de Menthe 2 cl
## 4 Zorbatini Vodka 1 1/4 oz Stoli
## 5 Zorbatini Ouzo 1/4 oz
## 6 Zombie #3 Gold rum 1 1/2 oz
## 7 Zombie #3 Lime juice 3 tsp
## 8 Zombie #3 Rum 1 tblsp Jamaican
## 9 Zombie #3 White rum 1 tblsp
## 10 Zombie #3 Pineapple juice 1 tblsp
Let’s check if we have any NAs left in the dataframe.
cat("cocktail name/", "NAs:", sum(is.na(db_tidy$cocktail.name)),"\n")
## cocktail name/ NAs: 0
cat("ingredient/", "NAs:", sum(is.na(db_tidy$ingredient)),"\n")
## ingredient/ NAs: 0
cat("amount/", "NAs:", sum(is.na(db_tidy$amount)),"\n")
## amount/ NAs: 901
There are 901 occurrences of NAs in the amount column, which indicates ingredients without the measure. Let’s create a list of the most frequent ingredients missing their measure to get an idea for the value to replace missing measures with.
db_na<-db_tidy%>%
filter(is.na(amount))%>%
group_by(ingredient)%>%
summarise(N=n())%>%
arrange(desc(N))
head(db_na,20)
## # A tibble: 20 × 2
## ingredient N
## <chr> <int>
## 1 Ice 90
## 2 Carbonated water 64
## 3 Orange juice 41
## 4 Nutmeg 32
## 5 Pineapple juice 28
## 6 Coca-Cola 26
## 7 Grenadine 24
## 8 Ginger ale 23
## 9 Cherry 21
## 10 Vodka 21
## 11 Lemon peel 19
## 12 Milk 19
## 13 Whipped cream 17
## 14 Lemon 15
## 15 Lime 15
## 16 Soda water 15
## 17 Salt 14
## 18 Cranberry juice 13
## 19 Bailey's irish cream 11
## 20 Kahlua 11
In most cases it’s ice (90), followed by carbonated water (64) and orange juice (41). Mostly the list consists of liquids, however there are some solid ingredients like nutmeg (32), cherry (21) and salt (14). Let’s now get the most frequent measure to further estimate what the good approximation should be.
db_amount_mean<-db_tidy%>%
group_by(amount)%>%
summarise(N=n())%>%
arrange(desc(N))
head(db_amount_mean,10)
## # A tibble: 10 × 2
## amount N
## <chr> <int>
## 1 1 oz 1225
## 2 1/2 oz 1135
## 3 <NA> 901
## 4 2 oz 627
## 5 1 1/2 oz 621
## 6 1 part 442
## 7 3/4 oz 362
## 8 1 332
## 9 1 tsp 326
## 10 1 dash 213
Most frequent measure is: 1 oz which is equal to 29.57 mL. So a reasonable estimate would be somewhere around 10 mL to both get enough liquid and don’t ruin the drink with too much salt (when salt is used). Roughly taking into account salt’s density of 2.16 g/cm3 it will be equal to 4.6g of salt which is a bit less than a size of a teaspoon. It still might be considered somewhat high, depending on the personal taste, but one needs a good supply of minerals and electrolytes when he or she drinks!
db_tidy$amount[is.na(db_tidy$amount)]<-10
Next, there are probably not that many people, except professional mixologists, who will bother to make a cocktail containing more than 7 ingredients. So from our dataframe let’s select cocktails composed of 7 ingredients or less.
db_rm<-db_tidy%>%
group_by(cocktail.name)%>%
summarise(N.ings=n())%>%
filter(N.ings>=7)
head(db_rm,n=10)
## # A tibble: 10 × 2
## cocktail.name N.ings
## <chr> <int>
## 1 1-900-FUK-MEUP 8
## 2 151 Florida Bushwacker 8
## 3 3-Mile Long Island Iced Tea 9
## 4 Absinthe #1 7
## 5 Adam Bomb 8
## 6 Agent Orange 8
## 7 Alien Urine Sample 7
## 8 Aloha Fruit punch 7
## 9 Amaretto Liqueur 11
## 10 Angelica Liqueur 12
list_rm<-db_rm$cocktail.name
db_tidy<-db_tidy%>%
filter(!cocktail.name%in%list_rm)
Now we have cocktails only with 7 or less ingredients in our dataframe. I, however, saved the recipe of “1-900-F*K-MEUP" on my to-do list for, well… personal investigation.
As the last step in data preparation let’s trim all columns from whitespaces on both sides and convert everything to the lower case.
# Trim both sides from whitespaces (if any), convert everything to the lower case
db_tidy <- as.data.frame(sapply(db_tidy, str_trim, side = "both"))
db_tidy <- db_tidy%>%
mutate_each(funs(tolower))
# Remove intermediate dataframes
rm(list=setdiff(ls(), "db_tidy"))
# Call garbage collector
gc()
Let’s do a bit of data exploration and look for the most common ingredients.
db_top <- db_tidy %>%
group_by(ingredient) %>%
summarise(N = n()) %>%
arrange(desc(N))
head(db_top,n=15)
## # A tibble: 15 × 2
## ingredient N
## <chr> <int>
## 1 vodka 584
## 2 gin 423
## 3 orange juice 360
## 4 ice 272
## 5 lemon juice 256
## 6 grenadine 241
## 7 sugar 211
## 8 pineapple juice 204
## 9 triple sec 200
## 10 light rum 185
## 11 lemon 181
## 12 bailey's irish cream 178
## 13 kahlua 173
## 14 tequila 169
## 15 dry vermouth 166
Vodka is, without a doubt, the most popular ingredient used in cocktail making (present in 584 cocktails). It does actually make sense, since it does not have a very strong after-taste and packs a lot of alcohol per mL. It’s followed closely by one of my favorite liquors: gin (423).
3rd place is occupied by orange juice (360), which is a great way of balancing you daily vitamin C intake. And, as expected, many cocktails (272) have to be made with ice.
There are quite diverse units of measure in the amount column: ounces (oz), milliliters (mL), table spoons, teaspoons, jiggers, pints, e.t.c. Let’s convert all of quantifiable units to mL.
First, let’s separate every amount (for example: 4 oz) into its quantity (4) and its unit (oz). Since the quantity can be specified as something like: (2 - 3) let’s split it into two possible measures 2 or 3, and use the first, usually the smaller one. To deal with complex measures like “1 and 1/2” we will first convert it to a formula expression “(1+1/2)” and then use parsing and evaluation function (parse, eval) to calculate the numeric value.
db_tidyz <- db_tidy %>%
mutate(number=gsub("[^[:digit:],^[:punct:] ]", "", amount)) %>%
# Trim both sides from whitespaces
mutate(numbe=str_trim(number,side="both"))%>%
#replace double and triple and so on whitspaces with single whitespace
mutate(numb=gsub("\\s+"," ",numbe))%>%
# Select all numbers that matching the following four patterns
mutate(num=str_extract(numb,"[:digit:]+\\s+[:digit:]+[:punct:]+[:digit:]|[:digit:]+[:punct:]+[:digit:]+|[:digit:]+|[:digit:]+[:punct:]+[:digit:]+[:digit:]+|[:digit:]+[:punct:]+[:digit:]+[:punct:]+[:digit:]+"))%>%
# Replace all whitespaces with + sign
mutate(nu=str_replace_all(num," ","+")) %>%
# Separate everything that has "-" into two columns
separate(nu,c("nu","nuu"), sep="-") %>%
# Select text only (with white spaces)
mutate(unit=gsub("[[:digit:]|[:punct:]]","",amount))%>%
# Trim both sides from whitespaces
mutate(unit=str_trim(unit,side="both")) %>%
#replace double and triple and so on whitspaces with single whitespace
mutate(unit=gsub("\\s+"," ",unit))
head(db_tidyz[,c(2,3,8,10)],n=10)
## ingredient amount nu unit
## 1 sambuca 2 cl 2 cl
## 2 bailey's irish cream 2 cl 2 cl
## 3 white creme de menthe 2 cl 2 cl
## 4 vodka 1 1/4 oz stoli 1+1/4 oz stoli
## 5 ouzo 1/4 oz 1/4 oz
## 6 light rum 1 oz 1 oz
## 7 creme de almond 1/2 oz 1/2 oz
## 8 sweet and sour 1 1/2 oz 1+1/2 oz
## 9 triple sec 1/2 oz 1/2 oz
## 10 orange juice 1 1/2 oz 1+1/2 oz
In the next step we are going to convert all quantifiable units (oz, shots, jiggers, e.t.c) into mL using the proper conversion factor. There can be multiple spellings for the same unit, for example, teaspoon can be spelled as “teaspoon”, “tsp” or even “ts p”. So we have to catch them all!
# Select only cocktail name, ingredient, nu number and unit column
db_tidyx <- db_tidyz %>%
select(cocktail.name,ingredient,nu,unit)
# Replace units with proper conversion to mL
db_tidyx <- db_tidyx %>%
mutate(unit=str_replace(unit,"ozjamaican","oz")) %>%
mutate(unit=str_replace(unit,"oz","29.5")) %>%
mutate(unit=str_replace(unit,"shot","29.5")) %>%
mutate(unit=str_replace(unit,"jigger","44.5")) %>%
mutate(unit=str_replace(unit,"cup","257")) %>%
mutate(unit=str_replace(unit,"tblsp","11.1")) %>%
mutate(unit=str_replace(unit,"tsp","3.7")) %>%
mutate(unit=str_replace(unit,"ts p","3.7")) %>%
mutate(unit=str_replace(unit,"teaspoon","3.7")) %>%
mutate(unit=str_replace(unit,"cl","10")) %>%
mutate(unit=str_replace(unit,"dl","100")) %>%
mutate(unit=str_replace(unit,"litre","1000")) %>%
mutate(unit=str_replace(unit,"liter","1000")) %>%
mutate(unit=str_replace(unit,"dash","0.9")) %>%
mutate(unit=str_replace(unit,"splash","3.7")) %>%
mutate(unit=str_replace(unit,"twist","15")) %>%
mutate(unit=str_replace(unit,"twistof","15")) %>%
mutate(unit=str_replace(unit,"can","355")) %>%
mutate(unit=str_replace(unit,"cube","12")) %>%
mutate(unit=str_replace(unit,"part","29.5")) %>%
mutate(unit=str_replace(unit,"pint","473")) %>%
mutate(unit=str_replace(unit,"glass","473")) %>%
mutate(unit=str_replace(unit,"bottles","473")) %>%
mutate(unit=str_replace(unit,"gal","3785"))
# Check if missing something important like glass or a pint
# unique(db_tidyx$unit)
All quantifiable units were converted to mL. However there are still subjective units, like “handful” or a “splash”, for them let’s use previous estimate for the missing measure of 10 mL.
Everything is ready, so let’s parse and evaluate measures and multiply with the unit conversion factor, which was added in the previous step.
db_tidyc<-db_tidyx %>%
# Select all numbers that matching the following patterns
mutate(unit=str_extract(unit,"[:digit:]+[:digit:]+[:punct:]+[:digit:]|[:digit:]+[:punct:]+[:digit:]+|[:digit:]+|[:digit:]+[:punct:]+[:digit:]+[:digit:]+"))%>%
# Convert unit column to numeric
mutate(unit=as.numeric(unit))%>%
# replace comma , with dot .
mutate(nu=gsub(",",".",nu)) %>%
# add left and right brackets for nu colum for proper parsing and evaluation
mutate(nu=sub("^", "(",nu)) %>%
mutate(nu=sub("$", ")",nu))
# set NAs to 10 mL
db_tidyc[is.na(db_tidyc)]<-10
# parse and evaluate nu
m2 <- sapply(db_tidyc$nu ,function(x) eval(parse(text=x)))
# add m2 as a new column
db_tidyc$nup<-m2
# compare nu and nup for sanity check
# head(db_tidyc,20)
# multiply nup and conversion factor for the measure's unit
db_tidyc<- db_tidyc %>%
mutate(measure=nup*unit)
head(db_tidyc,10)
## cocktail.name ingredient nu unit nup measure
## 1 zorro sambuca (2) 10.0 2.00 20.000
## 2 zorro bailey's irish cream (2) 10.0 2.00 20.000
## 3 zorro white creme de menthe (2) 10.0 2.00 20.000
## 4 zorbatini vodka (1+1/4) 29.5 1.25 36.875
## 5 zorbatini ouzo (1/4) 29.5 0.25 7.375
## 6 zombie #1 light rum (1) 29.5 1.00 29.500
## 7 zombie #1 creme de almond (1/2) 29.5 0.50 14.750
## 8 zombie #1 sweet and sour (1+1/2) 29.5 1.50 44.250
## 9 zombie #1 triple sec (1/2) 29.5 0.50 14.750
## 10 zombie #1 orange juice (1+1/2) 29.5 1.50 44.250
# Select cocktail name, ingridient and measure (now in mL)
db_clean_tidy <- db_tidyc %>%
select(cocktail.name,ingredient,measure)
#head(db_clean_tidy,n=20)
# Save tidy and clean cocktails data
# fwrite(db_clean_tidy,"Data/db_clean_tidy.csv")
# Remove intermediate dataframes
rm(db_tidyx,db_tidyc, db_tidyz)
We are going to use cocktails’ similarity, in terms of ingredient composition, and list of user-rated cocktails to recommend him/her a cocktail that he/she might like. For that, let’s develop a methodology to compare two cocktails and quantify how similar they are.
First, we will represent every cocktail as a vector in a high dimensional space, transforming data from long to wide format, where every dimension (column) is a separate cocktail ingredient. There are total of 452 possible ingredients, so if the cocktail contains a specific ingredient the value in the corresponding column will be set to the ingredient’s amount (in mL), if it does not the value will be set to 0.
For example, Mad Scientist cocktail which is made from: Bailey’s irish cream, blueberry schnapps, grenadine, and raspberry schnapps will have non-zero values only in these columns and 0s in all other columns (the rest of ingredients space). First column is used for the cocktail name.
# Hadley's trick for the spread by adding unique identifiers:
db_clean_tidy$row <- 1:nrow(db_clean_tidy)
# Let's spread it!
db_spread <- db_clean_tidy %>%
spread(ingredient, measure)
# Remove the row column
db_spread<- db_spread %>%
select(-row)
# Replace any previously uncaught NAs with 0s
db_spread[is.na(db_spread)]<-0
# Group by cocktail name
db_spread_comb <- db_spread %>%
group_by(cocktail.name) %>%
summarise_each(funs(sum))
rm(db_spread)
#fwrite(db_spread_comb,"Data/db_spreaded.csv")
Now let’s normalize cocktail vectors (rows), so that a sum of the vector multiplied with itself will be equal 1.
# Noramlize cocktails(vectors) using type 2 normalization
#db_spread_comb<-read.csv("Data/db_spreaded.csv")
#db_spread_comb<-db_spread_comb[,-1]
cocktail.name<-db_spread_comb$cocktail.name
dtf <- db_spread_comb[,-1]
dtf <- sapply(dtf, as.numeric)
for(i in 1:nrow(dtf)){
dtf[i,]<-dtf[i,]/norm(dtf[i,],type="2")
}
# Recombine with the coctails names
db_norm<-data.frame(cocktail.name,dtf)
#colnames(db_norm)
# Spread is ready and normalized! Let's save it!
# fwrite(db_norm,"Data/db_spreaded_normalized.csv")
As a next step in our methodology we are going to calculate inner products between cocktails, i.e we will multiply every cocktail vector with a transpose of every other vector. This operation will allow us to get an easy-to-interpret cross-similarity metric (0 - have nothing in common, 1 - identical) between cocktails in the database.
#db_norm<-read.csv("Data/db_spreaded_normalized.csv")
#db_norm<-db_norm[,-1]
dtf<-db_norm[,-1]
dtf[is.na(dtf)]<-0
# Convert everything to numeric
dtf <- sapply(dtf, as.numeric)
# Calculate inner product
x<-dtf
y<-t(x)
sumi<-x %*% y
# Replace NAs (if any) with 0
sumi[is.na(sumi)]<-0
# Convert to dataframe set rows and columns names to the names of cocktails
sumidf<-as.data.frame(sumi)
colnames(sumidf)<-db_norm$cocktail.name
rownames(sumidf)<-colnames(sumidf)
# Save as db_innerproduct_matrix.csv
# fwrite(sumidf,"Data/db_innerproduct_matrix.csv")
# Get the histogram
hist(sumi[sumi>=0.0001],main="Histogram of Inner Products (Similarities)",xlab="Inner Product (more than 0.0000001)")
# Ratio of number of cocktails with moderate-low to high similarity
length(sumi[sumi<0.75&sumi>0.3])/length(sumi[sumi<0.9999&sumi>=0.75])
## [1] 5.357429
# Ratio of number of cocktails in very low range to low-high similarity
length(sumi[sumi<0.3&sumi>0.0001])/length(sumi[sumi<0.9999&sumi>=0.3])
## [1] 2.054668
The majority of cocktails have nothing in common (inner product=0), however if we remove complete zeros (by selecting inner products of at least 0.0001 and above) the graph demonstrates exponential-like distribution with a slow decay. There are about 5 times more cocktails with moderate (0.3 to 0.75) than high similarity (0.75 to 0.9999). And only half as many cocktails in the low to high range (0.3 to 0.9999) of similarity comparing to the very low range (0.00001 to 0.3) of similarity. Which, in fact, is quite promising in terms of chances of finding a good recommendation.
Let’s do a quick test to estimate how well inner products work to gauge similarity between cocktails. Starting with margarita:
# Choose Margarita
x<-sumidf %>%
select(margarita)
x<-x[order(-x$margarita), , drop = FALSE]
head(x,n=7)
## margarita
## margarita 1.0000000
## midori margarita 0.9917645
## jimpop's margarita 0.9688518
## blue margarita 0.9596115
## margarita #2 0.9332931
## strawberry margarita 0.8832629
## headcrush 0.6607850
We see that our approach is capable of finding variations (very high similarity) of margarita recipe. It can also find somewhat moderately similar drinks, like headcrush which has tequila and salt too, but differs in other ingredients: tabasco sauce and whipped cream. Well, that does sound yummy! Let’s now test for something more interesting, like White Russian.
# Choose Margarita
x<-sumidf %>%
select(`white russian`)
x<-x[order(-x$'white russian'), , drop = FALSE]
head(x,n=7)
## white russian
## white russian 1.0000000
## almond joy 0.7870075
## brown bomber 0.7870075
## foxy lady 0.7870075
## dutch velvet 0.7856355
## cappucino cocktail 0.7662873
## orgasm 0.7643810
All of these are quite similar in taste to White Russian. You can notice that top 3 cocktails (not including white russian) have the same similarity metric, which suggests, that in our database they are essentially the same exact cocktail having multiple names. Which, in fact, is true for almond joy and foxy lady, but brown bomber has peanut liquor instead of amaretto. All of these 3 are related to white russian by high amount of light cream. Orgasm has both vodka and light cream (no pun intended), but proportion of light cream is different.
Let’s explore hidden relationships between cocktails in the inner product matrix by clustering cocktails in separate categories. For that we will use unsupervised machine learning algorithm known as K-means, and determine most optimal number of clusters by plotting within sum of squares as function of number of clusters. For an ideal case, within sum of squares is equal to 0, i.e points in each cluster are at exactly the same location in features space and total sum of squares is exactly equal to between (clusters) sum of squares. This can, however, almost never happen in practice, except if one chooses number of clusters to be equal to the number of distinct elements (observations) in the data. Obviously this does not provide with any additional information - each element is in its own cluster, the same as in a single cluster case - when all elements are in the exactly same cluster. So to extract reliable information about internal patterns in the data one has to consider this trade-off, and be cautious about possibility of over/under-estimating number of clusters. A reasonable way to choose number of clusters is by finding an “elbow” on the graph (i.e a pivot point, indicating change of regime) when increase in number of clusters does not decrease within sum of squares that rapidly anymore.
# Function for within sum of squares plot (if it wasn't loaded previously)
wssplot <- function(data, nc=15, seed=1234){
wss <- (nrow(data)-1)*sum(apply(data,2,var))
for (i in 2:nc){
#print(i)
set.seed(seed)
wss[i] <- sum(kmeans(data, centers=i,nstart=25, iter.max = 10)$withinss)}
plot(1:nc, wss, type="b", xlab="Number of Clusters",
ylab="Within groups sum of squares")
}
# Let's determine reasonable number of clusters using Within Sum of Squares
wssplot(sumidf)
From the plot, my best estimate for the “elbow” location lands at number of clusters = 6. Let’s use it to do the actual clustering and get corresponding statistics.
# Perform unsupervised kmeans clustering on inner products matrix for 5 clusters
set.seed(1234)
km.out.ip <- kmeans(sumidf, 6, nstart =25,iter.max=100)
# Tidy the k-means results
x.km<- tidy(km.out.ip)%>%
select(cluster,size,withinss)
x.km
## cluster size withinss
## 1 1 203 3584.198
## 2 2 321 6441.378
## 3 3 247 5119.297
## 4 4 202 5863.246
## 5 5 1720 44083.070
## 6 6 313 5002.519
Most of the resulting clusters are comparable in terms of size and within (each cluster) sum of squares, except for the 5th cluster which apparently was used as a “dump” by K-means algorithm to group the remaining cocktails in a one big cluster. We can, of course, try to further split this big group into smaller chunks by increasing number of clusters but we risk grouping cocktails which might have not that much in common.
Let’s take a quick look into clusters by finding most frequent ingredient in every cluster.
ip.x <-as.data.frame(km.out.ip$cluster)
ip.x <-data.frame(cluster=ip.x$`km.out.ip$cluster`,cocktail.name=rownames(ip.x))%>%
arrange(cluster)
head(ip.x,n=10)
## cluster cocktail.name
## 1 1 a piece of ass
## 2 1 a true amaretto sour
## 3 1 absolut evergreen
## 4 1 absolut limousine
## 5 1 absolutely cranberry smash
## 6 1 amaretto sunset
## 7 1 amer picon punch
## 8 1 aqua fodie
## 9 1 arctic fish
## 10 1 arctic mouthwash
db_clust<-merge(db_clean_tidy,ip.x,by="cocktail.name")
db_clust_top<-db_clust%>%
group_by(cluster,ingredient)%>%
summarise(N = n())%>%
filter(N==max(N))%>%
arrange(cluster,desc(N))
head(db_clust_top,n=20)
## Source: local data frame [6 x 3]
## Groups: cluster [6]
##
## cluster ingredient N
## <int> <chr> <int>
## 1 1 ice 203
## 2 2 vodka 323
## 3 3 orange juice 247
## 4 4 light rum 125
## 5 5 bailey's irish cream 141
## 6 6 gin 313
# Save the comparison
#fwrite(dc_clust_comp,"Data/clust_compare.csv")
As can be seen, overall, clusters are quite distinct in their most frequent ingredient, which means k-means did a decent job grouping cocktails together based on their inner products values. So if, as an extra feature for our app, we would ever need to group cocktails by an additional parameter, like overall flavor group or, maybe, the mood they create, unsupervised clustering can be used as a good starting point.
Finally, let’s use inner products matrix and generate ranked recommendation list using small sample of user ratings. There are 15 rated cocktails in user1.csv file each rated on a scale from 1 to 5 (cocktail shaped stars!). We will use rating as a weight:
1 star = -1 (strongly dislike),
2 stars = -0.5 (prefer not to drink),
3 stars = 0.25 (drinkable),
4 stars = 0.75 (almost perfect),
5 stars = 1 (can drink it all day - all night!)
and calculate average “preferences” vector in the inner product space.
To get the actual recommendations we will use k-d tree algorithm (FNN package) and find 50 nearest neighbors to this averaged vector of user preferences.
Let’s start with loading the list of rated cocktails and converting ratings into weights.
library(FNN)
library(dplyr)
library(tidyr)
# Load preferences
user1<-read.csv("user1.csv", header = F)
colnames(user1)<-c('cocktail','rating')
user1<- mutate_each(user1, funs(tolower))
# Getting weights instead of stars (1 star=-1 2stars=-0.5 3stars=0.25,4stars=0.75,5stars=1)
replDF <- data.frame(
rating = unique(user1$rating),
weight = NA)
replDF$weight[replDF$rating==5] <- 1
replDF$weight[replDF$rating==4] <- 0.75
replDF$weight[replDF$rating==3] <- 0.25
replDF$weight[replDF$rating==2] <- -0.5
replDF$weight[replDF$rating==1] <- -1
user1_list <- merge(user1, replDF,
by = "rating")
# Get cocktails ID numbers
for (i in 1:nrow(user1_list)){
user1_list$cocktail.id[i]<-match(user1_list$cocktail[i],colnames(sumidf))
}
user1_list<-user1_list%>%
arrange(cocktail)
head(user1_list,n=15)
## rating cocktail weight cocktail.id
## 1 3 black russian 0.25 303
## 2 4 bloody mary 0.75 329
## 3 4 blue hawaiian 0.75 342
## 4 4 blue lagoon 0.75 345
## 5 5 mai tai 1.00 1639
## 6 4 mimosa 0.75 1735
## 7 2 mint julep #1 -0.50 1740
## 8 3 mojito 0.25 1759
## 9 2 moscow mule -0.50 1791
## 10 3 old-fashioned 0.25 1874
## 11 3 sangria #1 0.25 2288
## 12 4 sex on the beach 0.75 2362
## 13 4 tequila sunrise 0.75 2677
## 14 1 whiskey sour -1.00 2934
## 15 5 white russian 1.00 2944
Now we will pull inner product vectors for each cocktail in the user rated list, multiply them with appropriate weights and average the results to get a single “preferences” vector.
# Select user rated choices from inner product matrix
user_list_raw<-sumidf[user1_list$cocktail.id,]
# Give random ratings
user_ratings<-user1_list$weight
# Apply ratings as weights to the user choices
user_list<-user_list_raw*user_ratings
# Set all 0 inner product values to NAs to do proper averaging
user_list[user_list==0]<-NA
# Calculate an average (vector) of user preferences
user_vector<-as.data.frame(t(colMeans(user_list, na.rm = TRUE)))
# Set NAs to 0
user_vector[is.na(user_vector)]<-0
#colnames(user_vector)<-cocktail.name
As the last step we will search for 50 nearest neighbors of this averaged “preferences” vector using kd-tree algorithm.
# Calculate 50 nearest neighborhs using kd_tree
user_vector_knnx<-get.knnx(sumidf,user_vector,k=50,algorithm = "kd_tree")
# Get the names of the suggestions and their distances
user_recc <-colnames(sumidf)[user_vector_knnx$nn.index]
user_recc_distances<-data.frame(user_recc,t(user_vector_knnx$nn.dist))
user_recc
## [1] "lucky driver" "red ox"
## [3] "love juice (lj)" "party slush punch #2"
## [5] "ocean drive" "maèek"
## [7] "climax" "spiced peach punch"
## [9] "simpson bronco" "the pineapple drink"
## [11] "mind over marny" "alfie cocktail"
## [13] "monsoon" "izayoi"
## [15] "reptile (orginal)" "orgasm"
## [17] "rootbeer floatie" "warm witch's blood"
## [19] "foxy lady" "almond joy"
## [21] "dutch velvet" "belle melon"
## [23] "lava flow" "brown bomber"
## [25] "banana colada #1" "jeweler's hammer"
## [27] "mock pink champagne #1" "monarchy luau punch"
## [29] "stardust" "yellow parakeet"
## [31] "winegum" "peach treat"
## [33] "italian sombrero" "outrigger"
## [35] "skinny dip" "sombrero"
## [37] "pink squirrel" "beelzebub"
## [39] "wet dream" "long island iced tea #5"
## [41] "golden glow punch" "hot springs cocktail"
## [43] "peach bunny" "island girl"
## [45] "coffee grasshopper" "pink panther #1"
## [47] "canadian cherry" "amaretto sweet & sour"
## [49] "phillips screwdriver" "pan galactic gargle blaster #2"
# Save user reccomendations
# fwrite(user_recc_distances,"user1_reccomendations.csv")
As the outcome I got the ranked list of cocktail recommendations for me to try! I made and consumed the first 3 recommended cocktails (for the sake of science, of course!) and I should confirm they did hit the right spot both in terms of flavor and alcohol content. There are, of course, a lot of things that can be improved, starting with migrating to the Absolut Drinks Database API and ending with developing minimum viable product (Web, Iphone or Android app) to conduct tests and get real-user data to quantify algorithm’s performance on the larger scale. But before that I have 47 cocktails on my list to rate. So stay tuned for updates, and cheers!