Kotlin: Tips Dataset

Kotlin: Tips Dataset#

The tips dataset is a data frame with 244 rows and 7 variables, which represents some tipping data where one waiter recorded information about each tip he received over a period of few months working in one restaurant.

(short intro about the dataset)

For this example, we will import the following packages

%use multik
%use dataframe
%use lets-plot

Let’s load the “Tips” dataset, and show it’s first 5 rows:

val tips = DataFrame.readCSV("../resources/example-datasets/datasets/tips.csv")
tips.head()

DataFrame: rowsCount = 5, columnsCount = 7

The dataset has 7 variables:

total_bill in dollars
tip in dollars
sex of the bill payer
smokers whether there were smokers in the party
day of the week
time time of day
size: people at the party

During the loading of the dataset, some values could have been mapped to a wrong datatype (e.g. Date can be loaded as String if not well formatted).

With the schema() method it’s possible to see how values have been parsed.

tips.schema()

total_bill: Double
tip: Double
sex: String
smoker: Boolean
day: String
time: String
size: Int

We can analyze some statistics of categorical data (String and Boolean columns):

tips.describe { colsOf<String>() and colsOf<Boolean>() }

DataFrame: rowsCount = 4, columnsCount = 10

There are four categorical variables in the Tips dataset as seen above. For a better visualization of those data, we can make plots for visualizing for example the number of people for each day of the week.

ggplot(tips.select { day and size }.sortBy("size").toMap()) { x = "day" } +
    geomBar(stat = Stat.count(), position=positionDodge(), alpha=0.8 ) { 
        y = "..count.." ; fill=asDiscrete("size") 
    } +
    ggtitle("Tables served by Day and party Size")

Fridays are the quietest days. Saturdays are the busiest followed by Sundays, meaning that there are more customers in the weekend.
The most common party size is by far 2, and there are very a few lone diners.

val p1 = ggplot(tips.select { day and smoker}.toMap()) { x = "day" } +
    geomBar(
        stat = Stat.count(),
        position = positionFill(),
        alpha = 0.8,
        tooltips = layerTooltips("smoker")
            .format("..prop..", ".1%")
            .line("perc. |@..prop..")
            
    ) { y = "..prop.."; fill = "smoker"} +
    scaleYContinuous(format=".1%") +
    ggtitle("Percentage of smokers for each day")

val p2 = ggplot(tips.select { day and sex }.toMap()) { x = "day" } +
    geomBar(
        stat = Stat.count(),
        position = positionFill(),
        alpha = 0.8,
        tooltips = layerTooltips("sex")
            .format("..prop..", ".1%")
            .line("perc. |@..prop..")
        ) { y = "..prop.." ; fill = "sex" } +
    coordFlip() +
    scaleYContinuous(format=".1%") +
    ggtitle("Percentage of bill payers' sex for each day")

GGBunch().addPlot(p1, 0, 0, 400, 400).addPlot(p2, 400, 0, 400, 400)

It’s very easy now to notice that:

There are almost equal numbers of male and female that pay the bill in the weekday, but the number of male increases at the weekend.
The percentage of non smokers is most of the time major that the total percentage, but in the day with least people in the restaurant (Friday), most of them are smokers.

Let’s analyze now quantitative variables: total_bill and tips.

ggplot(tips.toMap()) { x = "total_bill" } +
    geomHistogram(bins = 25, fill="white", color="black") { y = "..density.." } +
    geomArea(stat = Stat.density(), fill = "orange", alpha = 0.2) +
    geomVLine(xintercept = tips.total_bill.mean(), color="red", linetype = "dashed", size = 1.0) +
    ggtitle("Total bill amounts frequencies")

This histogram shows that the average bill amount falls inside the range from 10 to 25 dollars, with it’s mean located at about 20 dollars (red dashed line at 19.8).

We can make the same plot, but with tips instead

ggplot(tips.toMap()) { x = "tip" } +
    geomHistogram(bins = 25, fill="white", color="black") { y = "..density.." } +
    geomArea(stat = Stat.density(), fill = "dark-green", alpha = 0.2) +
    geomVLine(xintercept = tips.tip.mean(), color="red", linetype = "dashed", size = 1.0) +
    ggtitle("Tips amounts frequencies")

As shown above, the tips peak is at about two dollars, while the mean is right about at three dollars.

It would be more interesting to see the distribution of the tips in relation to its total bill.

var data = tips.add("tip_pct") { tip / total_bill }
data.head()

DataFrame: rowsCount = 5, columnsCount = 8

ggplot(data.toMap()) { x="tip_pct" } +
    geomHistogram(
        bins = 25,
        fill="gray",
        tooltips = layerTooltips("tip_pct")
            .format("tip_pct", ".1%")
    ) { y = "..density.." } +
    geomVLine(
        xintercept = data.tip_pct.mean(),
        linetype = "dashed",
        color = "red",
        size = 1.0,
    ) +
    scaleXContinuous(format = ".1%") +
    xlab("Tips Percentage") +
    ggtitle("Tips percentage on Total Bill amount")

We can see that the peak is at about 15% of the total bill. We can spot also some outliers, and let’s see their details in the dataframe.

data.sortBy { tip_pct.desc() }.head(5)

DataFrame: rowsCount = 5, columnsCount = 8

It can also be interesting to analyze the amount of money spent by each person inside a group

// adding Bill Per Person col
data = data.add("bill_pp") { total_bill / size }
data.head()

DataFrame: rowsCount = 5, columnsCount = 9

And similarly as above:

ggplot(data.toMap()) { x="bill_pp" } +
    geomHistogram(
        bins = 25,
        fill="gray",
    ) { y = "..density.." } +
    geomVLine(
        xintercept = data["bill_pp"].cast<Double>().mean(),
        linetype = "dashed",
        color = "red",
        size = 1.0,
    ) +
    xlab("Bill per Person") +
    ggtitle("Distribution of Bill per Person")

It can be useful to see the bill per person with total_bill in the same plot.

ggplot(data.toMap()) +
    geomHistogram(
        bins=25, fill="blue", 
        color="white", alpha=0.2) {
            x="total_bill" ; y="..density.."
    } +
    geomLine(stat = Stat.density(), color="blue", size=1.0) {
        x = "total_bill"
    } +
    geomHistogram(bins=25, fill="red", color="white", alpha=0.2) {
        x="bill_pp" ; y="..density.."
    } +
    geomLine(stat = Stat.density(), color="red", size=1.0) { x="bill_pp"} +
    ggtitle("Total bill amount and bill per person distributions")

We want to see if there is correlation with smokers, group size and tip percentage:

val smokersData = 
    data.groupBy { size }
        .pivot { smoker }
        .mean { tip_pct }
        .sortBy { size }
smokersData

DataFrame: rowsCount = 6, columnsCount = 2

In order to easily process data for plotting, we rearrange data as follows

val data = smokersData.flatten().gather("false", "true").into("smoker", "tip_pct")
data

DataFrame: rowsCount = 12, columnsCount = 3

ggplot(data.toMap()) { x = "size" ; y="tip_pct" } +
    geomBar(
        stat = Stat.identity,
        position=positionDodge(0.3),
        alpha=0.6,
        tooltips = layerTooltips("smoker", "tip_pct")
            .format("tip_pct", ".1%")
    ) { fill="smoker" } +
    ylab("Tip Percentage") +
    scaleYContinuous(format=".1%") +
    xlab("Group Size") +
    ggtitle("Smoker analysis with Tip Percentage and Goup Size")

We can see that smoker’s tip percentage is generally lower that non smoker’s. Even on Friday, the day with most smokers, the tips of non-smokers people are higher.