Data Grouping and Aggregation with Kotlin#

%use dataframe

Kotlin DataFrame provides a very similar interface to the one supplied by pandas. The methods that we will use more will be groupBy() and pivot(), that we have already discussed in a previous chapter.

Having no such thing as pandas.Index, with Kotlin DataFrame the structure of the tabular data is more flexible, but in the same time it lacks all the power that a labeled index or multi index could give. The tabular object represented by a Kotlin DataFrame is slightly different from the pandas one, and so we have to think differently sometimes when it comes to replicate some pandas behaviors, especially when pivoting and grouping by a variable.

Split#

The “Split” operation in both pandas and DataFrame, is performed when invoking the groupBy or pivot operation. The result of those methods are not a dataset, but a temporary data structure that needs an aggregate operation to combine all the groups that are been created.

Let’s consider the following dataset:

val rand = java.util.Random(100)
val letters = sequenceOf("A", "B", "C")

val df = mapOf(
    "letter" to List(10) { letters.shuffled().find { true } },
    "num" to List(10) { rand.nextGaussian() }
).toDataFrame()

df

DataFrame: rowsCount = 10, columnsCount = 2

If we call a groupBy with the letter column, we will split the dataset into three groups, one for each letter. In Kotlin DataFrame, we can print easily the content of the grouping operation, whereas in python we have to loop through the content of each group, because calling groupby will not compute anything except some intermediate data about the group key.

df.groupBy { letter }

GroupBy

And we can see all the groups that are being formed. The groupBy method accept any column expression provided, so the following will be still legit:

df.groupBy { letter.map { it == "A"} }

GroupBy

This is very useful when manipulating date objects and we want to group by year/month/day.

val dates = listOf(
    LocalDate(1998, 1, 2),
    LocalDate(1999, 11, 21),
    LocalDate(1999, 3, 12),
    LocalDate(1998, 6, 22),
    LocalDate(1998, 12, 25),
    LocalDate(1999, 12, 24),
    LocalDate(1999, 2, 9),
    LocalDate(1999, 6, 24),
    LocalDate(1998, 1, 20),
    LocalDate(1999, 12, 20),
).toColumn("date")

val dfd = df.add(dates)
dfd.groupBy { date.map { it.year } }

GroupBy

groupBy accepts any number of column names to group by with.

dfd.groupBy { letter and date.map { it.year } }

GroupBy

We can also override the name of the group key column in the resulting dataframe using the named keyword.

Before computing aggregation or reduction operations, we can apply some transformation to the groups, like sorting, or computing other groups modifications operations. Among those methods, there’s the concat method, that unions all data groups of groupBy into original DataFrame preserving new order of rows produced by grouping.

A very common pattern, is to use the concatenation of groupBy and pivot operations:

dfd.groupBy { date.map { it.year } named "year" }.pivot { letter }

PivotGroupBy

The result we have computed are not directly usable: as said before, the split-apply-combine strategy require that grouped data are aggregated and reduced with an application of a combining operation.

Aggregation (Apply and Combine)#

With grouped data, we can compute a large variety of operations on groups. The most common operations are defined in Kotlin DataFrame library, like mean(), sum() or count().

For example, if we take the dataframe we grouped above, we can compute the mean of the values inside each group:

val groupedDf = dfd.groupBy { date.map { it.year } named "year" }
    .pivot { letter }

groupedDf.mean()

DataFrame: rowsCount = 2, columnsCount = 2

The same exact result can be archived in python with a very similar instruction:

dfd.groupby([df['date'].dt.year, 'letter'])\
    .mean() \
    .unstack('letter')

Note that the unstack operation is used to move “letters” from the MultiIndex object created with “year”, to columns. It is also possible to call pivot_table instead of unstack, specifying what will be in place of indices, columns keys and values.

Sometimes is useful to flatten the group result, in favor of a better handling of indices. We can use flatten(), which affects every groups of the dataframe, or ungroup() for selecting only one group.

groupedDf.mean().ungroup("letter")

DataFrame: rowsCount = 2, columnsCount = 3

Every time that we create a GroupBy or Pivot object, and after applying the needed transformations, we can aggregate or reduce all data with several functions all computed in the same time, inside the aggregate function. Each operation can then be mapped in the corresponding column.

If we want to compute the sum, the mean and the number of rows of each group, we can write:

groupedDf.aggregate { 
        sum { num } into "sum"
        mean { num } into "mean"
        count() into "count"
    }.ungroup("letter")

DataFrame: rowsCount = 2, columnsCount = 3

And this approach can be more straightforward than pandas use of agg function. If we consider the example above, a similar behavior can be computed in python as follows:

df.groupby([df['date'].dt.year, 'letter']) \
    .agg([('sum', 'sum'), ('mean', 'mean'), ('count', 'count')]) \
    .unstack('letter')

Where inside agg function, we specify the name of the column and the function we want to apply.

Both in python and Kotin, inside the aggregation function a custom function can be called. For example let’s consider this function:

fun peakToPeak(arr: List<Double>): Double = arr.max() - arr.min()

We can the call the function inside the aggregation method:

groupedDf.aggregate { peakToPeak(num.toList()) }

DataFrame: rowsCount = 2, columnsCount = 2

And the same thing can be done with pandas agg easily with:

def peak_to_peak(arr):
    return arr.max() - arr.min()

groupedDf.agg(peak_to_peak)

Kotlin DataFrame is very similar in behavior with pandas, and the transition between the two platform is very natural because of the same strategies implementations.

apply#

In python, the agg function performs row-wise operations, whereas the apply function is a more general purpose. In Kotlin we are more limited with the use of aggregate, but with some turnarounds, we can stille get the same results, effortlessly.

Consider this python function:

def top(df, n=5, column):
    return df.sort_values(by=column)[-n:]

That returns the top n rows with the largest value in column.

With apply, we can call this operation to every group:

df.groupby(df['year'].dt.year, 'letter').apply(top, n=1, 'num')

and the “dataframe” operation is computed along each groups.

The same behavior can be accomplished in kotlin with:

fun top(df: AnyFrame, col: String, n: Int = 5): AnyFrame = df.sortByDesc(col).head(n)
groupedDf.aggregate { top(it, col="num", n=5) }

DataFrame: rowsCount = 2, columnsCount = 2

Where the function invocation is a little more explicit than the panda’s one.

In contrast to Kotlin DataFrame, pandas groupby and apply function are way more powerful when grouping and applying functions when slicing data up into buckets with bins or by sample quantiles (using the function cut and qcut). Kotlin DataFrame does not provide a way to handle Categorical object easily as in pandas, but some turnarounds can be made in order to emulate the behavior of pandas.

Pivot Tables and Cross Tabulation#

The pandas method pivot_table summarizes the joint use of groupBy and pivot in Kotlin Dataframe:

  • The list of values specified in the pivot table, are the ones used for compute an aggregate function in Kotlin DataFrame’s aggregate function body.

  • The index list provided to pivot_table, is represented by the list of column names inside the Kotlin DataFrame groupBy method.

  • The columns list is represented by the list of column names inside the Kotlin DataFrame’s pivot function.

  • The aggfunc parameter in Kotlin is specified inside the aggregate function body.

Let’s consider some examples with the tips dataset. (More about this dataset in the examples at the end of this document)

val tips = DataFrame.readCSV("../resources/example-datasets/datasets/tips.csv")
tips.head()

DataFrame: rowsCount = 5, columnsCount = 7

tips.groupBy { day and smoker }.sortBy { day and smoker }.mean()

DataFrame: rowsCount = 8, columnsCount = 5

In python we could have accomplished the same result with the use of pivot table as follows:

tips.pivot_table(index=['day', 'smoker'])

and by default, the aggregation function used is the mean of the data.

Note that in the following examples we are using the sortBy method for sticking to the output that pandas would provide.

We could now create a more sophisticated pivot table with this python snippet:

tips['tips_pct'] = tips['tip'] / tips['total_bill']
tips.pivot_table(['tips_pct', 'size'], index=['time', 'day'],
                 columns='smoker')

And in Kotlin that pivot table can be created with:

tips.add("tips_pct") { tip / total_bill }
    .groupBy { time and day }.sortBy { time and day }
    .pivot { smoker }
    .aggregate {
        mean("tips_pct") into "tips_pct" 
        mean { size } into "size"
    }

DataFrame: rowsCount = 6, columnsCount = 3

So a deep understanding of the pandas pivot_table method, can really helps a lot when it comes to learning how to pivot and rearrange the dataset using Kotlin DataFrame!

Conclusion#

This process of data aggregation and the application of the split-apply-combine strategy can help both data cleaning as well as modeling or statistical analysis work, and as we have seen in this chapter, Kotlin DataFrame offers almost the same pandas capabilities when it come to those operations, especially when pivoting tables.