Data Grouping and Aggregation with Kotlin#
%use dataframe
Kotlin DataFrame provides a very similar interface to the one supplied by pandas. The methods that we will use more will be groupBy()
and pivot()
, that we have already discussed in a previous chapter.
Having no such thing as pandas.Index
, with Kotlin DataFrame the structure of the tabular data is more flexible, but in the same time it lacks all the power that a labeled index or multi index could give.
The tabular object represented by a Kotlin DataFrame
is slightly different from the pandas one, and so we have to think differently sometimes when it comes to replicate some pandas behaviors, especially when pivoting and grouping by a variable.
Split#
The “Split” operation in both pandas and DataFrame, is performed when invoking the groupBy
or pivot
operation. The result of those methods are not a dataset, but a temporary data structure that needs an aggregate operation to combine all the groups that are been created.
Let’s consider the following dataset:
val rand = java.util.Random(100)
val letters = sequenceOf("A", "B", "C")
val df = mapOf(
"letter" to List(10) { letters.shuffled().find { true } },
"num" to List(10) { rand.nextGaussian() }
).toDataFrame()
df
DataFrame: rowsCount = 10, columnsCount = 2
If we call a groupBy
with the letter
column, we will split the dataset into three groups, one for each letter. In Kotlin DataFrame, we can print easily the content of the grouping operation, whereas in python we have to loop through the content of each group, because calling groupby
will not compute anything except some intermediate data about the group key.
df.groupBy { letter }
GroupBy
And we can see all the groups that are being formed.
The groupBy
method accept any column expression provided, so the following will be still legit:
df.groupBy { letter.map { it == "A"} }
GroupBy
This is very useful when manipulating date
objects and we want to group by year/month/day.
val dates = listOf(
LocalDate(1998, 1, 2),
LocalDate(1999, 11, 21),
LocalDate(1999, 3, 12),
LocalDate(1998, 6, 22),
LocalDate(1998, 12, 25),
LocalDate(1999, 12, 24),
LocalDate(1999, 2, 9),
LocalDate(1999, 6, 24),
LocalDate(1998, 1, 20),
LocalDate(1999, 12, 20),
).toColumn("date")
val dfd = df.add(dates)
dfd.groupBy { date.map { it.year } }
GroupBy
groupBy
accepts any number of column names to group by with.
dfd.groupBy { letter and date.map { it.year } }
GroupBy
We can also override the name of the group key column in the resulting dataframe using the named
keyword.
Before computing aggregation or reduction operations, we can apply some transformation to the groups, like sorting, or computing other groups modifications operations. Among those methods, there’s the concat
method, that unions all data groups of groupBy
into original DataFrame
preserving new order of rows produced by grouping.
A very common pattern, is to use the concatenation of groupBy
and pivot
operations:
dfd.groupBy { date.map { it.year } named "year" }.pivot { letter }
PivotGroupBy
The result we have computed are not directly usable: as said before, the split-apply-combine strategy require that grouped data are aggregated and reduced with an application of a combining operation.
Aggregation (Apply and Combine)#
With grouped data, we can compute a large variety of operations on groups. The most common operations are defined in Kotlin DataFrame library, like mean()
, sum()
or count()
.
For example, if we take the dataframe we grouped above, we can compute the mean of the values inside each group:
val groupedDf = dfd.groupBy { date.map { it.year } named "year" }
.pivot { letter }
groupedDf.mean()
DataFrame: rowsCount = 2, columnsCount = 2
The same exact result can be archived in python with a very similar instruction:
dfd.groupby([df['date'].dt.year, 'letter'])\
.mean() \
.unstack('letter')
Note that the unstack
operation is used to move “letters” from the MultiIndex
object created with “year”, to columns. It is also possible to call pivot_table
instead of unstack
, specifying what will be in place of indices, columns keys and values.
Sometimes is useful to flatten the group result, in favor of a better handling of indices. We can use flatten()
, which affects every groups of the dataframe, or ungroup()
for selecting only one group.
groupedDf.mean().ungroup("letter")
DataFrame: rowsCount = 2, columnsCount = 3
Every time that we create a GroupBy
or Pivot
object, and after applying the needed transformations, we can aggregate or reduce all data with several functions all computed in the same time, inside the aggregate
function. Each operation can then be mapped in the corresponding column.
If we want to compute the sum, the mean and the number of rows of each group, we can write:
groupedDf.aggregate {
sum { num } into "sum"
mean { num } into "mean"
count() into "count"
}.ungroup("letter")
DataFrame: rowsCount = 2, columnsCount = 3
And this approach can be more straightforward than pandas use of agg
function. If we consider the example above, a similar behavior can be computed in python as follows:
df.groupby([df['date'].dt.year, 'letter']) \
.agg([('sum', 'sum'), ('mean', 'mean'), ('count', 'count')]) \
.unstack('letter')
Where inside agg
function, we specify the name of the column and the function we want to apply.
Both in python and Kotin, inside the aggregation function a custom function can be called. For example let’s consider this function:
fun peakToPeak(arr: List<Double>): Double = arr.max() - arr.min()
We can the call the function inside the aggregation method:
groupedDf.aggregate { peakToPeak(num.toList()) }
DataFrame: rowsCount = 2, columnsCount = 2
And the same thing can be done with pandas agg
easily with:
def peak_to_peak(arr):
return arr.max() - arr.min()
groupedDf.agg(peak_to_peak)
Kotlin DataFrame is very similar in behavior with pandas, and the transition between the two platform is very natural because of the same strategies implementations.
apply
#
In python, the agg
function performs row-wise operations, whereas the apply
function is a more general purpose. In Kotlin we are more limited with the use of aggregate
, but with some turnarounds, we can stille get the same results, effortlessly.
Consider this python function:
def top(df, n=5, column):
return df.sort_values(by=column)[-n:]
That returns the top n
rows with the largest value in column
.
With apply
, we can call this operation to every group:
df.groupby(df['year'].dt.year, 'letter').apply(top, n=1, 'num')
and the “dataframe” operation is computed along each groups.
The same behavior can be accomplished in kotlin with:
fun top(df: AnyFrame, col: String, n: Int = 5): AnyFrame = df.sortByDesc(col).head(n)
groupedDf.aggregate { top(it, col="num", n=5) }
DataFrame: rowsCount = 2, columnsCount = 2
Where the function invocation is a little more explicit than the panda’s one.
In contrast to Kotlin DataFrame, pandas groupby
and apply
function are way more powerful when grouping and applying functions when slicing data up into buckets with bins or by sample quantiles (using the function cut
and qcut
). Kotlin DataFrame does not provide a way to handle Categorical
object easily as in pandas, but some turnarounds can be made in order to emulate the behavior of pandas.
Pivot Tables and Cross Tabulation#
The pandas method pivot_table
summarizes the joint use of groupBy
and pivot
in Kotlin Dataframe:
The list of values specified in the pivot table, are the ones used for compute an aggregate function in Kotlin DataFrame’s
aggregate
function body.The
index
list provided topivot_table
, is represented by the list of column names inside the Kotlin DataFramegroupBy
method.The
columns
list is represented by the list of column names inside the Kotlin DataFrame’spivot
function.The
aggfunc
parameter in Kotlin is specified inside theaggregate
function body.
Let’s consider some examples with the tips
dataset. (More about this dataset in the examples at the end of this document)
val tips = DataFrame.readCSV("../resources/example-datasets/datasets/tips.csv")
tips.head()
DataFrame: rowsCount = 5, columnsCount = 7
tips.groupBy { day and smoker }.sortBy { day and smoker }.mean()
DataFrame: rowsCount = 8, columnsCount = 5
In python we could have accomplished the same result with the use of pivot table as follows:
tips.pivot_table(index=['day', 'smoker'])
and by default, the aggregation function used is the mean of the data.
Note that in the following examples we are using the sortBy
method for sticking to the output that pandas would provide.
We could now create a more sophisticated pivot table with this python snippet:
tips['tips_pct'] = tips['tip'] / tips['total_bill']
tips.pivot_table(['tips_pct', 'size'], index=['time', 'day'],
columns='smoker')
And in Kotlin that pivot table can be created with:
tips.add("tips_pct") { tip / total_bill }
.groupBy { time and day }.sortBy { time and day }
.pivot { smoker }
.aggregate {
mean("tips_pct") into "tips_pct"
mean { size } into "size"
}
DataFrame: rowsCount = 6, columnsCount = 3
So a deep understanding of the pandas pivot_table
method, can really helps a lot when it comes to learning how to pivot and rearrange the dataset using Kotlin DataFrame!
Conclusion#
This process of data aggregation and the application of the split-apply-combine strategy can help both data cleaning as well as modeling or statistical analysis work, and as we have seen in this chapter, Kotlin DataFrame offers almost the same pandas capabilities when it come to those operations, especially when pivoting tables.