Data Wrangling with Kotlin#
In this chapter we will explore some tools and strategies that Kotlin DataFrame offers for all the tasks involved in data wrangling, alongside their python’s counterparts.
As always, we will import Kotlin DataFrame with the magic command:
%use dataframe
Before digging into data wrangling techniques that DataFrame offers, there is one Column type that has not been covered extensively yet: ColumnGroup
.
ColumnGroup
and FrameColumn
#
They are a special kind of columns that contains a series of column (in ColumnGroup
s) or a DataFrame
.
The power of those structures is the ability to store and organize data in a hierarchical way. This is essential when dealing with JSON serialization and deserialization.
Dealing with “nested” objects can also occur very often when using grouping and pivoting operations (discussed in next chapter), and a minimum comprehension is required before dealing with those operations.
Let’s consider a Dataframe of people with the following informations:
val name by columnOf(
"Woody Allen",
"Bob Dylan",
"Charlie Chaplin",
"John Coltrane",
"Bob Marley",
"Linus Torvalds",
"Charlie Parker",
)
val age by columnOf(15, 45, 20, 30, 15, 22, 57)
val city by columnOf(
"Rome",
"Moscow",
"Tirana",
"Sarajevo",
"Cesena",
null,
"Kyoto",
)
val weight by columnOf(55, 70, null, 80, null, null, 90)
val isDied by columnOf(false, false, true, true, true, false, true)
val people = dataFrameOf(name, age, city, weight, isDied)
people
DataFrame: rowsCount = 7, columnsCount = 5
Creating a group of columns is pretty straightforward:
people.group { age and city }.into("group")
DataFrame: rowsCount = 7, columnsCount = 4
We can also create a nested column, for example, splitting the name in a firstName
and a lastName
column:
val groupedDf = people.split { name }.by(' ').inward("firstName", "lastName")
groupedDf
DataFrame: rowsCount = 7, columnsCount = 5
Using the inward()
method splits the columns into the provided column names, nesting the inside the original column, creating a ColumnGroup
.
groupedDf.name.javaClass
class org.jetbrains.kotlinx.dataframe.impl.columns.ColumnGroupImpl
We can always access the fields of the ColumnGroup
with the .
notation
groupedDf.name.firstName
DataColumn: name = "firstName", type = String, size = 7
As said above, most of the time we will have to deal with these nested structures when using pivot
or groupBy
methods.
We can, for example, pivot the table to create columns that contains a DataFrame
: FrameColumns
groupedDf.pivot{ name.firstName }
Pivot
As the prompt above, the dataframe suggests us that this is a Pivot
object, and it should be a temporary object before applying an aggregate function or other manipulations. We will cover pivot
and groupBy
extensively in the chapter 7.
These nested structures can resemble to a pandas.MultiIndex
: they both express the concept of organizing data in a hierarchical way.
DataFrame multilevel structures differs from pandas because they do not have an explicit concept of Index
, and operations like pandas.dataframe.stack()/unstack()
would make no sense. In some ways that result can be accomplished with some trickery, but DataFrame’s ColumnGroup
or FrameColumn
are not intended to substitute pandas.MultiIndex
, even if they’re goal is very similar.
Working with Multiple DataFrames#
DataFrame provides three methods for operating with multiple DataFrame
s:
add
: adds new columns to theDataFrame
.concat
: returns the union of the providedDataFrame
s.join
: SQL-like join of twoDataFrame
s by key columns.
we already have seen an application of the add
method, but it is possible to add multiple columns all at once:
groupedDf
.convert { weight }.toDouble()
.dropNA { weight }
.add {
"year of birth" from 2023 - age
age gt 18 into "is adult"
"details" {
"weight"<Double>() / 6.35 into "weight (approx. stones)"
"full name" from { name.firstName + " " + name.lastName }
}
}
DataFrame: rowsCount = 4, columnsCount = 8
When applying concat
, it concatenates the rows of the provided DataFrame
s or DataColumn
s.
val df1 = dataFrameOf("a", "b", "c").fill(5) { it }
val df2 = dataFrameOf("a", "b", "c").fill(2) { it - 10 }
df1.concat(df2)
DataFrame: rowsCount = 7, columnsCount = 3
When concatenating dataframes with different column keys, the result is like a full join in the database world, where non matching values of the two collections are filled with null
.
val df1 = dataFrameOf("a", "b", "c").fill(10) { it }
val df2 = dataFrameOf("a", "c", "d").fill(2) { it - 10 }
df1.concat(df2)
DataFrame: rowsCount = 12, columnsCount = 4
We can also use the concat
method providing a List
object
listOf(df1, df2).concat()
DataFrame: rowsCount = 12, columnsCount = 4
The concat
method is similar to pandas.concat
method, with the difference that in pandas you can specify which axis to merge, having the Index
object that can provide a merging key when choosing axis=0
. On the other hand, When using axis=1
, merging two pandas DataFrame
s will produce a similar result to what Kotlin DataFrame
provides.
If we want to use sql like join
operations, we can use the join
method provided by Kotlin DataFrame
.
join
’s method signature is the following:
join(otherDf, type = JoinType.Inner) [ { joinColumns } ]
Having the join columns as optional, and the default join is set to Inner (only matched columns from left and right DataFrame
s).
val df1 = dataFrameOf("a", "b", "c").fill(10) { it }
val df2 = dataFrameOf("a", "c", "d").fill(2) { it }
df1.join(df2)
DataFrame: rowsCount = 2, columnsCount = 4
We can specify which column to match with the match
DSL keyword
df1.join(df2, type = JoinType.Full) { a match right.a }
DataFrame: rowsCount = 10, columnsCount = 5
And you can sport that any column that matched during the join, but not included in the joinColumn
clause, are duplicated with a new column key.
The match
keyword is used in all those cases where we can apply the join because of matching row values, but the columns keys differs by name.
Consider the next example:
people
DataFrame: rowsCount = 7, columnsCount = 5
and let’s suppose we have a new dataset with new data that can be joined with the previous one
val newPeopleDf = people.head(2) // pick just the first two
.rename("name").into("fullName") // renameing join column
.add("stonesWeight") { // add new dummy column
weight!! / 6.35
}
.select("fullName", "stonesWeight")
newPeopleDf
DataFrame: rowsCount = 2, columnsCount = 2
We can use the match
keyword for specifying which columns to use for the join operation:
people.join(newPeopleDf, type = JoinType.Left) { name match right.fullName }
DataFrame: rowsCount = 7, columnsCount = 6
There are handy shortcuts for specifying which type of join we want to perform for each kind of join. The previous code can be rewritten to:
people.leftJoin(newPeopleDf) { name match right.fullName }
DataFrame: rowsCount = 7, columnsCount = 6
See the full reference for supported types of join.
Reshaping and Pivoting#
Reshaping and Pivoting a datasets is a very common operation that is being made during Data Analysis, and Kotlin DataFrame provides a series of methods that can help the developer in the creation of different views of the same DataFrame
.
The most common operations that are used when pivoting and reshaping a dataset, are pivot
and groupBy
, and very often they’re used chained together for distributing data along rows or columns.
We will run all the examples with the macrodata
dataset
val df = DataFrame.readCSV("../resources/example-datasets/datasets/macrodata.csv")
df.head(3)
DataFrame: rowsCount = 3, columnsCount = 14
Consider the following example:
val longFormat = df.groupBy { year and quarter }
.values { realgdp and infl and unemp }
.gather { realgdp and infl and unemp }.into("item", "value")
.explode("value")
longFormat.head(10)
DataFrame: rowsCount = 10, columnsCount = 4
In Kotlin DataFrame, groupBy
operation is not only used for grouping and aggregating data, but it can be very useful for rearranging data. groupBy
takes a list of columns to group by, and produces a DataFrame
where each group key is placed in a distinct row, with its associated group (a FrameColumn
).
If we run the code above row by row, we can see how the result DataFrame
has been formed:
df.head(4) // using head just to limit the output
.groupBy { year and quarter }
GroupBy
On the other hand, if we call pivot
, all the provided columns will be the column keys of the resulting DataFrame
, creating another group of columns.
So for example, if we want to pick only all the data from 1995 to 2000, and we want to display one column for each year, we can use the pivot
method.
df.filter { year >= 1995 && year <= 2000 }.pivot { year }
Pivot
After a pivot
or groupBy
operation, we can use the method values
for selecting only some columns of the group.
df.head(4)
.groupBy { year and quarter }
.values { realgdp and infl and unemp }
DataFrame: rowsCount = 4, columnsCount = 5
Now, if we want to display data in the so called long format, with each row containing the year, quarter, item name and value, we have to make item’s columns to be mapped as rows.
With the help of the gather
we can map a set of columns to two columns: “key” containing names of the original columns and “value” containing values of the original columns.
In a certain way, gather
is the opposite of pivot
, that splits rows of a dataframe and groups them horizontally into new columns.
If we apply gather
, the result will be:
df.head(4)
.groupBy { year and quarter }
.values { realgdp and infl and unemp }
.gather { realgdp and infl and unemp }
.into("item", "value")
DataFrame: rowsCount = 12, columnsCount = 4
Lastly, the square brackets suggest us that the value
column contains a series of list, and we can flatten it with the explode
method.
In contrast to Kotlin DataFrame, pandas has more “ad-hoc” methods for both pivoting and reshaping datasets.
Having the Index
object, every DataFrame
has the ability to easily reindex
itself, or using the pivot
operation to swap the order of both index and columns, specifying which will be the columns of values.
For example, in pandas the pipeline we created above could be translated as follows:
df.pivot_table(index=['year', 'quarter'], values=['realgdp', 'infl', 'unemp']) \
.stack() \
.reset_index() \
.rename(columns={ "level_2": "item", 0: "value"})
Where the pivot_table
method is used to rearrange Index
objects and value columns (it’s a more generalized version of pivot
method).
In pandas, stack
and unstack
operations are very useful when collapsing several columns in one (like DataFrame’s gather
), or when distributing data contained in one column across several (like DataFrame’s pivot
). These operations, combined with the indices manipulations techniques (like reset
/set_index
, reindex
(works also with columns)), makes pandas more powerful and more precise when it comes to data wrangling.
Moreover, pandas Index
object can be of multiple types, for example CategoricalIndex
, DatetimeIndex
, PeriodIndex
, MultiIndex
. In this example, a PeriodIndex
would have fit the data perfectly, because we can create a range of dates from a year and a quarter: pd.PeriodIndex(year=df['year'], quarter=df['quarter'], name='date')
.
We now understand that the biggest difference between pandas and Kotlin DataFrame is the presence of an explicit Index Object that let us perform reshaping of the dataframe in a more precise way.
Note that the operation after stack
are used only to recreate the example, the Series
created with stack
is perfectly usable as is.
Conclusions#
In the chapter we have explored how Kotlin DataFrame has a full support for data wrangling tasks. The differences with pandas are sometimes remarkable, and our way to compute some operations can be very different between the two platforms. Anyhow, with a little bit of practicing, a lot of what can be done in pandas can be archived with the use of Kotlin DataFrame.