Data Wrangling with Kotlin#

In this chapter we will explore some tools and strategies that Kotlin DataFrame offers for all the tasks involved in data wrangling, alongside their python’s counterparts.

As always, we will import Kotlin DataFrame with the magic command:

%use dataframe

Before digging into data wrangling techniques that DataFrame offers, there is one Column type that has not been covered extensively yet: ColumnGroup.

ColumnGroup and FrameColumn#

They are a special kind of columns that contains a series of column (in ColumnGroups) or a DataFrame.

The power of those structures is the ability to store and organize data in a hierarchical way. This is essential when dealing with JSON serialization and deserialization.

Dealing with “nested” objects can also occur very often when using grouping and pivoting operations (discussed in next chapter), and a minimum comprehension is required before dealing with those operations.

Let’s consider a Dataframe of people with the following informations:

val name by columnOf(
    "Woody Allen",
    "Bob Dylan",
    "Charlie Chaplin",
    "John Coltrane",
    "Bob Marley",
    "Linus Torvalds",
    "Charlie Parker",
)
val age by columnOf(15, 45, 20, 30, 15, 22, 57)
val city by columnOf(
    "Rome",
    "Moscow",
    "Tirana",
    "Sarajevo",
    "Cesena",
    null,
    "Kyoto",
)

val weight by columnOf(55, 70, null, 80, null, null, 90)
val isDied by columnOf(false, false, true, true, true, false, true)

val people = dataFrameOf(name, age, city, weight, isDied)
people

DataFrame: rowsCount = 7, columnsCount = 5

Creating a group of columns is pretty straightforward:

people.group { age and city }.into("group")

DataFrame: rowsCount = 7, columnsCount = 4

We can also create a nested column, for example, splitting the name in a firstName and a lastName column:

val groupedDf = people.split { name }.by(' ').inward("firstName", "lastName")
groupedDf

DataFrame: rowsCount = 7, columnsCount = 5

Using the inward() method splits the columns into the provided column names, nesting the inside the original column, creating a ColumnGroup.

groupedDf.name.javaClass
class org.jetbrains.kotlinx.dataframe.impl.columns.ColumnGroupImpl

We can always access the fields of the ColumnGroup with the . notation

groupedDf.name.firstName

DataColumn: name = "firstName", type = String, size = 7

As said above, most of the time we will have to deal with these nested structures when using pivot or groupBy methods. We can, for example, pivot the table to create columns that contains a DataFrame: FrameColumns

groupedDf.pivot{ name.firstName }

Pivot

As the prompt above, the dataframe suggests us that this is a Pivot object, and it should be a temporary object before applying an aggregate function or other manipulations. We will cover pivot and groupBy extensively in the chapter 7.

These nested structures can resemble to a pandas.MultiIndex: they both express the concept of organizing data in a hierarchical way.

DataFrame multilevel structures differs from pandas because they do not have an explicit concept of Index, and operations like pandas.dataframe.stack()/unstack() would make no sense. In some ways that result can be accomplished with some trickery, but DataFrame’s ColumnGroup or FrameColumn are not intended to substitute pandas.MultiIndex, even if they’re goal is very similar.

Working with Multiple DataFrames#

DataFrame provides three methods for operating with multiple DataFrames:

  • add: adds new columns to the DataFrame.

  • concat: returns the union of the provided DataFrames.

  • join: SQL-like join of two DataFrames by key columns.

we already have seen an application of the add method, but it is possible to add multiple columns all at once:

groupedDf
    .convert { weight }.toDouble()
    .dropNA { weight }
    .add {
    "year of birth" from 2023 - age
    age gt 18 into "is adult"
    "details" {
         "weight"<Double>() / 6.35 into "weight (approx. stones)"
        "full name" from { name.firstName + " " + name.lastName }
    }
}

DataFrame: rowsCount = 4, columnsCount = 8

When applying concat, it concatenates the rows of the provided DataFrames or DataColumns.

val df1 = dataFrameOf("a", "b", "c").fill(5) { it }
val df2 = dataFrameOf("a", "b", "c").fill(2) { it  - 10 }

df1.concat(df2)

DataFrame: rowsCount = 7, columnsCount = 3

When concatenating dataframes with different column keys, the result is like a full join in the database world, where non matching values of the two collections are filled with null.

val df1 = dataFrameOf("a", "b", "c").fill(10) { it }
val df2 = dataFrameOf("a", "c", "d").fill(2) { it - 10 }

df1.concat(df2)

DataFrame: rowsCount = 12, columnsCount = 4

We can also use the concat method providing a List object

listOf(df1, df2).concat()

DataFrame: rowsCount = 12, columnsCount = 4

The concat method is similar to pandas.concat method, with the difference that in pandas you can specify which axis to merge, having the Index object that can provide a merging key when choosing axis=0. On the other hand, When using axis=1, merging two pandas DataFrames will produce a similar result to what Kotlin DataFrame provides.

If we want to use sql like join operations, we can use the join method provided by Kotlin DataFrame.

join’s method signature is the following:

join(otherDf, type = JoinType.Inner) [ { joinColumns } ]

Having the join columns as optional, and the default join is set to Inner (only matched columns from left and right DataFrames).

val df1 = dataFrameOf("a", "b", "c").fill(10) { it }
val df2 = dataFrameOf("a", "c", "d").fill(2) { it }
df1.join(df2)

DataFrame: rowsCount = 2, columnsCount = 4

We can specify which column to match with the match DSL keyword

df1.join(df2, type = JoinType.Full) { a match right.a }

DataFrame: rowsCount = 10, columnsCount = 5

And you can sport that any column that matched during the join, but not included in the joinColumn clause, are duplicated with a new column key.

The match keyword is used in all those cases where we can apply the join because of matching row values, but the columns keys differs by name.

Consider the next example:

people

DataFrame: rowsCount = 7, columnsCount = 5

and let’s suppose we have a new dataset with new data that can be joined with the previous one

val newPeopleDf = people.head(2) // pick just the first two
        .rename("name").into("fullName") // renameing join column
        .add("stonesWeight") { // add new dummy column
            weight!! / 6.35
        }
        .select("fullName", "stonesWeight")
        
newPeopleDf

DataFrame: rowsCount = 2, columnsCount = 2

We can use the match keyword for specifying which columns to use for the join operation:

people.join(newPeopleDf, type = JoinType.Left) { name match right.fullName }

DataFrame: rowsCount = 7, columnsCount = 6

There are handy shortcuts for specifying which type of join we want to perform for each kind of join. The previous code can be rewritten to:

people.leftJoin(newPeopleDf) { name match right.fullName }

DataFrame: rowsCount = 7, columnsCount = 6

See the full reference for supported types of join.

Reshaping and Pivoting#

Reshaping and Pivoting a datasets is a very common operation that is being made during Data Analysis, and Kotlin DataFrame provides a series of methods that can help the developer in the creation of different views of the same DataFrame.

The most common operations that are used when pivoting and reshaping a dataset, are pivot and groupBy, and very often they’re used chained together for distributing data along rows or columns.

We will run all the examples with the macrodata dataset

val df = DataFrame.readCSV("../resources/example-datasets/datasets/macrodata.csv")
df.head(3)

DataFrame: rowsCount = 3, columnsCount = 14

Consider the following example:

val longFormat = df.groupBy { year and quarter }
    .values { realgdp and infl and unemp }
    .gather { realgdp and infl and unemp }.into("item", "value")
    .explode("value")

longFormat.head(10)

DataFrame: rowsCount = 10, columnsCount = 4

In Kotlin DataFrame, groupBy operation is not only used for grouping and aggregating data, but it can be very useful for rearranging data. groupBy takes a list of columns to group by, and produces a DataFrame where each group key is placed in a distinct row, with its associated group (a FrameColumn).

If we run the code above row by row, we can see how the result DataFrame has been formed:

df.head(4) // using head just to limit the output
    .groupBy { year and quarter }

GroupBy

On the other hand, if we call pivot, all the provided columns will be the column keys of the resulting DataFrame, creating another group of columns.

So for example, if we want to pick only all the data from 1995 to 2000, and we want to display one column for each year, we can use the pivot method.

df.filter { year >= 1995 && year <= 2000 }.pivot { year }

Pivot

After a pivot or groupBy operation, we can use the method values for selecting only some columns of the group.

df.head(4)
    .groupBy { year and quarter }
    .values { realgdp and infl and unemp }

DataFrame: rowsCount = 4, columnsCount = 5

Now, if we want to display data in the so called long format, with each row containing the year, quarter, item name and value, we have to make item’s columns to be mapped as rows.

With the help of the gather we can map a set of columns to two columns: “key” containing names of the original columns and “value” containing values of the original columns.

In a certain way, gather is the opposite of pivot, that splits rows of a dataframe and groups them horizontally into new columns.

If we apply gather, the result will be:

df.head(4)
    .groupBy { year and quarter }
    .values { realgdp and infl and unemp }
    .gather { realgdp and infl and unemp }
    .into("item", "value")

DataFrame: rowsCount = 12, columnsCount = 4

Lastly, the square brackets suggest us that the value column contains a series of list, and we can flatten it with the explode method.

In contrast to Kotlin DataFrame, pandas has more “ad-hoc” methods for both pivoting and reshaping datasets. Having the Index object, every DataFrame has the ability to easily reindex itself, or using the pivot operation to swap the order of both index and columns, specifying which will be the columns of values. For example, in pandas the pipeline we created above could be translated as follows:

df.pivot_table(index=['year', 'quarter'], values=['realgdp', 'infl', 'unemp']) \
    .stack() \
    .reset_index() \
    .rename(columns={ "level_2": "item", 0: "value"})

Where the pivot_table method is used to rearrange Index objects and value columns (it’s a more generalized version of pivot method). In pandas, stack and unstack operations are very useful when collapsing several columns in one (like DataFrame’s gather), or when distributing data contained in one column across several (like DataFrame’s pivot). These operations, combined with the indices manipulations techniques (like reset/set_index, reindex (works also with columns)), makes pandas more powerful and more precise when it comes to data wrangling. Moreover, pandas Index object can be of multiple types, for example CategoricalIndex, DatetimeIndex, PeriodIndex, MultiIndex. In this example, a PeriodIndex would have fit the data perfectly, because we can create a range of dates from a year and a quarter: pd.PeriodIndex(year=df['year'], quarter=df['quarter'], name='date').

We now understand that the biggest difference between pandas and Kotlin DataFrame is the presence of an explicit Index Object that let us perform reshaping of the dataframe in a more precise way.

Note that the operation after stack are used only to recreate the example, the Series created with stack is perfectly usable as is.

Conclusions#

In the chapter we have explored how Kotlin DataFrame has a full support for data wrangling tasks. The differences with pandas are sometimes remarkable, and our way to compute some operations can be very different between the two platforms. Anyhow, with a little bit of practicing, a lot of what can be done in pandas can be archived with the use of Kotlin DataFrame.