Data handling with Dataframe#

Kotlin DataFrame was largely inspired by pandas structures, but despite this, it has his own characteristics that make Data Analysis very understandable and concise, but more importantly, the type safe. Most of it’s readability comes from functional languages chains of transformations (the data pipeline), and the use of DSL that makes it’s syntax closer to natural language.

In this chapter we will cover the basics to work with Kotlin DataFrame’s data structures and its basic operations.

Overview#

Let’s see how a simple workflow can look like using Kotlin DataFrame.

DataFrame is built-in kotlin jupyter kernel, so the magic command %use will load the needed packages.

%use dataframe

For creating a dataframe from scratch, we can follow several approaches, and the most used are:

  1. Creating Column objects for storing data, and assign columns to a DataFrame

val names by columnOf("foo", "bar", "baz")
val numbers by columnOf(1, 2, 3)

val df = dataFrameOf(names, numbers)
df
namesnumbers
foo1
bar2
baz3

DataFrame: rowsCount = 3, columnsCount = 2

  1. Providing a series of Pair<String, Any> or a Map<String, Any>

val df = dataFrameOf(
    "names" to listOf("foo", "bar", "baz"),
    "numbers" to listOf(1, 2, 3)
)

df
namesnumbers
foo1
bar2
baz3

DataFrame: rowsCount = 3, columnsCount = 2

  1. Providing an Iterable<Column>

val values = (1..5).map { List(10) { x -> (x - it) * 10 }.toColumn("$it") }

dataFrameOf(values)
12345
-10-20-30-40-50
0-10-20-30-40
100-10-20-30
20100-10-20
3020100-10
403020100
5040302010
6050403020
7060504030
8070605040

DataFrame: rowsCount = 10, columnsCount = 5

We can load a dataset, and compute some statics

val df = DataFrame.readCSV(
            "../resources/example-datasets/datasets/stock_px.csv",
            header = listOf("date", "AAPL", "MSFT", "XOM", "SPX"),
            skipLines = 1
        )
df.head(5)
dateAAPLMSFTXOMSPX
2003-01-02T00:007.4021.1129.22909.03
2003-01-03T00:007.4521.1429.24908.59
2003-01-06T00:007.4521.5229.96929.01
2003-01-07T00:007.4321.9328.95922.93
2003-01-08T00:007.2821.3128.83909.93

DataFrame: rowsCount = 5, columnsCount = 5

df.groupBy { date.map { it.month } }.mean() // mean for each month, for each stock
dateAAPLMSFTXOMSPX
JANUARY111.95469624.84226557.8428731173.562486
FEBRUARY109.54162823.51168658.9431401164.714244
MARCH114.25582923.01517658.3789451156.397487
APRIL120.84914023.71069960.1443011186.491237
MAY126.34946823.41324559.8134041203.199202
JUNE129.07675323.19195959.6415981191.002268
JULY133.11788423.63164059.9880951185.564656
AUGUST137.71075423.70030259.2194971182.175678
SEPTEMBER142.78838724.04634459.7033871188.879570
OCTOBER130.34844124.27177459.7546771182.235806
NOVEMBER122.31871225.06865060.1255831190.728466
DECEMBER125.57444425.30719361.3657311202.463216

DataFrame: rowsCount = 12, columnsCount = 5

val yearMeans = df.groupBy { date.map { it.year} }.mean() // compute the means for each year
    // mappig each stock and it's value to two separate columns
    .gather { AAPL and MSFT and XOM and SPX}.into("stock", "value") 
    .rename { date }.into("year")
yearMeans.head(10)
yearstockvalue
2003AAPL9.272619
2003MSFT20.595119
2003XOM30.211111
2003SPX965.227540
2004AAPL17.763889
2004MSFT21.850437
2004XOM38.875437
2004SPX1130.649444
2005AAPL46.675952
2005MSFT23.072421

DataFrame: rowsCount = 10, columnsCount = 3

We can then plot those statistics with lets-plot library (we will explore plotting later in chapter five):

%use lets-plot
val p1 = ggplot(yearMeans.toMap()) { x="year" ; y="value" ; color="stock" } +
    geomLine(stat = Stat.identity, position = positionDodge(0.5), alpha = 0.7) +
    geomPoint(size=3.0, shape = 5) +
    scaleYLog10()
    
val p2 = ggplot(yearMeans.toMap()) { x="stock" ; y="value" } +
    geomBoxplot() { fill = "stock"} +
    scaleYLog10()
    

gggrid(listOf(p1, p2))
2,0042,0062,0082,01010321003161,000valueyearstockAAPLMSFTXOMSPX
AAPLMSFTXOMSPX10321003161,000valuestockstockAAPLMSFTXOMSPX
ggplot(yearMeans.filter { stock == "AAPL"}.toMap() )+
    geomLine(stat = Stat.identity) { x="year" ; y="value" } +
    geomPoint(color="red", size=3.5, shape = 2) { x="year" ; y="value" } +
    ylab("Value ($)") + 
    xlab("Year") +
    ggtitle("Apple Inc. Stock Price (2003-2011)")
    
2,0032,0042,0052,0062,0072,0082,0092,0102,011050100150200250300350Apple Inc. Stock Price (2003-2011)Value ($)Year

DataFrame Architecture#

Working inside a Jupyter Notebook#

The strength of Kotlin DataFrame is its ability to conciliate the dynamic nature of data, with Kotlin’s strong typing, resulting in a type safe library for working with data. In contrast to pandas, when we compute operations in a dataframe, we know at compile time the types of the columns of the dataframe, and their results.

This is true when working with Jupyter Notebooks, because every time a dataframe is loaded and its cell executed, a new DataSchema is created for the dataframe specified. The DataSchema provides a way to define and manage the metadata of a DataFrame, including columns names and types, and nullable flags. It is used to ensure that the data in a DataFrame is consistent and can be processed correctly.

So if we would create a dataframe with one column of names like:

val names by columnOf("foo", "bar", "baz")
val df = dataFrameOf(names)

The following code is called implicitly (you can see this output using the magic command %trackExecution).

Executing:
@DataSchema interface _DataframeTypel
val ColumnsContainer< _DataFrameTypel> .names: DataColumn<String> @3vmName (" _DataframeTypel names") get() = this ("names"'] as DataColumn<String>
val DataRow< _DataFrameTypel>.names: String @JvmName (" _DataFrameTypel_names") get) = this ("names") as String
val ColumnsContainer‹ DataFrameType1?>.names: DataColumn<String?> @JvmName ("Nullable DataFrameTypel names') get() = this ("names"] as DataColumn‹String?>
val DataRow< _DataFrameTypel?>.names: String? @JvmName ("Nullable _DataFrameTypel_names") get () = this ("names") as String?
df.cast<_DataFrameType1>()

A custom type of the dataframe is created, that will be used to define the columns containers. For each one of them, the above extension functions will be created. Note that this chain of operations is computed even when reading a dataset from file (a full explanation on how data schema is dynamically created, can be found here).

Using this method, each time we execute a cell the new columns types will be defined as above, so that in the next cells, we know at compile time the data type of the columns.

Working inside an IDE#

When working inside an IDE, we can ensure type safety in two ways:

  • Defining a custom DataSchema, and use it inside a Gradle project (see documentation for more).

  • Defining columns object.

In both ways, the extension properties API can be used, providing strong type checking at compile time (we will discuss various column accessors API later).

DataFrame Data Structures#

The DataFrame library, defines the following data abstractions:

  • DataColumn: is a named, typed and ordered collection of elements.

  • DataFrame: consists of one or several DataColumns with unque names and equal size.

  • DataRow is a single row of a DataFrame and provides a single value for every DataColumn.

Because we are dealing with structured data, Dataframe provides hierarchical data structures using two special types of columns:

  • ColumnGroup is a group of columns

  • FrameColumn is a column of dataframes

This makes easy the creation of tree structures among data (very handy when reading JSONs files). We can look at ColumnGroup and FrameColumn as pandas MultiIndex objects: they both try to express a hierarchical structure of data.

By nature, data frames are dynamic objects, column labels depend on the input source and also new columns could be added or deleted while wrangling. Kotlin in contrast, is a statically typed language and all types are defined and verified ahead of execution.

For this reason, the Kotlin DataFrame library provide four different ways to access columns:

  • String API

  • Columns Accessors API

  • KProperties API

  • Extension Properties API

For detailed usage, refer to the official documentation of column accessors.

The string API is the simplest and unsafest of them all. The main advantage is that it can be used at any time, including when accessing new columns in chain calls, so that this call can be made:

df.add("age") { ... }
    .sum("age")

If you’re not working in an Jupyter Notebook, Column Accessor API provide type-safe access to columns, but does not ensure that the columns really exist in a particular dataframe. Similarly, when working in an IDE, KProperties API is useful when you’ve already declared classes in you application business logic with fields that correspond columns of a DataFrame.

Otherwise, if you’re working inside a notebook, you can use Extension Properties API, which are the safest and convenient to use, with the trade-off of execution speed in the moment of generation.

DataColumn#

A DataColumn object is the equivalent of a pandas Series object; both represent a one dimensional array of data with a specific data type. Every DataColumn object has a unique type and several data mapped into rows.

As pointed out above, we can create a column object with the by keyword

val col by columnOf("a", "b", "c")

Following this approach, the name of the column is the name we gave to the variable, and we can use the column accessor API for better type safety.

Similarly, we can explicitly cast the column to a Ktype, and the result will be a ColumnAccessor:

val col by column<Double>("values")

With the ColumnAccessor, we can convert it to a DataColumn using withValues function:

val age by column<Int>()
val ageCol1 = age.withValues(15, 20)
val ageCol2 = age.withValues(10..20)

List and Set from the standard library have an extension function that can convert the collection to a column with the provided name.

val col = List(5) { it * 2 }.toColumn("values")

A column can be of three types:

  • ValueColumn: stores primitives data (by now, the underlying structure is a List)

  • ColumnGroup: stores nested columns

    • For referencing a nested column we can use

val name by columnGroup()
val firstName by name.column<String>()
  • FrameColumn: stores a nested DataFrame

DataFrame#

A DataFrame represent a list of DataColumns. Columns in a DataFrame must have equal size and names.

The simplest way to create a DataFrame is using the function dataFrameOf

val df = dataFrameOf("name", "age")(
    "Alice", 15,
    "Bob", 20,
    "Charlie", 25
)
df
nameage
Alice15
Bob20
Charlie25

DataFrame: rowsCount = 3, columnsCount = 2

For all the methods for building a dataframe, see the official documentation

Unlike pandas, Kotlin DataFrame does not implement an explicit Index object, meaning that the way we compute operations on pandas.DataFrame can be very different when working with Kotlin DataFrame object. The nature of the dataframe is quite different between the two libraries, and both of them has its pros and cons.

In the following sections we will see most of the operations that could be made on top of DataFrame.

Operations Overview#

As said before, data transformations pipelines are designed in functional style so that the whole processing can be represented as a sequential chain of operations. DataFrames are immutable, and every operations return a copy of the object instance reusing underlying data structures as much as possible.

Operations can be divided in three categories:

  • General Operations: all basic operations that can be called on a dataframe (e.g. schema(), sum(), move(), map(), filter(), …)

  • Multiplex Operations: more complex operations that does not return a new DataFrame immediately, instead they provide an intermediate object that is used for further configurations. Every multiplex operation follows the schema:

    1. Use a column selector to select target columns

    2. Additional configuration functions

    3. Terminal function that returns the modified DataFrame

    • Most of these operations end with into or with, and the following convention is used:

      • into defines column names for storing the result.

      • where defines row-wise data transformation.

  • Shortcut Operations: shortcut for more general operations (e.g. rename is a special case for move, fillNA is a special case for update, …)

Essential Functionalities#

This section will walk you through the fundamental mechanics of interacting with the data contained in DataFrames.

Indexing, Selection and Filtering#

val obj = dataFrameOf(
    "letters" to listOf("a", "b", "c", "d"),
    "nums" to listOf(0.0, 1.0, 2.0, 3.0)
)
obj
lettersnums
a0.0
b1.0
c2.0
d3.0

DataFrame: rowsCount = 4, columnsCount = 2

We can access by row with

obj[1]
lettersnums
b1.0

DataRow: index = 1, columnsCount = 2

And we can access by columns with:

obj["nums"]
nums
0.0
1.0
2.0
3.0

DataColumn: name = "nums", type = Double, size = 4

We can select multiple columns with the usual notation:

obj["nums", "letters"]
numsletters
0.0a
1.0b
2.0c
3.0d

DataFrame: rowsCount = 4, columnsCount = 2

We can select also ranges (note that the second boundary of the range is included)

obj[0..2] // obj[0 until 3]
lettersnums
a0.0
b1.0
c2.0

DataFrame: rowsCount = 3, columnsCount = 2

The [...] operator calls the get() method, so:

obj[0] == obj.get(0) // true
obj["nums"] == obj.getColumn(1) == obj.getColumn("nums") // true

Unlike python, kotlin does not provide filtering inside square brackets, but it offers the filter method that can be more understandable

obj.filter { nums.toInt() % 2 == 0 }
lettersnums
a0.0
c2.0

DataFrame: rowsCount = 2, columnsCount = 2

instead of python’s:

obj[(obj["nums"] % 2 == 0)]

Arithmetic Operations#

Unlike pandas.DataFrame, operations between dataframes in kotlin are not defined by default.

We can still compute row-wise operations with the update() method:

obj.update { nums }.with { it * 100 }
lettersnums
a0.0
b100.0
c200.0
d300.0

DataFrame: rowsCount = 4, columnsCount = 2

But the following code will not compile:

// obj + obj

Function Application and Mapping#

Much like NumPy’s ufunc element wise operations, every Kotlin collection (and then DataFrame DataColumn) is provided with the map method. In case of a dataframe, we can apply a function for each column with the map operator.

Most of the time, when we want to compute some row-wise operations, the methods update() and convert() suit perfectly our needs.

val frame = dataFrameOf("b", "d", "e").randomDouble(3)
    .update { colsOf<Double>() }.with { it * 7 }
frame
bde
5.0637795.8332315.659028
2.5949552.1032035.466144
6.2431921.8038124.375719

DataFrame: rowsCount = 3, columnsCount = 3

We can now apply a function with map on one, some or all columns and get its result as a List. Additionally, it’s possible to store the result of the function application to a new column using the method mapToColumn(). Lastly, the mapToFrame() map the function applications along multiple columns inside a new dataframe.

frame.map { Math.ceil(it.b) }
[6.0, 3.0, 7.0]
frame.mapToColumn("ceiling_b") { Math.ceil(it.b) }
ceiling_b
6.0
3.0
7.0

DataColumn: name = "ceiling_b", type = Double, size = 3

frame.mapToFrame { 
    "new_b" from b ;
    d gt 1.0 into "b_gt_1"
}
new_bb_gt_1
5.063779true
2.594955true
6.243192true

DataFrame: rowsCount = 3, columnsCount = 2

Conclusions#

In this chapter, we have introduced Kotlin DataFrame, a library that provides working with tabular data with all the advantages that the Kotlin languages provides. Coming from Python’s pandas, its behavior is very similar and intuitive, but the major differences are:

  • DataFrame ensures type safety at compile time (especially when working with Jupyter Notebook).

  • Lack of an Index object to manipulate (i.e. having a DateTimeIndex for indexing).

  • pandas use of NumPy ensures optimized arithmetic operations (through vectorization), whereas Kotlin DataFrame does not use an optimized library for representing vectors and matrices, but only Kotlin standard collections.

While pandas offers a more comprehensive set of features and is a more mature library, Kotlin DataFrame offers a useful subset of its capabilities. In the following chapters, we will explore further the capabilities of Kotlin DataFrame and demonstrate its use in various data analysis tasks, from data cleaning to grouping strategies. Overall, Kotlin DataFrame is a valuable addition to the toolkit of any data analyst or scientist working in Kotlin.