Data handling with Dataframe#
Kotlin DataFrame was largely inspired by pandas structures, but despite this, it has his own characteristics that make Data Analysis very understandable and concise, but more importantly, the type safe. Most of it’s readability comes from functional languages chains of transformations (the data pipeline), and the use of DSL that makes it’s syntax closer to natural language.
In this chapter we will cover the basics to work with Kotlin DataFrame’s data structures and its basic operations.
Overview#
Let’s see how a simple workflow can look like using Kotlin DataFrame.
DataFrame is built-in kotlin jupyter kernel, so the magic command %use
will load the needed packages.
%use dataframe
For creating a dataframe from scratch, we can follow several approaches, and the most used are:
Creating
Column
objects for storing data, and assign columns to aDataFrame
val names by columnOf("foo", "bar", "baz")
val numbers by columnOf(1, 2, 3)
val df = dataFrameOf(names, numbers)
df
names | numbers |
---|---|
foo | 1 |
bar | 2 |
baz | 3 |
DataFrame: rowsCount = 3, columnsCount = 2
Providing a series of
Pair<String, Any>
or aMap<String, Any>
val df = dataFrameOf(
"names" to listOf("foo", "bar", "baz"),
"numbers" to listOf(1, 2, 3)
)
df
names | numbers |
---|---|
foo | 1 |
bar | 2 |
baz | 3 |
DataFrame: rowsCount = 3, columnsCount = 2
Providing an
Iterable<Column>
val values = (1..5).map { List(10) { x -> (x - it) * 10 }.toColumn("$it") }
dataFrameOf(values)
1 | 2 | 3 | 4 | 5 |
---|---|---|---|---|
-10 | -20 | -30 | -40 | -50 |
0 | -10 | -20 | -30 | -40 |
10 | 0 | -10 | -20 | -30 |
20 | 10 | 0 | -10 | -20 |
30 | 20 | 10 | 0 | -10 |
40 | 30 | 20 | 10 | 0 |
50 | 40 | 30 | 20 | 10 |
60 | 50 | 40 | 30 | 20 |
70 | 60 | 50 | 40 | 30 |
80 | 70 | 60 | 50 | 40 |
DataFrame: rowsCount = 10, columnsCount = 5
We can load a dataset, and compute some statics
val df = DataFrame.readCSV(
"../resources/example-datasets/datasets/stock_px.csv",
header = listOf("date", "AAPL", "MSFT", "XOM", "SPX"),
skipLines = 1
)
df.head(5)
date | AAPL | MSFT | XOM | SPX |
---|---|---|---|---|
2003-01-02T00:00 | 7.40 | 21.11 | 29.22 | 909.03 |
2003-01-03T00:00 | 7.45 | 21.14 | 29.24 | 908.59 |
2003-01-06T00:00 | 7.45 | 21.52 | 29.96 | 929.01 |
2003-01-07T00:00 | 7.43 | 21.93 | 28.95 | 922.93 |
2003-01-08T00:00 | 7.28 | 21.31 | 28.83 | 909.93 |
DataFrame: rowsCount = 5, columnsCount = 5
df.groupBy { date.map { it.month } }.mean() // mean for each month, for each stock
date | AAPL | MSFT | XOM | SPX |
---|---|---|---|---|
JANUARY | 111.954696 | 24.842265 | 57.842873 | 1173.562486 |
FEBRUARY | 109.541628 | 23.511686 | 58.943140 | 1164.714244 |
MARCH | 114.255829 | 23.015176 | 58.378945 | 1156.397487 |
APRIL | 120.849140 | 23.710699 | 60.144301 | 1186.491237 |
MAY | 126.349468 | 23.413245 | 59.813404 | 1203.199202 |
JUNE | 129.076753 | 23.191959 | 59.641598 | 1191.002268 |
JULY | 133.117884 | 23.631640 | 59.988095 | 1185.564656 |
AUGUST | 137.710754 | 23.700302 | 59.219497 | 1182.175678 |
SEPTEMBER | 142.788387 | 24.046344 | 59.703387 | 1188.879570 |
OCTOBER | 130.348441 | 24.271774 | 59.754677 | 1182.235806 |
NOVEMBER | 122.318712 | 25.068650 | 60.125583 | 1190.728466 |
DECEMBER | 125.574444 | 25.307193 | 61.365731 | 1202.463216 |
DataFrame: rowsCount = 12, columnsCount = 5
val yearMeans = df.groupBy { date.map { it.year} }.mean() // compute the means for each year
// mappig each stock and it's value to two separate columns
.gather { AAPL and MSFT and XOM and SPX}.into("stock", "value")
.rename { date }.into("year")
yearMeans.head(10)
year | stock | value |
---|---|---|
2003 | AAPL | 9.272619 |
2003 | MSFT | 20.595119 |
2003 | XOM | 30.211111 |
2003 | SPX | 965.227540 |
2004 | AAPL | 17.763889 |
2004 | MSFT | 21.850437 |
2004 | XOM | 38.875437 |
2004 | SPX | 1130.649444 |
2005 | AAPL | 46.675952 |
2005 | MSFT | 23.072421 |
DataFrame: rowsCount = 10, columnsCount = 3
We can then plot those statistics with lets-plot
library (we will explore plotting later in chapter five):
%use lets-plot
val p1 = ggplot(yearMeans.toMap()) { x="year" ; y="value" ; color="stock" } +
geomLine(stat = Stat.identity, position = positionDodge(0.5), alpha = 0.7) +
geomPoint(size=3.0, shape = 5) +
scaleYLog10()
val p2 = ggplot(yearMeans.toMap()) { x="stock" ; y="value" } +
geomBoxplot() { fill = "stock"} +
scaleYLog10()
gggrid(listOf(p1, p2))
ggplot(yearMeans.filter { stock == "AAPL"}.toMap() )+
geomLine(stat = Stat.identity) { x="year" ; y="value" } +
geomPoint(color="red", size=3.5, shape = 2) { x="year" ; y="value" } +
ylab("Value ($)") +
xlab("Year") +
ggtitle("Apple Inc. Stock Price (2003-2011)")
DataFrame Architecture#
Working inside a Jupyter Notebook#
The strength of Kotlin DataFrame is its ability to conciliate the dynamic nature of data, with Kotlin’s strong typing, resulting in a type safe library for working with data. In contrast to pandas, when we compute operations in a dataframe, we know at compile time the types of the columns of the dataframe, and their results.
This is true when working with Jupyter Notebooks, because every time a dataframe is loaded and its cell executed, a new DataSchema
is created for the dataframe specified. The DataSchema
provides a way to define and manage the metadata of a DataFrame
, including columns names and types, and nullable flags. It is used to ensure that the data in a DataFrame
is consistent and can be processed correctly.
So if we would create a dataframe with one column of names like:
val names by columnOf("foo", "bar", "baz")
val df = dataFrameOf(names)
The following code is called implicitly (you can see this output using the magic command %trackExecution
).
Executing:
@DataSchema interface _DataframeTypel
val ColumnsContainer< _DataFrameTypel> .names: DataColumn<String> @3vmName (" _DataframeTypel names") get() = this ("names"'] as DataColumn<String>
val DataRow< _DataFrameTypel>.names: String @JvmName (" _DataFrameTypel_names") get) = this ("names") as String
val ColumnsContainer‹ DataFrameType1?>.names: DataColumn<String?> @JvmName ("Nullable DataFrameTypel names') get() = this ("names"] as DataColumn‹String?>
val DataRow< _DataFrameTypel?>.names: String? @JvmName ("Nullable _DataFrameTypel_names") get () = this ("names") as String?
df.cast<_DataFrameType1>()
A custom type of the dataframe is created, that will be used to define the columns containers. For each one of them, the above extension functions will be created. Note that this chain of operations is computed even when reading a dataset from file (a full explanation on how data schema is dynamically created, can be found here).
Using this method, each time we execute a cell the new columns types will be defined as above, so that in the next cells, we know at compile time the data type of the columns.
Working inside an IDE#
When working inside an IDE, we can ensure type safety in two ways:
Defining a custom
DataSchema
, and use it inside a Gradle project (see documentation for more).Defining columns object.
In both ways, the extension properties API can be used, providing strong type checking at compile time (we will discuss various column accessors API later).
DataFrame Data Structures#
The DataFrame library, defines the following data abstractions:
DataColumn
: is a named, typed and ordered collection of elements.DataFrame
: consists of one or severalDataColumns
with unque names and equal size.DataRow
is a single row of aDataFrame
and provides a single value for everyDataColumn
.
Because we are dealing with structured data, Dataframe provides hierarchical data structures using two special types of columns:
ColumnGroup
is a group of columnsFrameColumn
is a column of dataframes
This makes easy the creation of tree structures among data (very handy when reading JSON
s files). We can look at ColumnGroup
and FrameColumn
as pandas MultiIndex
objects: they both try to express a hierarchical structure of data.
By nature, data frames are dynamic objects, column labels depend on the input source and also new columns could be added or deleted while wrangling. Kotlin in contrast, is a statically typed language and all types are defined and verified ahead of execution.
For this reason, the Kotlin DataFrame library provide four different ways to access columns:
String API
Columns Accessors API
KProperties API
Extension Properties API
For detailed usage, refer to the official documentation of column accessors.
The string API is the simplest and unsafest of them all. The main advantage is that it can be used at any time, including when accessing new columns in chain calls, so that this call can be made:
df.add("age") { ... }
.sum("age")
If you’re not working in an Jupyter Notebook, Column Accessor API provide type-safe access to columns, but does not ensure that the columns really exist in a particular dataframe. Similarly, when working in an IDE, KProperties API is useful when you’ve already declared classes in you application business logic with fields that correspond columns of a DataFrame.
Otherwise, if you’re working inside a notebook, you can use Extension Properties API, which are the safest and convenient to use, with the trade-off of execution speed in the moment of generation.
DataColumn
#
A DataColumn
object is the equivalent of a pandas Series
object; both represent a one dimensional array of data with a specific data type. Every DataColumn
object has a unique type and several data mapped into rows.
As pointed out above, we can create a column object with the by
keyword
val col by columnOf("a", "b", "c")
Following this approach, the name of the column is the name we gave to the variable, and we can use the column accessor API for better type safety.
Similarly, we can explicitly cast the column to a Ktype
, and the result will be a ColumnAccessor
:
val col by column<Double>("values")
With the ColumnAccessor
, we can convert it to a DataColumn
using withValues
function:
val age by column<Int>()
val ageCol1 = age.withValues(15, 20)
val ageCol2 = age.withValues(10..20)
List
and Set
from the standard library have an extension function that can convert the collection to a column with the provided name.
val col = List(5) { it * 2 }.toColumn("values")
A column can be of three types:
ValueColumn
: stores primitives data (by now, the underlying structure is aList
)ColumnGroup
: stores nested columnsFor referencing a nested column we can use
val name by columnGroup()
val firstName by name.column<String>()
FrameColumn
: stores a nestedDataFrame
DataFrame
#
A DataFrame
represent a list of DataColumn
s. Columns in a DataFrame
must have equal size and names.
The simplest way to create a DataFrame
is using the function dataFrameOf
val df = dataFrameOf("name", "age")(
"Alice", 15,
"Bob", 20,
"Charlie", 25
)
df
name | age |
---|---|
Alice | 15 |
Bob | 20 |
Charlie | 25 |
DataFrame: rowsCount = 3, columnsCount = 2
For all the methods for building a dataframe, see the official documentation
Unlike pandas, Kotlin DataFrame does not implement an explicit Index
object, meaning that the way we compute operations on pandas.DataFrame
can be very different when working with Kotlin DataFrame
object. The nature of the dataframe is quite different between the two libraries, and both of them has its pros and cons.
In the following sections we will see most of the operations that could be made on top of DataFrame
.
Operations Overview#
As said before, data transformations pipelines are designed in functional style so that the whole processing can be represented as a sequential chain of operations.
DataFrame
s are immutable, and every operations return a copy of the object instance reusing underlying data structures as much as possible.
Operations can be divided in three categories:
General Operations: all basic operations that can be called on a dataframe (e.g.
schema()
,sum()
,move()
,map()
,filter()
, …)Multiplex Operations: more complex operations that does not return a new
DataFrame
immediately, instead they provide an intermediate object that is used for further configurations. Every multiplex operation follows the schema:Use a column selector to select target columns
Additional configuration functions
Terminal function that returns the modified
DataFrame
Most of these operations end with
into
orwith
, and the following convention is used:into
defines column names for storing the result.where
defines row-wise data transformation.
Shortcut Operations: shortcut for more general operations (e.g.
rename
is a special case formove
,fillNA
is a special case forupdate
, …)
Essential Functionalities#
This section will walk you through the fundamental mechanics of interacting with the data contained in DataFrame
s.
Indexing, Selection and Filtering#
val obj = dataFrameOf(
"letters" to listOf("a", "b", "c", "d"),
"nums" to listOf(0.0, 1.0, 2.0, 3.0)
)
obj
letters | nums |
---|---|
a | 0.0 |
b | 1.0 |
c | 2.0 |
d | 3.0 |
DataFrame: rowsCount = 4, columnsCount = 2
We can access by row with
obj[1]
letters | nums |
---|---|
b | 1.0 |
DataRow: index = 1, columnsCount = 2
And we can access by columns with:
obj["nums"]
nums |
---|
0.0 |
1.0 |
2.0 |
3.0 |
DataColumn: name = "nums", type = Double, size = 4
We can select multiple columns with the usual notation:
obj["nums", "letters"]
nums | letters |
---|---|
0.0 | a |
1.0 | b |
2.0 | c |
3.0 | d |
DataFrame: rowsCount = 4, columnsCount = 2
We can select also ranges (note that the second boundary of the range is included)
obj[0..2] // obj[0 until 3]
letters | nums |
---|---|
a | 0.0 |
b | 1.0 |
c | 2.0 |
DataFrame: rowsCount = 3, columnsCount = 2
The [...]
operator calls the get()
method, so:
obj[0] == obj.get(0) // true
obj["nums"] == obj.getColumn(1) == obj.getColumn("nums") // true
Unlike python, kotlin does not provide filtering inside square brackets, but it offers the filter method that can be more understandable
obj.filter { nums.toInt() % 2 == 0 }
letters | nums |
---|---|
a | 0.0 |
c | 2.0 |
DataFrame: rowsCount = 2, columnsCount = 2
instead of python’s:
obj[(obj["nums"] % 2 == 0)]
Arithmetic Operations#
Unlike pandas.DataFrame
, operations between dataframes in kotlin are not defined by default.
We can still compute row-wise operations with the update()
method:
obj.update { nums }.with { it * 100 }
letters | nums |
---|---|
a | 0.0 |
b | 100.0 |
c | 200.0 |
d | 300.0 |
DataFrame: rowsCount = 4, columnsCount = 2
But the following code will not compile:
// obj + obj
Function Application and Mapping#
Much like NumPy’s ufunc
element wise operations, every Kotlin collection (and then DataFrame DataColumn
) is provided with the map
method. In case of a dataframe, we can apply a function for each column with the map
operator.
Most of the time, when we want to compute some row-wise operations, the methods update()
and convert()
suit perfectly our needs.
val frame = dataFrameOf("b", "d", "e").randomDouble(3)
.update { colsOf<Double>() }.with { it * 7 }
frame
b | d | e |
---|---|---|
5.063779 | 5.833231 | 5.659028 |
2.594955 | 2.103203 | 5.466144 |
6.243192 | 1.803812 | 4.375719 |
DataFrame: rowsCount = 3, columnsCount = 3
We can now apply a function with map
on one, some or all columns and get its result as a List
. Additionally, it’s possible to store the result of the function application to a new column using the method mapToColumn()
. Lastly, the mapToFrame()
map the function applications along multiple columns inside a new dataframe.
frame.map { Math.ceil(it.b) }
[6.0, 3.0, 7.0]
frame.mapToColumn("ceiling_b") { Math.ceil(it.b) }
ceiling_b |
---|
6.0 |
3.0 |
7.0 |
DataColumn: name = "ceiling_b", type = Double, size = 3
frame.mapToFrame {
"new_b" from b ;
d gt 1.0 into "b_gt_1"
}
new_b | b_gt_1 |
---|---|
5.063779 | true |
2.594955 | true |
6.243192 | true |
DataFrame: rowsCount = 3, columnsCount = 2
Conclusions#
In this chapter, we have introduced Kotlin DataFrame, a library that provides working with tabular data with all the advantages that the Kotlin languages provides. Coming from Python’s pandas, its behavior is very similar and intuitive, but the major differences are:
DataFrame ensures type safety at compile time (especially when working with Jupyter Notebook).
Lack of an
Index
object to manipulate (i.e. having aDateTimeIndex
for indexing).pandas use of NumPy ensures optimized arithmetic operations (through vectorization), whereas Kotlin DataFrame does not use an optimized library for representing vectors and matrices, but only Kotlin standard collections.
While pandas offers a more comprehensive set of features and is a more mature library, Kotlin DataFrame offers a useful subset of its capabilities. In the following chapters, we will explore further the capabilities of Kotlin DataFrame and demonstrate its use in various data analysis tasks, from data cleaning to grouping strategies. Overall, Kotlin DataFrame is a valuable addition to the toolkit of any data analyst or scientist working in Kotlin.