Data Preparation with Kotlin#
In the following chapter we will see the basic operation for preparing data for further analysis. Most of the covered Kotlin DataFrame’s functions are very similar to what pandas counterparts.
%use dataframe
Handling null
values#
In many dataset and data analysis application, missing data occurs commonly.
Thanks to Kotlin’s nullable values, we can have a column of a nullable type like:
val col by columnOf<String?>("a", "b", null)
col
DataColumn: name = "col", type = String?, size = 3
We can then print for each row if it is null or not, similar to pandas dataframe.isnull()
:
col.map { it.isNullOrEmpty() }
DataColumn: name = "col", type = Boolean, size = 3
The great advantage in using Kotlin in this kind of situation is the fact that we have a complete control on how we can handle missing data. Unlike python, kotlin’s provides out of the box methods for handling null values, possibly without raising a NullPointerException
if the developer keeps the context safe with the use of safe call operators (?.
) or explicit null checking.
Moreover, Dataframe offers a series of method for filtering or filling null
values.
val df = dataFrameOf(
"0" to listOf(1.0, null, null, 2.0),
"1" to listOf(3.5, 6.0, 4.0, null),
"2" to listOf(1.0, null, 9.6, 10.0)
)
df
DataFrame: rowsCount = 4, columnsCount = 3
By default, the method dropNulls()
drops all the rows that contain a null value.
df.dropNulls()
DataFrame: rowsCount = 1, columnsCount = 3
It is important to notice that Dataframe provides three methods for dropping possible null values:
dropNull()
: drops every row with anull
valuedropNaNs()
: drop rows withDouble.NaN
orFloat.NaN
valuesdropNA()
: removes rows withnull
,Double.NaN
orFloat.NaN
values
For each method, we can choose which columns we want to check for nulls, for example:
df.dropNA(whereAllNA = true) // remoevs the rows where ALL values are null
DataFrame: rowsCount = 4, columnsCount = 3
df.dropNA("0") // dropping all rows that has null in "0" column
DataFrame: rowsCount = 2, columnsCount = 3
// remove rows where col "0" and "2" have null or NaN
df.dropNA(whereAllNA = true) { "0" and "2" }
DataFrame: rowsCount = 3, columnsCount = 3
Instead of dropping null values, there could be the need to fill in missing values. Just like pandas, Dataframe offers similar API.
var df = dataFrameOf("a", "b", "c").randomDouble(7)
// dummy Double.NaN filing
df = df.update { a }.at(1..4).with { Double.NaN }
.update { b }.at(1, 2).with { Double.NaN }
df
DataFrame: rowsCount = 7, columnsCount = 3
df.fillNA { all() }.withZero()
DataFrame: rowsCount = 7, columnsCount = 3
pandas offers the filling method ffill
or bfill
, that fills missing values with the next or preceding row’s value.
We can simulate that behavior with:
df.fillNA { all() }.perRowCol { row, col -> row.prev()?.get(col) }
DataFrame: rowsCount = 7, columnsCount = 3
The example below does not consider the new values that are computed during the before computations. In case we want to fill ALL missing values in that column with the first non null preceding row, we must specify the column we want to modify, and use the method newValue()
.
df.fillNA { a }.with { prev()?.newValue() }
DataFrame: rowsCount = 7, columnsCount = 3
With fillNA
(or fillNulls
or fillNaNs
) you can pass any kind of function inside the with
construct, for example the row mean (remember to pass skipNA = true
when computing the mean):
df.fillNaNs{ colsOf<Double>() }
.perCol { it.mean(skipNA = true) }
DataFrame: rowsCount = 7, columnsCount = 3
Data Transformation#
Data transformation includes filtering, cleaning, converting and updating values.
Thanks to the underlying usage of Koltin’s collections for storing data, most collections’ manipulations methods can be called on columns or rows.
Suppose we have a dataframe containing data about meat and their quantities
val df = dataFrameOf("food", "ounces")(
"bacon", 4,
"pulled pork", 3,
"bacon", 12,
"pastrami", 6,
"corned beef", 7.5,
"bacon", 8,
"pastrami", 3,
"honey ham", 5,
"nova lox", 6,
)
df
DataFrame: rowsCount = 9, columnsCount = 2
We could map each meat with it’s corresponding animal
val meatToAnimal = mapOf(
"bacon" to "pig",
"pulled pork" to "pig",
"pastrami" to "cow",
"corned beef" to "cow",
"honey ham" to "pig",
"nova lox" to "salmon"
)
Using mapToFrame
let us perform column wise operations, creating a new dataframe with the provided set of instructions.
Mapping the whole column food
with the corresponding animal, is the same as applying a map
function to a Kotlin collection.
df.mapToFrame{
food.map{ meatToAnimal[it] } into "animal"
+food
+ounces
}
DataFrame: rowsCount = 9, columnsCount = 3
Notice that the three map
methods, offers three ways to yield the result of the mapping of the row expression provided.
map
returns aList
from the provided row expression.mapToColumn
returns a newColumn
from the provided row expression.mapToFrame
returns a newDataFrame
from the provided column mappings. Here we can specify to keep the old columns in the new frame with the+<col_name>
operator.
It is advised to learn more about DataRows
for understanding which useful methods DataRows
offers for creating row expressions or row conditions.
df.map { meatToAnimal[it.food]}
[pig, pig, pig, cow, cow, pig, cow, pig, salmon]
df.mapToColumn("animal") { meatToAnimal[it.food] }
DataColumn: name = "animal", type = String, size = 9
In python, the API is very similar:
df['animal'] = df['food'].map(lambda x: meat_to_animal[x])
Replacing Values#
Value replacing can be accomplished with the use of update
or convert
methods. They differs only for the return type: update
modify data in each row, but keeping it’s type; convert
modify data in each row, possibly changing it’s type.
val values by columnOf(1.0, -999.0, 2.0, -999.0, -1000.0, 3.0)
val df = dataFrameOf(values)
df.update { values }.where { it == -999.0 }.withZero()
DataFrame: rowsCount = 6, columnsCount = 1
Detecting and Filtering Outliers#
Sometimes it’s very useful to sport and filter out outliers before doing some computation.
Consider a Dataframe with some randomly distributed data:
val df = dataFrameOf("a", "b", "c", "d")
.randomDouble(1000)
.update { colsOf<Double>() }.with { it * 5}
df.head()
DataFrame: rowsCount = 5, columnsCount = 4
df.describe()
DataFrame: rowsCount = 4, columnsCount = 12
Suppose we want to find all values exceeding 4.5. Unlike pandas boolean indexing, with Dataframe we will apply a row-wise filter.
df.filter { it.values().any { it as Double > 4.5 } }
... showing only top 20 of 354 rows
DataFrame: rowsCount = 354, columnsCount = 4
Now let’s remap all values > 4.5, to be inside that range.
df.update { colsOf<Double>() }
.where { it > 4.5 }
.withValue(4.5).head()
DataFrame: rowsCount = 5, columnsCount = 4
Recap#
Dataframe power comes from update
and convert
values, which they can let us perform any kind of row-wise transformation on data. Those two clauses allows performing a pre filtering operation before updating cell values, using the where
method and providing a row condition. A more precise manipulation of the single cell values can be archived with perCol
or perRowCol
methods, allowing more complex expressions to be written.
Being the underlying structure of DataColumn
s kotlin’s collections, all the standard methods (like map
, filter
, all iterable
properties and so on) of them can be called and used when manipulating rows or columns.
More real world examples can be found in the examples section.