Data Preparation with Kotlin

Contents

Data Preparation with Kotlin#

In the following chapter we will see the basic operation for preparing data for further analysis. Most of the covered Kotlin DataFrame’s functions are very similar to what pandas counterparts.

%use dataframe

Handling `null` values#

In many dataset and data analysis application, missing data occurs commonly.

Thanks to Kotlin’s nullable values, we can have a column of a nullable type like:

val col by columnOf<String?>("a", "b", null)
col

DataColumn: name = "col", type = String?, size = 3

We can then print for each row if it is null or not, similar to pandas dataframe.isnull():

col.map { it.isNullOrEmpty() }

DataColumn: name = "col", type = Boolean, size = 3

The great advantage in using Kotlin in this kind of situation is the fact that we have a complete control on how we can handle missing data. Unlike python, kotlin’s provides out of the box methods for handling null values, possibly without raising a NullPointerException if the developer keeps the context safe with the use of safe call operators (?.) or explicit null checking.

Moreover, Dataframe offers a series of method for filtering or filling null values.

val df = dataFrameOf(
    "0" to listOf(1.0, null, null, 2.0),
    "1" to listOf(3.5, 6.0, 4.0, null),
    "2" to listOf(1.0, null, 9.6, 10.0)
)

df

DataFrame: rowsCount = 4, columnsCount = 3

By default, the method dropNulls() drops all the rows that contain a null value.

df.dropNulls()

DataFrame: rowsCount = 1, columnsCount = 3

It is important to notice that Dataframe provides three methods for dropping possible null values:

dropNull(): drops every row with a null value
dropNaNs(): drop rows with Double.NaN or Float.NaN values
dropNA(): removes rows with null, Double.NaN or Float.NaN values

For each method, we can choose which columns we want to check for nulls, for example:

df.dropNA(whereAllNA = true) // remoevs the rows where ALL values are null

DataFrame: rowsCount = 4, columnsCount = 3

df.dropNA("0") // dropping all rows that has null in "0" column

DataFrame: rowsCount = 2, columnsCount = 3

// remove rows where col "0" and "2" have null or NaN
df.dropNA(whereAllNA = true) { "0" and "2" }

DataFrame: rowsCount = 3, columnsCount = 3

Instead of dropping null values, there could be the need to fill in missing values. Just like pandas, Dataframe offers similar API.

var df = dataFrameOf("a", "b", "c").randomDouble(7)

// dummy Double.NaN filing
df = df.update { a }.at(1..4).with { Double.NaN }
  .update { b }.at(1, 2).with { Double.NaN }
df

DataFrame: rowsCount = 7, columnsCount = 3

df.fillNA { all() }.withZero()

DataFrame: rowsCount = 7, columnsCount = 3

pandas offers the filling method ffill or bfill, that fills missing values with the next or preceding row’s value.

We can simulate that behavior with:

df.fillNA { all() }.perRowCol { row, col -> row.prev()?.get(col) }

DataFrame: rowsCount = 7, columnsCount = 3

The example below does not consider the new values that are computed during the before computations. In case we want to fill ALL missing values in that column with the first non null preceding row, we must specify the column we want to modify, and use the method newValue().

df.fillNA { a }.with { prev()?.newValue() }

DataFrame: rowsCount = 7, columnsCount = 3

With fillNA (or fillNulls or fillNaNs) you can pass any kind of function inside the with construct, for example the row mean (remember to pass skipNA = true when computing the mean):

df.fillNaNs{ colsOf<Double>() }
    .perCol { it.mean(skipNA = true) }

DataFrame: rowsCount = 7, columnsCount = 3

Data Transformation#

Data transformation includes filtering, cleaning, converting and updating values.

Thanks to the underlying usage of Koltin’s collections for storing data, most collections’ manipulations methods can be called on columns or rows.

Suppose we have a dataframe containing data about meat and their quantities

val df = dataFrameOf("food", "ounces")(
    "bacon", 4,
    "pulled pork", 3,
    "bacon", 12,
    "pastrami", 6,
    "corned beef", 7.5,
    "bacon", 8,
    "pastrami", 3,
    "honey ham", 5,
    "nova lox", 6,
)
df

DataFrame: rowsCount = 9, columnsCount = 2

We could map each meat with it’s corresponding animal

val meatToAnimal = mapOf(
    "bacon" to "pig",
    "pulled pork" to "pig",
    "pastrami" to "cow",
    "corned beef" to "cow",
    "honey ham" to "pig",
    "nova lox" to "salmon"
)

Using mapToFrame let us perform column wise operations, creating a new dataframe with the provided set of instructions.

Mapping the whole column food with the corresponding animal, is the same as applying a map function to a Kotlin collection.

df.mapToFrame{ 
    food.map{ meatToAnimal[it] } into "animal"
    +food
    +ounces
}

DataFrame: rowsCount = 9, columnsCount = 3

Notice that the three map methods, offers three ways to yield the result of the mapping of the row expression provided.

map returns a List from the provided row expression.
mapToColumn returns a new Column from the provided row expression.
mapToFrame returns a new DataFrame from the provided column mappings. Here we can specify to keep the old columns in the new frame with the +<col_name> operator.

It is advised to learn more about DataRows for understanding which useful methods DataRows offers for creating row expressions or row conditions.

df.map { meatToAnimal[it.food]}

[pig, pig, pig, cow, cow, pig, cow, pig, salmon]

df.mapToColumn("animal") { meatToAnimal[it.food] }

DataColumn: name = "animal", type = String, size = 9

In python, the API is very similar:

df['animal'] = df['food'].map(lambda x: meat_to_animal[x])

Replacing Values#

Value replacing can be accomplished with the use of update or convert methods. They differs only for the return type: update modify data in each row, but keeping it’s type; convert modify data in each row, possibly changing it’s type.

val values by columnOf(1.0, -999.0, 2.0, -999.0, -1000.0, 3.0)
val df = dataFrameOf(values)
df.update { values }.where { it == -999.0 }.withZero()

DataFrame: rowsCount = 6, columnsCount = 1

Detecting and Filtering Outliers#

Sometimes it’s very useful to sport and filter out outliers before doing some computation.

Consider a Dataframe with some randomly distributed data:

val df = dataFrameOf("a", "b", "c", "d")
    .randomDouble(1000)
    .update { colsOf<Double>() }.with { it * 5}
df.head()

DataFrame: rowsCount = 5, columnsCount = 4

df.describe()

DataFrame: rowsCount = 4, columnsCount = 12

Suppose we want to find all values exceeding 4.5. Unlike pandas boolean indexing, with Dataframe we will apply a row-wise filter.

df.filter { it.values().any { it as Double > 4.5 } }

... showing only top 20 of 354 rows

DataFrame: rowsCount = 354, columnsCount = 4

Now let’s remap all values > 4.5, to be inside that range.

df.update { colsOf<Double>() }
    .where { it > 4.5 }
    .withValue(4.5).head()

DataFrame: rowsCount = 5, columnsCount = 4

Recap#

Dataframe power comes from update and convert values, which they can let us perform any kind of row-wise transformation on data. Those two clauses allows performing a pre filtering operation before updating cell values, using the where method and providing a row condition. A more precise manipulation of the single cell values can be archived with perCol or perRowCol methods, allowing more complex expressions to be written.

Being the underlying structure of DataColumns kotlin’s collections, all the standard methods (like map, filter, all iterable properties and so on) of them can be called and used when manipulating rows or columns.

More real world examples can be found in the examples section.