Kotlin NDArrays

Kotlin NDArrays#

Mutlik Engines#

Multik engines are a key feature of this library, which provide a way to perform computation on arrays using different backends.

In the context of Multik, a backend is a software component that provides the implementation of specific array operations (i.e. a backend might implement the algorithm for matrix multiplication). Multik lets the user select a backend (a default one is loaded when specifying anything), depending on his specific needs and requirements. Currently, there are several Multik engines available, including:

JVM Engine: default engine, runs on the JVM and provides basic array operations.
Apache Arrow Engine

Thanks to Multik and the standard Collections library, it’s easy to create, access and manipulate N-dimensional arrays and matrices.

You can import Multik using Jupyter’s magic command %use, for importing Multik library and dependencies.

%use multik

We can then create a simple 1D array as follows:

val arr: NDArray<Int, D1> = mk.ndarray(mk[1, 2, 3])
arr

[1, 2, 3]

Notice the type definition of the Array, where we must specify:

Type: Any Kotlin subtype of Number (Boolean is not permitted).
Dimension: Multik defines 5 Dimension types: D1, D2, D3, D4, DN. DN is used in all the cases where the dimension is not known in advance, or when the dimension size is major than 4.

In the same way, we can create a 2 dimensional array, but this time we let the compiler infer the type and dimension

val arr = mk.ndarray(mk[mk[1, 2, 3], mk[4, 5, 6]])
arr

[[1, 2, 3],
[4, 5, 6]]

We can also create an array from Kotlin Collections List or Set

val arr = mk.ndarray(listOf(1, 2, 3))
arr

[1, 2, 3]

Already from these examples, it’s quite easy to spot the similarities with NumPy API, but at the same time, Kotlin’s static typing can be a bit tedious at first comparing to NumPy, but the effort will often pay off.

Creation#

We will go through simple NDArrays creation methods. Most of the times they’re very similar to NumPy’s methods, so the following examples should be already familiar.

Creating equally spaced arrays with arange and linspace

val a = mk.linspace<Double>(0, 1, 5)
val b = mk.arange<Int>(0, 10, 2)

println("a -> $a")
println("b -> $b")

a -> [0.0, 0.25, 0.5, 0.75, 1.0]
b -> [0, 2, 4, 6, 8]

Generating an array of Zeros and Ones

val zrs = mk.zeros<Double>(10)
val ons = mk.ones<Double>(10)
println("Zeros ->$zrs")
println("Ones -> $ons")

Zeros ->[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]

Ones -> [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]

We can then map the array to a matrix using the reshape method, which takes in input an arbitrary number of dimensions size along which the array will be mapped.

mk.zeros<Int>(50).reshape(2, 5, 5) // three dimensional matrix with (z, x, y) = (2, 5, 5)

[[[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0]],

[[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0]]]

It’s also possible to create an NDArray providing a lambda for constructing it, considering that the matrix will be built upon a arange vector.

mk.d3array(2, 5, 5) { it % 2 }

[[[0, 1, 0, 1, 0],
[1, 0, 1, 0, 1],
[0, 1, 0, 1, 0],
[1, 0, 1, 0, 1],
[0, 1, 0, 1, 0]],

[[1, 0, 1, 0, 1],
[0, 1, 0, 1, 0],
[1, 0, 1, 0, 1],
[0, 1, 0, 1, 0],
[1, 0, 1, 0, 1]]]

Same as

mk.arange<Int>(50).map { it % 2 }.reshape(2, 5, 5)

[[[0, 1, 0, 1, 0],
[1, 0, 1, 0, 1],
[0, 1, 0, 1, 0],
[1, 0, 1, 0, 1],
[0, 1, 0, 1, 0]],

[[1, 0, 1, 0, 1],
[0, 1, 0, 1, 0],
[1, 0, 1, 0, 1],
[0, 1, 0, 1, 0],
[1, 0, 1, 0, 1]]]

Arithmetic with NDArrays#

In the current version of Multik, an NDArray object support all arithmetic operations, enabling what are called as vectorized operations. Vectorization is very convenient because enable the user to express batch operations on data without writing any for loops (and also could possibly enhance performances).

val arr = mk.ndarray(mk[mk[1.0, 2.0, 3.0], mk[4.0, 5.0, 6.0]])
arr

[[1.0, 2.0, 3.0],
[4.0, 5.0, 6.0]]

arr * arr

[[1.0, 4.0, 9.0],
[16.0, 25.0, 36.0]]

1.0 / arr

[[1.0, 0.5, 0.3333333333333333],
[0.25, 0.2, 0.16666666666666666]]

arr dot arr.transpose()

[[14.0, 32.0],
[32.0, 77.0]]

Multik supports two separate packages for linear algebra basic operations (package api.linalg) and vectorized math operations (package api.math)

mk.math.cumSum(arr, axis = 1)

[[1.0, 3.0, 6.0],
[4.0, 9.0, 15.0]]

val sqrMat = mk.ones<Double>(3, 3)
// Frobenius norm of the matrix (default when calling `norm()`)
mk.linalg.norm(sqrMat, norm = Norm.Fro)

3.0

mk.math.sin(arr)

[[0.8414709848078965, 0.9092974268256817, 0.1411200080598672],
[-0.7568024953079283, -0.9589242746631385, -0.27941549819892586]]

The linalg package contains also methods for solving linear matrix equations and matrix decomposition methods.

// solve the linear system
val a = mk.d2array(3, 3) { it * it }.asType<Double>()
val b = mk.ndarray(mk[54.0, 396.0, 1062.0])

val res = mk.linalg.solve(a, b).map { it.roundToInt() }
println(" x = $res")

// check
(a dot res.asType<Double>()) == b

 x = [0, 6, 12]

true

linalg also offers matrix inversion and qr decomposition

val X = mk.rand<Double>(5, 5)
val mat = X.transpose().dot(X)
mk.linalg.inv(mat)

val (q, r) = mk.linalg.qr(mat)
q

[[-0.6434522522184294, 0.38641698097305016, 0.20534345094725628, 0.4802349982187798, -0.40479566397082917],
[-0.45474352025096354, -0.47859182494124414, 0.4129962773533749, -0.5871315544067359, -0.2210629064346973],
[-0.3292296576854662, -0.6546065678319505, -0.1733503466261196, 0.510129349187332, 0.41571117169724725],
[-0.4300144650309776, 0.1468132807302342, -0.8164473380454641, -0.35606292314874416, -0.012898039486490235],
[-0.29304082014153776, 0.414207537193111, 0.30105073254919396, -0.19400547972214482, 0.7837662432974303]]

[[-3.008139142773184, -2.475365375022795, -1.9932339679809647, -2.437935083612442, -1.253512427444805],
[0.0, -0.5818782649415262, -0.6984592874394522, -0.05435701240587266, 0.23201919273065758],
[0.0, 0.0, -0.10852908316489254, -0.7022880594833871, 0.0653178642796647],
[0.0, 0.0, 0.0, -0.2289228194117369, -0.016211823468891426],
[0.0, 0.0, 0.0, 0.0, 0.050274994433414676]]

For more, visit Multik’s documentation for linalg and math

Indexing and Slicing#

There are many ways to select subset of data of and NDArray.

For one dimensional arrays is straightforward:

val arr = mk.arange<Int>(10)
arr

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

// basic indexing
arr[2]

// range indexing
println(arr[1..3])
// the behavior is different using kotlin keyword `until`
println(arr[1 until 3])

[1, 2, 3]
[1, 2]

For Multi-Index arrays the range operator can be used like:

val a = mk.linspace<Int>(0, 20, 10).reshape(5, 2)

// single elemt selection
a[3, 1]
// row selection
a[0]
// range selection left inclusive
a[0..2] // != a[0 until 2] -> left exlusive
// selecting first column of first 3 rows
a[0..2, 0]

[0, 4, 8]

Warning

Slices are copies of the source array, unlike NumPy that generates a view on the original array.

var b = a[0..2, 0]
a[0..2, 0] === b // false!

NDArrays does not support boolean indexing, but filters can be applied with the filter() method, like every Collection in Kotlin’s standard library.

For each “indexed functional method” that can be applied to kotlin collections, like mapIndexed, forEachIndexed, filterIndexed, Multik offers the counterpart for multidimensional arrays (of the form *MultiIndexed()), where the index is an intArray() of the combination of the indices of the current element.

A full list of those methods, and more, can be found int the ndarray.operations package.

`inplace` context#

Multik provides inplace context for operating directly inside an array, modifying it’s structure. We can then call the default math context for applying inplace mathematical transformation on the elements of the array.

val a = mk.d1array(10) { it * 10 }.asType<Double>()
val b = mk.arange<Double>(10)

println("Before -> $a")
a.inplace { 
    math {
        (this - b) * b
        abs()
    }
}

println("After -> $a")

Before -> [0.0, 10.0, 20.0, 30.0, 40.0, 50.0, 60.0, 70.0, 80.0, 90.0]

After -> [0.0, 9.0, 36.0, 81.0, 144.0, 225.0, 324.0, 441.0, 576.0, 729.0]

Note that this approach violates the immutability provided by Kotlin’s val keyword!

Mathematical and Statistical Methods#

Both api.stat and api.math provides several methods for computing basic statistics like mean, median, max, argMax, sum and cumSum, some of them are callable by the instance method or using the top-level Multik function.

If we want to compute the mean of a matrix, we must specify whether we want to compute the mean for the whole matrix, or row/column-wise with mean() and meanD2() respectively (for higher dimensional matrices, the meanD* variant covers dimension up to 4, then the meanDN has to be used).

val arr = mk.rand<Double>(5, 4)

println("mean of the whole matrix -> ${mk.stat.mean(arr)}")
println("mean across columns -> ${mk.stat.meanD2(arr, 0)}")
println("mean across rows -> ${mk.stat.meanD2(arr, 1)}")

mean of the whole matrix -> 0.5788003662292656

mean across columns -> [0.6074798009411675, 0.49930172509369386, 0.5479675749694699, 0.6604523639127308]

mean across rows -> [0.4931613167485237, 0.36150513066399875, 0.7832244650967224, 0.698053726239088, 0.5580571923979946]

Example: Random Walks#

This example has been taken from the book Python for Data Analysis

A simple pure kotlin implementation can be summarized as follows

import java.util.Random
var position = 0
val walk = mutableListOf(position)
val rand = Random(123456)
val steps = 1000
for (i in 0 until steps) {
    val step = if (rand.nextBoolean()) 1 else -1
    position += step
    walk.add(position)
}

We can plot that walk

%use lets-plot

ggplot() { x=(0..100).toList() ; y= walk.slice(0..100) } +
    geomLine()

We can get the same result computing the cumulative sum of the random steps. (note that kotlin List can be used for creating one dimensional vector)

val nsteps = 1000

val draws = mk.d1array(nsteps) { rand.nextInt(0, 2)}
val steps = mk.d1array(draws.size) { if (draws[it] > 0) 1 else -1 }

val walk = steps.cumSum()
walk.min()

-20

walk.max()

We can get the index of the walk when we cross the 10 steps above or below the origin, even if we have to pass trough a list, because Multik does not support Boolean arrays.

val first10 = abs(walk).toList().map { it >= 10 }.indexOfFirst { it }
first10

And we can plot the first 100 steps similarly as we did before, illustrating where is the point we computed before

ggplot() { x = mk.arange<Int>(100).toList() ; y = walk[0..99].toList() } +
    geomLine() +
    geomPoint(x = first10, y = walk[first10], color="red", size=5.0)

Multik’s ndarrays become handy in case we want to simulate many random walks at once, representing them in a matrix, where each row is a walk and each column of that row is a step:

val nwalks = 5000
val nsteps = 1000

val draws = mk.d2array(nwalks, nsteps) { rand.nextInt(0, 2)}
val steps = draws.map { if (it > 0) 1 else -1 }

val walks = mk.math.cumSum(steps, axis = 1)

walks.min()

-105

walks.max()

We can get all the runs that reach 50 or -50 using mapMultiIndexed() that will preserve the matrix structure.

mk.math.maxD2(
    walks.mapMultiIndexed { _, elem -> if (elem >= 50) 1 else 0 }, 1).sum()

Broadcasting#

Array with different shapes can be broadcasted if this condition is satisfied (from NumPy documentation):

In order to broadcast, the size of the trailing axes for both arrays in an operation must either be the same size or one of them must be one.

Multik does not have a direct way to broadcast two arrays with different shapes, but KMath tensor module provides an implementation in the DoubleTensorAlgebra context.

KMath’s modules needed for broadcasting are not available using the %use magic command, but we can tell Koltin’s Jupyter Kernel to import them. KMath also has a set of API for transforming a Multik NDArray into a tensor, a more familiar object to Kmath.

We use the @file:DependsOn command for importing core, multik and tensors modules of KMath inside the project:

@file:DependsOn("space.kscience:kmath-core:0.3.1")
@file:DependsOn("space.kscience:kmath-tensors:0.3.1")
@file:DependsOn("space.kscience:kmath-multik:0.3.1")

For being able to use them, we import the needed packages.

import space.kscience.kmath.multik.MultikTensor
import space.kscience.kmath.tensors.core.DoubleTensorAlgebra
import space.kscience.kmath.tensors.core.tensorAlgebra
import space.kscience.kmath.tensors.core.withBroadcast
import space.kscience.kmath.tensors.core.DoubleTensor

Of course we could have used KMath tensors for all the computation, but KMath tensors are in some way harder to manipulate that Multik’s NDArrays, so in this example we will see how we can compute broadcasting starting from two NDArrays.

Thanks to Kotlin extensions functions we can map the + operator between two MultikTensors objects to compute broadcasting if needed.

fun NDArray<Double, DN>.asMultikTensor(): MultikTensor<Double> = MultikTensor(this)

fun brAdd(a: MultikTensor<Double>, b: MultikTensor<Double>): DoubleTensor? {
    return Double.tensorAlgebra.withBroadcast { 
        a + b
    }
}

operator fun MultikTensor<Double>.plus(other: MultikTensor<Double>): DoubleTensor? = brAdd(this, other)

With those functions, the following code produces the expected output.

val a: NDArray<Double, DN> = mk.arange<Double>(40).reshape(2, 4, 5).asDNArray()
val b: NDArray<Double, DN> = mk.arange<Double>(20).reshape(4, 5).asDNArray()

a.asMultikTensor() + b.asMultikTensor()

DoubleTensor(
   [[[ 0.0      ,  2.0      ,  4.0      ,  6.0      ,  8.0      ],
     [ 1.0e+1   ,  1.2e+1   ,  1.4e+1   ,  1.6e+1   ,  1.8e+1   ],
     [ 0.2e+2   ,  2.2e+1   ,  2.4e+1   ,  2.6e+1   ,  2.8e+1   ],
     [ 0.3e+2   ,  3.2e+1   ,  3.4e+1   ,  3.6e+1   ,  3.8e+1   ]],
    [[ 0.2e+2   ,  2.2e+1   ,  2.4e+1   ,  2.6e+1   ,  2.8e+1   ]],
    [ 0.3e+2   ,  3.2e+1   ,  3.4e+1   ,  3.6e+1   ,  3.8e+1   ]],
   [ 0.4e+2   ,  4.2e+1   ,  4.4e+1   ,  4.6e+1   ,  4.8e+1   ]],
  [ 0.5e+2   ,  5.2e+1   ,  5.4e+1   ,  5.6e+1   ,  5.8e+1   ]]]
)

There is no such thing as a np.newaxis for specifying the missing dimension to broadcast along. We must reshape the arrays in order to broadcast properly.

Consider this example, where we compute the mean of every row, and we want to add the mean to the original matrix. According to broadcasting rules, if we want to perform broadcasting across axes other than axis 0, the “broadcast dimensions” must be 1 in the smaller array. Being b of shape (4, 5) and the mean vector is (, 4), we must reshape and insert a “1” in the broadcasting dimension, the second one.

val mean = mk.stat.meanD2(b.asD2Array(), 1)
                .reshape(b.shape[0], 1)
                .asDNArray().asMultikTensor()


b.asMultikTensor() + mean

DoubleTensor(
   [[ 2.0      ,  3.0      ,  4.0      ,  5.0      ,  6.0      ],
    [ 1.2e+1   ,  1.3e+1   ,  1.4e+1   ,  1.5e+1   ,  1.6e+1   ],
    [ 2.2e+1   ,  2.3e+1   ,  2.4e+1   ,  2.5e+1   ,  2.6e+1   ],
    [ 3.2e+1   ,  3.3e+1   ,  3.4e+1   ,  3.5e+1   ,  3.6e+1   ]]
)

Notes:

The withBroadcast context is claimed to be unstable and could change in the future.
For large arrays, the chain of conversions can be very slow in performance-critical code.
No direct ways to re-convert a DoubleTensor to a Multik NDArray were found: the only way is to manually copy element by element from the DoubleTensor to a new NDArray.

Conclusions#

In this chapter was presented Kotlin’s Multik library for working with N-Dimensional Arrays. The set of data structure and methods are very close to NumPy’s, which makes it very easy to learn if the developer has a basic knowledge of NumPy. However, NumPy has more that 20 years of development and fine tuning, resulting in one of the best libraries for scientific computation: the better support for linear algebra operation, array broadcasting and the possibility to use multidimensional arrays with non-numeric data types, makes it a very flexible and powerful library. On the other hand, Multik can be used with libraries like Kmath and Apache Commons Math, or any other library developed in Java, making it very powerful with the right set of components at the need of the developer (i.e. array broadcasting and more advanced linear algebra can be accomplished with the use of Kmath). Please note that Multik is very efficient with basic mathematical computation, and for a little bit more, Kmath connectors for Multik can really help archiving more difficult tasks (i.e. integration).

Kotlin ecosystem for scientific computation is still young, but with the joint use of the right libraries, a lot can be archived.