Kotlin NDArrays#
Mutlik Engines#
Multik engines are a key feature of this library, which provide a way to perform computation on arrays using different backends.
In the context of Multik, a backend is a software component that provides the implementation of specific array operations (i.e. a backend might implement the algorithm for matrix multiplication). Multik lets the user select a backend (a default one is loaded when specifying anything), depending on his specific needs and requirements. Currently, there are several Multik engines available, including:
JVM Engine: default engine, runs on the JVM and provides basic array operations.
Apache Arrow Engine
Thanks to Multik and the standard Collections library, it’s easy to create, access and manipulate N-dimensional arrays and matrices.
You can import Multik using Jupyter’s magic command %use
, for importing Multik library and dependencies.
%use multik
We can then create a simple 1D array as follows:
val arr: NDArray<Int, D1> = mk.ndarray(mk[1, 2, 3])
arr
[1, 2, 3]
Notice the type definition of the Array, where we must specify:
Type: Any Kotlin subtype of
Number
(Boolean
is not permitted).Dimension: Multik defines 5
Dimension
types:D1
,D2
,D3
,D4
,DN
.DN
is used in all the cases where the dimension is not known in advance, or when the dimension size is major than 4.
In the same way, we can create a 2 dimensional array, but this time we let the compiler infer the type and dimension
val arr = mk.ndarray(mk[mk[1, 2, 3], mk[4, 5, 6]])
arr
[[1, 2, 3],
[4, 5, 6]]
We can also create an array from Kotlin Collections List or Set
val arr = mk.ndarray(listOf(1, 2, 3))
arr
[1, 2, 3]
Already from these examples, it’s quite easy to spot the similarities with NumPy API, but at the same time, Kotlin’s static typing can be a bit tedious at first comparing to NumPy, but the effort will often pay off.
Creation#
We will go through simple NDArrays
creation methods. Most of the times they’re very similar to NumPy’s methods, so the following examples should be already familiar.
Creating equally spaced arrays with arange
and linspace
val a = mk.linspace<Double>(0, 1, 5)
val b = mk.arange<Int>(0, 10, 2)
println("a -> $a")
println("b -> $b")
a -> [0.0, 0.25, 0.5, 0.75, 1.0]
b -> [0, 2, 4, 6, 8]
Generating an array of Zeros and Ones
val zrs = mk.zeros<Double>(10)
val ons = mk.ones<Double>(10)
println("Zeros ->$zrs")
println("Ones -> $ons")
Zeros ->[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
Ones -> [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
We can then map the array to a matrix using the reshape
method, which takes in input an arbitrary number of dimensions size along which the array will be mapped.
mk.zeros<Int>(50).reshape(2, 5, 5) // three dimensional matrix with (z, x, y) = (2, 5, 5)
[[[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0]],
[[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0]]]
It’s also possible to create an NDArray
providing a lambda for constructing it, considering that the matrix will be built upon a arange
vector.
mk.d3array(2, 5, 5) { it % 2 }
[[[0, 1, 0, 1, 0],
[1, 0, 1, 0, 1],
[0, 1, 0, 1, 0],
[1, 0, 1, 0, 1],
[0, 1, 0, 1, 0]],
[[1, 0, 1, 0, 1],
[0, 1, 0, 1, 0],
[1, 0, 1, 0, 1],
[0, 1, 0, 1, 0],
[1, 0, 1, 0, 1]]]
Same as
mk.arange<Int>(50).map { it % 2 }.reshape(2, 5, 5)
[[[0, 1, 0, 1, 0],
[1, 0, 1, 0, 1],
[0, 1, 0, 1, 0],
[1, 0, 1, 0, 1],
[0, 1, 0, 1, 0]],
[[1, 0, 1, 0, 1],
[0, 1, 0, 1, 0],
[1, 0, 1, 0, 1],
[0, 1, 0, 1, 0],
[1, 0, 1, 0, 1]]]
Arithmetic with NDArrays#
In the current version of Multik, an NDArray
object support all arithmetic operations, enabling what are called as vectorized operations.
Vectorization is very convenient because enable the user to express batch operations on data without writing any for loops (and also could possibly enhance performances).
val arr = mk.ndarray(mk[mk[1.0, 2.0, 3.0], mk[4.0, 5.0, 6.0]])
arr
[[1.0, 2.0, 3.0],
[4.0, 5.0, 6.0]]
arr * arr
[[1.0, 4.0, 9.0],
[16.0, 25.0, 36.0]]
1.0 / arr
[[1.0, 0.5, 0.3333333333333333],
[0.25, 0.2, 0.16666666666666666]]
arr dot arr.transpose()
[[14.0, 32.0],
[32.0, 77.0]]
Multik supports two separate packages for linear algebra basic operations (package api.linalg
) and vectorized math operations (package api.math
)
mk.math.cumSum(arr, axis = 1)
[[1.0, 3.0, 6.0],
[4.0, 9.0, 15.0]]
val sqrMat = mk.ones<Double>(3, 3)
// Frobenius norm of the matrix (default when calling `norm()`)
mk.linalg.norm(sqrMat, norm = Norm.Fro)
3.0
mk.math.sin(arr)
[[0.8414709848078965, 0.9092974268256817, 0.1411200080598672],
[-0.7568024953079283, -0.9589242746631385, -0.27941549819892586]]
The linalg
package contains also methods for solving linear matrix equations and matrix decomposition methods.
// solve the linear system
val a = mk.d2array(3, 3) { it * it }.asType<Double>()
val b = mk.ndarray(mk[54.0, 396.0, 1062.0])
val res = mk.linalg.solve(a, b).map { it.roundToInt() }
println(" x = $res")
// check
(a dot res.asType<Double>()) == b
x = [0, 6, 12]
true
linalg
also offers matrix inversion and qr
decomposition
val X = mk.rand<Double>(5, 5)
val mat = X.transpose().dot(X)
mk.linalg.inv(mat)
val (q, r) = mk.linalg.qr(mat)
q
[[-0.6434522522184294, 0.38641698097305016, 0.20534345094725628, 0.4802349982187798, -0.40479566397082917],
[-0.45474352025096354, -0.47859182494124414, 0.4129962773533749, -0.5871315544067359, -0.2210629064346973],
[-0.3292296576854662, -0.6546065678319505, -0.1733503466261196, 0.510129349187332, 0.41571117169724725],
[-0.4300144650309776, 0.1468132807302342, -0.8164473380454641, -0.35606292314874416, -0.012898039486490235],
[-0.29304082014153776, 0.414207537193111, 0.30105073254919396, -0.19400547972214482, 0.7837662432974303]]
r
[[-3.008139142773184, -2.475365375022795, -1.9932339679809647, -2.437935083612442, -1.253512427444805],
[0.0, -0.5818782649415262, -0.6984592874394522, -0.05435701240587266, 0.23201919273065758],
[0.0, 0.0, -0.10852908316489254, -0.7022880594833871, 0.0653178642796647],
[0.0, 0.0, 0.0, -0.2289228194117369, -0.016211823468891426],
[0.0, 0.0, 0.0, 0.0, 0.050274994433414676]]
Indexing and Slicing#
There are many ways to select subset of data of and NDArray
.
For one dimensional arrays is straightforward:
val arr = mk.arange<Int>(10)
arr
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
// basic indexing
arr[2]
2
// range indexing
println(arr[1..3])
// the behavior is different using kotlin keyword `until`
println(arr[1 until 3])
[1, 2, 3]
[1, 2]
For Multi-Index arrays the range operator can be used like:
val a = mk.linspace<Int>(0, 20, 10).reshape(5, 2)
// single elemt selection
a[3, 1]
// row selection
a[0]
// range selection left inclusive
a[0..2] // != a[0 until 2] -> left exlusive
// selecting first column of first 3 rows
a[0..2, 0]
[0, 4, 8]
Warning
Slices are copies of the source array, unlike NumPy that generates a view on the original array.
var b = a[0..2, 0]
a[0..2, 0] === b // false!
NDArrays
does not support boolean indexing, but filters can be applied with the filter()
method, like every Collection
in Kotlin’s standard library.
For each “indexed functional method” that can be applied to kotlin collections, like mapIndexed
, forEachIndexed
, filterIndexed
, Multik offers the counterpart for multidimensional arrays (of the form *MultiIndexed()
), where the index is an intArray()
of the combination of the indices of the current element.
A full list of those methods, and more, can be found int the ndarray.operations
package.
inplace
context#
Multik provides inplace
context for operating directly inside an array, modifying it’s structure.
We can then call the default math
context for applying inplace mathematical transformation on the elements of the array.
val a = mk.d1array(10) { it * 10 }.asType<Double>()
val b = mk.arange<Double>(10)
println("Before -> $a")
a.inplace {
math {
(this - b) * b
abs()
}
}
println("After -> $a")
Before -> [0.0, 10.0, 20.0, 30.0, 40.0, 50.0, 60.0, 70.0, 80.0, 90.0]
After -> [0.0, 9.0, 36.0, 81.0, 144.0, 225.0, 324.0, 441.0, 576.0, 729.0]
Note that this approach violates the immutability provided by Kotlin’s val
keyword!
Mathematical and Statistical Methods#
Both api.stat
and api.math
provides several methods for computing basic statistics like mean
, median
, max
, argMax
, sum
and cumSum
, some of them are callable by the instance method or using the top-level Multik function.
If we want to compute the mean of a matrix, we must specify whether we want to compute the mean for the whole matrix, or row/column-wise with mean()
and meanD2()
respectively (for higher dimensional matrices, the meanD*
variant covers dimension up to 4, then the meanDN
has to be used).
val arr = mk.rand<Double>(5, 4)
println("mean of the whole matrix -> ${mk.stat.mean(arr)}")
println("mean across columns -> ${mk.stat.meanD2(arr, 0)}")
println("mean across rows -> ${mk.stat.meanD2(arr, 1)}")
mean of the whole matrix -> 0.5788003662292656
mean across columns -> [0.6074798009411675, 0.49930172509369386, 0.5479675749694699, 0.6604523639127308]
mean across rows -> [0.4931613167485237, 0.36150513066399875, 0.7832244650967224, 0.698053726239088, 0.5580571923979946]
Example: Random Walks#
This example has been taken from the book Python for Data Analysis
A simple pure kotlin implementation can be summarized as follows
import java.util.Random
var position = 0
val walk = mutableListOf(position)
val rand = Random(123456)
val steps = 1000
for (i in 0 until steps) {
val step = if (rand.nextBoolean()) 1 else -1
position += step
walk.add(position)
}
We can plot that walk
%use lets-plot
ggplot() { x=(0..100).toList() ; y= walk.slice(0..100) } +
geomLine()
We can get the same result computing the cumulative sum of the random steps. (note that kotlin List
can be used for creating one dimensional vector)
val nsteps = 1000
val draws = mk.d1array(nsteps) { rand.nextInt(0, 2)}
val steps = mk.d1array(draws.size) { if (draws[it] > 0) 1 else -1 }
val walk = steps.cumSum()
walk.min()
-20
walk.max()
9
We can get the index of the walk when we cross the 10 steps above or below the origin, even if we have to pass trough a list, because Multik does not support Boolean
arrays.
val first10 = abs(walk).toList().map { it >= 10 }.indexOfFirst { it }
first10
55
And we can plot the first 100 steps similarly as we did before, illustrating where is the point we computed before
ggplot() { x = mk.arange<Int>(100).toList() ; y = walk[0..99].toList() } +
geomLine() +
geomPoint(x = first10, y = walk[first10], color="red", size=5.0)
Multik’s ndarray
s become handy in case we want to simulate many random walks at once, representing them in a matrix, where each row is a walk and each column of that row is a step:
val nwalks = 5000
val nsteps = 1000
val draws = mk.d2array(nwalks, nsteps) { rand.nextInt(0, 2)}
val steps = draws.map { if (it > 0) 1 else -1 }
val walks = mk.math.cumSum(steps, axis = 1)
walks.min()
-105
walks.max()
107
We can get all the runs that reach 50 or -50 using mapMultiIndexed()
that will preserve the matrix structure.
mk.math.maxD2(
walks.mapMultiIndexed { _, elem -> if (elem >= 50) 1 else 0 }, 1).sum()
604
Broadcasting#
Array with different shapes can be broadcasted if this condition is satisfied (from NumPy documentation):
In order to broadcast, the size of the trailing axes for both arrays in an operation must either be the same size or one of them must be one.
Multik does not have a direct way to broadcast two arrays with different shapes, but KMath tensor module provides an implementation in the DoubleTensorAlgebra
context.
KMath’s modules needed for broadcasting are not available using the %use
magic command, but we can tell Koltin’s Jupyter Kernel to import them. KMath also has a set of API for transforming a Multik NDArray
into a tensor, a more familiar object to Kmath.
We use the @file:DependsOn
command for importing core
, multik
and tensors
modules of KMath inside the project:
@file:DependsOn("space.kscience:kmath-core:0.3.1")
@file:DependsOn("space.kscience:kmath-tensors:0.3.1")
@file:DependsOn("space.kscience:kmath-multik:0.3.1")
For being able to use them, we import the needed packages.
import space.kscience.kmath.multik.MultikTensor
import space.kscience.kmath.tensors.core.DoubleTensorAlgebra
import space.kscience.kmath.tensors.core.tensorAlgebra
import space.kscience.kmath.tensors.core.withBroadcast
import space.kscience.kmath.tensors.core.DoubleTensor
Of course we could have used KMath tensors for all the computation, but KMath tensors are in some way harder to manipulate that Multik’s NDArrays
, so in this example we will see how we can compute broadcasting starting from two NDArray
s.
Thanks to Kotlin extensions functions we can map the +
operator between two MultikTensor
s objects to compute broadcasting if needed.
fun NDArray<Double, DN>.asMultikTensor(): MultikTensor<Double> = MultikTensor(this)
fun brAdd(a: MultikTensor<Double>, b: MultikTensor<Double>): DoubleTensor? {
return Double.tensorAlgebra.withBroadcast {
a + b
}
}
operator fun MultikTensor<Double>.plus(other: MultikTensor<Double>): DoubleTensor? = brAdd(this, other)
With those functions, the following code produces the expected output.
val a: NDArray<Double, DN> = mk.arange<Double>(40).reshape(2, 4, 5).asDNArray()
val b: NDArray<Double, DN> = mk.arange<Double>(20).reshape(4, 5).asDNArray()
a.asMultikTensor() + b.asMultikTensor()
DoubleTensor(
[[[ 0.0 , 2.0 , 4.0 , 6.0 , 8.0 ],
[ 1.0e+1 , 1.2e+1 , 1.4e+1 , 1.6e+1 , 1.8e+1 ],
[ 0.2e+2 , 2.2e+1 , 2.4e+1 , 2.6e+1 , 2.8e+1 ],
[ 0.3e+2 , 3.2e+1 , 3.4e+1 , 3.6e+1 , 3.8e+1 ]],
[[ 0.2e+2 , 2.2e+1 , 2.4e+1 , 2.6e+1 , 2.8e+1 ]],
[ 0.3e+2 , 3.2e+1 , 3.4e+1 , 3.6e+1 , 3.8e+1 ]],
[ 0.4e+2 , 4.2e+1 , 4.4e+1 , 4.6e+1 , 4.8e+1 ]],
[ 0.5e+2 , 5.2e+1 , 5.4e+1 , 5.6e+1 , 5.8e+1 ]]]
)
There is no such thing as a np.newaxis
for specifying the missing dimension to broadcast along. We must reshape the arrays in order to broadcast properly.
Consider this example, where we compute the mean of every row, and we want to add the mean to the original matrix.
According to broadcasting rules, if we want to perform broadcasting across axes other than axis 0, the “broadcast dimensions” must be 1 in the smaller array. Being b
of shape (4, 5)
and the mean vector is (, 4)
, we must reshape and insert a “1” in the broadcasting dimension, the second one.
val mean = mk.stat.meanD2(b.asD2Array(), 1)
.reshape(b.shape[0], 1)
.asDNArray().asMultikTensor()
b.asMultikTensor() + mean
DoubleTensor(
[[ 2.0 , 3.0 , 4.0 , 5.0 , 6.0 ],
[ 1.2e+1 , 1.3e+1 , 1.4e+1 , 1.5e+1 , 1.6e+1 ],
[ 2.2e+1 , 2.3e+1 , 2.4e+1 , 2.5e+1 , 2.6e+1 ],
[ 3.2e+1 , 3.3e+1 , 3.4e+1 , 3.5e+1 , 3.6e+1 ]]
)
Notes:
The
withBroadcast
context is claimed to be unstable and could change in the future.For large arrays, the chain of conversions can be very slow in performance-critical code.
No direct ways to re-convert a
DoubleTensor
to a MultikNDArray
were found: the only way is to manually copy element by element from theDoubleTensor
to a newNDArray
.
Conclusions#
In this chapter was presented Kotlin’s Multik library for working with N-Dimensional Arrays. The set of data structure and methods are very close to NumPy’s, which makes it very easy to learn if the developer has a basic knowledge of NumPy. However, NumPy has more that 20 years of development and fine tuning, resulting in one of the best libraries for scientific computation: the better support for linear algebra operation, array broadcasting and the possibility to use multidimensional arrays with non-numeric data types, makes it a very flexible and powerful library. On the other hand, Multik can be used with libraries like Kmath and Apache Commons Math, or any other library developed in Java, making it very powerful with the right set of components at the need of the developer (i.e. array broadcasting and more advanced linear algebra can be accomplished with the use of Kmath). Please note that Multik is very efficient with basic mathematical computation, and for a little bit more, Kmath connectors for Multik can really help archiving more difficult tasks (i.e. integration).
Kotlin ecosystem for scientific computation is still young, but with the joint use of the right libraries, a lot can be archived.