Data-centric Metaprogramming

39:13 341 views 100% Published 5 years ago

This video was recorded at Scala Days Berlin 2016
Follow us on Twitter @ScalaDays or visit our website for more information

We can compose data structures like LEGO bricks: a relational employee table can be modelled as a `Vector[Employee]`, where we use the standard `Vector` collection. Yet, few programmers know just how inefficient this is: iterating requires dereferencing a pointer for each employee and a good part of the memory is occupied by redundant bookkeeping information.

Data-centric metaprogramming is a technique that allows developers to tweak how their data structures are stored in memory, thus improving performance. For example, we can use `Vector[Employee]` throughout the program, despite its inefficiency. Then, when performance starts to matter, we simply instruct the compiler how to store the `Vector[Employee]` more efficiently, using separate arrays (or vectors) for each component. In turn, the compiler uses this information to optimize our code, automatically switching to the improved memory layout for the `Vector[Employee]`. This makes premature optimization redundant: we write the code using our favorite abstractions and, only when necessary, we tune them after the fact.

There are many usecases for data-centric metaprogramming. For example, applied to Spark, it can produce 40% speedups. The Scala compiler plugin that enables data-centric metaprogramming is developed at and is documented on

This way we avoid the misinterpretation that the newly introduced whole-wold Dataset optimisation (URL: relies on the data-centric metaprogramming approach, which is not the case.

Watch on YouTube