Update 2015-10-08: Optimization “hack” described in this post still works, however we don’t use it in production anymore. With careful parallelism config, overhead introduced by distributed models is negligible.
You’ve all probably already know how awesome is Spark for doing Machine Learning on Big Data. However I’m pretty sure no one told you how bad (slow) it can be on Small Data.
As I mentioned in my previous post, we extensively use Spark for doing machine learning and audience modeling. It turned out that in some cases, for example when we are starting optimization for new client/campaign we simply don’t have enough positive examples to construct big enough dataset, so that using Spark would make sense.
Spark ML from 10000 feet
Essentially every machine learning algorithm is a function minimization, where function value depends on some calculation using data in RDD
.
For example logistic regression can calculate function value 1000 times before it will converge and find optimal parameters. It means that it will
compute some RDD
1000 times. In case of LogisticRegression
it’s doing RDD.treeAggregate
which is supper efficient, but still it’s distributed
computation.
Now imagine that all the data you have is 50000 rows, and you have for example 1000 partitions. It means that each partition has only 50 rows. And
each RDD.treeAggregate
on every iteration serializing closures, sending them to partitions and collecting result back.
It’s HUGE OVERHEAD and huge load on a driver.
Throw Away Spark and use Python/R?
It’s definitely an option, but we don’t want to build multiple systems for data of different size. Spark ML pipelines are awesome abstraction, and we want to use it for all machine learning jobs. Also we want to use the same algorithm, so results would be consistent if dataset size just crossed the boundary between small and big data.
Run LogisticRegression in ‘Local Mode’
What if Spark could run the same machine learning algorithm, but instead of using RDD
for storing input data, it would use Arrays
?
It solves all the problems, you get consistent model, computed 10-20x faster because it doesn’t need distributed computations.
That’s exactly approach I used in Spark Ext, it’s called LocalLogisticRegression.
It’s mostly copy-pasta from Spark LogisticRegression
, but when input data frame has only single partition, it’s running
function optimization on one of the executors using mapPartition
function, essentially using Spark as distributed executor service.
This approach is much better than collecting data to driver, because you are not limited by driver computational resources.
When DataFrame
has more than 1 partition it just falls back to default distributed logistic regression.
Code for new train
method looks like this:
If input dataset size is less than 100000 rows, it will be placed inside single partition, and regression model will be trained in local mode.
Results
With a little ingenuity (and copy paste) Spark became perfect tool for machine learning both on Small and Big Data. Most awesome thing is that this
new LocalLogisticRegression
can be used as drop in replacement in Spark ML pipelines, producing exactly the same LogisticRegressionModel
at the end.
It might be interesting idea to use this approach in Spark itself, because in this case it would be possible to do it without doing so many code duplication. I’d love to see if anyone else had the same problem, and how solved it.
More cool Spark things in Github.