Parameter Server

Parameter Server

Scaling machine learning algorithms to large realistic problems is hard. The parameter server makes it easy. It can help you to run your jobs on:

  • Hundreds of terabytes of data (e.g. for sparse logistic regression)

  • Billions of documents (e.g. for Latent Dirichlet Allocation)

  • Thousands of machines (on a real production cluster)

  • Hundreds of billions of parameters (on real advertising data)

  • Unreliable underlying infrastructure (killed, tested, and self-repaired)

  • Over a billion keys/s throughput (on 15 machines)

  • With the consistency guarantees you need (eventual, sequential, bounded delay, or anything else)

  • An open source platform with BSD license

It works by separating the problems of processing data and the problem of communicating and synchronizing them between different machines. Processing works in any streaming or batch setting that a user might like. Synchronization is implemented by user-defined semantics for push and pull with regard to a common shared parameter storage. A sophisticated set of filters ensures efficient use of bandwidth and avoids the need for expensive message scheduling by preventing unnecessary messages in the first place.


Core Features

Chord-style Key Layout

Most distributed (key,value) stores for machine learninguse data layout similar to memcached, that is, load balancing over machines works by distributing keys at random between them. While this allows for almost perfect load balancing, it makes synchronization really difficult as it requires inverting a hash. As a result, range-based synchronization, efficient key aggregation into larger message blocks, and repair are nontrivial to accomplish.

Instead, we use Chord style key distribution as is also used in Amazon Dynamo. That is, keys and servers are inserted into a hash ring. Each server takes care of its adjacent segment. By laying out data in memory in the same fashion, it is very easy to synchronize entire ranges. Moreover, it makes things like vector clocks really easy since they are now applied to entire ranges. For some more context see this blog post.

Replication, Fault Tolerance and Repair

Machines fail. Data analysis needs to take this into account when deploying algorithms on an industrial scale. Our strategy for fault tolerance is to replicate keys on the ring. That is, adjacent servers on the ring hold replicas of the same parameters in memory. As a consequence when a server fails, failover is instantaneous since it only requires a status update rather than actual data transfer. This is crucial when building systems that are capable of front end serving.

Flexible Consistency

Many systems argue for their specific choice of consistency. For instance, it is easy to build bulk synchronous tools in Hadoop. Likewise, we proposed a fully asynchronous model in Yahoo LDA. Graphlab argues for consistency by clever execution of vertex updates. However, what happens if your algorithm doesn't fit a particular model? E.g. if you want stronger consistency initially but rela it later. Or consistency only between some of the operations? The parameter server can accommodate all of this, simply by defining a dependency graph that is processed in a distributed fashion.

High Performance

We are not aware of any other open source system that can handle half a petabyte of data and over one hundred billion parameters for logistic regression, 4 billion documents for LDA or can deliver over 1 billion keys per second throughput on only 15 machines.

Batteries Included

Right now just a sparse logistic regression implementation but much more coming in the next releases …


Amazon Baidu Google Intel Microsoft NSF CMU