Tuesday, November 30, 2010

Troubles with Sharding

Troubles with Sharding - What can we learn from the Foursquare Incident? (Page last updated October 2010, Added 2010-10-28, Author Todd Hoff, Publisher highscalability.com). Tips:

  • [Although the article is about shards, the tuning suggestions are generic and really apply to many systems]
  • Use more powerful servers - scaling-up is often the best solution.
  • Spread your load over more cores and servers
  • Design/Enable components to be movable (ideally dynamically) so that they can each separately move to a system with lower load.
  • Can your data be moved to another node fast enough to fix overload problems?
  • Monitor request queues and memory and fragmentation, and have limits that trigger a contingency plan when limits are exceeded.
  • Prioritize requests so management traffic can be received by nodes even when the system is thrashing. Requests should be droppable, prioritizable, load balanceable, etc rather than being allowed to take down an entire system.
  • Have the ability to turn off parts of system so load can be reduced enough that the system can recover.
  • Consider whether a read-only or restricted backup version of the system could be usable for maintaining restricted operations after catastrophic failure while repairs are going on.
  • Use an even distribution mechanism to better spread load.
  • Replicate the data and load balance requests across replicas.
  • Monitor resource usage so you can take preventative action. Look for spikes in disk operations per second, increased request queue sizes and request latency.
  • Build in automated elasticity, sharding, and failure handling.
  • Enable background reindexing and defragmentation.
  • Continually capacity plan to continually adjust resources to fit projected needs.
  • Test your system under realistic high load and overload conditions. Integrate testing into your build system.
  • Use incremental algorithms that just need an event and a little state to calculate the next state, e.g. a running average algorithm rather than one that requires all values for for every calculation. This reduces data flow requirements as data can be discarded from caches more quickly.
  • Separate historical and real-time data.
  • Use more compact data structures.
  • If you are expecting that all your data will not fit in RAM, then make sure your IO system isn't the bottleneck.
  • Single points of failure are not the end of the word, but you need to know about them and work it into your plans for contingency planning.
  • Figure out how you will handle downtime. Let people know what's happening and they will still love you

No comments: