The Tail at Scale
Paper Name: The Tail at Scale
Context for the picture: https://twitter.com/JeffDean/status/1613735033122050048?s=20
Link: https://research.google/pubs/pub40801/
History: Published in 2013 by JEFF DEAN and LUIZ ANDRÉ BARROSO at Google.
Jeff is known for his work on Spanner, MapReduce, BigTable etc. We will look at these systems in detail in the future.
Problem Statement: When you use commodity hardware in a distributed setup, wherein one request depends on 100+ components. Failures compound and reliability of your distributed system takes a hit. The paper examines the different possible techniques to make commodity hardware work on a large scale.
Summary: This paper is considered to be one of the most foundational papers in distributed systems. Google identified and developed multiple solutions to make it possible to run distributed software on 100s of commodity hardware.
The paper talks about techniques to handle variable latency in distributed systems, and how latency can be affected by uncommon faults. Paper discusses various within request and cross request ways to mitigate the high latency problem like
Hedged request: The use of hedged requests can help reduce latency variability in distributed systems by sending the same request to multiple replicas and using the response from the fastest replica. This technique is effective because latency is often caused by interference rather than the specific request. By delaying the secondary request until the first request has been outstanding for longer than the 95th percentile expected latency, the additional load is limited to around 5% while substantially shortening the latency tail. In addition, hedged requests can be tagged as lower priority than primary requests to further reduce overhead. Overall, hedged requests are a useful technique for reducing latency in distributed systems, but they must be implemented carefully to avoid adding excessive load.
Tied requests: Instead of delaying before sending out hedged requests, enqueue requests simultaneously on multiple servers, but tie them together but telling each server who else also has this in their queue. When the first server processes the request, it tells the others to cancel it from their queues. Since the cancellation messages traverse the network
Micro partitions: Have many more partitions than servers to help combat imbalances.
Canary Requests: A small number of requests are sent to probe newly added code or servers
The paper also talks about how adding latency tolerance on systems which already have fault tolerance will have minimal overhead for the overall system.
The paper talks in detail about how variability exists in latency for large systems, explaining shared resources, daemons, maintenance activities and other factors. This helps the reader understand the context of the problem better.
The paper discusses how adding latency tolerance using already existing fault tolerance is a way to reduce overhead. This is a good design as it employs separation of concerns. The idea of canary requests where a small number of requests are sent to probe newly added code or servers is a very interesting idea. It introduces more intelligence in the system and can be helpful for both latency tolerance and fault tolerance.
Negative Points (With Suggestions) :
The paper does not talk a lot about how latency can be improved for updates in a large system. It mentions quorum based algorithms but not in detail, more clarification on them would have helped to understand if the techniques mentioned in the paper are applications to update models as well.
The paper does not talk about how fluctuations in the network can affect latency and are the algorithms resilient to that. Some experiments in faulty network conditions would have helped uncover that.
There is no discussion about the overheads in request size and number of requests caused due to various techniques like hedged request, tied request. Some quantification about overhead would help in deciding the best technique for any user.
Thanks for reading :)