- Xiaohu Li: Manager, Backend Engineering
- Frank Ren: Director, Backend Engineering
- Devin Thomson: Lead, Backend Engineer
- Daniel Geng: Backend Engineer
We covered the sharding mechanism in our previous post, which laid the theoretical foundation of geosharded clusters. In part 2 we are going to explain how we built a high-performing, scalable and balanced infrastructure to support our business needs, as well as discuss some interesting engineering challenges we have overcome, and the considerations behind them.
After we finalized the sharding algorithm and implemented the abstraction layer as an internal microservice, we had the high level cluster architecture seen below:
Once a user’s location changes, it goes through the user’s profile, hits location services to figure out whether the location change resulted in a shard move, and then it hits the data node with specific route info to perform the shard move.
When users fetch recommendations, our recommendation engine uses the logic layer to figure out how many shards to query (based on user’s current location and distance filter) and then queries indices with that info. The results are collected and aggregated across these shards and sent back to clients.
At this step we encountered following open questions:
- What would be the best engineering implementation of a “shard” in our index?
- How do we decide the optimal setup for a geosharding cluster?
- How do we balance the load?
Multi-index v.s. Multi-cluster
The word “shard” we have been talking about up to this point is only a logical term representing an isolation of index based on the user’s geolocation, but we haven’t decided what the engineering implementation of it would be. In the world of Elasticsearch, we have one index for each shard, but all indices belong to one cluster, or, we could have multiple clusters, each one representing a shard. Here are the pros and cons for each scenario:
Pros: Elasticsearch supports multi-index search out of box, easier to maintain from operational point of view
Cons: Harder to scale up/down a shard once it becomes hot, might impact another shard
Pros: A lower level of separation, easier to scale up/down per cluster
Cons: There is no multi-cluster query supported yet; we have to explicitly fire them in parallel, same for shard move situation
We made a careful decision to go with multi-index scenario because it significantly simplified the logic on our server side, as well as operational costs. From there, we developed a a load balancing strategy to overcome the cons of this scenario. And we can always scale up a shard by adding more replicas to it.
Finding the Right Size
The next engineering challenge is to pick the right cluster size, especially when there are so many variables we can tune. Some questions about variables include:
- What is the best size for each shard?
- How many hosts do we need in total?
- How much computation power do we need for each host?
- How many replicas do we need for each shard?
- What are the optimal Elasticsearch/jvm configurations for our setup?
To find out answers to the questions above, extensive performance testing is needed. We used JMeter to set up our performance test framework, which contains a coordinating node, and a data node, and a JMeter remote server node. The JMeter remote server node is responsible for firing out requests to the coordinating node, which essentially gets forwarded the traffic to the data node, and then collects the results from it and returns it back to the remote server node for benchmarking.
Most of the time, the variables we consider are not isolated from each other. For example, instance type, shard size and replica count impact each other: the larger shard size, the fewershards you can fit within the same instance type, resulting in a the lower replica count you can have within the same host count. Instead of testing all the possible combinations of these dimensions, we decided to determine them one by one, starting from shard size, in terms of both number of users and geological areas.
In order to get a sense of that, we aggregate our user locations and map it to find out the most dense area of Tinder population. From there, we tentatively put the whole area into one shard, so users within that dense area will have less of a chance to perform cross shard query and shard move. We then prepared the test data by seeding user documents into the data node, the amount of which depends on the projected user count per shard. We also choose a typical shard area size when seeding the geoposition of each user document so they are randomly - but uniformly - distributed within a square area. We also take the actual distribution of our user’s age and age filter and apply randomness during feeding, trying our best to simulate the traffic pattern and hit size for each query.
Starting from a small volume of traffic and ramping up to the point where either the query node or data node have maxed out their CPU or memory, we performance-test different scenarios (changing shard size, replica count, AWS instance type, jvm configurations, etc.) and identify the optimal setup for our geosharded cluster in the production environment.
Dealing with Time Zones
Users within the same geoshard are usually in the same or adjacent time zones because they are geographically adjacent. This nature implicates unbalanced traffic because the peak and off-peak hours will differ due to time zones, and also different use patterns for users per geolocation. This graph shows the hypothetical traffic pattern of two geoshards in a 24-hour time span:
As you can see, their peak traffic is roughly the same, but they happen at a different time of day. If we only allocate one shard in one physical box, the load will be unbalanced across different nodes. The biggest difference in terms of requests per second can be more than 10x at any given time between different shards.
Our goal is to build a geosharded cluster that can handle spiking load per shard and make it as balanced as possible so that our global user base can experience similar latency, regardless of which geoshard they are in, and how many geoshards they are potentially hitting. To achieve this goal, one solution is to manually allocate multiple shards into the same nodes to make sure their load is similar to each other at any given time of day. We believe this can produce the most balanced load in theory; however, it requires a lot of manual intervention to maintain the dynamic cluster and it is an NP hard problem to solve. Additionally, when we need to reshard due to user base growth, we need to recalculate this problem again. That’s why we eventually decided to leverage the power of randomness. Here is how we did it:
- Using a replica count, assume at any given point the max traffic per shard per replica is X rps,
- Categorize all shards into 5 different traffic groups by their actual traffic. For example, all shards with traffic from 0~0.2X rps go to Group 1, 0.2X ~ 0.4X rps go to Group 2, etc.
- Calculate the possibility of the host with the largest load taking more than one shard from the high traffic group. In this case, it would be the group with 0.8X ~ 1.0X load).
By modeling like this, we simplified an NP hard problem into a typical statistics problem: to put total N balls (N = number of shards * number of replica per shard) of five different colors into M drawers (M = number of physical hosts), what is the rate of having two balls of the same color in any given drawer. We iterated with a different value of N by adjusting the number of replicas per shard and calculating the possibilities. Eventually, we established the sweet spot for the number of replicas. This is essentially a trade-off: the more replicas we have, the more balanced it will be; but write capacity and memory will suffer per host basis.
The Final Stage
After all these tweaks and testing, here is how our geosharded cluster architecture looks:
Under three master nodes, we have two ASG, one contains only coordinating nodes (this is the one where we send all the requests) another ASG contains all data nodes. Each data node keeps randomly distributed four indices, some of them are primary of a certain shard while others are replicas of some other shards. Depending on how many geoshards a query is hitting, the coordinating node will distribute requests to corresponding data nodes that contains the target geoshards. In this way we are able to host all geosharded indices in one cluster and balance the load.
As of now, we have a large-scale, balanced index cluster. According to our measurement in production, geosharded indices handles 20 times more computations compared with to previous setup.
- Performance testing is a critical element for rolling out new infrastructure
- Consider leveraging randomness to solve hard problems in engineering practice
Are you interested in building solutions to complex problems like these on a global scale? We are always looking for top talent and inventive problem solvers to join our engineering team. Check out our openings here!