Architecture Review : Improve by adding scalability and availability to existing service


 Recently, I have received a task at work. There is a microservice which is a push message broker. It is provided with messages from service servers and sends those messages to GCM, APNS, and etc. This service handles requests well and everything is fine.

 However, scalability and availability were not considered when this service was implemented at first. As I am making use of Auto Scaling Group on AWS environment now, I need to think about adding these necessary features to the service.

 After below review, using ASG on this service deferred. Let me explain why.


Current Architecture


 As a push message broker, the service is consists of two modules: producer and consumer.
Also, this service should store push messages somewhere. Currently, Cassandra is adopted as message storage. So basically, producers put push messages to Cassandra and consumers take out messages from Cassandra.

 To deal with massive traffic from service servers, I am managing three regions on AWS and each region consists of multi-instances. Also, push messages are sharded from 0 to 5000 for the purpose of Cassandra partitioning based on row key.

 For example, If I have 10 instances in the EU region, each instance is responsible for specific shard: 0 to 500 for instance 0, 500 to 1000 for instance 1, and so forth.

 As we all know, using Cassandra as a queue is an anti-pattern. Also, fixed shard key makes it difficult to adopt AWS ASG strategy. But there are two reasons why I adopted Cassandra. First, I need to relay push message with delay. Delay time varies service by service. Second, while push messages are staying at Cassandra if same push messages arrive, latter push message should be ignored. It means any identical push messages henceforth is deduplicated.

 To achieve these two goals, I had to exploit Cassandra because I could fulfill requirements by adding an extra timestamp column for a delay and an extra table for deduplication.


Review several alternative architectures 

1. Kafka with Redis

 Many companies adopt Kafka as a message streaming solution. So, Kafka was the first solution came across my mind when I got down to the task. Kafka is famous for high throughput, scalability, and reliability based on its log appending mechanism.

 How can I deduplicate the same messages after the first one is produced to Kafka? I searched a lot but there is no function to check the message in Kafka. There is a controversy about exact-once-delivery of Kafka. But in this case, it is about consuming a message exactly once rather than deduplication of producers. This ends up using Redis. And I found an example from Tapjoy blog. They use Kafka with Redis for deduplication.

 Then, How can I assure the delay of the message? For now, I can get push message from Cassandra by using where condition limiting timestamp range. Kafka is a log append based message streaming solution. How can I expect one push message which has a property of 10-sec delay will be delivered exactly ten seconds after it produced to Kafka?

 One possible option I got a hint from googling was to manage several Kafka topics. Uber described reliable reprocessing queue with Kafka. For example, topic_one is a queue for one-second delay push messages and topic_ten for ten-seconds messages. But this means I have to manage every different delay parameters. To make it simple, I have to change API spec and let service developers choose from possible delay seconds. However, even if I decide to do that, Kafka doesn't hold messages with delay option. Consumer thread should check delay parameter of push messages and determine whether the push message should wait or not. In this context, there is no guarantee that one push message with ten-seconds delay will be consumed exactly after ten-seconds.

 It looks like Kafka with Redis is a great combination when it comes to streaming with deduplication. Nonetheless, this is not in accordance with the requirements.

2. AWS SQS with Redis

 Even the service use Cassandra as a queue, actually, it's not a traditional operation of Queue. That's why I've never used SQS although I heavily depend on AWS. But I gave it a try and found that SQS support delaying feature! I examined SQS closely.

 Standard queue support nearly unlimited tps. Also, it supports a delay option message by message. And the maximum inflight message is 120,000. However, one problem is that it would be possible to get a deleted message again. On the FAQ page, the doc describes this happens under rare circumstances. It means I have to boot up two Redis for caching: one for enqueuing deduplication the other for dequeuing deduplication.

 FIFO queue is an exact-once delivery queue. But delay option is set by queues, not by messages. Also, Tps is pretty low. FIFO queue supports 300 tps including enqueuing, dequeuing, and deleting a message. Another limitation is that the maximum inflight message is 20,200.

 Above all things, the maximum number of messages I can receive from one request using SDK is 10. It is an HTTP request to SQS endpoint. I didn't test but this could have an impact on performance.


Conclusion


 After reviewing the above alternative architectures, I decided to defer adding scalability and availability to current architecture. Apparently, Kafka is not fit with my service as I described. AWS SQS with two Redis would be a good solution to add scalability and availability. But heavy dependency on AWS is not a good strategy. And I cannot ensure the performance throughput if I move to the architecture. If I start from the baseline, I might adopt SQS. But current architecture is also solid and data is secured on Cassandra. All things considered because of automation: Auto-scalability and Auto-Reliability by removing sharding dependency. Until the possible solution come up, I will keep current architecture. If you have any ideas, please let me know. :)

Comments

Popular posts from this blog

삼성전자 무선사업부 퇴사 후기

개발자 커리어로 해외 취업, 독일 이직 프로세스

코드리뷰에 대하여