Today’s modern microservices are more likely to deal with challenging scenarios, where service resiliency is not a nice to have and service downtime is unacceptable and must be avoided.
Some of the challenges are commonly shared within all microservices, especially when the same infrastructure and technical stack are being used.
As a leading eCommerce marketing platform (PaaS), each product is being consumed as an SaaS product. We tend to build microservices at the speed of light with PLG in mind. The intention is to make the service reliable, with zero manual intervention during the service lifecycle.
Product-led growth (PLG) is an end-user-focused growth model that relies on the product itself as the primary driver of customer acquisition, conversion, and expansion.
In this post, I’ll be covering one of our hottest domain microservices, where we’ve tackled multiple challenges with industry standards and very creative solutions.
Before we dive into the challenges we faced building the email service, let’s look at an overview of the system, and the lifecycle of an event in that system.
The email service is the backbone of the marketing automation domain, where it must be highly available. It should have the ability to handle the high load with no manual intervention, and provide clarity about the system state.
For a service that is capable of handling millions of email requests per hour, we wanted to build an idempotent service that is resilient, with zero-intervention. With idempotency, as in the software programming term, it doesn’t matter how many times we get the same email message to be delivered, the end result is the same. That email message fits one of the following final states:
Failed, e.g. the email template is faulty
Skipped, due to smart sending business requirements
Now let’s discover which components made our service resilient…
The components that played a role in making that happen:
Database entity state
Auto retryable HTTP client on place (WebClient/RestTemplate)
In-memory exponential backoff retry mechanism
Kafka retry service
Database entity state
Each email request that enters the system is being handled by a state!
The system handles that message according to its state like a state machine. The rest of the post actions will run from that state until the message reaches its final state.
So, when implementing such a service we should bear in mind to handle such use cases, where the DB entry might already exist and have some initial state.
You can think of it as an ActionState.
A transitionable state that executes one or more actions when entered. When the action(s) are executed, this state responds to their result(s) to decide what state to transition to next.
Below is the simplified flow of handling an email request in which the actions are determined based on the state.
Auto retryable HTTP client
To complete the email request, the email service is required to reach other services using an HTTP client. For example, getting the email profile — the persona, store settings, etc.
Since our microservices are Spring Boot services, we have implemented a common library wherein a quick configuration of automatic retry WebClient/RestTemplate bean is autowired.
These beans are fully autonomous and retry on all status codes provided at the configuration file, in addition to network exceptions.
The most awesome thing about it is that it’s transparent to the user. It already retries the request with a backoff retry policy. This mechanism is very handy when there are some failures because of temporary unavailable service/restart, etc.
In-memory exponential backoff
For some use cases, where an HTTP call is not required at all or the retry is not enough since it’s fast and is meant to be quick and end within a few hundred milliseconds, the in-memory retry mechanism kicks in.
The in-memory retry should also be transparent to the user, with minimal impact on the application.
We use the Spring Retry annotation and RetryTemplate for that.
So what happens when the retry is exhausted? I will cover it in the next section…
Kafka retry service
As a modern microservice, which uses a messaging queue, the message delivery terms exactly once and at least once should have been raised during the service lifecycle.
Exactly once, as the name suggests, the message is delivered precisely once. At least once, the message could be delivered more than once.
The message duplication might happen for several reasons, such as:
it was written twice to the same topic, or
it was consumed twice by the same consumer group, like when the message was not ack’d — offset committed, or because of a consumer group rebalance
To properly handle the duplication, and prevent starvation for the other messages in the queue, the message can’t be retried endlessly. So after the HTTP client retry and the backoff retries are exhausted, the message is delivered to a retry topic.
The journey from there is that the same message will be consumed by the service.
However, this time the retry is counted and it has a limited number of retries. The retry counter is updated at the header and when it reaches the maximum number of retries defined in the service. Only then will the message be marked as failed in the database and reach its final destination — the DLQ (Dead Letter Queue) topic, where it will rest for the log retention period, in case there is a need to replay it. The ability is provided to replay the messages from the DLQ.
The deduplication is handled through Kafka and the DB state. When the same message, with the same record key, reaches the same consumer, it’s then processed sequentially. Hence, there is no parallel consumption for the exact same message (duplicate). Following this, each message is handled from a state, and when the duplicated message reaches the consumer the state of that message should be a final state (Delivered / Failed / Skipped). Then the consumer won’t process the message, and it will be discarded since it was handled earlier.
During the service lifecycle and with the system maturity, we tend to find and tackle new challenges that we identify as a shared challenge for multiple services. I will briefly describe the existing underwork challenges.
Calling multiple services within the same message processing time slot is an anti-pattern, and it might sum up to the message processing time.
We are rethinking the flow, and how choreography can decouple this service from other services as well as enhance the throughput.
Creating a service that is capable of telling if an exception is part of the predefined set of expected exceptions to recover from. This should enhance exception and retry handling.
Retrying the request on all microservices might be problematic, since some of the services might have some business logic that changes their state. We must prevent data inconsistency or strive for eventual consistency of these services.
Building a consumer wrapper process for managing Kafka’s latest handled offsets per partition, to prevent duplication without working against a DB entry for each Kafka record.
Building the first microservice might take a while, when the intention is to handle all the points raised above. However, when building a common library and a good infrastructure that handles most of them, it’s a no-brainer. It’s crucial to define the microservice boundaries, responsibilities and technical stack. As well as knowing which method to use to overcome obstacles when you encounter them.
Building a resilient microservice is challenging and satisfying at the same time.