An approach to an Express Logistics Systems design from a high availability and scalability standpoint.
End-to-End Express Logistics as a Monolith
I am a huge proponent of the paradigm that when it comes to creating large software products, it is better to start with a monolith than a microservice model, especially when you are building in the domain for the first time. Starting off with a monolith, there might come a time when it becomes increasingly pressing to start breaking the code into smaller, separate components that can be managed, tested, and deployed independently.
This is where our team found ourselves at the start of 2020, building the Express Logistics product. The application had become too complex and the velocity of our Sprints had reduced significantly. Deployment slowed down as bug-fixing and new feature development impact was not fully understood anymore and required lengthy manual testing. The conversation for switching to microservices reached a tipping point when we reached a 12–15 hours a week delay in deploying code to production.
This is not an out-of-the-ordinary situation, rather this is a common juncture that most software product companies come to as they grow. There is something to be said about the timing of Re-architecture efforts for a startup, and the stories of startups that almost died trying to do that jump, but I’ll reserve that for another article.
While the deployment delays were the final trigger, our primary reasons to consider the microservice architecture were:
To Reduce Risk while maintaining Speed - Availability and Latency risk reduction for the core functionality was the highest priority (explained in the section below), as driven by our Customer Discovery efforts. Microservices offered the flexibility for services to be iterated, tested, and deployed optimizing for different objectives and speed.
Scalability - the ability for different services to scale depending on the localized demand and the importance of it within how our product was structured.
Vision for the product - The core services were the base foundation for the product and the foot in the door for business, however, the high revenue impact features were in extending and leveraging the data generated by the core. These features had different objectives and could use their own iteration-deployment cycles.
Talent pool - Flexibility in terms of hiring technologists closer to a particular Business Capability. For example, the Analytics module could be built with a stack that the Data Science team is familiar with vs building it with the stack chosen for the Order Management with different objectives in mind.
On the priority of Reducing Risk and Improving Scalability
Speaking to a significant number of companies that we had identified to be developed as our reference customers, there was a clear pattern emerging. A majority of these customers had started building their software systems in the late 90s and early 2000s and have been adding patchwork on top whenever they needed a new capability. The average solution used looked like a hairball with about 15-20 systems built over the last two decades from different era technologies resulting in the following:
Significant downtime every month, ranging from a day to 4–5 days resulting in lost revenue (not able to accept new orders) / reduced margins (not able to control operational costs). This is largely attributed to some race condition between systems developed by different vendors and a cascading effect in the system failure.
High latency under peak load, as most older systems were built with vertical scaling in mind and use older tech for inter-process communication. Logistics being a very seasonal market, a peak day like the Singles Day can be 50x more traffic than an average day.
Understanding customer pain has pushed us to make Availability Risk Reduction and Scalability to be our sacred objectives in building this product. We understood that it is only by standing on these pillars that the rest of the business impact of our value proposition can be realized.
Refactoring the Code
Our approach to refactoring the monolith was to extract out one or two business capabilities with clear boundaries to start with and extend from there. Routing and Pricing components had very little dependencies on the rest of the application and provided a good foundation to evaluate the advantages we were going for with this change.
Design Patterns worth mentioning on the path to Full Refactor
Separating the service functionality based on Business Capability and branching them under umbrellas created along the lines of how they were sold and used - Core, Platform, and Addon services, with dedicated teams.
Eventual Consistency using an Event-Driven approach - Kafka was an ideal candidate for this as the Message Broker, based on the team’s experience with it, as well as because of features like Zookeeper to manage the state of the cluster and auto replication.
Orchestrated Sagas for ACID transactions - centralized approach to managing the transactions
Continuous Semantic Monitoring - a type of synthetic transactional functional testing, semantic monitoring approach can help test critical flows as a single service can be a part of a number of business flows. While low-level metrics is definitely useful, it will also produce quite a bit of noise when a large number of systems are involved. Semantic Monitoring on the other hand can alert the team on the issues with high-level flows and provide a bigger picture.
The following diagram shows our approach to the separation of business capabilities and categorization into teams.
Lessons learned
Operational Complexity: Managing and monitoring complexity aside, the boundary definitions of the microservices has a profound impact on how services work together and how easy or hard it is to debug something spanning multiple services.
Microservice API versioning: Any changes to a service API would need the same change to all services dependent on it, which can be overcome with API versioning and running all in-use versions.
Failure Resiliency can be a complex problem: While Netlix OSS Hystrix provides a good solution to fault tolerance via the Circuit Breaker pattern, it takes a whole lot more to make an application resilient.
Shifting complexity: Microservices do not result in a simpler system but shifts the complexity to areas that can be owned and managed better than with a large monolith.
Increased Team Ownership and Velocity: Product squads organized around business capabilities run with a high degree of ownership vs a team that switches context often, and this is more pronounced when the boundaries are clearer.
Good Reads
Fallacies of Distributed Computing : https://en.wikipedia.org/wiki/Fallacies_of_distributed_computing
Martin Fowler on Microservices : https://martinfowler.com/microservices/
Article by Bilgin Ibryam on the topic of resiliency in MSA https://developers.redhat.com/blog/2017/05/16/it-takes-more-than-a-circuit-breaker-to-create-a-resilient-application/
留言