Chaos Engineering make disciplined Microservices

 

Chaos and discipline, These two words are Oxymoron, you might be thinking, How Chaos can make Disciplined Microservices?


But the universal truth is discipline means the absence of Chaos, so until you have not experienced chaos you can not be disciplined.


If we think about the Law of Entropy, then Chaos is the high entropy state, and a discipline is the low entropy state. Always disciplined services degrade to chaotic ones to make the system in equilibrium, as the flow of the direction is from high(Chaos) to low entropy(discipline) state. So chaos is inevitable.


Now, If we want to make sure our services remain in a low entropy state(discipline) throughout, we need to adopt a few special techniques. as per the Law of physics, this is an irreversible process(Flow from low to high entropy state), it is going against entropy we called it reverse entropy (watch Christopher Nolan Masterpiece TNET!!!)Refrigerator is a reverse entropy object(doing cooling), Crux is to maintain discipline in your services we need to adopt a Resilience strategy but the question is How to determine what resilience strategy needs to be adopted? For that, we have to experience Chaos in production and act accordingly.


This is the essence of Chaos engineering, by injecting mild fault into the system to experience the chaos and take preventive measures and self-healing against it.


Today I am talking about implementing chaos in production!!!!


After hearing this you might think what I am saying? Am I insane? I am encouraging implement chaos in production, which is the most emotional and sensitive area of a developer, We are all praying whatever the error comes, please those come before Production, In production, if something goes wrong your organization reputation at stake, your organization loose user base, Revenue, etc and I am encouraging to implement Fault/Chaos.


But the Irony is we are having the wrong mindset, our mindset should be “Failure is inevitable and we must prepare for it”. In this tutorial, I am advocating for this Culture.


A simple Microservice definition::


Microservice Architecture is distributed in nature and it consists of suites of small services which can be scaled and deployed independently. 


If we deduce the above statement we will find three important things.


As Microservice is distributed, it is communicated over the network, and the network is unreliable so How come your Microservice will be reliable?

Over the network, Microservices are communicated to each other so they are dependent on each other, so they can fail if their dependent services fails.

Microservices Scaled and deployed on infrastructure so if infrastructure fails your Microservices will fail.


These points justify the “Failure is inevitable and we must prepare for it” statement.

But the question is How do we prepare for it? 

The Answer is Chaos Engineering.


What is Chaos Engineering?


Chaos Engineering is a technique by which you can measure the resilience of your architecture. By Chaos Engineering we will inject Fault(Increase load, inject delay), and then we will check How the services react, how resilient the service is? We called it FIT (Failure Injection Testing) If the service is not resilient we will identify it and make the service resilient so it can handle real-time error in production.



Image courtesy : Netflix


Now Let’s Discuss In terms of Microservices what are the areas we should do the Chaos testing to see our Microservices are resilient and what are the resilience strategy we must follow to avoid downtime.

 
Network, Resilience, and Chaos Engineering::

Chaos in Network::

As Microservices are communicating over the network many types of failure we can encounter like the network is unstable, network load,DDoS attack. Delay during the call etc.

Resilience Technique::
Hystrix is a suitable tool. By hystrix, we can ensure if one service is down rather than hammering that service we can take a default route, and give time to that service to recover. 

Chaos Testing
We can introduce load/congestion or network delay in using Netflix FIT Framework in production to check how services are reacting, note that, In FIt we must have a separate Canary path for FIT testing where we can experiment roughly for 1% of the real-time user load and if service is not resilient we will terminate the test then there and reroute to the real path so that user does not experience any error.

    Image Courtesy: Google Search


Dependency, Resilience, and Chaos Engineering


Chaos in dependency

Microservice are calling each other to fulfill a business capability, so microservice dependency is a major factor, many types of failures we can predict like, dependent Service is not available, service not in a state to receive the request, one service fails as a cascading effect whole microservice service chain fail and crash, distributed cache unavailability, cache memory crash, single point of failure.


We will talk about the resilience strategy for all the above cases.


Resilience Technique::

Hystrix is a Netflix OSS tool by which we can implement Circuit breaker patterns in the services, so if one dependent service fails for several requests it is better to not call it and take a default route so that the whole Microservices calling chains, not breaks and the user experience not getting stopped.


Another Resilience technique is to identify the critical services which are the heart of the Business and make sure if other services fail these services can run and can give users a minimal experience to carry on rather than halting the whole user experience.

In Real-time architecture we don’t always call the persistence layer it creates latency also all business features are not stateless we need shared data across multiple microservices so we are using cache techniques and do some sorts of orchestration in our code where if a request comes then we first check cache then data not available we call persistence layer and add the result to the cache for further requests.


Now, If Cache fails or acts as a single point of failure our services will fail, to avoid the same, we are using distributed caching with replication and the data must be replicated to different available zones so if one zone fails data can be retrieved for another available zone.


We need to adopt a Multi-region strategy for the persistence layer as well Microservices, If your Business spread over the Geography then as per architectural style we must have different data centers over the Geography say US North, US East, Asia Pacific, Europe etc.


Now the interesting thing is if your one data center serves one region say Asia pacific serves Asia, now if that datacenter goes down your all Asia pacific users affected so we must have Multi-region and failover strategy so if one region goes down it's user request can be shared by another region to have a resilience system.  



Chaos Testing::

To check Service is unavailable to introduce latency in the call to see how services react, we can use the Chaos Monkey framework and chaos toolkit to achieve that testing.


To check whether your critical services are working or not you can blacklist other services and only white-listed critical services to see how it reacts when only critical services are up.


We can use Chaos Monkey and Chaos Gorilla to kill random nodes to see how services react assuming service multiple instances deployed in multiple nodes.


We have Chaos Kong which can takedown the entire region to check the Multi-region Failover strategy.



Scaling, Resilience, and Chaos Engineering::


Chaos in Scaling::

In Microservices we generally adopt XYZ axis scaling. Now in Scaling, we can face many types of issues like Zone unavailability, Server affinity, Sticky sessions, cache unavailability, etc.


Resilience Technique::


Generally, when we build a Microservices / distributed architecture, although we are saying to build stateless microservices it does not always happen we have to maintain statefulness due to business requirements like cart functionality, so we are adopting many techniques like server affinity or sticky session, once a request comes to a node load balancer makes sure that users further request will be processed by that node only, but this technique has some downsides, although your system is distributed you have multiple instances if that particular node goes down all the users served by that node will experienced error, which is not expected.


So we need to make sure our services should not have any server affinity. We can use distributed cache to store the stateful data or session also we can use request payload to append the state so data are available in the whole request cycle, which has been populated by different microservices.


In the case of cache also as I mentioned earlier data needs to be replicated in multiple regions if not then if one cache region went down all the users whose data stored by that region will be impacted, if we have replication, data can be available from other areas.


Another important thing is when we are adopting Horizontal scaling we must have to make sure it can auto-scale based on the load, we need to treat each node as cattle, not like pets, so if one node malfunctioned we can kill that node(I am against animal cruelty) and spawn another one automatically. Now If your services have server affinity or Load balancer works on sticky session strategy auto-scaling is not possible, so fix the same. Your organization's infrastructure must be capable of self-healing and spawning new nodes based on the load. So it is advisable to use containerization and container orchestrators like Kubernetes, Docker swarm to handle this.


Chaos Testing::

To check whether your service has server affinity or not, you can randomly kill nodes by Chaos Monkey and chaos tool kit to see the outcome.


Infrastructure, Resilience, and Chaos Engineering


Chaos in Infrastructure::

While adopting Microservices you have to have strong infrastructure support, most of the organizations either use cloud providers or they have their internal cloud or own data centers whatever the case, your Microservices architecture depends on Infrastructure if infrastructure fails you Microservice also fails. So we must think about infrastructure failure and design our architecture in such a way we can mitigate the failure.


Resilience Technique::


For infrastructure, we must need to have a Multi-Region strategy and failover mechanism, so that if one region's infrastructure goes down another region can serve that region's users.


Also, we need to choose Data center smartly so that two data center physical distance is not too much if it is so network hoping time increases so response time increases, but they Physically must be far apart so that natural calamities, Terrorist attack occur it should not impact both data centers.



Chaos Testing::


Using Chaos Kong we can take down the entire region to see how failover works.




Image Courtesy: Netflix

How to adopt Chaos Engineering?

Although Chaos engineering is a compelling idea it makes your developer ready to tackle real-time production incidents, but Chaos engineering is not a free lunch, it needs the proper DevOps pipeline, auto-scaling architecture, resilience system to make it successful.

So before adopting it you need to spend money on Infrastructure, DevOps, and building Scalable architecture.

Below are some pointers to think about while adopting Chaos engineering in your system.

  1. You must have a well-designed DevOps Pipeline where you can test Chaos engineering roughly 1% load to flow to Chaos engineering Path to test how your system reacts.
  2. You need to have well-designed container orchestration techniques, where you can manage the containers, autoscaling, failover, networking rule, etc 
  3. Chaos engineering is meant for a complex business system, if your system is simple enough don’t adopt chaos engineering it will increase the overhead and cost of your business.
  4. After adopting Chaos engineering you must conduct the Gameday concept, the day when you do the Chaos testing in productions and it has to be conducted at a regular interval.
  5. You need to have automation platforms and implement the chaos quality gates, which do all types for failure testing. If your team has to do it manually so then in the long run the team will lose interest and it will be out of control.
  6. If you identify a new failure in production during chaos engineering you need to terminate the test and reroute it to the original route so the user does not experience any error.
  7. In case of new bugs identified in chaos testing, the team must need to do RCA of that and try to solve it and then find a way to automate it and assimilate it in chaos quality gates.
  8.  The organization must have a Chaos Checklist, every service needs to pass that checklist then only it will be promoted to production.
  9. Chaos engineering in production is risky if your team is not skilled enough, it will be better to start with lower environments and once the team acquired the skill then try to do it in Production.
  10. While testing in production it is important to minimize the blast radius, unnecessary giving pain to customers is not a good way to experimenting chaos, so the chaos engineering team must ensure to keep the experience in the minimal blast radius and fall back to the original route if something went wrong.



Conclusion:

Chaos engineering advocating for checking how your system and developers are ready to handle real times issues. Often our system is not battle-tested against Load testing, region unavailable, critical services unavailability, although we are having some kind of load testing, integration testing done in the lower environment but that is not enough and the reason is,

“Our lower environment does not mimic production infrastructure”

So often developers have to battle against failure only in real-time without any preparation. There is no process in place, so developers are confused about what to do and that's why production failure is still a fear factor for all developers and infrastructure guys.

Having said that, by The Chaos Engineering, we are giving chance to developers and infrastructure guys to prepare themselves in real-time in Production itself, now they will be seasoned players and can handle the Production error without fear. This is the future mindset all organization needs to adopt as we are moving very fast, everyday new framework, tools are created every organization adopts them to break out from the old system, it gives organization enough flexibility in terms of scaling, resilience but on other hand, it complicates the architecture so without chaos engineering you can not be sustained.

“Chaos Engineering is the Vaccine of Complicated architecture, it helps to create antibodies and memorize them so in future if the same issues happen, those will be mitigated immediately.”







10 commandments On Microservice Decomposition.


 While we are talking about Microservices, we talked a lot about Domain-Driven design, Event-Driven Architecture, Core domain, Subdomain, Bounded context, Anti-corruption Layer, etc,

Whether you are working in a Brownfield project or a Green Field project, If your organization wants to adopt Microservices, (assuming your organization has a compelling reason for adopting Microservices as it is not a free lunch.) then you need to understand the above terms in detail to properly decomposing your Business domain logic(Business Space) and mapped it with Microservices architecture(Code Space) , so you can gain the benefits of Microservice traits.


In this article, I will try to cover the purposes of the above-mentioned terms while decomposing Microservices and try to fit them under one umbrella concept.


To understand each term from the root, it must be one or more than one article. Having said that, I will concise them in this article and give you the pointers while you are applying the Microservice decomposition strategy in your Organization.


Let’s begin with the 10 Commandments of Microservice Decomposition.


1. View Your Business Domain In terms of Bounded Context and Ubiquitous Language::


Before taking any step on decomposition, the first thing first is to reduce the gap between Product owners and developers, Product owners do not understand Technical term, and the Technical team not understand the importance of a term in terms of Business and how business interpret it, It is like one Portugees talking with one American with their native language no one understood the conversation, so to bridge the gap we need to take below steps,


  1.  gather in front of a Drawing board start a discussion with Product owners what is the objective of the business, what are the actors in a particular feature, what are the terms they used while defining the feature, on every step ask more questions until you figure out what are the conflicting terms, like in Order Context Customer is different than Infrastructure Support context Customer. 

  2. Once you understand the conflicting terms and clubbed the related feature, draw a context so that in each context every domain entity name is clear. 

  3. Define a Ubiquitous language for each context. So the Business team and tech team in sync and using a common language when they communicate.

  4. Start with a Coarse-grained Bounded context, later If has a compelling reason to divide then divide the bounded context, I would recommend not do that if there is a Business reason.



2. Identify the Core Domain and apply Innovative Idea:: 


The core domain is such domain, which brings the revenue to your Business, Say for Online shopping Shopping cart module is the core domain which gives the platform to Buy and Sell (B2C) opportunity to Business and consumers, understand your core module is and think about how you can improve that module which your competitor does not have, any automation, innovations will add advantage and boost your revenue, so pay attention, do R&D invest money on core domain to stay ahead of the competition.


3. Do Cost Optimization on Generic Domains:: 


Generic Domains are such domain which is common in every business in that niche, and already different Third-party already provides the solution and commercialize in the market, Like you Notification module, or Ad Campaign module for your business, it is the best strategy, not to spend money on In house project to create this module and reinvent the wheel unless you have some compelling reason, preferably adopt the Third-party solution for the cheap price. 


4. Think on Support Domains:: 


Core domain needs the Support domains help, to enrich itself and in some cases, Support domain can lead the revenue and can be possible core domains in future. So it is important to think and take decisions to invest in the support domain so that it can generate revenue. Like in a shopping cart domain Inventory Management is Support domain but it is important to invest money to expand inventory locations to cut down the shipping cost, and also invest in algorithms, which can identify the nearest Inventory location for a Customer order to reduce shipping cost.


5. Introduce Anti Corruption Layer:: 


Anticorruption Layer is an integral part of Microservice design it protects Microservices from outerworld malformation, In a real-time Legacy project always you will encounter with such old system which builds on Mainframe or any other language, while you are doing decommission they are the important source of Microservice input data and live side by side with Microservices architecture you can not decompose that system for various reasons.


So it is a good idea to create a facade between Legacy and microservice communication, rather than directly consume data from legacy and create coupling on Microservice and Legacy architecture.


Also think on the Generic domain as they are adopting third party library so rather than directly consume/publish the data according to their contract introduce an anti-corruption layer which insulated the Microservices from outer world contract API, The Port and Hub pattern, so rather than driven by their contract we create our contract and ACL(Anti-corruption layer act as a translator between Microservice and third party contact. It helps you to adopt any third-party library in the future.


6. Identify the Data Communication patterns:: 


Once you decompose the microservices based on the features, and your core services encapsulated their own database/persistence layer,(database per service), the next important things to understand, to complete a feature, how your UI views/components will communicate with each other, is it a sequential flow? At a one-shot your user needs to complete the whole feature, or it can be asynchronous where the user can do a partial functionality and create an intermediate state, another system takes action on the intermediate state and calls back or notify the user to resume the action.


7. Introducing Event-Driven Architecture (EDA):


In a real-time application, your business cases having complex workflows and many branches on the workflows based on the state of the data, based on the state change, workflow took a different strategy, so if you think to expose all by Rest API, you will see that it creates a chatty network not only that each microservices coupled with others and create a spaghetti code and distributed ball of muds.


So somehow we need a clean architecture where each microservices can operate independently without creating coupling, here Event-driven architectures play a vital role, each event is wrapping a change of a state, and the Microservices are followed pub/sub model so one microservice produces it state change and wrap the necessary data in a form of event other Microservices listens to that events and can take the strategy based on the data wrapped in the event. As Events are immutable it also holds the history of an entity or aggregator so if you are adopting an event store and event storming you can generate any statistics and report from the events.


8. Make API contract Clean and concise :


In Microservices you need to publish API so it will act as a contract, so while you are publishing API make sure your api does not publish internal state, think about the encapsulation, and think about the network call, so publish API is such a way that other services can get enough information to carry on their flow, they should not come back multiple times for getting derivative information, also think about the events which events, which you should publish and which must remain inside, maybe you can publish one coarse-grained event rather than publish small internal events.


Example: say internally you have Address Change Event, Personal info change event rather than publish both in API contract, publish a Coarse-grained event called CustomerUpdateEvent.


9. Merge Related Microservices to a Bigger Service:: 


After decomposing if you can experience, few microservices always changing together when a feature needs to be added or updated, then you know, you decomposed it in the wrong way, they must not be segregated to a small service they are part of the same logical unit so it is wise to merge them a single service, it will reduce unnecessary coupling and network call. 


10.  Introduce Supporting tool for Seamless development:: 


Microservices is not free lunch, If you adopt Microservices first thing first be ready to expense on the supporting software as Microservice is distributed we adopt it for scaling resiliency and high availability and reduce Time to Market, it is distributed and works over the network so failure is inevitable and you need to catch the failure at the earliest without spending on infrastructure it is not possible.


If you spend well, then only Microservice allows you to buy out different options and help your organization to grow further.


So spend on CI/CD pipeline, adopt cloud infrastructure, use Tracing tool, use Log aggregator to search logs, use chaos tools for checking how you are prepared for failure, etc.






Conclusion ::

 The above points are necessary while you are decomposing Microservices, I will write an article on each topic on how they play a pivotal role in adopting Microservices architecture.


Also, like to hear from you more on the challenges you faced while decomposing to Microservices.


Microservices and Scaling Strategy.

 

I have heard many times below question 

" How do I scale Microservices? 

 or

What type of Scaling Microservices unlocks?

So I thought to write a crisp article on the Scaling.


A distributed system can always be scaled in a 3D space, i.e X axis, Y axis, and Z axis, and we need to scale distributed system to manage the load and having a high availability of the website and of course managing the cost in an efficient manner by maintaining servers/other resources optimum way.


X-Axis offers to spawn the environment based on the load, the old way of scaling distributed system that is behind a load balancer increase instances. It offers infinite scaling.


Y-axis offers, scaling by isolated the business functionality, aka functional decomposition so if a function has more load/usage/priority than the other functionality, we only scale that function and managing the cost of resources optimized manner. it offers infinite scaling of a particular function but total decomposition is finite but can be increased as function increases.


Z-axis offers Scaling through partitioning via business parameters, it's unlocked servicing premium customers or special request or geodiversity, etc. It also can be infinite based on the business parameters. But total parameter rules is finite but can be increased as you add more rule to it.


Keep one thing in mid Monoliths also distributed and it is also logically functionally decomposed through multi-module projects or package structure etc, but while deploying it packages as a single artifact ear or jar, so it does not unlock the Y-axis, as it does not offer a physical functional decomposition, each function does not have it's own program space or environment(Per function per container/server), Microservice is by nature unlocks that so, Microservice adopted 3D space scaling.


 



Microservices uses 3 types of scaling, actually, It follows the Art of scaling principles.


Y-Axis Scaling:: Microservices' main focus is to do the functional decomposition and it does it very well, each function wrapped by a few microservices, so If one feature in your system dealing with high loads you can only scale up that functions only not touch the others.


Z-axis Scaling: It helps you to partition your data, zone wise if your business distributed geologically you can put data centers based on Zones and that datacenter will serve the request for that zone, it helps serve your request quickly, and if loads of zones increases you can scale only that zones, not only that you can apply logic based on request param and send it to separate servers, say you have premium customers you want to serve them quickly you can do that.


X-axis Sclalling: : you can spawn multiple instances of Microservices based on the loads, you can use Cloud or can spawn by containers, it will give you the whole environment, not only artifacts that save your boot uptime.

So, if you think Microservices Scaling in a Hirechy order it starts with Y-axis Scaling then each Microservices can be scaled Via X and Z axis, you can adopt all type of scaling through Microservices architecture.

What is a Microservice?


Although the question is Simple, it is tough to answer as Microservice does not have any de facto standard, many people have many perspectives on Microservices so many definitions.




The compelling definition is.

Microservices is a Suite of services where each service is, bounded by a bounded context and can run, deployed, and scale independently without impacting other services.

So, to make the above statement correct in reality, organizations that are adopted Microservices follow few common characteristics, so

Javaonfly recommends, rather than going by definition go for Characteristics.

Characteristics to be followed for creating a successful Microservices architecture.

Componentization of services:: Service can be upgraded, evolved, deployed, scalable, tested independently.

Database per core services:: Take out the coupling from the database and make it private to core services, create an API contract by which other services get data.

Test cases are the first-class citizen:: as MS is a collaboration of multiple services so Unit test and Integration test is needed often to identify issues, Fail fast strategy is key to success, So need full Unite test case and integration test coverage with automation.

CI/CD Pipeline and Automation:: To become a success, we need CI/CD pipeline and provision to deploy in UAT, SIT, and may be automated deployment to PROD as well. Also If using PAAS, or containerization to treats resources like cattle, not Pet, we can spin servers on the fly for X-axis scaling

You build it you run It Strategy:: Make sure Team are mixed bag Agile teams UI, Backend, dba, QA, and Team is responsible for implementing a Business capability, like Registration Team or Login Team, Order Team, Payment Team, etc, may One Scrum handle Aggregator and core services for that functionality based on organization Team strength but one feature is built that team is the Sole owner of that feature, all bugs, support, deployment, delivery, database tuning, testing will be done by that team.

Strategy for Failure is must:: Microservice architecture dealing with a chain of services call over the network which is not in Teams control, so it bounds to fail but we create MS for high availability so Failure strategy is must, not a good to have kind of things, so always design Plan A, B, C D for failure and lastly fallback.

No Single Point of Failure:: Never ever design a component that can't be scalable through X-axis, so replication is a must for MS, use CAP theorem for your business needs, whether it is the service or database or Cache, config, Load balancer any component.

Reduce Chattiness by Aggregator:: As microservices over network call and a capability spans multiple services design an Aggregator which collects information from core services and returns UI specific response. Use Aggregrtaor carefully it might cause a God Service antipattern

Tracing and Logging:: s microservices over network call and a capability spans multiple services, think about developers how they debug if something goes wrong so use Tracing mechanism, also as MS using X-axis scaling it is not possible to check every server rather need a log aggregator like ELK or Splunk so using correlated id developer trace the call.

Smart endpoint dumb pipe:: Pipe means carrier of instructions i.e network only carry the payload all logic in the MS itself, unlike Service Bus where middleware did all stuff for you!!

Externalization of configuration:: Configuration must be extracted out from MS so we can change them without restarting the services, it helps to achieve high up time for the service.

If the above characteristics are present we can safely say it is Microservice, missing of one or many then that is not a Microservice but we can say they tend to achieve Microservices.


Tips::Favor Composition Over Inheritance is not a Universal Mantra!!

 


"Composition and Inheritance are the atoms of Pure OOO design.

One can't be replaced by others. Two have their own purpose,  Most of the time developers use Inheritance in the wrong way and it looks  Composition is the better option. 

One of the signs of using Inheritance the wrong way if you suppress a parent method by throwing "not implemented exception" or by silent/empty implementation.

To restrict that silly mistakes, Seniors often giving the above mantras without thinking about the context, but it does not fit always."

To know in detail please follow#javaOnFly , we will publish an article soon!!!


javaOnFly started a new segment called Tips corner, every Saturday we will post Tips only for you to make you a better programmer.


If you Like the Tips want more about the topic , you can read a whole bunch of detailed Java Articles on javaOnFly, if you like the content, you can subscribe to javaonfly.


Still not Satisfied!!! I love to discuss with enthusiastic people like you, on Java, Patterns, Microservice, clean code latest Trends, I am just one click away from you to connect


Follow me on

javaonfly :: https://javaonfly.blogspot.com

Twitter :: https://twitter.com/Shami83

Linked In :: https://www.linkedin.com/in/shamik-mitra-05b66227/

Instagram :: https://www.instagram.com/shamik.mitra/

Dzone :: https://dzone.com/users/1211805/mitrashamik.html

Facebook Group :: https://www.facebook.com/groups/javachamp

FaceBook Page :: https://www.facebook.com/artofjavaprog


#java #javaOnFly #code #programming #cleancode #javaonfly

#ask2shamik #tips #composition #inheritence #oops