5 best practices for Building and Deployment Microservices

Many organizations who are going to adopt or adopted Microservices culture they often think Microservices means breaking a big business functionality into small independent services, which can be scaled automatically. But this is the half part of the story, Another part is often ignored but the other part is equally important and if you do not implement the other part please go back and sit with your architects as you are not getting the full advantages of Microservice culture. The other part is an efficient way to handle build and Deployment so developer code is always production ready.

Today I will discuss 5 best practices for build and development for Microservices.

  1. Independent Deployment : When an organization adopting a Microservice culture, The first rule has to be set is, each and every Microservices should be independent. Here what I mean by independent is, Each Microservice should have a separate code base, publish/consume well-defined API, Separate persistence layer, Separate build, and deployment script, and separate deployable artifacts.  So Developer of the Microservices own the process end to end, They must have complete control over their services in terms of technical and management, which help them to protect themselves from any external impediments. When you building a Microservice you should know Other Microservices API which you will go to consume, And also publish a API so that another Microservices Consumes your Service by doing that you can become independent as you know the request/response for the services you consume so you can MOCK them and work independently, also as you publish your API , there is well-defined method signature so unless you change the API signature, whatever you can do with your codebase, even you can deploy it thousand times in a day that will not affect others. Also in Management perspective Team must own the build and deployment pipeline, administrator access in each and every environment even production!!! Should have complete control everything related to Administration, If your organization does not allow this then you are not doing Microservice and not get its essence. Sometimes dependency with Other services is unavoidable( Not management related those are in our hand), or to cope future changes you need to revamp your API, But those are the Exception cases, in my previous article I had a talk about how to handle the same, But again that is exception, By rule every service is independent has its sole identity, Team related to this service owns everything they are all in all of this service. This brings the speed of development and as an organization, you can publish features to end user easily.

  1. CI/CD : One of the main causes of slowness in publishing feature in production is dependency between Dev team and Ops team, Dev team holds the technical part and ops holds the environment for it , This causes a barrier like( communication, availability of environment/artifact etc) as they depend on each other, In a competitive market even a delay of one hour matters, This is not the way to adopt Microservice culture, Ideal Microservice environment would be anytime any code from developers environment  can be production ready , If developers code pass all criteria then it can be published to production within no time and there may be a manual intervention to publish it in production but else should be automated, Microservice culture welcomes DevOps culture automatically. Without DevOps Organisation not utilize the power of Microservice fullest. There should be a build pipeline whenever user checks in a code it will trigger a build then artifacts should be in UAT environment where unit testing will be done then it moves to SIT and integration testing will be fired at last it can publish to production with or without manual intervention.

  1. Environment Replication : In Project Management lifecycle, our traditional approach is to create the different environments like UAT environment, SIT environment, PROD environment etc. This is really a good approach but the problem is that environments are not identical, In SIT we have few servers , A Load balancer manages them but in Production the Infrastructure is totally changed there may be a Primary and Secondary pool of servers for Blue/green deployment or they may have different data center based on geographical affinity. So one artifact different Topologies. One artifact tested against multiple environments, as a result,  we can’t predict how the artifacts behave in Production as we do not have production like infrastructure to test on. The most common bug comes from production is how artifacts behave when there is a heavy load. So to identify how exactly code behaves it needs an Identical environment like production with Load testing capability. It can be done if organization adopt IAAS, Where we can spawn machines seamlessly and clone the developer's environment. So our artifacts should not be only jar, war, gems. An Artifacts should contain the whole environment i.e the OS, Application server, persistence layer all ancillary settings. Nowadays we can achieve it through IAAS(Infrastructure As A Service). Also, Docker is a very popular option to spawn a container and by Docker file, we can instruct what should be packaged inside the container. Puppet and Chef also a good option to achieve that. If you want it should be managed by the third party you can go for AWS or Cloud Foundry, BlueMix etc.

  1. Artifact Repository :  Although it comes with CI/CD I want to highlight this feature. Why is an Organization adopting Microservice architecture in terms of Business perspective?
The answers are
  1. They want to release their feature quickly to end user to stay ahead of competitors.
  2. No Downtime which increases reputation.

    To achieve this organization has to have a robust framework where it can store all the artifacts with its version number as if any production issues occur with current release either we can patch the fix and deploy it to production or if it takes time to fix we can easily revert the changes and deploy the old artifacts so user does not have to wait for fix for next release. So it is utmost important to maintains the artifact repository and it should be very easy to deploy in production, No cumbersome processes so that it needs a special expertise, anyone can do this it should be like a command in CLI.

Like deploy <ServiceName><Version ID><Environment> where
Service ID: Logical Name of the Microservice.
Version ID is the Version you want to deploy.

Environment: Which has a logical name like prod1,prod2,sit1 internally it contains all the information of the servers and apply artifacts to all servers.

5. Scrum of Scrums : We know that one Microservice holds one Domain information. So When an organization wants to build a feature which distributed over multiple domains naturally the work should be distributed in multiple Agile teams. Although in Microservice culture each Team is independent as they can mock the other services, but make functionality works fully it needs an Integration.  So it is utmost important every team's Work Item should be in Sync so that In a defined timeframe all teams did their jobs and all the works can be integrated and tested in the Integrated environment. So Organization must have to conduct a Scrum of Scrums where each team's Scrum master and Product owner sit together and decide the priority items so that business functionality published in production according to the announced release.

Conclusion : Breaking a Complex Business domain into small moving parts and each part work independent is just one side of the coin, To make and publish new features in a hassle-free manner we need to focus on the Management structure of the organization also.  

I am curious to know from The organization who adopted or about to adopt microservice culture how they change their Management policy or what are the other management related challenges they faced when they move into Microservice Culture.

5 Best Practices for REST based MIcroservice

In this tutorial, we will discuss 5 best practices by which you can make your Microservice architecture developer friendly so they can manage and track the error easily.

  1. User Agent : It is very important to provide a meaningful name in request header so if any problem like slowness, huge memory access, spike etc occurs developers can understand from which microservice this request is originated. It is the best practice to provide the logical name/{service id } in the User-Agent property in the Request header.
Ex : User-Agent:EmployeeSearchService

  1. API Version Control :  In REST based Microservices architecture, One Microservice access another Microservices resources via API. So API acts as a Facade to other Microservices. The utmost important part is -- written API in a judicious way so that API is not changed frequently because other Microservices consume the same so any changes in the API method signature is breaking other codes, but the Irony is Change is inevitable as we don’t know the future so today's business strategy may become old in a few days so API must be revamped. As an architect the main challenge is how to cope with these changes, Answer is maintaining the Version, For a major changes you can update the API version and provide notice to your consumers that new Version is arrived so please migrate to the new version within a defined timeframe -- In this time frame as an API provider you have to maintain both versions and constantly send high important notice to your consumer to migrate there codebase call to new version and after that period you can decommission your old version.

  1. Correlate ID :We know that  a business functionality in a MIcroservice architecture is distributed over multiple microservices, So one request from a client internally fans out to many separate requests  and it is highly probable that one of the services is slow or down, But as a developer how to track which service is slow in the Microservice forest, So  somehow we need to group the requests logically. So it is always a good practice to generate a Random UUID for a client request and pass that UUID to every internal request so in a log we can track by the UUID and find the call trace accordingly.

  1. ELK Implementation : Microservice is meant for autoscaling, so in a complex business domain it is very hard to manage Log files for Microservices, say In a system there are 50 microservices involved and each has 10 instances so there will be 50*10=500 logs file will be generated so as a developer it is not possible to log into each instance and collect logs for investigate an Issue. So we need a Centralized mechanism where all logs are dumped and we can do some intelligent search over that log like find out Error or a particular exception or search by the host, correlate ID etc. ELK provide the same where E stands for ElasticSearch L is Logstash and K is Kibana. ElasticSearch dumps the logs and provide a fuzzy search capability Logstash used for collect logs from different sources and transform it. Kibana is a Graphical User Interface where a developer can search the logs as there need. Alternatively, you can use Splunk or any other opensource framework for centralizinig and analysis the log .

  1. Resiliency Implementation : As I said earlier in a Microservice architecture -- many microservices are involved to complete a business functionality. So it is very common that one of the services is not responding which will stop the whole flow to handle such scenario Resiliency is utmost important and each and every service should implement resiliency to provide a seamless experience to the end user.When a service is not responding a certain amount of time there should be a fallback path so user are not wait for the response but immediately notified about the internal problem with a default response. It is the essence of a responsive design.

Conclusion: when building a REST-based Microservices think about two perspective
1. User experience
2. Developer perspective.
A User does not like to wait and S/he needs a clear nontechnical message what is going wrong if there is an internal problem occurs so S/he can communicate with help desk properly. On another hand when it's time for support the microservice we need the all important information in the log as the incident is already happened and based on the log we are analyzing it. So always think about support perspective while developing a Microservice so you can track a flow and get sufficient information from a log.

Deep Dive to Distributed Service Registry

In my previous article, I discussed how to maintain Resiliency in Microservice/Distributed Architecture. In this tutorial, I discussed Distributed Service Registry.

What is a Distributed Service Registry ?

In a service registry pattern, all the services are registered in a Registry(A Map Data Structure). If any service needs an instance of an another service it contacts Registry and gets the service instance. Very simple isn’t it But when we deal with a distributed environment it's getting highly complex.

Let’s discuss why it is getting so complex in a distributed environment. When we are talking about the distributed environment we assume that each service is deployed in a separate web server/application server to be very simple each service runs in a separate JVM. per service per JVM. As Service registry holds the service name and its instances in a key/Value pair so we need a centralized system which only maintains this service registry and performs CRUD operations on it and each service contact this registry for getting required services, so far so good. But the problem is when we make service registry centralized i.e a separate service or JVM we bring a Single point of failure in our architecture.  Think about the scenario if the Service registry is unavailable then our application falls down as no service can contact service registry so it not gets the desired service and fails. So makes service registry centralize is a bad idea.

See the following picture and think how can we improve the architecture.

Service Registry
One thing we can do to improve the architecture if we maintain multiple instances of  Service registry so if one is unavailable other can serve the purpose then there will be no SPF(Single point of failures).

Distributed Service Registry Algorithm

The idea is like there will be one Leader Node(Service registry) all the CRUD operation applied on this node, and it is leader node duty to distribute the updated state of the registry to another Service registry(peer nodes). Although by reading it seems very easy and robust solution but it is very complex to implement and a new set of challenges comes with this architecture.

I am trying to talk about the challenges one by one

Leader Election: As the Service registries are connected to each other How we can select one node as a Leader also if that node fails basis on which we will elect another leader who will take over the responsibility?

State Synchronization: As there are multiple service registry nodes now the question is if Leader registry got updated by adding a new instance or deleting an instance then how leader node distribute that state to others and how it becomes to know that all nodes are updated with the latest state?

To solve these problems Distributed registry take helps from Algorithm.

Bully Algorithm for Leader Election :  In Bully algorithm each node/Service registry has a process Id, greater process Id will select as the Leader node. Suppose assume Leader node is not available then How Bully Algorithm selects the next leader.

Step 1: When a node discovers that Leader node is unavailable it starts the election by sending a claim message to all other nodes who have greater id than this node.

Step 2:  When greater nodes are received this claim message, they either show interest or not provide the response for the message, If they are interested they send previous node a response to stop as they will take it from here and send claim message to all other nodes which have greater Id than this.

Step 3: If it does not receive any response from the higher node, then it selects itself as Leader node and sends message to all other nodes by saying it is the Leader.

Bully Algorithm

State Synchronization by Consensus: If any instances are added or deleted from the leader node it has to be reflected on other nodes so every service registry got the same value. To decide the correct state of the registry majority of the node has to agree on a certain value/latest value.  And faulty nodes needs to update its state to be sync with other nodes and system will be eventually consistent.

Consensus works on two major criteria

Value proposition : Each node/Service registry propose a value..
Agreement: Majority of the node should agree on the proposed value.

A consensus algorithm works in following way

    Every node decides some value.
    If all registry proposes the same state say  X, then all correct nodes decide X as an updated state.
    Every correct node decides at most one value, and if it decides some value X then X  must have been proposed by Leader Node.
    Every correct process must agree on the same value X.

Conclusion:  To maintains a Highly available and fault tolerant service registry in a distributed system Consensus and Leader election is the main techniques to be followed.

Deep Dive to Hystrix Resiliency Maintenance

In my previous Microservice Tutorial , I have shown How We can use Hystrix as a circuit Breaker and gives our service breathing room to recover itself.

In this tutorial, we will deep dive into Hystrix architecture and will eventually get to know How it manages resiliency in a complex system.

Let assume there are more than 100 Microservices In a complex system and to perform any business functionality more or less it depends on 5-10 underlying Microservices(approximately).

Generally, There are three types of services are found in aforesaid Microservice based architecture
  1. Core Services
  2. Aggregator service.
  3. Edge Service.

Where the Edge service takes the request from UI/client then it forwards to an Aggregator service, Then Aggregator service fans out the request and sends the calls to core-services and Eventually collects responses from Core services and sends back to the Edge Service, Which actually sends the response to UI/Client.

hystrix circuit breaker architechture
Imagine we have a complex microservice based system where a single incoming request distributed over 10 dependent services, so one request then split to 10 internal requests (Fans out) and goes to individual core service, Now if there are 100 concurrent requests per seconds in a peak time.

So there are 100*10 = 10000 internal requests will be created per second.  Think about the load of the system, Even a minute of delay response from one of the core services can create a bottleneck as in web server(Tomcat, Jetty) there are fixed numbers of Threads in the Thread pool.
and we have 60*100*1=6000 requests are waiting for an individual service for a minute, So We can assume failure is inevitable even if you have 99.99% of uptime for a service-- because when we do a service call we have to depend on two external entities Network and Socket and the sad part is --The control of Network not in developers hand. To make the system resilience we have to deal with this phenomena and have to plan according to it.

Now let see the options we have to deal the above scenario.
Non Blocking request: If Requests are not waiting for the service irrespective of the service is available or not.Only then all request will be either served or rejected. If we follow this then there will be no requests in the waiting queue as if the service is not available or response is getting delayed immediately that request will be rejected and a fallback/default response will be provided to the caller service.  

So request has been short-circuited and the fallback path invoked -- dependent service gets a chance to recover itself. We know Hystrix do this stuff for us. But it is not a very easy task to implement. Internally a complex flow has been maintained by Hystrix to offer Resilient Microservice Architecture.

We will discuss that workflow now.

Deep Dive to Hystrix Workflow :

ThreadPool : Hystrix maintains a Thread pool so that when a service calls to another service --The call will be assigned to one of the thread from that Thread pool, Basically to handle the concurrency It offers Worker Threads. Each Thread has a timeout limit, If the response is not returned within the Thread time out mentioned then the request will be treated as a failure and fallback path will be invoked.

Network and Socket Timeout :  When a Service call to another service, Two things are happening, the data travels through Network and service talks each other via socket. When data travels through Network it has to cross multiple hops so if a hop is slow or down then we can feel network slowness so we have to calculate an average time for a request/response travel time as well as Socket read/write time to calculate thread timeouts in an optimum way.

Hystrix Workflow :  

  1. when a service calls a dependent service Hystrix stepped in and checks Hystrix Circuit is open or not if it is open then it returns to the fallback path.
  2. If Circuit is not open then  Hystrix checks are all the worker Threads in the pools are in use if so it returns immediately and the fallback path is invoked.
  3. If threads are available in pools then it assigns one free thread and waits for the response from dependent service If response time is greater than thread timeout then it again invokes fallback path.
  4. If all is well then the actual response is back to the caller service.
  5. If for a certain amount of time (default 10 sec) if 50% of the request is failed then Hystrix opens the circuit.

    Hystrix workflow

It is very important to choose Thread pool and Thread time out wisely unless necessary implications are coming into the picture, if Thread pool count is large say it is greater the database connection thread pool  then in spite of the hystrix all the Connection are consumed by the service again if the Thread pools count is less then we can’t serve many request concurrently may be that will cause a performance hit. So choose the Thread pool count based on your system configuration.

Same for Timeouts if the timeout is big then there will more incoming requests in the queue but if Timeout is short then all calls are timeouts even if your service is healthy because the average response time of your service is greater than the Timeout.

Conclusion :  
Nowadays whatever architecture (Microservices/Monolith) you opt for your project, Resiliency is the foremost criteria, it is not an add on feature good to have it is the first class citizen now.
So Resiliency is mandatory -- when you are planning and developing your project also think about Resilience spend time on it, think how you can offer a Resilient Architecture. There are many strategies to achieve resilience like  Request Timeout, Maximum Retries, failing rate etc. So while developing your project Identify the resource intensive areas try to build it in that fashion, so that there is no resource hogging happens. Also, provide a Fallback mechanism so you can avoid cascade failures and eventually achieve a robust system.

Microservices Communication: Hystrix As The Jon Snow

In the previous  Microservice Tutorial ,we have learned about How to use Zuul API gateway. In this Tutorial , we will learn about Hystrix which act as a Circuit breaker of the services. Circuit breaker -- the term is new to you in terms of Software Architecture? Don’t  worry I will discuss in detail regarding the same.

But before that Let's discuss with a well-known incident who are working in Support Project(Monolith).

Birth of the Night’s King:

Folks who are in On call Support how many times it happens, You got a call in the night saying, System is not responding, it is a priority 1 issue. You wake up in odd time opens your laptop,
Check health check pages found some servers are down, Some servers have a huge memory spike.So immediately you take a thread dump and all the necessary details then restarted all the server in the pool. After restarting you will found things are quite normal and go to sleep if you are lucky enough then you got a good sleep but if you unlucky again in the morning you may face the same scenario.

So, Next day when you and your team researching why this happened , what is the root cause of the birth of White walkers, which ate up all precious resources and the server eventually got unresponsive.

You may find there is a resource leak in somewhere may be in code level--Someone forgot to close a precious resource, like connection. Or there were an unnecessary open threads. Or there is a blocking session in the database etc.

But hold on why we can’t find this resource leak/birth of Night King at the first time? Why the Night’s king grows up silently and when he is in action then we got notified?

So, It opens our eyes that there is a problem in our Architecture(King's Landing), there are no techniques for early detection of a Resource leak( No Jon Snow to Watch the Wall!!!).

A practical Scenario.
Let examine a simple scenario which may cause this type of scenario, Say we have an architecture where service A and Service B dependent on Service C.  Both Service A and B query  Service C API to get some result. Now Service C is used the underlying database to fetch result but unfortunately, programmer does not close the connection in finally block he does it in the try block.

Now in production, if any error occurs in Service C regarding Database connection/query, It does not release the connection so Connections are not back in Connection pools(Connection pool has finite resources).  But Service A and B does not aware of this scenario it queries Service C as the request comes and Service C ate up one by one free Connection from Connections pool. So after a certain time, all Connections are eaten up by Service C and there is no connection available as free in Connection pool and Night Walkers(Service C) eaten up your System. After restarting all the server its gives you relief for sometimes but if the Service C error continues (Programming fault) then again you might have to wake up in the morning (Night King’s is back).

It all happens due to Service A and B, They are not aware the Service C is not responding the way it should be. If they aware they just simply stop the querying then We should not have faced this situation. Here the concept of the Circuit breaker(In GOT NightWatch) comes up.

Resourse Leak--Birth of Night King

Circuit Breaker Pattern :

The Circuit breaker concept is same as an electrical circuit When the Circuit is closed electrons flow through the circuit but if any unusual thing happens it trips the circuit, Circuit is opened up so there is no flow of electrons through the circuit. It provides the circuit to recover itself after a certain amount of time, Circuit closes and flows of the electrons continues.

Netflix hystrix is such a framework which works on the same principle.

It always monitoring the calls so if any dependent service response is greater than the threshold limit it trips the circuit,  so no further calls will not flow to the dependent service. It gives dependent service to recover itself. In that time there is a fallback policy, all the request goes to that fallback path. After a certain amount of time again the circuit is closed and request flows as it is.

Please note that we can enable Hystrix(Jon Snow-- The King of North) in Spring cloud. previously, it supports only Service and Component level @Service or @Component. With the latest, it supports in @Controller also.

Hystrinx As JonSnow

Coding Time:

Lets recap the EmployeeDashBoardService , It calls EmployeeSearchService to find employee based on the id. Currently ig EmployeeSearch service is unavailable then EmployeeDashBoard Service does not got the result and show the error. But we want to show a Default Employee Value if EmployeeSeracgservice is not available so to incorporate the change in EmployeeDashboardService we have to do the following changes.

Step 1 : Add Hystrix plugin into pom.xml


Step 2 : Add @EnableCircuitBreaker on top of  EmployeeDashBoardService, to enable Hystrix for this service.

package com.example.EmployeeDashBoardService;

import org.springframework.boot.SpringApplication;
import org.springframework.boot.autoconfigure.SpringBootApplication;
import org.springframework.boot.web.client.RestTemplateBuilder;
import org.springframework.cloud.client.circuitbreaker.EnableCircuitBreaker;
import org.springframework.cloud.client.discovery.EnableDiscoveryClient;
import org.springframework.cloud.netflix.feign.EnableFeignClients;
import org.springframework.context.annotation.Bean;
import org.springframework.web.client.RestTemplate;

public class EmployeeDashBoardService {

   public static void main(String[] args) {
      SpringApplication.run(EmployeeDashBoardService.class, args);

   public RestTemplate restTemplate(RestTemplateBuilder builder) {
      return builder.build();

Step 3 :  Now we will change the EmployeeInfoController.java so it can be Hystrix enable.

package com.example.EmployeeDashBoardService.controller;

import java.util.Collection;

import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.beans.factory.annotation.Value;
import org.springframework.cloud.context.config.annotation.RefreshScope;
import org.springframework.web.bind.annotation.PathVariable;
import org.springframework.web.bind.annotation.RequestMapping;
import org.springframework.web.bind.annotation.RestController;
import org.springframework.web.client.RestTemplate;

import com.example.EmployeeDashBoardService.domain.model.EmployeeInfo;
import com.netflix.appinfo.InstanceInfo;
import com.netflix.discovery.EurekaClient;
import com.netflix.discovery.shared.Application;
import com.netflix.hystrix.contrib.javanica.annotation.HystrixCommand;

public class EmployeeInfoController {
    private RestTemplate restTemplate;
    private EurekaClient eurekaClient;
    private String employeeSearchServiceId;

   public EmployeeInfo findme(@PathVariable Long myself){
      Application application = eurekaClient.getApplication(employeeSearchServiceId);
       InstanceInfo instanceInfo = application.getInstances().get(0);
       String url = "http://"+instanceInfo.getIPAddr()+ ":"+instanceInfo.getPort()+"/"+"employee/find/"+myself;
       System.out.println("URL" + url);
       EmployeeInfo emp = restTemplate.getForObject(url, EmployeeInfo.class);
       System.out.println("RESPONSE " + emp);
       return emp;
   private EmployeeInfo defaultMe(Long id){
      EmployeeInfo info = new EmployeeInfo();
      info.setName("Hystrix fallback");
      return info;
   public  Collection<EmployeeInfo> findPeers(){
      Application application = eurekaClient.getApplication(employeeSearchServiceId);
       InstanceInfo instanceInfo = application.getInstances().get(0);
       String url = "http://"+instanceInfo.getIPAddr()+ ":"+instanceInfo.getPort()+"/"+"employee/findall";
       System.out.println("URL" + url);
       Collection<EmployeeInfo> list= restTemplate.getForObject(url, Collection.class);
        System.out.println("RESPONSE " + list);
       return list;

Carefully note the method named findme, It actually calls the EmployeeService, So I use a
@HystrixCommand(fallbackMethod="defaultMe") annotation on top of this method, by doing we instruct Spring to proxy this method, so that if any error will occur or Employee Service is not available it goes through the fallback method and called it, and shows the default value rather than showing an error

For that, we add the attribute fallbackmethod=defaultMe where default me is the default method, Please note that method signature and return type must be the same of the findme method. Unless you facing an error no Such method found. It internally uses Spring AOP which intercept the method call.

If the EmployeeService is not available then it calls defaultMe Method and returns the default employee.

Let's check the same,

Start Config server, Eureka server, and EmployeeDashBoard service, intentionally I not started the EmployeeSearchService so it is unavailable when we call findme method

If you hit the following URL

You will see the following response as the Actual EmployeeSearchService is down.

  "name":"Hystrix fallback",