Deep Dive to Hystrix Resiliency Maintenance

In my previous Microservice Tutorial , I have shown How We can use Hystrix as a circuit Breaker and gives our service breathing room to recover itself.

In this tutorial, we will deep dive into Hystrix architecture and will eventually get to know How it manages resiliency in a complex system.

Let assume there are more than 100 Microservices In a complex system and to perform any business functionality more or less it depends on 5-10 underlying Microservices(approximately).

Generally, There are three types of services are found in aforesaid Microservice based architecture
  1. Core Services
  2. Aggregator service.
  3. Edge Service.


Where the Edge service takes the request from UI/client then it forwards to an Aggregator service, Then Aggregator service fans out the request and sends the calls to core-services and Eventually collects responses from Core services and sends back to the Edge Service, Which actually sends the response to UI/Client.

hystrix circuit breaker architechture
Imagine we have a complex microservice based system where a single incoming request distributed over 10 dependent services, so one request then split to 10 internal requests (Fans out) and goes to individual core service, Now if there are 100 concurrent requests per seconds in a peak time.

So there are 100*10 = 10000 internal requests will be created per second.  Think about the load of the system, Even a minute of delay response from one of the core services can create a bottleneck as in web server(Tomcat, Jetty) there are fixed numbers of Threads in the Thread pool.
and we have 60*100*1=6000 requests are waiting for an individual service for a minute, So We can assume failure is inevitable even if you have 99.99% of uptime for a service-- because when we do a service call we have to depend on two external entities Network and Socket and the sad part is --The control of Network not in developers hand. To make the system resilience we have to deal with this phenomena and have to plan according to it.



Now let see the options we have to deal the above scenario.
Non Blocking request: If Requests are not waiting for the service irrespective of the service is available or not.Only then all request will be either served or rejected. If we follow this then there will be no requests in the waiting queue as if the service is not available or response is getting delayed immediately that request will be rejected and a fallback/default response will be provided to the caller service.  

So request has been short-circuited and the fallback path invoked -- dependent service gets a chance to recover itself. We know Hystrix do this stuff for us. But it is not a very easy task to implement. Internally a complex flow has been maintained by Hystrix to offer Resilient Microservice Architecture.

We will discuss that workflow now.

Deep Dive to Hystrix Workflow :

ThreadPool : Hystrix maintains a Thread pool so that when a service calls to another service --The call will be assigned to one of the thread from that Thread pool, Basically to handle the concurrency It offers Worker Threads. Each Thread has a timeout limit, If the response is not returned within the Thread time out mentioned then the request will be treated as a failure and fallback path will be invoked.


Network and Socket Timeout :  When a Service call to another service, Two things are happening, the data travels through Network and service talks each other via socket. When data travels through Network it has to cross multiple hops so if a hop is slow or down then we can feel network slowness so we have to calculate an average time for a request/response travel time as well as Socket read/write time to calculate thread timeouts in an optimum way.

Hystrix Workflow :  

  1. when a service calls a dependent service Hystrix stepped in and checks Hystrix Circuit is open or not if it is open then it returns to the fallback path.
  2. If Circuit is not open then  Hystrix checks are all the worker Threads in the pools are in use if so it returns immediately and the fallback path is invoked.
  3. If threads are available in pools then it assigns one free thread and waits for the response from dependent service If response time is greater than thread timeout then it again invokes fallback path.
  4. If all is well then the actual response is back to the caller service.
  5. If for a certain amount of time (default 10 sec) if 50% of the request is failed then Hystrix opens the circuit.

    Hystrix workflow


It is very important to choose Thread pool and Thread time out wisely unless necessary implications are coming into the picture, if Thread pool count is large say it is greater the database connection thread pool  then in spite of the hystrix all the Connection are consumed by the service again if the Thread pools count is less then we can’t serve many request concurrently may be that will cause a performance hit. So choose the Thread pool count based on your system configuration.

Same for Timeouts if the timeout is big then there will more incoming requests in the queue but if Timeout is short then all calls are timeouts even if your service is healthy because the average response time of your service is greater than the Timeout.

Conclusion :  
Nowadays whatever architecture (Microservices/Monolith) you opt for your project, Resiliency is the foremost criteria, it is not an add on feature good to have it is the first class citizen now.
So Resiliency is mandatory -- when you are planning and developing your project also think about Resilience spend time on it, think how you can offer a Resilient Architecture. There are many strategies to achieve resilience like  Request Timeout, Maximum Retries, failing rate etc. So while developing your project Identify the resource intensive areas try to build it in that fashion, so that there is no resource hogging happens. Also, provide a Fallback mechanism so you can avoid cascade failures and eventually achieve a robust system.

Microservices Communication: Hystrix As The Jon Snow

In the previous  Microservice Tutorial ,we have learned about How to use Zuul API gateway. In this Tutorial , we will learn about Hystrix which act as a Circuit breaker of the services. Circuit breaker -- the term is new to you in terms of Software Architecture? Don’t  worry I will discuss in detail regarding the same.


But before that Let's discuss with a well-known incident who are working in Support Project(Monolith).


Birth of the Night’s King:

Folks who are in On call Support how many times it happens, You got a call in the night saying, System is not responding, it is a priority 1 issue. You wake up in odd time opens your laptop,
Check health check pages found some servers are down, Some servers have a huge memory spike.So immediately you take a thread dump and all the necessary details then restarted all the server in the pool. After restarting you will found things are quite normal and go to sleep if you are lucky enough then you got a good sleep but if you unlucky again in the morning you may face the same scenario.

So, Next day when you and your team researching why this happened , what is the root cause of the birth of White walkers, which ate up all precious resources and the server eventually got unresponsive.

You may find there is a resource leak in somewhere may be in code level--Someone forgot to close a precious resource, like connection. Or there were an unnecessary open threads. Or there is a blocking session in the database etc.

But hold on why we can’t find this resource leak/birth of Night King at the first time? Why the Night’s king grows up silently and when he is in action then we got notified?

So, It opens our eyes that there is a problem in our Architecture(King's Landing), there are no techniques for early detection of a Resource leak( No Jon Snow to Watch the Wall!!!).


A practical Scenario.
Let examine a simple scenario which may cause this type of scenario, Say we have an architecture where service A and Service B dependent on Service C.  Both Service A and B query  Service C API to get some result. Now Service C is used the underlying database to fetch result but unfortunately, programmer does not close the connection in finally block he does it in the try block.

Now in production, if any error occurs in Service C regarding Database connection/query, It does not release the connection so Connections are not back in Connection pools(Connection pool has finite resources).  But Service A and B does not aware of this scenario it queries Service C as the request comes and Service C ate up one by one free Connection from Connections pool. So after a certain time, all Connections are eaten up by Service C and there is no connection available as free in Connection pool and Night Walkers(Service C) eaten up your System. After restarting all the server its gives you relief for sometimes but if the Service C error continues (Programming fault) then again you might have to wake up in the morning (Night King’s is back).

It all happens due to Service A and B, They are not aware the Service C is not responding the way it should be. If they aware they just simply stop the querying then We should not have faced this situation. Here the concept of the Circuit breaker(In GOT NightWatch) comes up.

Resourse Leak--Birth of Night King


Circuit Breaker Pattern :

The Circuit breaker concept is same as an electrical circuit When the Circuit is closed electrons flow through the circuit but if any unusual thing happens it trips the circuit, Circuit is opened up so there is no flow of electrons through the circuit. It provides the circuit to recover itself after a certain amount of time, Circuit closes and flows of the electrons continues.

Netflix hystrix is such a framework which works on the same principle.

It always monitoring the calls so if any dependent service response is greater than the threshold limit it trips the circuit,  so no further calls will not flow to the dependent service. It gives dependent service to recover itself. In that time there is a fallback policy, all the request goes to that fallback path. After a certain amount of time again the circuit is closed and request flows as it is.

Please note that we can enable Hystrix(Jon Snow-- The King of North) in Spring cloud. previously, it supports only Service and Component level @Service or @Component. With the latest, it supports in @Controller also.

Hystrinx As JonSnow



Coding Time:

Lets recap the EmployeeDashBoardService , It calls EmployeeSearchService to find employee based on the id. Currently ig EmployeeSearch service is unavailable then EmployeeDashBoard Service does not got the result and show the error. But we want to show a Default Employee Value if EmployeeSeracgservice is not available so to incorporate the change in EmployeeDashboardService we have to do the following changes.


Step 1 : Add Hystrix plugin into pom.xml



<dependency>
      <groupId>org.springframework.cloud</groupId>
      <artifactId>spring-cloud-starter-hystrix</artifactId>
</dependency>


Step 2 : Add @EnableCircuitBreaker on top of  EmployeeDashBoardService, to enable Hystrix for this service.

package com.example.EmployeeDashBoardService;

import org.springframework.boot.SpringApplication;
import org.springframework.boot.autoconfigure.SpringBootApplication;
import org.springframework.boot.web.client.RestTemplateBuilder;
import org.springframework.cloud.client.circuitbreaker.EnableCircuitBreaker;
import org.springframework.cloud.client.discovery.EnableDiscoveryClient;
import org.springframework.cloud.netflix.feign.EnableFeignClients;
import org.springframework.context.annotation.Bean;
import org.springframework.web.client.RestTemplate;

@EnableDiscoveryClient
@EnableCircuitBreaker
@EnableFeignClients
@SpringBootApplication
public class EmployeeDashBoardService {

   public static void main(String[] args) {
      SpringApplication.run(EmployeeDashBoardService.class, args);
   }

   @Bean
   public RestTemplate restTemplate(RestTemplateBuilder builder) {
      return builder.build();
   }
}


Step 3 :  Now we will change the EmployeeInfoController.java so it can be Hystrix enable.

package com.example.EmployeeDashBoardService.controller;

import java.util.Collection;

import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.beans.factory.annotation.Value;
import org.springframework.cloud.context.config.annotation.RefreshScope;
import org.springframework.web.bind.annotation.PathVariable;
import org.springframework.web.bind.annotation.RequestMapping;
import org.springframework.web.bind.annotation.RestController;
import org.springframework.web.client.RestTemplate;

import com.example.EmployeeDashBoardService.domain.model.EmployeeInfo;
import com.netflix.appinfo.InstanceInfo;
import com.netflix.discovery.EurekaClient;
import com.netflix.discovery.shared.Application;
import com.netflix.hystrix.contrib.javanica.annotation.HystrixCommand;

@RefreshScope
@RestController
public class EmployeeInfoController {
   
    @Autowired
    private RestTemplate restTemplate;
   
    @Autowired
    private EurekaClient eurekaClient;
   
    @Value("${service.employyesearch.serviceId}")
    private String employeeSearchServiceId;


   @RequestMapping("/dashboard/{myself}")
   @HystrixCommand(fallbackMethod="defaultMe")
   public EmployeeInfo findme(@PathVariable Long myself){
      Application application = eurekaClient.getApplication(employeeSearchServiceId);
       InstanceInfo instanceInfo = application.getInstances().get(0);
       String url = "http://"+instanceInfo.getIPAddr()+ ":"+instanceInfo.getPort()+"/"+"employee/find/"+myself;
       System.out.println("URL" + url);
       EmployeeInfo emp = restTemplate.getForObject(url, EmployeeInfo.class);
       System.out.println("RESPONSE " + emp);
       return emp;
   }
   
   private EmployeeInfo defaultMe(Long id){
      EmployeeInfo info = new EmployeeInfo();
      info.setEmployeeId(id);
      info.setName("Hystrix fallback");
      info.setCompanyInfo("Netfilx");
      info.setDesignation("Fallback");
      return info;
   }
   
   
   @RequestMapping("/dashboard/peers")
   public  Collection<EmployeeInfo> findPeers(){
      Application application = eurekaClient.getApplication(employeeSearchServiceId);
       InstanceInfo instanceInfo = application.getInstances().get(0);
       String url = "http://"+instanceInfo.getIPAddr()+ ":"+instanceInfo.getPort()+"/"+"employee/findall";
       System.out.println("URL" + url);
       Collection<EmployeeInfo> list= restTemplate.getForObject(url, Collection.class);
        System.out.println("RESPONSE " + list);
       return list;
   }
}




Carefully note the method named findme, It actually calls the EmployeeService, So I use a
@HystrixCommand(fallbackMethod="defaultMe") annotation on top of this method, by doing we instruct Spring to proxy this method, so that if any error will occur or Employee Service is not available it goes through the fallback method and called it, and shows the default value rather than showing an error

For that, we add the attribute fallbackmethod=defaultMe where default me is the default method, Please note that method signature and return type must be the same of the findme method. Unless you facing an error no Such method found. It internally uses Spring AOP which intercept the method call.

If the EmployeeService is not available then it calls defaultMe Method and returns the default employee.

Let's check the same,

Start Config server, Eureka server, and EmployeeDashBoard service, intentionally I not started the EmployeeSearchService so it is unavailable when we call findme method

If you hit the following URL




You will see the following response as the Actual EmployeeSearchService is down.

{
  "employeeId":2,
  "name":"Hystrix fallback",
  "practiceArea":null,
  "designation":"Fallback",
  "companyInfo":"Netfilx"
}


Microservices Communication: Zuul API Gateway

In the previous Microservice tutorial, we have learned How Microservices handling Client side Load balancing.

Here, we will discuss Zuul proxy.

What is a Zuul Proxy ?

The Crux of Microservices pattern is to create an Independent service which can be scaled, deployed independently. So in a complex Business Domain-- more than  50-100 Microservices is very common. Let's Imagine a System where we have fifty Microservices now we have to implement a UI which is kind of a dashboard -- So it calls Multiple services to fetch the important information and show that in UI.

From UI developer perspective --  to collect information from fifty underlying Microservices it has to call fifty Rest API as each Microservice exposes a Rest API for Communication. So the client has to know the details of all Rest API and URL pattern /port to call them. Certainly, it is not sound like a good design. It is kind of a breaching of  Encapsulation, UI has to know all Microservices server/port details to query the services.

Moreover think about the Common aspects of a Web programming like CORS, Authentication, Security, Monitoring in terms of this design -- Each Microservice team has to develop all these aspects into its own service so same code has been replicated over fifty Microservices, Change in the Authentication requirements or CORS policy rippled overall services. It is against of DRY principle. So This type of design is very error prone and rigid. To make it robust It has to be changed in such way so that we have only one entry point where all common aspects code are written and client communicates with that common service. Here the Zuul(The Gatekeeper/demigod ) Concept pops up.


microservices architecture-Zuul




Edge Service: Zuul acts as an API gateway or Edge service. It receives all the request comes from UI and then delegates the request to internal Microservices. So we have to create a brand new Microservice which is Zuul enabled and this service sits on top of all other Microservices. It acts as an Edge service or Client facing service. Its Service API should be exposed to client/UI, Client calls this service as a proxy for Internal Microservice then this service delegates the request to appropriate service.

The advantage of this type of design is common aspects like CORS, Authentication, Security can be put into a centralized service, so all common aspects will be applied on each request, and if any changes occur in the future we just have to update the business logic of this Edge Service.


Also, we can implement any Routing rules or any Filter implementation says we want to append a special tag into the request header before it reaches to internal Microservices we can do it in the Edge service.

As Edge service itself is a Microservice so it can be independently scalable, deployable. So we can perform some load testing also.





edge services








Coding Time:

Create a project using http://start.spring.io/ named it as EmployeeZuluService select Zuul and Eureka Discovery module as Edge service itself a Eureka client.

The pom.xml  will look like following


<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
   xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
   <modelVersion>4.0.0</modelVersion>

   <groupId>com.example</groupId>
   <artifactId>EmployeeZuulService</artifactId>
   <version>0.0.1-SNAPSHOT</version>
   <packaging>jar</packaging>

   <name>EmployeeZuulService</name>
   <description>Demo project for Spring Boot</description>

   <parent>
      <groupId>org.springframework.boot</groupId>
      <artifactId>spring-boot-starter-parent</artifactId>
      <version>1.5.6.RELEASE</version>
      <relativePath/> <!-- lookup parent from repository -->
   </parent>

   <properties>
      <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
      <project.reporting.outputEncoding>UTF-8</project.reporting.outputEncoding>
      <java.version>1.8</java.version>
      <spring-cloud.version>Dalston.SR2</spring-cloud.version>
   </properties>

   <dependencies>
      <dependency>
          <groupId>org.springframework.cloud</groupId>
          <artifactId>spring-cloud-starter-eureka</artifactId>
      </dependency>
      <dependency>
          <groupId>org.springframework.cloud</groupId>
          <artifactId>spring-cloud-starter-zuul</artifactId>
      </dependency>

      <dependency>
          <groupId>org.springframework.boot</groupId>
          <artifactId>spring-boot-starter-test</artifactId>
          <scope>test</scope>
      </dependency>
   </dependencies>

   <dependencyManagement>
      <dependencies>
          <dependency>
              <groupId>org.springframework.cloud</groupId>
              <artifactId>spring-cloud-dependencies</artifactId>
              <version>${spring-cloud.version}</version>
              <type>pom</type>
              <scope>import</scope>
          </dependency>
      </dependencies>
   </dependencyManagement>

   <build>
      <plugins>
          <plugin>
              <groupId>org.springframework.boot</groupId>
              <artifactId>spring-boot-maven-plugin</artifactId>
          </plugin>
      </plugins>
   </build>


</project>



Next, rename the application.properties to bootstrap.properties and write down the following configuration there.




spring.application.name=EmployeeAPIGateway
eureka.client.serviceUrl.defaultZone:http://localhost:9091/eureka
server.port=8084
security.basic.enable: false   
management.security.enabled: false
zuul.routes.employeeUI.serviceId=EmployeeDashBoard
zuul.host.socket-timeout-millis=30000

As this is a Microservice so its need to be registered in Eureka server so it can be aware of other services.

Also, this service runs on port 8084.

Pay attention to lats two properties
zuul.routes.employeeUI.serviceId=EmployeeDashBoard

By this, we say if any request comes to API gateway in form of

/employeeUI it will redirect to EmployeeDashBoard Microservice

So if you hit the following URL

It will redirect to


Note that UI developer only aware of the Gateway service port, it is Gateway service responsibility to route the service to appropriate Microservice.

NB: Zuul can be implemented without Eureka server, In that case, you have to provide the exact URL of the service where it will be redirected.



zuul.host.socket-timeout-millis=30000 -- by this we instruct Spring boot to wait response for 30000 ms unless Zuuls internal Hystrix timeout will kickoff and showing you the error.

Now add @EnableZuulproxy and @EnableDiscoveryClient on top of  EmployeeZuulServiceApplication class

package com.example.EmployeeZuulService;

import org.springframework.boot.SpringApplication;
import org.springframework.boot.autoconfigure.SpringBootApplication;
import org.springframework.cloud.client.discovery.EnableDiscoveryClient;
import org.springframework.cloud.netflix.zuul.EnableZuulProxy;

@EnableZuulProxy
@EnableDiscoveryClient
@SpringBootApplication
public class EmployeeZuulServiceApplication {

   public static void main(String[] args) {
      SpringApplication.run(EmployeeZuulServiceApplication.class, args);
   }
}



Our API gateway service is ready.

Noe starts the config server, Eureka server, EmployeeSearchService, EmployeeDasboradService and, EmployeeZuulService respectively.

If you hit the following URL


You will see the following output

{
 "employeeId": 1,
 "name": "Shamik  Mitra",
 "practiceArea": "Java",
 "designation": "Architect",
 "companyInfo": "Cognizant"
}