OFBizian: Hystrix

Showing posts with label Hystrix. Show all posts

It takes more than a Circuit Breaker to create a resilient application

Tuesday, May 16, 2017 1 comment

(This post was originally published on Red Hat Developers, the community to learn, code, and share faster. To read the original post, click here.)

Topics such as application resiliency, self-healing, antifragility are my area of interest. I've been trying to distinguish, define, and visualize these concepts, and create solutions with these characteristics.

Software characteristics

However, I notice over and over again, that there are various conference talks about resiliency, self-healing, and antifragility and often they lazily conclude that Neflix OSS Hystrix is the answer to all of that. It is important to remember that conferences speakers are overly optimistic, wishful thinkers, and it takes more than a Circuit Breaker to create a resilient and self-healing application.

Conference level Resiliency

So what does a typical resiliency pitch look like: use timeouts, isolate in bulkheads, and of course apply the circuit breaker pattern. Having implemented the circuit breaker pattern twice in Apache Camel (first a homegrown version, then using Hystrix) I have to admit that circuit breaker is a perfect conference material with nice visualization options and state transitions. (I will spare explaining to you how a circuit breaker works here, I'm sure you will not mind). And typically, such a pitch concludes that the answer to all of the above concerns is Hystrix. Hurrah!

Get out of the Process

I agree with all the suggestions above such as timeout, bulkhead and circuit breaker. But that is a very narrow sighted view. It is not possible to make an application resilient and self-healing (not to mention antifragile) only from within. For a truly resilient and self-healing architecture you need also isolation, external monitoring, and autonomous decision making. What do I mean by that?

If you read Release It book carefully, you will realize that bulkhead pattern is not about thread pools. In my Camel Design Patterns book, I've explained that there are multiple levels to isolate and apply the bulkhead pattern. Thread Pools with Hystrix is only the first level.

Tools for bulkhead pattern

Hystrix uses thread pools to ensure that the CPU time dedicated to your application process is better distributed among the different threads of the application. This will prevent a CPU intensive failure from spreading beyond a thread pool and other parts of the service still gets some CPU time.
But what about any other kind of failure that can happen in an application that is not contained in a thread pool? What about if there is a memory leak in the application or some sort of infinite loop or a fork bomb? For these kinds of failures, you need to isolate the different instances of your service through processes resource isolation. Something that is provided by modern container technologies and used as the standard deployment unit nowadays. In practical term, this means isolating processes on the same host using containers by setting memory and CPU limits.

Once you have isolated the different service instances and ensured failure containment among the different service processes through containers, the next step is to protect from VM/Node/Host failures. In a cloud environment, VMs can come and go even more often, and with that, all process instances on the VM would also vanish. That requires distributing the different instances of your service into different VMs and contain VMs failures from bringing down the whole application.

All VMs run on some kind of hardware and it is also important to isolate hardware failures too. If an application is spread across multiple VMs but all of them depend on a shared hardware unit, a failure on the hardware can still affect the whole application.
A container orchestrator such as Kubernetes can spread the service instances on multiple nodes using anti-affinity feature. Even further, anti-affinity can spread the instances of a service across hardware racks, availability zones, or any other logical grouping of hardware to reduce correlated failures.

Self-Healing from What?

The circuit breaker pattern has characteristics for auto-recovery and self-healing. An open or half-open circuit breaker will periodically let certain requests reach the target endpoint and if these succeed, the circuit breaker will transition to its healthy state.
But a circuit breaker can protect and recover only from failures related to service interactions. To recover from other kinds of failures that we mentioned previously, such as memory leaks, infinite loops, fork bombs or anything else that may prevent a service from functioning as intended, we need some other means of failure detection, containment, and self-healing. This where container health checks come into the picture.
Health checks such as Kubernetes liveness and readiness probes will monitor and detect failures in the services and restart them if required. That is a pretty powerful feature, as it allows polyglot services to be monitored and recovered in a unified way.
Restarting a service will help only to recover from failures. But what about coping with other kinds of behavior such as high load? Kubernetes can scale up and down the services horizontally or even the underlying infrastructure as demonstrated here.

AWS outage handled by Kubernetes

Health checks and container restarts can help with individual services failures, but what happens if the whole node or rack fails? This is where the Kubernetes scheduler kicks in and places the services on other hosts that have enough capacity to run them.
As you can see here, in order to have a system that can self-heal from different kinds of failures, there is a need for a way more resiliency primitives than a circuit breaker. The integrated toolset in Kubernetes in the form of container resource isolation, health checks, graceful termination and start up, container placement, autoscaling, etc do help achieve application resiliency, self-healing and even blend into antifragility.

Let the Platform Handle it

There are many examples of developer and application responsibilities that have shifted from the application into the platform. With Kubernetes some examples are:

Application health checks and restarts are handled by the platform.
Application placements are automated and performed by the scheduler.
The act of updating a service with a newer version is covered by Deployments.
Service discovery, which was an application level concern has moved into the platform (through Services).
Managing Cron jobs has shifted from being an application responsibility to the platform (through Kuberneres CronJobs).

In a similar fashion, the act of performing timeouts, retries, circuit breaking is shifting from the application into the platform. There is a new category of tools referred to as Service Mesh and with the more popular members at this moment being:

These tools provide features such as:

Retry
Circuit-breaking
Latency and other metrics
Failure- and latency-aware load balancing
Distributed tracing
Protocol upgrade
Version aware routing
Cluster failover, etc

That means, very soon, we won't need an implementation of the circuit breaker as part of every microservice. Instead, we will be using one as a sidecar pattern or host proxy. In either case, these new tools will shift all of the network-related concerns where they belong: from L7 to L4/5.

Image from Christian Posta

When we talk about Microservices at scale, that is the only possible way to manage complexity: automation and delegation to the platform. My colleague and friend @christianposta has blogged about Service Mesh in depth here.

A Resiliency Toolkit

Without scaring you death, below is a collection of practises and patterns for achieving a resilient architecture by Uwe Friedrichsen.

Resiliency patterns by Uwe Friedrichsen

Do not try to use all of them, and do not try to use Hystrix all the time. Consider which of these patterns will apply to your application and use them cautiously, only when a pattern benefit outweighs its cost.
At the next conference, when somebody tries to sell you a circuit breaker talk, tell them that this is only the starter and ask for the main course.
Follow me @bibryam for future blog posts on related topics.

From Fragile to Antifragile Software

Thursday, July 21, 2016 1 comment

(This post was originally published on Red Hat Developers, the community to learn, code, and share faster. To read the original post, click here.)

One of my favourite books is Antifragile by Nassim Taleb where the author talks about things that gain from disorder. Nacim introduces the concept of antifragility which is similar to hormesis in biology or creative destruction (accessible version creative-destruction) in economics and analyses it characteristics in great details. If you find this topic interesting, there are also other authors who have examined the same phenomenon in different industries such as Gary Hamel, C. S. Holling, Jan Husdal. The concept of antifragile is the opposite of the fragile. A fragile thing such as a package of wine glasses is easily broken when dropped but an antifragile object would benefit from such stress. So rather than marking such a box with "Handle with Care", it would be labelled "Please Mishandle" and the wine would get better with each drop (would be awesome woulnd't it).

It didn't take long for the concept of antifragility to be used also for describing some of the software development principles and architectural styles. Some would say that SOLID prinsiples are antifragile, some would say that microservices are antifragile, and some would say software systems cannot be antifragile ever. This article is my take on the subject.

According to Taleb, fragility, robustness, resilience and antifragility are all very different. Fragility involves loss and penalisation from disorder. Robustness is enduring to stress with no harm nor gain. Resilience involves adapting to stress and staying the same. And antifragility involves gain and benefit from disorder. If we try to relate these concepts and their characteristics to software systems, one way to define them would be as the following.

Disorder

Different systems are affected by different kind of disorder, such as stress, time, change, volatility, debt, etc. For software systems, the main disorder is the change. The business is in a constant changing environment and the software needs to adapt to the business needs quickly. That is implementing new requirements, changes to existing functionality, even creating new business opportunities through innovation. A software system has to change all the time, otherwise it is obsolete.
Apart from development time challenges, there are also runtime challenges for software systems too. Software systems are created and exists to add value by running in a production environment. And while doing so, they are under stress by end users and other systems. This is another kind of disorder, that software systems have to deal with.

Fragile

This property describes systems that suffer when put under stress. Imagine a software project that is not easy to change at development time. For example if it is not easy to extend, modify and deploy to the production environment. Or a system that is not able to handle unexpected user inputs or external system failures and breaks easily. That's is a fragile system that is harmed by stress and penalised by change, a good example of fragile.

Robust

This is a system that can continue functioning in the presence of internal and external challenges without adaptation. Every system is robust up to a level. For example a bottle is robust until it reaches the level of breaking point. A software system can be made robust to handle unanticipated user input, or failures in external systems. For example handling NPE in Java, using try-catch-finally statements to handle unreliable invocations, having a thread pool to handle concurrent users, creating network connections using timeouts, are all examples of robustness for a software system. But a robust system doesn't adapt to a chaining environment and when the stress and change threshold is reached it would break. The qualities that define the robustness vary from system to system. An ATM for example needs to be robust and not fail in the middle of a transaction, whereas, a media streaming service can drop a frame or two, as long as it continues streaming under stress.

Resilient

A system is resilient when it can adapt to internal and external challenges by changing its method of operation. The key here is that the system is responding to stress by changing its internal behaviour rather than resisting stress with a predefined buffer.

A typical example here is the circuit breaker pattern which changes its internal state to adapt to the external system behaviour to protect itself.
Another example would be using a retry mechanism with some backoff algorithm to handle transient failures in external systems.
A different technique for creating resilient systems is through graceful degradation, both on the UI and the server side. Rendering UI based on the user agent capabilities, or failing fast on the server side and fallback to some default values are commonly used techniques to adapt to failures.
Systems with self healing and autorepair capabilities are another example of resiliency. These systems are self-aware and can detect abnormalities and take corrective actions. For example Kubernetes/OpenShift will perform regular liveness checks for the running Docker containers and if they detect any anomalies they will restart the container and perform necessary backoffs until the application stabilises. This is another mechanism to cope with stress and improve application resilience.

Overview

Before looking at the next level of software evolution - the antifragile systems, let's visualise and summarise different kind of software system characteristics.

A fragile system is difficult to modify, and cannot cope with a changing environment. Even if it provides some value when used in a stable non changing environment, when faced with further stress and change, it quickly turns into a liability. Many organisations have applications (mainframes for example) which are impossible to change, very expensive to maintain, but still running on high cost as they are very critical to the business.
A robust system is the one that is implemented with certain buffers to handle change and stress. So when the stress level increases, it can withstand it for up to a level without losing its capabilities and still provide a good value. But a robust system does not adapt, and if the stress and change levels continue raising, such a system can also stops providing benefit and may turn into liability and run on loss.
A resilient system can handle more stress and change as it is designed and implemented with stress in mind and adaptability features. Even if it is not benefiting from stress, it can survive lot's of different kind of stress and change and provide value up to a greater degree.
An antifragile system is created with change in mind and it feeds from stress and change. It is much harder to create such a system (it is not a software system but a social-technical system) but once it is in place, it drives the business based on change, and even creates the change.

Antifragile

Many things in life are antifragile, such as the human body. When stressed at the right level, a muscle or bone wold come back stronger. But can a software system be antifragile? Certainly there are some tools, platforms, architectural styles, methodologies that can help create software with antifragile characteristics. Let's see some of the more popular ones.

Auto scaling feature allows applications to handle increasing load by creating more instances of the application. To achieve that, the software system have to able to measure and then react to change and stress. Some good examples here are AWS autoscaling of EC2 instances at infrastructure level, and OpenShift autoscaling of application containers for the application level. This is a feature that transitions applications from resiliency to antifragility since the software system is shifting resources from one part of the system into another to respond to stress.

Microservices. According to Taleb, at times of stress, the large is doomed to breaking. And that phenomen has been observed with mammals, corporations, administrations, etc. In software and large projects, this behaviour has been observed even more often. The bigger a software project is, the harder it becomes to change and react to stress. Microservices is an architecture styles that allows easier change by having autonomous services with well defined APIs - features that allow change. Russ Miles is a strong believer and proponent of Antifragile Software through Microservices (here is an intro video from him).

Chaos engineering is a technique to create antifragility by evolving systems to survive chaos. Rather than waiting for things to break at the worst possible time, the idea of chaos engineering is to proactively inject failures in order to be prepared when disaster strikes. Netflix’s Simian Army is a very good materialization of this technique, designed to generate failures and help isolate system’s weaknesses.

Continuous deployments to a production environment creates continuous partial system failures and forces organisations to react better to failures through redundancy, rolling upgrades, rollbacks, and avoiding single points of failure. Other techniques such as canary release, blue-green deployments, are used to reduce the risk of introducing new software into production environment. Some other methods such as A/B testing even allows experimenting with change and measuring its effect, in order to gain from change.

The human element

Antifragility is not an universal characteristics. Different systems are antigragile towards different kind of disorder. For example Chaos Monkey will make your system antifragile towards EC2 deaths, and autoscaller will make your system respond to specific type of load. But your systems will not be antifragile towards other kinds of stress. And if you look above at the different ways of introdusing antifragility into software systems, all of them are means for making the social-technical system antifragile (and not only the software system):

Chaos engineering forces human feedback to the injected randomness and makes the system antifragile.
Microservices alone do not make a software system antifragile. But microservices combined with appropriate organisational and team structure enables antifragility. If microservices are a way for architecting applications into autonomous structures, DevOps is a means for organizing teams into similar structures. You need both in order to benefit from them and gain from disorder.
Continuous deployment pipeline is a tool that allows teams to react to stress faster, introduce or retract change faster and generally use the stress as the driver.
Similarly, iterative development is not enough to benefit from changing environment. But iterative development with open and honest retrospective rituals is.

Innovate

If it takes few weeks to create a new developer environment, you can not react to change. If it takes three months to release a new feature, you can not react to change. If your ops team is watching the metrics dashboard and manually scaling applications up and down, you cannot embrace change. If the team is hardly catching up with the change, there is no way to gain from change. But once you put an appropriate organisational structure, the right tools and culture in place, then you can start gaining from change. Then you can afford having Friday Hackathons, then you can start exploring open source projects and start contributing to them, then you can start open sourcing your internal projects and benefit from a community, and generally be the change itself. And why not the Netflix or the Amazon of tomorrow.

Create Resilient Camel applications with Hystrix DSL

Wednesday, June 22, 2016 No comments

(This post was originally published on Red Hat Developers, the community to learn, code, and share faster. To read the original post, click here.)

Apache Camel is a mature integration library (over 9 years old now) that implements all the patterns from Enterprise Integration Patterns book. But Camel is not only an EIP implementation library, it is a modern framework that constantly evolves, adds new patterns and adapts to the changes in the industry. Apart from tens of connectors added in each release, Camel also goes hand-in-hand with the new features provided by the new versions of Java language itself and other Java frameworks. With time some architectural styles such as SOA and ESB lose attraction and new architectural styles such as REST, Microservices get popular. To enable developers do integrations using these new trends, Camel responds by adding new DSLs such the REST DSL and new patterns such as the Circuit Breaker, and components such as Spring Boot. And that's not all and we are nowhere near done. With technologies such as Docker containers and Kubernetes, the IT industry is moving forward even faster now, and Camel is evolving too in order to ease the developers as it always has been. To get an idea of kinda tools you would need to develop and run applications on Docker and Kubernetes, check out the Fabric8 project and specifically tools such as the Docker Maven plugin, Kubernetes CDI extension, Kubernetes Java client, Arquilian tests for Kubernetes, etc. Exciting times ahead with lot's of cool technologies, so let's have a look at one of those: Camel and the Hystrix based circuit breaker.

Two Circuit Breakers in Camel, which one to chose?

Two years ago, when I first read Release It from Michael Nygard, I couldn't stop myself implementing the Circuit Breaker pattern in Camel. Usually I drive my contributions by real customer needs, but Circuit Breaker pattern is so elegant, I had to do it. To implement it in a non-intrusive manner, I have added it as a Camel Load Balancer strategy. Here how simple it is:
The DSL above is self describing: if the number of MyCustomExceptions thrown by mock:result endpoint reaches the threshold number, the CircuitBreaker goes to open state and starts rejecting all requests. After 1000ms it moves to halfOpenAfter state and the result of the first request in this state will define its next state as closed or open. It is the simplest possible implementation of the CircuitBreker you can imagine, but still useful.

Since then, the Microseservices architecture has became more popular, and so is the Circuit Breaker Pattern and its java implementation Hystrix. At some point Raúl Kripalani started the Hystrix implementation in Camel and put all the ground work in place, but with time it lost momentum. Then seeing the same request again and again from different customers, I took the relay and continued the work and pushed to Camel a Hystrix component. Seeing the feedback from the community, it still didn't feel as elegant as it could be. Then Claus stepped in and made Hystrix part of the Camel DSL by turning it into an EIP (rather than component). So how does it look like to create a Hystrix based Circuit Breaker in Camel now?

In the example above, you can see only very few of the available options for a Circuit Breaker, to see all of them checkout the offical documents and try out the example application Claus put in place.

Based on this example you may think that Hystrix is part of Camel core, but it is not. Camel core is still this light and without dependencies to third party libraries. If you want to use the Hystrix based Circuit Breaker, you need to add camel-hystrix dependency to your dependencies as it is with any other non-core component and make it available at runtime.

Fail Fast, Fallback, Bulkhead, Timeout and more

The Hystrix library implements more than Circuit Breaker patter. It also does bulkheading, request caching, timeouts, request collapsing, etc. The implementation in Camel does not support request collapsing and caching as these are done using other patterns and components available in Camel already. Request collapsing in Camel can be done using Aggregator EIP and caching can be done using cache components such as Redis, Inifinspan, Hazelcast, etc. The Hystrix DSL in Camel offers around 30 configuration options supported by Hystrix, also exposes metrics over JMX and/or REST for the Hystrix Dashboard.

As a final note, don't foget that to create a true resilient application, you need more than Hystrix. Hystrix will do bulkheading at thread pool level, but that is not enough if you don't apply the same principle at process, host and phisical machine level. To create a create a resilient distributed system, you will need to use also Retry, Throttling, Timeout... and other good bractices some of which I have described in Camel Design Patterns book.

To get some hands on feeling of the new pattern, check the example and then start defending your Camel based Microservices with Hystrix.