Bet on a Cloud Native Ecosystem, not a Platform

This is a small extract from a longer post I published at The New Stack. Check the original post here.
Recently I wrote about “The New Distributed Primitives for Developers” provided by cloud-native platforms such as Kubernetes and how these primitives blend with the programming primitives used for application development. For example, have a look below to see how many Kubernetes concepts a developer has to understand and use in order to run a single containerized application effectively:
Kubernetes concepts for Developers
The chances are, the developers will have to write the same amount of YAML code as the application code in the container. More importantly, the application itself will rely on more the platform than it ever used to do before. The cloud native application expects the platform to perform a health check, deployment, placement, service discovery, running a periodic task (cron job), or scheduling an atomic unit of work (job), autoscaling, configuration management, etc. As a result, your application has abdicated and delegated all these responsibilities to the platform and expects them to be handled in a reliable way. And the fact is, now your application and the involved teams are dependent on the platform on so many different levels: code, design, architecture, development practices, deployment and delivery pipelines, support procedures, recovery scenarios, you name it.

Bet on an Ecosystem, not a Platform

The platform is just the tip of the iceberg, and to be successful in the cloud-native world, you will need to become part of a fully integrated ecosystem of tools and companies. So the bet is never about a single platform, or a project or a cool library, or one company. It is about the whole ecosystem of projects that work together in sync, and the whole ecosystem of companies (vendors and customers) that collaborate and are committed to the cause for the next decade or so.  

You can read the full article published on The New Stack here. Follow me @bibryam for future blog posts on related topics.

Fighting Service Latency in Microservices with Kubernetes

(This post was originally published on Red Hat Developers, the community to learn, code, and share faster. To read the original post, click here.)

CPU and network speed have increased significantly in the last decade, as well as memory and disk sizes. But still one of the possible side effects of moving from a monolithic architecture to Microservices is the increase in the service latency. Here are few quick ideas on how to fight it using Kubernetes.

It is not the network

In the recent years, networks transitioned to using more efficient protocols and moved from 1GBit to 10GBit and even to 25GBit limit. Applications send much smaller payloads with less verbose data formats. With all that in mind, the chances are the bottleneck in a distributed application is not in the network interactions, but somewhere else like the database. We can safely ignore the rest of this article and go back to tuning the storage system :)

Kubernetes scheduler and service affinity

If two services (deployed as Pods in the Kubernetes world) are going to interact a lot, the first approach to reduce the network latency would be to ask politely the scheduler to place the Pods as close as possible using node affinity feature. How close, depends on our high availability requirements (covered by anti-affinity), but it can be co-locating in the same region, availability zone, rack or even on the same host.

Run services in the same Pod

Containers/Service co-located in the same Pod

The deployment unit in Kubernetes (Pod) that allows a service to be independently updated, deployed and scaled. But if performance is a higher priority, we could put two services in the same Pod as long as that is a deliberate decision. Both services would still be independently developed, tested, released as containers, but they would share the same runtime lifecycle in the same deployment unit. That would allow the services to talk to each other over localhost rather than using the service layer, or use the file system, or use some other high performant IPC mechanism on the shared host, or shared memory.

Run services in the same process

If co-locating two services on the same host is not good enough, we could have a hybrid between microservices and monolith by sharing the same process for multiple services. That means we are back to a monolith, but we could still use some of the principles of Microservices and allow development time independence and make a compromise in favour of performance in rare occasions.
We could develop and release two services independently by two different teams, but place them in the same container and share the runtime.
For example, in the Java world that would be placing two .jar files in the same Tomcat, WildFly or Karaf server. At runtime, the services can find each other and interact using a public static field that is accessible from any application in the same JVM. This same approach is used in Apache Camel direct component which allows synchronous in-memory interaction of Camel routes from different .jar files by sharing the same JVM.

Other areas to explore

If none of the above approaches seem like a good idea, maybe you are exploring in the wrong direction. It might be better to explore whether using some alternative approaches such using a cache, data compression, HTTP/2, or something else might help for the overall application performance. Service mesh tools such as envoy, linkerd, traefik can also help by providing latency-aware load balancing and routing. A completely new area to explore.

Follow me @bibryam for future blog posts on related topics.

It takes more than a Circuit Breaker to create a resilient application

(This post was originally published on Red Hat Developers, the community to learn, code, and share faster. To read the original post, click here.)

Topics such as application resiliency, self-healing, antifragility are my area of interest. I've been trying to distinguish, define, and visualize these concepts, and create solutions with these characteristics.

Software characteristics
However, I notice over and over again, that there are various conference talks about resiliency, self-healing, and antifragility and often they lazily conclude that Neflix OSS Hystrix is the answer to all of that. It is important to remember that conferences speakers are overly optimistic, wishful thinkers, and it takes more than a Circuit Breaker to create a resilient and self-healing application.

Conference level Resiliency

So what does a typical resiliency pitch look like: use timeouts, isolate in bulkheads, and of course apply the circuit breaker pattern. Having implemented the circuit breaker pattern twice in Apache Camel (first a homegrown version, then using Hystrix) I have to admit that circuit breaker is a perfect conference material with nice visualization options and state transitions. (I will spare explaining to you how a circuit breaker works here, I'm sure you will not mind). And typically, such a pitch concludes that the answer to all of the above concerns is Hystrix. Hurrah!

Get out of the Process

I agree with all the suggestions above such as timeout, bulkhead and circuit breaker. But that is a very narrow sighted view. It is not possible to make an application resilient and self-healing (not to mention antifragile) only from within. For a truly resilient and self-healing architecture you need also isolation, external monitoring, and autonomous decision making. What do I mean by that?

If you read Release It book carefully, you will realize that bulkhead pattern is not about thread pools. In my Camel Design Patterns book, I've explained that there are multiple levels to isolate and apply the bulkhead pattern. Thread Pools with Hystrix is only the first level.

Tools for bulkhead pattern
Hystrix uses thread pools to ensure that the CPU time dedicated to your application process is better distributed among the different threads of the application. This will prevent a CPU intensive failure from spreading beyond a thread pool and other parts of the service still gets some CPU time.
But what about any other kind of failure that can happen in an application that is not contained in a thread pool? What about if there is a memory leak in the application or some sort of infinite loop or a fork bomb? For these kinds of failures, you need to isolate the different instances of your service through processes resource isolation. Something that is provided by modern container technologies and used as the standard deployment unit nowadays. In practical term, this means isolating processes on the same host using containers by setting memory and CPU limits.

Once you have isolated the different service instances and ensured failure containment among the different service processes through containers, the next step is to protect from VM/Node/Host failures. In a cloud environment, VMs can come and go even more often, and with that, all process instances on the VM would also vanish. That requires distributing the different instances of your service into different VMs and contain VMs failures from bringing down the whole application.

All VMs run on some kind of hardware and it is also important to isolate hardware failures too. If an application is spread across multiple VMs but all of them depend on a shared hardware unit, a failure on the hardware can still affect the whole application.
A container orchestrator such as Kubernetes can spread the service instances on multiple nodes using anti-affinity feature. Even further, anti-affinity can spread the instances of a service across hardware racks, availability zones, or any other logical grouping of hardware to reduce correlated failures.

Self-Healing from What?

The circuit breaker pattern has characteristics for auto-recovery and self-healing. An open or half-open circuit breaker will periodically let certain requests reach the target endpoint and if these succeed, the circuit breaker will transition to its healthy state.
But a circuit breaker can protect and recover only from failures related to service interactions. To recover from other kinds of failures that we mentioned previously, such as memory leaks, infinite loops, fork bombs or anything else that may prevent a service from functioning as intended, we need some other means of failure detection, containment, and self-healing. This where container health checks come into the picture.
Health checks such as Kubernetes liveness and readiness probes will monitor and detect failures in the services and restart them if required. That is a pretty powerful feature, as it allows polyglot services to be monitored and recovered in a unified way.
Restarting a service will help only to recover from failures. But what about coping with other kinds of behavior such as high load? Kubernetes can scale up and down the services horizontally or even the underlying infrastructure as demonstrated here.

AWS outage handled by Kubernetes
Health checks and container restarts can help with individual services failures, but what happens if the whole node or rack fails? This is where the Kubernetes scheduler kicks in and places the services on other hosts that have enough capacity to run them.
As you can see here, in order to have a system that can self-heal from different kinds of failures, there is a need for a way more resiliency primitives than a circuit breaker. The integrated toolset in Kubernetes in the form of container resource isolation, health checks, graceful termination and start up, container placement, autoscaling, etc do help achieve application resiliency, self-healing and even blend into antifragility.

Let the Platform Handle it

There are many examples of developer and application responsibilities that have shifted from the application into the platform. With Kubernetes some examples are:
  • Application health checks and restarts are handled by the platform.
  • Application placements are automated and performed by the scheduler.
  • The act of updating a service with a newer version is covered by Deployments.
  • Service discovery, which was an application level concern has moved into the platform (through Services).
  • Managing Cron jobs has shifted from being an application responsibility to the platform (through Kuberneres CronJobs).
In a similar fashion, the act of performing timeouts, retries, circuit breaking is shifting from the application into the platform. There is a new category of tools referred to as Service Mesh and with the more popular members at this moment being:
These tools provide features such as:
  • Retry
  • Circuit-breaking
  • Latency and other metrics
  • Failure- and latency-aware load balancing
  • Distributed tracing
  • Protocol upgrade
  • Version aware routing
  • Cluster failover, etc
That means, very soon, we won't need an implementation of the circuit breaker as part of every microservice. Instead, we will be using one as a sidecar pattern or host proxy. In either case, these new tools will shift all of the network-related concerns where they belong: from L7 to L4/5.
Image from Christian Posta
When we talk about Microservices at scale, that is the only possible way to manage complexity: automation and delegation to the platform. My colleague and friend @christianposta has blogged about Service Mesh in depth here.

A Resiliency Toolkit

Without scaring you death, below is a collection of practises and patterns for achieving a resilient architecture by Uwe Friedrichsen.

Resiliency patterns by Uwe Friedrichsen
Do not try to use all of them, and do not try to use Hystrix all the time. Consider which of these patterns will apply to your application and use them cautiously, only when a pattern benefit outweighs its cost.
At the next conference, when somebody tries to sell you a circuit breaker talk, tell them that this is only the starter and ask for the main course.
Follow me @bibryam for future blog posts on related topics.

Some IT Wisdom Quotes from Twitter

I believe the way we interact with Twitter reflects the mood and the mindset in general we are. Here I collected some of the tweets I've liked and enjoyed reading recently. Let me know if you have others.

The price for free software is your time.

Kelsey Hightower @kelseyhightower

If you don’t end up regretting your early technology decisions, you probably overengineered.

Randy Shoup @randyshoup

Optimize to be Wrong, not Right.

Barry O'Reilly @BarryOReilly

Most decisions should probably be made with somewhere around 70% of the information you wish you had. If you wait for 90%, in most cases, you're probably being slow.

Jeff Bezos, Amazon CEO @JeffBezos

You can't understand the problem up front. The act of writing the software is what gives us insight into it. Embrace not knowing.

Sarah Mei @sarahmei

I love deadlines. I like the whooshing sound they make as they fly by.

Douglas Adams

It is the cloud, it is not heaven.

Everything is a tradeoff... just make them intentionally.

Matt Ranney, Chief Architect Uber @mranney

Microservices simplifies code. It trades code complexity for operational complexity.

Do not strive for reusability, and instead aim for replaceability.

Fred Brooks, @ufried

Signing up for Microservices is signing up for evolutionary architecture. There’s no point where you’re just done.

Josh Evans from Netflix

Inverse bus factor: how many people must be hit by a bus for the project to make progress.

Erich Eichinger @oakinger

If you think good architecture is expensive, try bad architecture.

Brian Foote & Joseph Yoder

API Design is easy ... Good API Design is HARD.

David Carver

If we don’t create the thing that kills Facebook, someone else will.

Facebook’s Little Red Book

The Job of the deployment pipeline is to prove that the release candidate is unreleasable.

Jez Humble @jezhumble

Wait... Isn't forking what #opensource is all about? Nope. The power isn't the fork; it's the merge.

It is not necessary to change. Survival is not mandatory.

W. Edwards Deming

You can sell your time, but you can never buy it back. So the price of everything in life is the amount of time you spend on it.

Hope reading this post was worth the time you spent on it :) Follow me @bibryam for future blog posts on related topics.

About Me