Blogroll

From Fragile to Antifragile Software

(This post was originally published on Red Hat Developers, the community to learn, code, and share faster. To read the original post, click here.)

One of my favourite books is Antifragile by Nassim Taleb where the author talks about things that gain from disorder. Nacim introduces the concept of antifragility which is similar to hormesis in biology or creative destruction (accessible version creative-destruction) in economics and analyses it characteristics in great details. If you find this topic interesting, there are also other authors who have examined the same phenomenon in different industries such as Gary Hamel, C. S. Holling, Jan Husdal. The concept of antifragile is the opposite of the fragile. A fragile thing such as a package of wine glasses is easily broken when dropped but an antifragile object would benefit from such stress. So rather than marking such a box with "Handle with Care", it would be labelled "Please Mishandle" and the wine would get better with each drop (would be awesome woulnd't it).

It didn't take long for the concept of antifragility to be used also for describing some of the software development principles and architectural styles. Some would say that SOLID prinsiples are antifragile, some would say that microservices are antifragile, and some would say software systems cannot be antifragile ever. This article is my take on the subject.

According to Taleb, fragility, robustness, resilience and antifragility are all very different. Fragility involves loss and penalisation from disorder. Robustness is enduring to stress with no harm nor gain. Resilience involves adapting to stress and staying the same. And antifragility involves gain and benefit from disorder. If we try to relate these concepts and their characteristics to software systems, one way to define them would be as the following.

Disorder 

Different systems are affected by different kind of disorder, such as stress, time, change, volatility, debt, etc. For software systems, the main disorder is the change. The business is in a constant changing environment and the software needs to adapt to the business needs quickly. That is implementing new requirements, changes to existing functionality, even creating new business opportunities through innovation. A software system has to change all the time, otherwise it is obsolete.
Apart from development time challenges, there are also runtime challenges for software systems too. Software systems are created and exists to add value by running in a production environment. And while doing so, they are under stress by end users and other systems. This is another kind of disorder, that software systems have to deal with.

Fragile 

This property describes systems that suffer when put under stress. Imagine a software project that is not easy to change at development time. For example if it is not easy to extend, modify and deploy to the production environment. Or a system that is not able to handle unexpected user inputs or external system failures and breaks easily. That's is a fragile system that is harmed by stress and penalised by change, a good example of fragile.

Robust

This is a system that can continue functioning in the presence of internal and external challenges without adaptation. Every system is robust up to a level. For example a bottle is robust until it reaches the level of breaking point. A software system can be made robust to handle unanticipated user input, or failures in external systems. For example handling NPE in Java, using try-catch-finally statements to handle unreliable invocations, having a thread pool to handle concurrent users, creating network connections using timeouts, are all examples of robustness for a software system. But a robust system doesn't adapt to a chaining environment and when the stress and change threshold is reached it would break. The qualities that define the robustness vary from system to system. An ATM for example needs to be robust and not fail in the middle of a transaction, whereas, a media streaming service can drop a frame or two, as long as it continues streaming under stress.

Resilient

A system is resilient when it can adapt to internal and external challenges by changing its method of operation. The key here is that the system is responding to stress by changing its internal behaviour rather than resisting stress with a predefined buffer.
  • A typical example here is the circuit breaker pattern which changes its internal state to adapt to the external system behaviour to protect itself. 
  • Another example would be using a retry mechanism with some backoff algorithm to handle transient failures in external systems.
  • A different technique for creating resilient systems is through graceful degradation, both on the UI and the server side. Rendering UI based on the user agent capabilities, or failing fast on the server side and fallback to some default values are commonly used techniques to adapt to failures.
  • Systems with self healing and autorepair capabilities are another example of resiliency. These systems are self-aware and can detect abnormalities and take corrective actions. For example Kubernetes/OpenShift will perform regular liveness checks for the running Docker containers and if they detect any anomalies they will restart the container and perform necessary backoffs until the application stabilises. This is another mechanism to cope with stress and improve application resilience.

Overview

Before looking at the next level of software evolution - the antifragile systems, let's visualise and summarise different kind of software system characteristics.
  • A fragile system is difficult to modify, and cannot cope with a changing environment. Even if it provides some value when used in a stable non changing environment, when faced with further stress and change, it quickly turns into a liability. Many organisations have applications (mainframes for example) which are impossible to change, very expensive to maintain, but still running on high cost as they are very critical to the business.
  • A robust system is the one that is implemented with certain buffers to handle change and stress. So when the stress level increases, it can withstand it for up to a level without losing its capabilities and still provide a good value. But a robust system does not adapt, and if the stress and change levels continue raising, such a system can also stops providing benefit and may turn into liability and run on loss.
  • A resilient system can handle more stress and change as it is designed and implemented with stress in mind and adaptability features. Even if it is not benefiting from stress, it can survive lot's of different kind of stress and change and provide value up to a greater degree.
  • An antifragile system is created with change in mind and it feeds from stress and change. It is much harder to create such a system (it is not a software system but a social-technical system) but once it is in place, it drives the business based on change, and even creates the change.

Antifragile

Many things in life are antifragile, such as the human body. When stressed at the right level, a muscle or bone wold come back stronger. But can a software system be antifragile? Certainly there are some tools, platforms, architectural styles, methodologies that can help create software with antifragile characteristics. Let's see some of the more popular ones.

Auto scaling feature allows applications to handle increasing load by creating more instances of the application. To achieve that, the software system have to able to measure and then react to change and stress. Some good examples here are AWS autoscaling of EC2 instances at infrastructure level, and OpenShift autoscaling of application containers for the application level. This is a feature that transitions applications from resiliency to antifragility since the software system is shifting resources from one part of the system into another to respond to stress.

Microservices. According to Taleb, at times of stress, the large is doomed to breaking. And that phenomen has been observed with mammals, corporations, administrations, etc. In software and large projects, this behaviour has been observed even more often. The bigger a software project is, the harder it becomes to change and react to stress. Microservices is an architecture styles that allows easier change by having autonomous services with well defined APIs - features that allow change. Russ Miles is a strong believer and proponent of Antifragile Software through Microservices (here is an intro video from him).

Chaos engineering is a technique to create antifragility by evolving systems to survive chaos. Rather than waiting for things to break at the worst possible time, the idea of chaos engineering is to proactively inject failures in order to be prepared when disaster strikes. Netflix’s Simian Army is a very good materialization of this technique, designed to generate failures and help isolate system’s weaknesses.

Continuous deployments to a production environment creates continuous partial system failures and forces organisations to react better to failures through redundancy, rolling upgrades, rollbacks, and avoiding single points of failure. Other techniques such as canary release, blue-green deployments, are used to reduce the risk of introducing new software into production environment. Some other methods such as A/B testing even allows experimenting with change and measuring its effect, in order to gain from change.

The human element

Antifragility is not an universal characteristics. Different systems are antigragile towards different kind of disorder. For example Chaos Monkey will make your system antifragile towards EC2 deaths, and autoscaller will make your system respond to specific type of load. But your systems will not be antifragile towards other kinds of stress. And if you look above at the different ways of introdusing antifragility into software systems, all of them are means for making the social-technical system antifragile (and not only the software system):
  • Chaos engineering forces human feedback to the injected randomness and makes the system antifragile.
  • Microservices alone do not make a software system antifragile. But microservices combined with appropriate organisational and team structure enables antifragility. If microservices are a way for architecting applications into autonomous structures, DevOps is a means for organizing teams into similar structures. You need both in order to benefit from them and gain from disorder.
  • Continuous deployment pipeline is a tool that allows teams to react to stress faster, introduce or retract change faster and generally use the stress as the driver.
  • Similarly, iterative development is not enough to benefit from changing environment. But iterative development with open and honest retrospective rituals is.

Innovate

If it takes few weeks to create a new developer environment, you can not react to change. If it takes three months to release a new feature, you can not react to change. If your ops team is watching the metrics dashboard and manually scaling applications up and down, you cannot embrace change. If the team is hardly catching up with the change, there is no way to gain from change. But once you put an appropriate organisational structure, the right tools and culture in place, then you can start gaining from change. Then you can afford having Friday Hackathons, then you can start exploring open source projects and start contributing to them, then you can start open sourcing your internal projects and benefit from a community, and generally be the change itself. And why not the Netflix or the Amazon of tomorrow.

Visualizing Integration Applications

(This post was originally published on Red Hat Middleware Blog. To read the original post, click here.)
Since I've changed role and started performing architect duties, I have to draw more boxes and arrows than write code. There are ways to fight that, like contributing to open source projects during sleepless nights, POCs, demos, but drawing boxes to express architectures and designs is still big part of it. This post is about visualising distributed messaging/SOA/microservices applications in agile (this term has lost its meaning, but cannot find a better one in this case) environments. What I like about the software industry in recent years is that the majority of organisations I've worked with, value the principles behind lean and agile software development methodologies. As long as it is practical, everyone strives to deliver working software (rather than documentation), deliver fast (rather than plan for a long time), eliminate waste, respond to change, etc. And there are management practises such as Scrum and Kanban, and Technical Practises from Extreme programming (XP) methodology such as unit testing, pair programing, and other practises such as CI, CD, DevOps to help implement the aforementioned principles. In this line of thinking, I decided to put together a summary of the design tools and diagrams I find useful in my day to day job while working with distributed Systems.

Issues with 4+1 View Model and death by UML

Every project kicks off with big ambitions, but there is never enough time to do things perfectly, and at the end we have to deliver whatever works. And that is a good thing, it is the way the environment helps us avoid gold plating, YAGNI, KISS, etc. so we do just enough and adapt to chance.

Looking back, I can say that most of the diagrams I've seen around are inspired by 4+1 view model of Philippe Kruchten which has Logical, Development, Process and Physical views.
4+1 View Model
4+1 View Model
I quite like the ideas and the motivation behind this framework: using separate views and perspectives to address specific set of constraints and targeting the different stakeholders. That is a great way of describing complex software architectures. But I have two issues with using this model for integration applications.

Diagram applicability

Typically these views are expressed through UML, and for each view, you have to use one or more UML diagrams. The fact that I have to use 15 types of UML diagrams to communicate and express a system architecture in an accessible way, defeats its purpose.
    Death by UML
    Death by UML
    With such a complexity, the chances are that there are only one or two people in the whole organisation who has the tools to create, ability to understand and maintain these diagrams. And having hard to interpret, out of date diagrams is as useful as having out of date gibberish documentation. These diagrams are too complex and with limited value, and very quickly they turns into liability that you have to maintain rather than asset expressing the state of a constantly changing system.
    Another big drawback is that the existing UML diagram types are primarily focused on describing object-oriented architectures rather than Pipes and Filters architectures. The essence of messaging applications is around interaction styles, routing, data flow rather than structure. Class, object, component, package, and other diagrams are of less value for describing a Pipes and Filters based processing flows. Behavioural UML diagrams such as activity and sequence get closer, but still cannot express easily concepts such filtering and content based routing which are fundamental part of integration applications.

    View applicability 

    Having different set of views for a system, to address different concerns is a great way of expressing intend. But the existing views of 4+1 model doesn't reflect the way we develop and deploy software nowadays. The idea that you have a logical view first, which then leads to development and process view, and those lead to physical view is not always the case. The systems development life cycle, is not following the (waterfall) sequence of requirement gathering, designing, implementing and maintaining.
      Software Development Lifecycle
      Software Development Lifecycle
      Instead other development methodologies such as agile, prototyping, synchronise and stabilise, spike and stabilise are used too. In addition to the process, the stakeholders are changing too. With practises such as DevOps, developers have to know about the final physical deployment model, operations have to know about the application processing flows too. Modern architectures such as microservices affect the views too. Knowing one microservice is in a plethora of a services is not very useful. Knowing too much about all the services is not practical either. Having the right abstraction level to have a system wide view with just enough details becomes vital.

      Practical Visualisation for Integration Applications

      There are some good tips for drawing in general. The closest thing that has been working for me is described by Simon Brown as C4 model. (You should also get a free copy of Simon's awesome The Art of Visualising Software Architecture book). In his model, Simon is talking about the importance of a common set of abstractions rather than common notation (such as UML) and then using simple set of diagrams for different level of abstractions: system context diagram, container diagram, component diagram and class diagram. I quite like this "Outside-In" approach, where you first have 10000 foot view and with each next level, going deeper with more detailed views.
      C4 is also not an exact match for middleware/integration applications either, but it is getting closer. If we were to use C4 model, then system context diagram would be one box that says ESB (or middleware, MOM, microservices) with tens of arrows from north to south. Not very useful. Container diagram is quite close, but the term container is so overloaded (VM, application container, docker container) which makes it less useful for communication. Component and class diagrams are also not a good fit as Pipes and Filter architectures are focused around Enterprise Integration Patterns, rather than classes and packages.
      So at the end, what is it that worked for me? It is the following 3 types of diagrams which abbreviate as SSD (not as cool as C4):  System Context Diagram, Service Design Diagram and Deployment Diagram.

      System Context Diagram

      The aim of this model is to show all the services (whether they are SOA, Microservices) with their inputs and outputs. Ideally having the external systems on the north, the services in the middle section, and internal services in the south. Or you could use both external and internal services on both side of the middleware layer as shown below. Also having the protocol (such as HTTP, JMS, file) on the arrows, with the data format (XML, JSON, CSV) gives useful context too, but it is not mandatory. If there are too many services, you can leave the protocol and the data format for the service level diagrams. I use the direction of the arrow to indicate which service is initiating the call rather than the data flow direction.
      System Context Diagram
      System Context Diagram
      Having such a diagram gives a good overview of the scope of a distributed system. We can see all the services, the internal and external dependencies, the types of interaction (with protocol and data format), and the call initiator.

      Service Design Diagram

      The aim of this diagram is to show what is going on in each box representing a middleware service from the System Context Diagram. And the best diagram for this is to use EIP icons and connect those as message flows. A service may have a number of flows, support a number of protocols, implement real time and/or batch behaviour.

      Service Design Diagram
      Service Design Diagram
      At this level, we want to show all possible data flows implemented by a specific service, from any source to any destination.

      Deployment Diagram

      The previous two diagrams are the logical views of the system as a whole and each service separately. With the deployment diagram, we want to show where each service is going to be deployed. May be there will be multiple instances of the same service running on multiple hosts. May be some services will be active on one host, and passive on the other. May be there will be a load balancer fronting the services, etc.
      Deployment Diagram
      Deployment Diagram
      The deployment diagram is supposed to show how individual services and the system as a whole relates to the host systems (regardless whether that is physical or virtual).

      What tools do I use?

      The System Context and the Deployment Diagrams are composed only of boxes and arrows and do not require any specail tools. For the Service Design Diagram, you will need a tool that has the Enterprise Integration Pattern icons installed. So far, I have seen the following tools with EIP icon support:
      • Mac OS: OmniGraffle with icons from graffletopia.This is what I've used to create all the diagrams for Camel Design Patterns book.
      • Windows: Enterprise Architect by Sparx Systems with EIP icons by Harald Westphal. There are also MS Visio Stencils here.
      • Web: LucidCharts which ships EIP icons by default. That is the easiest to use tool and accessible from everywhere. It is my favourite tool (with MS Visio import/export options for Windows users), and has free account plans to start with.
      • Web: DrawIO another web tool with EIP icons. The beauty of this tool is that it forces you to use your own storage options for the diagrams, such as: google drive, dropbox, locally, etc. So you own the diagrams and keep them safe.
      • Canva: not specifically for EIP diagrams, but in general this site has a nice templates and photos to use in various presentations. Check it out.
      Other development tools that could be also used for creating EIP diagrams are:

      System Context Diagram is useful to show the system wide scope and reach of the services, Service Design Diagram is good for describing what a service does, and Deployment Diagram is useful mapping all that into something physical. In IT, we can expand work and fill up all the available time with things to do. I'm sure given more time, we can invent 10 more useful views. But w/o the above three, I cannot imagine myself describing an integration application. As Antoine de Saint-Exupery put it long ago: "Perfection is finally attained not when there is no longer anything to add but when there is no longer anything to take away."

      About Me