I recently left Red Hat to join Diagrid and work on the Dapr project. I spoke about Dapr when it was initially announced by Microsoft, but hadn’t looked into it since it joined CNCF. Two years later, during my onboarding into the new role, I spent some time looking into it and here are the steps I took in the journey and my impressions so far.
What is Dapr?
TL;DR: Dapr is a distributed systems toolkit in a box. It addresses the peripheral integration concerns of applications and lets developers focus on the business logic. If you are familiar with Apache Camel, Spring Framework in the Java world, or other distributed systems frameworks, you will find a lot of similarities with Dapr. Here are a few parallels with other frameworks:
Similar to Camel, Dapr has connectors (called bindings) that let you connect to various external systems.
Similar to HashiCorp Consul, Dapr offers services discovery which can be backed by Consul.
Similar to Spring Integration, Spring Cloud, (remember Netflix Hystrix?) and many other frameworks, Dapr has error handling capabilities with retries, timeouts, circuit breakers which are called resiliency policies.
Similar to Spring Data KeyValue, Dapr offers Key/Value-based state abstractions.
Similar to Kafka, Dapr offers pub/sub-based service interactions.
Similar to ActiveMQ clients, Dapr offers DLQs, but these are not specific to a messaging technology, which means they can be used even with things such as AWS SQS or Redis for example.
Similar to Spring Cloud Config, Dapr offers configuration and secret management
Similar to Zookeeper or Redis clients, Dapr offers distributed locks
Similar to a Service Mesh, Dapr offers mTLS and additional security between your application and the sidecar.
Similar to Envoy, Dapr offers enhanced observability through automatic metrics, tracing and log collection.
The primary difference between all of these frameworks and Dapr is that the latter offers its capabilities not as a library within your application, but as a sidecar running next to your application. These capabilities are exposed behind well-defined HTTP and gRPC APIs (very creatively called building blocks) where the implementations (called components) can be swapped w/o affecting your application code.
High-level Dapr architecture
You could say, Dapr is a collection of stable APIs exposed through a sidecar and swappable implementations running somewhere else. It is the cloudnative incarnation of integration technologies that makes integration capabilities previously available only in a few languages, available to everybody, and portable everywhere: Kubernetes, on-premise, or literally on the International Space Station (I mean the edge).
Getting started
The project is surprisingly easy to get up and running regardless of your developer background and language of choice. I was able to follow the getting started guides and run various quickstarts in no time on my MacOS. Here are roughly the steps I followed.
Install Dapr CLI
Dapr CLI is the main tool for performing Dapr-related tasks such as running an application with Dapr, seeing the logs, running Dapr dashboard, or deploying all to Kubernetes.
brew install dapr/tap/dapr-cli
With the CLI installed, we have a few different options for installing and running Dapr. I’ll start from the least demanding and flexible option and progress from there.
Option 1: Install Dapr without Docker
This is the lightest but not the most useful way to run Dapr.
dapr init --slim
In this slim mode only daprd and placement binaries are installed on the machine which is sufficient for running Dapr sidecars locally.
Run a Dapr sidecar
The following command will start a Dapr sidecar called no-app listening on HTTP port 3500 and a random gRPC port.
dapr run --app-id no-app --dapr-http-port 3500
Congratulations, you have your first Dapr sidecar running. You can see the sidecar instance through this command:
dapr list
or query its health status:
curl -i http://localhost:3500/v1.0/healthz
Dapr sidecars are supposed to run next to an application and not on their own. Let’s stop this instance and run it with an application.
dapr stop --app-id no-app
Run a simple app with a Dapr sidecar
For this demonstration we will use a simple NodeJS application:
res.status(200).send("Got a new order! Order ID: " + orderId); });
The application has one /neworder endpoint listening on port 3000. We can run this application and the sidecar with the following command:
dapr run --app-id nodeapp --app-port 3000 --dapr-http-port 3500 node app.js
The command starts the NodeJS application on port 3000 and Dapr HTTP endpoint on 3500. Once you see in the logs that the app has started successfully, we can poke it. But rather than hitting the /neworder endpoint directly on port 3000, we will instead interact with the application through the sidecar. We do that using Dapr CLI like this:
And see the response from the app. If you noticed, the CLI only needs the app-id (instead of host and port) to locate where the service is running. The CLI is just a handy way to interact with your service. It that seems like too much magic, we can use bare-bones curl command too:
This command uses the service Dapr’s invocation API to synchronously interact with the application. Here is a visual representation of what just happened:
Invoking an endpoint through Dapr sidecar
Now, with Dapr on the request path, we get the Daprized service invocation benefits such as resiliency policies such as retries, timeouts, circuit breakers, concurrency control; observability enhancements such as: metrics, tracing, logs; security enhancements such as mTLS, allow lists, etc. At this point, you can try out metadata, metrics endpoints, play with the configuration options, or see your single microservice in Dapr dashboard.
dapr dashboard
The slim mode we are running on is good for the Hello World scenario, but not the best setup for local development purposes as it lacks state store, pub/sub, metric server, etc. Let’s stop the nodeapp using the command from earlier (or CTL +C), and remove the slim Dapr binary:
dapr uninstall
One thing to keep in mind is that this command will not remove the default configuration and component specification files usually located in: ~/.dapr folder. We didn’t create any files in the steps so far, but if you follow other tutorials and change those files, they will remain and get applied with every dapr run command in the future (unless overridden). This caused me some confusion, keep it in mind.
Option 2: Install Dapr with Docker
This is the preferred way for running Dapr locally for development purposes but it requires Docker. Let’s set it up:
dapr init
The command will download and run 3 containers
Dapr placement container used with actors(I wish this was an optional feature)
Zipkin for collecting tracing information from our sidecars
And a single node Redis container used for state store, pub/sub, distributed-lock implementations.
You can verify when these containers are running and you are ready to go. docker ps
Run the Quickstarts
My next step from here was to try out the quickstarts that demonstrate the building blocks for service invocation, pub/sub, state store, bindings, etc. The awesome thing about these quickstarts is that they demonstrate the same example in multiple ways:
With Dapr SDK and w/o any dependency to Dapr SDK i.e. using HTTP only.
In multiple languages: Java, Javascript, .Net, Go, Python, etc.
You can mix and match different languages and interaction methods (SDK or native) for the same example which demonstrates Dapr’s polyglot nature.
Option 3: Install Dapr on Kubernetes
If you have come this far, you should have a good high-level understanding of what Dapr can do for you. The next step would be to deploy Dapr on Kubernetes where most of the Dapr functionalities are available and closest to a production deployment. For this purpose, I used minikube locally with default settings and no custom tuning. dapr init --kubernetes --wait
If successful, this command will start the following pods in dapr-system namespace:
dapr-operator: manages all components for state store, pub/sub, configuration, etc
dapr-sidecar-injector: injects dapr sidecars into annotated deployment pods
dapr-placement: required with actors only.
dapr-sentry: manages mTLS between services and acts as a certificate authority.
dapr-dashboard: a simple webapp to explore what is running within a Dapr cluster
These Pods collectively represent the Dapr control plane.
Injecting a sidecar
From here on, adding a Dapr sidecar to an application (this would be Dapr dataplane) is as easy as adding the following annotations to your Kubernetes Deployments:
annotations:
dapr.io/enabled: "true"
dapr.io/app-id: "nodeapp"
dapr.io/app-port: "3000"
The dapr-sidecar-injector service watches for new Pods with the dapr.io/enabled annotation and injects a container with the daprd process within the pod. It also adds DAPR_HTTP_PORT and DAPR_GRPC_PORT environment variables to your container so that it can easily communicate with Dapr without hard-coding Dapr port values.
To deploy a complete application on Kubernetes I suggest this step-by-step example. It has a provider and consumer services and it worked the first time for me.
Transparent vs explicit proxy
Notice Dapr sidecar injection is less intrusive than a typical service mesh with a transparent sidecar such as Istio’s Envoy. To inject a transparent proxy, typically the Pods also get injected with an init-container that runs at the start of the Pod and re-configures the Pods networking rules so that all ingress and egress traffic or your application container goes through the sidecar. With Dapr, that is not the case. There is a sidecar injected, but your application is in control of when and how to interact with Dapr over its well-defined explicit (non-transparent) APIs. Transparent service mesh proxies operate at lower network layers typically used by operations teams, whereas Dapr provides application layer primitives needed by developers. If you are interested in this topic, here is a good explanation of the differences and overlaps of Dapr with services meshes.
Summary
And finally, here are some closing thoughts with what I so far liked more and what less from Dapr.
Liked more
I love the fact that Dapr is one of the few CNCF projects targeting developers creating applications, and not only operations team who are running these applications. We need more cloudnative tools for implementing applications.
I love the non-intrusive nature of Dapr where capabilities are exposed over clear APIs and not through some black magic. I prefer transparent actions for instrumentation, observability, and general application insight, but not for altering application behavior.
I loved the polyglot nature of Dapr offering its capabilities to all programming languages and runtimes. This is what attracted me to Kubernetes and cloudnative in the first place.
I loved how easy it is to get started with Dapr and the many permutations of each quickstart. There is something for everyone regardless of where you are coming from into Dapr.
I’m excited about WASM modules and remote components features coming into Dapr. These will open new surface areas for more contributions and integrations.
Liked less
I haven’t used actors before and it feels odd to have a specific programming model included in a generic distributed systems toolkit. Luckily you don’t have to use it if you don’t want to.
The documentation is organized, but too sparse into multiple short pages. Learning a topic will require navigating a lot of pages multiple times, and it is still hard to find what you are looking for.
Follow me at @bibryam to join my journey of learning and using Dapr and shout out with any thoughts and comments.
“We build our computers the way we build our cities—over time, without a plan, on top of ruins.”
Ellen Ullman wrote this in 1998, but it applies just as much today to the way we build modern applications; that is, over time, with short-term plans, on top of legacy software. In this article, I will introduce a few patterns and tools that I believe work well for thoughtfully modernizing legacy applications and building modern event-driven systems.
Application modernization refers to the process of taking an existing legacy application and modernizing its infrastructure—the internal architecture—to improve the velocity of new feature delivery, improve performance and scalability, expose the functionality for new use cases, and so on. Luckily, there is already a good classification of modernization and migration types, as shown in Figure 1.
Figure 1: Three modernization types and the technologies we might use for them.
Depending on your needs and appetite for change, there are a few levels of modernization:
Retention: The easiest thing you can do is to retain what you have and ignore the application's modernization needs. This makes sense if the needs are not yet pressing.
Retirement: Another thing you could do is retire and get rid of the legacy application. That is possible if you discover the application is no longer being used.
Rehosting: The next thing you could do is to rehost the application, which typically means taking an application as-is and hosting it on new infrastructure such as cloud infrastructure, or even on Kubernetes through something like KubeVirt. This is not a bad option if your application cannot be containerized, but you still want to reuse your Kubernetes skills, best practices, and infrastructure to manage a virtual machine as a container.
Replatforming: When changing the infrastructure is not enough and you are doing a bit of alteration at the edges of the application without changing its architecture, replatforming is an option. Maybe you are changing the way the application is configured so that it can be containerized, or moving from a legacy Java EE runtime to an open source runtime. Here, you could use a tool like windup to analyze your application and return a report with what needs to be done.
Refactoring: Much application modernization today focuses on migrating monolithic, on-premises applications to a cloud-native microservices architecture that supports faster release cycles. That involves refactoring and rearchitecting your application, which is the focus of this article.
For this article, we will assume we are working with a monolithic, on-premise application, which is a common starting point for modernization. The approach discussed here could also apply to other scenarios, such as a cloud migration initiative.
Challenges of migrating monolithic legacy applications
Deployment frequency is a common challenge for migrating monolithic legacy applications. Another challenge is scaling development so that more developers and teams can work on a common code base without stepping on each other’s toes. Scaling the application to handle an increasing load in a reliable way is another concern. On the other hand, the expected benefits from a modernization include reduced time to market, increased team autonomy on the codebase, and dynamic scaling to handle the service load more efficiently. Each of these benefits offsets the work involved in modernization. Figure 2 shows an example infrastructure for scaling a legacy application for increased load.
Figure 2: Refactoring a legacy application into event-driven microservices.
Envisioning the target state and measuring success
For our use case, the target state is an architectural style that follows microservices principles using open source technologies such as Kubernetes, Apache Kafka, and Debezium. We want to end up with independently deployable services modeled around a business domain. Each service should own its own data, emit its own events, and so on.
When we plan for modernization, it is also important to consider how we will measure the outcomes or results of our efforts. For that purpose, we can use metrics such as lead time for changes, deployment frequency, time to recovery, concurrent users, and so on.
The next sections will introduce three design patterns and three open source technologies—Kubernetes, Apache Kafka, and Debezium—that you can use to migrate from brown-field systems toward green-field, modern, event-driven services. We will start with the Strangler pattern.
The Strangler pattern
The Strangler pattern is the most popular technique used for application migrations. Martin Fowler introduced and popularized this pattern under the name of Strangler Fig Application, which was inspired by a type of fig that seeds itself in the upper branches of a tree and gradually evolves around the original tree, eventually replacing it. The parallel with application migration is that our new service is initially set up to wrap the existing system. In this way, the old and the new systems can coexist, giving the new system time to grow and potentially replace the old system. Figure 3 shows the main components of the Strangler pattern for a legacy application migration.
Figure 3: The Strangler pattern in a legacy application migration.
The key benefit of the Strangler pattern is that it allows low-risk, incremental migration from a legacy system to a new one. Let’s look at each of the main steps involved in this pattern.
Step 1: Identify functional boundaries
The very first question is where to start the migration. Here, we can use domain-driven design to help us identify aggregates and the bounded contexts where each represents a potential unit of decomposition and a potential boundary for microservices. Or, we can use the event storming technique created by Antonio Brandolini to gain a shared understanding of the domain model. Other important considerations here would be how these models interact with the database and what work is required for database decomposition. Once we have a list of these factors, the next step is to identify the relationships and dependencies between the bounded contexts to get an idea of the relative difficulty of the extraction.
Armed with this information, we can proceed with the next question: Do we want to start with the service that has the least amount of dependencies, for an easy win, or should we start with the most difficult part of the system? A good compromise is to pick a service that is representative of many others and can help us build a good technology foundation. That foundation can then serve as a base for estimating and migrating other modules.
Step 2: Migrate the functionality
For the strangler pattern to work, we must be able to clearly map inbound calls to the functionality we want to move. We must also be able to redirect these calls to the new service and back if needed. Depending on the state of the legacy application, client applications, and other constraints, weighing our options for this interception might be straightforward or difficult:
The easiest option would be to change the client application and redirect inbound calls to the new service. Job done.
If the legacy application uses HTTP, then we’re off to a good start. HTTP is very amenable to redirection and we have a wealth of transparent proxy options to choose from.
In practice, it likely that our application will not only be using REST APIs, but will have SOAP, FTP, RPC, or some kind of traditional messaging endpoints, too. In this case, we may need to build a custom protocol translation layer with something like Apache Camel.
Interception is a potentially dangerous slippery slope: If we start building a custom protocol translation layer that is shared by multiple services, we risk adding too much intelligence to the shared proxy that services depend on. This would move us away from the "smart microservices, dumb pipes” mantra. A better option is to use the Sidecar pattern, illustrated in Figure 4.
Figure 4: The Sidecar pattern.
Rather than placing custom proxy logic in a shared layer, make it part of the new service. But rather than embedding the custom proxy in the service at compile-time, we use the Kubernetes sidecar pattern and make the proxy a runtime binding activity. With this pattern, legacy clients use the protocol-translating proxy and new clients are offered the new service API. Inside the proxy, calls are translated and directed to the new service. That allows us to reuse the proxy if needed. More importantly, we can easily decommission the proxy when it is no longer needed by legacy clients, with minimal impact on the newer services.
Step 3: Migrate the database
Once we have identified the functional boundary and the interception method, we need to decide how we will approach database strangulation—that is, separating our legacy database from application services. We have a few paths to choose from.
Database first
In a database-first approach, we separate the schema first, which could potentially impact the legacy application. For example, a SELECT might require pulling data from two databases, and an UPDATE can lead to the need for distributed transactions. This option requires changes to the source application and doesn’t help us demonstrate progress in the short term. That is not what we are looking for.
Code first
A code-first approach lets us get to independently deployed services quickly and reuse the legacy database, but it could give us a false sense of progress. Separating the database can turn out to be challenging and hide future performance bottlenecks. But it is a move in the right direction and can help us discover the data ownership and what needs to be split into the database layer later.
Code and database together
Working on the code and database together can be difficult to aim for from the get-go, but it is ultimately the end state we want to get to. Regardless of how we do it, we want to end up with a separate service and database; starting with that in mind will help us avoid refactoring later.
Figure 4.1: Database strangulation strategies
Having a separate database requires data synchronization. Once again, we can choose from a few common technology approaches.
Triggers
Most databases allow us to execute custom behavior when data is changed. In some cases, that could even be calling a web service and integrating with another system. But how triggers are implemented and what we can do with them varies between databases. Another significant drawback here is that using triggers requires changing the legacy database, which we might be reluctant to do.
Queries
We can use queries to regularly check the source database for changes. The changes are typically detected with implementation strategies such as timestamps, version numbers, or status column changes in the source database. Regardless of the implementation strategy, polling always leads to the dilemma between polling often and creating overhead over the source database, or missing frequent updates. While queries are simple to install and use, this approach has significant limitations. It is unsuitable for mission-critical applications with frequent database interactions.
Log readers
Log readers identify changes by scanning the database transaction log files. Log files exist for database backup and recovery purposes and provide a reliable way to capture all changes including DELETEs. Using log readers is the least disruptive option because they require no modification to the source database and they don’t have a query load. The main downside of this approach is that there is no common standard for the transaction log files and we'll need specialized tools to process them. This is where Debezium fits in.
Figure 4.2: Data synchronization patterns
Before moving on to the next step, let's see how using Debezium with the log reader approach works.
Change data capture with Debezium
When an application writes to the database, changes are recorded in log files, then the database tables are updated. For MySQL, the log file is binlog; for PostgreSQL, it is the write-ahead-log; and for MongoDB it's the op log. The good news is Debezium has connectors for different databases, so it does the hard work for us of understanding the format of all of these log files. Debezium can read the log files and produce a generic abstract event into a messaging system such as Apache Kafka, which contains the data changes. Figure 5 shows Debezium connectors as the interface for a variety of databases.
Figure 5: Debezium connectors in a microservices architecture.
Debezium is the most widely used open source change data capture (CDC) project with multiple connectors and features that make it a great fit for the Strangler pattern.
Why is Debezium a good fit for the Strangler pattern?
One of the most important reasons to consider the Strangler pattern for migrating monolithic legacy applications is reduced risk and the ability to fall back to the legacy application. Similarly, Debezium is completely transparent to the legacy application, and it doesn’t require any changes to the legacy data model. Figure 6 shows Debezium in an example microservices architecture.
Figure 6: Debezium deployment in a hybrid-cloud environment.
With a minimal configuration to the legacy database, we can capture all the required data. So at any point, we can remove Debezium and fall back to the legacy application if we need to.
Debezium features that support legacy migrations
Here are some of Debezium's specific features that support migrating a monolithic legacy application with the Strangler pattern:
Snapshots: Debezium can take a snapshot of the current state of the source database, which we can use for bulk data imports. Once a snapshot is completed, Debezium will start streaming the changes to keep the target system in sync.
Filters: Debezium lets us pick which databases, tables, and columns to stream changes from. With the Strangler pattern, we are not moving the whole application.
Single message transformation (SMT): This feature can act like an anti-corruption layer and protect our new data model from legacy naming, data formats, and even let us filter out obsolete data
Using Debezium with a schema registry: We can use a schema registry such as Apicurio with Debezium for schema validation, and also use it to enforce version compatibility checks when the source database model changes. This can prevent changes from the source database from impacting and breaking the new downstream message consumers.
Using Debezium with Apache Kafka: There are many reasons why Debezium and Apache Kafka work well together for application migration and modernization. Guaranteed ordering of database changes, message compaction, the ability to re-read changes as many times as needed, and tracking transaction log offsets are all good examples of why we might choose to use these tools together.
Step 4: Releasing services
With that quick overview of Debezium, let’s see where we are with the Strangler pattern. Assume that, so far, we have done the following:
Identified a functional boundary.
Migrated the functionality.
Migrated the database.
Deployed the service into a Kubernetes environment.
Migrated the data with Debezium and kept Debezium running to synchronize ongoing changes.
At this point, there is not yet any traffic routed to the new services, but we are ready to release the new services. Depending on our routing layer's capabilities, we can use techniques such as dark launching, parallel runs, and canary releasing to reduce or remove the risk of rolling out the new service, as shown in Figure 7.
Figure 7: Directing read traffic to the new service.
What we can also do here is to only direct read requests to our new service initially, while continuing to send the writes to the legacy system. This is required as we are replicating changes in a single direction only.
When we see that the read operations are going through without issues, we can then direct the write traffic to the new service. At this point, if we still need the legacy application to operate for whatever reason, we will need to stream changes from the new services toward the legacy application database. Next, we'll want to stop any write or mutating activity in the legacy module and stop the data replication from it. Figure 8 illustrates this part of the pattern implementation.
Figure 8: Directing read and write traffic to the new service.
Since we still have legacy read operations in place, we are continuing the replication from the new service to the legacy application. Eventually, we'll stop all operations in the legacy module and stop the data replication. At this point, we will be able to decommission the migrated module.
We've had a broad look at using the Strangler pattern to migrate a monolithic legacy application, but we are not quite done with modernizing our new microservices-based architecture. Next, let’s consider some of the challenges that come later in the modernization process and how Debezium, Apache Kafka, and Kubernetes might help.
After the migration: Modernization challenges
The most important reason to consider using the Strangler pattern for migration is the reduced risk. This pattern gives value steadily and allows us to demonstrate progress through frequent releases. But migration alone, without enhancements or new “business value” can be a hard sell to some stakeholders. In the longer-term modernization process, we also want to enhance our existing services and add new ones. With modernization initiatives, very often, we are also tasked with setting the foundation and best practices for building modern applications that will follow. By migrating more and more services, adding new ones, and in general by transitioning to the microservices architecture, new challenges will come up, including the following:
Automating the deployment and operating a large number of services.
Performing dual-writes and orchestrating long-running business processes in a reliable and scalable manner.
Addressing the analytical and reporting needs.
There are all challenges that might not have existed in the legacy world. Let’s explore how we can address a few of them using a combination of design patterns and technologies.
Challenge 1: Operating event-driven services at scale
While peeling off more and more services from the legacy monolithic application, and also creating new services to satisfy emerging business requirements, the need for automated deployments, rollbacks, placements, configuration management, upgrades, self-healing becomes apparent. These are the exact features that make Kubernetes a great fit for operating large-scale microservices. Figure 9 illustrates.
Figure 9: A sample event-driven architecture on top of Kubernetes.
When we are working with event-driven services, we will quickly find that we need to automate and integrate with an event-driven infrastructure—which is where Apache Kafka and other projects in its ecosystem might come in. Moreover, we can use Kubernetes Operators to help automate the management of Kafka and the following supporting services:
Apicurio Registry provides an Operator for managing Apicurio Schema Registry on Kubernetes.
Strimzi offers Operators for managing Kafka and Kafka Connect clusters declaratively on Kubernetes.
KEDA (Kubernetes Event-Driven Autoscaling) offers workload auto-scalers for scaling up and down services that consume from Kafka. So, if the consumer lag passes a threshold, the Operator will start more consumers up to the number of partitions to catch up with message production.
Knative Eventing offers event-driven abstractions backed by Apache Kafka.
Note: Kubernetes not only provides a target platform for application modernization but also allows you to grow your applications on top of the same foundation into a large-scale event-driven architecture. It does that through automation of user workloads, Kafka workloads, and other tools from the Kafka ecosystem. That said, not everything has to run on your Kubernetes. For example, you can use a fully managed Apache Kafka or a schema registry service from Red Hat and automatically bind it to your application using Kubernetes Operators. Creating a multi-availability-zone (multi-AZ) Kafka cluster on Red Hat OpenShift Streams for Apache Kafka takes less than a minute and is completely free during our trial period. Give it a try and help us shape it with your early feedback.
Now, let’s see how we can meet the remaining two modernization challenges using design patterns.
Challenge 2: Avoiding dual-writes
Once you build a couple of microservices, you quickly realize that the hardest part about them is data. As part of their business logic, microservices often have to update their local data store. At the same time, they also need to notify other services about the changes that happened. This challenge is not so obvious in the world of monolithic applications and legacy distributed transactions. How can we avoid or resolve this situation the cloud-native way? The answer is to only modify one of the two resources—the database—and then drive the update of the second one, such as Apache Kafka, in an eventually consistent manner. Figure 10 illustrates this approach.
Figure 10: The Outbox pattern.
Using the Outbox pattern with Debezium lets services execute these two tasks in a safe and consistent manner. Instead of directly sending a message to Kafka when updating the database, the service uses a single transaction to both perform the normal update and insert the message into a specific outbox table within its database. Once the transaction has been written to the database’s transaction log, Debezium can pick up the outbox message from there and send it to Apache Kafka. This approach gives us very nice properties. By synchronously writing to the database in a single transaction, the service benefits from "read your own writes" semantics, where a subsequent query to the service will return the newly persisted record. At the same time, we get reliable, asynchronous, propagation to other services via Apache Kafka. The Outbox pattern is a proven approach for avoiding dual-writes for scalable event-driven microservices. It solves the inter-service communication challenge very elegantly without requiring all participants to be available at the same time, including Kafka. I believe Outbox will become one of the foundational patterns for designing scalable event-driven microservices.
Challenge 3: Long-running transactions
While the Outbox pattern solves the simpler inter-service communication problem, it is not sufficient alone for solving the more complex long-running, distributed business transactions use case. The latter requires executing multiple operations across multiple microservices and applying consistent all-or-nothing semantics. A common example for demonstrating this requirement is the booking-a-trip use case consisting of multiple parts where the flight and accommodation must be booked together. In the legacy world, or with a monolithic architecture, you might not be aware of this problem as the coordination between the modules is done in a single process and a single transactional context. The distributed world requires a different approach, as illustrated in Figure 11.
Figure 11: The Saga pattern implemented with Debezium.
The Saga pattern offers a solution to this problem by splitting up an overarching business transaction into a series of multiple local database transactions, which are executed by the participating services. Generally, there are two ways to implement distributed sagas:
Choreography: In this approach, one participating service sends a message to the next one after it has executed its local transaction.
Orchestration: In this approach, one central coordinating service coordinates and invokes the participating services.
Communication between the participating services might be either synchronous, via HTTP or gRPC, or asynchronous, via messaging such as Apache Kafka.
The cool thing here is that you can implement sagas using Debezium, Apache Kafka, and the Outbox pattern. With these tools, it is possible to take advantage of the orchestration approach and have one place to manage the flow of a saga and check the status of the overarching saga transaction. We can also combine orchestration with asynchronous communication to decouple the coordinating service from the availability of participating services and even from the availability of Kafka. That gives us the best of both worlds: orchestration and asynchronous, non-blocking, parallel communication with participating services, without temporal coupling.
Combining the Outbox pattern with the Sagas pattern is an awesome, event-driven implementation option for the long-running business transactions use case in the distributed services world. SeeSaga Orchestration for Microservices Using the Outbox Pattern (InfoQ) for a detailed description. Also see an implementation example of this pattern on GitHub.
Conclusion
The Strangler pattern, Outbox pattern, and Saga pattern can help you migrate from brown-field systems, but at the same time, they can help you build green-field, modern, event-driven services that are future-proof.
Kubernetes, Apache Kafka, and Debezium are open source projects that have turned into de facto standards in their respective fields. You can use them to create standardized solutions with a rich ecosystem of supporting tools and best practices.
The one takeaway from this article is the realization that modern software systems are like cities: They evolve over time, on top of legacy systems. Using proven patterns, standardized tools, and open ecosystems will help you create long-lasting systems that grow and change with your needs.
This post was originally published on Red Hat Developers. To read the original post, check here.
These days, there is a lot of excitement around 12-factor apps,
microservices, and service mesh, but not so much around cloud-native
data. The number of conference talks, blog posts, best practices, and
purpose-built tools around cloud-native data access is relatively low.
One of the main reasons for this is because most data access
technologies are architectured and created in a stack that favors static
environments rather than the dynamic nature of cloud environments and
Kubernetes.
In this article, we will explore the different categories of data
gateways, from more monolithic to ones designed for the cloud and
Kubernetes. We will see what are the technical challenges introduced by
the Microservices architecture and how data gateways can complement API
gateways to address these challenges in the Kubernetes era.
Application architecture evolutions
Let’s start with what has been changing in the way we manage code and
the data in the past decade or so. I still remember the time when I
started my IT career by creating frontends with Servlets, JSP, and JSFs.
In the backend, EJBs, SOAP, server-side session management, was the
state of art technologies and techniques. But things changed rather
quickly with the introduction of REST and popularization of Javascript.
REST helped us decouple frontends from backends through a uniform
interface and resource-oriented requests. It popularized stateless
services and enabled response caching, by moving all client session
state to clients, and so forth. This new architecture was the answer to
the huge scalability demands of modern businesses.
A similar change happened with the backend services through the
Microservices movement. Decoupling from the frontend was not enough, and
the monolithic backend had to be decoupled into bounded context
enabling independent fast-paced releases. These are examples of how
architectures, tools, and techniques evolved pressured by the business
needs for fast software delivery of planet-scale applications.
That takes us to the data layer. One of the existential motivations
for microservices is having independent data sources per service. If you
have microservices touching the same data, that sooner or later
introduces coupling and limits independent scalability or releasing. It
is not only an independent database but also a heterogeneous one, so
every microservice is free to use the database type that fits its needs.
Application architecture evolution brings new challenges
While decoupling frontend from backend and splitting monoliths into
microservices gave the desired flexibility, it created challenges
not-present before. Service discovery and load balancing, network-level
resilience, and observability turned into major areas of technology
innovation addressed in the years that followed.
Similarly, creating a database per microservice, having the freedom
and technology choice of different datastores is a challenge. That shows
itself more and more recently with the explosion of data and the demand
for accessing data not only by the services but other real-time
reporting and AI/ML needs.
The rise of API gateways
With the increasing adoption of Microservices, it became apparent
that operating such an architecture is hard. While having every
microservice independent sounds great, it requires tools and practices
that we didn’t need and didn’t have before. This gave rise to more
advanced release strategies such as blue/green deployments, canary
releases, dark launches. Then that gave rise to fault injection and
automatic recovery testing. And finally, that gave rise to advanced
network telemetry and tracing. All of these created a whole new layer
that sits between the frontend and the backend. This layer is occupied
primarily with API management gateways, service discovery, and service
mesh technologies, but also with tracing components, application load
balancers, and all kinds of traffic management and monitoring proxies.
This even includes projects such as Knative with activation and
scaling-to-zero features driven by the networking activity.
With time, it became apparent that creating microservices at a fast
pace, operating microservices at scale requires tooling we didn’t need
before. Something that was fully handled by a single load balancer had
to be replaced with a new advanced management layer. A new technology
layer, a new set of practices and techniques, and a new group of users
responsible were born.
The case for data gateways
Microservices influence the data layer in two dimensions. First, it
demands an independent database per microservice. From a practical
implementation point of view, this can be from an independent database
instance to independent schemas and logical groupings of tables. The
main rule here is, only one microservice owns and touches a dataset. And
all data is accessed through the APIs or Events of the owning
microservice. The second way a microservices architecture influenced the
data layer is through datastore proliferation. Similarly, enabling
microservices to be written in different languages, this architecture
allows the freedom for every microservices-based system to have a polyglot persistence
layer. With this freedom, one microservice can use a relational
database, another one can use a document database, and the third
microservice one uses an in-memory key-value store.
While microservices allow you all that freedom, again it comes at a
cost. It turns out operating a large number of datastore comes at a cost
that existing tooling and practices were not prepared for. In the
modern digital world, storing data in a reliable form is not enough.
Data is useful when it turns into insights and for that, it has to be
accessible in a controlled form by many. AI/ML experts, data scientists,
business analysts, all want to dig into the data, but the
application-focused microservices and their data access patterns are
not designed for these data-hungry demands.
API and Data gateways offering similar capabilities at different layers
This is where data gateways can help you. A data gateway is like an
API gateway, but it understands and acts on the physical data layer
rather than the networking layer. Here are a few areas where data
gateways differ from API gateways.
Abstraction
An API gateway can hide implementation endpoints and help upgrade and
rollback services without affecting service consumers. Similarly, a
data gateway can help abstract a physical data source, its specifics,
and help alter, migrate, decommission, without affecting data consumers.
Security
An API manager secures resource endpoints based on HTTP methods. A
service mesh secures based on network connections. But none of them can
understand and secure the data and its shape that is passing through
them. A data gateway, on the other hand, understands the different data
sources and the data model and acts on them. It can apply RBAC per data
row and column, filter, obfuscate, and sanitize the individual data
elements whenever necessary. This is a more fine-grained security model
than networking or API level security of API gateways.
Scaling
API gateways can do service discovery, load-balancing, and assist the
scaling of services through an orchestrator such as Kubernetes. But
they cannot scale data. Data can scale only through replication and
caching. Some data stores can do replication in cloud-native
environments but not all. Purpose-built tools, such as Debezium,
can perform change data capture from the transaction logs of data
stores and enable data replication for scaling and other use cases.
A data gateway, on the other hand, can speed-up access to all kinds
of data sources by caching data and providing materialized views. It can
understand the queries, optimize them based on the capabilities of the
data source, and produce the most performant execution plan. The
combination of materialized views and the stream nature of change data
capture would be the ultimate data scaling technique, but there are no
known cloud-native implementations of this yet.
Federation
In API management, response composition is a common technique for
aggregating data from multiple different systems. In the data space, the
same technique is referred to as heterogeneous data federation.
Heterogeneity is the degree of differentiation in various data sources
such as network protocols, query languages, query capabilities, data
models, error handling, transaction semantics, etc. A data gateway can
accommodate all of these differences as a seamless, transparent
data-federation layer.
Schema-first
API gateways allow contract-first service and client development with
specifications such as OpenAPI. Data gateways allow schema-first data
consumption based on the SQL standard. A SQL schema for data modeling is
the OpenAPI equivalent of APIs.
Many shades of data gateways
In this article, I use the terms API and data gateways loosely to
refer to a set of capabilities. There are many types of API gateways
such as API managers, load balancers, service mesh, service registry,
etc. It is similar to data gateways, where they range from huge
monolithic data virtualization platforms that want to do everything, to
data federation libraries, from purpose-built cloud services to end-user
query tools.
Let’s explore the different types of data gateways and see which fit
the definition of “a cloud-native data gateway.” When I say a
cloud-native data gateway, I mean a containerized first-class Kubernetes
citizen. I mean a gateway that is open source, using open standards; a
component that can be deployed on hybrid/multi-cloud infrastructures,
work with different data sources, data formats, and applicable for many
use cases.
Classic data virtualization platforms
In the very first category of data gateways, are the traditional data virtualization platforms such as Denodo and TIBCO/Composite.
While these are the most feature-laden data platforms, they tend to do
too much and want to be everything from API management, to metadata
management, data cataloging, environment management, deployment,
configuration management, and whatnot. From an architectural point of
view, they are very much like the old ESBs, but for the data layer. You
may manage to put them into a container, but it is hard to put them into
the cloud-native citizen category.
Databases with data federation capabilities
Another emerging trend is the fact that databases, in addition to
storing data, are also starting to act as data federation gateways and
allowing access to external data.
For example, PostgreSQL implements the
ANSI SQL/MED specification for a standardized way of handling access to
remote objects from SQL databases. That means remote data stores, such
as SQL, NoSQL, File, LDAP, Web, Big Data, can all be accessed as if they
were tables in the same PostgreSQL database. SQL/MED stands for
Management of External Data, and it is also implemented by MariaDB CONNECT engine, DB2, Teiid project discussed below, and a few others.
Starting in SQL Server 2019, you can now query external data sources without moving or copying the data. The PolyBase engine
of SQL Server instance to process Transact-SQL queries to access
external data in SQL Server, Oracle, Teradata, and MongoDB.
GraphQL data bridges
Compared to the traditional data virtualization, this is a new
category of data gateways focused around the fast web-based data access.
The common thing around Hasura, Prisma, SpaceUpTech,
is that they focus on GraphQL data access by offering a lightweight
abstraction on top of a few data sources. This is a fast-growing
category specialized for enabling rapid web-based development of
data-driven applications rather than BI/AI/ML use cases.
Open-source data gateways
Apache Drill is a schema-free
SQL query engine for NoSQL databases and file systems. It offers JDBC
and ODBC access to business users, analysts, and data scientists on top
of data sources that don’t support such APIs. Again, having uniform SQL
based access to disparate data sources is the driver. While Drill is
highly scalable, it relies on Hadoop or Apache Zookeeper’s kind of
infrastructure which shows its age.
Teiid is a mature data
federation engine
sponsored by Red Hat.
It uses the SQL/MED specification for defining the virtual data models
and relies on the Kubernetes Operator model for the building,
deployment, and management of its runtime. Once deployed, the runtime can scale as any other stateless
cloud-native workload on Kubernetes and integrate with other
cloud-native projects. For example, it can use Keycloak for single sign-on and data roles, Infinispan for
distributed caching needs, export metrics and register with Prometheus
for monitoring, Jaeger for tracing, and even with 3scale for API
management. But ultimately, Teiid runs as a single Spring Boot
application acting as a data proxy and integrating with other
best-of-breed services on Openshift rather than trying to reinvent
everything from scratch.
Architectural overview of Teiid data gateway
On the client-side, Teiid offers standard SQL over JDBC/ODBC and
Odata APIs. Business users, analysts, and data scientists can use
standard BI/analytics tools such as Tableau, MicroStrategy, Spotfire,
etc. to interact with Teiid. Developers can leverage the REST API or
JDBC for custom built microservices and serverless workloads. In either
case, for data consumers, Teiid appears as a standard PostgreSQL
database accessed over its JDBC or ODBC protocols but offering
additional abstractions and decoupling from the physical data sources.
PrestoDB is another popular
open-source project started by Facebook. It is a distributed SQL query
engine targeting big data use cases through its coordinator-worker
architecture. The Coordinator is responsible for parsing statements,
planning queries, managing workers, fetching results from the workers,
and returning the final results to the client. The worker is responsible
for executing tasks and processing data.
Some time ago, the founders split from PrestoDB and created a fork called Trino (formerly PrestoSQL). Today, PrestoDB is part of The Linux Foundation, and Trino part of Trino Software Foundation. Both
distributions of Presto are among the most active and powerful open-source data gateway projects
in this space. To learn more about this technology, here is a good book I found.
Cloud-hosted data gateways services
With a move to the cloud infrastructure, the need for data gateways
doesn’t go away but increases instead. Here are a few cloud-based data
gateway services:
AWS Athena is ANSI SQL
based interactive query service for analyzing data tightly integrated
with Amazon S3. It is based on PrestoDB and supports additional data
sources and federation capabilities too. Another similar service by
Amazon is AWS Redshift Spectrum.
It is focused around the same functionality, i.e. querying S3 objects
using SQL. The main difference is that Redshift Spectrum requires a
Redshift cluster, whereas Athena is a serverless offering that doesn’t
require any servers. Big Query is a similar service but from Google.
These tools require minimal to no setup, they can access on-premise
or cloud-hosted data and process huge datasets. But they couple you with
a single cloud provider as they cannot be deployed on multiple clouds
or on-premise. They are ideal for interactive querying rather than
acting as hybrid data frontend for other services and tools to use.
Secure tunneling data-proxies
With cloud-hosted data gateways comes the need for accessing
on-premise data. Data has gravity and also might be affected by
regulatory requirements preventing it from moving to the cloud. It may
also be a conscious decision to keep the most valuable asset (your data)
from cloud-coupling. All of these cases require cloud access to
on-premise data. And cloud providers make it easy to reach your data.
Azure’s On-premises Data Gateway is such a proxy allowing access to on-premise data stores from Azure Service Bus.
In the opposite scenario, accessing cloud-hosted data stores from on-premise clients can be challenging too. Google’s Cloud SQL Proxy provides secure access to Cloud SQL instances without having to whitelist IP addresses or configure SSL.
Red Hat-sponsored open-source project Skupper takes
the more generic approach to address these challenges. Skupper solves
Kubernetes multi-cluster communication challenges through a layer 7
virtual network that offers advanced routing and secure connectivity
capabilities. Rather than embedding Skupper into the business service
runtime, it runs as a standalone instance per Kubernetes namespace and
acts as a shared sidecar capable of secure tunneling for data access or
other general service-to-service communication. It is a generic
secure-connectivity proxy applicable for many use cases in the hybrid
cloud world.
Connection pools for serverless workloads
Serverless takes software decomposition a step further from
microservices. Rather than services splitting by bounded context,
serverless is based on the function model where every operation is
short-lived and performs a single operation. These granular software
constructs are extremely scalable and flexible but come at a cost that
previously wasn’t present. It turns out rapid scaling of functions is a
challenge for connection-oriented data sources such as relational
databases and message brokers. As a result cloud providers offer
transparent data proxies as a service to manage connection pools
effectively. Amazon RDS Proxy is
such a service that sits between your application and your relational
database to efficiently manage connections to the database and improve
scalability.
Conclusion
Modern cloud-native architectures combined with the microservices
principles enable the creation of highly scalable and independent
applications. The large choice of data storage engines, cloud-hosted
services, protocols, and data formats, gives the ultimate flexibility
for delivering software at a fast pace. But all of that comes at a cost
that becomes increasingly visible with the need for uniform real-time
data access from emerging user groups with different needs. Keeping
microservices data only for the microservice itself creates challenges
that have no good technological and architectural answers yet. Data
gateways, combined with cloud-native technologies offer features similar
to API gateways but for the data layer that can help address these new
challenges. The data gateways vary in specialization, but they tend to
consolidate on providing uniform SQL-based access, enhanced security
with data roles, caching, and abstraction over physical data stores.
Data has gravity, requires granular access control, is hard to scale,
and difficult to move on/off/between cloud-native infrastructures.
Having a data gateway component as part of the cloud-native tooling
arsenal, which is hybrid and works on multiple cloud providers, supports
different use cases is becoming a necessity.
This article was originally published on InfoQ here.