David Campos - Blog

From Bottleneck to Multiplier

2025-11-17T09:00:00+00:00

TL;DR

Engineering leaders can multiply team intelligence instead of bottlenecking it. This series introduces a framework with five interconnected pillars: psychological safety, clear success criteria, intentional delegation, alignment through documentation, and evidence-driven decisions. When these work together, teams ship faster, retain top talent, and solve problems before they escalate, creating a resilient culture that scales from a single team to an entire organization.

The Problem

Engineering organizations plateau despite having talented teams. You have smart people, decent infrastructure, and established processes. Yet problems emerge: shipping slows (sprint velocity predictions miss by +20%), innovation stalls (only three people propose solutions), and retention drops (mid-level engineers leaving +15% more than last year). The bottleneck isn’t talent or technology: IT’S LEADERSHIP.

But not leadership as authority or decision-making. Leadership as multiplication. Liz Wiseman’s research reveals a powerful distinction: multipliers create conditions where team intelligence amplifies. The whole becomes greater than the sum of its parts. Each person thinks more deeply, surfaces problems early, and takes ownership. Diminishers, even well-intentioned ones, do the opposite. Many leaders become accidental diminishers by acting as the primary problem-solver, creating a dependency that hinders team growth. Teams slow down waiting for direction, ideas die in silence, and problems hide until they explode because the culture punishes the messenger.

I learned this the hard way. Early in my career, I led a team responsible for an alarming service for connected heating devices—a critical system where mistakes had real-world consequences. I had transitioned from a developer-like role and believed my job was to ensure nothing went wrong. I involved myself in every design, reviewed every line of code, and mentored every engineer. On the surface, it worked. We shipped reliable software. But beneath the surface, a dependency culture was growing. The team would wait for my input before moving forward, turning me into a bottleneck. I was working late nights just to keep up, not realizing I was the one slowing them down. The turning point came when I started defining clear acceptance criteria for tasks and then deliberately stepped back, empowering two senior engineers to make decisions on new features. There was an initial discomfort, but soon, they were not only making sound technical choices but also mentoring others. They replaced me. By removing myself as the central node, I had multiplied their intelligence instead of just adding my own.

The difference shows up everywhere. Multiplier-led teams ship faster with fewer surprises because decisions don’t bottleneck at the top. They retain top talent because people stay where they can think and own outcomes. They catch issues in design review, not in production, because psychological safety makes problems visible early. Diminisher-led teams drift into service organizations that execute features handed down from product rather than proposing solutions as strategic partners.

This matters now more than ever. AI and tooling are rapidly commoditizing code generation, allowing junior engineers to produce what seniors wrote five years ago. While this shifts the landscape, it doesn’t eliminate the need for deep technical skill. Instead, it elevates the importance of what can’t be automated: robust architecture, systems thinking, product intuition, and operational excellence. The competitive edge is no longer just individual coding ability, but the collective intelligence of the team that applies these skills. The teams that win will combine deep technical expertise with the clearest thinking, the fastest decision cycles, and deepest psychological safety. This framework is built to cultivate that combination.

The Framework

Multiplier leadership is the work of a gardener. A gardener does not force growth. They prepare the soil, water with care, and put up the trellis so the plant can find its shape. In teams, that means creating the conditions where thinking happens and ownership grows. Good intent is not enough. You need clear structures the whole team can trust: decision frameworks, explicit success criteria, a steady communication rhythm, and shared measures of progress. The framework we’ll explore brings these together into one system with five pillars:

THE FOUNDATION Psychological Safety & Bidirectional Communication: Teams that can raise concerns without fear identify problems early, innovate more effectively, and adapt faster. This is where a strong leadership culture begins, fostered by continuous feedback that balances caring personally with challenging directly.
THE STRUCTURE Clear Acceptance Criteria & Measurable Success: When everyone knows what “done” looks like before work begins, autonomy becomes possible. Teams make smarter decisions because the constraints are explicit.
THE GROWTH ENGINE Delegation as Development: Rather than decision-making bottlenecking at the leader, capability multiplies through mentoring, coached delegation, and empowerment. You grow people while delivering.
THE CLARITY LAYER Technical Alignment through Documentation: RFCs, Design Documents, and Architecture Decision Records create asynchronous alignment across time zones and teams. Documentation distributes leadership, not being an overhead but an enablement.
THE RHYTHM Evidence-Driven Culture: SMART OKRs, SLIs/SLOs, DORA metrics, and tech maturity frameworks replace intuition with shared reality. When teams see data, they collaborate on improvement rather than debate priorities.

Figure: The Engineering Leadership Multiplier Framework.

The framework’s power comes from viewing it not as a list of best practices, but as a coherent system. It can be seen from four interconnected perspectives:

As a Diagnostic Tool: The five pillars directly address the most common failure modes in engineering teams. Psychological safety prevents hidden problems from going unaddressed. Clear success criteria and documentation prevent distributed decision-making from descending into chaos. Intentional delegation resolves leadership bottlenecks, and metrics with context stop teams from optimizing for the wrong outcomes. It provides a map to identify why a team is struggling.
As a Reinforcing System: The pillars are designed to be interconnected, where each one strengthens the others. For example, psychological safety is what allows for honest and critical feedback in RFCs. Clear acceptance criteria are what make effective delegation possible. Documentation enables distributed decision-making, and metrics provide the feedback loop to validate that the entire system is working. This creates a virtuous cycle of continuous improvement.
As a Holistic Philosophy (Systems Thinking): Unlike fragmented leadership advice (‘run good 1:1s’, ‘use OKRs’), this framework emphasizes systems thinking. Individual practices, when applied in isolation, can backfire. For instance, metrics without psychological safety feel like surveillance, and delegation without clear criteria feels like abandonment. The framework’s value lies in its coherence—seeing the practices as an integrated whole where each element supports the others.
As a Scalable Blueprint: The framework is grounded in universal human principles—trust, clarity, growth, and evidence—not rigid tools or processes. This allows it to scale from a single team to a large organization. A five-person team might foster safety through daily stand-ups, while a 100-person organization may need formal skip-levels and RFC processes. The implementation adapts to the context (team size, tools, development methodology), but the underlying principles remain constant.

The Expected Outcome

Teams that operate within this framework show predictable patterns:

They ship faster with better predictability. Decisions are distributed, so they don’t bottleneck at the leader. Documentation reduces rework and context-switching costs. Clear acceptance criteria mean fewer approval cycles. In practical terms, sprint velocity becomes more consistent (80%+ of sprints hit their forecast), deployment frequency increases, and change-lead time drops because decisions aren’t waiting for leadership consensus.
They retain high-potential talent. People stay where they feel they can think, grow, and own outcomes. Organizations applying these principles see mid-level engineer attrition drop significantly within a year, and more importantly, their exit interviews shift. Instead of “I wanted more growth”, you hear “I had genuine ownership” from people who stay.
They catch problems early, not in production. Psychological safety means design flaws surface in review, not in production incidents. Clear criteria catch scope creep in planning, not in integration hell. Regular metrics conversations identify degradation trends before they become fires. As a result, incident count decreases, and incident severity when they do happen is lower because problems were partway solved already.
They propose more solutions themselves. Instead of waiting for direction, teams think about what’s possible. They’re not executing a backlog handed down from product, they’re partners in strategy. Concrete markers are: 1) observing more RFCs initiated by individual contributors, 2) higher idea quality in planning sessions, and 3) smaller gaps between problem identification and solution proposals.

Context Matters

This framework reflects my experience leading engineering organizations and conversations with other leaders, but it’s also heavily grounded in research. Amy Edmondson’s work on psychological safety, Liz Wiseman’s multiplier leadership, Kim Scott’s Radical Candor, Google’s Project Aristotle on team dynamics, and DORA research on engineering performance all point to the same conclusion: high-performing teams trust each other, communicate clearly, grow continuously, and measure what matters.

It’s an approach that has worked, not the only way or the right way. Your context is different, namely your team size, market, technology, and constraints. Take what resonates. Adapt what doesn’t. Skip what contradicts your values. The framework’s strength is its coherence, but coherence doesn’t mean rigidity. The goal isn’t to follow the framework perfectly, it’s to think systematically about how the pieces of your leadership system work together.

What’s Next?

Over the next six posts, we’ll unpack the framework one piece at a time. Each of the five pillars will get its own dedicated post, and we’ll conclude with a pragmatic implementation guide—where to start, how to scale, and the pitfalls to avoid.

Each post stands alone but builds on the others. If you’re leading a single team of five, start with safety and clarity. If you’re managing managers, rhythm and metrics will feel urgent. The system clicks when you see how the parts reinforce one another.

Try This Week

Pick the pillar that’s costing you the most right now. Ask yourself: Where did you last lose time? Was it because someone didn’t feel safe raising a concern early? Because “done” wasn’t defined clearly enough? Because you’re still the decision bottleneck? Because tribal knowledge lives in someone’s head instead of documentation? Because you’re arguing about priorities without data? That’s your starting point. Pick one concrete action for this week:

Psychological safety weak? In your next 1:1, ask “What’s one thing we could do better that you haven’t told me?” and genuinely listen.
Unclear criteria? Before your next project starts, write down what “done” looks like and have the team challenge it.
Delegation bottleneck? Identify one decision you made this week that someone else could have made. Next time, coach them through making it.
Documentation gaps? Take the last significant decision your team made and write an ADR. Share it.
Missing metrics? Ask your team: “What’s the one metric that would tell us we’re improving?” Start tracking it manually if you have to.

That’s how systems change, through small and deliberate interventions.

In the next post, we’ll explore the foundation of everything else: psychological safety and bidirectional communication.

COVID-19 corpus of research articles annotated with biomedical entities

2020-03-28T09:00:00+00:00

TL;DR

Created a corpus of research articles related with COVID-19, automatically annotated with 10 biomedical entities of interest, namely Disorder, Species, Chemical or Drug, Gene or Protein, Enzyme, Anatomy, Biological Process, Molecular Function, Cellular Component, Pathway and microRNA.

The corpus is freely available and can be used to further research topics related with COVID-19, contributing to find insights towards a better understanding of the disease, in order to find effective drugs and reduce the pandemic impact.

Please follow the progress on Github, which already provides the CORD-19 corpus of full-text articles with more then 31 million biomedical annotations.

Download

Download the latest version of the COVID-19 annotated corpus.

Statistics

Overall corpus statistics:

Number of abstracts: 17740
Number of entity occurrences: 683349
Number of unique entities: 29423

Number of annotations per entity type:

Entity	# Occurrences	# Unique
Disorder	183528	4477
Species	128356	2170
Chemical or Drug	70619	2768
Gene and Protein	51114	15025
Enzyme	7892	282
Anatomy	106401	2369
Biological Process	74286	1561
Molecular Function	15089	383
Cellular Component	39451	263
Pathway	6587	97
microRNA	26	28

Structure

Corpus file corpus/pubmed_YYYYMMDD.zip contains the following folders:

json: article in JSON format from Pubmed;
raw: article with text only;
annotations: annotations in A1 format.

On each folder you can find one file per article, with the Pubmed ID on its name.

Articles

To collect articles related with COVID-19 from Pubmed, the following query was applied:

("2000"[Date - Publication] : "3000"[Date - Publication]) AND ((COVID-19) OR (Coronavirus) OR (Corona virus) OR (2019-nCoV) OR (SARS-CoV) OR (MERS-CoV) OR (Severe Acute Respiratory Syndrome) OR (Middle East Respiratory Syndrome) OR (2019 novel coronavirus disease[MeSH Terms]) OR (2019 novel coronavirus infection[MeSH Terms]) OR (2019-nCoV disease[MeSH Terms]) OR (2019-nCoV infection[MeSH Terms]) OR (coronavirus disease 2019[MeSH Terms]) OR (coronavirus disease-19[MeSH Terms]))

To collect the articles in JSON and then extract the text in raw format:

python scripts/pubmed/pubmed.py
python scripts/pubmed/json2raw.py

Please not that sentences in other languages than english are currently being discarded.

Resources

The following resources were applied to annotated each entity type:

Disorder (DISO): UMLS
Species (SPEC): NCBI Taxonomy
Chemical or Drug (CHED): ChEBI
Gene or Protein (PRGE): NER with CRFs and normalization with UniProt
Enzyme (ENZY): ExPASy
Anatomy (ANAT): Unified Medical Language System (UMLS)
Biological Process (PROC): Gene Ontology (GO) and UMLS
Molecular Function (FUNC): Gene Ontology (GO)
Cellular Component (COMP): Gene Ontology (GO)
Pathway (PATH): NCBI BioSystems
microRNA (MRNA): miRBase

For more details please check the article. Unfortunately dictionaries could not be shared for download, due to UMLS usage license. Nevertheless, keep in mind that Disorder and Species entities were extended to include COVID-19 entities of interest.

Annotation

Neji is the tool used for NER (Named Entity Recognition) and normalization, which is optimized for biomedical scientific articles and provides an easy to use CLI. For more details please check the article.

The annotation script is available at scripts/pubmed/annotate.sh.

Visualization

brat is used to visualize the annotations in the articles. Find below the instructions to run the tool, create corpus for brat and visualize annotations.

Install and run brat

cd tools
unzip brat-1.3.zip
cd brat-1.3
./install.sh -u
python standalone.py

Create corpus for visualization

./scripts/pubmed/brat.sh
ln -s corpus/pubmed/brat tools/brat-1.3/data/covid19-corpus

Visualize corpus

Go to http://localhost:8001/index.xhtml#/covid19-corpus/ and wait for the articles to load:

Figure: List of articles.

Double click in a document to visualize it:

Figure: Article with annotations visualization.

Example

Find below an example article in the provided document formats: JSON, Raw and A1.

JSON

{
    "pubmed_id": "32198088",
    "title": "Transmission potential and severity of COVID-19 in South Korea.",
    "abstract": "Since the first case of 2019 novel coronavirus (COVID-19) identified on Jan 20, 2020 in South Korea, the number of cases rapidly increased, resulting in 6,284 cases including 42 deaths as of March 6, 2020. To examine the growth rate of the outbreak, we aimed to present the first study to report the reproduction number of COVID-19 in South Korea.\nThe daily confirmed cases of COVID-19 in South Korea were extracted from publicly available sources. By using the empirical reporting delay distribution and simulating the generalized growth model, we estimated the effective reproduction number based on the discretized probability distribution of the generation interval.\nWe identified four major clusters and estimated the reproduction number at 1.5 (95% CI: 1.4-1.6). In addition, the intrinsic growth rate was estimated at 0.6 (95% CI: 0.6, 0.7) and the scaling of growth parameter was estimated at 0.8 (95% CI: 0.7, 0.8), indicating sub-exponential growth dynamics of COVID-19. The crude case fatality rate is higher among males (1.1%) compared to females (0.4%) and increases with older age.\nOur results indicate early sustained transmission of COVID-19 in South Korea and support the implementation of social distancing measures to rapidly control the outbreak.",
    "keywords": [
        "COVID-19",
        "Korea",
        "coronavirus",
        "reproduction number"
    ],
    "journal": "International journal of infectious diseases : IJID : official publication of the International Society for Infectious Diseases",
    "publication_date": "2020-03-22",
    "authors": [
        {
            "lastname": "Shim",
            "firstname": "Eunha",
            "initials": "E",
            "affiliation": "Department of Mathematics, Soongsil University, 369 Sangdoro, Dongjak-Gu, Seoul, 06978 Republic of Korea. Electronic address: alicia@ssu.ac.kr."
        },
        {
            "lastname": "Tariq",
            "firstname": "Amna",
            "initials": "A",
            "affiliation": "Department of Population Health Sciences, School of Public Health, Georgia State University, Atlanta, GA, USA. Electronic address: atariq1@student.gsu.edu."
        },
        {
            "lastname": "Choi",
            "firstname": "Wongyeong",
            "initials": "W",
            "affiliation": "Department of Mathematics, Soongsil University, 369 Sangdoro, Dongjak-Gu, Seoul, 06978 Republic of Korea. Electronic address: chok10004@soongsil.ac.kr."
        },
        {
            "lastname": "Lee",
            "firstname": "Yiseul",
            "initials": "Y",
            "affiliation": "Department of Population Health Sciences, School of Public Health, Georgia State University, Atlanta, GA, USA. Electronic address: ylee97@student.gsu.edu."
        },
        {
            "lastname": "Chowell",
            "firstname": "Gerardo",
            "initials": "G",
            "affiliation": "Department of Population Health Sciences, School of Public Health, Georgia State University, Atlanta, GA, USA. Electronic address: gchowell@gsu.edu."
        }
    ],
    "methods": null,
    "conclusions": null,
    "results": "We identified four major clusters and estimated the reproduction number at 1.5 (95% CI: 1.4-1.6). In addition, the intrinsic growth rate was estimated at 0.6 (95% CI: 0.6, 0.7) and the scaling of growth parameter was estimated at 0.8 (95% CI: 0.7, 0.8), indicating sub-exponential growth dynamics of COVID-19. The crude case fatality rate is higher among males (1.1%) compared to females (0.4%) and increases with older age.",
    "copyrights": "Copyright \u00a9 2020. Published by Elsevier Ltd.",
    "doi": "10.1016/j.ijid.2020.03.031",
    "xml": null
}

Raw

TITLE:
Transmission potential and severity of COVID-19 in South Korea.

ABSTRACT:
Since the first case of 2019 novel coronavirus (COVID-19) identified on Jan 20, 2020 in South Korea, the number of cases rapidly increased, resulting in 6,284 cases including 42 deaths as of March 6, 2020. To examine the growth rate of the outbreak, we aimed to present the first study to report the reproduction number of COVID-19 in South Korea.
The daily confirmed cases of COVID-19 in South Korea were extracted from publicly available sources. By using the empirical reporting delay distribution and simulating the generalized growth model, we estimated the effective reproduction number based on the discretized probability distribution of the generation interval.
We identified four major clusters and estimated the reproduction number at 1.5 (95% CI: 1.4-1.6). In addition, the intrinsic growth rate was estimated at 0.6 (95% CI: 0.6, 0.7) and the scaling of growth parameter was estimated at 0.8 (95% CI: 0.7, 0.8), indicating sub-exponential growth dynamics of COVID-19. The crude case fatality rate is higher among males (1.1%) compared to females (0.4%) and increases with older age.
Our results indicate early sustained transmission of COVID-19 in South Korea and support the implementation of social distancing measures to rapidly control the outbreak.

Annotations

T0	DISO 46 54	COVID-19
N0	Reference T0 UMLS:::DISO	COVID-19
T1	SPEC 106 128	2019 novel coronavirus
N1	Reference T1 NCBI:2697049:T001:SPEC	2019 novel coronavirus
T2	DISO 130 138	COVID-19
N2	Reference T2 UMLS:::DISO	COVID-19
T3	PROC 260 266	deaths
N3	Reference T3 GO:0016265::PROC	deaths
T4	PROC 303 309	growth
N4	Reference T4 UMLS:C1621966:T042:PROC	growth
N5	Reference T4 GO:0040007::PROC	growth
T5	PROC 382 394	reproduction
N6	Reference T5 GO:0000003::PROC	reproduction
T6	DISO 405 413	COVID-19
N7	Reference T6 UMLS:::DISO	COVID-19
T7	DISO 459 467	COVID-19
N8	Reference T7 UMLS:::DISO	COVID-19
T8	PROC 602 620	generalized growth
N9	Reference T8 GO:0040007::PROC	generalized growth
T9	PROC 614 620	growth
N10	Reference T9 UMLS:C1621966:T042:PROC	growth
T10	PROC 655 667	reproduction
N11	Reference T10 GO:0000003::PROC	reproduction
T11	CHED 778 786	clusters
N12	Reference T11 CHEBI:33731:T103:CHED	clusters
T12	PROC 805 817	reproduction
N13	Reference T12 GO:0000003::PROC	reproduction
T13	PROC 878 884	growth
N14	Reference T13 UMLS:C1621966:T042:PROC	growth
N15	Reference T13 GO:0040007::PROC	growth
T14	PROC 949 955	growth
N16	Reference T14 UMLS:C1621966:T042:PROC	growth
N17	Reference T14 GO:0040007::PROC	growth
T15	PROC 1034 1040	growth
N18	Reference T15 UMLS:C1621966:T042:PROC	growth
N19	Reference T15 GO:0040007::PROC	growth
T16	DISO 1053 1061	COVID-19
N20	Reference T16 UMLS:::DISO	COVID-19
T17	DISO 1231 1239	COVID-19
N21	Reference T17 UMLS:::DISO	COVID-19

Next steps

Possible next steps to improve the COVID-19 corpus:

Annotate “methods”, “results” and “conclusions” sections from JSON files;
Further optimize resources to target entities related with COVID-19;
Include additional entities of relevance;
Annotate PMC and Elsevier full text articles;
Collect co-occurrences to understand which entities might be related more often;
Index articles and annotations and provide access to search tool.

Conclusion

I hope this annotated corpus helps to understand the COVID-19 disease better, towards finding better medication and to reduce the impact on society as much as possible. Please remember that your comments, suggestions and contributions are more than welcome.

Let’s kick the virus ass! :muscle:

Flexible CI/CD with Kubernetes, Helm, Traefik and Jenkins

2020-03-15T09:00:00+00:00

TL;DR

Let’s create a CI/CD (Continuous Integration and Continuos Deployment) solution on top of Kubernetes, using Jenkins as building tool and Traefik as ingress for flexible application deployment and routing.

Source code is available on Github with example application and supporting files.

Goal

The main goal is to present a flexible CI/CD solution on top of Kubernetes, with automatic application deployment, host definition and routing per environment. To make this process easy to understand, the following steps are presented and described in detail:

Setup Kubernetes and understand its basic concepts;
Install Traefik, Dashboard and Jenkins using Helm;
Create Kotlin application to show how CI/CD can be used;
Implement Jenkins pipeline to build and deploy application automatically.

To fulfill the mentioned steps and validate the presented CI/CD solution, the architecture with the following components is proposed:

Kubernetes: for containers management and orchestration;
Traefik: as proxy and load balancer to access services;
Kubernetes Dashboard: to manage Kubernetes through a web-based interface;
Jenkins: as automation server to automatically build and deploy application;
GitHub: to manage source code using Git;
DockerHub: as registry to manage the Docker image with the example application;
Application stating: example application deployment for development and testing purposes;
Application production: example application deployment to be used in production.

Figure: Components.

Behind the curtains and as supporting tools, the following technologies are also used:

Docker: for services and applications containerization;
Helm: for simplified services deployment and configuration on Kubernetes;
Kotlin: to develop the example application, which will be automatically built and deployed to Kubernetes.

Regarding the CI/CD solution, this post will focus in two main interaction workflows, which are presented in the sequence diagram below:

Build and deploy application: checkout latest source code version to build application and deploy it on Kubernetes cluster;
Access application: use proxy for standardized access to deployed application on specific hostname.

Figure: Sequence diagram.

Kubernetes

Kubernetes, also known as K8s, is the current standard solution for containers orchestration, allowing to easily deploy and manage large-scale applications in the cloud with high scalability, availability and automation level. Kubernetes was originally developed at Google, receiving a lot of attention from the open source community. It is the main project of the Cloud Native Computing Foundation and some of the biggest players are supporting it, such as Google, Amazon, Microsoft and IBM. Out of curiosity, Kubernetes is currently one of the top open source projects, being the one with highest activity in front of Linux. Nowadays, several companies already provide production-ready Kubernetes clusters, such as AWS from Amazon, Azure from Microsoft and GCE from Google. An official list of existing Cloud Providers is provided in the Kubernetes documentation.

Terminology

To understand how applications can be deployed, it is fundamental to introduce some of the core concepts, which are presented and briefly described below:

Namespace: a virtual cluster that can sit on top of the same physical cluster hardware, enabling concern separation across development teams;
Pod: is the smallest deployable unit with a group of containers that share the same resources, such as memory, CPU and IP;
Replica Set: ensures that a specified number of Pod replicas are running at any given time;
Deployment: a set of multiple identical Pods, defining how to run multiple replicas of the application, how to automatically replace any instances that fail or become unresponsive, and how to perform updates;
Service: abstraction of a logical set of Pods, which is the only interface that other applications use to interact with;
Ingress: to manage how external access to services is provided;
Persistent Volume: a piece of storage used to persist data beyond the lifetime of a Pod.

Figure: Kubernetes deployment concepts.

Architecture

Before jumping into installing and configuring Kubernetes, it is important to understand the software and hardware components required to setup a cluster properly. The figure below summarizes the required components architecture, together with a brief description of the role of each one:

Master: responsible for maintaining the desired cluster state, being the entry point for administrators to manage the various nodes. The following software components run in the master:
- API Server: REST API that exposes all operations that can be performed on the cluster, such as creating, configuring and removing Pods and Services;
- Scheduler: responsible for assigning tasks to the various cluster nodes;
- Controller-Manager: to make sure that the cluster state is operating as expected, reacting to events triggered by controllers from throughout the cluster;
- etcd: distributed key-value store used to share information regarding cluster state, which can be accessed by all cluster nodes;
Node: physical or virtualized machine that performs a given task, with the following components running:
- Docker: container runtime responsible for starting and managing containers;
- Kubelet: tracks the state of a Pod to ensure that all the containers are running as expected;
- Kube-proxy: routes traffic coming into a node from the service;
UI: user interface application to manage cluster configurations and applications. Kubernetes Dashboard will be used in this post;
CLI: command line interfaces to manage cluster configurations and applications. Kubectl will be used in this post;

Figure: Kubernetes architecture. Source https://blog.sensu.io/how-kubernetes-works.

To learn more about Kubernetes architecture and terminology, several pages already provide an in-depth description, such as the Official Kubernetes Documentation, the introduction by Digital Ocean and the terminology presentation by Daniel Sanche.

Install

There are several options available that make the process of installing Kubernetes more straightforward, since installing and configuring every single component can be an time consuming task. Ramit Surana provides an extensive list of such installers. Special emphasis to kubeadm, kops, minikube and k3s, which are continuously supported and updated by the open source community. Since I am using MacOS and want to run Kubernetes locally in a single node, I decided to take advantage of Docker Desktop, which already provides Docker and Kubernetes installation in a single tool. After installing, one can check the system tray menu to make sure that Kubernetes is running as expected:

Figure: Docker Desktop.

Kubectl

Kubectl is the official CLI tool to completely manage a Kubernetes cluster, which can be used to deploy applications, inspect and manage cluster resources and view logs. Since Docker Desktop already installs kubectl, let’s just check if it is running properly by executing kubectl version, which provides an output similar to:

➜  ~ kubectl version
Client Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.5", GitCommit:"20c265fef0741dd71a66480e35bd69f18351daea", GitTreeState:"clean", BuildDate:"2019-10-15T19:16:51Z", GoVersion:"go1.12.10", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.5", GitCommit:"20c265fef0741dd71a66480e35bd69f18351daea", GitTreeState:"clean", BuildDate:"2019-10-15T19:07:57Z", GoVersion:"go1.12.10", Compiler:"gc", Platform:"linux/amd64"}

In order to understand the available commands and inherent logic, I would recommend a quick overview on the official kubectl cheat sheet. For instance, one can get the list pods that are running by executing kubectl get pods.

Last but not least, if you use the ZSH shell, keep in mind to use the kubectl plugin, in order to have proper highlight and auto-completion. To achieve that, just change your ZSH ~/.zshrc init script by adding the kubectl plugin:

plugins=(git kubectl)

Helm

Helm is the package manager for Kubernetes, which helps to create templates describing exactly how an application can be installed. Such templates can be shared with the community and customized for specific installations. Each template is referred as helm chart. Check Helm hub to understand if there is already a chart available for the application that you want to run. If you are curious and want to know how charts are implemented, you can also check the GitHub repository with official stable and incubated charts source code. Moreover, if you would like to have a repository for helm charts, solutions like Harbor and JFrog Artifactory can be used to store and serve your own charts.

Finally, to install helm and check if it properly installed, just run:

brew install helm
helm version

Which should give you something like:

➜  ~ helm version
version.BuildInfo{Version:"v3.1.1", GitCommit:"afe70585407b420d0097d07b21c47dc511525ac8", GitTreeState:"clean", GoVersion:"go1.13.8"}

Traefik

Traefik is a widely used proxy and load balancer for HTTP and TCP applications, natively compliant and optimized for Cloud-based solutions. In summary, Traefik analyzes the infrastructure and services configuration and automatically discovers the right configuration for each one, enabling automatic applications deployment and routing. On top of this, Traefik also supports collecting detailed metrics, logs and traceability.

Figure: Traefik architecture. Source https://docs.traefik.io.

Traefik offers a stable and official Helm chart that can be used for straightforward installation and configuration on Kubernetes. The following configuration values are provided to the chart, in order to configure:

access to Traefik dashboard through the domain “traefik.localhost”, using the admin as username and password;
enforce SSL for all proxied services, with automatically generated wildcard SSL certificate for the “*.localhost” domain.

dashboard:
  enabled: true
  domain: traefik.localhost
  auth:
    basic:
      admin: $2y$05$kpCJY2gJWlgG5CUs5tdPx.2xGJ4xyqhWtjiiM/NKfHmj3pfUPsap2
ssl:
  enabled: true
  enforced: true
  permanentRedirect: true
  generateTLS: true
  defaultCN: "*.localhost"

Code: Configuration values to use with Traefik Helm chart.

After saving the configuration values in the file “traefik-values.yml”, Traefik can be installed by executing the following command:

helm install stable/traefik --name traefik --values traefik-values.yml

If you would like to delete Traefik, the following command should help:

helm del --purge traefik

Check installation progress by checking the status of deployments and pods:

kubectl get deployments
kubectl get pods

When the deployment ready status is “1/1” (1 ready out of 1 required), visit http://traefik.localhost/ to access the Traefik dashboard and login with previously defined username and password. In the dashboard one can check the entry points (frontends) available to access the deployed services (backends).

Figure: Traefik dashboard.

Kubernetes Dashboard

Kubernetes Dashboard is an open-source web interface to quickly manage a Kubernetes cluster, providing user-friendly features to manage and troubleshoot deployed applications. Personally, I prefer Portainer interface and organization, however it is still not supporting Kubernetes. Thus, the following configurations are provided to enable the Traefik ingress and make the dashboard available through http://dashboard.localhost.

enableInsecureLogin: true
service:
  externalPort: 9090
ingress:
  enabled: true
  hosts: 
    - dashboard.localhost
  paths: 
    - /
  annotations:
    kubernetes.io/ingress.class: traefik

Code: Configuration values to use with Kubernetes Dashboard Helm chart.

Similarly to Traefik, the Dashboard can be installed using the official Kubernetes Dashboard Helm chart through the command:

helm install stable/kubernetes-dashboard --name dashboard --values dashboard-values.yml

In order to login, the helm chart already creates a service account with the appropriate permissions. The token to login with such service account is available in kubernetes secrets. To get the list of available secrets just run kubectl get secrets:

Figure: Kubernetes secrets.

To get the secret value, lets describe the secret that contains the dashboard token with kubectl describe secrets dashboard-kubernetes-dashboard-token-sk68z:

Figure: Kubernetes secret with token.

Finally, go to http://dashboard.localhost, and use the previous token value to login in the Kubernetes Dashboard:

Figure: Kubernetes Dashboard.

Jenkins

Jenkins is the most widely used open-source tool to automatically build, test and deploy software applications. Thus, with Jenkins we can specify a processing pipeline describing exactly how our application will be built and deployed automatically after each commit.

To install Jenkins, we will take advantage of the official Jenkins Helm chart, providing the following configurations to specify login credentials and install the plugins to integrate with GitHub and Kubernetes:

master:
  useSecurity: true
  adminUser: admin
  adminPassword: admin
  numExecutors: 1
  installPlugins:
    - kubernetes:1.21.1
    - workflow-job:2.36
    - workflow-aggregator:2.6
    - credentials-binding:1.20
    - git:3.12.1
    - command-launcher:1.3
    - github-branch-source:2.5.8
    - docker-workflow:1.21
    - pipeline-utility-steps:2.3.1
  overwritePlugins: true
  ingress:
    enabled: true
    hostName: jenkins.localhost
    annotations:
      kubernetes.io/ingress.class: traefik

Code: Configuration values to use with Jenkins Helm chart.

To perform installation, execute the following command and check the progress with kubectl get deployments:

helm install stable/jenkins --name jenkins --values jenkins-values.yml

When required pods are running, go to http://jenkins.localhost to access Jenkins and login with the previously provided credentials:

Figure: Jenkins Dashboard.

Application

Since all required tools are installed and running successfully, we are now ready to create the sample application to be built and deployed automatically. Such application will be developed in Kotlin using the Spring Boot framework. Spring Initializr is used to create the initial application, using the following configurations:

Figure: Spring Initializr configuration.

The core functionality will be in the GreetingController, which simply provides a GET REST endpoint to provide a greeting based on input argument, provided environment variable and overall counter to differentiate between different calls.

@RestController
class GreetingController {
    val counter = AtomicLong()
    
    @GetMapping("/greeting")
    fun greeting(@RequestParam(value = "name", defaultValue = "World") name: String): Greeting {
        val envVar: String = System.getenv("EXAMPLE_VALUE") ?: "default_value"
        return Greeting(counter.incrementAndGet(), "Hello, $name", envVar)
    }
}

Additionally, keep in mind to add the actuator dependency to enable the health endpoint at /actuator/health, which will be used to provide application health information to Kubernetes:

    org.springframework.boot
    spring-boot-starter-actuator

Dockerfile

To run the application in Kubernetes, a Docker image of the application is required, which can be described with the following Dockerfile:

FROM openjdk:8-jdk-alpine
EXPOSE 8090
ADD /target/k8s-jenkins-example*.jar k8s-jenkins-example.jar
ENTRYPOINT ["java", "-jar", "k8s-jenkins-example.jar"]

Helm chart

To create the helm chart for the sample application, one can take advantage of the helm CLI tool to create a baseline that we can adapt for the sample application. Such baseline can be created by running helm create helm on your terminal, which creates the templates of the required Kubernetes components to run and properly configure the application. Considering our goal, the following files are the ones that require most attention:

Chart.yaml: chart properties such as name, description and version;
values.yaml: default configuration values provided to chart;
templates/deplyment.yaml: template of Kubernetes deployment specification, to configure the application pod and replication characteristics;
templates/service.yaml: template of Kubernetes service specification, to configure the application interface for other applications;
templates/ingress.yaml: template of Kubernetes ingress specification, to expose service for external access.

Helm charts use ` {{}} ` for templating, which means that whatever that is inside will be interpreted to provide an output value. More details on several templating options in the official guide. For the template that we are creating, the following are the most important examples:

` {{ .Values.replicaCount }} ` to get configuration replicaCount from provided values file;
` {{- toYaml . nindent 8 }} `: copies the referred yaml tree (dot refers to the current structure reference) into outcome with an indent of 8 white spaces.

The following values were defined to configure the application, which will be used in the chart templates. Important to refer the provided docker image reference, the service port and the ingress configuration to use Traefik:

image:
  repository: davidcampos/k8s-jenkins-example
  tag: latest
  pullPolicy: Always

name: "example"
domain: "localhost"

replicaCount: 1

service:
  port: 8090

ingress:
  enabled: true
  annotations:
    kubernetes.io/ingress.class: traefik
  hosts:
    - host:
      paths:
        - /

Code: Helm chart configuration values.

Below you can find the deployment template, which configures the replica set and how it should be updated, sets up the container together with health probes, and finally specifies where the pods should be deployed:

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: {{ .Values.name }}-deployment
spec:
  replicas: {{ .Values.replicaCount }}
  revisionHistoryLimit: 2
  selector:
    matchLabels:
      app: {{ .Values.name }}
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1
      maxSurge: 1
  template:
    metadata:
      labels:
        app: {{ .Values.name }}
        role: rolling-update
    spec:
    {{- with .Values.imagePullSecrets }}
      imagePullSecrets:
        {{- toYaml . | nindent 8 }}
    {{- end }}
      containers:
        - name: {{ .Values.name }}-container
          image: "{{ .Values.image.repository }}:{{ .Values.image.tag }}"
          imagePullPolicy: {{ .Values.image.pullPolicy }}
          livenessProbe:
            httpGet:
              path: /actuator/health
              port: {{ .Values.service.port }}
          readinessProbe:
            httpGet:
              path: /actuator/health
              port: {{ .Values.service.port }}
          resources:
            {{- toYaml .Values.resources | nindent 12 }}
      {{- with .Values.nodeSelector }}
      nodeSelector:
        {{- toYaml . | nindent 8 }}
      {{- end }}
    {{- with .Values.affinity }}
      affinity:
        {{- toYaml . | nindent 8 }}
    {{- end }}
    {{- with .Values.tolerations }}
      tolerations:
        {{- toYaml . | nindent 8 }}
    {{- end }}

Code: Helm chart deployment template.

The following template provides the service configuration, which refers to the port provided in the deployment:

apiVersion: v1
kind: Service
metadata:
  name: {{ .Values.name }}-service
spec:
  ports:
    - name: http
      targetPort: {{ .Values.service.port }}
      port: {{ .Values.service.port }}
  selector:
    app: {{ .Values.name }}

Code: Helm chart service template.

Finally, the ingress template configures how the service is exposed for external access, specifying matching rules and TLS properties:

{{- if .Values.ingress.enabled -}}
{{- $name := .Values.name -}}
{{- $hostname := printf "%s.%s" .Values.name .Values.domain -}}
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  name: {{ $name }}-ingress
  {{- with .Values.ingress.annotations }}
  annotations:
    {{- toYaml . | nindent 4 }}
  {{- end }}
spec:
{{- if .Values.ingress.tls }}
  tls:
  {{- range .Values.ingress.tls }}
    - hosts:
      {{- range .hosts }}
        - {{ $hostname | quote}}
      {{- end }}
      secretName: {{ .secretName }}
  {{- end }}
{{- end }}
  rules:
  {{- range .Values.ingress.hosts }}
    - host: {{ $hostname | quote }}
      http:
        paths:
        {{- range .paths }}
          - path: {{ . }}
            backend:
              serviceName: {{ $name }}-service
              servicePort: http
        {{- end }}
  {{- end }}
{{- end }}

Code: Helm chart ingress template.

In order to check if the helm chart is working properly, we can install it and check if the several components where deployed properly:

helm install example ./helm
kubectl get deployment
kubectl get pod
kubectl get service
kubectl get ingress

Pipeline

The goal is to build the pipeline taking full advantage of Kubernetes, building the required artifacts on dedicated agents executed on-demand. Such approach provides high flexibility and independency for developers, which are in full control of their building pipelines and without dependencies to whatever is installed on the Jenkins host machine. As a result, the Jenkins machine will not be polluted with many different tools and versions. For instance, if one team needs Java 8 and another needs Java 13, the Jenkins host machine does not need to have both installed, since each team pipeline will run on its own Jenkins agent that is deployed on-demand for each run. To achieve that, we used the Kubernetes Jenkins plugin, which allows to define a pod with containers with required tools. Then, we just have to mention that we want to run a specific step inside a specific container by referencing its name.

Keep in mind that a workspace volume is automatically created and shared between containers in the pod, which means that any change on the workspace will be available for other containers. For instance, if we use the maven container to create the packaged jar file, it will be available for the docker container to create the docker image. Moreover, in order to speed up the building process, do not forget to create a volume for the maven ~/.m2 folder, in order to share downloaded dependencies between job runs.

Since maven, docker and helm tools are required to properly build and deploy the sample application, the following pod specification is provided in the build.yaml file:

apiVersion: v1
kind: Pod
metadata:
  labels:
    some-label: pod
spec:
  containers:
    - name: maven
      image: maven:3.3.9-jdk-8-alpine
      command:
        - cat
      tty: true
      volumeMounts:
        - name: m2
          mountPath: /root/.m2
    - name: docker
      image: docker:19.03
      command:
        - cat
      tty: true
      privileged: true
      volumeMounts:
        - name: dockersock
          mountPath: /var/run/docker.sock
    - name: helm
      image: lachlanevenson/k8s-helm:v3.1.1
      command:
        - cat
      tty: true
  volumes:
    - name: dockersock
      hostPath:
        path: /var/run/docker.sock
    - name: m2
      hostPath:
        path: /root/.m2

Before jumping into the pipeline, we need to define the credentials that will be used to access GitHub source code and Docker Hub images. Such credentials can be stored on Jenkins credentials, which later can be referenced from the pipeline using respective identifiers:

Figure: Jenkins credentials.

For the pipeline I decided to use the declarative syntax instead of scripted, which is a better fit for simple pipelines and easier to read and understand. However, the more restrictive syntax can be a limitation if we want to perform more advanced tasks. For such cases, a script block can be defined in a declarative pipeline. In summary, the CI/CD declarative pipeline for the sample application will have the following stages:

Build: build application package using maven;
Docker Build: build docker image using previously created Dockerfile;
Docker Publish: publish built docker image to Docker Hub;
Kubernetes Deploy: deploy application using previously created helm chart, by installing or upgrading respective Kubernetes components.

On top of the stages, two different deployment environments will be created: production (https://example.localhost) and staging (https://example-staging.localhost), which are related with master and develop branches respectively. Thus, if the branch is not master or develop, the docker image is not built and the application is not deployed to Kubernetes. Moreover, all application artifacts have the same version, which is loaded from the POM file using the Pipeline Utility steps Jenkins library.

Find below the Jenkins declarative pipeline for the sample application, which also setups the agent using the pod described on the build.yaml file and automatically checkouts the source code from GitHub on each job run:

pipeline {
    environment {
        DEPLOY = "${env.BRANCH_NAME == "master" || env.BRANCH_NAME == "develop" ? "true" : "false"}"
        NAME = "${env.BRANCH_NAME == "master" ? "example" : "example-staging"}"
        VERSION = readMavenPom().getVersion()
        DOMAIN = 'localhost'
        REGISTRY = 'davidcampos/k8s-jenkins-example'
        REGISTRY_CREDENTIAL = 'dockerhub-davidcampos'
    }
    agent {
        kubernetes {
            defaultContainer 'jnlp'
            yamlFile 'build.yaml'
        }
    }
    stages {
        stage('Build') {
            steps {
                container('maven') {
                    sh 'mvn package'
                }
            }
        }
        stage('Docker Build') {
            when {
                environment name: 'DEPLOY', value: 'true'
            }
            steps {
                container('docker') {
                    sh "docker build -t ${REGISTRY}:${VERSION} ."
                }
            }
        }
        stage('Docker Publish') {
            when {
                environment name: 'DEPLOY', value: 'true'
            }
            steps {
                container('docker') {
                    withDockerRegistry([credentialsId: "${REGISTRY_CREDENTIAL}", url: ""]) {
                        sh "docker push ${REGISTRY}:${VERSION}"
                    }
                }
            }
        }
        stage('Kubernetes Deploy') {
            when {
                environment name: 'DEPLOY', value: 'true'
            }
            steps {
                container('helm') {
                    sh "helm upgrade --install --force --set name=${NAME} --set image.tag=${VERSION} --set domain=${DOMAIN} ${NAME} ./helm"
                }
            }
        }
    }
}

Job

To finalize, let’s create the Jenkins job to run the pipeline using the sample application source code. To achieve that, go to Jenkins and create a new Multibranch Pipeline job with the following configurations:

Figure: Jenkins job configuration.

After saving the Jenkins job, you should be able to see it in the list, explore its several branches, and check the pipelines executed for each one:

Figure: Jenkins list of jobs, branches and pipeline runs for master branch.

Validate

Now that all pieces are running together and we checked the core functionality, let’s validate if the solution is up for a typical GitFlow development process:

Build master branch Jenkins job;

Check that production deploy is running and provides the expected value:

➜  ~ curl -k -w '\n' --request GET 'https://example.localhost/greeting'
{"id":1,"content":"Hello, World","env":"default_value"}

Create develop branch and build respective Jenkins job;

Check if staging deployment is running properly:

➜  ~ curl -k -w '\n' --request GET 'https://example-staging.localhost/greeting'
{"id":1,"content":"Hello, World","env":"default_value"}

Checkout develop branch, and change the default name argument value of the greeting method from “World” to “World!”;
Commit and wait for Jenkins job to finish, in order to update the staging deployment;

Check that default value is changed on staging deployment:

➜  ~ curl -k -w '\n' --request GET 'https://example-staging.localhost/greeting'
{"id":1,"content":"Hello, World!","env":"default_value"}

Merge develop branch into master branch;
Wait for master Jenkins job to finish and update production deployment;

Check if production deployment is properly updated:

➜  ~ curl -k -w '\n' --request GET 'https://example.localhost/greeting'
{"id":1,"content":"Hello, World!","env":"default_value"}

Yes, everything is working automagically!

Conclusion

The approach presented in this post allows teams to automatically and continuously integrate, deploy, validate and share the performed work, fostering enhanced product quality, developer independency and team collaboration. It is definitely nothing completely new, but the path to achieve it was not so straightforward as initially expected, which required a lot of try and error. Just out of curiosity, check below the Jenkins project status with the required runs until it was executed successfully:

Figure: The path to success.

All in all, I hope this post helps you and your team to easily build your CI/CD pipelines with Jenkins and Kubernetes.

Please remember that your comments, suggestions and contributions are more than welcome.

Let’s automate all the things! :sunglasses: :muscle:

Apache Cassandra with JPA: Achilles vs. Datastax vs. Kundera

2019-01-13T09:00:00+00:00

TL;DR

Use JPA libraries to communicate with Apache Cassandra comparing Achilles, Datastax and Kundera. The last one presents the better processing speeds with lower computational resources consumption.

Source code is available on Github with detailed documentation on how to build and run the tests using Docker.

Goal

With the overwhelming amounts of data being generated in nowadays technological solutions, one of the main challenges is to find the best solutions to properly store, manage and serve huge amounts of data. Apache Cassandra is one of such solutions, which is a NoSQL database designed for large-scale data management with high availability, consistency and performance. When performing millions of operations per day on top of such databases, every millisecond counts with significant impact on overall system behavior.

The main goal of this project is to use different JPA libraries to communicate with Cassandra, comparing usage complexity, processing speeds and resources usage. The following architecture is proposed to achieve the aforementioned goal, which contains the following components and interfaces:

Cassandra: database for large-scale data management;
Datastax Native: Java library to communicate with Cassandra;
Datastax ORM: JPA library to communicate with Cassandra;
Kundera: JPA library to communicate with Cassandra;
Achilles: JPA library to communicate with Cassandra.

Figure: Illustration of the implementation architecture of Cassandra and JPA clients.

The architecture is implemented using following technologies:

Jave 8: main programming language for experiment;
Maven: dependency management and package building;
Docker: components containerization, orchestration and setup;
Docker Compose: simplify running multi-container solutions with dependencies.

Apache Cassandra

Apache Cassandra is an open-source and distributed column-based database, designed for large-scale applications and to handle large amounts of data with high availability with no single point of failure. It was initially developed at Facebook and is currently part of the Apache Software Foundation. Nowadays, Apache Cassandra is one of the most used NoSQL databases, as we can see in the Figure below:

Figure: Popularity of several NoSQL databases from DB-Engines.

When comparing Cassandra with other NoSQL databases, various studies already present a detailed evaluation and comparison, such as: End Point, Altoros, and Çankaya University. Overall, Cassandra presents top results when used with large amounts of data and with multiple nodes, achieving high throughput with low latency. Thus, Cassandra might be recommended when:

Run on more than one server node, specially with a geographically distributed cluster;
Data can be partitioned via a key, which allows the database to be spread across multiple nodes;
Writes exceed reads by a large margin;
Read access is performed by a known primary key;
Data is rarely updated;
There is no need to perform join or aggregate operations (e.g., sum, min, or max), since they must be pre-computed and stored.

Many companies are effectively using Cassandra as the core data storage and management solution, such as CapitalOne, Coursera, eBay, Hulu and NASA. Such examples show that Cassandra can be used with different types of data and targeting different purposes, such as financial, health, entertainment, web analytics and IoT.

Apache Cassandra is available in major cloud providers, such as Amazon AWS, Microsoft Azure and Google Cloud. However, both Amazon and Microsoft provide their own NoSQL database implementations (DynamoDB and CosmosDB), with support for Cassandra APIs and migration. Other companies provide enterprise support for on-premises or cloud installation and maintenance, such as Datastax and Bitnami.

JPA libraries for Cassandra

The official Cassandra documentation page presents a comprehensive list of available libraries to communicate with Cassandra using Java. A brief analysis shows that only some projects are active and have significant community support:

Achilles: active project with small community;
Astyanax: deprecated and is no longer supported;
Casser: very small community and is not clear if project is still active;
Datastax: active project with large community and enterprise interest;
Kundera: active project with large community;
PlayORM: specific for Play Framework and project does not look active.

Based on such analysis, Achilles, Datastax and Kundera are the JPA libraries that will be considered during this analysis. To have a point of comparison, both Datastax Native and Datastax ORM implementations will be used.

How to compare JPA libraries?

In order to have a fair performance and resources usage comparison of the several JPA libraries for Cassandra, it is important to consider and analyse several questions in detail, such as:

Which database operations should be executed and compared?
What type of data should be considered?
What is the data complexity?
What are the relevant performance indicators?
How to measure and collect the performance indicators?
How to collect consistent results without interferences and outliers?

Taking the previous topics into consideration, the following testing guidelines were defined:

Operation types: write, read, update and delete;
Data type: single table with simple fields and without relations;
Data singularity: all operations should be performed with unique data values to avoid caching;
Performance measurement: elapsed time to perform each operation;
Resources usage measurement: CPU and RAM usage of client and server applications;
Repetition factor: all tests should be repeated several times to collect average values instead of results from single executions.

A simplistic approach will be followed for the data definition. The following Figure illustrates the User class that will be used during the tests, which contains only four textual attributes (unique identifier, first name, last name and city). In summary, everytime an operation is performed, an instance of the User class is being written, read, updated or deleted on Cassandra.

Figure: Illustration of the simple User class and respective attributes.

The following pseudocode presents the algorithm applied to collect the processing times for each library and operation types, using a set of users with different attributes. For each library and test cycle, each operation type (write, read, update and delete) will be executed \(O\) times (TOTAL_OPERATIONS), which is repeated \(R\) times (TOTAL_REPETITIONS) to calculate the average of total processing times. If multiple cycles are defined, the previous process is repeated \(C\) times (TOTAL_CYCLES) to collect average values of all repetitions. In the end, average times of all cycles and repetitions are collected per library and operation type. That way, all tasks are repeated to make sure external interferences have no impact on compared processing times.

FOR EACH library in [datastax_native, datastax_orm, kundera, achilles]
	FOR EACH cycle in TOTAL_CYCLES (C)
		SET users of size TOTAL_REPETITIONS*TOTAL_OPERATIONS
		FOR EACH operation type in [write, read, update, delete]
			FOR EACH repetition in TOTAL_REPETITIONS (R)
				FOR EACH operation in TOTAL_OPERATIONS (O)
				    GET unique user from users
					CALL operation with unique user instance
					GET operation processing time
				END FOR
				GET total time of all operations
			END FOR
			GET average of repeated total times
		END FOR
		GET average times per operation type
	END FOR
	GET average times per library and operation type
END FOR

Pseudocode: Algorithm defined to perform JPA libraries tests.

While executing the operations in the Java application, CPU and RAM resources usage will be collected on both client and server applications. By doing this we are able to evaluate if there is any significant impact of each JPA library on the Java application and Cassandra server resources usage.

If you would like to check the results right away, you can jump to the Results section below.

Implementation

The Java application implementation was performed to minimize code replication as much as possible. However, different User classes are required to provide the specific Java annotations. Thus, the following Figure illustrates how the User interface is used to make sure different User classes implement the required methods.

Figure: Illustration of the User implementation.

To minimize complexity and to make sure that the different tests have the same core behavior, the Run abstract class implements methods to run write, read, update and delete tests using the configured number of operations, repetitions and cycles. That way, specific run classes only have to implement core methods to perform atomic operations using each JPA library. The following Figure illustrates such implementation details.

Figure: Illustration of the Run implementation.

Finally, the main application just needs to take advantage of the run() methods to execute all the designed tests, as presented in the following Figure.

Figure: Illustration of the Main implementation.

Cassandra Server

Before starting with implementation details, it is crucial to have a Cassandra server running, towards developing and testing the code. The following Docker Compose YML file is provided to run the Cassandra server with an attached network.

version: '3.6'

networks:
  bridge:
    driver: bridge

services:
  cassandra:
    image: cassandra:3.11
    environment:
      CASSANDRA_START_RPC: "true"
      CASSANDRA_CLUSTER_NAME: cassandra
    networks:
      bridge:
        aliases:
        - cassandra

Code: docker-compose.yml file for running Cassandra.

Unfortunately it was not possible to find any good web-based tool to access and manage Cassandra. In order to validate if operations were performed properly, a RazorSQL trial license was used instead. Let me know if you know any good web-based alternative :blush:.

Finally, the Cassandra server can be started using the docker compose tool as following:

docker-compose up -d

Datastax Native

To use Datastax Native, the core Java dependency is required and should be defined in the project POM file.

	com.datastax.cassandra
	cassandra-driver-core
	3.6.0

Code: Maven dependency for Datastax Native implementation.

The code snippet below exemplifies how Datastax Native QueryBuilder can be used to connect, write, read, update and delete User data to/from Cassandra.

// Connect
Cluster cluster = Cluster.builder()
		.addContactPoint(Commons.EXAMPLE_CASSANDRA_HOST)
		.build();
Session session = cluster.connect();

// Write
Insert insert = QueryBuilder
		.insertInto("example", "user")
		.value("id", uuid)
		.value("first_name", "John")
		.value("last_name", "Smith")
		.value("city", "London");
session.execute(insert);

// Read
Select.Where select = QueryBuilder
		.select("id", "first_name", "last_name", "city")
		.from("example", "user")
		.where(QueryBuilder.eq("id", uuid));
ResultSet rs = session.execute(select);

// Update
Update.Where update = QueryBuilder
		.update("example", "user")
		.with(QueryBuilder.set("first_name", "___u"))
		.and(QueryBuilder.set("last_name", "___u"))
		.and(QueryBuilder.set("city", "___u"))
		.where(QueryBuilder.eq("id", uuid));
ResultSet rs = session.execute(update);

// Delete
Delete.Where delete = QueryBuilder
		.delete()
		.from("example", "user")
		.where(QueryBuilder.eq("id", uuid));
session.execute(delete);

Code: Example code to perform connect, write, read, update and delete operations using Datastax Native.

Datastax ORM

In addition to the core Datastax dependency, the mapping dependency is also required to support the ORM implementation:

	com.datastax.cassandra
	cassandra-driver-mapping
	3.5.1

Code: Maven dependency for Datastax ORM implementation.

The UserDatastax class is defined using the Java annotations provided by Datastax, which allow to define table and column characteristics, such as name and primary key.

@Table(keyspace = "example", name = "user",
        readConsistency = "QUORUM",
        writeConsistency = "QUORUM",
        caseSensitiveKeyspace = false,
        caseSensitiveTable = false)
public class UserDatastax implements User {
    @Column(name = "id")
    @PartitionKey
    private UUID id;

    @Column(name = "first_name")
    private String firstName;

    @Column(name = "last_name")
    private String lastName;

    @Column(name = "city")
    private String city;
...
}

Code: Datastax User implementation.

The code snippet below shows how simple is to perform connect, write, read, update and delete operations using Datastax ORM.

// Connect
Cluster cluster = Cluster.builder()
		.addContactPoint(Commons.EXAMPLE_CASSANDRA_HOST)
		.build();
Session session = cluster.connect();

// Write
UserDatastax user = new UserDatastax(uuid, "John", "Smith", "London");
mapper.save(user);

// Read
UserDatastax user = (UserDatastax) mapper.get(uuid);

// Update
UserDatastax user = users.get(uuid);
user.setFirstName(user.getFirstName() + "___u");
user.setLastName(user.getLastName() + "___u");
user.setCity(user.getCity() + "___u");
mapper.save(user);

// Delete
UserDatastax user = users.get(uuid);
mapper.delete(user);

Code: Example code to perform connect, write, read, update and delete operations using Datastax ORM.

Kundera

The following Java dependencies are added to use Kundera:

	com.impetus.kundera.core
	kundera-core
	3.13


	com.impetus.kundera.client
	kundera-cassandra
	3.13

Code: Maven dependencies for Kundera implementation.

Persistence configuration of Kundera is performed using the persistence.xml file, in order to specify how connectivity is performed to Cassandra and identify the classes that should be mapped. In order to automatically create the database, change the kundera.ddl.auto.prepare property from update to create.

 xmlns="http://java.sun.com/xml/ns/persistence" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
             xsi:schemaLocation="http://java.sun.com/xml/ns/persistence
	http://java.sun.com/xml/ns/persistence/persistence_2_0.xsd"
             version="2.0">
     name="cassandra_pu">
        com.impetus.kundera.KunderaPersistence
        org.davidcampos.cassandra.kundera.UserKundera
        true
        
             name="kundera.nodes" value="cassandra"/>
             name="kundera.port" value="9160"/>
             name="kundera.keyspace" value="example"/>
             name="kundera.dialect" value="cassandra"/>
             name="kundera.ddl.auto.prepare" value="update"/>
             name="kundera.client.lookup.class"
                      value="com.impetus.client.cassandra.thrift.ThriftClientFactory"/>
        
    

Code: Kundera persistence configuration.

Adding Cassandra connectivity configurations to persistence.xml reduces the required properties in the UserKundera class. Special attention to the schema property that makes the link with the persistence-unit previously defined in the XML file.

@Entity
@Table(name = "user", schema = "example@cassandra_pu")
public class UserKundera implements User {
    @Id
    @Column(name = "id")
    private UUID id;

    @Column(name = "first_name")
    private String firstName;

    @Column(name = "last_name")
    private String lastName;

    @Column(name = "city")
    private String city;
...
}

Code: Kundera User implementation.

The following code snippet presents how Kundera can be used to perform connect, write, read, update and delete operations on Cassandra.

// Connect
Map<String, String> props = new HashMap<>();
props.put(CassandraConstants.CQL_VERSION, CassandraConstants.CQL_VERSION_3_0);

EntityManagerFactory emf = Persistence.createEntityManagerFactory("cassandra_pu", props);
EntityManager em = emf.createEntityManager();

// Write
UserKundera user = new UserKundera(uuid, "John", "Smith", "London");
em.persist(user);

// Read
UserKundera user = em.find(UserKundera.class, uuid);

// Update
UserKundera user = users.get(uuid);
user.setFirstName(user.getFirstName() + "___u");
user.setLastName(user.getLastName() + "___u");
user.setCity(user.getCity() + "___u");
em.merge(user);

// Delete
UserKundera user = users.get(uuid);
em.remove(user);

Code: Example code to perform connect, write, read, update and delete operations using Kundera.

By default Kundera provides a considerable amount of logging information, which can be minimized by adding the following logback.xml file to the resources folder.

     level="ERROR">

Achilles

Achilles requires the following Java dependency to be added to the POM file:

	info.archinnov
	achilles-core
	6.0.0

Code: Maven dependency for Achilles implementation.

As presented below, the UserAchilles class is defined with the respective Java annotations.

@Table(table = "user")
public class UserAchilles implements User {
    @Column(value = "id")
    @PartitionKey
    private UUID id;

    @Column(value = "first_name")
    private String firstName;

    @Column(value = "last_name")
    private String lastName;

    @Column(value = "city")
    private String city;
...
}

Code: Achilles User implementation.

After the definition of the entity classes, Achilles requires to build the project to automatically generate the manager classes that allow to interact with Cassandra. If any change is performed in any entity class, the project needs to be built again to generate the manager classes again. To enable source code auto-complete of such classes on IntelliJ IDEA, the generated classes need to be added as sources of the project, as we can see in the Figure below.

Figure: Project sources configuration on IntelliJ IDEA.

The following code snippet presents how to perform connect, write, read, update and delete operations using the UserAchilles_Manager class generated by Achilles.

// Connect
Cluster cluster = Cluster.builder()
		.addContactPoint(Commons.EXAMPLE_CASSANDRA_HOST)
		.build();

Session session = cluster.connect();

ManagerFactory managerFactory = ManagerFactoryBuilder
		.builder(cluster)
		.withDefaultKeyspaceName("example")
		.doForceSchemaCreation(true)
		.build();

UserAchilles_Manager manager = managerFactory.forUserAchilles();

// Write
UserAchilles user = new UserAchilles(uuid, "John", "Smith", "London");
manager.crud().insert(user).execute();

// Read
UserAchilles user = manager.crud().findById(uuid).get();

// Update
UserAchilles user = users.get(uuid);
user.setFirstName(user.getFirstName() + "___u");
user.setLastName(user.getLastName() + "___u");
user.setCity(user.getCity() + "___u");
manager.crud().update(user).execute();

// Delete
UserAchilles user = users.get(uuid);
manager.crud().delete(user).execute();

Code: Example code to perform connect, write, read, update and delete operations using Achilles.

Elapsed time measurement

The measurement of the elapsed time is performed to check the execution of the atomic operation only. This means that the time required to create or get User objects will not be considered. In the following code example we can check that a Stopwatch is used to measure the elapsed time of the persist operation only.

// Get UUID
UUID uuid = Commons.uuids.get(repetition * Commons.OPERATIONS + i);

// Create user
UserKundera user = new UserKundera(
		uuid,
		"John" + i,
		"Smith" + i,
		"London" + i
);
users.put(uuid, user);

// Store user
Commons.resumeOrStartStopWatch(stopwatch);
em.persist(user);
stopwatch.suspend();

Code: Example code to measure the operation elapsed time.

Main

To get everything together, the Main application is created to run the tests for each JPA library, considering the configurations provided in environment variables.

public class Main {
    public static void main(final String... args) throws InterruptedException {
        RunDatastaxNative runDatastaxNative = new RunDatastaxNative();
        runDatastaxNative.run();

        RunDatastax runDatastax = new RunDatastax();
        runDatastax.run();

        RunKundera runKundera = new RunKundera();
        runKundera.run();

        RunAchilles runAchilles = new RunAchilles();
        runAchilles.run();
    }
}

Code: Main program to run the tests for each JPA library.

Configurations

The following configurations are required to connect with the Apache Cassandra server and configure the tests properly:

“EXAMPLE_CASSANDRA_HOST” and “EXAMPLE_CASSANDRA_PORT”: Cassandra host and port;
“EXAMPLE_OPERATIONS”: number of operations to run;
“EXAMPLE_REPETITIONS”: number of times to repeat operations execution and average values;
“EXAMPLE_CYCLES”: number of times to repeat tests execution and average values.

Such configurations will be loaded from environment variables using the Commons class, which assumes default values if no environment variables are defined. Moreover, unique identifiers are also generated to perform each operation using an UUID that was never used before, creating EXAMPLE_OPERATIONS*EXAMPLE_REPETITIONS unique identifiers.

public final static int OPERATIONS = System.getenv("EXAMPLE_OPERATIONS") != null ?
		Integer.parseInt(System.getenv("EXAMPLE_OPERATIONS")) : 1000;

public final static int REPETITIONS = System.getenv("EXAMPLE_REPETITIONS") != null ?
		Integer.parseInt(System.getenv("EXAMPLE_REPETITIONS")) : 5;

public final static int CYCLES = System.getenv("EXAMPLE_CYCLES") != null ?
		Integer.parseInt(System.getenv("EXAMPLE_CYCLES")) : 5;

public static List<UUID> uuids = generateUUIDs();

public final static String EXAMPLE_CASSANDRA_HOST = System.getenv("EXAMPLE_CASSANDRA_HOST") != null ?
		System.getenv("EXAMPLE_CASSANDRA_HOST") : "cassandra";

public final static String EXAMPLE_CASSANDRA_PORT = System.getenv("EXAMPLE_CASSANDRA_PORT") != null ?
		System.getenv("EXAMPLE_CASSANDRA_PORT") : "9160";

Code: Commons class to load project configurations from environment variables.

Packaging

To build fat JAR file with all dependencies included, the Maven Assembly Plugin was used with the following configurations:

	
			maven-assembly-plugin
			3.1.0
			
					jar-with-dependencies
				
						org.davidcampos.cassandra.main.Main
					
					make-assembly
					package
					
						single

Since several classes have the @Table Java annotation in the same JAR package, Kundera will consider all annotated classes as persistence entities, which will cause an error similar to:

Exception in thread "main" com.impetus.kundera.loader.MetamodelLoaderException: Error while retrieving and storing entity metadata
	at com.impetus.kundera.configure.MetamodelConfiguration.loadEntityMetadata(MetamodelConfiguration.java:238)
	at com.impetus.kundera.configure.MetamodelConfiguration.configure(MetamodelConfiguration.java:112)
	at com.impetus.kundera.persistence.EntityManagerFactoryImpl.configure(EntityManagerFactoryImpl.java:158)
	at com.impetus.kundera.persistence.EntityManagerFactoryImpl.(EntityManagerFactoryImpl.java:135)
	at com.impetus.kundera.KunderaPersistence.createEntityManagerFactory(KunderaPersistence.java:85)
	at javax.persistence.Persistence.createEntityManagerFactory(Persistence.java:79)
	at org.davidcampos.cassandra.kundera.KunderaExample.runWrites(KunderaExample.java:37)
	at org.davidcampos.cassandra.kundera.KunderaExample.main(KunderaExample.java:21)
	at org.davidcampos.cassandra.Main.main(Main.java:12)

To fix the error, please make sure Kundera excludes non-expected entity classes, adding the following configuration to the persistence.xml file:

true

Finally, to build the fat JAR, please run mvn clean package in the project folder, which stores the resulting JAR cassandra-jpa-example-1.0-SNAPSHOT-jar-with-dependencies.jar in the target folder.

Docker image

To build the Docker Image for the Java application, the following Dockerfile was built using the OpenJDK image as baseline:

FROM openjdk:8u151-jdk-alpine3.7
MAINTAINER David Campos (david.marques.campos@gmail.com)

# Install Bash
RUN apk add --no-cache bash

# Copy resources
WORKDIR /
COPY wait-for-it.sh wait-for-it.sh
COPY target/cassandra-jpa-example-1.0-SNAPSHOT-jar-with-dependencies.jar cassandra-jpa-example.jar

# Wait for Cassandra and Kafka to be available and run application
CMD ./wait-for-it.sh -s -t 180 $EXAMPLE_CASSANDRA_HOST:$EXAMPLE_CASSANDRA_PORT -- java -Xmx512m -jar cassandra-jpa-example.jar

Code: Dockerfile to build Java application Docker image.

wait-for-it.sh is used to check if a Cassandra host and port is available and only run the Java application when connectivity is established. To build the docker image, run the following command in the project folder:

docker build -t cassandra-jpa-example .

Docker compose

To create the container to run the Java application with the tests, the previous Docker Compose YML file should be extended adding the application configurations. The environment variables that provide the Cassandra and Test configurations are also provided.

  java:
    image: cassandra-jpa-example
    depends_on:
    - cassandra
    environment:
      EXAMPLE_CASSANDRA_HOST: "cassandra"
      EXAMPLE_CASSANDRA_PORT: "9160"
      EXAMPLE_REQUEST_WAIT: 0
      EXAMPLE_ITERATIONS: 10000
      EXAMPLE_REPETITIONS: 3
    networks:
    - bridge

Code: Part of docker-compose.yml file for running Java application.

CPU and RAM usage

To collect resources usage of Cassandra and Java application separately, we decided to take advantage of the docker stats utility, which provides detailed RAM and CPU usage of a target container and also allows to customize the output data format. The following script allows to continuously collect Cassandra’s container resources usage and store the results in the TSV file stats-cassandra.tsv.

#!/usr/bin/env bash
 while true; do docker stats --no-stream cassandra-jpa-example_cassandra_1 --format "\t{{.MemUsage}}\t{{.MemPerc}}\t{{.CPUPerc}}" | ts >> stats-cassandra.tsv; done 

Code: Script to collect RAM and CPU usage of a docker container.

Run

Now that everything is in place, it is time to start the containers using the docker-compose tool, passing the -d argument to detach and run the containers in the background:

docker-compose up -d

Such execution will provide detailed feedback regarding the success of creating and running each container and network:

Creating network "cassandra-jpa-example_bridge" with driver "bridge"
Creating cassandra-jpa-example_cassandra_1 ... done
Creating cassandra-jpa-example_java_1      ... done

In order to check if everything is working properly, we can take advantage of the docker logs tool to analyse the output being generated on each container.

docker logs kafka-spark-flink-example_kafka-producer_1 -f

Output should be similar to the following example:

wait-for-it.sh: waiting 180 seconds for cassandra:9160
wait-for-it.sh: cassandra:9160 is available after 17 seconds
17:19:24.002 [main] INFO  org.davidcampos.cassandra.datastax_native.RunDatastaxNative - 	WRITE	3	38102	16434	10255	12700.666666666666
17:20:02.617 [main] INFO  org.davidcampos.cassandra.datastax_native.RunDatastaxNative - 	READ	3	35910	14873	10247	11970.0
17:20:35.775 [main] INFO  org.davidcampos.cassandra.datastax_native.RunDatastaxNative - 	UPDATE	3	30508	11592	9240	10169.333333333334
17:21:08.828 [main] INFO  org.davidcampos.cassandra.datastax_native.RunDatastaxNative - 	DELETE	3	30453	10565	9673	10151.0

In parallel run docker-stats-cassandra.sh and docker-stats-java.sh scripts to collect results of CPU and RAM usage on Cassandra and Java application containers. Such measurements are stored in TSV files with the following format:

Dec 15 16:53:46 	1.06GiB / 1.952GiB	54.29%	138.51%
Dec 15 16:53:48 	1.087GiB / 1.952GiB	55.71%	160.26%
Dec 15 16:53:50 	1.141GiB / 1.952GiB	58.44%	218.72%
Dec 15 16:53:52 	1.137GiB / 1.952GiB	58.25%	180.90%
Dec 15 16:53:54 	1.137GiB / 1.952GiB	58.25%	6.11%
Dec 15 16:53:56 	1.117GiB / 1.952GiB	57.23%	4.69%

Results

Please keep in mind that the results collected are highly related with the pre-conditions previously described, namely:

Elapsed time measured considering atomic operations only;
Single thread application to perform operations;
Tests executed on macOS using a MacBook Pro with 4 cores @ 2,3GHz and 16GB RAM;
Cassandra and Java Application running on top of Docker;
CPU and RAM usage collected using docker stats.

The results were collected with the following configurations:

Number of operations: 1000, 5000 and 10000;
Number of repetitions: 3;
Number of cycles: 3.

The Figure below presents the average of the measured times for the several libraries and operation types. Overall, delete operations are the fastest ones, followed by the write tasks. As expected by Cassandra architecture and functionality, read operations are the ones that take longer execution time. When comparing the used JPA libraries, Kundera presents the fastest performance times in write, read, update and delete operation types. On the other hand, Achilles presents the worst results. Comparing the best with the worst library, for 10K operations we have an average difference of 3.2 seconds. If we extrapolate for 10M operations, this execution time difference can reach almost 1 hour. In average, Kundera performance is 28% better than Achilles, 19% better than Datastax, and 24% better than Native. It is quite interesting to see that Datastax ORM presents similar or better time measurements than Datastax Native. Keep in mind that the low complexity of the User data is not adding significant complexity on top of native and ORM solutions.

Figure: Comparison of Cassandra JPA libraries processing time for the different operation types.

Jumping into the resources usage analysis, the Figure below presents CPU and RAM consumption of the Java application and Cassandra while performing the tests. Overall, Kundera presents significant lower CPU usages on both Cassandra and Java application. Regarding RAM, there is no significant difference or impact on Cassandra when all JPA libraries are being used. However, Kundera and Achilles seem to use more RAM than Datastax libraries. For instance, on the 10K operations test, Kundera presents up to 78% less CPU usage on Cassandra, and up to 41% less CPU consumption on Java application. Regarding RAM usage, Kundera and Achilles use more 7% of RAM than Datastax libraries. Such differences might related with the fact that Kundera holds operations on RAM before submitting them to Cassandra, which has a minor impact on RAM but a very significant impact on low CPU consumption both on client and server applications. However, it is still open to clarify if a higher complexity on the stored data will have a higher impact on RAM usage.

Figure: Comparison of Cassandra JPA libraries resources usage.

Conclusion

In conclusion, Kundera presents up to 28% faster performance results with significant lower CPU impact on both client application and Cassandra server. Such interesting results are significant and should be considered while designing your next Cassandra and Java project, in order to reduce resources usage and increase processing throughput. Nevertheless, do not forget to evaluate the behavior of Kundera with your specific data and entities characteristics, requirements and complexity.

Please tell me if you had different results using this or other JPA libraries. Your comments, suggestions and contributions are more than welcome.

Happy new and techy 2019! :smile: :fireworks:

Kafka streaming with Spark and Flink

2018-11-01T19:00:00+00:00

TL;DR

Sample project taking advantage of Kafka messages streaming communication platform using:

1 data producer sending random numbers in textual format;
3 different data consumers using Kafka, Spark and Flink to count word occurrences.

Source code is available on Github with detailed documentation on how to build and run the different software components using Docker.

Goal

The main goal of this sample project is to mimic the streaming communication of nowadays large-scale solutions. An infrastructure is required to enable communication between components generating data sent to a centralized infrastructure. Such data is later consumed by other components with different purposes. The “Hello World” example project of such solutions is the Word Count problem, were producers send words to a central back-end and consumers count occurrences of each word. The following actors are involved:

Producer sends words in textual format to message broker;
Message broker receives messages and serves them to registered consumers;
Consumers process messages and count occurrences of each word.

The following architecture is proposed, which contains the following components:

Producer: send words to message broker;
Zookeeper: service for centralized configuration and synchronization of distributed services. In this case it is required to install and configure Kafka;
Kafka: message broker to receive messages from producer and propagate them to consumers
Kafka Consumer: count word occurrences using Kafka;
Spark Consumer: count work occurrences using Spark;
Flink Consumer: count work occurrences using Flink.

Figure: Illustration of the implementation architecture of the example project.

Such infrastructure will run on top of Docker, which simplifies the orchestration and setup processes. If we would like to scale-up the example, we can deploy it in a large-scale Docker-based orchestration platform, such as Docker Swarm and Kubernetes. Additionally, the following technologies are also used:

Java 8 as main programming language for producer and consumers. Actually tried to use Java 10 first, but had several problems with Spark and Flink Scala versions;
Maven for producer and consumers dependency management and build purposes;
Docker Compose to simplify the process of running multi-container solutions with dependencies.

Kafka

Kafka is becoming the de-facto standard messaging platform, enabling large-scale communication between software components producing and consuming streams of data for different purposes. It was originally built at LinkedIn and is currently part of the Apache Software Foundation. The following Figure illustrates the architecture of solutions using Kafka, with multiple components generating data that is consumed by different consumers for different purposes, making Kafka the communication bridge between them.

Figure: Illustration of Kafka capabilities as a message broker between heterogeneous producers and consumers. Source https://kafka.apache.org.

Hundreds of companies already take advantage of Kafka to provide their services, such as Oracle, LinkedIn, Mozilla and Netflix. As a result, it is being used in many different real-life use cases for different purposes, such as messaging, website activity tracking, metrics collection, logs aggregation, stream processing and event sourcing. For instance, in the IoT context, thousands of devices can send streams of operational data to Kafka, which might be processed and stored for many different purposes, such as improved maintenance, enhanced support and functionality optimization. Taking advantage of streaming enables reacting on real-time to relevant changes on connected devices.

Kafka anatomy in 1 minute

To better understand how Kafka works, it is important to understand its main concepts:

Record: consists of a key, a value and a timestamp;
Topic: category of records;
Partition: subset of records of a topic that can reside in different brokers;
Broker: service in a node with partitions that allows consumers and producers to access the records of a topic;
Producer: service that puts records into a topic;
Consumer: service that reads records from a topic;
Consumer group: set of consumers sharing a common identifier, making sure that all partitions from a topic are read by a consumer group without consumers overlap.

The figure below illustrates the relation between the aforementioned Kafka concepts. In summary, messages that are sent to Kafka are organized into topics. Thus, a producer sends messages to a specific topic and a consumer reads messages from that topic. Each topic is divided into partitions, that can reside in different nodes and enable multiple consumers to read from a topic in parallel. Consumers are organized in consumer groups to make sure that partitions from a topic are consumed at least once, also making sure that each partition is only consumed by a single consumer from the group. Considering the example illustrated in the figure, since Group A has two consumers, each consumer reads records from two different partitions. On the other hand, since Group B has four consumers, each consumer reads records from a single partition only.

Figure: Relation between Kafka producers, topics, partitions, consumers and consumer groups.

Apache Spark and Apache Flink

There are several open-source and commercial tools to simplify and optimize real-time data processing, such as Apache Spark, Apache Flink, Apache Storm, Apache Samza or Apama. Considering the current popularity of Spark and Flink-based solutions and respective stream processing characteristics, these are the tools that will be used in this example. Nevertheless, since the source code is available on GitHub, it is straightforward to add additional consumers using one of the aforementioned tools.

Apache Spark is an open-source platform for distributed batch and stream processing, providing features for advanced analytics with high speed and availability. After its first release in 2014, it has been adopted by dozens of companies (e.g., Yahoo!, Nokia and IBM) to process terabytes of data. On the other hand, Apache Flink is an open-source framework for distributed stream data processing, mostly focused on providing low latency and high fault tolerance data processing. It started from a fork of the Stratosphere distributed execution engine and it was first released in 2015. It has been used by several companies (e.g., Ebay, Huawei and Zalando) to process data in real-time.

Several blog posts already compare Spark and Flink features, functionality, latency and community. The blog posts from Chandan Prakash, Justin Ellingwood and Ivan Mushketyk present an interesting analysis, highlighting when one solution might provide added value in comparison with the other. In terms of functionality, the main difference is related with the actual streaming processing support and implementation. In summary, there are two types of stream processing:

Native streaming (Flink): data records are processed as soon as they arrive, without waiting a specific amount of time for other records;
Micro-batching (Spark): data records are grouped into small batches and processed together with some seconds of delay.

Considering this design difference, if the goal is to react as soon as data is delivered to the back-end infrastructure and every second counts, such behaviour might make a difference. Nonetheless, for most use cases a few seconds of delay is not significantly relevant for business goals.

Kafka Server

Before putting our hands on code, we need to have a Kafka server running, in order to develop and test our code. The following Docker Compose YML file is provided to run Zookeeper, Kafka and Kafka Manager:

version: '3.6'

networks:
  bridge:
    driver: bridge

services:
  zookeeper:
    image: confluentinc/cp-zookeeper:latest
    environment:
      ZOOKEEPER_CLIENT_PORT: 32181
      ZOOKEEPER_TICK_TIME: 2000
    networks:
      bridge:
        aliases:
          - zookeeper

  kafka:
    image: wurstmeister/kafka:latest
    depends_on:
      - zookeeper
    environment:
      KAFKA_BROKER_ID: 1
      KAFKA_ADVERTISED_HOST_NAME: 0.0.0.0
      KAFKA_ZOOKEEPER_CONNECT: zookeeper:32181
      KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://kafka:9092
      KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1
      JMX_PORT: 9999
    networks:
      bridge:
        aliases:
          - kafka
  
  kafka-manager:
    image: sheepkiller/kafka-manager:latest
    environment:
      ZK_HOSTS: "zookeeper:32181"
    ports:
      - 9000:9000
    networks:
      - bridge

Code: docker-compose.yml file for running Zookeeper, Kafka and Kafka Manager.

Kafka Manager is a web-based tool to manage and monitor Kafka configurations, namely clusters, topics, partitions, among others. Such tool will be used to monitor Kafka usage and messages processing rate. A bridge network is also included in the compose file, which will be created to enable communication between the three services, taking advantage of the aliases announced on the network to access each service (“zookeeper” and “kafka”). That way, connection strings provided on environment variables have only the network alias and not the specific IPs, which might vary from deployment to deployment.

To start the Kafka and Kafka Manager services, we use the docker-compose tool passing the -d argument to detach and run the containers in the background:

docker-compose up -d

Such execution will provide detailed feedback regarding the success of creating and running each container and network:

Creating network "kafka-spark-flink-example_bridge" with driver "bridge"
Creating kafka-spark-flink-example_kafka-manager_1 ... done
Creating kafka-spark-flink-example_zookeeper_1     ... done
Creating kafka-spark-flink-example_kafka_1         ... done

After starting the containers, visit http://localhost:9000 to access the Kafka Manager, which should be similar to the one presented in the figure below:

Figure: Kafka Manager interface to manage a topic and get operation feedback.

Now that Kafka is running, we are able to start developing and testing the code as soon as we develop it, sending messages and check if they are properly delivered. A single project will be created for the producer and the several consumers, varying the execution goal with environment variables. In fact, all configurations will be provided as environment variables, which simplifies the configuration process when executing Docker containers.

Configurations

The following configurations are required:

“EXAMPLE_KAFKA_SERVER”: Kafka server connection string to send and receive messages;
“EXAMPLE_KAFKA_TOPIC”: name of the Kafka topic to send and receive messages;
“EXAMPLE_ZOOKEEPER_SERVER”: Zookeeper server connection string to create Kafka topic.

Such configurations will be loaded from environment variables using the Commons class, which assumes default values if no environment variables are defined.

public class Commons {
    public final static String EXAMPLE_KAFKA_TOPIC = System.getenv("EXAMPLE_KAFKA_TOPIC") != null ?
            System.getenv("EXAMPLE_KAFKA_TOPIC") : "example";
    public final static String EXAMPLE_KAFKA_SERVER = System.getenv("EXAMPLE_KAFKA_SERVER") != null ?
            System.getenv("EXAMPLE_KAFKA_SERVER") : "localhost:9092";
    public final static String EXAMPLE_ZOOKEEPER_SERVER = System.getenv("EXAMPLE_ZOOKEEPER_SERVER") != null ?
            System.getenv("EXAMPLE_ZOOKEEPER_SERVER") : "localhost:32181";
}

Code: Commons class to load project configurations from environment variables.

Topic

In this example each consumer has its specific group associated, which means that all messages will be delivered to all consumers. Since it is not allowed to have multiple consumers reading messages from the same partition, running multiple containers for the same consumer and consumer group will result in only one consumer receiving the messages. For instance, if we run 3 containers for the Kafka consumer, only one of them will receive the messages.

Figure: Illustration of the topic partition and relation with consumer groups and respective consumers.

In order to create the Kafka topic, the Zookeeper Client Java dependency is needed and should be added to the Maven POM file:

    com.101tec
    zkclient
    0.10

Code: Maven dependency to create a Kafka topic using the Zookeeper client.

The following code snippet implements the logic to create the Kafka topic if it does not exist. To achieve that, the Zookeeper client is used to establish a connection with the Zookeeper server, and afterwards the topic is created with only one partition and one replica.

private static void createTopic() {
    int sessionTimeoutMs = 10 * 1000;
    int connectionTimeoutMs = 8 * 1000;

    // Create Zookeeper Client
    ZkClient zkClient = new ZkClient(
            Commons.EXAMPLE_ZOOKEEPER_SERVER,
            sessionTimeoutMs,
            connectionTimeoutMs,
            ZKStringSerializer$.MODULE$);

    // Create Zookeeper Utils to perform management tasks
    boolean isSecureKafkaCluster = false;
    ZkUtils zkUtils = new ZkUtils(zkClient, new ZkConnection(Commons.EXAMPLE_ZOOKEEPER_SERVER), isSecureKafkaCluster);

    // Create topic if it does not exist
    int partitions = 1;
    int replication = 1;
    Properties topicConfig = new Properties();
    if (!AdminUtils.topicExists(zkUtils, Commons.EXAMPLE_KAFKA_TOPIC)) {
        AdminUtils.createTopic(zkUtils, Commons.EXAMPLE_KAFKA_TOPIC, partitions, replication, topicConfig, RackAwareMode.Safe$.MODULE$);
        logger.info("Topic {} created.", Commons.EXAMPLE_KAFKA_TOPIC);
    } else {
        logger.info("Topic {} already exists.", Commons.EXAMPLE_KAFKA_TOPIC);
    }

    zkClient.close();
}

Code: Connect to Zookeeper and create the Kafka topic if it does not exist yet.

Producer

Now that the topic is created, we are able to create the producer to send messages to it. To accomplish that, the Kafka Clients dependency is required in the Maven POM file:

    org.apache.kafka
    kafka-clients
    1.1.0

Code: Maven dependency to create a Kafka Producer.

To create the Kafka Producer, four different configurations are required:

Kafka Server: host name and port of Kafka server (e.g., “localhost:9092”)
Producer identifier: unique identifier of the Kafka client (e.g., “KafkaProducerExample”);
Key and Value Serializers: serializers allow defining how objects are translated to and from the byte-stream format used by Kafka. In this example, since both key and values are Strings, the StringSerializer class already provided by Kafka can be used.

The createProducer method provides a Kafka Producer instance properly configured with the previously mentioned properties:

private static Producer<String, String> createProducer() {
    Properties props = new Properties();
    props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, Commons.EXAMPLE_KAFKA_SERVER);
    props.put(ProducerConfig.CLIENT_ID_CONFIG, "KafkaProducerExample");
    props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, StringSerializer.class.getName());
    props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, StringSerializer.class.getName());
    return new KafkaProducer<>(props);
}

Code: Create a Kafka Producer.

To finish the Producer logic, we need to continuously send words to Kafka. The following code snippet implements that logic, sending a random word from the words array every EXAMPLE_PRODUCER_INTERVAL milliseconds (default value is 100ms). The code snippet is properly commented to make it self-explanatory.

public static void main(final String... args) {
    // Create topic
    createTopic();

    // Create array of words
    String[] words = new String[]{"one", "two", "three", "four", "five", "six", "seven", "eight", "nine", "ten"};

    // Create random
    Random ran = new Random(System.currentTimeMillis());

    // Create producer
    final Producer<String, String> producer = createProducer();

    // Get time interval to send words
    int EXAMPLE_PRODUCER_INTERVAL = System.getenv("EXAMPLE_PRODUCER_INTERVAL") != null ?
            Integer.parseInt(System.getenv("EXAMPLE_PRODUCER_INTERVAL")) : 100;

    try {
        while (true) {
            // Get random word and unique identifier
            String word = words[ran.nextInt(words.length)];
            String uuid = UUID.randomUUID().toString();

            // Build record to send
            ProducerRecord<String, String> record = new ProducerRecord<>(Commons.EXAMPLE_KAFKA_TOPIC, uuid, word);
            
            // Send record to producer
            RecordMetadata metadata = producer.send(record).get();

            // Log record sent
            logger.info("Sent ({}, {}) to topic {} @ {}.", uuid, word, Commons.EXAMPLE_KAFKA_TOPIC, metadata.timestamp());

            // Wait to send next word
            Thread.sleep(EXAMPLE_PRODUCER_INTERVAL);
        }
    } catch (InterruptedException | ExecutionException e) {
        logger.error("An error occurred.", e);
    } finally {
        producer.flush();
        producer.close();
    }
}

Code: Send a random word to Kafka every 100ms (default value).

Consumers

Now that we are able to send words to a specific Kafka topic, it is time to develop the consumers that will process the messages and count word occurrences.

Kafka Consumer

Similar to the producer, the following properties are required to create the Kafka consumer:

Kafka Server: host name and port of Kafka server (e.g., “localhost:9092”)
Consumer Group Identifier: unique identifier of the consumer group (e.g., “KafkaConsumerGroup”);
Key and Value Serializers: since both key and values are Strings, the StringSerializer class is be used.

After creating the consumer, we need to subscribe to the EXAMPLE_KAFKA_TOPIC topic, in order to receive the messages that are sent to it by the producer:

private static Consumer<String, String> createConsumer() {
    // Create properties
    final Properties props = new Properties();
    props.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, Commons.EXAMPLE_KAFKA_SERVER);
    props.put(ConsumerConfig.GROUP_ID_CONFIG, "KafkaConsumerGroup");
    props.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class.getName());
    props.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class.getName());

    // Create the consumer using properties
    final Consumer<String, String> consumer = new KafkaConsumer(props);

    // Subscribe to the topic.
    consumer.subscribe(Collections.singletonList(Commons.EXAMPLE_KAFKA_TOPIC));
    return consumer;
}

Code: Create Kafka consumer subscribing to topic.

To continuously collect the records sent to the topic, we can take advantage of the poll method provided in the consumer, polling a specific number of records from the topic. After that, we can process each record and count the number of word occurrences using a ConcurrentHashMap.

public static void main(final String... args) {
        // Counters map
        ConcurrentMap<String, Integer> counters = new ConcurrentHashMap<>();

        // Create consumer
        final Consumer<String, String> consumer = createConsumer();
        while (true) {
            // Get records
            final ConsumerRecords<String, String> consumerRecords = consumer.poll(1000);
            consumerRecords.forEach(record -> {
                // Get word
                String word = record.value();

                // Update word occurrences
                int count = counters.containsKey(word) ? counters.get(word) : 0;
                counters.put(word, ++count);

                // Log word number of occurrences
                logger.info("({}, {})", word, count);
            });
            consumer.commitAsync();
        }
    }

Code: Polling 1000 records from the Kafka topic and count word occurrences.

Spark Stream Consumer

To create the Spark Consumer, the following Java dependencies are required and should be added to the POM file. Special attention is required to Scala versions of dependencies (last version number after the underscore on artifactId), making sure the project and dependencies use the same scala version (in this case 2.11), otherwise nothing will work properly with a huge amount of exceptions of missing classes.

    org.apache.spark
    spark-streaming_2.11
    2.3.0


    org.apache.spark
    spark-streaming-kafka-0-10_2.11
    2.3.0


    com.fasterxml.jackson.module
    jackson-module-scala_2.11
    2.9.5

Code: Maven dependencies to create a Spark Consumer.

Since spark performs micro-batching for stream processing, a temporal batch of 5 seconds is defined, which means that the words received in the last 5 seconds are processed together in a single batch. To process the input streams of words, the MapReduce programming model is used, which has two different processing stages:

Map: filter and sort input data converting it to tuples (key/value pairs);
Reduce: processes the tuples from the map method and combines them into a smaller set of tuples considering a specific goal.

In this specific WordCount example, the following logic is followed:

Get input stream of words for the last 5 seconds;
Map words into tuples (e.g., “eight, 1”);
Reduce tuples by word summing the occurrences (e.g., “eight, 10”);
Print reduced set of tuples with words and total number of occurrences.

The following code snippet implements the previously mentioned algorithm, taking advantage of the Spark JavaDStream and JavaPairDStream classes for Streams and MapReduce operations.

public static void main(final String... args) {
    // Configure Spark to connect to Kafka running on local machine
    Map<String, Object> kafkaParams = new HashMap<>();
    kafkaParams.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, Commons.EXAMPLE_KAFKA_SERVER);
    kafkaParams.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class.getName());
    kafkaParams.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class.getName());
    kafkaParams.put(ConsumerConfig.GROUP_ID_CONFIG, "SparkConsumerGroup");

    //Configure Spark to listen messages in topic test
    Collection<String> topics = Arrays.asList(Commons.EXAMPLE_KAFKA_TOPIC);

    SparkConf conf = new SparkConf().setMaster("local[2]").setAppName("SparkConsumerApplication");

    //Read messages in batch of 5 seconds
    JavaStreamingContext jssc = new JavaStreamingContext(conf, Durations.seconds(5));

    // Start reading messages from Kafka and get DStream
    final JavaInputDStream<ConsumerRecord<String, String>> stream =
            KafkaUtils.createDirectStream(jssc, LocationStrategies.PreferConsistent(),
                    ConsumerStrategies.Subscribe(topics, kafkaParams));

    // Read value of each message from Kafka and return it
    JavaDStream<String> lines = stream.map((Function<ConsumerRecord<String, String>, String>) kafkaRecord -> kafkaRecord.value());

    // Break every message into words and return list of words
    JavaDStream<String> words = lines.flatMap((FlatMapFunction<String, String>) line -> Arrays.asList(line.split(" ")).iterator());

    // Take every word and return Tuple with (word,1)
    JavaPairDStream<String, Integer> wordMap = words.mapToPair((PairFunction<String, String, Integer>) word -> new Tuple2<>(word, 1));

    // Count occurrence of each word
    JavaPairDStream<String, Integer> wordCount = wordMap.reduceByKey((Function2<Integer, Integer, Integer>) (first, second) -> first + second);

    //Print the word count
    wordCount.print();

    jssc.start();
    try {
        jssc.awaitTermination();
    } catch (InterruptedException e) {
        logger.error("An error occurred.", e);
    }
}

Code: Spark stream consumer to count words occurrences from last 5s.

Flink Stream Consumer

To build the Flink consumer, the following dependencies are required in the Maven POM file:

    org.apache.flink
    flink-java
    1.4.2


    org.apache.flink
    flink-streaming-java_2.11
    1.4.2


    org.apache.flink
    flink-clients_2.11
    1.4.2


    org.apache.flink
    flink-connector-kafka-0.10_2.11
    1.4.2

Code: Maven dependencies to create a Flink Consumer.

The Flink consumer also takes advantage of the MapReduce programming model, following the same strategy previously presented for the Spark consumer. In this case, the Flink DataStream class is used, which provides cleaner and easier to understand source code, as we can see below.

public static void main(final String... args) {
    // Create execution environment
    StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

    // Properties
    final Properties props = new Properties();
    props.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, Commons.EXAMPLE_KAFKA_SERVER);
    props.put(ConsumerConfig.GROUP_ID_CONFIG, "FlinkConsumerGroup");

    DataStream<String> messageStream = env.addSource(new FlinkKafkaConsumer010<>(Commons.EXAMPLE_KAFKA_TOPIC, new SimpleStringSchema(), props));
    
    // Split up the lines in pairs (2-tuples) containing: (word,1)
    messageStream.flatMap(new Tokenizer())
            // group by the tuple field "0" and sum up tuple field "1"
            .keyBy(0)
            .sum(1)
            .print();

    try {
        env.execute();
    } catch (Exception e) {
        logger.error("An error occurred.", e);
    }
}

public static final class Tokenizer implements FlatMapFunction<String, Tuple2<String, Integer>> {
    @Override
    public void flatMap(String value, Collector<Tuple2<String, Integer>> out) {
        // normalize and split the line
        String[] tokens = value.toLowerCase().split("\\W+");

        // emit the pairs
        for (String token : tokens) {
            if (token.length() > 0) {
                out.collect(new Tuple2<>(token, 1));
            }
        }
    }
}

Code: Setup Flink to continuously consume messages and count and print occurrences per word.

Main

To get everything together, the Main application is created to run a specific application depending on the execution goal. The environment variable EXAMPLE_GOAL is used to get the goal of the program, i.e., to run a producer or a consumer with Kafka, Spark or Flink. By doing this we can have a single Docker Image to run the 4 different goals, which might vary with the provided environment variable.

public static void main(final String... args) {
    String EXAMPLE_GOAL = System.getenv("EXAMPLE_GOAL") != null ?
            System.getenv("EXAMPLE_GOAL") : "producer";

    logger.info("Kafka Topic: {}", Commons.EXAMPLE_KAFKA_TOPIC);
    logger.info("Kafka Server: {}", Commons.EXAMPLE_KAFKA_SERVER);
    logger.info("Zookeeper Server: {}", Commons.EXAMPLE_ZOOKEEPER_SERVER);
    logger.info("GOAL: {}", EXAMPLE_GOAL);

    switch (EXAMPLE_GOAL.toLowerCase()) {
        case "producer":
            KafkaProducerExample.main();
            break;
        case "consumer.kafka":
            KafkaConsumerExample.main();
            break;
        case "consumer.spark":
            KafkaSparkConsumerExample.main();
            break;
        case "consumer.flink":
            KafkaFlinkConsumerExample.main();
            break;
        default:
            logger.error("No valid goal to run.");
            break;
    }
}

Code: Main program to select which program to run.

Build package

Finally, in order to build fat JAR file with all dependencies included, the Maven Shade Plugin was used. Tried doing this using the Maven Assembly Plugin but had problems to gather all Flink dependencies in the fat JAR.

    
            org.apache.maven.plugins
            maven-shade-plugin
            3.1.1
            
                    package
                    
                        shade
                    
                                *:*
                                
                                    META-INF/*.SF
                                    META-INF/*.DSA
                                    META-INF/*.RSA
                                
                        true
                        jar-with-dependencies
                        
                                *:*
                            
                                    implementation="org.apache.maven.plugins.shade.resource.AppendingTransformer">
                                reference.conf
                            
                                    implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
                                
                                    org.davidcampos.kafka.cli.Main

To build the fat JAR, please run mvn clean package in the project folder, which stores the resulting JAR kafka-spark-flink-example-1.0-SNAPSHOT-jar-with-dependencies.jar in the target folder.

Docker Image

To build the Docker Image for the producer and consumers, the following Dockerfile was built using the OpenJDK image as baseline:

FROM openjdk:8u151-jdk-alpine3.7
MAINTAINER David Campos (david.marques.campos@gmail.com)

# Install Bash
RUN apk add --no-cache bash

# Copy resources
WORKDIR /
COPY wait-for-it.sh wait-for-it.sh
COPY target/kafka-spark-flink-example-1.0-SNAPSHOT-jar-with-dependencies.jar kafka-spark-flink-example.jar

# Wait for Zookeeper and Kafka to be available and run application
CMD ./wait-for-it.sh -s -t 30 $EXAMPLE_ZOOKEEPER_SERVER -- ./wait-for-it.sh -s -t 30 $EXAMPLE_KAFKA_SERVER -- java -Xmx512m -jar kafka-spark-flink-example.jar

Code: Dockerfile for building Docker image.

wait-for-it.sh is used to check if a specific host and port is available and only run the provided command when connectivity is established. wait-for-it.sh was developed by Giles Hall and is available at https://github.com/vishnubob/wait-for-it. In this example, producer and consumers should only be started when kafka is successfully running with connectivity available.

To build the docker image, run the following command in the project folder:

docker build -t kafka-spark-flink-example .

After the build process, check on docker images if it is available, by running the command docker images. If the image is available, the output should me similar to the following:

REPOSITORY                  TAG                   IMAGE ID            CREATED             SIZE
kafka-spark-flink-example   latest                3bd70969dacd        4 days ago          253MB

Docker compose

To create the containers running the Producer and the three Consumers, the previous Docker Compose YML file should be extended, adding the configurations for the Kafka Producer, Kafka Consumer, Spark Consumer and Flink Consumer. The environment variables to specify the Kafka Topic, Kafka Server, Zookeeper Server, Execution goal and Messages cadence are also provided.

 kafka-producer:
    image: kafka-spark-flink-example
    depends_on:
      - kafka
    environment:
      EXAMPLE_GOAL: "producer"
      EXAMPLE_KAFKA_TOPIC: "example"
      EXAMPLE_KAFKA_SERVER: "kafka:9092"
      EXAMPLE_ZOOKEEPER_SERVER: "zookeeper:32181"
      EXAMPLE_PRODUCER_INTERVAL: 100
    networks:
      - bridge

  kafka-consumer-kafka:
      image: kafka-spark-flink-example
      depends_on:
        - kafka-producer
      environment:
        EXAMPLE_GOAL: "consumer.kafka"
        EXAMPLE_KAFKA_TOPIC: "example"
        EXAMPLE_KAFKA_SERVER: "kafka:9092"
        EXAMPLE_ZOOKEEPER_SERVER: "zookeeper:32181"
      networks:
        - bridge

  kafka-consumer-spark:
        image: kafka-spark-flink-example
        depends_on:
          - kafka-producer
        ports:
          - 4040:4040
        environment:
          EXAMPLE_GOAL: "consumer.spark"
          EXAMPLE_KAFKA_TOPIC: "example"
          EXAMPLE_KAFKA_SERVER: "kafka:9092"
          EXAMPLE_ZOOKEEPER_SERVER: "zookeeper:32181"
        networks:
          - bridge

  kafka-consumer-flink:
        image: kafka-spark-flink-example
        depends_on:
          - kafka-producer
        environment:
          EXAMPLE_GOAL: "consumer.flink"
          EXAMPLE_KAFKA_TOPIC: "example"
          EXAMPLE_KAFKA_SERVER: "kafka:9092"
          EXAMPLE_ZOOKEEPER_SERVER: "zookeeper:32181"
        networks:
          - bridge

Run

Now that everything is in place, it is time to start the containers using the docker-compose tool, passing the -d argument to detach and run the containers in the background:

docker-compose up -d

Such execution will provide detailed feedback regarding the success of creating and running each container and network:

Creating network "kafka-spark-flink-example_bridge" with driver "bridge"
Creating kafka-spark-flink-example_kafka-manager_1 ... done
Creating kafka-spark-flink-example_zookeeper_1     ... done
Creating kafka-spark-flink-example_kafka_1         ... done
Creating kafka-spark-flink-example_kafka-producer_1 ... done
Creating kafka-spark-flink-example_kafka-consumer-flink_1 ... done
Creating kafka-spark-flink-example_kafka-consumer-kafka_1 ... done
Creating kafka-spark-flink-example_kafka-consumer-spark_1 ... done

To stop and remove all containers, please take advantage of the down option of the docker-compose tool:

docker-compose down

Detailed feedback about stopping and destroying each container and network is also provided:

Stopping kafka-spark-flink-example_kafka-consumer-flink_1 ... done
Stopping kafka-spark-flink-example_kafka-consumer-kafka_1 ... done
Stopping kafka-spark-flink-example_kafka-consumer-spark_1 ... done
Stopping kafka-spark-flink-example_kafka_1                ... done
Stopping kafka-spark-flink-example_zookeeper_1            ... done
Stopping kafka-spark-flink-example_kafka-manager_1        ... done
Removing kafka-spark-flink-example_kafka-consumer-flink_1 ... done
Removing kafka-spark-flink-example_kafka-consumer-kafka_1 ... done
Removing kafka-spark-flink-example_kafka-consumer-spark_1 ... done
Removing kafka-spark-flink-example_kafka-producer_1       ... done
Removing kafka-spark-flink-example_kafka_1                ... done
Removing kafka-spark-flink-example_zookeeper_1            ... done
Removing kafka-spark-flink-example_kafka-manager_1        ... done
Removing network kafka-spark-flink-example_bridge

Validate

In order to check if everything is working properly, we can take advantage of the docker logs tool to analyse the output being generated on each container. In that context, we can check the logs of the producer and consumers to validate that data is being processed properly.

Producer

Run the following command to access producer logs:

docker logs kafka-spark-flink-example_kafka-producer_1 -f

Output should be similar to the following example, were each line represents a word already sent to Kafka:

43:41.355 [main] INFO  org.davidcampos.kafka.producer.KafkaProducerExample - Sent (ac8f0337-bbde-4e92-8659-c847aa7b7eaf, four) to topic example @ 1525725821264.
43:41.468 [main] INFO  org.davidcampos.kafka.producer.KafkaProducerExample - Sent (6ece8f3c-72b8-40a0-a37f-398d6cb9ee76, six) to topic example @ 1525725821455.
43:41.590 [main] INFO  org.davidcampos.kafka.producer.KafkaProducerExample - Sent (9eba2ad9-5926-4eac-b3b4-1bde27209d77, two) to topic example @ 1525725821569.
43:41.768 [main] INFO  org.davidcampos.kafka.producer.KafkaProducerExample - Sent (0eb0c80b-760e-47f3-8a73-f86868d83ff4, two) to topic example @ 1525725821694.
43:41.876 [main] INFO  org.davidcampos.kafka.producer.KafkaProducerExample - Sent (ca247271-07b4-4bb9-834a-27ec5168e9cf, two) to topic example @ 1525725821869.
43:41.985 [main] INFO  org.davidcampos.kafka.producer.KafkaProducerExample - Sent (ab715932-0b28-46cb-9c89-b6965e34619c, eight) to topic example @ 1525725821977.
43:42.103 [main] INFO  org.davidcampos.kafka.producer.KafkaProducerExample - Sent (74b60cc9-0849-4468-8125-fd6b368e5e66, three) to topic example @ 1525725822087.
43:42.218 [main] INFO  org.davidcampos.kafka.producer.KafkaProducerExample - Sent (51d80cf5-d601-476b-9eb6-33f44ca94716, ten) to topic example @ 1525725822204.
43:42.329 [main] INFO  org.davidcampos.kafka.producer.KafkaProducerExample - Sent (0a9a20b1-06c9-4103-ac3d-66bc48397936, eight) to topic example @ 1525725822318.

Kafka consumer

Run the following command to review Kafka consumer logs:

docker logs kafka-spark-flink-example_kafka-consumer-kafka_1 -f

For every word received the respective total number of occurrences is displayed, as we can see below:

14:43.463 [main] INFO  org.davidcampos.kafka.consumer.KafkaConsumerExample - (five, 27)
14:43.581 [main] INFO  org.davidcampos.kafka.consumer.KafkaConsumerExample - (three, 15)
14:43.709 [main] INFO  org.davidcampos.kafka.consumer.KafkaConsumerExample - (seven, 35)
14:43.822 [main] INFO  org.davidcampos.kafka.consumer.KafkaConsumerExample - (seven, 36)
14:43.931 [main] INFO  org.davidcampos.kafka.consumer.KafkaConsumerExample - (four, 19)
14:44.043 [main] INFO  org.davidcampos.kafka.consumer.KafkaConsumerExample - (four, 20)
14:44.157 [main] INFO  org.davidcampos.kafka.consumer.KafkaConsumerExample - (five, 28)
14:44.273 [main] INFO  org.davidcampos.kafka.consumer.KafkaConsumerExample - (seven, 37)
14:44.386 [main] INFO  org.davidcampos.kafka.consumer.KafkaConsumerExample - (five, 29)
14:44.493 [main] INFO  org.davidcampos.kafka.consumer.KafkaConsumerExample - (nine, 21)
14:44.604 [main] INFO  org.davidcampos.kafka.consumer.KafkaConsumerExample - (one, 12)

Spark consumer

To check Spark consumer logs please run:

docker logs kafka-spark-flink-example_kafka-consumer-spark_1 -f

Every 5s, Spark will output the number of occurrences for each word for that specific period of time, similar to:

-------------------------------------------
Time: 1541082310000 ms
-------------------------------------------
(two,4)
(one,7)
(nine,3)
(six,6)
(three,9)
(five,2)
(four,3)
(seven,4)
(eight,5)
(ten,2)

-------------------------------------------
Time: 1541082315000 ms
-------------------------------------------
(two,4)
(one,8)
(nine,3)
(six,9)
(three,7)
(five,1)
(four,4)
(seven,3)
(ten,7)

Additionally, you can also check the Spark UI interface available at http://localhost:4040. Such web-based tool provides relevant information for monitoring and instrumentation, with detailed information about the jobs executed, elapsed time, memory usage, among others.

Figure: Spark interface to check active jobs and respective status.

Flink consumer

Finally, we can check Flink logs by executing:

docker logs kafka-spark-flink-example_kafka-consumer-flink_1 -f

Since Flink is a timeseries-based approach it reacts to every message received. As a result, for every word received the respective total number of occurrences is displayed, as we can see below:

1> (ten,85)
4> (nine,104)
1> (ten,86)
4> (five,91)
4> (one,94)
4> (six,90)
1> (three,89)
4> (six,91)
4> (five,92)

We did it!

It is done and working properly! The producer is sending the messages to Kafka and all consumers are receiving and processing the messages, showing the number of occurrences for each word.

Scale it up

Just one more thing. What about increasing the number of messages being sent? As a first approach, we can change the time interval between requests. By default, this value is set to 100ms, which means that a message is sent every 100ms. To change this behaviour, set the EXAMPLE_PRODUCER_INTERVAL environment variable to specify the producer time interval between requests to Kafka. Thus, changing the docker-compose.yml accordingly (line 10), we can send a word to Kafka every 10ms.

kafka-producer:
    image: kafka-spark-flink-example
    depends_on:
      - kafka
    environment:
      EXAMPLE_GOAL: "producer"
      EXAMPLE_KAFKA_TOPIC: "example"
      EXAMPLE_KAFKA_SERVER: "kafka:9092"
      EXAMPLE_ZOOKEEPER_SERVER: "zookeeper:32181"
      EXAMPLE_PRODUCER_INTERVAL: 10
    networks:
      - bridge

In order to scale the number of messages even further, two different options can be considered:

Add multi-thread support to producer in order to send multiple messages at the same time;
Have multiple producer containers sending multiple messages at the same time.

Considering the example context, it is more straightforward to take advantage of Docker to run multiple containers of the producer service. In a real world-application, a less resource intensive approach might be considered. Thus, in order to change the number of replicas for the producer service, we can take advantage of the --scale argument of docker-compose up. It works by specifying the number of containers for the service name, such as --scale =. In the next example we request three containers for the kafka-producer service:

docker-compose up -d --scale kafka-producer=3

When you do this, in the output you can check Docker starting three different containers for the producer service:

Creating kafka-spark-flink-example_kafka-producer_1 ... done
Creating kafka-spark-flink-example_kafka-producer_2 ... done
Creating kafka-spark-flink-example_kafka-producer_3 ... done

After a while, instead of receiving ~50 messages every 5s, we receive almost 1000 messages per 5s, which represents a 20x increase with just some small changes. The Spark consumer logs confirm the high number of words received in 5s:

-------------------------------------------
Time: 1525543330000 ms
-------------------------------------------
(two,92)
(one,96)
(nine,83)
(six,113)
(three,88)
(five,82)
(four,100)
(seven,91)
(eight,106)
(ten,88)

Besides this simple exercise to scale up the example, keep in mind that with only three servers, Jay Kreps was able to write 2 million messages per second into Kafka and read almost 1 million records from Kafka in a single thread. Such example reflects the scalability and processing power of Kafka. For a more detailed analysis, you can access this Kafka benchmark in the LinkedIn Engineering Blog.

Conclusion

I hope this example helps to understand how Kafka can be used as a communication broker between producers and consumers with different purposes. Nonetheless, keep in mind that using Kafka in large-scale applications with hundreds of thousands of producers and consumers requires high level of expertise for configuring and maintaining the service running with expected availability, performance, robustness and fault tolerance behaviour. Several companies already provide enterprise Kafka services for large-scale applications, such as Clonfluent, Amazon, Azure and Cloudera. Not saying it is cheap, but it is definitely an option if such expertise is not in the company portfolio.

Please remember that your comments, suggestions and contributions are more than welcome.

Happy Kafking and Streaming! :smile:

Hello World

2018-09-01T20:00:00+01:00

The goal of the first post is to validate the deployment process on GitHub pages using Jekyll.

Next post will be published soon! :smirk: