Written by
Bregt Coenen
Bregt Coenen
Bregt Coenen
All blog posts
code
code
Reading time 3 min
8 MAY 2025

In the complex world of modern software development, companies are faced with the challenge of seamlessly integrating diverse applications developed and managed by different teams. An invaluable asset in overcoming this challenge is the Service Mesh. In this blog article, we delve into Istio Service Mesh and explore why investing in a Service Mesh like Istio is a smart move." What is Service Mesh? A service mesh is a software layer responsible for all communication between applications, referred to as services in this context. It introduces new functionalities to manage the interaction between services, such as monitoring, logging, tracing, and traffic control. A service mesh operates independently of the code of each individual service, enabling it to operate across network boundaries and collaborate with various management systems. Thanks to a service mesh, developers can focus on building application features without worrying about the complexity of the underlying communication infrastructure. Istio Service Mesh in Practice Consider managing a large cluster that runs multiple applications developed and maintained by different teams, each with diverse dependencies like ElasticSearch or Kafka. Over time, this results in a complex ecosystem of applications and containers, overseen by various teams. The environment becomes so intricate that administrators find it increasingly difficult to maintain a clear overview. This leads to a series of pertinent questions: What is the architecture like? Which applications interact with each other? How is the traffic managed? Moreover, there are specific challenges that must be addressed for each individual application: Handling login processes Implementing robust security measures Managing network traffic directed towards the application ... A Service Mesh, such as Istio, offers a solution to these challenges. Istio acts as a proxy between the various applications (services) in the cluster, with each request passing through a component of Istio. How Does Istio Service Mesh Work? Istio introduces a sidecar proxy for each service in the microservices ecosystem. This sidecar proxy manages all incoming and outgoing traffic for the service. Additionally, Istio adds components that handle the incoming and outgoing traffic of the cluster. Istio's control plane enables you to define policies for traffic management, security, and monitoring, which are then applied to the added components. For a deeper understanding of Istio Service Mesh functionality, our blog article, "Installing Istio Service Mesh: A Comprehensive Step-by-Step Guide" , provides a detailed, step-by-step explanation of the installation and utilization of Istio. Why Istio Service Mesh? Traffic Management: Istio enables detailed traffic management, allowing developers to easily route, distribute, and control traffic between different versions of their services. Security: Istio provides a robust security layer with features such as traffic encryption using its own certificates, Role-Based Access Control (RBAC), and capabilities for implementing authentication and authorization policies. Observability: Through built-in instrumentation, Istio offers deep observability with tools for monitoring, logging, and distributed tracing. This allows IT teams to analyze the performance of services and quickly detect issues. Simplified Communication: Istio removes the complexity of service communication from application developers, allowing them to focus on building application features. Is Istio Suitable for Your Setup? While the benefits are clear, it is essential to consider whether the additional complexity of Istio aligns with your specific setup. Firstly, a sidecar container is required for each deployed service, potentially leading to undesired memory and CPU overhead. Additionally, your team may lack the specialized knowledge required for Istio. If you are considering the adoption of Istio Service Mesh, seek guidance from specialists with expertise. Feel free to ask our experts for assistance. More Information about Istio Istio Service Mesh is a technological game-changer for IT professionals aiming for advanced control, security, and observability in their microservices architecture. Istio simplifies and secures communication between services, allowing IT teams to focus on building reliable and scalable applications. Need quick answers to all your questions about Istio Service Mesh? Contact our experts

Read more
kubernetes aca group
kubernetes aca group
Reading time 7 min
6 MAY 2025

Within ACA, there are multiple teams working on different (or the same!) projects. Every team has their own domains of expertise, such as developing custom software, marketing and communications, mobile development and more. The teams specialized in Atlassian products and cloud expertise combined their knowledge to create a highly-available Atlassian stack on Kubernetes. Not only could we improve our internal processes this way, we could also offer this solution to our customers! In this blogpost, we’ll explain how our Atlassian and cloud teams built a highly-available Atlassian stack on top of Kubernetes. We’ll also discuss the benefits of this approach as well as the problems we’ve faced along the path. While we’re damn close, we’re not perfect after all 😉 Lastly, we’ll talk about how we monitor this setup. The setup of our Atlassian stack Our Atlassian stack consists of the following products: Amazon EKS Amazon EFS Atlassian Jira Data Center Atlassian Confluence Data Center Amazon EBS Atlassian Bitbucket Data Center Amazon RDS As you can see, we use AWS as the cloud provider for our Kubernetes setup. We create all the resources with Terraform. We’ve written a separate blog post on what our Kubernetes setup exactly looks like. You can read it here ! The image below should give you a general idea. The next diagram should give you an idea about the setup of our Atlassian Data Center. While there are a few differences between the products and setups, the core remains the same. The application is launched as one or more pods described by a StatefulSet. The pods are called node-0 and node-1 in the diagram above. The first request is sent to the load balancer and will be forwarded to either the node-0 pod or the node-1 pod. Traffic is sticky, so all subsequent traffic from that user will be sent to node-1. Both pod-0 and pod-1 require persistent storage which is used for plugin cache and indexes. A different Amazon EBS volume is mounted on each of the pods. Most of the data like your JIRA issues, Confluence spaces, … is stored in a database. The database is shared, node-0 and node-1 both connect to the same database. We usually use PostgreSQL on Amazon RDS. The node-0 and node-1 pod also need to share large files which we don’t want to store in a database, for example attachments. The same Amazon EFS volume is mounted on both pods. When changes are made, for example an attachment is uploaded to an issue, the attachment is immediately available on both pods. We use CloudFront (CDN) to cache static assets and improve the web response times. The benefits of this setup By using this setup, we can leverage the advantages of Docker and Kubernetes and the Data Center versions of the Atlassian tooling. There are a lot of benefits to this kind of setup, but we’ve listed the most important advantages below. It’s a self-healing platform : containers and worker nodes will automatically replace themselves when a failure occurs. In most cases, we don’t even have to do anything and the stack takes care of itself. Of course, it’s still important to investigate any failures so you can prevent them from occurring in the future. Exactly zero downtime deployments : when upgrading the first node within the cluster to a new version, we can still serve the old version to our customers on the second. Once the upgrade is complete, the new version is served from the first node and we can upgrade the second node. This way, the application stays available, even during upgrades. Deployments are predictable : we use the same Docker container for development, staging and production. It’s why we are confident the container will be able to start in our production environment after a successful deploy to staging. Highly available applications: when failure occurs on one of the nodes, traffic can be routed to the other node. This way you have time to investigate the issue and fix the broken node while the application stays available. It’s possible to sync data from one node to the other . For example, syncing the index from one node to the other to fix a corrupt index can be done in just a few seconds, while a full reindex can take a lot longer. You can implement a high level of security on all layers (AWS, Kubernetes, application, …) AWS CloudTrail prevents unauthorized access on AWS and sends an alert in case of anomaly. AWS Config prevents AWS security group changes. You can find out more on how to secure your cloud with AWS Config in our blog post. Terraform makes sure changes on the AWS environment are approved by the team before rollout. Since upgrading Kubernetes master and worker nodes has little to no impact, the stack is always running a recent version with the latest security patches. We use a combination of namespacing and RBAC to make sure applications and deployments can only access resources within their namespace with least privilege . NetworkPolicies are rolled out using Calico. We deny all traffic between containers by default and only allow specific traffic. We use recent versions of the Atlassian applications and implement Security Advisories whenever they are published by Atlassian. Interested in leveraging the power of Kubernetes yourself? You can find more information about how we can help you on our website! {% module_block module "widget_3d4315dc-144d-44ec-b069-8558f77285de" %}{% module_attribute "buttons" is_json="true" %}{% raw %}[{"appearance":{"link_color":"light","primary_color":"primary","secondary_color":"primary","tertiary_color":"light","tertiary_icon_accent_color":"dark","tertiary_text_color":"dark","variant":"primary"},"content":{"arrow":"right","icon":{"alt":null,"height":null,"loading":"disabled","size_type":null,"src":"","width":null},"tertiary_icon":{"alt":null,"height":null,"loading":"disabled","size_type":null,"src":"","width":null},"text":"Apply the power of Kubernetes"},"target":{"link":{"no_follow":false,"open_in_new_tab":false,"rel":"","sponsored":false,"url":null,"user_generated_content":false}},"type":"normal"}]{% endraw %}{% end_module_attribute %}{% module_attribute "child_css" is_json="true" %}{% raw %}{}{% endraw %}{% end_module_attribute %}{% module_attribute "css" is_json="true" %}{% raw %}{}{% endraw %}{% end_module_attribute %}{% module_attribute "definition_id" is_json="true" %}{% raw %}null{% endraw %}{% end_module_attribute %}{% module_attribute "field_types" is_json="true" %}{% raw %}{"buttons":"group","styles":"group"}{% endraw %}{% end_module_attribute %}{% module_attribute "isJsModule" is_json="true" %}{% raw %}true{% endraw %}{% end_module_attribute %}{% module_attribute "label" is_json="true" %}{% raw %}null{% endraw %}{% end_module_attribute %}{% module_attribute "module_id" is_json="true" %}{% raw %}201493994716{% endraw %}{% end_module_attribute %}{% module_attribute "path" is_json="true" %}{% raw %}"@projects/aca-group-project/aca-group-app/components/modules/ButtonGroup"{% endraw %}{% end_module_attribute %}{% module_attribute "schema_version" is_json="true" %}{% raw %}2{% endraw %}{% end_module_attribute %}{% module_attribute "smart_objects" is_json="true" %}{% raw %}null{% endraw %}{% end_module_attribute %}{% module_attribute "smart_type" is_json="true" %}{% raw %}"NOT_SMART"{% endraw %}{% end_module_attribute %}{% module_attribute "tag" is_json="true" %}{% raw %}"module"{% endraw %}{% end_module_attribute %}{% module_attribute "type" is_json="true" %}{% raw %}"module"{% endraw %}{% end_module_attribute %}{% module_attribute "wrap_field_tag" is_json="true" %}{% raw %}"div"{% endraw %}{% end_module_attribute %}{% end_module_block %} Apply the power of Kubernetes Problems we faced during the setup Migrating to this stack wasn’t all fun and games. We’ve definitely faced some difficulties and challenges along the way. By discussing them here, we hope we can facilitate your migration to a similar setup! Some plugins (usually older plugins) were only working on the standalone version of the Atlassian application. We needed to find an alternative plugin or use vendor support to have the same functionality on Atlassian Data Center. We had to make some changes to our Docker containers and network policies (i.e. firewall rules) to make sure both nodes of an application could communicate with each other. Most of the applications have some extra tools within the container. For example, Synchrony for Confluence, ElasticSearch for BitBucket, EazyBI for Jira, and so on. These extra tools all needed to be refactored for a multi-node setup with shared data. In our previous setup, each application was running on its own virtual machine. In a Kubernetes context, the applications are spread over a number of worker nodes. Therefore, one worker node might run multiple applications. Each node of each application will be scheduled on a worker node that has sufficient resources available. We needed to implement good placement policies so each node of each application has sufficient memory available. We also needed to make sure one application could not affect another application when it asks for more resources. There were also some challenges regarding load balancing. We needed to create a custom template for nginx ingress-controller to make sure websockets are working correctly and all health checks within the application are reporting a healthy status. Additionally, we needed a different load balancer and URL for our BitBucket SSH traffic compared to our web traffic to the BitBucket UI. Our previous setup contained a lot of data, both on filesystem and in the database. We needed to migrate all the data to an Amazon EFS volume and a new database in a new AWS account. It was challenging to find a way to have a consistent sync process that also didn’t take too long because during migration, all applications were down to prevent data loss. In the end, we were able to meet these criteria and were able to migrate successfully. Monitoring our Atlassian stack We use the following tools to monitor all resources within our setup Datadog to monitor all components created within our stack and to centralize logging of all components. You can read more about monitoring your stack with Datadog in our blog post here . NewRelic for APM monitoring of the Java process (Jira, Confluence, Bitbucket) within the container. If our monitoring detects an anomaly, it creates an alert within OpsGenie . OpsGenie will make sure that this alert is sent to the team or the on-call person that is responsible to fix the problem. If the on-call person does not acknowledge the alert in time, the alert will be escalated to the team that’s responsible for that specific alert. Conclusion In short, we are very happy we migrated to this new stack. Combining the benefits of Kubernetes and the Atlassian Data Center versions of Jira, Confluence and BitBucket feels like a big step in the right direction. The improvements in self-healing, deploying and monitoring benefits us every day and maintenance has become a lot easier. Interested in your own Atlassian Stack? Do you also want to leverage the power of Kubernetes? You can find more information about how we can help you on our website! Our Atlassian hosting offering

Read more
kubernetes setup
kubernetes setup
Reading time 6 min
6 MAY 2025

At ACA, we live and breathe Kubernetes. We set up new projects with this popular container orchestration system by default, and we’re also migrating existing customers to Kubernetes. As a result, the amount of Kubernetes clusters the ACA team manages, is growing rapidly! We’ve had to change our setup multiple times to accommodate for more customers, more clusters, more load, less maintenance and so on. From an Amazon ECS to a Kubernetes setup In 2016, we had a lot of projects that were running in Docker containers. At that point in time, our Docker containers were either running in Amazon ECS or on Amazon EC2 Virtual Machines running the Docker daemon. Unfortunately, this setup required a lot of maintenance. We needed a tool that would give us a reliable way to run these containers in production. We longed for an orchestrator that would provide us high availability, automatic cleanup of old resources, automatic container scheduling and so much more. → Enter Kubernetes ! Kubernetes proved to be the perfect candidate for a container orchestration tool. It could reliably run containers in production and reduce the amount of maintenance required for our setup. Creating a Kubernetes-minded approach Agile as we are, we proposed the idea for a Kubernetes setup for one of our next projects. The customer saw the potential of our new approach and agreed to be part of the revolution. At the beginning of 2017, we created our first very own Kubernetes cluster. At this stage, there were only two certainties: we wanted to run Kubernetes and it would run on AWS . Apart from that, there were still a lot of questions and challenges. How would we set up and manage our cluster? Can we run our existing docker containers within the cluster? What type of access and information can we provide the development teams? We’ve learned that in the end, the hardest task was not the cluster setup. Instead, creating a new mindset within ACA Group to accept this new approach, and involving the development teams in our next-gen Kubernetes setup proved to be the harder task at hand. Apart from getting to know the product ourselves and getting other teams involved as well, we also had some other tasks that required our attention: we needed to dockerize every application, we needed to be able to setup applications in the Kubernetes cluster that were high available and if possible also self-healing, and clustered applications needed to be able to share their state using the available methods within the selected container network interface. Getting used to this new way of doing things in combination with other tasks, like setting up good monitoring, having a centralized logging setup and deploying our applications in a consistent and maintainable way, proved to be quite challenging. Luckily, we were able to conquer these challenges and about half a year after we’d created our first Kubernetes cluster, our first production cluster went live (August 2017). These were the core components of our toolset anno 2017: Terraform would deploy the AWS VPC, networking components and other dependencies for the Kubernetes cluster Kops for cluster creation and management An EFK stack for logging was deployed within the Kubernetes cluster Heapster, influxdb and grafana in combination with Librato for monitoring within the cluster Opsgenie for alerting Nice! … but we can do better: reducing costs, components and downtime Once we had completed our first setup, it became easier to use the same topology and we continued implementing this setup for other customers. Through our infrastructure-as-code approach (Terraform) in combination with a Kubernetes cluster management tool (Kops), the effort to create new clusters was relatively low. However, after a while, we started to notice some possible risks related to this setup. The amount of work required for the setup and the impact of updates or upgrades on our Kubernetes stack was too large. At the same time, the number of customers that wanted their very own Kubernetes cluster was growing. So, we needed to make some changes to reduce maintenance effort on the Kubernetes part of this setup to keep things manageable for ourselves. Migration to Amazon EKS and Datadog At this point the Kubernetes service from AWS (Amazon EKS) became generally available. We were able to move all things that are managed by Kops to our Terraform code, making things a lot less complex. As an extra benefit, the Kubernetes master nodes are now managed by EKS. This means we now have less nodes to manage and EKS also provides us cluster upgrades with a touch of the button. Apart from reducing the workloads on our Kubernetes management plane, we’ve also reduced the number of components within our cluster. In the previous setup we were using an EFK (ElasticSearch, Fluentd and Kibana) stack for our logging infrastructure. For our monitoring, we were using a combination of InfluxDB, Grafana, Heapster and Librato. These tools gave us a lot of flexibility but required a lot of maintenance effort, since they all ran within the cluster. We’ve replaced them all with Datadog agent, reducing our maintenance workloads drastically. Upgrades in 60 minutes Furthermore, because of the migration to Amazon EKS and the reduction in the number of components running within the Kubernetes cluster, we were able to reduce the cost and availability impact of our cluster upgrades. With the current stack, using Datadog and Amazon EKS, we can upgrade a Kubernetes cluster within an hour. If we were to use the previous stack, it would take us about 10 hours on average. So where are we now? We currently have 16 Kubernetes clusters up and running , all running the latest available EKS version. Right now, we want to spread our love for Kubernetes wherever we can. Multiple project teams within ACA Group are now using Kubernetes, so we are organizing workshops to help them get up to speed with the technology quickly. At the same time, we also try to catch up with the latest additions to this rapidly changing platform. That’s why we’ve attended the Kubecon conference in Barcelona and shared our opinions in our Kubecon Afterglow event. What’s next? Even though we are very happy with our current Kubernetes setup, we believe there’s always room for improvement . During our Kubecon Afterglow event, we’ve had some interesting discussions with other Kubernetes enthusiasts. These discussions helped us defining our next steps, bringing our Kubernetes setup to an even higher level. Some things we’d like to improve in the near future: add service mesh to our Kubernetes stack, 100% automatic worker node upgrades without application downtime. Of course, these are just a few focus points. We’ll implement many new features and improvements whenever they are released! What about you? Are you interested in your very own Kubernetes cluster? Which improvements do you plan on making to your stack or Kubernetes setup? Or do you have an unanswered Kubernetes question we might be able to help you with? Contact us at cloud@aca-it.be and we will help you out! {% module_block module "widget_7e6bdbd6-406c-4a0a-8393-27a28f436c6d" %}{% module_attribute "buttons" is_json="true" %}{% raw %}[{"appearance":{"link_color":"light","primary_color":"primary","secondary_color":"primary","tertiary_color":"light","tertiary_icon_accent_color":"dark","tertiary_text_color":"dark","variant":"primary"},"content":{"arrow":"right","icon":{"alt":null,"height":null,"loading":"disabled","size_type":null,"src":"","width":null},"tertiary_icon":{"alt":null,"height":null,"loading":"disabled","size_type":null,"src":"","width":null},"text":"Our Kubernetes services"},"target":{"link":{"no_follow":false,"open_in_new_tab":false,"rel":"","sponsored":false,"url":{"content_id":null,"href":"https://www.acagroup/be/en/services/kubernetes","href_with_scheme":"https://www.acagroup/be/en/services/kubernetes","type":"EXTERNAL"},"user_generated_content":false}},"type":"normal"}]{% endraw %}{% end_module_attribute %}{% module_attribute "child_css" is_json="true" %}{% raw %}{}{% endraw %}{% end_module_attribute %}{% module_attribute "css" is_json="true" %}{% raw %}{}{% endraw %}{% end_module_attribute %}{% module_attribute "definition_id" is_json="true" %}{% raw %}null{% endraw %}{% end_module_attribute %}{% module_attribute "field_types" is_json="true" %}{% raw %}{"buttons":"group","styles":"group"}{% endraw %}{% end_module_attribute %}{% module_attribute "isJsModule" is_json="true" %}{% raw %}true{% endraw %}{% end_module_attribute %}{% module_attribute "label" is_json="true" %}{% raw %}null{% endraw %}{% end_module_attribute %}{% module_attribute "module_id" is_json="true" %}{% raw %}201493994716{% endraw %}{% end_module_attribute %}{% module_attribute "path" is_json="true" %}{% raw %}"@projects/aca-group-project/aca-group-app/components/modules/ButtonGroup"{% endraw %}{% end_module_attribute %}{% module_attribute "schema_version" is_json="true" %}{% raw %}2{% endraw %}{% end_module_attribute %}{% module_attribute "smart_objects" is_json="true" %}{% raw %}null{% endraw %}{% end_module_attribute %}{% module_attribute "smart_type" is_json="true" %}{% raw %}"NOT_SMART"{% endraw %}{% end_module_attribute %}{% module_attribute "tag" is_json="true" %}{% raw %}"module"{% endraw %}{% end_module_attribute %}{% module_attribute "type" is_json="true" %}{% raw %}"module"{% endraw %}{% end_module_attribute %}{% module_attribute "wrap_field_tag" is_json="true" %}{% raw %}"div"{% endraw %}{% end_module_attribute %}{% end_module_block %}

Read more
Reading time 8 min
23 APR 2023

In the fast-moving world of IT, ACA Group constantly takes the time to explore innovative solutions and tools to provide the best services to our customers. Recently, we shared our experience with Flux, a CloudNative Continuous Deployment tool that implements GitOps. What is Harbor? Harbor is a CloudNative tool designed to leverage the flexibility, scalability and resilience of the cloud. It is a containerized solution that provides advanced features such as vulnerability scanning and artifact registry. Harbor can run on any Kubernetes based solution on public/private cloud, but also on your local Kubernetes cluster. It is a self-managed solution that needs to be deployed on your Kubernetes cluster. You can choose the components to deploy based on your needs. For some features, for example vulnerability scanning, a selection between different tools can be made. The benefits of Harbor Registery We are excited to use Harbor Registry for our projects in the coming months because of the following benefits: Harbor is easy to scale : All components can be set up with multiple replicas, preventing unexpected downtime and providing fail-over. This ensures that your container images are always available when you need them. Harbor is composable : It allows you to deploy only the workloads for the features you use. Harbor is not only a contaner registery : It can, for example, also be used as a Chartmuseum to store helm charts. Harbor is multi-tenant : Project-specific configuration is possible with specific quota and policies. It also has the most complete set of features of any container registry we have ever worked with, such as: Connection with OpenID Connect. Audit logging for container pull and push actions. Policies for container images like tag retention and tag immutability. Replication to other registries (for example ECR, ACR, ...). Webhooks can be triggered when specific actions occur. A project can serve as a cache proxy for public images. Additionally, Harbor has a well-documented API, and we use a Terraform provider to set up resources like projects and users within Harbor using Terraform code. As you can read, Harbor has a lot to offer. ;-) In the next section, we take a deeper look into some of the features we haven’t covered yet, but are definitely worth mentioning. Container Image Vulnerability Scanning One of the most interesting features is the built-in Vulnerability Scanning. Harbor has a built-in vulnerability scanning tool that automatically scans container images for vulnerabilities when pushed to the registry. A job will be scheduled when a container image is pushed and once the result is available, it will be visible next to the container image. You can also get more details on the specific vulnerabilities that are found: For us, having these insights on the quality of the containers is a huge improvement compared to the current container registry we are using. By making some additional configurations, we can make the vulnerability scanning experience even better. Some examples: Block pulling container images that have Critical CVE vulnerabilities. Shedule frequent scans of images already stored in the Harbor registery. Allow specific CVEs that can't be fixed at the moment. Set up a webhook to take action when a CVE is detected. Harbor uses Trivy as its default vulnerability scanning tool, but it's easy to switch to another tool by installing it on your cluster and registering it in the Harbor interface. With these simple steps, you can take container security to the next level. Container Image Signing Harbor also offers a Container Image Signing feature that allows users to verify the trust of container images. When Notary or Cosign is used to secure container images, Harbor can validate their signatures, ensuring that the images have not been tampered with by any unauthorized sources other than your build tools. The Container Image Signing feature is signified by a green check mark in Harbor's interface, indicating that the image has been correctly signed. While this blog post won't cover how this feature works, you can find detailed documentation on the process via this link . Robot Accounts In addition to regular users, Harbor also allows for the creation of robot accounts. These system users are not associated with personal accounts and are often utilized by scripts and processes to authenticate with the Harbor registry. For instance, when building a container image, scripts may use a robot account to push the container image to the Harbor registry. To increase security, it's possible to set up an expiration time for robot users. Moreover, the access rights can be restricted to a particular project, and even the level of permissions within that project can be customized. The audit log records all activities performed by the robot account, just like regular users. What does the Harbor registery setup look like? We run Harbor registry on EKS, the Kubernetes Service provided by AWS. Since we are running on AWS, we can use some of the AWS services to provide some of the dependencies. We have 3 layers of configuration in this setup: AWS resources. Kubernetes resources. Resources within Harbor registery. ⬇️ In the next sections, we will take a deeper dive into these layers of configuration. 1. Setting up the AWS resources Within the ACA Group, we try to manage all our infrastructure as code. We use Terraform to setup all the dependencies for our Harbor registry: Route53 to provide DNS. RDS to provide a multi-az postgres database. Elasticache to provide a high-available Redis cluster for session management. EFS to provide multi-zone shared storage for our containers. EKS to provide the Kubernetes cluster master layer. Nodegroups deployed over multiple availability zones that will serve as compute capacity for our Kubernetes cluster. Once these resources are available, we can generate the Kubernetes configuration and deploy these to our Kubernetes cluster. 2. Setting up the kubernetes resources We use a helm chart to generate the Kubernetes configuration files that are required to set up Harbor. These are YAML files that are stored in GIT repositories. Ultimately, flux will deploy these YAML files to the Kubernetes cluster. ℹ️ you can read about deployments with flux in another blogpost here . As a result, the following workloads are created on your Kubernetes cluster: Harbor core. Harbor Portal, serves the UI. Harbor Registry, manages the container registry. Harbor Jobservice, schedules background jobs. Harbor Trivy, CVE / vulnerability scanning. Notary resources, image signing. Additionally, various Kubernetes objects are generated, including the Ingress, which exposes the user interface (UI) on your specified URL. If you want more details, you can directly install Harbor on your local Kubernetes cluster by running the Helm install command: helm install my-release harbor/harbor Default 3. Seting up resources within Harbor Now that we have the Harbor registry up and running, we can efficiently create various resources such as projects, robot accounts, and retention policies within it. Once again, we want to manage these resources in code instead of creating them via the UI. This not only helps us track the changes effectively but also prevents any potential misconfigurations through pull request mechanisms. With its comprehensive and well-documented API, Harbor allows many tools to develop custom addons. Leveraging our expertise in Terraform code, we prefer the Terraform Harbor addon to efficiently manage the resources within the Harbor registry. The following example will create a project within Harbor: resource "harbor_project" "myproject" { name = "myproject" public = false vulnerability_scanning = true enable_content_trust = true deployment_security = "" } Default Using the Harbor registery After deploying the Harbor registry and creating projects, it becomes a functional container registry that operates similarly to any other container registry. To push container images to the Harbor registry, conventional build jobs can be used. However, the build job requires authentication credentials, usually from a robot account. Following that, you have to update the configuration of your jenkins, tekton, BitBucket Pipeline, GitHub action, or similar job to specify the correct project and the Harbor URL, such as registry.example.be/myproject. The push command can also be found in the Harbor interface: After pushing the container image to the Harbor registry, it can be pulled from other locations. To pull an image to your local machine using Docker, you can use the following commands: docker login docker pull registry.example.be/myproject/image:version Default To utilize the container image in a Kubernetes environment, start by creating a Secret with the "docker-registry" type that includes the necessary credentials for deploying the container image. As a Secret is specific to a namespace, you need to run this command for each namespace that uses a container from the Harbor registry. kubectl -n NAMESPACE create secret docker-registry registry.example.be --docker-server=registry.example.be --docker-username='firstname.lastname' --docker-password='mysupersecurepassword' --docker-email=me@company.be Default Now you can point to the container image within your Deployment, StatefulSet, Job, … The imagePullSecrets section points to the Secret created in the step above. image: registry.example.be/myproject/image:version … imagePullSecrets: - name: registry.example.be Default Conclusion This blog post provided an overview of the numerous advantages and features of the Harbor registry. We also shared our approach to setting up and utilizing the container registry. At ACA, we use Harbor as the container registry for one of our most important projects and are currently in the process of adopting it as our default registry for new projects. Once it has been set up, we will create a plan to migrate additional active projects. Our goal is to enhance stability, availability and security for our clients. If you would like to learn more about Harbor registry, feel free to contact us! {% module_block module "widget_32015f12-8114-463e-bcf8-473d84a7e2dd" %}{% module_attribute "buttons" is_json="true" %}{% raw %}[{"appearance":{"link_color":"light","primary_color":"primary","secondary_color":"primary","tertiary_color":"light","tertiary_icon_accent_color":"dark","tertiary_text_color":"dark","variant":"primary"},"content":{"arrow":"right","icon":{"alt":null,"height":null,"loading":"disabled","size_type":null,"src":"","width":null},"tertiary_icon":{"alt":null,"height":null,"loading":"disabled","size_type":null,"src":"","width":null},"text":"Talk to use here!"},"target":{"link":{"no_follow":false,"open_in_new_tab":false,"rel":"","sponsored":false,"url":{"content_id":null,"href":"https://www.acagroup.be/en/services/cloud/","href_with_scheme":"https://www.acagroup.be/en/services/cloud/","type":"EXTERNAL"},"user_generated_content":false}},"type":"normal"}]{% endraw %}{% end_module_attribute %}{% module_attribute "child_css" is_json="true" %}{% raw %}{}{% endraw %}{% end_module_attribute %}{% module_attribute "css" is_json="true" %}{% raw %}{}{% endraw %}{% end_module_attribute %}{% module_attribute "definition_id" is_json="true" %}{% raw %}null{% endraw %}{% end_module_attribute %}{% module_attribute "field_types" is_json="true" %}{% raw %}{"buttons":"group","styles":"group"}{% endraw %}{% end_module_attribute %}{% module_attribute "isJsModule" is_json="true" %}{% raw %}true{% endraw %}{% end_module_attribute %}{% module_attribute "label" is_json="true" %}{% raw %}null{% endraw %}{% end_module_attribute %}{% module_attribute "module_id" is_json="true" %}{% raw %}201493994716{% endraw %}{% end_module_attribute %}{% module_attribute "path" is_json="true" %}{% raw %}"@projects/aca-group-project/aca-group-app/components/modules/ButtonGroup"{% endraw %}{% end_module_attribute %}{% module_attribute "schema_version" is_json="true" %}{% raw %}2{% endraw %}{% end_module_attribute %}{% module_attribute "smart_objects" is_json="true" %}{% raw %}null{% endraw %}{% end_module_attribute %}{% module_attribute "smart_type" is_json="true" %}{% raw %}"NOT_SMART"{% endraw %}{% end_module_attribute %}{% module_attribute "tag" is_json="true" %}{% raw %}"module"{% endraw %}{% end_module_attribute %}{% module_attribute "type" is_json="true" %}{% raw %}"module"{% endraw %}{% end_module_attribute %}{% module_attribute "wrap_field_tag" is_json="true" %}{% raw %}"div"{% endraw %}{% end_module_attribute %}{% end_module_block %}

Read more
Reading time 6 min
8 MAR 2023

IT never stands still, which is why ACA Group is constantly investigating innovative solutions and tools. One of those tools is Flux. In this blogpost, our experts share their experience and findings. First, what is Flux? Flux is cloud-native tooling, designed to leverage the flexibility, scalability and resilience of the cloud. It can run on any Kubernetes-based solution on public/private cloud as well as on a local Kubernetes cluster. Flux is a containerized tool that serves only one purpose: implementing Continuous Delivery. To achieve this, it keeps the Kubernetes cluster on which it is deployed in sync with a config source. A typical source would be a GIT repository. The config source is monitored by Flux (pull approach); and if a change is detected, the change will be deployed to the cluster. The change is deployed via a reconciliation method. This means that Flux will not destroy and create all resources, but only make the changes needed to match the state described in the GIT repository. For example, you have a Deployment.yaml and Service.yaml, but only want to change the first one to use a new version of a container image. Then only the Deployment.yaml will be replaced within the cluster; without changing the service. In summary, the process of syncing the content in GIT is called GitOps. What is GitOps? In a GitOps approach, the state of the cluster is fully described in GIT repositories; it contains everything required to deploy an application (Deployment.yaml, Service.yaml, ConfigMap.yaml,…). To match the cluster with the state, we need an automated solution. In this case, Flux will execute a reconciliation when something has changed in GIT. Using GitOps is considered a more developer-centric experience. It is based on tools and principles that developers already know, so additional knowledge is no longer needed. The benefits of GitOps The use of the GitOps approach implemented by Flux has a lot of benefits: the full Kubernetes cluster status is visible in GIT, making it more understandable for developers evelopment teams can work more independently as no complex deploy pipelines are needed it is a lot easier to set up additional applications that can be deployed on the Kubernetes cluster the use of pull request flows facilitates the monitoring of changes in applications branching or versioning strategies make it easy to keep different environments (test, acceptance, production) in sync. the approach can be uniform across all types of Kubernetes clusters (EKS, Rancher, OpenShift or a local setup) flux is lightweight and can be easily installed on any Kubernetes cluster as the capacity is usually already in place flux has good documentation and an active community Since the Flux resources are running within our cluster, there is no need to use other development tooling (such as Jenkins or Bamboo). This also has some advantages: less security issues as we use a pull approach and do not need to store credentials in external tools no unexpected downtime caused by external deployment tools no unexpected downtime caused by external deployment tools less overhead because there is no need to maintain any deployment tools How to manage applications with Flux Suppose we have a GIT repo called flux-app that contains the Deployment.yaml we want to deploy. How can we instruct Flux to create this Deployment in the Kubernetes cluster? Before we can deploy our applications, we first need to install Flux. Flux also uses a GitOps approach to manage its own installation: create a GIT repository that will contain the resources for the Flux installation download the Flux CLI run the bootstrap command flux bootstrap git \ --url=ssh://git@bitbucket.org/sample-repo/flux-installation.git \ --private-key-file=/Users/yourname/.ssh/flux \ --branch=main \ --path=./clusters/rancher-desktop-local Default The YAML files are stored in the aforementioned GIT repository and are applied to your cluster. The main resources created are visible in the image below. The Flux installation has added 4 important components to our Kubernetes cluster: source-controller pod kustomize-controller pod GitRepository CRD (Custom Resource Definition) Kustomization CRD (Custom Resource Definition) ⚠️ What are Custom Resource Definitions? Kubernetes provides a specific set of API resources by default. These are the well-known resources such as Pods, ConfigMaps, Deployments, Secrets. A CR, Custom Resource, is an extension to the Kubernetes API. It provides a way to add new resource types in addition to these existing resources. However, as with known resources, we need a specification on how to create such a resource. The CustomResourceDefinition (CRD) is basically a blueprint of what the CustomResource (CR) should look like. Moreover, we need logic on what should happen when such a resource is created. This logic is usually added to the application container, which monitors the CRD for changes and takes the required actions when they occur. Click here for more information about this topic can be found in the official Kubernetes documentation. In the case of Flux, the source-controller pod has the logic to take actions when a GitRepository CR is created/modified. Similarly, the kustomize controller pod has the logic to take actions when a Kustiomization CR is created / modified. Flux provides the CustomResourceDefintion (blueprint) and the logic (running in the containers) to do something with a Custom Resource. The Custom Resource is added to the Kubernetes cluster in the next steps. Adding GitRepository Custom Resources Now that we have the GitRepository CustomResourceDefinition and the source controller, we can start adding GitRepository resources. This resource contains the logic to connect to the GIT repository where your application's YAML files are stored. apiVersion : source.toolkit.fluxcd.io/v1beta2 kind : GitRepository metadata : name : flux - app namespace : flux - app spec : gitImplementation : go - git interval : 1m0s ref : branch : main secretRef : name : bitbucket - cloud - credentials timeout : 60s url : https : //bitbucket.org/sample - repo/application YAML We create the GitRepository resource. kubectl create - f GitRepository.yaml YAML NAME URL AGE READY STATUS flux-app https://bitbucket.org/sample-repo/application 21m True stored artifact for revision 'main/9ad65085cfe584f438f71e361c4ad20ac9d04f55' Default Note that the revision points to branch/commit-id At this point, the GIT repository is viewed, but nothing is deployed to our cluster. To deploy the YAML files stored in the flux app GIT repository, we need to create a Kustomization resource. ⚠️ You will have to add a secret for authentication to the Bitbucket repository. Since this can be done in multiple ways, this has not been added to this article. When the secret is created, you need to reference it in your GitRepository yaml file. secretRef: name: bitbucket-cloud-credentials More information can be found here! Adding Kustomization Custom Resources The Kustomization resource will configure which GitRepository resource to watch for changes. Once the GitRepository resource points to a new revision, the kustomize-controller will deploy the current version of the artifacts to the cluster. apiVersion: kustomize.toolkit.fluxcd.io/v1beta2 kind: Kustomization metadata: name: flux-app namespace: flux-app spec: force: true interval: 1m0s path: ./ prune: false sourceRef: kind: GitRepository name: flux-app Default We create the Kustomization resource. kubectl create - f Kustomization.yaml YAML When we get the status of the created Kustomization resource, we see that it corresponds to the latest revision. kubectl - n flux - app get kustomization flux - app YAML NAME AGE READY STATUS flux-app 37m True Applied revision: main/9ad65085cfe584f438f71e361c4ad20ac9d04f55 Default When checking BitBucket, we see there is a Deployment file in our GIT repository - the commit matches the one mentioned by Kustomization. This Deployment has been applied to our Kubernetes cluster. kubectl -n flux-app get pod Default NAME READY STATUS RESTARTS AGE flux-app-7ddd9dd674-xp24s 1/1 Running 0 5m49s Default Putting things together To summarize, the above design summarizes the flow: a developer creates a pull request the pull request is aggregated to a branch monitored by flux the source controller monitors the branch; it will notice a new commit id and change the revision of the GitRepository source to 'branch/commit-id'. the kustomization controller watches the GitRepository source if a new version is noticed, the new version of the files will be applied, the revision of the Kustomization source will be changed to 'branch/commit-id' Conclusion In this blogpost, we’ve explained how Flux works and how easy it is to use Flux for continuous delivery. Within ACA, we are currently doing the migration from complex Jenkins deploy pipelines to an easy to understand GitOps approach using Flux. We believe that this GitOps approach will become the standard way to deploy workloads in the near future. Want to know more about Flux? Contact us!

Read more
Reading time 6 min
16 JUN 2022

I started writing this blog post the day after I came home from KubeCon and CloudNativeCon 2022. The main thing I noticed was that the content of the talks has changed over the last few years. Kubernetes’ new challenges When looking at the topics of this year’s KubeCon / CloudNativeCon, it feels like a lot of questions about Kubernetes, types of cloud, logging tools and more are answered for most companies. This makes sense, because more and more organizations have already successfully adopted Kubernetes. Kubernetes is no longer considered the next big thing, but rather the logical choice. However, we’ve noticed (during the talks, but also in our own journey) that new problems and challenges have arisen, leading to other questions: How can I implement more automation? How can I control/lower the costs for these setups? Is there a way to expand on whatever exists and add my own functionalities to Kubernetes? One of the possible ways to add functionalities to Kubernetes is using Operators. In this blog post, I will briefly explain how Operators work. How Operators work The concept of an operator is quite simple. I believe the easiest way to explain it is by actually installing an operator. Within ACA, we use the istio operator. The exact steps of installing depends on the operator you are installing, but usually they’re quite similar. First, install the istioctl binary on the machine that has access to the Kubernetes api. The next step is to run the command to install the operator. curl -sL https://istio.io/downloadIstioctl | sh - export PATH=$PATH:$HOME/.istioctl/bin istioctl operator init Default This will create the operator resource(s) in the istio-system namespace. You should see a pod running. kubectl get pods -n istio-operator NAMESPACE NAME READY STATUS RESTARTS AGE istio-operator istio-operator-564d46ffb7-nrw2t 1/1 Running 0 20s kubectl get crd NAME CREATED AT istiooperators.install.istio.io 2022-05-21T19:19:43Z Default As you can see, a new CustomResourceDefinition called istiooperators.install.istio.io is created. This is a blueprint that specifies how resource definitions should be added to the cluster. To create config, we need to know what ‘kind’ of config the CRD expects to be created. kubectl get crd istiooperators.install.istio.io -oyaml … status: acceptedNames: kind: IstioOperator … Default Let’s create a simple config file. kubectl apply -f - EOF apiVersion: install.istio.io/v1alpha1 kind: IstioOperator metadata: namespace: istio-system name: istio-controlplane spec: profile: minimal EOF Default Once the ResourceDefinition that contains the configuration is added to the cluster, the operator will make sure the resources in the cluster match whatever is defined in the configuration. You’ll see that new resources are created. kubectl get pods -A istio-system istiod-7dc88f87f4-rsc42 0/1 Pending 0 2m27s Default Since I run a small kind cluster, the istiod pod can’t be scheduled and is stuck in a Pending state. Let me explain the process first before changing this. The istio-operator will keep watching the IstioOperator configuration file for changes. If changes are made to the file, it will only make the changes that are required to update the resources in the cluster to match the state specified in the configuration file. This behavior is called reconciliation . Let’s watch the IstioOperator configuration file status. Note that it’s created in the istio-system namespace. kubectl get istiooperator -n istio-system NAME REVISION STATUS AGE istio-controlplane RECONCILING 3m Default As you can see, this is still reconciling, because the pod can’t start. After some time, it’ll go in an ERROR state. kubectl get istiooperator -n istio-system NAME REVISION STATUS AGE istio-controlplane ERROR 6m58s Default You can also check the istio-operator log for useful information. kubectl -n istio-operator logs istio-operator-564d46ffb7-nrw2t --tail 20 - Processing resources for Istiod. - Processing resources for Istiod. Waiting for Deployment/istio-system/istiod ✘ Istiod encountered an error: failed to wait for resource: resources not ready after 5m0s: timed out waiting for the condition. Since I’m running a small demo cluster, I’ll update the memory limit so the POD can be scheduled. This is done within the spec: part of the IstioOperator definition. kubectl -n istio-system edit istiooperator istio-controlplane spec: profile: minimal components: pilot: k8s: resources: requests: memory: 128Mi The istiooperator will go back to a RECONCILING state. kubectl get istiooperator -n istio-system NAME REVISION STATUS AGE istio-controlplane RECONCILING 11m Default And after some time, it becomes HEALTHY . kubectl get istiooperator -n istio-system NAME REVISION STATUS AGE istio-controlplane HEALTHY 12m Default You can see the istiod pod is running. NAMESPACE NAME READY STATUS istio-system istiod-7dc88f87f4-n86z9 1/1 Running Default Apart from the istiod deployment, a lot of new CRDs are added as well. authorizationpolicies.security.istio.io 2022-05-21T20:08:05Z destinationrules.networking.istio.io 2022-05-21T20:08:05Z envoyfilters.networking.istio.io 2022-05-21T20:08:05Z gateways.networking.istio.io 2022-05-21T20:08:05Z istiooperators.install.istio.io 2022-05-21T20:07:01Z peerauthentications.security.istio.io 2022-05-21T20:08:05Z proxyconfigs.networking.istio.io 2022-05-21T20:08:05Z requestauthentications.security.istio.io 2022-05-21T20:08:05Z serviceentries.networking.istio.io 2022-05-21T20:08:05Z sidecars.networking.istio.io 2022-05-21T20:08:05Z telemetries.telemetry.istio.io 2022-05-21T20:08:05Z virtualservices.networking.istio.io 2022-05-21T20:08:05Z wasmplugins.extensions.istio.io 2022-05-21T20:08:06Z workloadentries.networking.istio.io 2022-05-21T20:08:06Z workloadgroups.networking.istio.io 2022-05-21T20:08:06Z Default How the operator works - summary As you can see, this is a very easy way to quickly set up istio within our cluster. In short, these are the steps: Install the operator One (or more) CustomResourceDefinitions is added that provides a blueprint for the objects that can be created/managed. A deployment is created, which in turn creates a Pod that monitors the Configurations of the kinds that are specified by the CRD. The user adds configuration to the cluster, with its type specified by the CRD. The operator POD notices the new configuration and takes all steps that are required to make sure the cluster is in the desired state specified by the configuration. Benefits of the operator approach The operator approach makes it easy to package a set of resources like Deployments, Jobs, CustomResourceDefinitions. This way, it’s easy to add additional behavior and capabilities to Kubernetes. There’s a library which lists the available operators which can be found at https://operatorhub.io/ , counting 255 operators at the moment of writing. The operators are usually installed with just a few commands or lines of code. It’s also possible to create your own operators. It might make sense to package a set of deployments, jobs, CRDs, … that provide a specific functionality as an operator. The operator can be handled as operators and use pipelines for CVE validations, E2E tests, rollout to test environments, and more before a new version is promoted to production. Pitfalls We have been using Kubernetes for a long time within the ACA Group and have collected some security best-practices during this period. We’ve noticed that one-file-deployments and helm charts from the internet are usually not as well configured as we want them to be. Think about RBAC rules that give too many permissions, resources not currently namespaced or containers running as root. When using operators from operatorhub.io, you basically trust the vendor or provider to follow security best-practices. However … one of the talks at KubeCon 2022 that made the biggest impression on me, stated that a lot of the operators have issues regarding security. I would suggest you to watch Tweezering Kubernetes Resources: Operating on Operators - Kevin Ward, ControlPlane before installing. Another thing we’ve noticed is that using operators can speed up the process to implement new tools and features. Be sure to read the documentation that was provided by the creator of an operator before you dive into advanced configuration. It might be possible that not all features are actually implemented on the CRD that is created by the operator. However, it is bad practice to directly manipulate the resources that were created by the operator. The operator is not tested against your manual changes and this might cause inconsistencies. Additionally, new operator versions might (partly) undo your changes, which also might cause problems. At that point, you’re basically stuck, unless you create your own operator that provides additional features. We’ve also noticed that there is no real ‘rule book’ on how to provide CRDs and documentation is not always easy to find or understand. Conclusion Operators are currently a hot topic within the Kubernetes community. The number of available operators is growing fast, making it easy to add functionality to your cluster. However, there is no rule book or a minimal baseline of quality. When installing operators from the operatorhub, be sure to check the contents or validate the created resources on a local setup. We expect to see some changes and improvements in the near future, but at this point they can be very useful already. AUTHOR Bregt Coenen

Read more
Reading time 9 min
16 JAN 2019

Over the past few months, Kubernetes has become a more mature product and setting up a cluster has become a lot easier. Especially with the official release of Amazon Elastic Container Service for Kubernetes (EKS) on Amazon Web Services , another major cloud provider is able to provide a Kubernetes cluster with a few clicks. While the complexity of creating a Kubernetes cluster has decreased drastically, there still are some challenging tasks when setting up the resources within the cluster. The biggest challenge for us has always been providing reliable monitoring and logging for the components within the cluster. Since we’ve migrated to Datadog , things have changed for the better. In this blog post, we’ll teach you how to monitor your Kubernetes cluster with Datadog. Setting up Datadog monitoring and logging For this blog post, we’ll assume you have an active Kubernetes setup and kubectl configured. Our cloud services team prefers the following Kubernetes setup: Amazon Web Services (AWS) as the cloud provider Amazon Elastic Container Service for Kubernetes (EKS) which offers managed Kubernetes Terraform to automate the process of creating the required resources within the AWS account VPC and networking requirements EKS cluster Kubernetes worker nodes Datadog for monitoring and log collection and OpsGenie for alert and incident management. Of course, you’re free to choose your own tools. One requirement, however, is that you must use Datadog (else this whole blog post won’t make a lot of sense). If you’re new to Datadog, you need to create a Datadog account. You can try it out for 14 days for free by clicking here and pressing the “Get started” button. Complete the form and login to your newly created organization. Time to add some hosts! Kubernetes DaemonSet for creating Datadog agents A Kubernetes DaemonSet makes sure that a Docker container running the Datadog agent is created on every worker node (host) that has joined the Kubernetes cluster. This way, you can monitor the resources for all active worker nodes within the cluster. The YAML file specifies the configuration for all Datadog components we want to enable: Datadog Process Agent Datadog Log Agent and Datadog JMX If you wonder what the file looks like, this is it: apiVersion: apps/v1 kind: DaemonSet metadata: name: datadog-agent namespace: tools labels: k8s-app: datadog-agent spec: selector: matchLabels: name: datadog-agent template: metadata: labels: name: datadog-agent spec: #tolerations: #- key: node-role.kubernetes.io/master # operator: Exists # effect: NoSchedule serviceAccountName: datadog-agent containers: - image: datadog/agent:latest-jmx imagePullPolicy: Always name: datadog-agent ports: - containerPort: 8125 # hostPort: 8125 name: dogstatsdport protocol: UDP - containerPort: 8126 # hostPort: 8126 name: traceport protocol: TCP env: - name: DD_API_KEY valueFrom: secretKeyRef: name: datadog key: DATADOG_API_KEY - name: DD_COLLECT_KUBERNETES_EVENTS value: "true" - name: DD_LEADER_ELECTION value: "true" - name: KUBERNETES value: "yes" - name: DD_PROCESS_AGENT_ENABLED value: "true" - name: DD_LOGS_ENABLED value: "true" - name: DD_LOGS_CONFIG_CONTAINER_COLLECT_ALL value: "true" - name: SD_BACKEND value: "docker" - name: SD_JMX_ENABLE value: "yes" - name: DD_KUBERNETES_KUBELET_HOST valueFrom: fieldRef: fieldPath: status.hostIP resources: requests: memory: "400Mi" cpu: "200m" limits: memory: "400Mi" cpu: "200m" volumeMounts: - name: dockersocket mountPath: /var/run/docker.sock - name: procdir mountPath: /host/proc readOnly: true - name: sys-fs mountPath: /host/sys readOnly: true - name: root-fs mountPath: /rootfs readOnly: true - name: cgroups mountPath: /host/sys/fs/cgroup readOnly: true - name: pointerdir mountPath: /opt/datadog-agent/run - name: dd-agent-config mountPath: /conf.d - name: datadog-yaml mountPath: /etc/datadog-agent/datadog.yaml subPath: datadog.yaml livenessProbe: exec: command: - ./probe.sh initialDelaySeconds: 60 periodSeconds: 5 failureThreshold: 3 successThreshold: 1 timeoutSeconds: 3 volumes: - hostPath: path: /var/run/docker.sock name: dockersocket - hostPath: path: /proc name: procdir - hostPath: path: /sys/fs/cgroup name: cgroups - hostPath: path: /opt/datadog-agent/run name: pointerdir - name: sys-fs hostPath: path: /sys - name: root-fs hostPath: path: / - name: datadog-yaml configMap: name: dd-agent-config items: - key: datadog-yaml path: datadog.yaml Default As a whole the file looks a bit overwhelming, so let’s zoom in on some aspects. #tolerations: #- key: node-role.Kubernetes.io/master # operator: Exists # effect: NoSchedule Default Since we use EKS, the master plane is maintained by AWS. Therefore we don’t want any Datadog agent pods to run on the master nodes. Uncomment this if you want to monitor your master nodes, for example when you are running Kops. containers: - image: Datadog/agent:latest-JMX imagePullPolicy: Always name: Datadog-agent Default We use the JMX-enabled version of the Datadog agent image, which is required for Kafka and Zookeeper integrations. If you don’t need JMX, you should use Datadog/agent:latest as this image is less resource-intensive. We specify “imagePullPolicy: Always” so we are sure that on startup, the image labelled “latest” is pulled again. In other cases when a new “latest” release is available, it won’t get pulled as we already have an image tagged “latest” available on the node. env: - name: DD_API_KEY valueFrom: secretKeyRef: name: Datadog key: Datadog_API_KEY Default We use SealedSecrets , which stores the Datadog API Key. It also sets the environment variable to the value of the Secret. If you don’t know how to get an API Key from Datadog, you can do that here . Enter a useful name and press the “Create API” button. - name: DD_LOGS_ENABLED value: "true" Default This ensures the Datadog logs agent is enabled. - name: SD_BACKEND value: "Docker" - name: SD_JMX_ENABLE value: "yes" Default This enables autodiscovery and JMX, which we need for our Zookeeper and Kafka integration to work, as it will use JMX to collect data. For more information on autodiscovery, you can read the Datadog docs here . resources: requests: memory: "400Mi" cpu: "200m" limits: memory: "400Mi" cpu: "200m" Default After enabling JMX, the memory usage of the container drastically increases. If you are not using the JMX version of the image, half of these limits should be fine. - name: Datadog-yaml mountPath: /etc/Datadog-agent/Datadog.yaml subPath: Datadog.yaml … - name: Datadog-yaml configMap: name: dd-agent-config items: - key: Datadog-yaml path: Datadog.yaml Default To add some custom configuration, we need to override the default Datadog.yaml configuration file. The ConfigMap has the following content: apiVersion: v1 kind: ConfigMap metadata: name: datadogtoken namespace: tools data: event.tokenKey: "0" --- apiVersion: v1 kind: ConfigMap metadata: name: dd-agent-config namespace: tools data: datadog-yaml: |- check_runners: 1 listeners: - name: kubelet config_providers: - name: kubelet polling: true tags: tst, kubelet, kubernetes, worker, env:tst, environment:tst, application:kubernetes, location:aws Default The first ConfigMap called Datadogtoken is required to have a persistent state when a new leader is elected. The content of the dd-agent-config ConfigMap is used to create the Datadog.yaml configuration file. We specify and add some extra tags to the resources collected by the agent, which is useful to create filters later on. livenessProbe: exec: command: - ./probe.sh initialDelaySeconds: 60 periodSeconds: 5 failureThreshold: 3 successThreshold: 1 timeoutSeconds: 3 Default When having a Kubernetes cluster with a lot of nodes, we’ve seen containers being stuck in a CrashLoopBackOff status. It’s therefore a good idea to do a more advanced health check to see whether your containers have actually booted. Make sure the health checks start polling after 60 minutes, which seems to be the best value. Once you have gathered all required configuration in your ConfigMap and DaemonSet files, you can create the resources using your Kubernetes CLI. kubectl create -f ConfigMap.yaml kubectl create -f DaemonSet.yaml Default After a few seconds, you should start seeing logs and metrics in the Datadog GUI. Taking a look at the collected data Datadog has a range of powerful monitoring features. The host map gives you a visualization of your nodes over the AWS availability zones. The colours in the map represent the relative CPU utilization for each node, green displaying a low level of CPU utilization and orange displaying a busier CPU. Each node is visible in the infrastructure list. Selecting one of the nodes reveals its details. You can monitor containers in the container view and see more details (e.g. graphs which visualize a trend) by selecting a specific container. Last but not least, processes can be monitored separately from the process list, with trends visible for every process. These fine-grained viewing levels make it easy to quickly pinpoint problems and generally lead to faster response times. All data is available to create beautiful dashboards and good monitors to alert on failures. The creation of these monitors can be scripted, making it fairly easy to set up additional accounts and setups. Easy to see why Datadog is indispensable in our solutions… 😉 Logging with Datadog Logs Datadog Logs is a little bit less mature than the monitoring part, but it’s still one of our favourite logging solutions. It’s relatively cheap and the same agent can be used for both monitoring and logging. Monitors – which are used to trigger alerts – can be created from the log data and log data can also be visualized in dashboards. You can see the logs by navigating here and filter them by container, namespace or pod name. It’s also possible to filter your logs by label, which you can add to your Deployment, StatefulSet, … Setting up additional Datadog integrations As you’ve noticed, Datadog already provides a lot of data by default. However, extra metric collection and dashboards can easily be added by adding integrations. Datadog claims they have more than 200 integrations you can enable. Here’s a list of integrations we usually enable on our clusters: AWS Docker Kubernetes Kafka Zookeeper ElasticSearch OpsGenie Installing integrations is usually a very straightforward process. Some of them can be enabled with one click, others require some extra configuration. Let’s take a deeper look at setting up some of the above integrations. AWS Integration Setup This integration should be configured both on the Datadog and AWS side. First, in AWS, we need to create a IAM Policy and a AssumeRolePolicy to allow access from Datadog to our AWS account. AssumeRolePolicy { "Effect": "Allow", "Principal": { "AWS": "arn:aws:iam::464622532012:root" }, "Action": "sts:AssumeRole", "Condition": { "StringEquals": { "sts:ExternalId": "${var.Datadog_aws_external_id}" } } } Default The content for the IAM Policy can be found here . Attach both Policies to an IAM Role called DatadogAWSIntegrationRole. Go to your Datadog account setting and press on the “+Available” button under the AWS integration. Go to the configuration tab, replace the variable ${var.Datadog_aws_external_id} in the policy above with the value of AWS External ID. Add the AWS account number and for the role use DatadogAWSIntegrationRole as created above. Optionally, you can add tags which will be added to all metric gathered by this integration. On the left, limit the selection to the AWS services you use. Lastly, save the integration and your AWS integration (and integration for the enabled AWS Services) will be shown under “Installed”. Integration in action When you go to your dashboard list, you’ll see some new interesting dashboards with new metrics you can use to create new monitors with, such as: Database (RDS) memory usage, load, cpu, disk usage, connections Number of available VPN tunnels for a VPN connection Number of healthy hosts behind a load balancer ... Docker Integration Enabling the Docker integration is as easy as pressing the “+Available” button. A “Docker – Overview” dashboard is available as soon as you enable the integration. Kubernetes Integration Just like the Docker integration above, enabling the Kubernetes integration is as easy as pressing the “+Available” button, with a “Kubernetes – Overview” dashboard available as soon as you enable the integration. If you want all data for this integration, you should make sure kube-state-metrics is running within your Kubernetes cluster. More information here . 🚀 Takeaway The goal of this article was to show you how Datadog can become your most indispensable tools in your monitoring and logging infrastructure. Setup is pretty easy and there is so much information that can be collected and visualized effectively. If you can create a good set of monitors so Datadog alerts in case of degradation or increased error rates, most incidents can be solved even before they become actual problems. You can script the creation of these monitors using the Datadog API, reducing the setup time of your monitoring and alerting framework drastically. Do you want more information, or could you use some help setting up your own EKS cluster with Datadog monitoring? Don’t hesitate to contact us !

Read more