We learn & share

ACA Group Blog

Read more about our thoughts, views, and opinions on various topics, important announcements, useful insights, and advice from our experts.

Featured

8 MAY 2025
Reading time 5 min

In the ever-evolving landscape of data management, investing in platforms and navigating migrations between them is a recurring theme in many data strategies. How can we ensure that these investments remain relevant and can evolve over time, avoiding endless migration projects? The answer lies in embracing ‘Composability’ - a key principle for designing robust, future-proof data (mesh) platforms. Is there a silver bullet we can buy off-the-shelf? The data-solution market is flooded with data vendor tools positioning themselves as the platform for everything, as the all-in-one silver bullet. It's important to know that there is no silver bullet. While opting for a single off-the-shelf platform might seem like a quick and easy solution at first, it can lead to problems down the line. These monolithic off-the-shelf platforms often end up inflexible to support all use cases, not customizable enough, and eventually become outdated.This results in big complicated migration projects to the next silver bullet platform, and organizations ending up with multiple all-in-one platforms, causing disruptions in day-to-day operations and hindering overall progress. Flexibility is key to your data mesh platform architecture A complete data platform must address numerous aspects: data storage, query engines, security, data access, discovery, observability, governance, developer experience, automation, a marketplace, data quality, etc. Some vendors claim their all-in-one data solution can tackle all of these. However, typically such a platform excels in certain aspects, but falls short in others. For example, a platform might offer a high-end query engine, but lack depth in features of the data marketplace included in their solution. To future-proof your platform, it must incorporate the best tools for each aspect and evolve as new technologies emerge. Today's cutting-edge solutions can be outdated tomorrow, so flexibility and evolvability are essential for your data mesh platform architecture. Embrace composability: Engineer your future Rather than locking into one single tool, aim to build a platform with composability at its core. Picture a platform where different technologies and tools can be seamlessly integrated, replaced, or evolved, with an integrated and automated self-service experience on top. A platform that is both generic at its core and flexible enough to accommodate the ever-changing landscape of data solutions and requirements. A platform with a long-term return on investment by allowing you to expand capabilities incrementally, avoiding costly, large-scale migrations. Composability enables you to continually adapt your platform capabilities by adding new technologies under the umbrella of one stable core platform layer. Two key ingredients of composability Building blocks: These are the individual components that make up your platform. Interoperability: All building blocks must work together seamlessly to create a cohesive system. An ecosystem of building blocks When building composable data platforms, the key lies in sourcing the right building blocks. But where do we get these? Traditional monolithic data platforms aim to solve all problems in one package, but this stifles the flexibility that composability demands. Instead, vendors should focus on decomposing these platforms into specialized, cost-effective components that excel at addressing specific challenges. By offering targeted solutions as building blocks, they empower organizations to assemble a data platform tailored to their unique needs. In addition to vendor solutions, open-source data technologies also offer a wealth of building blocks. It should be possible to combine both vendor-specific and open-source tools into a data platform tailored to your needs. This approach enhances agility, fosters innovation, and allows for continuous evolution by integrating the latest and most relevant technologies. Standardization as glue between building blocks To create a truly composable ecosystem, the building blocks must be able to work together, i.e. interoperability. This is where standards come into play, enabling seamless integration between data platform building blocks. Standardization ensures that different tools can operate in harmony, offering a flexible, interoperable platform. Imagine a standard for data access management that allows seamless integration across various components. It would enable an access management building block to list data products and grant access uniformly. Simultaneously, it would allow data storage and serving building blocks to integrate their data and permission models, ensuring that any access management solution can be effortlessly composed with them. This creates a flexible ecosystem where data access is consistently managed across different systems. The discovery of data products in a catalog or marketplace can be greatly enhanced by adopting a standard specification for data products. With this standard, each data product can be made discoverable in a generic way. When data catalogs or marketplaces adopt this standard, it provides the flexibility to choose and integrate any catalog or marketplace building block into your platform, fostering a more adaptable and interoperable data ecosystem. A data contract standard allows data products to specify their quality checks, SLOs, and SLAs in a generic format, enabling smooth integration of data quality tools with any data product. It enables you to combine the best solutions for ensuring data reliability across different platforms. Widely accepted standards are key to ensuring interoperability through agreed-upon APIs, SPIs, contracts, and plugin mechanisms. In essence, standards act as the glue that binds a composable data ecosystem. A strong belief in evolutionary architectures At ACA Group, we firmly believe in evolutionary architectures and platform engineering, principles that seamlessly extend to data mesh platforms. It's not about locking yourself into a rigid structure but creating an ecosystem that can evolve, staying at the forefront of innovation. That’s where composability comes in. Do you want a data platform that not only meets your current needs but also paves the way for the challenges and opportunities of tomorrow? Let’s engineer it together Ready to learn more about composability in data mesh solutions? {% module_block module "widget_f1f5c870-47cf-4a61-9810-b273e8d58226" %}{% module_attribute "buttons" is_json="true" %}{% raw %}[{"appearance":{"link_color":"light","primary_color":"primary","secondary_color":"primary","tertiary_color":"light","tertiary_icon_accent_color":"dark","tertiary_text_color":"dark","variant":"primary"},"content":{"arrow":"right","icon":{"alt":null,"height":null,"loading":"disabled","size_type":null,"src":"","width":null},"tertiary_icon":{"alt":null,"height":null,"loading":"disabled","size_type":null,"src":"","width":null},"text":"Contact us now!"},"target":{"link":{"no_follow":false,"open_in_new_tab":false,"rel":"","sponsored":false,"url":{"content_id":230950468795,"href":"https://25145356.hs-sites-eu1.com/en/contact","href_with_scheme":null,"type":"CONTENT"},"user_generated_content":false}},"type":"normal"}]{% endraw %}{% end_module_attribute %}{% module_attribute "child_css" is_json="true" %}{% raw %}{}{% endraw %}{% end_module_attribute %}{% module_attribute "css" is_json="true" %}{% raw %}{}{% endraw %}{% end_module_attribute %}{% module_attribute "definition_id" is_json="true" %}{% raw %}null{% endraw %}{% end_module_attribute %}{% module_attribute "field_types" is_json="true" %}{% raw %}{"buttons":"group","styles":"group"}{% endraw %}{% end_module_attribute %}{% module_attribute "isJsModule" is_json="true" %}{% raw %}true{% endraw %}{% end_module_attribute %}{% module_attribute "label" is_json="true" %}{% raw %}null{% endraw %}{% end_module_attribute %}{% module_attribute "module_id" is_json="true" %}{% raw %}201493994716{% endraw %}{% end_module_attribute %}{% module_attribute "path" is_json="true" %}{% raw %}"@projects/aca-group-project/aca-group-app/components/modules/ButtonGroup"{% endraw %}{% end_module_attribute %}{% module_attribute "schema_version" is_json="true" %}{% raw %}2{% endraw %}{% end_module_attribute %}{% module_attribute "smart_objects" is_json="true" %}{% raw %}null{% endraw %}{% end_module_attribute %}{% module_attribute "smart_type" is_json="true" %}{% raw %}"NOT_SMART"{% endraw %}{% end_module_attribute %}{% module_attribute "tag" is_json="true" %}{% raw %}"module"{% endraw %}{% end_module_attribute %}{% module_attribute "type" is_json="true" %}{% raw %}"module"{% endraw %}{% end_module_attribute %}{% module_attribute "wrap_field_tag" is_json="true" %}{% raw %}"div"{% endraw %}{% end_module_attribute %}{% end_module_block %}

Read more
We learn & share

ACA Group Blog

Read more about our thoughts, views, and opinions on various topics, important announcements, useful insights, and advice from our experts.

Featured

8 MAY 2025
Reading time 5 min

In the ever-evolving landscape of data management, investing in platforms and navigating migrations between them is a recurring theme in many data strategies. How can we ensure that these investments remain relevant and can evolve over time, avoiding endless migration projects? The answer lies in embracing ‘Composability’ - a key principle for designing robust, future-proof data (mesh) platforms. Is there a silver bullet we can buy off-the-shelf? The data-solution market is flooded with data vendor tools positioning themselves as the platform for everything, as the all-in-one silver bullet. It's important to know that there is no silver bullet. While opting for a single off-the-shelf platform might seem like a quick and easy solution at first, it can lead to problems down the line. These monolithic off-the-shelf platforms often end up inflexible to support all use cases, not customizable enough, and eventually become outdated.This results in big complicated migration projects to the next silver bullet platform, and organizations ending up with multiple all-in-one platforms, causing disruptions in day-to-day operations and hindering overall progress. Flexibility is key to your data mesh platform architecture A complete data platform must address numerous aspects: data storage, query engines, security, data access, discovery, observability, governance, developer experience, automation, a marketplace, data quality, etc. Some vendors claim their all-in-one data solution can tackle all of these. However, typically such a platform excels in certain aspects, but falls short in others. For example, a platform might offer a high-end query engine, but lack depth in features of the data marketplace included in their solution. To future-proof your platform, it must incorporate the best tools for each aspect and evolve as new technologies emerge. Today's cutting-edge solutions can be outdated tomorrow, so flexibility and evolvability are essential for your data mesh platform architecture. Embrace composability: Engineer your future Rather than locking into one single tool, aim to build a platform with composability at its core. Picture a platform where different technologies and tools can be seamlessly integrated, replaced, or evolved, with an integrated and automated self-service experience on top. A platform that is both generic at its core and flexible enough to accommodate the ever-changing landscape of data solutions and requirements. A platform with a long-term return on investment by allowing you to expand capabilities incrementally, avoiding costly, large-scale migrations. Composability enables you to continually adapt your platform capabilities by adding new technologies under the umbrella of one stable core platform layer. Two key ingredients of composability Building blocks: These are the individual components that make up your platform. Interoperability: All building blocks must work together seamlessly to create a cohesive system. An ecosystem of building blocks When building composable data platforms, the key lies in sourcing the right building blocks. But where do we get these? Traditional monolithic data platforms aim to solve all problems in one package, but this stifles the flexibility that composability demands. Instead, vendors should focus on decomposing these platforms into specialized, cost-effective components that excel at addressing specific challenges. By offering targeted solutions as building blocks, they empower organizations to assemble a data platform tailored to their unique needs. In addition to vendor solutions, open-source data technologies also offer a wealth of building blocks. It should be possible to combine both vendor-specific and open-source tools into a data platform tailored to your needs. This approach enhances agility, fosters innovation, and allows for continuous evolution by integrating the latest and most relevant technologies. Standardization as glue between building blocks To create a truly composable ecosystem, the building blocks must be able to work together, i.e. interoperability. This is where standards come into play, enabling seamless integration between data platform building blocks. Standardization ensures that different tools can operate in harmony, offering a flexible, interoperable platform. Imagine a standard for data access management that allows seamless integration across various components. It would enable an access management building block to list data products and grant access uniformly. Simultaneously, it would allow data storage and serving building blocks to integrate their data and permission models, ensuring that any access management solution can be effortlessly composed with them. This creates a flexible ecosystem where data access is consistently managed across different systems. The discovery of data products in a catalog or marketplace can be greatly enhanced by adopting a standard specification for data products. With this standard, each data product can be made discoverable in a generic way. When data catalogs or marketplaces adopt this standard, it provides the flexibility to choose and integrate any catalog or marketplace building block into your platform, fostering a more adaptable and interoperable data ecosystem. A data contract standard allows data products to specify their quality checks, SLOs, and SLAs in a generic format, enabling smooth integration of data quality tools with any data product. It enables you to combine the best solutions for ensuring data reliability across different platforms. Widely accepted standards are key to ensuring interoperability through agreed-upon APIs, SPIs, contracts, and plugin mechanisms. In essence, standards act as the glue that binds a composable data ecosystem. A strong belief in evolutionary architectures At ACA Group, we firmly believe in evolutionary architectures and platform engineering, principles that seamlessly extend to data mesh platforms. It's not about locking yourself into a rigid structure but creating an ecosystem that can evolve, staying at the forefront of innovation. That’s where composability comes in. Do you want a data platform that not only meets your current needs but also paves the way for the challenges and opportunities of tomorrow? Let’s engineer it together Ready to learn more about composability in data mesh solutions? {% module_block module "widget_f1f5c870-47cf-4a61-9810-b273e8d58226" %}{% module_attribute "buttons" is_json="true" %}{% raw %}[{"appearance":{"link_color":"light","primary_color":"primary","secondary_color":"primary","tertiary_color":"light","tertiary_icon_accent_color":"dark","tertiary_text_color":"dark","variant":"primary"},"content":{"arrow":"right","icon":{"alt":null,"height":null,"loading":"disabled","size_type":null,"src":"","width":null},"tertiary_icon":{"alt":null,"height":null,"loading":"disabled","size_type":null,"src":"","width":null},"text":"Contact us now!"},"target":{"link":{"no_follow":false,"open_in_new_tab":false,"rel":"","sponsored":false,"url":{"content_id":230950468795,"href":"https://25145356.hs-sites-eu1.com/en/contact","href_with_scheme":null,"type":"CONTENT"},"user_generated_content":false}},"type":"normal"}]{% endraw %}{% end_module_attribute %}{% module_attribute "child_css" is_json="true" %}{% raw %}{}{% endraw %}{% end_module_attribute %}{% module_attribute "css" is_json="true" %}{% raw %}{}{% endraw %}{% end_module_attribute %}{% module_attribute "definition_id" is_json="true" %}{% raw %}null{% endraw %}{% end_module_attribute %}{% module_attribute "field_types" is_json="true" %}{% raw %}{"buttons":"group","styles":"group"}{% endraw %}{% end_module_attribute %}{% module_attribute "isJsModule" is_json="true" %}{% raw %}true{% endraw %}{% end_module_attribute %}{% module_attribute "label" is_json="true" %}{% raw %}null{% endraw %}{% end_module_attribute %}{% module_attribute "module_id" is_json="true" %}{% raw %}201493994716{% endraw %}{% end_module_attribute %}{% module_attribute "path" is_json="true" %}{% raw %}"@projects/aca-group-project/aca-group-app/components/modules/ButtonGroup"{% endraw %}{% end_module_attribute %}{% module_attribute "schema_version" is_json="true" %}{% raw %}2{% endraw %}{% end_module_attribute %}{% module_attribute "smart_objects" is_json="true" %}{% raw %}null{% endraw %}{% end_module_attribute %}{% module_attribute "smart_type" is_json="true" %}{% raw %}"NOT_SMART"{% endraw %}{% end_module_attribute %}{% module_attribute "tag" is_json="true" %}{% raw %}"module"{% endraw %}{% end_module_attribute %}{% module_attribute "type" is_json="true" %}{% raw %}"module"{% endraw %}{% end_module_attribute %}{% module_attribute "wrap_field_tag" is_json="true" %}{% raw %}"div"{% endraw %}{% end_module_attribute %}{% end_module_block %}

Read more

All blog posts

Lets' talk!

We'd love to talk to you!

Contact us and we'll get you connected with the expert you deserve!

Lets' talk!

We'd love to talk to you!

Contact us and we'll get you connected with the expert you deserve!

Lets' talk!

We'd love to talk to you!

Contact us and we'll get you connected with the expert you deserve!

Lets' talk!

We'd love to talk to you!

Contact us and we'll get you connected with the expert you deserve!

kubernetes aca group
kubernetes aca group
How to build a highly available Atlassian stack on Kubernetes
Reading time 7 min
6 MAY 2025

Within ACA, there are multiple teams working on different (or the same!) projects. Every team has their own domains of expertise, such as developing custom software, marketing and communications, mobile development and more. The teams specialized in Atlassian products and cloud expertise combined their knowledge to create a highly-available Atlassian stack on Kubernetes. Not only could we improve our internal processes this way, we could also offer this solution to our customers! In this blogpost, we’ll explain how our Atlassian and cloud teams built a highly-available Atlassian stack on top of Kubernetes. We’ll also discuss the benefits of this approach as well as the problems we’ve faced along the path. While we’re damn close, we’re not perfect after all 😉 Lastly, we’ll talk about how we monitor this setup. The setup of our Atlassian stack Our Atlassian stack consists of the following products: Amazon EKS Amazon EFS Atlassian Jira Data Center Atlassian Confluence Data Center Amazon EBS Atlassian Bitbucket Data Center Amazon RDS As you can see, we use AWS as the cloud provider for our Kubernetes setup. We create all the resources with Terraform. We’ve written a separate blog post on what our Kubernetes setup exactly looks like. You can read it here ! The image below should give you a general idea. The next diagram should give you an idea about the setup of our Atlassian Data Center. While there are a few differences between the products and setups, the core remains the same. The application is launched as one or more pods described by a StatefulSet. The pods are called node-0 and node-1 in the diagram above. The first request is sent to the load balancer and will be forwarded to either the node-0 pod or the node-1 pod. Traffic is sticky, so all subsequent traffic from that user will be sent to node-1. Both pod-0 and pod-1 require persistent storage which is used for plugin cache and indexes. A different Amazon EBS volume is mounted on each of the pods. Most of the data like your JIRA issues, Confluence spaces, … is stored in a database. The database is shared, node-0 and node-1 both connect to the same database. We usually use PostgreSQL on Amazon RDS. The node-0 and node-1 pod also need to share large files which we don’t want to store in a database, for example attachments. The same Amazon EFS volume is mounted on both pods. When changes are made, for example an attachment is uploaded to an issue, the attachment is immediately available on both pods. We use CloudFront (CDN) to cache static assets and improve the web response times. The benefits of this setup By using this setup, we can leverage the advantages of Docker and Kubernetes and the Data Center versions of the Atlassian tooling. There are a lot of benefits to this kind of setup, but we’ve listed the most important advantages below. It’s a self-healing platform : containers and worker nodes will automatically replace themselves when a failure occurs. In most cases, we don’t even have to do anything and the stack takes care of itself. Of course, it’s still important to investigate any failures so you can prevent them from occurring in the future. Exactly zero downtime deployments : when upgrading the first node within the cluster to a new version, we can still serve the old version to our customers on the second. Once the upgrade is complete, the new version is served from the first node and we can upgrade the second node. This way, the application stays available, even during upgrades. Deployments are predictable : we use the same Docker container for development, staging and production. It’s why we are confident the container will be able to start in our production environment after a successful deploy to staging. Highly available applications: when failure occurs on one of the nodes, traffic can be routed to the other node. This way you have time to investigate the issue and fix the broken node while the application stays available. It’s possible to sync data from one node to the other . For example, syncing the index from one node to the other to fix a corrupt index can be done in just a few seconds, while a full reindex can take a lot longer. You can implement a high level of security on all layers (AWS, Kubernetes, application, …) AWS CloudTrail prevents unauthorized access on AWS and sends an alert in case of anomaly. AWS Config prevents AWS security group changes. You can find out more on how to secure your cloud with AWS Config in our blog post. Terraform makes sure changes on the AWS environment are approved by the team before rollout. Since upgrading Kubernetes master and worker nodes has little to no impact, the stack is always running a recent version with the latest security patches. We use a combination of namespacing and RBAC to make sure applications and deployments can only access resources within their namespace with least privilege . NetworkPolicies are rolled out using Calico. We deny all traffic between containers by default and only allow specific traffic. We use recent versions of the Atlassian applications and implement Security Advisories whenever they are published by Atlassian. Interested in leveraging the power of Kubernetes yourself? You can find more information about how we can help you on our website! {% module_block module "widget_3d4315dc-144d-44ec-b069-8558f77285de" %}{% module_attribute "buttons" is_json="true" %}{% raw %}[{"appearance":{"link_color":"light","primary_color":"primary","secondary_color":"primary","tertiary_color":"light","tertiary_icon_accent_color":"dark","tertiary_text_color":"dark","variant":"primary"},"content":{"arrow":"right","icon":{"alt":null,"height":null,"loading":"disabled","size_type":null,"src":"","width":null},"tertiary_icon":{"alt":null,"height":null,"loading":"disabled","size_type":null,"src":"","width":null},"text":"Apply the power of Kubernetes"},"target":{"link":{"no_follow":false,"open_in_new_tab":false,"rel":"","sponsored":false,"url":null,"user_generated_content":false}},"type":"normal"}]{% endraw %}{% end_module_attribute %}{% module_attribute "child_css" is_json="true" %}{% raw %}{}{% endraw %}{% end_module_attribute %}{% module_attribute "css" is_json="true" %}{% raw %}{}{% endraw %}{% end_module_attribute %}{% module_attribute "definition_id" is_json="true" %}{% raw %}null{% endraw %}{% end_module_attribute %}{% module_attribute "field_types" is_json="true" %}{% raw %}{"buttons":"group","styles":"group"}{% endraw %}{% end_module_attribute %}{% module_attribute "isJsModule" is_json="true" %}{% raw %}true{% endraw %}{% end_module_attribute %}{% module_attribute "label" is_json="true" %}{% raw %}null{% endraw %}{% end_module_attribute %}{% module_attribute "module_id" is_json="true" %}{% raw %}201493994716{% endraw %}{% end_module_attribute %}{% module_attribute "path" is_json="true" %}{% raw %}"@projects/aca-group-project/aca-group-app/components/modules/ButtonGroup"{% endraw %}{% end_module_attribute %}{% module_attribute "schema_version" is_json="true" %}{% raw %}2{% endraw %}{% end_module_attribute %}{% module_attribute "smart_objects" is_json="true" %}{% raw %}null{% endraw %}{% end_module_attribute %}{% module_attribute "smart_type" is_json="true" %}{% raw %}"NOT_SMART"{% endraw %}{% end_module_attribute %}{% module_attribute "tag" is_json="true" %}{% raw %}"module"{% endraw %}{% end_module_attribute %}{% module_attribute "type" is_json="true" %}{% raw %}"module"{% endraw %}{% end_module_attribute %}{% module_attribute "wrap_field_tag" is_json="true" %}{% raw %}"div"{% endraw %}{% end_module_attribute %}{% end_module_block %} Apply the power of Kubernetes Problems we faced during the setup Migrating to this stack wasn’t all fun and games. We’ve definitely faced some difficulties and challenges along the way. By discussing them here, we hope we can facilitate your migration to a similar setup! Some plugins (usually older plugins) were only working on the standalone version of the Atlassian application. We needed to find an alternative plugin or use vendor support to have the same functionality on Atlassian Data Center. We had to make some changes to our Docker containers and network policies (i.e. firewall rules) to make sure both nodes of an application could communicate with each other. Most of the applications have some extra tools within the container. For example, Synchrony for Confluence, ElasticSearch for BitBucket, EazyBI for Jira, and so on. These extra tools all needed to be refactored for a multi-node setup with shared data. In our previous setup, each application was running on its own virtual machine. In a Kubernetes context, the applications are spread over a number of worker nodes. Therefore, one worker node might run multiple applications. Each node of each application will be scheduled on a worker node that has sufficient resources available. We needed to implement good placement policies so each node of each application has sufficient memory available. We also needed to make sure one application could not affect another application when it asks for more resources. There were also some challenges regarding load balancing. We needed to create a custom template for nginx ingress-controller to make sure websockets are working correctly and all health checks within the application are reporting a healthy status. Additionally, we needed a different load balancer and URL for our BitBucket SSH traffic compared to our web traffic to the BitBucket UI. Our previous setup contained a lot of data, both on filesystem and in the database. We needed to migrate all the data to an Amazon EFS volume and a new database in a new AWS account. It was challenging to find a way to have a consistent sync process that also didn’t take too long because during migration, all applications were down to prevent data loss. In the end, we were able to meet these criteria and were able to migrate successfully. Monitoring our Atlassian stack We use the following tools to monitor all resources within our setup Datadog to monitor all components created within our stack and to centralize logging of all components. You can read more about monitoring your stack with Datadog in our blog post here . NewRelic for APM monitoring of the Java process (Jira, Confluence, Bitbucket) within the container. If our monitoring detects an anomaly, it creates an alert within OpsGenie . OpsGenie will make sure that this alert is sent to the team or the on-call person that is responsible to fix the problem. If the on-call person does not acknowledge the alert in time, the alert will be escalated to the team that’s responsible for that specific alert. Conclusion In short, we are very happy we migrated to this new stack. Combining the benefits of Kubernetes and the Atlassian Data Center versions of Jira, Confluence and BitBucket feels like a big step in the right direction. The improvements in self-healing, deploying and monitoring benefits us every day and maintenance has become a lot easier. Interested in your own Atlassian Stack? Do you also want to leverage the power of Kubernetes? You can find more information about how we can help you on our website! Our Atlassian hosting offering

Read more
kubernetes setup
kubernetes setup
What does our Kubernetes setup at ACA look like?
Reading time 6 min
6 MAY 2025

At ACA, we live and breathe Kubernetes. We set up new projects with this popular container orchestration system by default, and we’re also migrating existing customers to Kubernetes. As a result, the amount of Kubernetes clusters the ACA team manages, is growing rapidly! We’ve had to change our setup multiple times to accommodate for more customers, more clusters, more load, less maintenance and so on. From an Amazon ECS to a Kubernetes setup In 2016, we had a lot of projects that were running in Docker containers. At that point in time, our Docker containers were either running in Amazon ECS or on Amazon EC2 Virtual Machines running the Docker daemon. Unfortunately, this setup required a lot of maintenance. We needed a tool that would give us a reliable way to run these containers in production. We longed for an orchestrator that would provide us high availability, automatic cleanup of old resources, automatic container scheduling and so much more. → Enter Kubernetes ! Kubernetes proved to be the perfect candidate for a container orchestration tool. It could reliably run containers in production and reduce the amount of maintenance required for our setup. Creating a Kubernetes-minded approach Agile as we are, we proposed the idea for a Kubernetes setup for one of our next projects. The customer saw the potential of our new approach and agreed to be part of the revolution. At the beginning of 2017, we created our first very own Kubernetes cluster. At this stage, there were only two certainties: we wanted to run Kubernetes and it would run on AWS . Apart from that, there were still a lot of questions and challenges. How would we set up and manage our cluster? Can we run our existing docker containers within the cluster? What type of access and information can we provide the development teams? We’ve learned that in the end, the hardest task was not the cluster setup. Instead, creating a new mindset within ACA Group to accept this new approach, and involving the development teams in our next-gen Kubernetes setup proved to be the harder task at hand. Apart from getting to know the product ourselves and getting other teams involved as well, we also had some other tasks that required our attention: we needed to dockerize every application, we needed to be able to setup applications in the Kubernetes cluster that were high available and if possible also self-healing, and clustered applications needed to be able to share their state using the available methods within the selected container network interface. Getting used to this new way of doing things in combination with other tasks, like setting up good monitoring, having a centralized logging setup and deploying our applications in a consistent and maintainable way, proved to be quite challenging. Luckily, we were able to conquer these challenges and about half a year after we’d created our first Kubernetes cluster, our first production cluster went live (August 2017). These were the core components of our toolset anno 2017: Terraform would deploy the AWS VPC, networking components and other dependencies for the Kubernetes cluster Kops for cluster creation and management An EFK stack for logging was deployed within the Kubernetes cluster Heapster, influxdb and grafana in combination with Librato for monitoring within the cluster Opsgenie for alerting Nice! … but we can do better: reducing costs, components and downtime Once we had completed our first setup, it became easier to use the same topology and we continued implementing this setup for other customers. Through our infrastructure-as-code approach (Terraform) in combination with a Kubernetes cluster management tool (Kops), the effort to create new clusters was relatively low. However, after a while, we started to notice some possible risks related to this setup. The amount of work required for the setup and the impact of updates or upgrades on our Kubernetes stack was too large. At the same time, the number of customers that wanted their very own Kubernetes cluster was growing. So, we needed to make some changes to reduce maintenance effort on the Kubernetes part of this setup to keep things manageable for ourselves. Migration to Amazon EKS and Datadog At this point the Kubernetes service from AWS (Amazon EKS) became generally available. We were able to move all things that are managed by Kops to our Terraform code, making things a lot less complex. As an extra benefit, the Kubernetes master nodes are now managed by EKS. This means we now have less nodes to manage and EKS also provides us cluster upgrades with a touch of the button. Apart from reducing the workloads on our Kubernetes management plane, we’ve also reduced the number of components within our cluster. In the previous setup we were using an EFK (ElasticSearch, Fluentd and Kibana) stack for our logging infrastructure. For our monitoring, we were using a combination of InfluxDB, Grafana, Heapster and Librato. These tools gave us a lot of flexibility but required a lot of maintenance effort, since they all ran within the cluster. We’ve replaced them all with Datadog agent, reducing our maintenance workloads drastically. Upgrades in 60 minutes Furthermore, because of the migration to Amazon EKS and the reduction in the number of components running within the Kubernetes cluster, we were able to reduce the cost and availability impact of our cluster upgrades. With the current stack, using Datadog and Amazon EKS, we can upgrade a Kubernetes cluster within an hour. If we were to use the previous stack, it would take us about 10 hours on average. So where are we now? We currently have 16 Kubernetes clusters up and running , all running the latest available EKS version. Right now, we want to spread our love for Kubernetes wherever we can. Multiple project teams within ACA Group are now using Kubernetes, so we are organizing workshops to help them get up to speed with the technology quickly. At the same time, we also try to catch up with the latest additions to this rapidly changing platform. That’s why we’ve attended the Kubecon conference in Barcelona and shared our opinions in our Kubecon Afterglow event. What’s next? Even though we are very happy with our current Kubernetes setup, we believe there’s always room for improvement . During our Kubecon Afterglow event, we’ve had some interesting discussions with other Kubernetes enthusiasts. These discussions helped us defining our next steps, bringing our Kubernetes setup to an even higher level. Some things we’d like to improve in the near future: add service mesh to our Kubernetes stack, 100% automatic worker node upgrades without application downtime. Of course, these are just a few focus points. We’ll implement many new features and improvements whenever they are released! What about you? Are you interested in your very own Kubernetes cluster? Which improvements do you plan on making to your stack or Kubernetes setup? Or do you have an unanswered Kubernetes question we might be able to help you with? Contact us at cloud@aca-it.be and we will help you out! {% module_block module "widget_7e6bdbd6-406c-4a0a-8393-27a28f436c6d" %}{% module_attribute "buttons" is_json="true" %}{% raw %}[{"appearance":{"link_color":"light","primary_color":"primary","secondary_color":"primary","tertiary_color":"light","tertiary_icon_accent_color":"dark","tertiary_text_color":"dark","variant":"primary"},"content":{"arrow":"right","icon":{"alt":null,"height":null,"loading":"disabled","size_type":null,"src":"","width":null},"tertiary_icon":{"alt":null,"height":null,"loading":"disabled","size_type":null,"src":"","width":null},"text":"Our Kubernetes services"},"target":{"link":{"no_follow":false,"open_in_new_tab":false,"rel":"","sponsored":false,"url":{"content_id":null,"href":"https://www.acagroup/be/en/services/kubernetes","href_with_scheme":"https://www.acagroup/be/en/services/kubernetes","type":"EXTERNAL"},"user_generated_content":false}},"type":"normal"}]{% endraw %}{% end_module_attribute %}{% module_attribute "child_css" is_json="true" %}{% raw %}{}{% endraw %}{% end_module_attribute %}{% module_attribute "css" is_json="true" %}{% raw %}{}{% endraw %}{% end_module_attribute %}{% module_attribute "definition_id" is_json="true" %}{% raw %}null{% endraw %}{% end_module_attribute %}{% module_attribute "field_types" is_json="true" %}{% raw %}{"buttons":"group","styles":"group"}{% endraw %}{% end_module_attribute %}{% module_attribute "isJsModule" is_json="true" %}{% raw %}true{% endraw %}{% end_module_attribute %}{% module_attribute "label" is_json="true" %}{% raw %}null{% endraw %}{% end_module_attribute %}{% module_attribute "module_id" is_json="true" %}{% raw %}201493994716{% endraw %}{% end_module_attribute %}{% module_attribute "path" is_json="true" %}{% raw %}"@projects/aca-group-project/aca-group-app/components/modules/ButtonGroup"{% endraw %}{% end_module_attribute %}{% module_attribute "schema_version" is_json="true" %}{% raw %}2{% endraw %}{% end_module_attribute %}{% module_attribute "smart_objects" is_json="true" %}{% raw %}null{% endraw %}{% end_module_attribute %}{% module_attribute "smart_type" is_json="true" %}{% raw %}"NOT_SMART"{% endraw %}{% end_module_attribute %}{% module_attribute "tag" is_json="true" %}{% raw %}"module"{% endraw %}{% end_module_attribute %}{% module_attribute "type" is_json="true" %}{% raw %}"module"{% endraw %}{% end_module_attribute %}{% module_attribute "wrap_field_tag" is_json="true" %}{% raw %}"div"{% endraw %}{% end_module_attribute %}{% end_module_block %}

Read more
aws team aca group
aws team aca group
KubeCon / CloudNativeCon 2022 highlights!
Reading time 7 min
5 MAY 2025

Didn’t make it to KubeCon this year? Read along to find out our highlights of the KubeCon / CloudNativeCon conference this year by ACA Group’s Cloud Native team! What is KubeCon / CloudNativeCon? KubeCon (Kubernetes Conference) / CloudNativeCon , organized yearly at EMAE by the Cloud Native Computing Foundation (CNCF), is a flagship conference that gathers adopters and technologists from leading open source and cloud native communities in a location. This year, approximately 5,000 physical and 10,000 virtual attendees showed up for the conference. CNCF is the open source, vendor-neutral hub of cloud native computing, hosting projects like Kubernetes and Prometheus to make cloud native universal and sustainable. Bringing 300+ sessions from partners, industry leaders, users and vendors on topics covering CI/CD, GitOps, Kubernetes, machine learning, observability, networking, performance, service mesh and security. It's clear there's always something interesting to hear about at KubeCon, no matter your area of interest or level of expertise! It's clear that the Cloud Native ecosystem has grown to a mature, trend-setting and revolutionizing game-changer in the industry. All is initiated on the Kubernetes trend and a massive amount of organizations that support, use and have grown their business by building cloud native products or using them in mission-critical solutions. 2022's major themes What struck us during this year’s KubeCon were the following major themes: The first was increasing maturity and stabilization of Kubernetes and associated products for monitoring, CI/CD, GitOps, operators, costing and service meshes, plus bug fixing and small improvements. The second is a more elaborate focus on security . Making pods more secure, preventing pod trampoline breakouts, end-to-end encryption and making full analysis of threats for a complete k8s company infrastructure. The third is sustainability and a growing conscience that systems running k8s and the apps on it consume a lot of energy while 60 to 80% of CPU remains unused. Even languages can be energy (in)efficient. Java is among the most power efficient, while Python apparently is far less due to the nature of the interpreter / compiler. Companies all need to plan and work on decreasing energy footprint in both applications and infrastructure. Autoscaling will play an important role in achieving this. Sessions highlights Sustainability Data centers worldwide consume 8% of all generated electricity worldwide. So we'll need to reflect on the effective usage of our infrastructure and avoid idle time (on average CPU utilization is only between 20 and 40%) when servers are running, make them work with running as many workloads as possible shut down resources when they are not needed by applying autoscaling approaches the coding technology used in your software, some programming languages use less CPU. CICD / GitOps GitOps automates infrastructure updates using a Git workflow with continuous integration (CI) and continuous delivery (CI/CD). When new code is merged, the CI/CD pipeline enacts the change in the environment. Flux is a great example of this. Flux provides GitOps for both apps and infrastructure. It supports GitRepository, HelmRepository, HelmRepository and Bucket CRD as the single source of truth. With A/B or Canary deployments, it makes it easy to deploy new features without impacting all the users. When the deployment fails, it can easily roll back. Checkout the KubeCon schedule page for more information! Kubernetes Even though Kubernetes 1.24 was released a few weeks before the start of the event, not many talks were focused on the Kubernetes core. Most talks were focused on extending Kubernetes (using APIs, controllers, operators, …) or best practices around security, CI/CD, monitoring … for whatever will run within the Kubernetes cluster. If you're interested in the new features that Kubernetes 1.24 has to offer, you can check the official website . Observability Getting insights on how your application is running in your cluster is crucial, but not always practical. This is where eBPF comes into play, which is used by tools such as Pixie to collect data without any code changes. Check out the KubeCon schedule page for more information! FinOps Now that more and more people are using Kubernetes, a lot of workloads have been migrated. All these containers have a footprint. Memory, CPU, storage, … needs to be allocated, and they all have a cost. Cost management was a recurring topic during the talks. Using autoscaling (adding but also removing capacity) to match the required resources and identifying unused resources are part of this new movement. New services like 'kubecost' are becoming increasingly popular. Performance One of the most common problems in a cluster is not having enough space or resources. With the help of a Vertical Pod Autoscaler (VPA) this can be a thing of the past. A VPA will analyze and store Memory and CPU metrics/data to automatically adjust to the right CPU and memory request limits. The benefits of this approach will let you save money, avoid waste, size optimally the underlying hardware, tune resources on worker nodes and optimize placements of pods in a Kubernetes cluster. Check out the KubeCon schedule page for more information! Service mesh We all know it's extremely important to know which application is sharing data with other applications in your cluster. Service mesh provides traffic control inside your cluster(s). You can block or permit any request that is sent or received from any application to other applications. It also provides Metrics, Specs, Split, ... information to understand the data flow. In the talk, Service Mesh at Scale: How Xbox Cloud Gaming Secures 22k Pods with Linkerd , Chris explains why they choose Linkerd and what the benefits are of a service mesh. Check out the KubeCon schedule page for more information! Security Trampoline pods, sounds fun, right? During a talk by two security researchers from Palo Alto Networks, we learned that they aren’t all that fun. In short, these are pods that can be used to gain cluster admin privileges. To learn more about the concept and how to deal with them, we strongly recommend taking a look at the slides on the KubeCon schedule page ! Lachlan Evenson from Microsoft gave a clear explanation of Pod Security in his The Hitchhiker's Guide to Pod Security talk. Pod Security is a built-in admission controller that evaluates Pod specifications against a predefined set of Pod Security Standards and determines whether to admit or deny the pod from running. — Lachlan Evenson , Principal Program Manager at Microsoft P o d Security is replacing PodSecurityPolicy starting fro m Kubernetes 1.23. So if you are using PodSecurityPolicy, now might be a good time to further research Pod Security and the migration path. In version 1.25, support for PodSecurityPolicy will be removed. If you aren’t using PodSecurityPolicy or Pod Security, it is definitely time to further investigate it! Another one of the recurring themes of this KubeCon 2022 were operators. Operators enable the extension of the Kubernetes API with operational knowledge. This is achieved by combining Kubernetes controllers and watched objects that describe the desired state. They introduce Custom Resource Definitions, custom controllers, Kubernetes or cloud resources and logging and metrics, making life easier for Dev as well as Ops. H owever, during a talk by Kevin Ward from ControlPlane, we learned that there are some risks. Additionally, and more importantly, he also talked about how we can identify those risks with tools such as BadRobot and an operator thread matrix . Checkout the KubeCon schedule page for more information! Scheduling Telemetry Aware Scheduling helps you schedule your workloads based on metrics from your worker nodes. You can for example set a rule to not schedule new workloads on worker nodes with more than 90% used memory. The cluster will take this into account when scheduling a pod. Another nice feature of this tool is that it can also reschedule pods to make sure your rules are kept in line. Checkout the KubeCon schedule page for more information! Cluster autoscaling A great way for stateless workloads to scale cost effectively is to use AWS EC2 Spot, which is spare VM capacity available at a discount. To use Spot instances effectively in a K8S cluster, you should use aws-node-termination-handler . This way, you can move your workloads off of a worker node when Spot decides to reclaim it. Another good tool is Karpenter , a tool to provision Spot instances just in time for your cluster. With these two tools, you can cost effectively host your stateless workloads! Check out the KubeCon schedule page for more information! Event-driven autoscaling Using the Horizontal Pod Autoscaler (HPA) is a great way to scale pods based on metrics such as CPU utilization, memory usage, and more. Instead of scaling based on metrics, Kubernetes Event Driven Autoscaling (KEDA) can scale based on events (Apache Kafka, RabbitMQ, AWS SQS, …) and it can even scale to 0 unlike HPA. Check out the KubeCon schedule page for more information! Wrap-up We had a blast this year at the conference. We left with an inspired feeling that we'll no doubt translate into internal projects, apply for new customer projects and discuss with existing customers where applicable. Not only that, but we'll brief our colleagues and organize an afterglow session for those interested back home in Belgium. If you appreciated our blog article, feel free to drop us a small message. We are always happy when the content that we publish is also of any value or interest to you. If you think we can help you or your company in adopting Cloud Native, drop me a note at peter.jans@aca-it.be . As a final note we'd like to thank Mona for the logistics, Stijn and Ronny for this opportunity and the rest of the team who stayed behind to keep an eye on the systems of our valued customers.

Read more
Reading time 6 min
8 MAR 2023

IT never stands still, which is why ACA Group is constantly investigating innovative solutions and tools. One of those tools is Flux. In this blogpost, our experts share their experience and findings. First, what is Flux? Flux is cloud-native tooling, designed to leverage the flexibility, scalability and resilience of the cloud. It can run on any Kubernetes-based solution on public/private cloud as well as on a local Kubernetes cluster. Flux is a containerized tool that serves only one purpose: implementing Continuous Delivery. To achieve this, it keeps the Kubernetes cluster on which it is deployed in sync with a config source. A typical source would be a GIT repository. The config source is monitored by Flux (pull approach); and if a change is detected, the change will be deployed to the cluster. The change is deployed via a reconciliation method. This means that Flux will not destroy and create all resources, but only make the changes needed to match the state described in the GIT repository. For example, you have a Deployment.yaml and Service.yaml, but only want to change the first one to use a new version of a container image. Then only the Deployment.yaml will be replaced within the cluster; without changing the service. In summary, the process of syncing the content in GIT is called GitOps. What is GitOps? In a GitOps approach, the state of the cluster is fully described in GIT repositories; it contains everything required to deploy an application (Deployment.yaml, Service.yaml, ConfigMap.yaml,…). To match the cluster with the state, we need an automated solution. In this case, Flux will execute a reconciliation when something has changed in GIT. Using GitOps is considered a more developer-centric experience. It is based on tools and principles that developers already know, so additional knowledge is no longer needed. The benefits of GitOps The use of the GitOps approach implemented by Flux has a lot of benefits: the full Kubernetes cluster status is visible in GIT, making it more understandable for developers evelopment teams can work more independently as no complex deploy pipelines are needed it is a lot easier to set up additional applications that can be deployed on the Kubernetes cluster the use of pull request flows facilitates the monitoring of changes in applications branching or versioning strategies make it easy to keep different environments (test, acceptance, production) in sync. the approach can be uniform across all types of Kubernetes clusters (EKS, Rancher, OpenShift or a local setup) flux is lightweight and can be easily installed on any Kubernetes cluster as the capacity is usually already in place flux has good documentation and an active community Since the Flux resources are running within our cluster, there is no need to use other development tooling (such as Jenkins or Bamboo). This also has some advantages: less security issues as we use a pull approach and do not need to store credentials in external tools no unexpected downtime caused by external deployment tools no unexpected downtime caused by external deployment tools less overhead because there is no need to maintain any deployment tools How to manage applications with Flux Suppose we have a GIT repo called flux-app that contains the Deployment.yaml we want to deploy. How can we instruct Flux to create this Deployment in the Kubernetes cluster? Before we can deploy our applications, we first need to install Flux. Flux also uses a GitOps approach to manage its own installation: create a GIT repository that will contain the resources for the Flux installation download the Flux CLI run the bootstrap command flux bootstrap git \ --url=ssh://git@bitbucket.org/sample-repo/flux-installation.git \ --private-key-file=/Users/yourname/.ssh/flux \ --branch=main \ --path=./clusters/rancher-desktop-local Default The YAML files are stored in the aforementioned GIT repository and are applied to your cluster. The main resources created are visible in the image below. The Flux installation has added 4 important components to our Kubernetes cluster: source-controller pod kustomize-controller pod GitRepository CRD (Custom Resource Definition) Kustomization CRD (Custom Resource Definition) ⚠️ What are Custom Resource Definitions? Kubernetes provides a specific set of API resources by default. These are the well-known resources such as Pods, ConfigMaps, Deployments, Secrets. A CR, Custom Resource, is an extension to the Kubernetes API. It provides a way to add new resource types in addition to these existing resources. However, as with known resources, we need a specification on how to create such a resource. The CustomResourceDefinition (CRD) is basically a blueprint of what the CustomResource (CR) should look like. Moreover, we need logic on what should happen when such a resource is created. This logic is usually added to the application container, which monitors the CRD for changes and takes the required actions when they occur. Click here for more information about this topic can be found in the official Kubernetes documentation. In the case of Flux, the source-controller pod has the logic to take actions when a GitRepository CR is created/modified. Similarly, the kustomize controller pod has the logic to take actions when a Kustiomization CR is created / modified. Flux provides the CustomResourceDefintion (blueprint) and the logic (running in the containers) to do something with a Custom Resource. The Custom Resource is added to the Kubernetes cluster in the next steps. Adding GitRepository Custom Resources Now that we have the GitRepository CustomResourceDefinition and the source controller, we can start adding GitRepository resources. This resource contains the logic to connect to the GIT repository where your application's YAML files are stored. apiVersion : source.toolkit.fluxcd.io/v1beta2 kind : GitRepository metadata : name : flux - app namespace : flux - app spec : gitImplementation : go - git interval : 1m0s ref : branch : main secretRef : name : bitbucket - cloud - credentials timeout : 60s url : https : //bitbucket.org/sample - repo/application YAML We create the GitRepository resource. kubectl create - f GitRepository.yaml YAML NAME URL AGE READY STATUS flux-app https://bitbucket.org/sample-repo/application 21m True stored artifact for revision 'main/9ad65085cfe584f438f71e361c4ad20ac9d04f55' Default Note that the revision points to branch/commit-id At this point, the GIT repository is viewed, but nothing is deployed to our cluster. To deploy the YAML files stored in the flux app GIT repository, we need to create a Kustomization resource. ⚠️ You will have to add a secret for authentication to the Bitbucket repository. Since this can be done in multiple ways, this has not been added to this article. When the secret is created, you need to reference it in your GitRepository yaml file. secretRef: name: bitbucket-cloud-credentials More information can be found here! Adding Kustomization Custom Resources The Kustomization resource will configure which GitRepository resource to watch for changes. Once the GitRepository resource points to a new revision, the kustomize-controller will deploy the current version of the artifacts to the cluster. apiVersion: kustomize.toolkit.fluxcd.io/v1beta2 kind: Kustomization metadata: name: flux-app namespace: flux-app spec: force: true interval: 1m0s path: ./ prune: false sourceRef: kind: GitRepository name: flux-app Default We create the Kustomization resource. kubectl create - f Kustomization.yaml YAML When we get the status of the created Kustomization resource, we see that it corresponds to the latest revision. kubectl - n flux - app get kustomization flux - app YAML NAME AGE READY STATUS flux-app 37m True Applied revision: main/9ad65085cfe584f438f71e361c4ad20ac9d04f55 Default When checking BitBucket, we see there is a Deployment file in our GIT repository - the commit matches the one mentioned by Kustomization. This Deployment has been applied to our Kubernetes cluster. kubectl -n flux-app get pod Default NAME READY STATUS RESTARTS AGE flux-app-7ddd9dd674-xp24s 1/1 Running 0 5m49s Default Putting things together To summarize, the above design summarizes the flow: a developer creates a pull request the pull request is aggregated to a branch monitored by flux the source controller monitors the branch; it will notice a new commit id and change the revision of the GitRepository source to 'branch/commit-id'. the kustomization controller watches the GitRepository source if a new version is noticed, the new version of the files will be applied, the revision of the Kustomization source will be changed to 'branch/commit-id' Conclusion In this blogpost, we’ve explained how Flux works and how easy it is to use Flux for continuous delivery. Within ACA, we are currently doing the migration from complex Jenkins deploy pipelines to an easy to understand GitOps approach using Flux. We believe that this GitOps approach will become the standard way to deploy workloads in the near future. Want to know more about Flux? Contact us!

Read more
Reading time 6 min
16 JUN 2022

I started writing this blog post the day after I came home from KubeCon and CloudNativeCon 2022. The main thing I noticed was that the content of the talks has changed over the last few years. Kubernetes’ new challenges When looking at the topics of this year’s KubeCon / CloudNativeCon, it feels like a lot of questions about Kubernetes, types of cloud, logging tools and more are answered for most companies. This makes sense, because more and more organizations have already successfully adopted Kubernetes. Kubernetes is no longer considered the next big thing, but rather the logical choice. However, we’ve noticed (during the talks, but also in our own journey) that new problems and challenges have arisen, leading to other questions: How can I implement more automation? How can I control/lower the costs for these setups? Is there a way to expand on whatever exists and add my own functionalities to Kubernetes? One of the possible ways to add functionalities to Kubernetes is using Operators. In this blog post, I will briefly explain how Operators work. How Operators work The concept of an operator is quite simple. I believe the easiest way to explain it is by actually installing an operator. Within ACA, we use the istio operator. The exact steps of installing depends on the operator you are installing, but usually they’re quite similar. First, install the istioctl binary on the machine that has access to the Kubernetes api. The next step is to run the command to install the operator. curl -sL https://istio.io/downloadIstioctl | sh - export PATH=$PATH:$HOME/.istioctl/bin istioctl operator init Default This will create the operator resource(s) in the istio-system namespace. You should see a pod running. kubectl get pods -n istio-operator NAMESPACE NAME READY STATUS RESTARTS AGE istio-operator istio-operator-564d46ffb7-nrw2t 1/1 Running 0 20s kubectl get crd NAME CREATED AT istiooperators.install.istio.io 2022-05-21T19:19:43Z Default As you can see, a new CustomResourceDefinition called istiooperators.install.istio.io is created. This is a blueprint that specifies how resource definitions should be added to the cluster. To create config, we need to know what ‘kind’ of config the CRD expects to be created. kubectl get crd istiooperators.install.istio.io -oyaml … status: acceptedNames: kind: IstioOperator … Default Let’s create a simple config file. kubectl apply -f - EOF apiVersion: install.istio.io/v1alpha1 kind: IstioOperator metadata: namespace: istio-system name: istio-controlplane spec: profile: minimal EOF Default Once the ResourceDefinition that contains the configuration is added to the cluster, the operator will make sure the resources in the cluster match whatever is defined in the configuration. You’ll see that new resources are created. kubectl get pods -A istio-system istiod-7dc88f87f4-rsc42 0/1 Pending 0 2m27s Default Since I run a small kind cluster, the istiod pod can’t be scheduled and is stuck in a Pending state. Let me explain the process first before changing this. The istio-operator will keep watching the IstioOperator configuration file for changes. If changes are made to the file, it will only make the changes that are required to update the resources in the cluster to match the state specified in the configuration file. This behavior is called reconciliation . Let’s watch the IstioOperator configuration file status. Note that it’s created in the istio-system namespace. kubectl get istiooperator -n istio-system NAME REVISION STATUS AGE istio-controlplane RECONCILING 3m Default As you can see, this is still reconciling, because the pod can’t start. After some time, it’ll go in an ERROR state. kubectl get istiooperator -n istio-system NAME REVISION STATUS AGE istio-controlplane ERROR 6m58s Default You can also check the istio-operator log for useful information. kubectl -n istio-operator logs istio-operator-564d46ffb7-nrw2t --tail 20 - Processing resources for Istiod. - Processing resources for Istiod. Waiting for Deployment/istio-system/istiod ✘ Istiod encountered an error: failed to wait for resource: resources not ready after 5m0s: timed out waiting for the condition. Since I’m running a small demo cluster, I’ll update the memory limit so the POD can be scheduled. This is done within the spec: part of the IstioOperator definition. kubectl -n istio-system edit istiooperator istio-controlplane spec: profile: minimal components: pilot: k8s: resources: requests: memory: 128Mi The istiooperator will go back to a RECONCILING state. kubectl get istiooperator -n istio-system NAME REVISION STATUS AGE istio-controlplane RECONCILING 11m Default And after some time, it becomes HEALTHY . kubectl get istiooperator -n istio-system NAME REVISION STATUS AGE istio-controlplane HEALTHY 12m Default You can see the istiod pod is running. NAMESPACE NAME READY STATUS istio-system istiod-7dc88f87f4-n86z9 1/1 Running Default Apart from the istiod deployment, a lot of new CRDs are added as well. authorizationpolicies.security.istio.io 2022-05-21T20:08:05Z destinationrules.networking.istio.io 2022-05-21T20:08:05Z envoyfilters.networking.istio.io 2022-05-21T20:08:05Z gateways.networking.istio.io 2022-05-21T20:08:05Z istiooperators.install.istio.io 2022-05-21T20:07:01Z peerauthentications.security.istio.io 2022-05-21T20:08:05Z proxyconfigs.networking.istio.io 2022-05-21T20:08:05Z requestauthentications.security.istio.io 2022-05-21T20:08:05Z serviceentries.networking.istio.io 2022-05-21T20:08:05Z sidecars.networking.istio.io 2022-05-21T20:08:05Z telemetries.telemetry.istio.io 2022-05-21T20:08:05Z virtualservices.networking.istio.io 2022-05-21T20:08:05Z wasmplugins.extensions.istio.io 2022-05-21T20:08:06Z workloadentries.networking.istio.io 2022-05-21T20:08:06Z workloadgroups.networking.istio.io 2022-05-21T20:08:06Z Default How the operator works - summary As you can see, this is a very easy way to quickly set up istio within our cluster. In short, these are the steps: Install the operator One (or more) CustomResourceDefinitions is added that provides a blueprint for the objects that can be created/managed. A deployment is created, which in turn creates a Pod that monitors the Configurations of the kinds that are specified by the CRD. The user adds configuration to the cluster, with its type specified by the CRD. The operator POD notices the new configuration and takes all steps that are required to make sure the cluster is in the desired state specified by the configuration. Benefits of the operator approach The operator approach makes it easy to package a set of resources like Deployments, Jobs, CustomResourceDefinitions. This way, it’s easy to add additional behavior and capabilities to Kubernetes. There’s a library which lists the available operators which can be found at https://operatorhub.io/ , counting 255 operators at the moment of writing. The operators are usually installed with just a few commands or lines of code. It’s also possible to create your own operators. It might make sense to package a set of deployments, jobs, CRDs, … that provide a specific functionality as an operator. The operator can be handled as operators and use pipelines for CVE validations, E2E tests, rollout to test environments, and more before a new version is promoted to production. Pitfalls We have been using Kubernetes for a long time within the ACA Group and have collected some security best-practices during this period. We’ve noticed that one-file-deployments and helm charts from the internet are usually not as well configured as we want them to be. Think about RBAC rules that give too many permissions, resources not currently namespaced or containers running as root. When using operators from operatorhub.io, you basically trust the vendor or provider to follow security best-practices. However … one of the talks at KubeCon 2022 that made the biggest impression on me, stated that a lot of the operators have issues regarding security. I would suggest you to watch Tweezering Kubernetes Resources: Operating on Operators - Kevin Ward, ControlPlane before installing. Another thing we’ve noticed is that using operators can speed up the process to implement new tools and features. Be sure to read the documentation that was provided by the creator of an operator before you dive into advanced configuration. It might be possible that not all features are actually implemented on the CRD that is created by the operator. However, it is bad practice to directly manipulate the resources that were created by the operator. The operator is not tested against your manual changes and this might cause inconsistencies. Additionally, new operator versions might (partly) undo your changes, which also might cause problems. At that point, you’re basically stuck, unless you create your own operator that provides additional features. We’ve also noticed that there is no real ‘rule book’ on how to provide CRDs and documentation is not always easy to find or understand. Conclusion Operators are currently a hot topic within the Kubernetes community. The number of available operators is growing fast, making it easy to add functionality to your cluster. However, there is no rule book or a minimal baseline of quality. When installing operators from the operatorhub, be sure to check the contents or validate the created resources on a local setup. We expect to see some changes and improvements in the near future, but at this point they can be very useful already. AUTHOR Bregt Coenen

Read more
Reading time 5 min
19 MAR 2021

The problem with that is that this is all rather ‘standard’. When you already have a fully customized Prometheus/Grafana setup in Rancher 1, such as we do, it seems a waste to throw this out the window. The journey from a Rancher 1 ‘cattle’ Prometheus/Grafana to Rancher 2 K8s went very smooth and was fairly easy. However, with Prometheus , you historically would have to edit the prometheus.yaml file every time you want to scrape a new application, unless you had already added your own custom discovery tool as a scrape. Fixing the incomplete data with auto-discovery A problem that I faced with directly scraping a Longhorn and Spring Boot (or any other) Service in K8s, is that only one of the many backend pods behind that Service is scraped. So, you end up with incomplete data in Prometheus and hence incomplete data in your dashboards in Grafana. In Prometheus, you can see that only one of three existing Longhorn endpoints is scraped. In Grafana, you can see that there is only one node accounted for and the other two are reported as ‘Failed Nodes’. To make matters worse, only one of seven volumes is reported at ‘Total Number of Volumes’. This is where auto-discovery of Kubernetes endpoint services comes in as a true savior. Many web pages describe the various aspects of scraping, but I found none of them complete and others had critical errors. In this blog post, I’ll provide you with a minimal and simple configuration to bring your Prometheus configuration with auto-discovery of Kubernetes endpoint services up to speed. 1. Include configMap additions for Prometheus Add this to the end of the prometheus.yaml in your Prometheus configMap. The jobname is ‘ kubernetes-service-endpoints ’ as it seemed appropriate. # Scrape config for service endpoints. # # The relabeling allows the actual service scrape endpoint to be configured # via the following annotations: # # * `prometheus.io/scrape`: Only scrape services that have a value of `true` # * `prometheus.io/scheme`: If the metrics endpoint is secured then you will need # to set this to `https` amp; most likely set the `tls_config` of the scrape config. # * `prometheus.io/path`: If the metrics path is not `/metrics` override this. # * `prometheus.io/port`: If the metrics are exposed on a different port to the # service then set this appropriately. - job_name: 'kubernetes-service-endpoints' scrape_interval: 5s scrape_timeout: 2s kubernetes_sd_configs: - role: endpoints relabel_configs: - source_labels: [ __meta_kubernetes_service_annotation_prometheus_io_scrape] action: keep regex: true - source_labels: [ __meta_kubernetes_service_annotation_prometheus_io_scheme] action: replace target_label: __scheme__ regex: (https?) - source_labels: [ __meta_kubernetes_service_annotation_prometheus_io_path] action: replace target_label: __metrics_path__ regex: (.+) - source_labels: [ __address__, __meta_kubernetes_service_annotation_prometheus_io_port] action: replace target_label: __address__ regex: (.+)(?::\d+);(\d+) replacement: $1:$2 - action: labelmap regex: __meta_kubernetes_service_label_(.+) - source_labels: [ __meta_kubernetes_namespace] action: replace target_label: kubernetes_namespace - source_labels: [ __meta_kubernetes_service_name] action: replace target_label: kubernetes_name 2. Configure the Services As in the comment above of the prometheus.yaml, you can configure the following annotations. The annotation prometheus.io/scrape: “true” is mandatory, if you want to scrape a particular service. All the other annotations are optional and explained here: prometheus.io/scrape: Only scrape services that have a value of `true` prometheus.io/scheme: If the metrics endpoint is secured then you will need to set this to `https` most likely set the `tls_config` of the scrape config. prometheus.io/path: If the metrics path is not `/metrics` override this. prometheus.io/port: If the metrics are exposed on a different port to the service then set this appropriately. Let’s look at an example for a Longhorn Service first. (Longhorn is a great replicated storage solution!) apiVersion: v1 kind: Service metadata: annotations: prometheus.io/port: "9500" prometheus.io/scrape: "true" labels: app: longhorn-manager name: longhorn-backend namespace: longhorn-system spec: ports: - name: manager port: 9500 protocol: TCP targetPort: manager selector: app: longhorn-manager sessionAffinity: ClientIP sessionAffinityConfig: clientIP: timeoutSeconds: 10800 type: ClusterIP Next, let’s look at an example for a Spring Boot Application Service. Note the non-standard scrape path /actuator/prometheus. apiVersion: v1 kind: Service metadata: name: springbootapp namespace: spring labels: app: gateway annotations: prometheus.io/path: "/actuator/prometheus" prometheus.io/port: "8080" prometheus.io/scrape: "true" spec: ports: - name: management port: 8080 - name: http port: 80 selector: app: gateway sessionAffinity: None type: ClusterIP 3. Configure Prometheus roles ClusterRole First, change the namespace as needed. Note: possibly this clusterRole needs to be a little tighter than it currently is. apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: labels: app.kubernetes.io/component: controller app.kubernetes.io/name: prometheus name: prometheus namespace: prometheus rules: - apiGroups: - apiextensions.k8s.io resources: - customresourcedefinitions verbs: - create - apiGroups: - apiextensions.k8s.io resourceNames: - alertmanagers.monitoring.coreos.com - podmonitors.monitoring.coreos.com - prometheuses.monitoring.coreos.com - prometheusrules.monitoring.coreos.com - servicemonitors.monitoring.coreos.com - thanosrulers.monitoring.coreos.com resources: - customresourcedefinitions verbs: - get - update - apiGroups: - monitoring.coreos.com resources: - alertmanagers - alertmanagers/finalizers - prometheuses - prometheuses/finalizers - thanosrulers - thanosrulers/finalizers - servicemonitors - podmonitors - prometheusrules verbs: - '*' - apiGroups: - apps resources: - statefulsets verbs: - '*' - apiGroups: - "" resources: - configmaps - secrets verbs: - '*' - apiGroups: - "" resources: - pods verbs: - get - list - watch - apiGroups: - "" resources: - services - services/finalizers - endpoints verbs: - "*" - apiGroups: - "" resources: - nodes verbs: - list - watch - apiGroups: - "" resources: - namespaces verbs: - get - list - watch - apiGroups: - extensions resources: - ingresses verbs: ["get", "list", "watch"] - nonResourceURLs: ["/metrics"] verbs: ["get"] ClusterRoleBinding Again, change the namespace as needed. apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: labels: app.kubernetes.io/component: controller app.kubernetes.io/name: prometheus name: prometheus namespace: prometheus roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: prometheus subjects: - kind: ServiceAccount name: default namespace: prometheus ServiceAccount Once more, change the namespace as needed. DO NOT change the name unless you change the ClusterRoleBinding subjects.name as well. apiVersion: v1 kind: ServiceAccount metadata: name: default namespace: prometheus Apply First, apply the ServiceAccount, ClusterRoleBinding, ClusterRole and Services to your K8s cluster. After updating the Prometheus configMap , redeploy Prometheus to make sure that the new configMap is activated/loaded. Results in Prometheus Go to the Prometheus GUI and navigate to Status - Targets . You’ll see that now all the pod endpoints ‘magically’ pop up at the kubernetes-services-endpoints heading. Any future prometheus.io related annotation changes in k8s Services will immediately come into effect after applying them! Grafana Longhorn dashboard I used a generic Grafana Longhorn dashboard, which you can find here for yourself. Thanks to the auto-discovery, the Grafana Longhorn dashboard now correctly shows three nodes and seven volumes, which is exactly correct! Conclusion After running through all the steps in this blog post, you basically never have to look at your Prometheus configuration again. With auto-discovery of Kubernetes endpoint services, adding and removing Prometheus scrapes for your applications has now become almost as simple as unlocking your cell phone! I hope this blog post has helped you out! If you have any questions, reach out to me. Or, if you’d like professional advice and services, see how we can help you out with Kubernetes.

Read more
Reading time 6 min
25 SEP 2019

Training machine learning models can take up a lot of time if you don’t have the hardware to support it. For instdataance with neural networks, you need to calculate the addition of every neuron to the total error during the training step. This can result into thousands of calculations per step. Complex models and large datasets can result in a long process of training. Evaluating such models at scale can potentially slow down your application’s performance. Not to mention hyperparameters you need to tune, restarting the process a few times over. In this blog post I want to talk about how you can tackle these issues by making maximum use of your resources. In particular, I want to talk about TensorFlow, a framework designed for parallel computing and Kubernetes, a platform able to scale up or down in terms of application usage. TensorFlow TensorFlow is an open-source library for building and training machine learning models. Originally a Google project, it has had many successes in the field of AI. It is available in multiple layers of abstraction, which allows you to quickly set up predefined machine learning models. TensorFlow was designed to run on distributed systems. The computations it requires can be run in parallel across multiple devices, through data flow graphs underneath. These represent a series of mathematical equations, with multidimensional array representations (tensors) at its edges. Deepmind used this power to create AlphaGo Zero, using 64 GPU workers and 19 CPU parameter servers to play 4.9 million games of GO against itself in just 3 days. Kubernetes Kubernetes is Google project as well and is an open-source platform for managing containerized applications at scale. With Kubernetes, you can easily add more instance nodes and get more out of your available hardware. You can compare Kubernetes to cash registers at the supermarket. Whenever there’s a long queue of customers waiting, the store quickly opens up a new register to handle a few of those customers. In reality, this means that Kubernetes (the cash register) is a virtual machine running a service and the customers are consumers of that service. The power of Kubernetes is in its ease of usage. You don’t need to add newly created instances to the load balancer, it’s done automatically. You don’t need to connect the new instance with file storage or networks, Kubernetes does it for you. And if an instance doesn’t behave like it should, it kills it off and immediately spins up a new one. Distributed Training Like I mentioned before, you can reduce the time it takes to train a model by doing computations in parallel over different hardware units. Even with a limited configuration, you can reduce your training time to a minimum by distributing it over multiple devices. TensorFlow allows you to use CPUs, GPUs and even TPUs or Tensor Processing Unit, a chip designed to run TensorFlow operations. You need to define a Strategy and make sure you create and compile your model within the scope of that strategy. mirrored_strategy = tf.distribute.MirroredStrategy() with mirrored_strategy.scope(): model = tf.keras.Sequential([tf.keras.layers.Dense(1, input_shape=(1,))]) model.compile(loss='mse', optimizer='sgd') Default The MirroredStrategy above allows you to distribute training over multiple GPUs on the same machine. The model is replicated for every GPU and variable updates are being executed for every replica. A more interesting variant of this strategy is the MultiWorkerMirroredStrategy . It gives you the opportunity to distribute the training over multiple machines (workers), and each of those may use multiple GPUs. This is where Kubernetes can help fast-track your machine learning. You can create multiple service nodes with Kubernetes according to the need for parameter servers and workers. Parameter servers keep track of the model parameters, workers calculate the updates of those parameters. In general, you can reduce the bandwidth between the members of the server cluster by adding more parameter servers. To make the setup run, you need to set an environment variable TS_CONFIG which defines the role of each node and the setup of the rest of the cluster. os.environ["TF_CONFIG"] = json.dumps({ 'cluster': { 'worker': ['worker-0:5000', 'worker-1:5000'], 'ps': ['ps-0:5000'] }, 'task': {'type': 'ps', 'index': 0} }) Default To make the setup easier, there’s a Github repository with a template for Kubernetes. Note that it doesn’t set up TS_CONFIG itself, but passes its content as parameters to the script. These parameters are used to define which devices can be used in a distributed training. cluster = tf.train.ClusterSpec({ "worker": ['worker-0:5000', 'worker-1:5000'], "ps": ['ps-0:5000']}) server = tf.train.Server( cluster, job_name='ps', task_index='0') Default The ClusterSpec specifies the worker and parameter servers in the cluster. It has the same value for all nodes. The Server contains the definition of the task of the current node, hence a different value per node. TensorFlow Serving For distributed inference, TensorFlow contains a package for hosting machine learning models. This is called TensorFlow Serving and it has been designed to quickly set up and manage machine learning models. All it needs is a SavedModel representation. SavedModel is a format to save trained Tensorflow models in a way that they can easily be loaded and restored. A SavedModel can be serialized into a directory, making it somewhat portable and easy to share. You can quickly create a SavedModel by using the built-in function save . model_version = '1' model_name = 'my_model' model_path = os.path.join('/path/to/save/dir/', model_name, model_version) model = tf.keras.Sequential([tf.keras.layers.Dense(1, input_shape=(1,))]) model.compile(loss='mse', optimizer='sgd') model.fit(X_train, y_train, epochs=10, validation_data=(X_valid, y_valid)) tf.saved_model.save(model, model_path) Default You can use the SavedModel CLI to inspect SavedModel files. Once you have these files in place, TensorFlow Serving can turn it into a gRPC or RESTful interface. The Docker image tensorflow/serving provides the easiest path towards a running server. There are multiple versions of this image, including one for GPU usage. Besides choosing the right image, you only need to provide the path to the directory you just created and name your model. $ docker run -t --rm -p 8500:8500 -p 8501:8501 \ -v "/path/to/save/dir/my_model:/models/my_model" \ -e MODEL_NAME=my_model \ tensorflow/serving Default Obviously, with Kubernetes you can now create a deployment for this image, and scale up/down the number of replicas automatically. Put a LoadBalancer Service in front of it, and your users will be redirected to the right node without anyone noticing. Because inference requires much less computation, you don’t have to distribute the computation amongst multiple nodes. Note that the save directory path also contains a “version” directory. This is a convention TensorFlow Serving uses to watch the directory for new versions of a SavedModel. When it detects a new one, it loads it automatically, ready to be served. With TensorFlow Serving and Kubernetes, you can handle any amount of load for your classification, regression or prediction models. 🚀 Takeaway You can gain a lot of time by distributing the necessary computations for your machine learning project. By combining a highly scalable library like TensorFlow with a flexible platform like Kubernetes, you can make optimal use of your resources and your time. Of course, you can speed up things even more if you have a knowledgeable Kubernetes team at your side, or somebody to help tune your machine learning models. If you’re ready to ramp up your machine learning, we can do exactly that! Interested or questions? Shoot me an email at stijn.vandenenden@acagroup.be

Read more
How to monitor your Kubernetes cluster with Datadog
Reading time 9 min
16 JAN 2019

Over the past few months, Kubernetes has become a more mature product and setting up a cluster has become a lot easier. Especially with the official release of Amazon Elastic Container Service for Kubernetes (EKS) on Amazon Web Services , another major cloud provider is able to provide a Kubernetes cluster with a few clicks. While the complexity of creating a Kubernetes cluster has decreased drastically, there still are some challenging tasks when setting up the resources within the cluster. The biggest challenge for us has always been providing reliable monitoring and logging for the components within the cluster. Since we’ve migrated to Datadog , things have changed for the better. In this blog post, we’ll teach you how to monitor your Kubernetes cluster with Datadog. Setting up Datadog monitoring and logging For this blog post, we’ll assume you have an active Kubernetes setup and kubectl configured. Our cloud services team prefers the following Kubernetes setup: Amazon Web Services (AWS) as the cloud provider Amazon Elastic Container Service for Kubernetes (EKS) which offers managed Kubernetes Terraform to automate the process of creating the required resources within the AWS account VPC and networking requirements EKS cluster Kubernetes worker nodes Datadog for monitoring and log collection and OpsGenie for alert and incident management. Of course, you’re free to choose your own tools. One requirement, however, is that you must use Datadog (else this whole blog post won’t make a lot of sense). If you’re new to Datadog, you need to create a Datadog account. You can try it out for 14 days for free by clicking here and pressing the “Get started” button. Complete the form and login to your newly created organization. Time to add some hosts! Kubernetes DaemonSet for creating Datadog agents A Kubernetes DaemonSet makes sure that a Docker container running the Datadog agent is created on every worker node (host) that has joined the Kubernetes cluster. This way, you can monitor the resources for all active worker nodes within the cluster. The YAML file specifies the configuration for all Datadog components we want to enable: Datadog Process Agent Datadog Log Agent and Datadog JMX If you wonder what the file looks like, this is it: apiVersion: apps/v1 kind: DaemonSet metadata: name: datadog-agent namespace: tools labels: k8s-app: datadog-agent spec: selector: matchLabels: name: datadog-agent template: metadata: labels: name: datadog-agent spec: #tolerations: #- key: node-role.kubernetes.io/master # operator: Exists # effect: NoSchedule serviceAccountName: datadog-agent containers: - image: datadog/agent:latest-jmx imagePullPolicy: Always name: datadog-agent ports: - containerPort: 8125 # hostPort: 8125 name: dogstatsdport protocol: UDP - containerPort: 8126 # hostPort: 8126 name: traceport protocol: TCP env: - name: DD_API_KEY valueFrom: secretKeyRef: name: datadog key: DATADOG_API_KEY - name: DD_COLLECT_KUBERNETES_EVENTS value: "true" - name: DD_LEADER_ELECTION value: "true" - name: KUBERNETES value: "yes" - name: DD_PROCESS_AGENT_ENABLED value: "true" - name: DD_LOGS_ENABLED value: "true" - name: DD_LOGS_CONFIG_CONTAINER_COLLECT_ALL value: "true" - name: SD_BACKEND value: "docker" - name: SD_JMX_ENABLE value: "yes" - name: DD_KUBERNETES_KUBELET_HOST valueFrom: fieldRef: fieldPath: status.hostIP resources: requests: memory: "400Mi" cpu: "200m" limits: memory: "400Mi" cpu: "200m" volumeMounts: - name: dockersocket mountPath: /var/run/docker.sock - name: procdir mountPath: /host/proc readOnly: true - name: sys-fs mountPath: /host/sys readOnly: true - name: root-fs mountPath: /rootfs readOnly: true - name: cgroups mountPath: /host/sys/fs/cgroup readOnly: true - name: pointerdir mountPath: /opt/datadog-agent/run - name: dd-agent-config mountPath: /conf.d - name: datadog-yaml mountPath: /etc/datadog-agent/datadog.yaml subPath: datadog.yaml livenessProbe: exec: command: - ./probe.sh initialDelaySeconds: 60 periodSeconds: 5 failureThreshold: 3 successThreshold: 1 timeoutSeconds: 3 volumes: - hostPath: path: /var/run/docker.sock name: dockersocket - hostPath: path: /proc name: procdir - hostPath: path: /sys/fs/cgroup name: cgroups - hostPath: path: /opt/datadog-agent/run name: pointerdir - name: sys-fs hostPath: path: /sys - name: root-fs hostPath: path: / - name: datadog-yaml configMap: name: dd-agent-config items: - key: datadog-yaml path: datadog.yaml Default As a whole the file looks a bit overwhelming, so let’s zoom in on some aspects. #tolerations: #- key: node-role.Kubernetes.io/master # operator: Exists # effect: NoSchedule Default Since we use EKS, the master plane is maintained by AWS. Therefore we don’t want any Datadog agent pods to run on the master nodes. Uncomment this if you want to monitor your master nodes, for example when you are running Kops. containers: - image: Datadog/agent:latest-JMX imagePullPolicy: Always name: Datadog-agent Default We use the JMX-enabled version of the Datadog agent image, which is required for Kafka and Zookeeper integrations. If you don’t need JMX, you should use Datadog/agent:latest as this image is less resource-intensive. We specify “imagePullPolicy: Always” so we are sure that on startup, the image labelled “latest” is pulled again. In other cases when a new “latest” release is available, it won’t get pulled as we already have an image tagged “latest” available on the node. env: - name: DD_API_KEY valueFrom: secretKeyRef: name: Datadog key: Datadog_API_KEY Default We use SealedSecrets , which stores the Datadog API Key. It also sets the environment variable to the value of the Secret. If you don’t know how to get an API Key from Datadog, you can do that here . Enter a useful name and press the “Create API” button. - name: DD_LOGS_ENABLED value: "true" Default This ensures the Datadog logs agent is enabled. - name: SD_BACKEND value: "Docker" - name: SD_JMX_ENABLE value: "yes" Default This enables autodiscovery and JMX, which we need for our Zookeeper and Kafka integration to work, as it will use JMX to collect data. For more information on autodiscovery, you can read the Datadog docs here . resources: requests: memory: "400Mi" cpu: "200m" limits: memory: "400Mi" cpu: "200m" Default After enabling JMX, the memory usage of the container drastically increases. If you are not using the JMX version of the image, half of these limits should be fine. - name: Datadog-yaml mountPath: /etc/Datadog-agent/Datadog.yaml subPath: Datadog.yaml … - name: Datadog-yaml configMap: name: dd-agent-config items: - key: Datadog-yaml path: Datadog.yaml Default To add some custom configuration, we need to override the default Datadog.yaml configuration file. The ConfigMap has the following content: apiVersion: v1 kind: ConfigMap metadata: name: datadogtoken namespace: tools data: event.tokenKey: "0" --- apiVersion: v1 kind: ConfigMap metadata: name: dd-agent-config namespace: tools data: datadog-yaml: |- check_runners: 1 listeners: - name: kubelet config_providers: - name: kubelet polling: true tags: tst, kubelet, kubernetes, worker, env:tst, environment:tst, application:kubernetes, location:aws Default The first ConfigMap called Datadogtoken is required to have a persistent state when a new leader is elected. The content of the dd-agent-config ConfigMap is used to create the Datadog.yaml configuration file. We specify and add some extra tags to the resources collected by the agent, which is useful to create filters later on. livenessProbe: exec: command: - ./probe.sh initialDelaySeconds: 60 periodSeconds: 5 failureThreshold: 3 successThreshold: 1 timeoutSeconds: 3 Default When having a Kubernetes cluster with a lot of nodes, we’ve seen containers being stuck in a CrashLoopBackOff status. It’s therefore a good idea to do a more advanced health check to see whether your containers have actually booted. Make sure the health checks start polling after 60 minutes, which seems to be the best value. Once you have gathered all required configuration in your ConfigMap and DaemonSet files, you can create the resources using your Kubernetes CLI. kubectl create -f ConfigMap.yaml kubectl create -f DaemonSet.yaml Default After a few seconds, you should start seeing logs and metrics in the Datadog GUI. Taking a look at the collected data Datadog has a range of powerful monitoring features. The host map gives you a visualization of your nodes over the AWS availability zones. The colours in the map represent the relative CPU utilization for each node, green displaying a low level of CPU utilization and orange displaying a busier CPU. Each node is visible in the infrastructure list. Selecting one of the nodes reveals its details. You can monitor containers in the container view and see more details (e.g. graphs which visualize a trend) by selecting a specific container. Last but not least, processes can be monitored separately from the process list, with trends visible for every process. These fine-grained viewing levels make it easy to quickly pinpoint problems and generally lead to faster response times. All data is available to create beautiful dashboards and good monitors to alert on failures. The creation of these monitors can be scripted, making it fairly easy to set up additional accounts and setups. Easy to see why Datadog is indispensable in our solutions… 😉 Logging with Datadog Logs Datadog Logs is a little bit less mature than the monitoring part, but it’s still one of our favourite logging solutions. It’s relatively cheap and the same agent can be used for both monitoring and logging. Monitors – which are used to trigger alerts – can be created from the log data and log data can also be visualized in dashboards. You can see the logs by navigating here and filter them by container, namespace or pod name. It’s also possible to filter your logs by label, which you can add to your Deployment, StatefulSet, … Setting up additional Datadog integrations As you’ve noticed, Datadog already provides a lot of data by default. However, extra metric collection and dashboards can easily be added by adding integrations. Datadog claims they have more than 200 integrations you can enable. Here’s a list of integrations we usually enable on our clusters: AWS Docker Kubernetes Kafka Zookeeper ElasticSearch OpsGenie Installing integrations is usually a very straightforward process. Some of them can be enabled with one click, others require some extra configuration. Let’s take a deeper look at setting up some of the above integrations. AWS Integration Setup This integration should be configured both on the Datadog and AWS side. First, in AWS, we need to create a IAM Policy and a AssumeRolePolicy to allow access from Datadog to our AWS account. AssumeRolePolicy { "Effect": "Allow", "Principal": { "AWS": "arn:aws:iam::464622532012:root" }, "Action": "sts:AssumeRole", "Condition": { "StringEquals": { "sts:ExternalId": "${var.Datadog_aws_external_id}" } } } Default The content for the IAM Policy can be found here . Attach both Policies to an IAM Role called DatadogAWSIntegrationRole. Go to your Datadog account setting and press on the “+Available” button under the AWS integration. Go to the configuration tab, replace the variable ${var.Datadog_aws_external_id} in the policy above with the value of AWS External ID. Add the AWS account number and for the role use DatadogAWSIntegrationRole as created above. Optionally, you can add tags which will be added to all metric gathered by this integration. On the left, limit the selection to the AWS services you use. Lastly, save the integration and your AWS integration (and integration for the enabled AWS Services) will be shown under “Installed”. Integration in action When you go to your dashboard list, you’ll see some new interesting dashboards with new metrics you can use to create new monitors with, such as: Database (RDS) memory usage, load, cpu, disk usage, connections Number of available VPN tunnels for a VPN connection Number of healthy hosts behind a load balancer ... Docker Integration Enabling the Docker integration is as easy as pressing the “+Available” button. A “Docker – Overview” dashboard is available as soon as you enable the integration. Kubernetes Integration Just like the Docker integration above, enabling the Kubernetes integration is as easy as pressing the “+Available” button, with a “Kubernetes – Overview” dashboard available as soon as you enable the integration. If you want all data for this integration, you should make sure kube-state-metrics is running within your Kubernetes cluster. More information here . 🚀 Takeaway The goal of this article was to show you how Datadog can become your most indispensable tools in your monitoring and logging infrastructure. Setup is pretty easy and there is so much information that can be collected and visualized effectively. If you can create a good set of monitors so Datadog alerts in case of degradation or increased error rates, most incidents can be solved even before they become actual problems. You can script the creation of these monitors using the Datadog API, reducing the setup time of your monitoring and alerting framework drastically. Do you want more information, or could you use some help setting up your own EKS cluster with Datadog monitoring? Don’t hesitate to contact us !

Read more