

Training machine learning models can take up a lot of time if you don’t have the hardware to support it. For instdataance with neural networks, you need to calculate the addition of every neuron to the total error during the training step. This can result into thousands of calculations per step. Complex models and large datasets can result in a long process of training. Evaluating such models at scale can potentially slow down your application’s performance. Not to mention hyperparameters you need to tune, restarting the process a few times over.
In this blog post I want to talk about how you can tackle these issues by making maximum use of your resources. In particular, I want to talk about TensorFlow, a framework designed for parallel computing and Kubernetes, a platform able to scale up or down in terms of application usage.
TensorFlow

TensorFlow is an open-source library for building and training machine learning models. Originally a Google project, it has had many successes in the field of AI. It is available in multiple layers of abstraction, which allows you to quickly set up predefined machine learning models. TensorFlow was designed to run on distributed systems. The computations it requires can be run in parallel across multiple devices, through data flow graphs underneath. These represent a series of mathematical equations, with multidimensional array representations (tensors) at its edges. Deepmind used this power to create AlphaGo Zero, using 64 GPU workers and 19 CPU parameter servers to play 4.9 million games of GO against itself in just 3 days.
Kubernetes

Kubernetes is Google project as well and is an open-source platform for managing containerized applications at scale. With Kubernetes, you can easily add more instance nodes and get more out of your available hardware. You can compare Kubernetes to cash registers at the supermarket. Whenever there’s a long queue of customers waiting, the store quickly opens up a new register to handle a few of those customers. In reality, this means that Kubernetes (the cash register) is a virtual machine running a service and the customers are consumers of that service.
The power of Kubernetes is in its ease of usage. You don’t need to add newly created instances to the load balancer, it’s done automatically. You don’t need to connect the new instance with file storage or networks, Kubernetes does it for you. And if an instance doesn’t behave like it should, it kills it off and immediately spins up a new one.
Distributed Training

Like I mentioned before, you can reduce the time it takes to train a model by doing computations in parallel over different hardware units. Even with a limited configuration, you can reduce your training time to a minimum by distributing it over multiple devices. TensorFlow allows you to use CPUs, GPUs and even TPUs or Tensor Processing Unit, a chip designed to run TensorFlow operations. You need to define a Strategy and make sure you create and compile your model within the scope of that strategy.
mirrored_strategy = tf.distribute.MirroredStrategy()
with mirrored_strategy.scope():
model = tf.keras.Sequential([tf.keras.layers.Dense(1, input_shape=(1,))])
model.compile(loss='mse', optimizer='sgd')
The MirroredStrategy above allows you to distribute training over multiple GPUs on the same machine. The model is replicated for every GPU and variable updates are being executed for every replica.
A more interesting variant of this strategy is the MultiWorkerMirroredStrategy. It gives you the opportunity to distribute the training over multiple machines (workers), and each of those may use multiple GPUs. This is where Kubernetes can help fast-track your machine learning. You can create multiple service nodes with Kubernetes according to the need for parameter servers and workers. Parameter servers keep track of the model parameters, workers calculate the updates of those parameters. In general, you can reduce the bandwidth between the members of the server cluster by adding more parameter servers. To make the setup run, you need to set an environment variable TS_CONFIG which defines the role of each node and the setup of the rest of the cluster.
os.environ["TF_CONFIG"] = json.dumps({
'cluster': {
'worker': ['worker-0:5000', 'worker-1:5000'],
'ps': ['ps-0:5000']
},
'task': {'type': 'ps', 'index': 0}
})
To make the setup easier, there’s a Github repository with a template for Kubernetes. Note that it doesn’t set up TS_CONFIG itself, but passes its content as parameters to the script. These parameters are used to define which devices can be used in a distributed training.
cluster = tf.train.ClusterSpec({
"worker": ['worker-0:5000', 'worker-1:5000'],
"ps": ['ps-0:5000']})
server = tf.train.Server(
cluster, job_name='ps', task_index='0')
The ClusterSpec specifies the worker and parameter servers in the cluster. It has the same value for all nodes. The Server contains the definition of the task of the current node, hence a different value per node.
TensorFlow Serving

For distributed inference, TensorFlow contains a package for hosting machine learning models. This is called TensorFlow Serving and it has been designed to quickly set up and manage machine learning models. All it needs is a SavedModel representation. SavedModel is a format to save trained Tensorflow models in a way that they can easily be loaded and restored. A SavedModel can be serialized into a directory, making it somewhat portable and easy to share. You can quickly create a SavedModel by using the built-in function save.
model_version = '1'
model_name = 'my_model'
model_path = os.path.join('/path/to/save/dir/', model_name, model_version)
model = tf.keras.Sequential([tf.keras.layers.Dense(1, input_shape=(1,))])
model.compile(loss='mse', optimizer='sgd')
model.fit(X_train, y_train, epochs=10, validation_data=(X_valid, y_valid))
tf.saved_model.save(model, model_path)
You can use the SavedModel CLI to inspect SavedModel files. Once you have these files in place, TensorFlow Serving can turn it into a gRPC or RESTful interface. The Docker image tensorflow/serving provides the easiest path towards a running server. There are multiple versions of this image, including one for GPU usage. Besides choosing the right image, you only need to provide the path to the directory you just created and name your model.
$ docker run -t --rm -p 8500:8500 -p 8501:8501 \
-v "/path/to/save/dir/my_model:/models/my_model" \
-e MODEL_NAME=my_model \
tensorflow/serving
Obviously, with Kubernetes you can now create a deployment for this image, and scale up/down the number of replicas automatically. Put a LoadBalancer Service in front of it, and your users will be redirected to the right node without anyone noticing. Because inference requires much less computation, you don’t have to distribute the computation amongst multiple nodes. Note that the save directory path also contains a “version” directory. This is a convention TensorFlow Serving uses to watch the directory for new versions of a SavedModel. When it detects a new one, it loads it automatically, ready to be served. With TensorFlow Serving and Kubernetes, you can handle any amount of load for your classification, regression or prediction models.
🚀 Takeaway
You can gain a lot of time by distributing the necessary computations for your machine learning project. By combining a highly scalable library like TensorFlow with a flexible platform like Kubernetes, you can make optimal use of your resources and your time.
Of course, you can speed up things even more if you have a knowledgeable Kubernetes team at your side, or somebody to help tune your machine learning models. If you’re ready to ramp up your machine learning, we can do exactly that! Interested or questions? Shoot me an email at stijn.vandenenden@acagroup.be

What others have also read


Within ACA, there are multiple teams working on different (or the same!) projects. Every team has their own domains of expertise, such as developing custom software, marketing and communications, mobile development and more. The teams specialized in Atlassian products and cloud expertise combined their knowledge to create a highly-available Atlassian stack on Kubernetes. Not only could we improve our internal processes this way, we could also offer this solution to our customers! In this blogpost, we’ll explain how our Atlassian and cloud teams built a highly-available Atlassian stack on top of Kubernetes. We’ll also discuss the benefits of this approach as well as the problems we’ve faced along the path. While we’re damn close, we’re not perfect after all 😉 Lastly, we’ll talk about how we monitor this setup. The setup of our Atlassian stack Our Atlassian stack consists of the following products: Amazon EKS Amazon EFS Atlassian Jira Data Center Atlassian Confluence Data Center Amazon EBS Atlassian Bitbucket Data Center Amazon RDS As you can see, we use AWS as the cloud provider for our Kubernetes setup. We create all the resources with Terraform. We’ve written a separate blog post on what our Kubernetes setup exactly looks like. You can read it here ! The image below should give you a general idea. The next diagram should give you an idea about the setup of our Atlassian Data Center. While there are a few differences between the products and setups, the core remains the same. The application is launched as one or more pods described by a StatefulSet. The pods are called node-0 and node-1 in the diagram above. The first request is sent to the load balancer and will be forwarded to either the node-0 pod or the node-1 pod. Traffic is sticky, so all subsequent traffic from that user will be sent to node-1. Both pod-0 and pod-1 require persistent storage which is used for plugin cache and indexes. A different Amazon EBS volume is mounted on each of the pods. Most of the data like your JIRA issues, Confluence spaces, … is stored in a database. The database is shared, node-0 and node-1 both connect to the same database. We usually use PostgreSQL on Amazon RDS. The node-0 and node-1 pod also need to share large files which we don’t want to store in a database, for example attachments. The same Amazon EFS volume is mounted on both pods. When changes are made, for example an attachment is uploaded to an issue, the attachment is immediately available on both pods. We use CloudFront (CDN) to cache static assets and improve the web response times. The benefits of this setup By using this setup, we can leverage the advantages of Docker and Kubernetes and the Data Center versions of the Atlassian tooling. There are a lot of benefits to this kind of setup, but we’ve listed the most important advantages below. It’s a self-healing platform : containers and worker nodes will automatically replace themselves when a failure occurs. In most cases, we don’t even have to do anything and the stack takes care of itself. Of course, it’s still important to investigate any failures so you can prevent them from occurring in the future. Exactly zero downtime deployments : when upgrading the first node within the cluster to a new version, we can still serve the old version to our customers on the second. Once the upgrade is complete, the new version is served from the first node and we can upgrade the second node. This way, the application stays available, even during upgrades. Deployments are predictable : we use the same Docker container for development, staging and production. It’s why we are confident the container will be able to start in our production environment after a successful deploy to staging. Highly available applications: when failure occurs on one of the nodes, traffic can be routed to the other node. This way you have time to investigate the issue and fix the broken node while the application stays available. It’s possible to sync data from one node to the other . For example, syncing the index from one node to the other to fix a corrupt index can be done in just a few seconds, while a full reindex can take a lot longer. You can implement a high level of security on all layers (AWS, Kubernetes, application, …) AWS CloudTrail prevents unauthorized access on AWS and sends an alert in case of anomaly. AWS Config prevents AWS security group changes. You can find out more on how to secure your cloud with AWS Config in our blog post. Terraform makes sure changes on the AWS environment are approved by the team before rollout. Since upgrading Kubernetes master and worker nodes has little to no impact, the stack is always running a recent version with the latest security patches. We use a combination of namespacing and RBAC to make sure applications and deployments can only access resources within their namespace with least privilege . NetworkPolicies are rolled out using Calico. We deny all traffic between containers by default and only allow specific traffic. We use recent versions of the Atlassian applications and implement Security Advisories whenever they are published by Atlassian. Interested in leveraging the power of Kubernetes yourself? You can find more information about how we can help you on our website! {% module_block module "widget_3d4315dc-144d-44ec-b069-8558f77285de" %}{% module_attribute "buttons" is_json="true" %}{% raw %}[{"appearance":{"link_color":"light","primary_color":"primary","secondary_color":"primary","tertiary_color":"light","tertiary_icon_accent_color":"dark","tertiary_text_color":"dark","variant":"primary"},"content":{"arrow":"right","icon":{"alt":null,"height":null,"loading":"disabled","size_type":null,"src":"","width":null},"tertiary_icon":{"alt":null,"height":null,"loading":"disabled","size_type":null,"src":"","width":null},"text":"Apply the power of Kubernetes"},"target":{"link":{"no_follow":false,"open_in_new_tab":false,"rel":"","sponsored":false,"url":null,"user_generated_content":false}},"type":"normal"}]{% endraw %}{% end_module_attribute %}{% module_attribute "child_css" is_json="true" %}{% raw %}{}{% endraw %}{% end_module_attribute %}{% module_attribute "css" is_json="true" %}{% raw %}{}{% endraw %}{% end_module_attribute %}{% module_attribute "definition_id" is_json="true" %}{% raw %}null{% endraw %}{% end_module_attribute %}{% module_attribute "field_types" is_json="true" %}{% raw %}{"buttons":"group","styles":"group"}{% endraw %}{% end_module_attribute %}{% module_attribute "isJsModule" is_json="true" %}{% raw %}true{% endraw %}{% end_module_attribute %}{% module_attribute "label" is_json="true" %}{% raw %}null{% endraw %}{% end_module_attribute %}{% module_attribute "module_id" is_json="true" %}{% raw %}201493994716{% endraw %}{% end_module_attribute %}{% module_attribute "path" is_json="true" %}{% raw %}"@projects/aca-group-project/aca-group-app/components/modules/ButtonGroup"{% endraw %}{% end_module_attribute %}{% module_attribute "schema_version" is_json="true" %}{% raw %}2{% endraw %}{% end_module_attribute %}{% module_attribute "smart_objects" is_json="true" %}{% raw %}null{% endraw %}{% end_module_attribute %}{% module_attribute "smart_type" is_json="true" %}{% raw %}"NOT_SMART"{% endraw %}{% end_module_attribute %}{% module_attribute "tag" is_json="true" %}{% raw %}"module"{% endraw %}{% end_module_attribute %}{% module_attribute "type" is_json="true" %}{% raw %}"module"{% endraw %}{% end_module_attribute %}{% module_attribute "wrap_field_tag" is_json="true" %}{% raw %}"div"{% endraw %}{% end_module_attribute %}{% end_module_block %} Apply the power of Kubernetes Problems we faced during the setup Migrating to this stack wasn’t all fun and games. We’ve definitely faced some difficulties and challenges along the way. By discussing them here, we hope we can facilitate your migration to a similar setup! Some plugins (usually older plugins) were only working on the standalone version of the Atlassian application. We needed to find an alternative plugin or use vendor support to have the same functionality on Atlassian Data Center. We had to make some changes to our Docker containers and network policies (i.e. firewall rules) to make sure both nodes of an application could communicate with each other. Most of the applications have some extra tools within the container. For example, Synchrony for Confluence, ElasticSearch for BitBucket, EazyBI for Jira, and so on. These extra tools all needed to be refactored for a multi-node setup with shared data. In our previous setup, each application was running on its own virtual machine. In a Kubernetes context, the applications are spread over a number of worker nodes. Therefore, one worker node might run multiple applications. Each node of each application will be scheduled on a worker node that has sufficient resources available. We needed to implement good placement policies so each node of each application has sufficient memory available. We also needed to make sure one application could not affect another application when it asks for more resources. There were also some challenges regarding load balancing. We needed to create a custom template for nginx ingress-controller to make sure websockets are working correctly and all health checks within the application are reporting a healthy status. Additionally, we needed a different load balancer and URL for our BitBucket SSH traffic compared to our web traffic to the BitBucket UI. Our previous setup contained a lot of data, both on filesystem and in the database. We needed to migrate all the data to an Amazon EFS volume and a new database in a new AWS account. It was challenging to find a way to have a consistent sync process that also didn’t take too long because during migration, all applications were down to prevent data loss. In the end, we were able to meet these criteria and were able to migrate successfully. Monitoring our Atlassian stack We use the following tools to monitor all resources within our setup Datadog to monitor all components created within our stack and to centralize logging of all components. You can read more about monitoring your stack with Datadog in our blog post here . NewRelic for APM monitoring of the Java process (Jira, Confluence, Bitbucket) within the container. If our monitoring detects an anomaly, it creates an alert within OpsGenie . OpsGenie will make sure that this alert is sent to the team or the on-call person that is responsible to fix the problem. If the on-call person does not acknowledge the alert in time, the alert will be escalated to the team that’s responsible for that specific alert. Conclusion In short, we are very happy we migrated to this new stack. Combining the benefits of Kubernetes and the Atlassian Data Center versions of Jira, Confluence and BitBucket feels like a big step in the right direction. The improvements in self-healing, deploying and monitoring benefits us every day and maintenance has become a lot easier. Interested in your own Atlassian Stack? Do you also want to leverage the power of Kubernetes? You can find more information about how we can help you on our website! Our Atlassian hosting offering
Read more

At ACA, we live and breathe Kubernetes. We set up new projects with this popular container orchestration system by default, and we’re also migrating existing customers to Kubernetes. As a result, the amount of Kubernetes clusters the ACA team manages, is growing rapidly! We’ve had to change our setup multiple times to accommodate for more customers, more clusters, more load, less maintenance and so on. From an Amazon ECS to a Kubernetes setup In 2016, we had a lot of projects that were running in Docker containers. At that point in time, our Docker containers were either running in Amazon ECS or on Amazon EC2 Virtual Machines running the Docker daemon. Unfortunately, this setup required a lot of maintenance. We needed a tool that would give us a reliable way to run these containers in production. We longed for an orchestrator that would provide us high availability, automatic cleanup of old resources, automatic container scheduling and so much more. → Enter Kubernetes ! Kubernetes proved to be the perfect candidate for a container orchestration tool. It could reliably run containers in production and reduce the amount of maintenance required for our setup. Creating a Kubernetes-minded approach Agile as we are, we proposed the idea for a Kubernetes setup for one of our next projects. The customer saw the potential of our new approach and agreed to be part of the revolution. At the beginning of 2017, we created our first very own Kubernetes cluster. At this stage, there were only two certainties: we wanted to run Kubernetes and it would run on AWS . Apart from that, there were still a lot of questions and challenges. How would we set up and manage our cluster? Can we run our existing docker containers within the cluster? What type of access and information can we provide the development teams? We’ve learned that in the end, the hardest task was not the cluster setup. Instead, creating a new mindset within ACA Group to accept this new approach, and involving the development teams in our next-gen Kubernetes setup proved to be the harder task at hand. Apart from getting to know the product ourselves and getting other teams involved as well, we also had some other tasks that required our attention: we needed to dockerize every application, we needed to be able to setup applications in the Kubernetes cluster that were high available and if possible also self-healing, and clustered applications needed to be able to share their state using the available methods within the selected container network interface. Getting used to this new way of doing things in combination with other tasks, like setting up good monitoring, having a centralized logging setup and deploying our applications in a consistent and maintainable way, proved to be quite challenging. Luckily, we were able to conquer these challenges and about half a year after we’d created our first Kubernetes cluster, our first production cluster went live (August 2017). These were the core components of our toolset anno 2017: Terraform would deploy the AWS VPC, networking components and other dependencies for the Kubernetes cluster Kops for cluster creation and management An EFK stack for logging was deployed within the Kubernetes cluster Heapster, influxdb and grafana in combination with Librato for monitoring within the cluster Opsgenie for alerting Nice! … but we can do better: reducing costs, components and downtime Once we had completed our first setup, it became easier to use the same topology and we continued implementing this setup for other customers. Through our infrastructure-as-code approach (Terraform) in combination with a Kubernetes cluster management tool (Kops), the effort to create new clusters was relatively low. However, after a while, we started to notice some possible risks related to this setup. The amount of work required for the setup and the impact of updates or upgrades on our Kubernetes stack was too large. At the same time, the number of customers that wanted their very own Kubernetes cluster was growing. So, we needed to make some changes to reduce maintenance effort on the Kubernetes part of this setup to keep things manageable for ourselves. Migration to Amazon EKS and Datadog At this point the Kubernetes service from AWS (Amazon EKS) became generally available. We were able to move all things that are managed by Kops to our Terraform code, making things a lot less complex. As an extra benefit, the Kubernetes master nodes are now managed by EKS. This means we now have less nodes to manage and EKS also provides us cluster upgrades with a touch of the button. Apart from reducing the workloads on our Kubernetes management plane, we’ve also reduced the number of components within our cluster. In the previous setup we were using an EFK (ElasticSearch, Fluentd and Kibana) stack for our logging infrastructure. For our monitoring, we were using a combination of InfluxDB, Grafana, Heapster and Librato. These tools gave us a lot of flexibility but required a lot of maintenance effort, since they all ran within the cluster. We’ve replaced them all with Datadog agent, reducing our maintenance workloads drastically. Upgrades in 60 minutes Furthermore, because of the migration to Amazon EKS and the reduction in the number of components running within the Kubernetes cluster, we were able to reduce the cost and availability impact of our cluster upgrades. With the current stack, using Datadog and Amazon EKS, we can upgrade a Kubernetes cluster within an hour. If we were to use the previous stack, it would take us about 10 hours on average. So where are we now? We currently have 16 Kubernetes clusters up and running , all running the latest available EKS version. Right now, we want to spread our love for Kubernetes wherever we can. Multiple project teams within ACA Group are now using Kubernetes, so we are organizing workshops to help them get up to speed with the technology quickly. At the same time, we also try to catch up with the latest additions to this rapidly changing platform. That’s why we’ve attended the Kubecon conference in Barcelona and shared our opinions in our Kubecon Afterglow event. What’s next? Even though we are very happy with our current Kubernetes setup, we believe there’s always room for improvement . During our Kubecon Afterglow event, we’ve had some interesting discussions with other Kubernetes enthusiasts. These discussions helped us defining our next steps, bringing our Kubernetes setup to an even higher level. Some things we’d like to improve in the near future: add service mesh to our Kubernetes stack, 100% automatic worker node upgrades without application downtime. Of course, these are just a few focus points. We’ll implement many new features and improvements whenever they are released! What about you? Are you interested in your very own Kubernetes cluster? Which improvements do you plan on making to your stack or Kubernetes setup? Or do you have an unanswered Kubernetes question we might be able to help you with? Contact us at cloud@aca-it.be and we will help you out! {% module_block module "widget_7e6bdbd6-406c-4a0a-8393-27a28f436c6d" %}{% module_attribute "buttons" is_json="true" %}{% raw %}[{"appearance":{"link_color":"light","primary_color":"primary","secondary_color":"primary","tertiary_color":"light","tertiary_icon_accent_color":"dark","tertiary_text_color":"dark","variant":"primary"},"content":{"arrow":"right","icon":{"alt":null,"height":null,"loading":"disabled","size_type":null,"src":"","width":null},"tertiary_icon":{"alt":null,"height":null,"loading":"disabled","size_type":null,"src":"","width":null},"text":"Our Kubernetes services"},"target":{"link":{"no_follow":false,"open_in_new_tab":false,"rel":"","sponsored":false,"url":{"content_id":null,"href":"https://www.acagroup/be/en/services/kubernetes","href_with_scheme":"https://www.acagroup/be/en/services/kubernetes","type":"EXTERNAL"},"user_generated_content":false}},"type":"normal"}]{% endraw %}{% end_module_attribute %}{% module_attribute "child_css" is_json="true" %}{% raw %}{}{% endraw %}{% end_module_attribute %}{% module_attribute "css" is_json="true" %}{% raw %}{}{% endraw %}{% end_module_attribute %}{% module_attribute "definition_id" is_json="true" %}{% raw %}null{% endraw %}{% end_module_attribute %}{% module_attribute "field_types" is_json="true" %}{% raw %}{"buttons":"group","styles":"group"}{% endraw %}{% end_module_attribute %}{% module_attribute "isJsModule" is_json="true" %}{% raw %}true{% endraw %}{% end_module_attribute %}{% module_attribute "label" is_json="true" %}{% raw %}null{% endraw %}{% end_module_attribute %}{% module_attribute "module_id" is_json="true" %}{% raw %}201493994716{% endraw %}{% end_module_attribute %}{% module_attribute "path" is_json="true" %}{% raw %}"@projects/aca-group-project/aca-group-app/components/modules/ButtonGroup"{% endraw %}{% end_module_attribute %}{% module_attribute "schema_version" is_json="true" %}{% raw %}2{% endraw %}{% end_module_attribute %}{% module_attribute "smart_objects" is_json="true" %}{% raw %}null{% endraw %}{% end_module_attribute %}{% module_attribute "smart_type" is_json="true" %}{% raw %}"NOT_SMART"{% endraw %}{% end_module_attribute %}{% module_attribute "tag" is_json="true" %}{% raw %}"module"{% endraw %}{% end_module_attribute %}{% module_attribute "type" is_json="true" %}{% raw %}"module"{% endraw %}{% end_module_attribute %}{% module_attribute "wrap_field_tag" is_json="true" %}{% raw %}"div"{% endraw %}{% end_module_attribute %}{% end_module_block %}
Read more

Didn’t make it to KubeCon this year? Read along to find out our highlights of the KubeCon / CloudNativeCon conference this year by ACA Group’s Cloud Native team! What is KubeCon / CloudNativeCon? KubeCon (Kubernetes Conference) / CloudNativeCon , organized yearly at EMAE by the Cloud Native Computing Foundation (CNCF), is a flagship conference that gathers adopters and technologists from leading open source and cloud native communities in a location. This year, approximately 5,000 physical and 10,000 virtual attendees showed up for the conference. CNCF is the open source, vendor-neutral hub of cloud native computing, hosting projects like Kubernetes and Prometheus to make cloud native universal and sustainable. Bringing 300+ sessions from partners, industry leaders, users and vendors on topics covering CI/CD, GitOps, Kubernetes, machine learning, observability, networking, performance, service mesh and security. It's clear there's always something interesting to hear about at KubeCon, no matter your area of interest or level of expertise! It's clear that the Cloud Native ecosystem has grown to a mature, trend-setting and revolutionizing game-changer in the industry. All is initiated on the Kubernetes trend and a massive amount of organizations that support, use and have grown their business by building cloud native products or using them in mission-critical solutions. 2022's major themes What struck us during this year’s KubeCon were the following major themes: The first was increasing maturity and stabilization of Kubernetes and associated products for monitoring, CI/CD, GitOps, operators, costing and service meshes, plus bug fixing and small improvements. The second is a more elaborate focus on security . Making pods more secure, preventing pod trampoline breakouts, end-to-end encryption and making full analysis of threats for a complete k8s company infrastructure. The third is sustainability and a growing conscience that systems running k8s and the apps on it consume a lot of energy while 60 to 80% of CPU remains unused. Even languages can be energy (in)efficient. Java is among the most power efficient, while Python apparently is far less due to the nature of the interpreter / compiler. Companies all need to plan and work on decreasing energy footprint in both applications and infrastructure. Autoscaling will play an important role in achieving this. Sessions highlights Sustainability Data centers worldwide consume 8% of all generated electricity worldwide. So we'll need to reflect on the effective usage of our infrastructure and avoid idle time (on average CPU utilization is only between 20 and 40%) when servers are running, make them work with running as many workloads as possible shut down resources when they are not needed by applying autoscaling approaches the coding technology used in your software, some programming languages use less CPU. CICD / GitOps GitOps automates infrastructure updates using a Git workflow with continuous integration (CI) and continuous delivery (CI/CD). When new code is merged, the CI/CD pipeline enacts the change in the environment. Flux is a great example of this. Flux provides GitOps for both apps and infrastructure. It supports GitRepository, HelmRepository, HelmRepository and Bucket CRD as the single source of truth. With A/B or Canary deployments, it makes it easy to deploy new features without impacting all the users. When the deployment fails, it can easily roll back. Checkout the KubeCon schedule page for more information! Kubernetes Even though Kubernetes 1.24 was released a few weeks before the start of the event, not many talks were focused on the Kubernetes core. Most talks were focused on extending Kubernetes (using APIs, controllers, operators, …) or best practices around security, CI/CD, monitoring … for whatever will run within the Kubernetes cluster. If you're interested in the new features that Kubernetes 1.24 has to offer, you can check the official website . Observability Getting insights on how your application is running in your cluster is crucial, but not always practical. This is where eBPF comes into play, which is used by tools such as Pixie to collect data without any code changes. Check out the KubeCon schedule page for more information! FinOps Now that more and more people are using Kubernetes, a lot of workloads have been migrated. All these containers have a footprint. Memory, CPU, storage, … needs to be allocated, and they all have a cost. Cost management was a recurring topic during the talks. Using autoscaling (adding but also removing capacity) to match the required resources and identifying unused resources are part of this new movement. New services like 'kubecost' are becoming increasingly popular. Performance One of the most common problems in a cluster is not having enough space or resources. With the help of a Vertical Pod Autoscaler (VPA) this can be a thing of the past. A VPA will analyze and store Memory and CPU metrics/data to automatically adjust to the right CPU and memory request limits. The benefits of this approach will let you save money, avoid waste, size optimally the underlying hardware, tune resources on worker nodes and optimize placements of pods in a Kubernetes cluster. Check out the KubeCon schedule page for more information! Service mesh We all know it's extremely important to know which application is sharing data with other applications in your cluster. Service mesh provides traffic control inside your cluster(s). You can block or permit any request that is sent or received from any application to other applications. It also provides Metrics, Specs, Split, ... information to understand the data flow. In the talk, Service Mesh at Scale: How Xbox Cloud Gaming Secures 22k Pods with Linkerd , Chris explains why they choose Linkerd and what the benefits are of a service mesh. Check out the KubeCon schedule page for more information! Security Trampoline pods, sounds fun, right? During a talk by two security researchers from Palo Alto Networks, we learned that they aren’t all that fun. In short, these are pods that can be used to gain cluster admin privileges. To learn more about the concept and how to deal with them, we strongly recommend taking a look at the slides on the KubeCon schedule page ! Lachlan Evenson from Microsoft gave a clear explanation of Pod Security in his The Hitchhiker's Guide to Pod Security talk. Pod Security is a built-in admission controller that evaluates Pod specifications against a predefined set of Pod Security Standards and determines whether to admit or deny the pod from running. — Lachlan Evenson , Principal Program Manager at Microsoft P o d Security is replacing PodSecurityPolicy starting fro m Kubernetes 1.23. So if you are using PodSecurityPolicy, now might be a good time to further research Pod Security and the migration path. In version 1.25, support for PodSecurityPolicy will be removed. If you aren’t using PodSecurityPolicy or Pod Security, it is definitely time to further investigate it! Another one of the recurring themes of this KubeCon 2022 were operators. Operators enable the extension of the Kubernetes API with operational knowledge. This is achieved by combining Kubernetes controllers and watched objects that describe the desired state. They introduce Custom Resource Definitions, custom controllers, Kubernetes or cloud resources and logging and metrics, making life easier for Dev as well as Ops. H owever, during a talk by Kevin Ward from ControlPlane, we learned that there are some risks. Additionally, and more importantly, he also talked about how we can identify those risks with tools such as BadRobot and an operator thread matrix . Checkout the KubeCon schedule page for more information! Scheduling Telemetry Aware Scheduling helps you schedule your workloads based on metrics from your worker nodes. You can for example set a rule to not schedule new workloads on worker nodes with more than 90% used memory. The cluster will take this into account when scheduling a pod. Another nice feature of this tool is that it can also reschedule pods to make sure your rules are kept in line. Checkout the KubeCon schedule page for more information! Cluster autoscaling A great way for stateless workloads to scale cost effectively is to use AWS EC2 Spot, which is spare VM capacity available at a discount. To use Spot instances effectively in a K8S cluster, you should use aws-node-termination-handler . This way, you can move your workloads off of a worker node when Spot decides to reclaim it. Another good tool is Karpenter , a tool to provision Spot instances just in time for your cluster. With these two tools, you can cost effectively host your stateless workloads! Check out the KubeCon schedule page for more information! Event-driven autoscaling Using the Horizontal Pod Autoscaler (HPA) is a great way to scale pods based on metrics such as CPU utilization, memory usage, and more. Instead of scaling based on metrics, Kubernetes Event Driven Autoscaling (KEDA) can scale based on events (Apache Kafka, RabbitMQ, AWS SQS, …) and it can even scale to 0 unlike HPA. Check out the KubeCon schedule page for more information! Wrap-up We had a blast this year at the conference. We left with an inspired feeling that we'll no doubt translate into internal projects, apply for new customer projects and discuss with existing customers where applicable. Not only that, but we'll brief our colleagues and organize an afterglow session for those interested back home in Belgium. If you appreciated our blog article, feel free to drop us a small message. We are always happy when the content that we publish is also of any value or interest to you. If you think we can help you or your company in adopting Cloud Native, drop me a note at peter.jans@aca-it.be . As a final note we'd like to thank Mona for the logistics, Stijn and Ronny for this opportunity and the rest of the team who stayed behind to keep an eye on the systems of our valued customers.
Read moreWant to dive deeper into this topic?
Get in touch with our experts today. They are happy to help!

Want to dive deeper into this topic?
Get in touch with our experts today. They are happy to help!

Want to dive deeper into this topic?
Get in touch with our experts today. They are happy to help!

Want to dive deeper into this topic?
Get in touch with our experts today. They are happy to help!


