Komodor: Comprehensive Kubernetes Troubleshooting

Today we are delighted to welcome Ben Ofiri to the show. Ben is Co-Founder&CEO at Komodor, a startup company that has built a robust Kubernetes troubleshooting tool and a platform. Ben has a decade-long track record in software development and product management, including six years at Google where he served as product lead for Google’s Duplex – the company’s flagship conversational AI project. Ben’s experience in Google familiarized him with small and large-scale Kubernetes deployments, providing him with an in-depth understanding of the challenges involved in troubleshooting complex distributed systems.

Watch & Listen to the full interview episode.

Ben what was it about Kubernetes that made you think you should quit your job and start a company that’s built to help other people better understand Kubernetes?

It was a single moment where I understood I want to start my own company. Instead, it was an aggregated experience. My partner (Ed. Itiel Schwartz, Co-Founder&CTO at Komodor) and I researched the industry and saw that companies adopting Kubernetes want to equip their developers with tools and processes to troubleshoot systems more effectively. 

Today, many organizations still expect developers to check the logs, check pods, check nodes, prevent memory issues and check if something changed in the configuration or deployment. That is not a trivial task without expertise and knowledge on troubleshooting day-to-day problems in a system as complex as Kubernetes. 

At Komodor, we see a lot of frustration from developers because instead of writing code and developing new features, they fix systems they are often clueless about or have to escalate most alerts to the DevOps and SRE teams. In turn, DevOps teams, deal with constant alerts instead of developing infrastructure. Thus, we realized that it is going to get only worse unless there’s a massive change in how developers and modern DevOps troubleshoot, understand, and operate Kubernetes. And that’s where we decided to lead the change and build our tool.

Ben Ofiri,

Co-Founder&CEO at Komodor

You talk about the “three pillars” of troubleshooting Kubernetes. Can you explain what those three pillars are and how Komodor helps with each?

Those pillars are our interpretation of troubleshooting in modern systems. They are understanding, managing and preventing.

So, when you get an alert, it’s crucial to understand what’s going on in that massively complex system that comprises thousands of different microservices with thousands of constantly changing pods. To understand that, you’d need a lot of context about the ongoing services, the relationships between services and pods, the status of every component, what was the last change – was it a configuration change or an infrastructure change, etc. Having all that context, you can use your knowledge and expertise to define the reason for a particular issue.  

Once you understand what happened, you need to manage the incident by communicating with other team members and resolving the issue either by reverting a rollback, writing a new code, or changing the configuration. 

After you fixed the system and got everything working, you need to prevent similar issues from reoccurring in your system. At Komodor, we provide significant value in all three pillars by building an efficient troubleshooting process within organizations. 

How is your offering different from other tools available on the market?

At Komodor, we don’t compete with any tools or platforms our users had adopted. We work together with them and have all the integrations with the monitoring tools such as Prometheus, Data Dog, New Relic. We have integrations with logging tools – Splunk, Sumo Logic, Logz.io, etc. Komodor can instantly integrate with the incident response platforms such as Pager Duty, VictorOps, or OpsGenie.

We streamline developers’ interactions with all the available tools and reduce the complexity to a few button clicks to give you a comprehensive understanding of what is happening in your systems.

The primary benefit of our tool is the ability to collect data from the code repository and configuration systems, digest this information and serve a coherent story of changes and root causes instead of just sending an alert from the collected data.

Komodor has also been talking a lot about Change Intelligence? Can you explain to our audience what Change Intelligence is and does it use AI?

First of all, Change Intelligence means that we acknowledge that in modern systems, distributed systems composed of 1000s of different sub-components, changes are the key to understanding what’s going on in the system.

When something goes wrong in your system, you need to look at changes across tools you are using and see what change happened in code, in configurations, in Kubernetes and perform deep analysis on top of them.

In Change Intelligence, we’re using data-driven models; we’re using different ways to correlate events primarily with our knowledge and vast experience of how Kubernetes works internally and how Kubernetes works together with other tools like Jenkins, Argo CD, Data Dog, etc. We know how to track code commit from GitHub to a specific Kubernetes deployment. We know how to collect different configuration changes that happened and show you everything in a single timeline. 

Ben, thanks for your insights and lessons. And we hope we can have you on the show again sometime soon.

Stay tuned for more great interviews coming your way!