close
close

Could eBPF have spared us the Crowdstrike incident? • The Register

interview The chaos at CrowdStrike was caused by software that got out of control in the Windows kernel after an update caused the code to crash. eBPF is a useful tool for kernel tracking and observation, but could it have mitigated the CrowdStrike incident?

“It’s interesting,” says Tom Wilkie, CTO of observability specialist Grafana Labs The Register“because there was a vulnerability in the eBPF runtime that caused a similar outage to that triggered by CrowdStrike in a specific Red Hat kernel.”

CrowdStrike's Falcon Sensor is also linked to Linux kernel panics and crashes

READ MORE

Wilkie is referring to an incident in June in which Red Hat warned its customers about an issue related to CrowdStrike's Falcon Sensor. The problem paled in comparison to what happened a few weeks later, when a CrowdStrike update left 8.5 million Windows computers around the world stuck in a blue screen boot loop.

eBPF allows software to run in a virtual machine (VM) inside the Linux kernel, allowing developers to add features at runtime. The theory is that an eBPF program cannot crash the kernel because it runs in a sandbox and is checked for safety by a verifier. Due to the low level at which some programs run, this is a popular way to implement observability and security.

Work to implement the technology for Windows is underway.

“So eBPF could be the solution,” Wilkie continued, “but it has also been a historical cause of these problems. I mean, fundamentally, injecting code into running kernels is a risky activity. That was the problem that CrowdStrike had. And you can still have bugs in eBPF; the security guarantees provided by the eBPF runtime and the eBPF verifier are not perfect.

“The concept of eBPF is good, but the implementation – like all implementations – has flaws. Could you use eBPF to detect something like the CrowdStrike incident? Yes. Probably. But honestly, you could also detect it by just doing better testing, and that would be my advice. Better hygiene in software development. And that's the lesson CrowdStrike has already learned.”

Crowstrike CEO George Kurtz said earlier this month at Goldman Sachs' Communacopia and Technology Conference that the July disaster was caused by an freak incident.

“In this particular case, we had a configuration change where there is no code, just a configuration that the sensor uses. And we went through a validation process and validated all of them. They actually worked. The problem is that we had 21 of them and the sensor understood 20. And that's the simple explanation for what happened.

“So what have we changed process-wise? We now run the configuration changes not only through validation, but also through all of our various code QA processes and then deploy them in stages, giving customers the choice of how they want to deliver the content.”

Ahead of this week's New York City ObservabilityCON, where Grafana Labs will announce improvements to its Explore apps and Adaptive features, Wilkie joins us to share his thoughts on another hot topic: moving back to the cloud and funding open source development.

Having users working in the cloud is central to Grafana. Wilkie says the company continues to see growing use of its cloud – both in terms of number of users and revenue – but is repatriation happening? “I would agree with that sentiment,” he concedes.

“It feels like there's been a shift in the market in the last year or two, since the zero interest rates. People are looking at the cloud economics more critically and realizing that many SaaS and infrastructure-as-a-service offerings are simply not profitable from a cost perspective.”

Cloud giant AWS recently warned in a letter to the UK Competition and Markets Authority that it is facing strong competition from the very on-premises infrastructure that it itself had dismissed as outdated not so many years ago.

Wilkie says Grafana Labs' solution is to make its cloud more attractive. While there is an on-premises version, features like adaptive metrics and logs are only available in the cloud. Wilkie says that for many applications, customers find it more cost-effective to use Grafana Labs' cloud than to try to develop their own – at least he would, we suspect.

This brings us to the question of how Grafana Labs remains a viable company and how it decides which services to make open source and which remain proprietary.

… people are looking more critically at the cloud economy and realizing that many SaaS and Infrastructure-as-a-Service offerings are simply not profitable from a cost perspective

Wilkie explains: “We call it the 'sniff test'. If a feature can be used generally by a very large group of people, we make it open source; if it is only interesting to a small group of companies or large organizations, we consider keeping it as a commercial differentiation.”

He gives an example: “Grafana has over 200 data sources that you can connect to Grafana virtually anywhere, and about 170 of them are open source. Thirty of them are commercial integrations that we sell as part of Grafana Enterprise.

“A good example of a commercial integration would be Datadog. One of our most popular enterprise data sources is our Datadog source. If you pay Datadog to store your metrics and want to visualize it in Grafana, you might as well pay us some money for that! That seems like a fair exchange of value.”

Wilkie also cites Grafana's open source projects. A customer can use them to develop solutions, but, as mentioned, El Reg by Kelsey Hightower, Grafana would love to sell you a managed service that requires a credit card to get started in minutes. ®