This is Episode 2B of The Observability Cost Crisis. This episode continues directly from Episode 2A.


There is no free lunch

The Whiteboard

Asep stood up.

He already knew where this conversation was going to end. Twenty-three years in infrastructure had a way of collapsing the distance between a proposal and its consequences. He could see the shape of Fajar’s idea — the elegance of it, even — and he could also see exactly where it would fall apart.

But he had learned something over those twenty-three years: telling a smart engineer they are wrong rarely works. What works is walking them, step by step, to the place where they can see it themselves. So he uncapped the marker and turned to the whiteboard — not to argue, but to show.

“Let’s say we migrate to ELK,” he said. “Walk me through what that actually means.”

Fajar didn’t hesitate. He laid it out with the precision of someone presenting a case they believed in. Elasticsearch data nodes to replace the indexer clusters. Dedicated master nodes to replace the cluster managers. Kibana with cross-cluster search to replace the search head clusters. A Logstash cluster to replace the heavy forwarders.

Asep wrote each component on the whiteboard as Fajar spoke. The list grew longer than the whiteboard had comfortable room for.

He stepped back and looked at it.

“This is either a migration plan or a resignation letter,” he said. “I haven’t decided yet.”

Dita laughed. Dito looked up from his laptop with genuine concern. “Do you need HR’s number, Pak?”

“I’m fine, Dito.”

“I have it saved, just in case—”

“I’m fine.”


Migration Plan

The Part Everyone Forgets

“What about the endpoints?” Asep asked.

“Elastic Agent,” Fajar said. “Replaces the Universal Forwarders.”

“Eight thousand of them.”

A beat. “Yes.”

Asep wrote 8,000 endpoint rollout on the whiteboard. Then he circled it slowly, twice, without comment. The circle did the work a sentence would have overcomplicated.

Dita was flipping back through her notebook. “The K8s clusters,” she said. “Last week, when you flagged the ingestion numbers from cloud that didn’t make sense. How is the telemetry collected from there right now?”

“OTel Collector as a DaemonSet on each node,” Fajar said. “Logs and metrics go from the DaemonSet to the Heavy Forwarders in cloud, then up to the cloud indexer cluster. That path stays entirely in cloud.”

“And on-premises K8s?”

“Same pattern — DaemonSet to on-prem Heavy Forwarders, then to the on-prem indexer cluster. Traces from both environments are handled separately. They go to a dedicated OTel Gateway, a VM sitting on-premises, which forwards everything to the on-prem indexers.”

“So the two indexer clusters — cloud and on-prem — how do the search heads reach both of them?”

“Direct Connect,” Fajar said. “The search head clusters are hybrid. They query both indexer clusters through Direct Connect. From the user’s perspective it’s one search. Under the hood it’s two clusters.”

Dita nodded, writing. Then she looked up.

“The K8s DaemonSet in cloud — what log levels is it collecting from the containers?”

Fajar paused for just a moment. “All of them. Everything written to container stdout. It’s the default configuration. We haven’t changed it since the cluster was onboarded.”

“So debug logs too.”

“Yes.”

“And filtering at the Heavy Forwarder level?”

A slightly longer pause this time. “Not currently configured.”

Dita wrote something in her notebook and said nothing more. Asep noticed. He suspected everyone in the room had noticed, but no one followed the thread. Not yet. There was a larger point still being made at the whiteboard, and some threads were better pulled at the right moment than all at once.


Schema on Sleep

Schema on Write

“There’s one more thing,” Fajar continued. “It’s not a small thing.”

He pulled up a simple diagram on his laptop and turned it toward the room.

“Splunk is schema on read. You send it raw data — logs, events, anything — and the structure, the field extraction, the parsing, all of that happens at query time. The platform figures out what things mean when you ask a question, not when the data arrives.”

He paused.

“ELK is schema on write. The structure has to be defined before the data arrives. Index templates, field mappings, data types — all of it has to be designed upfront. If the data doesn’t match the schema, it either gets rejected, misindexed, or lands in a catch-all field that becomes very difficult to query later.”

This was, objectively, the most technically dense thing Fajar had said all afternoon. The room absorbed it at different speeds.

Dita was writing steadily. Asep was nodding slowly. Dito, who had been following the conversation with the focused determination of a man fighting a losing battle against the laws of physics, had reached his threshold. The explanation had been long, the room was slightly warm, and schema mapping was nobody’s idea of high drama. His eyelids negotiated briefly with his will, and lost.

His chin dipped.

Fajar continued, not noticing. “Everything we currently have — years of existing data, field extractions, dashboards built on specific field names — none of it maps cleanly to an ELK schema. We would need to redesign the data model from scratch. For every source. Before we can ingest a single event.”

Dito’s head dropped further.

Then snapped back up.

He straightened in his chair with the composure of someone who had absolutely not just been asleep. The room was looking at him.

“Schema on write means the structure has to be defined before data arrives,” Dito said, with complete calm. “So our existing field extractions don’t carry over. We’d need to redesign the data model for every source before migration can even begin.” He paused. “Which is probably why the eight-month estimate is optimistic.”

Silence.

“Yes,” Fajar said carefully. “That’s exactly right.”

Dito nodded once, as if this had been obvious all along, and looked back at his laptop.

Asep added data model redesign — all sources to the whiteboard list. Then he set down the marker and turned to face the room.


The four questions

Everything Still Has a Process

“ELK is a legitimate platform,” Asep said. “Fajar is right about what it can do. At the right scale, with the right team, with proper runway — it is a valid architecture. I want to be clear about that.”

He looked at the whiteboard — the component list, the circled eight thousand, the schema note, the migration timeline.

“That is not our situation today. We have an active cost incident. The director wants an answer in two weeks. A migration of this scale goes through architecture review, procurement, security sign-off, parallel running, cutover planning, and an endpoint rollout across eight thousand machines. In a company our size, that is not a two-week process. That is a six-month project on a good day.” He paused. “You cannot teach a cat to swim just because AI exists now. Everything still has a process.”

Dito raised his hand. “Could Ardi—”

“Dito.”

The hand went down.

“So we fix what we have,” Fajar said. It wasn’t quite a question.

“We fix what we have,” Asep confirmed. “Not because ELK is wrong. Because the problem isn’t the platform.”

The room was quiet for a moment.

“So what is the problem?” Dita asked.

Asep picked up the marker one more time and wrote four questions at the bottom of the whiteboard.

What are we paying for?

What are we actually using?

What would it cost to do this differently?

Do we even know what we have?

He capped the marker. “That last one. That’s where we start.”


What The Comparison Actually Looks Like

The ELK conversation happens in almost every engineering team at some point. The platform costs too much, someone proposes the open-source alternative, and the discussion moves to platforms instead of the underlying problem.

It is worth being honest about what the comparison actually looks like.

Paid Platform (e.g. Splunk)Self-Managed Open Source (e.g. ELK)Cloud-Native Managed (e.g. Datadog)
License costHigh, but predictableNoneStarts low, scales with usage
Infrastructure costLow–MediumMedium–HighLow (vendor-managed)
Engineer timeLow–MediumHighLow
Schema modelOn read — flexibleOn write — structured upfrontVaries
Migration complexityHighMedium
Cost predictabilityGoodMediumCan spike at scale
Hidden costsLicense tier overagesOps overhead, expertise dependencyData transfer, cardinality, retention fees

A note on cloud-native platforms: they are often presented as the modern cost-efficient alternative, and at lower volumes they can be. But usage-based pricing means costs scale directly with your data. For organizations already dealing with high ingestion volumes, cloud-native managed services can end up more expensive than a Splunk license once you factor in data transfer costs, cardinality-based metric pricing, and retention fees for longer data windows. The invoice feels different because it arrives in smaller, more frequent increments. The total often isn’t.

There is no universally correct answer. The right platform depends on your team’s expertise, your data volumes, your risk tolerance, and how much engineering capacity you are willing to spend running the observability system versus actually using it.

What Kartana had was a platform with rising costs and uncertain value. The question was not which platform to use. The question was whether they were getting value from what they had — and if not, why not.

That was a different problem. And before you could solve it, you needed to understand something more fundamental: not all of what they were collecting was worth collecting in the first place.


In Episode 3 — “Not All Logs Are Created Equal”, Asep’s team starts looking inside the data — and discovers that the most important decision in observability isn’t which platform you use. It’s knowing the difference between the log that saves you at 2am and the log that has been quietly arriving every five minutes for three years and has never once been searched.