Operations 11 min read

Why Infra Companies Are Racing Into Observability and What It Means for 2026

The article examines how SRE and infrastructure teams are converging, why major infra vendors are acquiring observability assets, the rising cost pressures, and how OpenTelemetry combined with Apache Iceberg forms a new standard stack that AI‑driven incident response will rely on in the coming years.

DevOps Coach
DevOps Coach
DevOps Coach
Why Infra Companies Are Racing Into Observability and What It Means for 2026

Background: Separate SRE and Infra Tracks

Historically, Site Reliability Engineering (SRE) teams handled on‑call duties, incident response, and day‑to‑day troubleshooting, while Infra teams built databases, storage, pipelines, and platforms, using different tools, budgets, and discussion topics.

Convergence and Market Moves

By 2026 the separation is weakening. Infra companies are moving directly into the observability space, not by offering better dashboards but by adopting a full‑stack strategy: first control the telemetry data plane, then sell agents that sit on top of it. Recent acquisitions illustrate this trend:

Chronosphere – acquired by Palo Alto Networks.

HyperDX – acquired by ClickHouse.

Observe – acquisition announced by Snowflake.

These deals show that the market is shifting from a pure tool marketplace to a battle for the entry point of observability data.

Observability Costs and Business Drivers

Observability is one of the few infra spend categories that can be justified without heavy marketing because incidents have real monetary impact, slow recovery costs, and on‑call fatigue. However, pricing is unpredictable, leading teams to hire dedicated cost‑control roles. Companies like Hex openly discuss these challenges in their hiring posts.

AI’s Role in Incident Management

AI adoption in enterprises remains slower than hype due to lengthy procurement and security reviews. In the on‑call domain, AI can reduce mean time to recovery (MTTR), automate triage, and handle remediation steps, creating clear value for companies such as Resolve AI that target incident response pipelines.

Open Standards: OpenTelemetry + Apache Iceberg

At the ingestion layer, OpenTelemetry is the default choice because it decouples instrumentation from back‑ends, reducing vendor lock‑in. The next critical question is the storage and data‑plane behind it. The emerging “standard stack” combines OpenTelemetry with Apache Iceberg, providing an open data layer that mitigates lock‑in risks.

Iceberg’s advantages include support for semi‑structured data (via the variant type in v3) and an open table format that keeps data portable across compute engines.

Challenges of Building a Telemetry Pipeline on Iceberg

While Iceberg works well for long‑term storage on object stores, the write path for observability workloads faces four major failure modes:

Lack of true local state – streaming aggregations require intermediate state that must be shuttled through object storage, creating bottlenecks at hundreds of terabytes per day.

High cost and OOM risk of stateful processing – complex aggregations like trace sessionization consume significant memory.

Iceberg write quality – small files, commit frequency, and compaction strategy directly affect downstream query performance.

Need for custom aggregation logic – real observability requires user‑defined aggregation functions beyond simple sums or counts.

Without careful engineering, write paths can explode in cost (e.g., scanning terabytes of data in a cloud warehouse) and destabilize the system.

Solutions and Emerging Players

RisingWave is designed for native state and streaming aggregation, avoiding the “upload‑download” penalty of moving state through storage. It can batch and shape writes to keep Iceberg tables healthy and supports user‑defined aggregation functions, reducing glue code.

Future Market Drivers (2026)

The decisive factors will be:

Control of the data plane that enables agents to safely read telemetry at scale, supporting investigation, replay, correlation, and automated action.

Delivery of a stable, production‑grade standard stack where OpenTelemetry handles data ingestion and Iceberg provides an open, portable storage layer, while specialized systems like RisingWave ensure a performant write path.

Thus, 2026 is poised to be the year infrastructure platforms battle for the AI‑enabled observability entry point, turning observability from a niche UI competition into a core AI data pipeline.

SREApache IcebergAI incident response
DevOps Coach
Written by

DevOps Coach

Master DevOps precisely and progressively.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.