Two decades on radio networks leave you with a particular skepticism - toward device behavior, vendor benchmarks, that stay green while customers notice something is wrong. Perspective is always from the network floor up. Posts cover anomaly detection, 5G SA/NSA, automation, and building observability into live national networks. The signal was always there. Getting to the insight is the harder part.
Why Real-Time Analytics Changed How We Troubleshoot Networks
Analytics Infrastructure · ML · Snowflake / Databricks · 8 min read
Networks don't fail slowly anymore. Modern applications generate short, bursty sessions. Devices attach and detach constantly.
Features interact in ways that never appear in static KPIs. By the time a traditional batch report is generated, the window
to act has already closed — and the conditions that caused the problem have shifted.
The core problem was not lack of metrics. It was latency in insight. Data existed. It arrived too late to be useful.
24h+
batch report cycle
traditional OSS reporting cadence
15min
counter granularity
finest resolution in most OSS exports
<2min
operational refresh target
real-time pipeline cadence achieved
Why batch reporting failed modern networks
Legacy reporting was built for a different failure mode — one where degradations developed over hours and fixes were applied in daily cycles.
The assumption was that a problem visible yesterday was still the same problem today. That assumption stopped being reliable once network behavior became more dynamic.
Analytics as infrastructure: the architecture shift
Manual data extraction and ad-hoc analysis could not keep pace with live network behavior.
The shift required treating analytics as infrastructure — with the same reliability, latency, and schema consistency expected from any production system.
Fig 1 — Analytics pipeline: from network counters to operational insight
The pipeline components were not novel individually. The value was in how they were connected: consistent vendor-agnostic schemas at ingestion,
versioned baselines in Snowflake enabling before/after comparison at query time, and Databricks processing running continuously rather than on demand.
Dashboards refreshed at operational cadence. Engineers watched the network evolve rather than reviewing what had already happened.
ML-assisted anomaly detection: what it changed
The real inflection point was not faster dashboards.
It was shifting from engineers scanning thousands of counters to models surfacing deviations from expected behavior — and doing it before the deviation became a complaint.
Anomaly detection: model behavior vs manual scanning
Manual scanning:
Engineer reviews dashboard: 200+ KPIs across 40+ markets
Threshold breach triggers review: works for hard failures
Gradual deviation: invisible until it crosses a static threshold
Cross-KPI correlation: manual, hours per incident
ML-assisted detection:
Baseline behavior modeled per cell, per time window, per traffic state
Deviation scored against expected range (not static threshold)
Cross-counter correlation embedded in model features
Output: ranked anomaly list with contributing counter signatures
Example: throughput collapse under mixed NSA/SA traffic
Static KPI: throughput above minimum threshold — no flag
ML model: throughput declining 18% vs same-hour baseline
correlated with NR SCell addition failure rate increase
and specific anchor cell load pattern
Flag raised: 14 minutes post-onset
Root cause: anchor parameter interaction post-feature activation
Resolution: before user complaints generated
What the models were and were not used for
ML application
What it did
What humans retained
Anomaly scoring
Ranked cells and markets by deviation magnitude and correlation strength
Root cause investigation, contextual judgment on whether to act
Regression detection
Flagged post-change KPI trajectories diverging from change-type baseline
Decision to roll back or monitor, based on context the model didn't have
Pattern clustering
Grouped similar failure signatures across markets for systematic review
Determining whether clusters shared a root cause or were coincidental
Capacity projection
Trended utilization per carrier against historical load growth patterns
Prioritization of capacity actions against business and rollout context
Engineering judgment was not replaced. It was focused.
The model's role was to answer "where should an engineer look right now" — not "what should be done."
That distinction mattered in practice: models that tried to make the second decision were less trusted and less used than those that answered the first one well.
Fig 2 — Insight latency: batch vs real-time pipeline
Analytics became infrastructure when the same reliability expectations applied
to data pipelines as to the network itself.Once insight latency
dropped below the window of useful action,troubleshooting changed
from firefighting to anticipation.
That is the only sustainable operating model for a network at national scale.
The tooling described here — Snowflake for versioned baselines,
Databricks and Python for continuous processing, ML models
for anomaly scoring — was not valuable
because of the names on the stack. It was valuable because it eliminated the gap
between when a problem existed and when an engineer could act on it.
That discipline of making network state continuously observable shaped
every performance platform, automation program, and analytics framework built afterward.
Analytics Infrastructure · Snowflake · Databricks · ML ·
RAN Optimization · 5G · Performance Engineering · Telecommunications
Integrating Two Live Networks Without Breaking the Customer Network integration · 5G · observability . 8 min read Large-scale network integrations are often described as a cutover problem. In reality, they are a behavioral problem. When two live mobile networks are stitched together, the hardest issues rarely come from radios or core elements in isolation. They emerge at the edges, where assumptions from one network collide with the operational realities of another. One of the earliest lessons during a nationwide integration effort was that roaming logic and native-network logic behave very differently under load. What works well for a roaming footprint can expose weaknesses quickly when millions of devices begin behaving as if the network is home. The issues that surfaced were not configuration gaps. They were assumption gaps. What surfaced only under live conditions None of the hardest problems appeared in lab testing. They appea...
The Evolution of AI in Telecommunications: From Static Models to Autonomous Agents AI · agentic systems · RAN automation 9 min read AI in telecommunications is not one thing. It has been three distinct things, each requiring different infrastructure, different trust models, and different relationships between the system and the engineer operating it. Understanding that progression matters because where you sit in it determines what problems you can actually solve. This is not an abstract observation. The analytics platforms built across the past several years went through each stage in sequence, and each transition changed not just what the system could do but how it was used. The pattern that emerged is worth describing in some detail, because it applies broadly to how AI gets deployed in any operationally complex environment. The three paradigms and what separates them Fig 1 -- Three AI paradigms: capability and autonomy increase left ...
Predictability is a harder engineering target than performance. A network can hit throughput benchmarks and still fail customers — because the failure mode isn't magnitude, it's consistency. Variance that can't be explained by traffic load or device behavior is an engineering debt, not an acceptable range. This became especially evident as VoLTE transitioned from preparation to production reality . The shift exposed a category of problems that lab testing and pre-launch drive campaigns rarely surface. The KPI gap In live LTE networks, many issues did not appear as outright failures. Calls connected, data flowed, and KPIs stayed within limits. Yet subtle inconsistencies — brief latency spikes, uneven uplink behavior, or intermittent retransmissions — created customer-visible quality degradation once voice traffic was introduced. Root cause pattern — VoLTE bearer health vs. perceived quality Uplink scheduling inconsistency → jitter on RTP stream → audio artifac...