Two decades on radio networks leave you with a particular skepticism - toward device behavior, vendor benchmarks, that stay green while customers notice something is wrong. Perspective is always from the network floor up. Posts cover anomaly detection, 5G SA/NSA, automation, and building observability into live national networks. The signal was always there. Getting to the insight is the harder part.
In earlier generations, a healthy KPI dashboard usually meant a healthy network. By 2021, that assumption quietly stopped being true.
As 5G NSA deployments scaled and traffic patterns shifted, networks that were technically compliant became operationally fragile. KPIs stayed green. Users experienced delays, retries, and intermittent failures that were difficult to reproduce in controlled tests. The problem was not missing counters. It was how existing counters were being interpreted.
What the old KPI model was designed for
Most KPIs in operational use were designed to answer single-layer, binary questions. Did the procedure complete? Was the threshold crossed? These questions made sense when network behavior was relatively sequential and device activity was steady.
KPI
What it measured
What it could not see
RRC setup success
Procedure completed without failure code
Setup latency variance, repeated setups by same device in short window
HO success rate
Handover procedure completed
Execution delay on GBR bearers, UE sync time at target, post-HO stall duration
Throughput above threshold
Average PRB throughput met target
Burst availability for bursty applications, scheduling gap under competing load
Drop rate
Abnormal release rate within target
Silent session timeouts, retries that recovered technically but degraded experience
Each KPI was technically accurate. Each described a slice of network behavior. None described how the slices connected under real device behavior in 2021.
What had changed about user behavior
By 2021, device activity in a 5G NSA network looked fundamentally different from the steady sessions these KPIs were designed around. Control plane and user plane were split across layers. Applications were bursty. Power-saving behavior made devices enter and exit connected mode more aggressively than the KPI framework anticipated.
5G NSA device behavior vs single-KPI assumptions
Control plane: LTE anchor (eNB)
User plane: NR secondary cell (gNB) where available
Result: RRC success on LTE does not confirm NR user-plane availability
Application behavior: bursty, short sessions, background sync
Result: throughput average masks scheduling gap during burst window
UE wakes, requests resources, scheduler not immediately ready
100-200ms gap: invisible in hourly average, visible as app latency
Power saving (C-DRX, I-DRX, PSM):
Result: device in power-save state misses paging
registration delay on wake, counted as "idle mode" not "failure"
aggregate paging success rate unaffected
individual device: delayed reachability, user perceives slow response
What sequence thinking exposed
What started working better was analyzing sequences rather than individual procedure outcomes. Looking at what happened before and after a KPI increment exposed the behavior that the increment alone concealed.
Sequence analysis — RRC transition correlated with scheduler behavior:
Single KPI view:
RRC setup success rate: 98.4% green
Throughput above threshold: yes green
No action indicated
Sequence view (same cells, same time window):
RRC setup: success
Time to first UL grant: 180ms (target: 40ms)
Scheduler: high competing load at time of grant request
NR SCell addition: delayed 340ms post-RRC
Application layer: first packet delayed 520ms total
User perception: "slow" despite "successful" connection
Second sequence:
RRC setup: success
Device entered C-DRX immediately post-setup (low activity)
Paging during DRX cycle: missed
Re-registration: 800ms
KPI: paging success rate unaffected (device recovered)
User: notification delayed, call setup attempt failed silently
Fig 1 — Single-KPI view vs sequence view: same network, same time
What changed in optimization decisions
Sequence thinking changed what parameters were tuned and what outcomes were targeted. Instead of tuning one parameter to improve a binary KPI, the focus shifted to stability across sequences — even if it meant slightly worse headline numbers in some cases.
Example: C-DRX cycle tuning
Before sequence analysis:
C-DRX long cycle: 320ms
Power saving counter: excellent, device battery favorable
RRC setup success: unaffected
Decision: leave as-is
After sequence analysis:
C-DRX 320ms cycle: paging miss rate 4.2% for latency-sensitive apps
First-packet delay for apps waking device: avg 680ms
Neither metric in standard KPI set
Decision: reduce C-DRX cycle to 160ms for mixed-traffic sectors
Result: paging miss rate -71%, first-packet delay -38%
Power saving KPI: minor regression (acceptable tradeoff, documented)
The most valuable insight from that period was not how many KPIs were tracked. It was which ones we stopped trusting blindly. A counter that describes a procedure outcome without describing the sequence it belongs to is not wrong — it is incomplete. Incomplete at the scale of 5G NSA device behavior is operationally indistinguishable from wrong.
The shift from single-KPI monitoring to sequence-aware analysis did not require new counters. It required correlating existing ones differently — RRC transitions with scheduler behavior, mobility events with user-plane stalls, retry patterns with device power states. That correlation discipline became the foundation for how anomaly detection and performance intelligence were built in the years that followed.
Integrating Two Live Networks Without Breaking the Customer Network integration · 5G · observability . 8 min read Large-scale network integrations are often described as a cutover problem. In reality, they are a behavioral problem. When two live mobile networks are stitched together, the hardest issues rarely come from radios or core elements in isolation. They emerge at the edges, where assumptions from one network collide with the operational realities of another. One of the earliest lessons during a nationwide integration effort was that roaming logic and native-network logic behave very differently under load. What works well for a roaming footprint can expose weaknesses quickly when millions of devices begin behaving as if the network is home. The issues that surfaced were not configuration gaps. They were assumption gaps. What surfaced only under live conditions None of the hardest problems appeared in lab testing. They appea...
The Evolution of AI in Telecommunications: From Static Models to Autonomous Agents AI · agentic systems · RAN automation 9 min read AI in telecommunications is not one thing. It has been three distinct things, each requiring different infrastructure, different trust models, and different relationships between the system and the engineer operating it. Understanding that progression matters because where you sit in it determines what problems you can actually solve. This is not an abstract observation. The analytics platforms built across the past several years went through each stage in sequence, and each transition changed not just what the system could do but how it was used. The pattern that emerged is worth describing in some detail, because it applies broadly to how AI gets deployed in any operationally complex environment. The three paradigms and what separates them Fig 1 -- Three AI paradigms: capability and autonomy increase left ...
Predictability is a harder engineering target than performance. A network can hit throughput benchmarks and still fail customers — because the failure mode isn't magnitude, it's consistency. Variance that can't be explained by traffic load or device behavior is an engineering debt, not an acceptable range. This became especially evident as VoLTE transitioned from preparation to production reality . The shift exposed a category of problems that lab testing and pre-launch drive campaigns rarely surface. The KPI gap In live LTE networks, many issues did not appear as outright failures. Calls connected, data flowed, and KPIs stayed within limits. Yet subtle inconsistencies — brief latency spikes, uneven uplink behavior, or intermittent retransmissions — created customer-visible quality degradation once voice traffic was introduced. Root cause pattern — VoLTE bearer health vs. perceived quality Uplink scheduling inconsistency → jitter on RTP stream → audio artifac...