Two decades on radio networks leave you with a particular skepticism - toward device behavior, vendor benchmarks, that stay green while customers notice something is wrong. Perspective is always from the network floor up. Posts cover anomaly detection, 5G SA/NSA, automation, and building observability into live national networks. The signal was always there. Getting to the insight is the harder part.
Sleepy Cells Were Never Idle — We Just Didn't Measure Them Right
LTE · 5G · Cell State Management · 7 min read
LTE networks were no longer failing loudly. They were failing quietly. Coverage maps looked clean. KPIs stayed mostly green. Yet field teams kept reporting pockets where devices behaved as if the network was half awake — slow access, delayed paging, inconsistent attach behavior. These weren't outages.
They were sleepy cells.
Fig 1 — The sleepy cell spectrum: not off, not fully on
Why 2021 traffic made the problem visible
Sleepy cell behavior had existed for years. The traffic mix in 2021 made its impact impossible to ignore. Background signaling from IoT devices, intermittent data sessions, and bursty applications stressed cells that were technically on but operationally misaligned with how devices needed to access them.
IoT device: wakes from PSM, expects immediate RACH grant within 40ms Sleepy cell: scheduler in low-activity state, grant delayed 300-600ms KPI view: access eventually succeeds — counted as success Device view: reporting window missed, upstream data lost
The problem was not RF strength or capacity. It lived in state management and scheduler behavior — specifically, in the gap between how long the network took to become fully responsive and how long devices expected to wait.
Sleepy cell state behavior — measured transitions:
Low-activity threshold trigger:
Cell enters low-activity scheduler mode after N idle TTIs
N set aggressively during energy-efficiency tuning
Recovery to full scheduler responsiveness: 80-180ms
Device access during recovery window:
RACH attempt: accepted
Grant scheduling: deferred until scheduler fully active
Total access latency: 300-600ms (vs 20-40ms target)
Not logged as failure — counts as success
Paging misalignment:
eDRX cycle: 5.12s (network config)
Device reachability window: 10ms within cycle
Cell in low-activity state: paging response handler also throttled
Miss rate: 12-18% of paging attempts during low-activity windows
Aggregate paging success rate: 96.4% — within target
Per-device miss rate during window: 1 in 6 attempts
The observability gap
What made this problem persistent was that it lived below the resolution of standard KPI monitoring. Success counters incremented. Alarms did not trigger. The behavior only emerged when transitions were measured — not outcomes.
What was measured
What it showed
What it missed
Access success rate
97.8% — within target
Latency distribution of successful accesses during low-activity windows
Paging success rate
96.4% — within target
Per-device miss rate within eDRX reachability window
Scheduler utilization
Low — cell appears underloaded
Recovery latency when load arrives after low-activity period
Cell availability
100% — no outage recorded
Responsiveness spectrum between low-activity and fully active states
What validation had to shift to
Feature correctness was not the question. The question was whether the cell behaved responsively under the actual traffic mix — including the bursty, low-rate, and intermittent patterns that 2021 networks carried.
Behavioral validation criteria added:
Access latency distribution (not just success rate):
p50, p95, p99 measured during low-activity windows
Target: p95 below 80ms regardless of scheduler state
Wake-up consistency across device types:
IoT devices, smartphones, background-sync apps tested together
Scheduler responsiveness consistent across first-access events
Paging miss rate within reachability window:
Not aggregate paging success
Specific: does device receive paging within its eDRX window?
Target: miss rate below 2% per device per window
Effective changes:
Low-activity threshold raised (less aggressive entry)
Scheduler pre-warm triggered by paging events, not only data events
eDRX cycles aligned with deployed device reachability expectations
Result: access latency p95 improved from 580ms to 65ms at low-activity cells
paging within-window success: 96% to 99.1%
Networks don't fail when they're off. They fail when they're almost on. Sleep in LTE is not a binary state — it is a spectrum. Unless the transitions are measured, the problem stays invisible to every monitoring tool pointed at outcomes.
That lesson carried forward into 5G power efficiency design, NSA anchor stability, and SA readiness validation. The principle is the same across all of them: a network that is technically available but operationally slow to respond will produce exactly the kind of problems that customers report and dashboards miss. Measuring behavior — not just outcomes — is the only way to see it.
LTE · 5G · Cell State Management · IoT · RAN Optimization · OSS Analytics · Performance Engineering · Telecommunications
Integrating Two Live Networks Without Breaking the Customer Network integration · 5G · observability . 8 min read Large-scale network integrations are often described as a cutover problem. In reality, they are a behavioral problem. When two live mobile networks are stitched together, the hardest issues rarely come from radios or core elements in isolation. They emerge at the edges, where assumptions from one network collide with the operational realities of another. One of the earliest lessons during a nationwide integration effort was that roaming logic and native-network logic behave very differently under load. What works well for a roaming footprint can expose weaknesses quickly when millions of devices begin behaving as if the network is home. The issues that surfaced were not configuration gaps. They were assumption gaps. What surfaced only under live conditions None of the hardest problems appeared in lab testing. They appea...
The Evolution of AI in Telecommunications: From Static Models to Autonomous Agents AI · agentic systems · RAN automation 9 min read AI in telecommunications is not one thing. It has been three distinct things, each requiring different infrastructure, different trust models, and different relationships between the system and the engineer operating it. Understanding that progression matters because where you sit in it determines what problems you can actually solve. This is not an abstract observation. The analytics platforms built across the past several years went through each stage in sequence, and each transition changed not just what the system could do but how it was used. The pattern that emerged is worth describing in some detail, because it applies broadly to how AI gets deployed in any operationally complex environment. The three paradigms and what separates them Fig 1 -- Three AI paradigms: capability and autonomy increase left ...
Predictability is a harder engineering target than performance. A network can hit throughput benchmarks and still fail customers — because the failure mode isn't magnitude, it's consistency. Variance that can't be explained by traffic load or device behavior is an engineering debt, not an acceptable range. This became especially evident as VoLTE transitioned from preparation to production reality . The shift exposed a category of problems that lab testing and pre-launch drive campaigns rarely surface. The KPI gap In live LTE networks, many issues did not appear as outright failures. Calls connected, data flowed, and KPIs stayed within limits. Yet subtle inconsistencies — brief latency spikes, uneven uplink behavior, or intermittent retransmissions — created customer-visible quality degradation once voice traffic was introduced. Root cause pattern — VoLTE bearer health vs. perceived quality Uplink scheduling inconsistency → jitter on RTP stream → audio artifac...