Two decades on radio networks leave you with a particular skepticism - toward device behavior, vendor benchmarks, that stay green while customers notice something is wrong. Perspective is always from the network floor up. Posts cover anomaly detection, 5G SA/NSA, automation, and building observability into live national networks. The signal was always there. Getting to the insight is the harder part.
Cat-M Didn't Struggle at Scale - LTE Scheduling Wasn't Built for Machines
Cat-M · LTE IoT · Scheduling · 7 min read
When Cat-M began moving from pilot deployments into production scale, a pattern emerged that didn't fit the usual coverage or capacity narratives. Devices were reachable. Signal levels were acceptable. Yet registrations were slow, inconsistent, and sometimes unpredictable in ways that drive-test campaigns and lab tests never surfaced.
The instinctive reaction was to look at radio conditions or device behavior. The issue sat deeper — in how LTE networks had been optimized long before machine traffic became meaningful. LTE schedulers were built with one dominant assumption: human traffic dominates the network. Cat-M exposed what that assumption cost at scale.
Where the friction came from
Phones behave in bursts, adapt quickly, and tolerate retries. The network was tuned around that tolerance. Machines have different expectations: short transmissions, infrequent access, deterministic timing requirements for reporting cycles. Under load, the scheduler's human-centric behavior created friction that no amount of additional coverage resolved.
Cat-M friction patterns at production scale — not visible in standard LTE KPIs
Access attempt delay (not rejection):
RACH attempts backed off during busy periods
Device waits, retries, waits again
LTE access success rate: unaffected (attempts eventually succeed)
Device reporting cycle: missed deadline, data lost or retransmitted
Paging deprioritization:
Cat-M paging responses scheduled behind broadband paging load
PSM-exiting devices: delayed reachability
LTE paging success rate: within target
Device reachability window: exceeded, device re-enters PSM
Next attempt: minutes later
Retry amplification:
Multiple devices experiencing access delay simultaneously
Retry timers fire in overlapping windows
Contention increases, delays compound
Network load: smooth in aggregate
Device population: synchronized retries creating micro-congestion spikes
None of this appeared in traditional LTE KPIs. Access success rates were acceptable. Paging success was within target. Throughput was not the constraint. From the network's perspective, things looked healthy. From the device's perspective, reporting deadlines were being missed.
Why capacity wasn't the answer
Adding carriers or expanding capacity did not change the behavior. The problem was scheduler priority logic, not resource availability. Cat-M devices were competing for access slots against broadband users using the same backoff and retry parameters — which were tuned for broadband traffic tolerances, not machine access determinism.
Cat-M access under standard LTE scheduler — busy hour:
Cat-M device: periodic sensor report, 200-byte payload
Scheduled access window: 10ms (device expectation)
Actual access latency under load: 340-800ms
Cause: RACH resources shared with broadband, backoff applied uniformly
Cat-M retry behavior:
T300 expiry: device retries after fixed interval
Under sustained load: retries compound rather than clear
Backoff parameters: designed for phone-scale sessions
Machine device: exponential backoff applied to deterministic
reporting cycle creates cascading delay
Broadband device under same conditions:
TCP retransmission handles the delay transparently
User perceives: slight slowdown
Application: unaffected
Cat-M device under same conditions:
Reporting cycle missed
Upstream application: missing data, timeout triggered
Re-registration initiated in some implementations
Network load: amplified by re-registration traffic
What the effective changes targeted
The fixes that worked were not capacity changes. They were scheduling logic adjustments that recognized Cat-M access phases as distinct from broadband traffic — and stopped applying broadband scheduler assumptions to machine access behavior.
Change area
What was adjusted
Effect on Cat-M behavior
RACH resource allocation
Dedicated PRACH resources for Cat-M access phases during busy hours
Access delay reduced from 340-800ms to 40-90ms at peak load
Paging priority
Cat-M paging window protected from broadband paging preemption
PSM device reachability: consistent, deadline-aligned instead of opportunistic
Backoff parameters
T300 / T302 timers adjusted for machine access pattern, not broadband retry pattern
Retry amplification eliminated, congestion spikes from synchronized retries removed
eDRX cycle alignment
eDRX paging cycles aligned with reporting intervals of deployed device types
Paging miss rate reduced 68%, re-registration-driven load dropped significantly
These were subtle changes. They did not increase peak throughput or change coverage. They made access behavior deterministic rather than opportunistic — which is what machine traffic requires by design.
Fig 1 — Cat-M access delay: standard vs tuned scheduler
What changed in validation
Once access behavior stabilized, validation criteria had to change to match. Lab success and feature enablement were no longer sufficient. The question was no longer "does it attach?" — it was "does it attach on time, under load, with real contention from broadband traffic?"
IoT validation criteria post-tuning:
Required test conditions:
Mixed LTE broadband + Cat-M traffic at busy-hour load ratio
Real paging load (not isolated device test)
eDRX and PSM cycles active, not disabled for test simplicity
Multiple device types with different reporting intervals
Pass criteria:
Access latency: p95 below 120ms at peak load
Paging success within device reachability window: above 97%
Retry amplification under sustained load: not observed
Re-registration rate: below 0.5% per hour
Previously used criteria:
Attach success rate above threshold
Coverage adequate
Feature enabled in configuration
Cat-M didn't expose a flaw in IoT standards. It exposed a mismatch between machine expectations and human-centric network logic. Once scheduling decisions respected that difference, Cat-M behaved exactly as designed. The technology was ready. The assumptions just needed updating.
That lesson carried forward into RedCap, 5G access state design, and how machine-type communication readiness is now evaluated in NR deployments. The pattern is consistent: features designed for a specific traffic type will underperform when the network they run on was optimized for a different one. Finding that mismatch early requires testing under conditions that reflect actual device population behavior — not the device behavior that was convenient to model.
Integrating Two Live Networks Without Breaking the Customer Network integration · 5G · observability . 8 min read Large-scale network integrations are often described as a cutover problem. In reality, they are a behavioral problem. When two live mobile networks are stitched together, the hardest issues rarely come from radios or core elements in isolation. They emerge at the edges, where assumptions from one network collide with the operational realities of another. One of the earliest lessons during a nationwide integration effort was that roaming logic and native-network logic behave very differently under load. What works well for a roaming footprint can expose weaknesses quickly when millions of devices begin behaving as if the network is home. The issues that surfaced were not configuration gaps. They were assumption gaps. What surfaced only under live conditions None of the hardest problems appeared in lab testing. They appea...
The Evolution of AI in Telecommunications: From Static Models to Autonomous Agents AI · agentic systems · RAN automation 9 min read AI in telecommunications is not one thing. It has been three distinct things, each requiring different infrastructure, different trust models, and different relationships between the system and the engineer operating it. Understanding that progression matters because where you sit in it determines what problems you can actually solve. This is not an abstract observation. The analytics platforms built across the past several years went through each stage in sequence, and each transition changed not just what the system could do but how it was used. The pattern that emerged is worth describing in some detail, because it applies broadly to how AI gets deployed in any operationally complex environment. The three paradigms and what separates them Fig 1 -- Three AI paradigms: capability and autonomy increase left ...
Predictability is a harder engineering target than performance. A network can hit throughput benchmarks and still fail customers — because the failure mode isn't magnitude, it's consistency. Variance that can't be explained by traffic load or device behavior is an engineering debt, not an acceptable range. This became especially evident as VoLTE transitioned from preparation to production reality . The shift exposed a category of problems that lab testing and pre-launch drive campaigns rarely surface. The KPI gap In live LTE networks, many issues did not appear as outright failures. Calls connected, data flowed, and KPIs stayed within limits. Yet subtle inconsistencies — brief latency spikes, uneven uplink behavior, or intermittent retransmissions — created customer-visible quality degradation once voice traffic was introduced. Root cause pattern — VoLTE bearer health vs. perceived quality Uplink scheduling inconsistency → jitter on RTP stream → audio artifac...