VoLTE Problems That Neither the Core Nor the Radio Owned

VoLTE · LTE · Cross-Layer Diagnostics · 8 min read

As VoLTE moved deeper into production, many service issues could not be attributed cleanly to either the core or the radio layer. Calls failed or degraded even when each domain appeared healthy in isolation. The real problems lived at the interaction boundary, and neither team's tooling was pointed at it.

Signaling said the call was up. The user heard something different.

When signaling succeeds and the call still fails

A recurring pattern involved signaling sequences completing successfully while radio conditions deteriorated underneath. SIP call setup succeeded. Bearers were established. KPIs showed normal attach and setup behavior. Within seconds, packet loss or uplink instability introduced jitter and audio impairment that no signaling metric flagged.

Call flow — signaling perspective: INVITE sent 100 Trying received 183 Session Progress received PRACK / 200 OK exchange: complete 180 Ringing received 200 OK (answer): received ACK sent Call established, SIP dialog active SIP metrics: clean Bearer setup: success eRAB success rate: 100% for this call Radio layer — same call, same time window: UE SINR at call establishment: 12 dB acceptable UE SINR at t+8 seconds: 4 dB degraded PUSCH retransmission ratio: 22% elevated Uplink packet loss (PDCP): 3.8% above AMR-WB tolerance RTP jitter measured at P-CSCF: 85ms above 50ms threshold No RLF. No HO failure. No alarm triggered. Call active. Audio unusable.

The call never failed by any definition the monitoring framework used. SIP saw a successful dialog. The radio KPI dashboard showed no breach. The user experienced a degraded call from the first few seconds. The evidence for why was split across two systems that were never correlated in real time.

Signaling timeline vs RAN behavior — the alignment gap

Fixing this required a different starting point. Instead of beginning with SIP traces or radio KPIs independently, the analysis had to align signaling event timelines with RAN counter behavior at the same time resolution.

Cross-layer correlation — what each source contributed
SIP / IMS trace: Call setup sequence, timing of each message Re-INVITE events (indicative of mid-call renegotiation) BYE cause codes (normal vs. abnormal termination) SIP response timing (latency in signaling path) RAN counters (cell level, 15-min granularity): PUSCH retransmission ratio at time of call Uplink SINR distribution during call window HO attempt and outcome during call RRC state during bearer activity NG1 / packet capture (UE level where available): RTP packet timing, jitter, loss per flow PDCP SDU discard events Uplink grant scheduling gaps None of these alone identified the cause. SIP said normal. RAN said acceptable. Packets showed the failure.

The correlation had to be time-aligned to within the same 15-minute window at minimum, ideally per-call where UE-level traces were available. Hourly OSS aggregates missed the transient events entirely.

Mobility under marginal conditions

In several clusters, calls initiated cleanly but degraded immediately after minor movement. The pattern was consistent: handover preparation succeeded, execution occurred under marginal radio conditions due to delayed measurement reporting or competing uplink load on the target cell.

Handover sequence during VoLTE call — degradation pattern: t=0: UE on Cell A, SINR 11 dB, call active, audio clean t=12s: UE moves, Cell A SINR drops to 6 dB Measurement report triggered (A3 event) t=14s: HO preparation to Cell B: success t=15s: HO execution begins Cell B uplink load: 74% PRB utilization at this moment UE uplink sync to Cell B: delayed 180ms (above typical 80ms) t=15.2s: RTP gap: 180ms AMR codec concealment: activated Perceived audio: dropout HO outcome logged: success KPI: handover success rate unaffected User perception: call quality broke during the handover

From a signaling perspective, nothing failed. The handover completed. From a user perspective, the 180ms sync delay was enough to trigger codec concealment. The gap between "handover success" as a KPI and "handover quality" as a user experience was not captured anywhere in the monitoring stack.

Transient instability — below alarm thresholds, above tolerance

Short-lived spikes in latency or packet loss, lasting only a few hundred milliseconds, were enough to impact voice quality but too brief to trigger alarms or move hourly KPIs. They surfaced only through packet-level analysis time-aligned with the call window.

Event type Duration KPI effect VoLTE effect Uplink packet loss burst 120-200ms None — too brief for hourly counter 6-10 RTP packets lost, audio dropout, concealment activated RRC re-establishment 200-350ms Counted as success if recovery completes Bearer interruption exceeds AMR-WB 160ms frame tolerance Scheduling gap under load 80-150ms Not visible in throughput average Jitter spike above 50ms P-CSCF threshold, quality flag raised HO execution delay 150-220ms above typical HO logged as success RTP gap triggers codec concealment, perceived dropout

Each event was technically within acceptable bounds for a data session. Each was a quality failure for a voice bearer. The monitoring framework was calibrated for the former and applied to the latter without adjustment.

What stabilization required

Stabilizing VoLTE meant treating the network as a single system rather than a collection of domains. Parameter tuning without validating its effect on end-to-end call behavior produced fixes that resolved counters without resolving quality. Signaling-layer fixes that ignored radio variability provided temporary relief that reversed under load.

Domain-isolated fix vs cross-layer fix — same symptom
Symptom: mid-call audio quality degradation, cluster X Domain-isolated approach: Core team: SIP traces clean, no IMS issue found RAN team: KPIs within target, no parameter change needed Result: issue not owned, persists Cross-layer approach: Aligned SIP re-INVITE timing with RAN HO execution events Found: re-INVITEs clustering in 15-min windows of high HO load Root cause: HO execution delays on overloaded target cells generating RTP gaps that triggered mid-call renegotiation Fix: target cell load threshold for HO admission adjusted HO margin increased for GBR bearers specifically Result: re-INVITE rate dropped 74%, quality complaints resolved

The fix was a RAN parameter change. It was only found by starting with signaling behavior and working back through the radio layer. Neither starting point alone reached the cause.

Service quality depends on the weakest interaction, not the strongest component. VoLTE made this unavoidable because voice has no tolerance for the transient instabilities that data sessions absorb silently. The shift from domain expertise to cross-layer validation was not a process change — it was a fundamental change in what "diagnosing a problem" meant. That discipline proved essential as networks grew more layered and the interactions between components became harder to reason about from any single domain's perspective.

VoLTE  ·  LTE  ·  SIP  ·  RAN Optimization  ·  Cross-layer Diagnostics  ·  IMS  ·  Performance Engineering  ·  Telecommunications

Popular posts from this blog