VoLTE Problems That Neither the Core Nor the Radio Owned
VoLTE · LTE · Cross-Layer Diagnostics · 8 min read
As VoLTE moved deeper into production, many service issues could not be attributed cleanly to either the core or the radio layer. Calls failed or degraded even when each domain appeared healthy in isolation. The real problems lived at the interaction boundary, and neither team's tooling was pointed at it.
Signaling said the call was up. The user heard something different.
When signaling succeeds and the call still fails
A recurring pattern involved signaling sequences completing successfully while radio conditions deteriorated underneath. SIP call setup succeeded. Bearers were established. KPIs showed normal attach and setup behavior. Within seconds, packet loss or uplink instability introduced jitter and audio impairment that no signaling metric flagged.
Call flow — signaling perspective:
INVITE sent
100 Trying received
183 Session Progress received
PRACK / 200 OK exchange: complete
180 Ringing received
200 OK (answer): received
ACK sent
Call established, SIP dialog active
SIP metrics: clean
Bearer setup: success
eRAB success rate: 100% for this call
Radio layer — same call, same time window:
UE SINR at call establishment: 12 dB acceptable
UE SINR at t+8 seconds: 4 dB degraded
PUSCH retransmission ratio: 22% elevated
Uplink packet loss (PDCP): 3.8% above AMR-WB tolerance
RTP jitter measured at P-CSCF: 85ms above 50ms threshold
No RLF. No HO failure. No alarm triggered.
Call active. Audio unusable.
The call never failed by any definition the monitoring framework used. SIP saw a successful dialog. The radio KPI dashboard showed no breach. The user experienced a degraded call from the first few seconds. The evidence for why was split across two systems that were never correlated in real time.
Signaling timeline vs RAN behavior — the alignment gap
Fixing this required a different starting point. Instead of beginning with SIP traces or radio KPIs independently, the analysis had to align signaling event timelines with RAN counter behavior at the same time resolution.
Cross-layer correlation — what each source contributed
SIP / IMS trace:
Call setup sequence, timing of each message
Re-INVITE events (indicative of mid-call renegotiation)
BYE cause codes (normal vs. abnormal termination)
SIP response timing (latency in signaling path)
RAN counters (cell level, 15-min granularity):
PUSCH retransmission ratio at time of call
Uplink SINR distribution during call window
HO attempt and outcome during call
RRC state during bearer activity
NG1 / packet capture (UE level where available):
RTP packet timing, jitter, loss per flow
PDCP SDU discard events
Uplink grant scheduling gaps
None of these alone identified the cause.
SIP said normal. RAN said acceptable. Packets showed the failure.
The correlation had to be time-aligned to within the same 15-minute window at minimum, ideally per-call where UE-level traces were available. Hourly OSS aggregates missed the transient events entirely.
Mobility under marginal conditions
In several clusters, calls initiated cleanly but degraded immediately after minor movement. The pattern was consistent: handover preparation succeeded, execution occurred under marginal radio conditions due to delayed measurement reporting or competing uplink load on the target cell.
Handover sequence during VoLTE call — degradation pattern:
t=0: UE on Cell A, SINR 11 dB, call active, audio clean
t=12s: UE moves, Cell A SINR drops to 6 dB
Measurement report triggered (A3 event)
t=14s: HO preparation to Cell B: success
t=15s: HO execution begins
Cell B uplink load: 74% PRB utilization at this moment
UE uplink sync to Cell B: delayed 180ms (above typical 80ms)
t=15.2s: RTP gap: 180ms
AMR codec concealment: activated
Perceived audio: dropout
HO outcome logged: success
KPI: handover success rate unaffected
User perception: call quality broke during the handover
From a signaling perspective, nothing failed. The handover completed. From a user perspective, the 180ms sync delay was enough to trigger codec concealment. The gap between "handover success" as a KPI and "handover quality" as a user experience was not captured anywhere in the monitoring stack.
Transient instability — below alarm thresholds, above tolerance
Short-lived spikes in latency or packet loss, lasting only a few hundred milliseconds, were enough to impact voice quality but too brief to trigger alarms or move hourly KPIs. They surfaced only through packet-level analysis time-aligned with the call window.
| Event type |
Duration |
KPI effect |
VoLTE effect |
| Uplink packet loss burst |
120-200ms |
None — too brief for hourly counter |
6-10 RTP packets lost, audio dropout, concealment activated |
| RRC re-establishment |
200-350ms |
Counted as success if recovery completes |
Bearer interruption exceeds AMR-WB 160ms frame tolerance |
| Scheduling gap under load |
80-150ms |
Not visible in throughput average |
Jitter spike above 50ms P-CSCF threshold, quality flag raised |
| HO execution delay |
150-220ms above typical |
HO logged as success |
RTP gap triggers codec concealment, perceived dropout |
Each event was technically within acceptable bounds for a data session. Each was a quality failure for a voice bearer. The monitoring framework was calibrated for the former and applied to the latter without adjustment.
What stabilization required
Stabilizing VoLTE meant treating the network as a single system rather than a collection of domains. Parameter tuning without validating its effect on end-to-end call behavior produced fixes that resolved counters without resolving quality. Signaling-layer fixes that ignored radio variability provided temporary relief that reversed under load.
Domain-isolated fix vs cross-layer fix — same symptom
Symptom: mid-call audio quality degradation, cluster X
Domain-isolated approach:
Core team: SIP traces clean, no IMS issue found
RAN team: KPIs within target, no parameter change needed
Result: issue not owned, persists
Cross-layer approach:
Aligned SIP re-INVITE timing with RAN HO execution events
Found: re-INVITEs clustering in 15-min windows of high HO load
Root cause: HO execution delays on overloaded target cells
generating RTP gaps that triggered mid-call renegotiation
Fix: target cell load threshold for HO admission adjusted
HO margin increased for GBR bearers specifically
Result: re-INVITE rate dropped 74%, quality complaints resolved
The fix was a RAN parameter change. It was only found by starting with signaling behavior and working back through the radio layer. Neither starting point alone reached the cause.
Service quality depends on the weakest interaction, not the strongest component. VoLTE made this unavoidable because voice has no tolerance for the transient instabilities that data sessions absorb silently. The shift from domain expertise to cross-layer validation was not a process change — it was a fundamental change in what "diagnosing a problem" meant. That discipline proved essential as networks grew more layered and the interactions between components became harder to reason about from any single domain's perspective.
VoLTE · LTE · SIP · RAN Optimization · Cross-layer Diagnostics · IMS · Performance Engineering · Telecommunications