Integrating Two Live Networks Without Breaking the Customer

Network integration  ·  5G  ·  observability  .  8 min read 

Large-scale network integrations are often described as a cutover problem. In reality, they are a behavioral problem. When two live mobile networks are stitched together, the hardest issues rarely come from radios or core elements in isolation. They emerge at the edges, where assumptions from one network collide with the operational realities of another.

One of the earliest lessons during a nationwide integration effort was that roaming logic and native-network logic behave very differently under load. What works well for a roaming footprint can expose weaknesses quickly when millions of devices begin behaving as if the network is home. The issues that surfaced were not configuration gaps. They were assumption gaps.

What surfaced only under live conditions

None of the hardest problems appeared in lab testing. They appeared when real devices, real applications, and real mobility entered the picture simultaneously. The combination was always what mattered, never any single element in isolation.

Integration failure classes -- visible only under live conditions
Mobility path fragility: Lab validation: HO between Network A and Network B cells: pass Live condition: same HO path under load with mixed device types Observed: 14% failure rate at specific boundary zones Cause: measurement reporting thresholds not aligned across networks Not visible in either network's standalone KPIs Legacy constraint exposure: Throttling caps from pre-integration roaming agreements still active for devices now treated as native Effect: 8-12% of devices experiencing artificial throughput ceiling Symptom: users reporting "slow" despite strong signal and low load Visible in telemetry: throughput floor inconsistent with PRB utilization Timer mismatch under load: T3412 / T3324 values differed between the two networks Devices crossing the boundary encountered inconsistent idle behavior Paging gaps: 6-9% of attempts during boundary mobility Aggregate paging success rate: 95.8% -- within target Per-device failure rate at boundary: 1 in 8 attempts
The shift from site-centric to path-centric
Site-centric vs path-centric view of network integration
Fig 1 -- Site-centric vs path-centric: same network, different question. Site view shows all nodes healthy. Path view shows where the device actually experiences instability.

The question that unlocked most of the hard problems was not "is this cell healthy?" It was "what does the end-to-end path actually look like for a device right now?" Cells were healthy. Nodes were healthy. Paths were not. The instability lived in the transitions between them, and it was invisible to any monitoring tool pointed at individual elements.

Path-level analysis -- what it exposed that cell KPIs did not: Cell A handover success rate: 96.2% (within target) Transport node utilization: 44% (healthy) Core anchor availability: 99.7% (healthy) Path-level view (device crossing all three during mobility): Anchor change event: triggered Transport routing update: 340ms Core re-anchor: 280ms NR SCell re-addition: 420ms Total path re-establishment: 1,040ms During this window: user plane on fallback path only Application layer: perceived as 1s interruption KPI view: no failure recorded at any individual node
Progressive normalization as the integration model
Progressive normalization timeline across integration phases
Fig 2 -- Progressive normalization: integration treated as a continuous behavioral process, not a cutover event

The approach that worked was not aggressive re-engineering. Forcing immediate convergence would have replaced one set of assumptions with another, just as untested. Progressive normalization meant making mobility decisions predictable before making them faster, and confirming each step through live telemetry before taking the next one.

Normalization phase What was targeted How success was confirmed Constraint removal Legacy throttling caps, stale routing logic, roaming-era rate limits still applied to native devices Throughput distribution shift: floor devices moved into normal range; telemetry confirmed no adverse interaction Timer alignment T3412, T3324, inactivity timers synchronized across both networks at boundary cells first Paging gap rate at boundary: 6.8% to 1.2%; per-device failure rate reduced before full rollout Mobility harmonization Measurement reporting thresholds aligned at inter-network boundary zones Boundary HO failure rate: 14% to 2.8%; confirmed under load before expanding to full boundary Path validation End-to-end re-establishment time measured per device across all transition types p95 path re-establishment below 400ms for all device population segments
01
Every large-scale change was treated as a hypothesis that needed behavioral confirmation under live conditions. Not a milestone to be checked off. The question after each change was not "did it complete?" but "does the path behave differently for actual devices?"
02
The problems that spanned RAN, transport, and core required teams from all three domains to look at the same path-level view simultaneously. No single team had enough visibility to solve them alone. The telemetry infrastructure was what made joint diagnosis possible rather than a series of isolated investigations that blamed each other's layer.
03
Customer experience was the only yardstick that mattered. Feature completion and traffic migration milestones were useful markers, but neither confirmed that users experienced fewer interruptions during movement, faster recovery from transient events, or consistent performance regardless of which legacy footprint they originated from. That confirmation required measuring at the device level, not the network level.

Network integrations do not fail because engineers lack tools. They fail when small inconsistencies compound at scale in ways that no one planned for, because the assumptions behind each inconsistency were never made visible. Handled with patience and high-fidelity observability, integration becomes more than a merger task. It becomes a forcing function that improves the network for everyone who follows.

The most valuable outcome was not a unified network. It was a stronger operational discipline. Observability, cross-domain coordination, and the habit of confirming behavioral hypotheses before scaling them -- these carried forward into every subsequent program. The integration did not go perfectly. None of them do. What mattered was that when assumptions broke, the visibility existed to find them quickly and the discipline existed to fix them carefully.

Network Integration  ·  5G  ·  Observability  ·  Telemetry  ·  RAN Optimization  ·  Performance Engineering  ·  Telecommunications

Popular posts from this blog