GSM / WCDMA  ·  OSS analysis

Networks could look perfectly healthy on OSS dashboards while subscribers kept complaining about dropped calls and poor voice quality. Call setup success rate, overall drop rate, and congestion KPIs stayed comfortably within target. Real users were still experiencing frequent interruptions.

The disconnect almost always came down to one thing: aggregation hiding real problems.

How averages conceal cell-level failure

Performance KPIs were typically calculated and reported as averages at BSC or cluster level. In dense urban environments, a handful of problematic cells could fail repeatedly without moving the needle on the aggregated numbers. Their impact was diluted by volume. From the customer's perspective, those few cells were the only thing that mattered.

Example: cluster of 40 cells, BSC-level KPI reporting Call drop rate (cluster average): 1.8% -- within target Call drop rate (worst 3 cells): 12%, 9%, 11% -- far outside target These 3 cells carry ~8% of cluster traffic Their failures are absorbed by the remaining 37 cells Dashboard stays green Affected subscribers see 10x the drop rate of the cluster average

The real issues surfaced only when we stopped looking at cluster averages and drilled down to per-cell counters. Many cells carrying normal traffic volumes showed abnormally high handover failure rates, frequent abnormal TCH releases, or recurring timing advance outliers. None of this was visible at the BSC rollup.

Where these cells were typically found
Common locations for masked cell-level failures
Sector boundaries: coverage overlap, competing pilots, inconsistent dominance Indoor-penetration zones: basement offices, dense concrete buildings High-mobility corridors: highways, rail lines, fast UE state transitions Cell edge areas: timing advance limits, uplink budget constraints Recently modified cells: parameter changes not validated post-deployment

These were not random. Once the pattern became clear, knowing where to look cut investigation time significantly. The cells in these locations needed individual attention, not cluster-average treatment.

Uniform parameters in non-uniform environments

A frequent contributor was the use of identical parameter templates across sites. Handover margins, power control settings, assignment thresholds, and frequency reuse patterns applied uniformly regardless of local RF conditions. Clutter type, building density, and user mobility varied widely across a cluster. The parameters did not.

Template parameter applied uniformly: HO margin: 6 dB (reasonable for open suburban) Applied to dense urban high-rise sector: UE indoor, deep penetration loss Serving cell RXLEV already marginal 6 dB margin: handover fires too late Result: radio link failure before candidate cell triggered Same template, highway sector: Fast-moving UE, short cell dwell time 6 dB margin: handover fires too slow for velocity Result: drop at cell edge, every peak hour

Under actual traffic, these mismatches produced late handovers, failed channel assignments, and unstable calls. Aggregated KPIs rarely flagged them because the total volume of affected calls was small relative to the cluster.

Cell-level forensic analysis

The approach that worked was per-cell counter analysis combined with failure classification. Not cluster averages. Not BSC rollups. Individual cells, sorted by failure rate, with failure type broken out.

Counter pull: per-cell, busy hour only (not 24hr average) Sort by: abnormal TCH release rate HO failure rate (outgoing + incoming separately) TA distribution outliers (cells with high % TA > 60) SDCCH drop rate (often precedes TCH problems) Classify failure type per cell: radio link timeout -- coverage or interference HO failure -- neighbor or parameter congestion release -- capacity TA timeout -- coverage boundary issue Each type points to a different root cause Each requires a different fix

Fixing a radio link timeout with a parameter change designed for handover failures wastes time and makes the network less stable. The classification step was not optional.

The lesson from this period was that green dashboards are not a reliable indicator of customer experience. Aggregate KPIs describe the average behavior of the majority of traffic. The customers experiencing poor service are almost always in the minority of cells that averages conceal. Finding them required leaving the dashboard and working at cell level, with counters broken out by failure type, pulled during the hours when the problems actually occurred.

GSM  ·  WCDMA  ·  RAN Optimization  ·  OSS Analytics  ·  KPI Analysis  ·  Call Drop  ·  RF Engineering  ·  Telecommunications

Popular posts from this blog