When a one vendor's VoIP media/trunking gateway is talking to another
of the same type, it uses 5 ms packetization (ptime) by default. I.e.,
each RTP packet has only 5 ms of audio recorded in it. At first this
sounds crazy. The efficiency is awful! The bandwidth across Ethernet
for G.711 goes to 188 kbps per call per direction! It also means that
a single call imposes 400 packets per second (PPS) on a router along
the path.
20 ms packetization is "normal". But consider one bane of VoIP
providers: echo. The greater the delay between when I talk and when I
hear myself, the more annoying the echo is.
The ptime contributes to the delay. In the best case, total echo delay
due to VoIP will be 2*ptime+rtt, where rtt is the round-trip time
across the IP and TDM networks from one VoIP endpoint to the other.
A typical jitter buffer size is 3*ptime, as well. Since there are two
jitter buffers, that adds 2*3*ptime to the echo delay.
All added together, the echo tail duration could be 8*ptime+rtt.
With the standard 20 ms ptime, and an engineered network with 30 ms
network RTT, my echo tail length could be 190 ms. Imagine talking in
an room with hard walls, floors, and ceilings, 106 feet across. You'd
hear yourself echo back.
If we drop this to 5 ms ptime, we drop the talker-echo delay to 70 ms.
Now the echoey room size has dropped to 39 feet across, and you
probably don't really notice that echo. But we significantly increased
the requirements of the network -- four-fold increase in PPS, and 200
kbps increase in bandwidth per call.
Now let's throw in a brand-new Foundry MLX network with 10-Gigabit-
Ethernet pipes and 2-Billion PPS capacity. Now what do you think about
188 kbps / call?