Path MTU Discovery and Filtering ICMP
Marc Slemko <[email protected]>
Created: Thursday, January 18 1998
Last Modified:

This document explains the details of how path MTU discovery (PMTU-D) combined with filtering ICMP messages can result in connectivity problems. If you are familiar with the terms discussed

Let's start by defining what we are talking about

MTU
The maximum transmission unit is a link layer restriction on the maximum number of bytes of data in a single transmission (ie. frame, cell, packet, depending on the terminology). The below table shows some typical values for MTUs, taken from RFC-1191:
MTUWhere Commonly Used
65535Hyperchannel
1791416 Mbit/sec token ring
8166Token Bus (IEEE 802.4)
44644 Mbit/sec token ring (IEEE 802.5)
1500Ethernet
1500PPP (typical; can vary widely)
576X.25 Networks
Path MTU
The smallest MTU of any link on the current path between two hosts. This may change over time since the route between two hosts, especially on the Internet, may change over time. It is not necessarily symmetric and can even vary for different types of traffic from the same host.
Fragmentation
When a packet is too large to be sent across a link as a single unit, a router can fragment the packet. This means that it splits it into multiple parts which contain enough information for the receiver to glue them together again. Note that this is not done on a hop-by-hop basis, but once fragmented a packet will not be put back together until it reaches its destination. Fragmentation is undesirable for numerous reasons, including:
DF (Don't Fragment) bit
This is a bit in the IP header that can be set to indicate that the packet should not be fragmented by routers, but instead an ICMP "can't fragment" error is returned sent to the sender and the packet is dropped.
ICMP Can't Fragment Error
This error (type 3 (destination unreachable), code 4 (fragmentation needed but don't-fragment bit set)) is returned by a router when it receives a packet that is too large for it to forward and the DF bit is set. The packet is dropped and the ICMP error is sent back to the origin host. Normally, this tells the origin host that it needs to reduce the size of its packets if it wants to get through. Recent systems also include the MTU of the next hop in the ICMP message so the source knows how big its packets can be. Note that this error is only sent if the DF bit is set; otherwise, packets are just fragmented and passed through.
MSS
The MSS is the maximum segment size. It can be announced during the establishment of a TCP connection to indicate to the other end the largest amount of data in one packet that should be sent by the remote system. Normally the packet generated will be 40 bytes larger than this; 20 bytes for the IP header and 20 for the TCP header. Most systems announce a MSS that is determined from the MTU on the interface that the traffic to the remote system passes out from the system through.
Path MTU Discovery (PMTU-D)
Now you know that Path MTUs vary. You know that fragmentation is bad. The solution? Well, one solution is Path MTU Discovery. The idea behind it is to send packets that are as large as possible while still avoiding fragmentation. A host does this by starting by sending packets that have a maximum size of the lesser of the local MTU or the MSS announced by the remote system. These packets are sent with the DF bit set. If there is some MTU between the two hosts which is too small to pass the packet successfully, then an ICMP can't fragment error will be sent back to the source. It will then know to lower the size; if the ICMP message includes the next hop MTU, it can pick the correct size for that link immediately, otherwise it has to guess.

The exact process that systems go through is somewhat more complicated to account for special circumstances. For full details, see RFC-1191.

A good indication of if a system is trying to do PMTU-D is to watch the packets it is sending with something like tcpdump or snoop and see if they have the DF bit set; if so, it is most likely trying to do PMTU-D.

Now, to the problem with ICMP filtering and PMTU-D

Now we get to the problem. Many network administrators have decided to filter ICMP at a router or firewall. There are valid (and many invalid) reasons for doing this, however it can cause problems. ICMP is an integral part of the Internet and can not be filtered without due consideration for the effects.

In this case, if the ICMP can't fragment errors can not get back to the source host due to a filter, the host will never know that the packets it is sending are too large. This means it will keep trying to send the same large packet, and it will keep being dropped--silently dropped from the view of any system on the other side of the filter. While a small handful of systems that implement PMTU-D also implement a way to detect such situations, most don't and even for those that do it has a negative impact on performance and the network.

If this is happening, typical symptoms include the ability for small packets (eg. request a very small web page) to get through, but larger ones (eg. a large web page) will simply hang. This situation can be confusing to the novice administrator because they obviously have some connectivity to the host, but it just stops working for no obvious reason on certain transfers.

There is one solution, and several workarounds, for this problem. They include:

So how can using RFC 1918 addresses for router links cause problems?

On many routers, a separate IP address in the same subnet is required for each end of a point to point link. This can use address space if there are a large number of such links. Since the actual address of the links doesn't appear to impact much, many people use RFC 1918 private address space for such links. The blocks included in this are:

10.0.0.0 - 10.255.255.255 (10/8 prefix)
172.16.0.0 - 172.31.255.255 (172.16/12 prefix)
192.168.0.0 - 192.168.255.255 (192.168/16 prefix)

If you are using such addresses, then ICMP messages (including "can't fragment" errors) will normally be generated using such addresses. Since many networks filter incoming traffic from such reserved addresses, the net result is the same as if all ICMP were being filtered and can cause the same problems.