A couple of months ago I spent an evening playing with Transmission (the BitTorrent client for Mac), and I had made a number of changes to my home network in order to get it working. This included updating the firmware in my router, putting my DSL modem into bridging mode, enabling uPNP, disabling the built-in firewall on my Macbook. And the situation got a little better in that uPNP was working, but a couple of days ago I realized I couldn't successfully do an Mercurial update of the source code from www.linuxtv.org without the tool hanging indefinitely in mid download. In fact, I discovered that I couldn't even download the tarball from there without it hanging in Firefox. I confirmed it was not the linuxtv.org website by "borrowing" some wireless from my neighbor, and confirmed it was not any particular workstation since it happened both on my Linux desktop and my Mac laptop.
Tonight, I had had enough, so I sat down and started to troubleshoot the problem. If you don't enjoy fun stores of network troubleshooting, you might as well stop reading.
My basic setup is I have a Westell 6100 DSL modem, which has a built-in NAT and firewall, and I am connected to that with my Belkin F5D7230-4 wireless router, behind which sits my various PCs. I started by looking at the logs in the Belkin router since that is the closest to the PCs. Here's what I found:
Hmmm.... For some reason the router is renewing it's DHCP address very frequently. In fact, it does it exactly every 30 seconds. And why does it think the DHCP lease is only 300 seconds? AND, why does it do a DHCP release followed by a renew, instead of just a DHCP renew? I logged in to the DSL modem and confirmed the internal DHCP server is set to issue leases of 1 day (51000 seconds).
Well, let's start by seeing if there is any new firmware. I updated the firmware two months ago to a release from July, so I guess it's possible that there's new firmware. In fact, release 9.0.10 came out in November, and guess what the release notes say in regards to the fixes:
Yeah, duh. Ok. So I updated the firmware, and while it's not renewing every 30 seconds, it now renews every 46 seconds. I guess we're moving in the right direction at least.
Now I'm starting to wonder about that 300 second lease time, and those of you familiar with the RFC 2131 would know that the "lease time" is actually a worst case before the address is issued to another party, and that client implementations are actually supposed to renew more frequently. Section 4.4.5 recommends a renewal time of (lease * 0.5), but implementations may vary. So, I break out an old 10Mb hub, stick it between the DSL modem and the router, and turn on Wireshark:
Yup, the router really is doing a release/renew, and the server really is issuing a 300 second lease.
From my time doing development on 3Com's routers, I have seen cases where routers that have both DHCP implementations and NAT in some cases will flush out the NAT table entries associated with a given IP if the client does a DHCP release. They do this to ensure that one PC doesn't get the NAT entries associated with a previous PC bound to that IP address. A side effect though is that if it is the same client then all of the state info is missing for sessions in progress.
Also, you might notice that the address issued isn't a private address on the 192.168.2.x network? It looks like a real public network address. In fact, it is. So back to the Westell DSL modem interface. It looks like even though I have DHCP configured for a 1-day lease, this doesn't apply to the DMZ. I had moved the router into the DMZ when I was playing with Transmission because I figured the NAT was flaky and I wanted to just use the modem as a bridge.
Turn off the DMZ, and like magic the lease time jumps back to 51000 and it stops renewing every 46 seconds. Mecurial can download source code. Firefox doesn't hang on downloads. All is well in the world. :-)
So why does the router do a release/renew instead of just a renew, causing the IP to be released and an interruption of service? Well, probably because it's stupid.
Why doesn't the Westell modem follow it's DHCP configuration with addresses in the DMZ like it does with private addresses? Well, probably because it's stupid.
In the end I found a known firmware bug in the router, a protocol violation in the router's DHCP, and what appears to be a configuration bug in the DSL modem's DHCP server. I was really expecting it to be one relatively simple problem.