The previous approach would desynchronize the state machine if the
carrier is paused after receiving the lease but before sending the
announce, since we have received a lease already.
This change is an improvement but is still not ideal.
If changing interface properties fails after getting a lease, it is
possible under some strange conditions for the failure to be
persistent. This seems to happen if the carrier cycles off and on
several times during ndhc initialization.
Since this issue is very hard to replicate, the most conservative
thing to do here is to simply have ndhc suicide itself so it can
be respawned by a process supervisor.
Logs of the issue in practice:
(carrier is down while the daemon is started here, it seems)
16:57:09.638979845 ndhc-ifch seccomp filter installed. Please disable seccomp if you
16:57:09.638989136 Discovering DHCP servers...
16:57:09.638991371 (send_dhcp_raw) carrier down; sendto would fail
16:57:09.638993318 Failed to send a discover request packet.
...
16:57:13.636519925 Discovering DHCP servers...
16:57:13.651462476 Received IP offer: X from server Y via Z
...
16:57:13.912592571 wan0: Gateway router set to: A
16:57:13.912607463 wan0: arp: Searching for dhcp server and gw addresses...
16:57:14.635532676 wan0: Carrier down.
17:04:32.983897760 wan0: arp: Still looking for gateway hardware address...
17:04:32.984158226 wan0: arp: Still looking for DHCP agent hardware address...
17:04:32.984781255 wan0: Interface is back. Revalidating lease...
17:04:32.985585501 wan0: arp: Gateway hardware address B
17:04:32.985590436 wan0: arp: DHCP agent hardware address C
17:04:38.234857403 wan0: arp: Still waiting for gateway to reply to arp ping...
17:04:38.235109016 wan0: arp: Still waiting for DHCP agent to reply to arp ping...
16:57:24.165620224 wan0: arp: Still waiting for gateway to reply to arp ping...
16:57:29.169621070 wan0: arp: DHCP agent and gateway didn't reply. Getting new lease.
16:57:29.217710616 wan0: Discovering DHCP servers...
16:57:29.249645130 wan0: Received IP offer: X from server Y via Z
16:57:29.249657203 wan0: Sending a selection request for X...
16:57:29.285632973 wan0: Received ACK: X from server Y via Z
16:57:29.297717159 wan0: arp: Probing for hosts that may conflict with our lease...
16:57:29.360249458 wan0: arp: Probing for hosts that may conflict with our lease...
16:57:29.435114526 wan0: arp: Probing for hosts that may conflict with our lease...
16:57:29.500473345 wan0: Lease of X obtained. Lease time is D seconds.
16:57:29.500485894 wan0: Failed to set the interface IP address and properties!
...
And the final two errors repeat. Restarting ndhc by hand instantly
fixes the issue.
So there's a lot going on -- bizzare clock skew, and carrier flickering
on and off.
This corrects a bug where stale dhcp packets would get reprocessed,
causing very bad behavior; an issue that was introduced in the
coroutine conversion.
This change makes it much easier to reason about ndhc's behavior
and properly handle errors.
It is a very large changeset, but there is no way to make this
sort of change incrementally. Lease acquisition is tested to
work.
It is highly likely that some bugs were both introduced and
squashed here. Some obvious code cleanups will quickly follow.
If a packet send failed because the carrier went down without a
netlink notification, then assume the hardware carrier was lost while
the machine was suspended (eg, ethernet cable pulled during suspend).
Simulate a netlink carrier down event and freeze the dhcp state
machine until a netlink carrier up event is received.
The ARP code is not yet handling this issue everywhere, but the
window of opportunity for it to happen there is much shorter.
Linux will quietly proceed as if the data were sent even if the carrier
is down and nothing actually happened. There is still a tiny race
condition where the carrier could drop between the check and the actual
write, but we really can't do anything about that and it is a very
small race.
Mostly reverts the previous commit and instead teaches ndhc to properly
handle the case when it is communicating with a DHCP relay agent on
its local segment rather than directly with a DHCP server.
different segment.
The network fingerprinting would never complete if the DHCP server was
on a different segment before this change, since it would be impossible
for the ARP messages sent by ndhc to ever reach the DHCP server
(and vice-versa).
Now just give up trying to find the hardware address after two tries
and assume that the DHCP server cannot be reached by ARP.
An alternative would be to fingerprint the relay agent instead, but
to do so would require a lot more work as the giaddr field is only
meaningful in the client->server message path, not in the
server->client path. Thus it would require gathering the source IP
for DHCP replies sent by unicast or broadcast and ferrying along
this information to the ARP checking code where it would be used
in place of the DHCP server address.
This is entirely possible to do, but is quite a bit more work.