If carrier is lost before network fingerprinting is complete, we
have a few problems; first, we don't know whether the network has
changed underneath us. Second, we've not yet configured the
interface properties, and it is not unlikely that doing so will
fail as the underlying network device may have been destroyed
and recreated during this time (eg, if ethtool has been run at
start-up time).
Thus, the safest reaction is to terminate and force a supervisor
respawn. It is best to do this once carrier recovers, not when
the carrier is lost, as it is more likely to minimize delays.
ARP packets aren't split across multiple receive events, so
reply_offset is pointless, and we implicitly assume that the
previous ARP packet data is still available after a forced sleep.
The gateway/router MAC fingerprinting could perhaps be done more
robustly in the face of suspend or carrier loss, but the time window
in which things could get confused is very small and I would rather
just rely on supervisor respawn in that case.
Even this case I don't think I've ever seen.
The previous approach would desynchronize the state machine if the
carrier is paused after receiving the lease but before sending the
announce, since we have received a lease already.
This change is an improvement but is still not ideal.
We instead check carrier status as needed. This approach is more
robust. For a simple example, imagine link state changes that happen
while the machine is suspended.
This is fairly tricky, but fopen() almost surely interally calls malloc
when it creates the FILE* that it returns. I did promise that
ndhc doesn't call malloc after initialization, besides what libc may
do internally, but it feels a bit dishonest given that fopen() is
basically sure to do so on any general-purpose libc.
Thus, eliminate it and just use direct POSIX i/o.
With the previous pidfile changes, ndhc doesn't use C fopen() at all.
In practice, this change won't really be noticeable as most libcs,
particularly with dynamic linking, will end up calling malloc themselves
during program initialization before main() is invoked.
There was no way to disable writing pidfiles before.
pidfiles are an unreliable method of tracking processes, anyway; process
supervisors are strongly recommended. If a pidfile is really needed, it
can be explicitly specified.
Before the exit code would be success and no error message would print,
and it required a bit of control flow tracing to determine what would
actually happen.
No direct functional change (unless the supervising process cares about
the return code of ndhc on exit).
If changing interface properties fails after getting a lease, it is
possible under some strange conditions for the failure to be
persistent. This seems to happen if the carrier cycles off and on
several times during ndhc initialization.
Since this issue is very hard to replicate, the most conservative
thing to do here is to simply have ndhc suicide itself so it can
be respawned by a process supervisor.
Logs of the issue in practice:
(carrier is down while the daemon is started here, it seems)
16:57:09.638979845 ndhc-ifch seccomp filter installed. Please disable seccomp if you
16:57:09.638989136 Discovering DHCP servers...
16:57:09.638991371 (send_dhcp_raw) carrier down; sendto would fail
16:57:09.638993318 Failed to send a discover request packet.
...
16:57:13.636519925 Discovering DHCP servers...
16:57:13.651462476 Received IP offer: X from server Y via Z
...
16:57:13.912592571 wan0: Gateway router set to: A
16:57:13.912607463 wan0: arp: Searching for dhcp server and gw addresses...
16:57:14.635532676 wan0: Carrier down.
17:04:32.983897760 wan0: arp: Still looking for gateway hardware address...
17:04:32.984158226 wan0: arp: Still looking for DHCP agent hardware address...
17:04:32.984781255 wan0: Interface is back. Revalidating lease...
17:04:32.985585501 wan0: arp: Gateway hardware address B
17:04:32.985590436 wan0: arp: DHCP agent hardware address C
17:04:38.234857403 wan0: arp: Still waiting for gateway to reply to arp ping...
17:04:38.235109016 wan0: arp: Still waiting for DHCP agent to reply to arp ping...
16:57:24.165620224 wan0: arp: Still waiting for gateway to reply to arp ping...
16:57:29.169621070 wan0: arp: DHCP agent and gateway didn't reply. Getting new lease.
16:57:29.217710616 wan0: Discovering DHCP servers...
16:57:29.249645130 wan0: Received IP offer: X from server Y via Z
16:57:29.249657203 wan0: Sending a selection request for X...
16:57:29.285632973 wan0: Received ACK: X from server Y via Z
16:57:29.297717159 wan0: arp: Probing for hosts that may conflict with our lease...
16:57:29.360249458 wan0: arp: Probing for hosts that may conflict with our lease...
16:57:29.435114526 wan0: arp: Probing for hosts that may conflict with our lease...
16:57:29.500473345 wan0: Lease of X obtained. Lease time is D seconds.
16:57:29.500485894 wan0: Failed to set the interface IP address and properties!
...
And the final two errors repeat. Restarting ndhc by hand instantly
fixes the issue.
So there's a lot going on -- bizzare clock skew, and carrier flickering
on and off.
This requires execute_buffer() and its callers to distinguish between
fatal and non-fatal errors. The -99 return value was already used for
non-recoverable errors that should force the daemon to restart, but the
execute_buffer() callers treated any non-success return as a fatal
error.
There a judgement call here on how to handle various error types. I
choose to assume that failures to set the IP address, netmask, broadcast
address, or default router are fatal errors. ndhc should be run from
process supervision, and this will trigger a daemon restart, which will
allow the machine to recover as soon as the problem (probably on the
dhcp server or local kernel state outside of ndhc's control) is
corrected.
This change corrects errors such as:
Discovering DHCP servers...
(process_client_socket) received invalid commands: 'carrier:;'
(send_dhcp_raw) carrier down; sendto would fail
Failed to send a discover request packet.
which happened if ndhc is started on a machine where the network
interface is down. After this change, ndhc should function as intended
by going to sleep until the carrier returns rather than terminating
itself to be restarted by the process supervisor until carrier returns.
Thus signed/unsigned conversion issues can't be a problem.
The extra range provided by ull isn't useful, either, since
we're dealing with uint32_t time_t seconds converted to ns,
which doesn't come close to exhausting the amount of post-epoch
ns that will be consumed until after 2100AD.
Lease times and arp timeouts are all calculated using long long,
but epoll takes its timeout argument as an int. Guard against
timeouts > INT_MAX but < UINT_MAX wrapping and causing spuriously
short timeouts when converted to a signed int.
This problem has been observed in the wild. Thanks to thypon
for a detailed strace that pointed me towards this issue.
Somewhere along the line it quit being set at the start of discovery
and was always 0. This is clearly not desired behavior.
Found by manual examination of packets while fuzzing the options
parser.
This corrects a bug where stale dhcp packets would get reprocessed,
causing very bad behavior; an issue that was introduced in the
coroutine conversion.
Some networks have multiple DHCP servers that don't respect the
serverid that is specified in DHCPREQUESTs.
Instead we simply check to see that the yiaddr matches.
While we're at it, ignore DHCP NAK during REQUESTING state. It
doesn't really make sense. Instead we should just wait for
timeout.