Suppose a system call such as bind() fails in the sockd subprocess in
request_sockd_fd(). sockd will suicide(). This will send a SIGCHLD to
the master process, which the master process should respond to by
calling suicide(), forcing a process supervisor to respawn the entire
ndhc program.
But, this doesn't reliably happen prior to this commit because of the
interaction between request_sock_fd() and signalfd() [or equivalently
self-pipe-trick] signal handling.
request_sock_fd() makes ndhc-master synchronously wait for a response
from sockd via safe_recvmsg(). The normal goto-like signal handling
path is suppressed when using signalfd() , so when SIGCHLD is received,
it will not be handled until io is dispatched for the signalfd or pipe.
But such code will never be reached because ndhc-master is waiting in
safe_recvmsg() and thus never polls signal fd status.
So, revert to using traditional POSIX sigaction() for SIGCHLD, which
provides exactly the required behavior for proper functioning.
Apparently I had forgotten the counter-intuitive semantics of
signalfd(): it's necessary to BLOCK the signals that will be
handled exclusively by signalfd() so that the default POSIX
signal handling mechanism won't intercept the signals first.
The lack of response to ctrl+c is a legitimate bug that is
now properly fixed; ba046c02c729c fixed that issue, but
regressed the handling of other signals.
If we get no response to three renews (unicast), switch to sending
rebinds (broadcast). Servers are supposed to always reply with
a DHCPACK or DHCPNAK even if the server doesn't update its internal
lease duration database, so this behavior should be RFC compliant.
'(c)' may not be a valid substitute for 'Copyright' in some legal
domains/interpretations. So be safe, since I obviously am asserting
copyright on my legal work.
It breaks with the existing whitelists on the latest glibc and is
just too much maintenance burden. It also causes the most questions
for new users.
Something like openbsd's pledge() would be fine, but I have no
intention of maintaining such a thing.
Most of the value-gain would come from disallowing high-risk
syscalls like ptrace() and the perf syscalls, anyway.
ndhc already uses extensive defense-in-depth and wasn't using
seccomp on non-(x86|x86-64) platforms, so it's not a huge loss.
If carrier is lost before network fingerprinting is complete, we
have a few problems; first, we don't know whether the network has
changed underneath us. Second, we've not yet configured the
interface properties, and it is not unlikely that doing so will
fail as the underlying network device may have been destroyed
and recreated during this time (eg, if ethtool has been run at
start-up time).
Thus, the safest reaction is to terminate and force a supervisor
respawn. It is best to do this once carrier recovers, not when
the carrier is lost, as it is more likely to minimize delays.
ARP packets aren't split across multiple receive events, so
reply_offset is pointless, and we implicitly assume that the
previous ARP packet data is still available after a forced sleep.
The gateway/router MAC fingerprinting could perhaps be done more
robustly in the face of suspend or carrier loss, but the time window
in which things could get confused is very small and I would rather
just rely on supervisor respawn in that case.
Even this case I don't think I've ever seen.
The previous approach would desynchronize the state machine if the
carrier is paused after receiving the lease but before sending the
announce, since we have received a lease already.
This change is an improvement but is still not ideal.
We instead check carrier status as needed. This approach is more
robust. For a simple example, imagine link state changes that happen
while the machine is suspended.