emo/ndhc - ndhc - Project Segfault Git

emo/ndhc

Author	SHA1	Message	Date
Nicholas J. Kain	1c2a39c544	Reorder client_config_t members to reduce structure padding.	2017-01-19 05:20:29 -05:00
Nicholas J. Kain	4fdde404aa	Remove unused client_config_t foreground variable.	2017-01-19 05:18:04 -05:00
Nicholas J. Kain	c38fd2be9b	Convert logical booleans in client_config_t to bool type.	2017-01-19 05:13:30 -05:00
Nicholas J. Kain	571b22c4b2	Rename client_state_t init variable to program_init. Easier to grep. No functional change.	2017-01-19 05:05:35 -05:00
Nicholas J. Kain	931530786b	Convert logically boolean client_state_t variables from uint8_t to bool.	2017-01-19 05:01:23 -05:00
Nicholas J. Kain	b8ee0bd5c2	Update copyright dates to 2017.	2017-01-13 20:15:27 -05:00
Nicholas J. Kain	29498f5341	Remove ifsPrevState and set non-infinite timeout on a send error. We instead check carrier status as needed. This approach is more robust. For a simple example, imagine link state changes that happen while the machine is suspended.	2017-01-13 20:15:27 -05:00
Nicholas J. Kain	c47630ffca	Rename check_carrier() to carrier_isup() and use bool return.	2017-01-12 05:25:15 -05:00
Nicholas J. Kain	1fc7bd3144	Eliminate fopen() in cfg.rl. Use unbuffered i/o instead. This is fairly tricky, but fopen() almost surely interally calls malloc when it creates the FILE* that it returns. I did promise that ndhc doesn't call malloc after initialization, besides what libc may do internally, but it feels a bit dishonest given that fopen() is basically sure to do so on any general-purpose libc. Thus, eliminate it and just use direct POSIX i/o. With the previous pidfile changes, ndhc doesn't use C fopen() at all. In practice, this change won't really be noticeable as most libcs, particularly with dynamic linking, will end up calling malloc themselves during program initialization before main() is invoked.	2016-05-06 18:21:13 -04:00
Nicholas J. Kain	04ec7c8f4b	Update to latest write_pid semantics and don't write pidfile by default. There was no way to disable writing pidfiles before. pidfiles are an unreliable method of tracking processes, anyway; process supervisors are strongly recommended. If a pidfile is really needed, it can be explicitly specified.	2016-05-06 15:00:31 -04:00
Nicholas J. Kain	a47a2feea1	Support resolv_conf.head and resolv_conf.tail. These will be used if they exist at initialization. Operations are all done by acting on pre-opened fds.	2016-05-06 06:45:16 -04:00
Nicholas J. Kain	313da14518	Update nl.[ch] to latest. No functional change for ndhc.	2016-05-06 06:36:01 -04:00
Nicholas J. Kain	a4ea869a54	README: Trivial style edits.	2015-12-27 01:00:18 -05:00
Nicholas J. Kain	5ab36719f1	If a fd closes unexpectedly in epoll, print error and exit with failure. Before the exit code would be success and no error message would print, and it required a bit of control flow tracing to determine what would actually happen. No direct functional change (unless the supervising process cares about the return code of ndhc on exit).	2015-12-25 02:44:54 -05:00
Nicholas J. Kain	ae16e26d00	arp: Fix case where changing interface properties consistently fails. If changing interface properties fails after getting a lease, it is possible under some strange conditions for the failure to be persistent. This seems to happen if the carrier cycles off and on several times during ndhc initialization. Since this issue is very hard to replicate, the most conservative thing to do here is to simply have ndhc suicide itself so it can be respawned by a process supervisor. Logs of the issue in practice: (carrier is down while the daemon is started here, it seems) 16:57:09.638979845 ndhc-ifch seccomp filter installed. Please disable seccomp if you 16:57:09.638989136 Discovering DHCP servers... 16:57:09.638991371 (send_dhcp_raw) carrier down; sendto would fail 16:57:09.638993318 Failed to send a discover request packet. ... 16:57:13.636519925 Discovering DHCP servers... 16:57:13.651462476 Received IP offer: X from server Y via Z ... 16:57:13.912592571 wan0: Gateway router set to: A 16:57:13.912607463 wan0: arp: Searching for dhcp server and gw addresses... 16:57:14.635532676 wan0: Carrier down. 17:04:32.983897760 wan0: arp: Still looking for gateway hardware address... 17:04:32.984158226 wan0: arp: Still looking for DHCP agent hardware address... 17:04:32.984781255 wan0: Interface is back. Revalidating lease... 17:04:32.985585501 wan0: arp: Gateway hardware address B 17:04:32.985590436 wan0: arp: DHCP agent hardware address C 17:04:38.234857403 wan0: arp: Still waiting for gateway to reply to arp ping... 17:04:38.235109016 wan0: arp: Still waiting for DHCP agent to reply to arp ping... 16:57:24.165620224 wan0: arp: Still waiting for gateway to reply to arp ping... 16:57:29.169621070 wan0: arp: DHCP agent and gateway didn't reply. Getting new lease. 16:57:29.217710616 wan0: Discovering DHCP servers... 16:57:29.249645130 wan0: Received IP offer: X from server Y via Z 16:57:29.249657203 wan0: Sending a selection request for X... 16:57:29.285632973 wan0: Received ACK: X from server Y via Z 16:57:29.297717159 wan0: arp: Probing for hosts that may conflict with our lease... 16:57:29.360249458 wan0: arp: Probing for hosts that may conflict with our lease... 16:57:29.435114526 wan0: arp: Probing for hosts that may conflict with our lease... 16:57:29.500473345 wan0: Lease of X obtained. Lease time is D seconds. 16:57:29.500485894 wan0: Failed to set the interface IP address and properties! ... And the final two errors repeat. Restarting ndhc by hand instantly fixes the issue. So there's a lot going on -- bizzare clock skew, and carrier flickering on and off.	2015-10-28 20:20:21 -04:00
Nicholas J. Kain	e0b5ff8eaf	perform_carrier() should not cause ifchd to terminate on failure. This requires execute_buffer() and its callers to distinguish between fatal and non-fatal errors. The -99 return value was already used for non-recoverable errors that should force the daemon to restart, but the execute_buffer() callers treated any non-success return as a fatal error. There a judgement call here on how to handle various error types. I choose to assume that failures to set the IP address, netmask, broadcast address, or default router are fatal errors. ndhc should be run from process supervision, and this will trigger a daemon restart, which will allow the machine to recover as soon as the problem (probably on the dhcp server or local kernel state outside of ndhc's control) is corrected. This change corrects errors such as: Discovering DHCP servers... (process_client_socket) received invalid commands: 'carrier:;' (send_dhcp_raw) carrier down; sendto would fail Failed to send a discover request packet. which happened if ndhc is started on a machine where the network interface is down. After this change, ndhc should function as intended by going to sleep until the carrier returns rather than terminating itself to be restarted by the process supervisor until carrier returns.	2015-07-03 00:49:13 -04:00
Nicholas J. Kain	12d8af4c67	Convert remaining ull time-types to ll. Thus signed/unsigned conversion issues can't be a problem. The extra range provided by ull isn't useful, either, since we're dealing with uint32_t time_t seconds converted to ns, which doesn't come close to exhausting the amount of post-epoch ns that will be consumed until after 2100AD.	2015-05-27 23:40:28 -04:00
Nicholas J. Kain	277f0f67c5	When converting timeout from ll to int, also guard against underflow.	2015-05-27 15:00:02 -04:00
Nicholas J. Kain	e02d30dcc5	renewTime, rebindTime should be unsigned long long. While we're at it, rearrange the client state struct to save some space lost to padding.	2015-05-27 13:11:23 -04:00
Nicholas J. Kain	abb1b54bfe	Fix an overflow that can cause spuriously short epoll timeouts. Lease times and arp timeouts are all calculated using long long, but epoll takes its timeout argument as an int. Guard against timeouts > INT_MAX but < UINT_MAX wrapping and causing spuriously short timeouts when converted to a signed int. This problem has been observed in the wild. Thanks to thypon for a detailed strace that pointed me towards this issue.	2015-05-27 12:58:42 -04:00
Nicholas J. Kain	ba875d4b2e	Failsafe should only trigger is the new timeout is also zero. This is what I get for rushing!	2015-05-27 12:35:16 -04:00
Nicholas J. Kain	f061a78a18	Fix dumb mistake in patch before last; epoll timeout is in ms, not s.	2015-05-27 12:29:46 -04:00
Nicholas J. Kain	2057d7cd18	Improve compatibility with clang.	2015-05-27 12:23:50 -04:00
Nicholas J. Kain	8273b383ab	Add a failsafe to prevent epoll busy-spin.	2015-05-27 12:23:16 -04:00
Nicholas J. Kain	6c136c3f85	Make add_option_(vendor\|hostname)() not use ndhc internals. This change makes it easier to fuzz test, but should have no functional effect on ndhc's behavior.	2015-02-20 03:58:25 -05:00
Nicholas J. Kain	646931a2bf	xid should be regenerated whenever we start a new discover cycle. Somewhere along the line it quit being set at the start of discovery and was always 0. This is clearly not desired behavior. Found by manual examination of packets while fuzzing the options parser.	2015-02-20 03:48:13 -05:00
Nicholas J. Kain	b8df1b33f5	Fix the timeouts for rebinding and renewing. Now renew/rebind requests will be sent every 60+/-10s.	2015-02-18 11:03:58 -05:00
Nicholas J. Kain	b3bd13d45f	Fix the return values of dhcp_packet_get and arp_packet_get. This corrects a bug where stale dhcp packets would get reprocessed, causing very bad behavior; an issue that was introduced in the coroutine conversion.	2015-02-18 11:02:13 -05:00
Nicholas J. Kain	a627e9dc9c	Minor changes to log messages and comments.	2015-02-18 09:56:38 -05:00
Nicholas J. Kain	3ede5fbe33	Handle the release and renew signals again.	2015-02-18 07:31:19 -05:00
Nicholas J. Kain	731bd14f0a	reinit_selecting() can't fail; convert it to void return.	2015-02-18 07:12:35 -05:00
Nicholas J. Kain	89d77313c6	Deconfigure the interface if we've failed a fingerprint check. We've moved to a new network, so keeping the old information is counterproductive.	2015-02-18 07:12:22 -05:00
Nicholas J. Kain	29da5d21a4	Tolerate DHCP ACKs in REQUESTING state that don't match the serverid. Some networks have multiple DHCP servers that don't respect the serverid that is specified in DHCPREQUESTs. Instead we simply check to see that the yiaddr matches. While we're at it, ignore DHCP NAK during REQUESTING state. It doesn't really make sense. Instead we should just wait for timeout.	2015-02-18 06:52:59 -05:00
Nicholas J. Kain	f28495a58b	selecting_packet() shouldn't take the xid from the packet. This change actually has no effect because incoming dhcp packets that differ from our xid are dropped, so the xid is always changed to the same value that it already has.	2015-02-18 06:01:43 -05:00
Nicholas J. Kain	da7d34a300	Bump version to 2.0. When the coroutine restructuring is fully tested and debugged, it will definitely be a significant enough change for a full version bump.	2015-02-18 05:41:31 -05:00
Nicholas J. Kain	69cf41f1b1	Only process one epoll event at a time. If ndhc were a high-performance program that handled lots of events, this change would harm performance. But it is not, and it implicitly believes that events come in one at a time. Processing batches would make it harder to assure correctness while also never allocating memory at runtime. The previous structure was fine when everything was handled immediately by callbacks, but it isn't now.	2015-02-18 05:36:13 -05:00
Nicholas J. Kain	99ce918a31	Use a coroutine instead of several callback state machines. This change makes it much easier to reason about ndhc's behavior and properly handle errors. It is a very large changeset, but there is no way to make this sort of change incrementally. Lease acquisition is tested to work. It is highly likely that some bugs were both introduced and squashed here. Some obvious code cleanups will quickly follow.	2015-02-18 05:31:13 -05:00
Nicholas J. Kain	37aa866ae4	Move action dispatch out of main epoll loop.	2015-02-15 06:48:49 -05:00
Nicholas J. Kain	61387408d0	Separate event state gathering from action dispatch in main epoll loop. This is the first step towards using coroutines.	2015-02-15 06:38:03 -05:00
Nicholas J. Kain	658d2954b8	Reduce log spam.	2015-02-15 02:58:51 -05:00
Nicholas J. Kain	e874373dcd	Check link carrier via ifch and netlink instead of ioctl. Thus, ioctl can once again be removed from the ndhc seccomp whitelist.	2015-02-15 02:50:29 -05:00
Nicholas J. Kain	6c9ca9eecd	If ifchd commands fail, propagate the failure back to ndhc.	2015-02-15 02:29:37 -05:00
Nicholas J. Kain	33aa5b13de	Recommend using a process supervisor for high reliability in README.	2015-02-14 20:50:46 -05:00
Nicholas J. Kain	5b82be8b00	If ifchd interactions fail, terminate. Ideally we would pause and resume state, but for now just bail out. If ndhc is process-supervised, it will recover to the proper state quickly.	2015-02-14 20:47:14 -05:00
Nicholas J. Kain	170f87c0e7	Propagate returns through ifchange_(deconfig\|bind). While doing so remove unnecessary argument null checks and make sure not to alter the stored interface state if the ifch requests failed.	2015-02-14 19:10:23 -05:00
Nicholas J. Kain	44175bd77c	Make ifch requests synchronous just like sockd requests. This change paves the way for allowing ifch to notify the core ndhc about failures. It would be far too difficult to reason about the state machine if the requests to ifch were asynchronous. Currently ndhc assumes that ifch requests never fail, but this is not always true because of eg, rfkill.	2015-02-14 16:49:50 -05:00
Nicholas J. Kain	61a48b0fb6	Fix the rfkill waiting.	2015-02-14 15:33:02 -05:00
Nicholas J. Kain	56cc05599a	Add error handling for un-notified carrier downs in ifup_action.	2015-02-14 05:39:15 -05:00
Nicholas J. Kain	b6b778831c	Add error handling for un-notified carrier downs when sending packets. If a packet send failed because the carrier went down without a netlink notification, then assume the hardware carrier was lost while the machine was suspended (eg, ethernet cable pulled during suspend). Simulate a netlink carrier down event and freeze the dhcp state machine until a netlink carrier up event is received. The ARP code is not yet handling this issue everywhere, but the window of opportunity for it to happen there is much shorter.	2015-02-14 05:20:04 -05:00
Nicholas J. Kain	d0d8bcf3ff	options.[ch] must still be valid C++11 because it is used in ndhs. That means the [static 1] arg specifier cannot be used.	2015-02-14 02:14:30 -05:00

1 2 3 4 5 ...

583 Commits