This is an attempt of a more coherent writeup of
my post on Whirlpool.
I've been debugging a problem in a rural location using an
NBN interim satellite internet connection and which was showing encountering frequent unreliability, particularly with SSL connections (HTTPS, POP3S) but also apparent with ssh and, when tried, even telnet. It was, of course, sporadic. It was not so obvious with cleartext web browsing (normal HTTP) but I believe this is because browsers will silently retry failed connections.
Poking around on
Whirlpool found little information; I currently suspect the fairly small share of iOS/MacOSX equipment in rural locations - Windows is very prevalent.
My current hypothesis, supported by packet traces, is that this unreliability is a combination of TCP acceleration in the modem (a
Gilat SkyEdge IP II), an overly aggressive SYN resend timing in MacOSX and iOS, and satellite latency.
The current workaround is to divert all outbound TCP traffic through a proxy on the firewall/router, so that the Mac connects to the proxy on the firewall (low latency local LAN connection) and the firewall makes the outbound TCP connection to the target host using a far saner SYN resend timing, achieving reliable connection.
Implementation
Most people will do this via a web proxy (eg squid or equivalent program, possible built into their router). This requires each client machine to be configured to use it, and will only deal with http and https traffic.
I'm doing it with
relayd on the firewall to forward any TCP connection and a PF rule in
pf.conf to divert outbound connections to it.
The network looks like this:
Mac --(wifi)-> airport -> switch -> firewall -> sat-modem -> internet
The PF rule reads like this:
pass in log quick on $if_lan inet proto tcp to !<local_nets> divert-to 127.0.0.1 port 8888
which causes TCP connections arriving on the local LAN interface and directed to non-local networks to be diverted to the relayd listening on 127.0.0.1, port 8888.
The
relayd.conf looks like this:
relayd_addr="127.0.0.1"
relayd_port="8888"
protocol mytcp {
tcp nodelay
##tcp no sack
tcp no splice
}
relay proxy {
protocol mytcp
listen on $relayd_addr port $relayd_port
forward to destination retry 3
}
Packet Traces and Discussion
The following traces are taken from the firewall on the local LAN interface. As mentioned, the network looks like this:
Mac --(wifi)-> airport -> switch -> firewall -> sat-modem -> internet
The firewall runs
OpenBSD with
stateful rules.
The satellite modem performs both HTTP and TCP acceleration. Gilat offers
a little description here. The HTTP acceleration supports an upstream driven prefetch, but I believe it to be irrelevant here. The same Gilat page indicates that the TCP acceleration passes the SYN:SYN/ACK:ACK TCP setup packets as-is, but then collects the data for established connections and sends it over the satellite portion of the link using a more satellite-optimised protocol. Also, by doing data ACKs locally the client (my Mac) is encouraged to send more data promptly instead doing a slow ramp up with high latency ACKs from the remote host. Conversely, the remote host can also send data more rapidly.
The TCP acceleration is all very cool (and surprisingly effective) but has a misfeature/bug in its implementation as shown below.
Here is a successful POP3S connection:
16:47:47.896426 {MACBOOK}.50142 > X.X.X..995: S 2233798023:2233798023(0) win 65535 (DF)
16:47:48.597903 {MACBOOK}.50142 > X.X.X.X.995: S 2233798023:2233798023(0) win 65535 (DF)
16:47:48.643500 X.X.X.X.995 > {MACBOOK}.50142: S 0:0(0) ack 2233798024 win 13312
16:47:48.644794 {MACBOOK}.50142 > X.X.X.X.995: . ack 1 win 65535 (DF)
16:47:48.645639 {MACBOOK}.50142 > X.X.X.X.995: P 1:307(306) ack 1 win 65535 (DF)
This shows an initial SYN packet, then a resend about 600ms later, then a SYN/ACK response from the far end to the first SYN at 743ms since the first SYN. And then our ACK and normal data traffic.
Round trip latency over satellite is at best about 650ms.
Here is an unsuccessful POP3S connection:
16:48:11.094536 {MACBOOK}.50144 > X.X.X.X.995: S 2181400205:2181400205(0) win 65535 (DF)
16:48:11.797141 {MACBOOK}.50144 > X.X.X.X.995: S 2181400205:2181400205(0) win 65535 (DF)
16:48:12.099393 {MACBOOK}.50144 > X.X.X.X.995: S 2181400205:2181400205(0) win 65535 (DF)
16:48:12.400934 {MACBOOK}.50144 > X.X.X.X.995: S 2181400205:2181400205(0) win 65535 (DF)
16:48:12.702356 {MACBOOK}.50144 > X.X.X.X.995: S 2181400205:2181400205(0) win 65535 (DF)
16:48:13.003581 X.X.X.X.995 > {MACBOOK}.50144: S 0:0(0) ack 2181400206 win 13312
16:48:13.005832 {MACBOOK}.50144 > X.X.X.X.995: S 2181400205:2181400205(0) win 65535 (DF)
16:48:13.006593 X.X.X.X.995 > {MACBOOK}.50144: R 1:1(0) win 0
16:48:13.007379 {MACBOOK}.50144 > X.X.X.X.995: . ack 1 win 65535 (DF)
16:48:13.007778 {MACBOOK}.50144 > X.X.X.X.995: P 1:307(306) ack 1 win 65535 (DF)
16:48:13.008437 X.X.X.X.995 > {MACBOOK}.50144: R 1:1(0) win 0
16:48:13.009220 X.X.X.X.995 > {MACBOOK}.50144: R 1:1(0) win 0
This shows and initial SYN and resends at 600ms, 900ms, 1200ms, 1500ms. We may infer that the network is congested. Finally the SYN/ACK from the far end at 1800ms. At this point (I believe) the TCP acceleration in the satellite modem has established state for the connection.
In an unfortunate (but annoyingly common) situation, there is another SYN resend from the Mac, arriving at the firewall 2ms after the SYN/ACK was dispatched. Notably, it will have been dispatched by the Mac
before it has seen the SYN/ACK. And herein lies the Gilat modem bug.
The modem believes the connection is ready; it has seen the SYN/ACK. It now sees the extra SYN resend, and decides that this is an invalid attempt to make a
new connection from the same source on the Mac to the same target host. In response it declares the connection invalid and sends an RST in response to the extra SYN. Meanwhile the Mac sees the SYN/ACK and sends an ACK as normal, and follows up with a PSH of the first data packet. Both of these also get RST packets send back. The Mac reports "connection reset by peer". Badness.
Better behaviour form the modem would be to accomodate a SYN resend from the client Mac, at least during a small window after connection setup.
Better behaviour from the Mac would be to send SYNs far less often. A more normal TCP stack (such as that in the firewall) sends its first resend 6 whole seconds after the first. The only use for a faster resend is to get snappier recovery in the event of the first SYN being dropped. In these days of
buffer bloat it looks like this is no longer a common response to congestion. instead, SYNs can clearly be delayed and not lost.
By diverting outbound TCP through relayd we are effectively replacing the Mac's overly eager SYN timing with the saner timing from the OpenBSD host, and the problem basicly goes away.