Prolixium dot com: News >> Blog >> Searching for “proto”

tl;dr - I'm running into some FIB bug on Linux > 4.14

Ever since I upgraded excalibur (bare metal IPv6 router running BGP and tunnels) from Linux past version 4.14 I've been having some weird IPv6 FIB problems. The host takes a few full IPv6 BGP feeds (currently ~60K routes) and puts them into the Linux FIB via FRR. It also terminates a few OpenVPN and 6in4 tunnels for friends & family and happens to also host dax, the VM where prolixium.com web content is hosted.

The problems started when I upgraded to Linux 4.19 that was packaged by Debian in testing. About 90 minutes after the reboot and after everything had converged, I started seeing reachability issues to some IPv6 destinations. The routes were in the RIB (FRR) and in the FIB but traffic was being bitbucketed. Even direct routes were affected. Here's excalibur's VirtualBox interface to dax going AWOL from an IPv6 perspective:

(excalibur:11:02:EST)% ip -6 addr show dev vboxnet0
15: vboxnet0:  mtu 1500 state UP qlen 1000
    inet6 2620:6:200f:3::1/64 scope global
       valid_lft forever preferred_lft forever
    inet6 fe80::800:27ff:fe00:0/64 scope link
       valid_lft forever preferred_lft forever
(excalibur:11:02:EST)% ip -6  ro |grep 2620:6:200f:3
2620:6:200f:3::/64 dev vboxnet0 proto kernel metric 256 pref medium
(excalibur:11:02:EST)% ip -6 route get 2620:6:200f:3::2
2620:6:200f:3::2 from :: dev vboxnet0 proto kernel src 2620:6:200f:3::1 metric 256 pref medium
(excalibur:11:02:EST)% ping6 -c4 2620:6:200f:3::2
PING 2620:6:200f:3::2(2620:6:200f:3::2) 56 data bytes

--- 2620:6:200f:3::2 ping statistics ---
4 packets transmitted, 0 received, 100% packet loss, time 3ms 

(excalibur:11:02:EST)%

In the above case, the route was there, a Netlink confirmed it, but no traffic would flow. The fix here was to either bounce the interface, restart FRR, or reboot.

Other times Netlink provides a negative response:

(excalibur:12:30:EST)% ip -6 route get 2620:6:200e:8100::
RTNETLINK answers: Invalid argument. 
(excalibur:12:30:EST)% ip -6 ro|grep 2620:6:200e:8100::/56
2620:6:200e:8100::/56 via 2620:6:200e::2 dev tun2 proto static metric 20 pref medium

In this case, the route appeared to be there but Netlink had some issue when querying it. Traffic to that prefix was being bitbucketed. The fix was to re-add the static route in FRR:

(excalibur:12:32:EST)# vtysh -c "conf t" -c "no ipv6 route 2620:6:200e:8100::/56 2620:6:200e::2"
(excalibur:12:32:EST)# vtysh -c "conf t" -c ipv6 route 2620:6:200e:8100::/56 2620:6:200e::2"
(excalibur:12:32:EST)% ip -6 route get 2620:6:200e:8100::
2620:6:200e:8100:: from :: via 2620:6:200e::2 dev tun2 proto static src 2620:6:200e::1 metric 20 pref medium

Downgrading from 4.19 to 4.16 seemed to have made the situation much better but not fix it completely. Instead of 50% routes failing to work after 90 minutes only handful of prefixes break. I'm not sure how many a handful is, but it's more than 1. I was running 4.14 for about 6 months without a problem so I might just downgrade to that for now.

I did try reproducing this on a local VM running 4.19, FRR, and two BGP feeds but the problem isn't manifesting itself. I'm wondering if this is traffic or load related or maybe even related to the existence of tunnels. I don't think it's FRR's fault but it certainly might be doing something funny with its Netlink socket that triggers the kernel bug. I also don't know how to debug this further, so I'm going to need to do some research.

Update 2019-02-16

I started playing with that local VM running 4.19 and can successfully cause IPv6 connectivity to "hiccup" if I do the following on it:

% ip -6 route|egrep "^[0-9a-f]{1,4}:"|awk '{ print $1; }'|sed "s#/.*##"|xargs -L 1 ip -6 route get

This basically walks the IPv6 Linux FIB and does an "ip -6 route get" for each prefix (first address in each). After exactly 4,261 prefixes Netlink just gives me network unreachable:

[...]
2001:df0:456:: from :: via fe80::21b:21ff:fe3b:a9b4 dev eth0 proto bgp src 2620:6:2003:105:250:56ff:fe1a:afc2 metric 20 pref medium
2001:df0:45d:: from :: via fe80::21b:21ff:fe3b:a9b4 dev eth0 proto bgp src 2620:6:2003:105:250:56ff:fe1a:afc2 metric 20 pref medium
2001:df0:465:: from :: via fe80::21b:21ff:fe3b:a9b4 dev eth0 proto bgp src 2620:6:2003:105:250:56ff:fe1a:afc2 metric 20 pref medium
RTNETLINK answers: Network is unreachable
RTNETLINK answers: Network is unreachable
RTNETLINK answers: Network is unreachable
RTNETLINK answers: Network is unreachable
RTNETLINK answers: Network is unreachable
RTNETLINK answers: Network is unreachable
RTNETLINK answers: Network is unreachable
RTNETLINK answers: Network is unreachable
RTNETLINK answers: Network is unreachable
[...]

It's funny because it's always at the same exact point. The route after 2001:df0:465::/48 is 2001:df0:467::/48, which I can query just fine outside of the loop:

(nltest:11:12:PST)% ip -6 route get 2001:df0:467::   
2001:df0:467:: from :: via fe80::21b:21ff:fe3b:a9b4 dev eth0 proto bgp src 2620:6:2003:105:250:56ff:fe1a:afc2 metric 20 pref medium
(nltest:11:12:PST)%

The only possible explanation I can come up with is that I'm hitting some Netlink limit and messages are getting dropped. If I don't Ctrl-C the script and just let it sit there spewing the unreachable messages on the screen eventually all IPv6 connectivity to my VM hiccups and cause my BGP sessions to bounce. I can see this when running an adaptive ping to the VM:

[...]
64 bytes from 2620:6:2003:105:250:56ff:fe1a:afc2: icmp_seq=1215 ttl=64 time=0.386 ms
64 bytes from 2620:6:2003:105:250:56ff:fe1a:afc2: icmp_seq=1216 ttl=64 time=0.372 ms
64 bytes from 2620:6:2003:105:250:56ff:fe1a:afc2: icmp_seq=1217 ttl=64 time=0.143 ms
64 bytes from 2620:6:2003:105:250:56ff:fe1a:afc2: icmp_seq=1218 ttl=64 time=0.383 ms
64 bytes from 2620:6:2003:105:250:56ff:fe1a:afc2: icmp_seq=1235 ttl=64 time=1022 ms     <--- segments 1219..1234 gone
64 bytes from 2620:6:2003:105:250:56ff:fe1a:afc2: icmp_seq=1236 ttl=64 time=822 ms
64 bytes from 2620:6:2003:105:250:56ff:fe1a:afc2: icmp_seq=1237 ttl=64 time=621 ms
64 bytes from 2620:6:2003:105:250:56ff:fe1a:afc2: icmp_seq=1238 ttl=64 time=421 ms
64 bytes from 2620:6:2003:105:250:56ff:fe1a:afc2: icmp_seq=1239 ttl=64 time=221 ms
64 bytes from 2620:6:2003:105:250:56ff:fe1a:afc2: icmp_seq=1240 ttl=64 time=20.6 ms
64 bytes from 2620:6:2003:105:250:56ff:fe1a:afc2: icmp_seq=1241 ttl=64 time=0.071 ms
64 bytes from 2620:6:2003:105:250:56ff:fe1a:afc2: icmp_seq=1242 ttl=64 time=0.078 ms
64 bytes from 2620:6:2003:105:250:56ff:fe1a:afc2: icmp_seq=1243 ttl=64 time=0.081 ms
64 bytes from 2620:6:2003:105:250:56ff:fe1a:afc2: icmp_seq=1244 ttl=64 time=0.076 ms
[...]

Next step here is to downgrade the VM to a 4.14 and run the same thing. It's possible I could just be burning out Netlink and this is normal, but I'm suspicious.

Update 2 2019-02-16

Downgrading my local VM to Linux 4.14 and running the same shell fragment above produces no network unreachable messages from Netlink, does not disturb IPv6 connectivity at all, and no BGP sessions bounce:

(nltest:11:44:PST)% ip -6 route|egrep "^[0-9a-f]{1,4}:"|awk '{ print $1; }'|sed "s#/.*##"|xargs -L 1 ip -6 route get 1> /dev/null 
(nltest:11:45:PST)%

Something definitely changed or got bugged in Netlink after 4.14.

Update 3 2019-02-16

After testing a few kernels, it seems this was introduced in Linux 4.18. More investigation needed.

Update 4 2019-11-17

It looks like I may have found something. I upgraded to 5.3.0-2-amd64 (Debian kernel) and ran the same test above. I got the same results but this time I saw something interesting in dmesg output:

[  119.460300] Route cache is full: consider increasing sysctl net.ipv[4|6].route.max_size.
[  120.666697] Route cache is full: consider increasing sysctl net.ipv[4|6].route.max_size.
[  121.668727] Route cache is full: consider increasing sysctl net.ipv[4|6].route.max_size.

Apparently, net.ipv6.route.max_size was set very low:

(netlink:19:17:PST)% sudo sysctl -A|grep max_size
net.ipv4.route.max_size = 2147483647
net.ipv6.route.max_size = 4096

Well, I certainly have more than 4,096 routes. So, I increased it to 1048576. It WORKED!

(netlink:19:18:PST)% ip -6 route|egrep "^[0-9a-f]{1,4}:"|awk '{ print $1; }'|sed "s#/.*##"|xargs -L 1 ip -6 route get 1> /dev/null 
(netlink:19:23:PST)%

No output means no RTNETLINK errors.

This net.ipv6.route.max_size key is present and set to 4096 on my production routers running ~77K IPv6 routes with 4.14 kernels with no issue. So, I have lots of questions here:

What is the "route cache" seen in dmesg? Linux's route cache was done away with several years ago.
Why does Linux 4.14 handle ~77K IPv6 routes just fine with the value of max_size set to 4096?
Why does only Linux 5.3 emit the error about the route cache?
Will increasing the max_size on the Linux 4.14 systems help anything at all?

More research is needed but at least there's a way forward with kernels > 4.17.

It took awhile, but I finally converted the last two software routers (well, hosts that run routing protocols) on my network that were running Quagga to FRR:

bazooka.prolixium.com.: Version: 3.1-dev-1.0-1~debian9+1
centauri.prolixium.com.: Version: 3.1-dev-1.0-1~debian9+1
evolution.prolixium.com.: Version: 3.1-dev-1.0-1~debian9+1
excalibur.prolixium.com.: Version: 4.0-1~debian9+1
exodus.prolixium.com.: Version: 1.6.3-3
firefly.prolixium.com.: Version: 1.6.3-3
mercury.prolixium.com.: Version: 3.1-dev-1.0-1~debian9+1
nat.prolixium.com.: Version: 4.1-dev-1.0-1~debian9+1
nox.prolixium.com.: Version: 3.1-dev-1.0-1~debian9+1
pathfinder.prolixium.com.: Version: 3.1-dev-1.0-1~debian9+1
proteus.prolixium.com.: Version: 3.1-dev-1.0-1~debian9+1
remus.prolixium.com.: Version: 3.1-dev-1.0-1~debian9+1
scimitar.prolixium.com.: Version: 3.1-dev-1.0-1~debian9+1
sprint.prolixium.com.: Version: 3.1-dev-1.0-1~debian9+1
starfire.prolixium.com.: Version: 4.1-dev-1.0-1~debian9+1
storm.prolixium.com.: Version: 1.6.3-3
tachyon.prolixium.com.: Version: 3.1-dev
tiny.prolixium.com.: Version: 4.0-1~debian9+1
trident.prolixium.com.: Version: 3.1-dev
trill.prolixium.com.: Version: 3.1-dev-1.0-1~debian9+1
valen.prolixium.com.: Version: 4.1-dev-1.0-1~debian9+1
orca.prolixium.com.: Version: 3.1-dev-1.0-1~debian9+1

The -dev versions above are hand-rolled from the latest source code. There are no pre-built Debian packages for i386 so I was forced to roll them by hand.

The 1.6 versions above are actually BIRD.

I started messing with BIRD the other day to work around some IPv6 issues with Quagga. The configuration is fairly simple, but I ran into a weird issue where it picks the wrong interface IPv4 address from some of my OpenVPN tunnels. For example, I've got these interfaces:

(storm:0:06:EST)% ip a s tun0
5: tun0: <POINTOPOINT,MULTICAST,NOARP,UP,LOWER_UP> mtu 1456 qdisc pfifo_fast state UNKNOWN group default qlen 100
    link/none 
    inet 10.3.254.44 peer 10.3.254.43/32 scope global tun0
       valid_lft forever preferred_lft forever
(storm:0:06:EST)% ip a s tun1
4: tun1: <POINTOPOINT,MULTICAST,NOARP,UP,LOWER_UP> mtu 1456 qdisc pfifo_fast state UNKNOWN group default qlen 100
    link/none 
    inet 10.3.254.81 peer 10.3.254.80/32 scope global tun1
       valid_lft forever preferred_lft forever

Here's what BIRD sees:

bird> show route protocol direct1
10.3.4.64/32       dev lo [direct1 23:53:06] * (240)
10.3.254.43/32     dev tun0 [direct1 23:53:06] * (240)
10.3.254.80/32     dev tun1 [direct1 23:53:06] * (240)
192.168.150.0/24   dev eth0 [direct1 23:53:06] * (240)

The addresses shown are the remote end of the OpenVPN tunnels, not the local end, which I'd expect.

Why?

Update: Well, this should have been obvious:

(storm:0:34:EST)% ip r s p kernel           
10.3.254.43 dev tun0 scope link src 10.3.254.44 
10.3.254.80 dev tun1 scope link src 10.3.254.81 
192.168.150.0/24 dev eth0 scope link src 192.168.150.105

I suppose I'll have to figure out how to get the link source to be visible to BIRD so I can advertise it, which is one of my odd requirements here.

OpenSSH 7.0 was released a few months ago and deprecated both the ssh-dss and ssh-rsa keys used for SSHv2 public key authentication. I haven't found a definitive source stating why these key types were deprecated other than some issue with entropy (doesn't make sense to me because that sounds like a machine-specific problem). I, unfortunately, still use these keys quite a bit and it's not feasible to completely convert to one of the newer key types.

OpenSSH 7.0 appeared in FreeBSD ports pretty quickly and recently made its way into Debian testing (stretch).

So, what are the newer key types supported? From what I can tell, it's just ecdsa and ed25519 for SSHv2. From the OpenSSH 6.2 ssh-keygen(1) manpage:

     -t type
             Specifies the type of key to create.  The possible values are
             ``rsa1'' for protocol version 1 and ``dsa'', ``ecdsa'' or ``rsa''
             for protocol version 2.

From the OpenSSH 7.1 ssh-keygen(1) manpage:

     -t dsa | ecdsa | ed25519 | rsa | rsa1
             Specifies the type of key to create.  The possible values are
             "rsa1" for protocol version 1 and "dsa", "ecdsa", "ed25519", or
             "rsa" for protocol version 2.

Even though dsa and rsa keys are still listed above in the 7.1 man page as being capable of created, they're not accepted by default anymore by ssh or sshd:

debug1: Next authentication method: publickey
debug1: Skipping ssh-dss key /home/prox/.ssh/id_dsa for not in PubkeyAcceptedKeyTypes

(if you're wondering, and I was too, DSS is a document that describes the creation of DSA keys as answered here)

Running ssh -Q key will also dump the list of keys acceptable by ssh:

(nox:11:46:CST)% ssh -Q key
ssh-ed25519
ssh-ed25519-cert-v01@openssh.com
ssh-rsa
ssh-dss
ecdsa-sha2-nistp256
ecdsa-sha2-nistp384
ecdsa-sha2-nistp521
ssh-rsa-cert-v01@openssh.com
ssh-dss-cert-v01@openssh.com
ecdsa-sha2-nistp256-cert-v01@openssh.com
ecdsa-sha2-nistp384-cert-v01@openssh.com
ecdsa-sha2-nistp521-cert-v01@openssh.com

The solution seems obvious, just throw out the old keys and use an ecdsa key, right? Sure, that'll work for OpenSSH versions that support it. However, sometimes I have to log into a few legacy boxes that only support RSA and DSA keys (Solaris, IRIX, random network devices, etc. - I've got some old stuff!).

What about keeping around the old keys? Sure, we can use PubkeyAcceptedKeyTypes in sshd_config and ssh_config like this:

PubkeyAcceptedKeyTypes ssh-dss,ssh-rsa

The only problem is that this option only exists in 7.0 and above. I use a common ~/.ssh/ssh_config for all of my systems and OpenSSH 6.x barfs on that line.

What's the solution? Well, one is to not upgrade to OpenSSH 7.0, but that's just delaying the inevitable. My solution may just be to use two keys, one for modern systems and one for very old systems that don't support ecdsa or ed25519. Regardless, it's pretty annoying, but security always is, right?

Update 20151229: This page highlights some of these differences and workarounds, too.

Among many other bugs, Android 5.x seems to have a DHCPv4 bug that prevents it from getting addresses from some embedded systems. For me, this includes the Wi-Fi functions on the GoPro HERO3 and Canon EOS 70D.

We all remember DORA from our networking 101 classes, right?

DHCP Discover
DHCP Offer
DHCP Request
DHCP Acknowledge

Unfortunately, with Android 5.x this turns into DORN when talking to some DHCP servers.

That's right, the server sends a NAK for the DHCP request, which usually indicates the requested address is invalid or in use.

I realized this while on my honeymoon earlier this year and was not pleased at all. I routinely use the Wi-Fi capabilities of my GoPro HERO3 camera with a "selfie stick" to shoot accurate photos. The HERO3 turns itself into a Wi-Fi access point and allows a single client to connect in order to use the GoPro application, which is used to control & view live video on camera. Usually this works without a hitch, but not this time. When connecting to the camera the Wi-Fi status hung in the "Obtaining IP address..." phase. After debugging a bit I realized that the server was sending NAKs for the DHCP request from my phone and I had upgraded my OnePlus One to CyanogenMod 12.0 (Android 5.0) a few weeks earlier. Here's what logcat says:

I/dhcpcd  (25750): wlan0: broadcasting for a lease
I/dhcpcd  (25750): wlan0: offered 10.5.5.109 from 10.5.5.9
W/dhcpcd  (25750): wlan0: NAK: via 10.5.5.9
I/dhcpcd  (25750): wlan0: broadcasting for a lease
I/dhcpcd  (25750): wlan0: offered 10.5.5.109 from 10.5.5.9
W/dhcpcd  (25750): wlan0: NAK: via 10.5.5.9
I/dhcpcd  (25750): wlan0: broadcasting for a lease
I/dhcpcd  (25750): wlan0: offered 10.5.5.109 from 10.5.5.9
W/dhcpcd  (25750): wlan0: NAK: via 10.5.5.9
I/dhcpcd  (25750): wlan0: broadcasting for a lease
I/dhcpcd  (25750): wlan0: offered 10.5.5.109 from 10.5.5.9
W/dhcpcd  (25750): wlan0: NAK: via 10.5.5.9

(the HERO3 uses 10.5.5/24 with the camera 10.5.5.9 and the clients starting around 10.5.5.109 or so)

The entry that's missing is the DHCP request debug entry, where Android should be sending a message to the DHCP server asking for 10.5.5.109. This is actually happening and can be seen via a packet capture:

20:48:52.890280 IP (tos 0x0, ttl 64, id 21568, offset 0, flags [none], proto UDP (17), length 330)
    0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from c0:ee:fb:24:8c:59 (oui Unknown), length 302, xid 0x1d11a2f4, Flags [none]
	  Client-Ethernet-Address c0:ee:fb:24:8c:59 (oui Unknown)
	  Vendor-rfc1048 Extensions
	    Magic Cookie 0x63825363
	    DHCP-Message Option 53, length 1: Discover
	    Client-ID Option 61, length 19: hardware-type 255, fb:24:8c:59:00:01:00:01:1c:c4:48:b1:c0:ee:fb:24:8c:59
	    MSZ Option 57, length 2: 1500
	    Vendor-Class Option 60, length 12: "dhcpcd-5.5.6"
	    Hostname Option 12, length 5: "omega"
	    Parameter-Request Option 55, length 10: 
	      Subnet-Mask, Static-Route, Default-Gateway, Domain-Name-Server
	      Domain-Name, MTU, BR, Lease-Time
	      RN, RB
20:48:52.896211 IP (tos 0x0, ttl 64, id 81, offset 0, flags [none], proto UDP (17), length 326)
    10.5.5.9.bootps > 10.5.5.109.bootpc: BOOTP/DHCP, Reply, length 298, xid 0x1d11a2f4, Flags [none]
	  Your-IP 10.5.5.109
	  Server-IP 10.5.5.9
	  Client-Ethernet-Address c0:ee:fb:24:8c:59 (oui Unknown)
	  Vendor-rfc1048 Extensions
	    Magic Cookie 0x63825363
	    DHCP-Message Option 53, length 1: Offer
	    Server-ID Option 54, length 4: 10.5.5.9
	    Subnet-Mask Option 1, length 4: 255.255.255.0
	    Lease-Time Option 51, length 4: 10368000
	    TTL Option 23, length 1: 64
	    MTU Option 26, length 2: 1500
	    RN Option 58, length 4: 10368000
	    RB Option 59, length 4: 10371600
	    Domain-Name Option 15, length 3: "lan"
	    BR Option 28, length 4: 10.5.5.255
	    Default-Gateway Option 3, length 4: 10.5.5.9
20:48:52.897006 IP (tos 0x0, ttl 64, id 41212, offset 0, flags [none], proto UDP (17), length 342)
    0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from c0:ee:fb:24:8c:59 (oui Unknown), length 314, xid 0x1d11a2f4, Flags [none]
	  Client-Ethernet-Address c0:ee:fb:24:8c:59 (oui Unknown)
	  Vendor-rfc1048 Extensions
	    Magic Cookie 0x63825363
	    DHCP-Message Option 53, length 1: Request
	    Client-ID Option 61, length 19: hardware-type 255, fb:24:8c:59:00:01:00:01:1c:c4:48:b1:c0:ee:fb:24:8c:59
	    Requested-IP Option 50, length 4: 10.5.5.109
	    Server-ID Option 54, length 4: 10.5.5.9
	    MSZ Option 57, length 2: 1500
	    Vendor-Class Option 60, length 12: "dhcpcd-5.5.6"
	    Hostname Option 12, length 5: "omega"
	    Parameter-Request Option 55, length 10: 
	      Subnet-Mask, Static-Route, Default-Gateway, Domain-Name-Server
	      Domain-Name, MTU, BR, Lease-Time
	      RN, RB
20:48:52.902836 IP (tos 0x0, ttl 64, id 82, offset 0, flags [none], proto UDP (17), length 272)
    10.5.5.9.bootps > 255.255.255.255.bootpc: BOOTP/DHCP, Reply, length 244, xid 0x1d11a2f4, Flags [none]
	  Client-Ethernet-Address c0:ee:fb:24:8c:59 (oui Unknown)
	  Vendor-rfc1048 Extensions
	    Magic Cookie 0x63825363
	    DHCP-Message Option 53, length 1: NACK

Except for some absurdly high lease times, there's nothing invalid I see in the DHCP request above. So, why is the HERO3 sending a NAK? Here's what I know:

Other DHCP clients (tested: Mac OS 10.9) can connect to the HERO3's Wi-Fi via DHCP without issue
The OPO handset running Android 5.0 exhibits the same behavior when attempting to connect to the Canon EOS 70D's DHCP server
The OPO handset has no problem handshaking with other DHCP servers (ISC, Cisco, etc.)
Android 5.1 exhibits identical behavior
Android 4.x did not exhibit this behavior
GoPro HERO3 firmware has not been changed (version is 3.00)

Since the 70D and HERO3's DHCP servers are closed, I can't look at their debug log to see what's happening. The only workaround I have so far is to assign a static IPv4 address to my OPO when connecting to the HERO3, which is 10.5.5.109/24 with a default gateway of 10.5.5.9. I don't believe the default gateway is used for the application functionality, but provides some fake HTTP responses that cause Android and iOS' Internet connection quality tests to succeed.

I'd submit a bug for this but I'm worried it'll go nowhere or be tagged WONTFIX like this request for DHCPv6 support.

Rather than go into my opinion about the FCC reclassifying broadband networks in the US as common carriers under Title II, I figured I'd just pose some questions that I haven't seen answers to, so far. In fact, I haven't seen many "gory technical details" at all.

First, what ISPs are going to be reclassified? What is the definition of broadband Internet nowadays, anyway? Is it just the 25 Mbps / 3 Mbps throughput requirement or is it just the multi-media requirement, both, or neither? Does the multi-media component require circuit switching (e.g., ATSC + DOCSIS) or can it be extended to different kinds of services (TV, phone, data, etc.) over the same packet-switched protocol? If so, this extends the definition of broadband to many more ISPs' offerings including commercial ones.

I've heard that blocking certain TCP and UDP ports constitute a net neutrality violation. Some popular examples are VPN-related ports like UDP/4500 (IPsec NAT-T), IP/50 (IPsec ESP), UDP/1194 (OpenVPN), etc. or BitTorrent-related ports (traditionally TCP/6881-6889) but how about the less-common ports? How about TCP/135 or TCP/139? These are routinely blocked by many residential ISPs since they have a bad history of abuse and are hardly ever legitimately used over the Internet. Would blocking TCP/135 be considered a net neutrality violation? What if there's a huge amplification attack vector discovered on some UDP service that happens to be listening on most home routers.. can an ISP block that without someone screaming about a net neutrality violation? Assuming that those "bad" ports aren't considered net neutrality violations, what if I decided to run a web server on TCP/135? Would that then bring TCP/135 into the scope of violation once again?

To go even further than just blocking ports, how about broadband ISPs that only hand out unroutable IPv4 address space (RFC 1918, squat space, or other junk) and use NAT+PAT to provide Internet access? Without some sort of UPnP there's no way for that host to receive unsolicited traffic from the Internet at large. Peer-to-peer "stuff" breaks. Does the choice of the address selection constitute a net neutrality violation? How about mobile networks offering IPv6 but firewall all inbound connections (hello, Verizon Wireless)? The IPv6 address space is typically publicly routable so the inbound filtering is certainly a net neutrality violation.. or is it?

What is the real definition of "fast lane" as it relates to net neutrality? The easy [naïve] answer to this might be something like "providing a faster connection to Facebook than Google"—but it's not that simple. Speaking only as it relates to the network infrastructure, the definition of "fast" is dependent on some general variables like link speed, RTT, and network congestion. While it's conceivable that link speed and network congestion can be made somewhat equal for a few networks (i.e., Google, Facebook, etc.) it's less likely that the RTT will be equal. Paths to other networks are almost never going to be equal because the chance that both the interconnect locations to the remote networks and destinations on the remote networks will be equal from an RTT perspective is highly unlikely. For example, is Comcast providing a "fast lane" to Google for a certain service area that may be closer to a peering point with Google than it is to a peering point for Facebook? The content providers' network architecture certainly makes a big difference, here. If Google has caches at every peering points but Facebook doesn't, how does an ISP provide equally fast lanes?

How do on-net caches play into this? Google and Netflix are two examples of content providers that offer on-net caches so ISPs don't have to eat transit costs to get content to their customers. It also provides a much better experience due to the lower latency—is this also considered a "fast lane"? To add an additional twist on this, how about Akamai's on-net caches? Well, wouldn't this favor only content providers that pay Akamai to host objects on their CDN?

How about peering vs. transit vs. customers? Does a peering connection (how many peering locations?) constitute a net neutrality violation? What if a small content provider has one transit provider and decides to get another one, would other customers of that second transit provider now have a "fast lane" to that content provider? There are many permutations.

I doubt I'll ever hear definitive answers to all of these questions. It's possible many of these questions will become invalid, too.

Ever since 2004 or 2005 I've wanted to see real virtual router functionality on Linux. It would make setting up a lightweight networking lab with Quagga (or BIRD) a pinch and also allow me to leverage multiple Internet connections for VPN isolation, amongst other things.

You might say, "Hey, Linux has that in the form of tables and rules that can be manipulated by ip(8)!" It does, sort-of. It's possible to setup another routing table (optionally naming it in /etc/iproute2/rt_tables), add arbitrary routes, then setup rules to always tell the kernel to use a specific routing table for all packets coming from a certain IP address (or many more things, if you use iptables MARK). This only "sort-of" works because there's no way (from what I can tell) to actually bind interfaces to a particular routing table. There's also the issue with overlapping IP space—how do I tell ssh(1), for example, to use a particular routing table if there are two interfaces with the same IP address? The -b argument won't do me much good. Also, DHCP is problematic because heavy modifications are needed in dhclient-script and they'd be mostly implmentation-specific.

So, although I've used multiple routing tables with rules, they don't really fit my definition of virtual routers (VRF-lite, in Cisco parlance), which I define as a isolated construct has exclusive access to a set of interfaces and their addresses.

Linux Containers

I've been messing with Linux containers (LXC) recently and I think they might provide the exact functionality I've been looking for all these years. With LXC it's possible to fire up a small instance that has its own routing table, interfaces, and applications. There's no need to use rules or -b arguments anymore. DHCP and Quagga don't require any hacks and work the way they should.

Networking with LXC is about what I expect; interfaces are dedicated to the container. Overlapping IP addresses are certainly possible, if there's a need for that. Connecting a container to the host or other containers can be achieved using a virtual Ethernet interface with a bridge. This makes it easy to setup multiple Linux "virtual routers" without ever having to mess with VMs, routing tables or rules. A good article on various LXC networking modes can be found here.

If you're familiar with Junos, Linux containers are almost analgous to logical systems in this type of role. Logical routers, unlike VRFs, have their own copy of RPD, which is a daemon that handles all dynamic routing protocols. If you're using Quagga with containers, the architecture is similar.

A slight drawback I can see with containers as virtual routers is the disk space usage. Each container has, by default, its filesystem stored in a separate directory in /var/lib/lxc. There's quite a bit of redundant data if you fire up many containers using the same distribution (e.g., Debian). I'm sure there is some way to de-duplicate this (which would help with package upgrades, too!) but I haven't really looked into it because storage is so cheap nowadays and most of us have plenty of it. A fairly fully-featured Debian container I've got is not that large, anyway:

(vega:17:04)# du -hcs /var/lib/lxc/*
4.0K	/var/lib/lxc/lxc-monitord.log
488M	/var/lib/lxc/soran
488M	total

So, in summary, for general-use virtual routers, I think LXC is pretty great. The best part is that the only thing required to use LXC is a recent kernel with cgroups enabled and mounted properly.

We've all heard about the recent NTP reflection attacks. Last night I noticed a higher-than-normal traffic volume on nox, so I checked it out with tcpdump:

Note, the first and second octets have been anonymized to protect the victim.

21:07:07.999600 IP 100.44.89.82.26528 > 64.16.214.60.123: NTPv3, Client, length 48
21:07:07.999608 IP 100.44.89.82.26528 > 64.16.214.60.123: NTPv3, Client, length 48
21:07:07.999617 IP 100.44.89.82.26528 > 64.16.214.60.123: NTPv3, Client, length 48
21:07:07.999625 IP 100.44.89.82.26528 > 64.16.214.60.123: NTPv3, Client, length 48
21:07:07.999712 IP 100.44.89.82.26528 > 64.16.214.60.123: NTPv3, Client, length 48
21:07:07.999722 IP 100.44.89.82.26528 > 64.16.214.60.123: NTPv3, Client, length 48
21:07:07.999730 IP 100.44.89.82.26528 > 64.16.214.60.123: NTPv3, Client, length 48

Yes, nox is a public NTP server. It's a member of the NTP Pool Project. No, it's not susceptible to an NTP reflection attack. It looks like some poor soul at 100.44.89.82 (looked like a SonicWALL when I poked around) was being attacked and the traffic above was being spoofed with the intention of having my server send back a reply that's much larger than the request. Here's a decode of one of the packets:

21:07:07.772681 IP (tos 0x0, ttl 53, id 0, offset 0, flags [DF], proto UDP (17), length 76)
    100.44.89.82.10084 > 64.16.214.60.123: [udp sum ok] NTPv3, length 48
        Client, Leap indicator: clock unsynchronized (192), Stratum 0 (unspecified), poll 10s, precision -19
        Root Delay: 1.000000, Root dispersion: 1.000000, Reference-ID: (unspec)
          Reference Timestamp:  0.000000000
          Originator Timestamp: 0.000000000
          Receive Timestamp:    0.000000000
          Transmit Timestamp:   3604450027.692652940 (2014/03/21 21:07:07)
            Originator - Receive Timestamp:  0.000000000
            Originator - Transmit Timestamp: 3604450027.692652940 (2014/03/21 21:07:07)

What's odd about this is the packet above looks like just a normal NTP query. Unlike most of the NTP reflection attacks that exploit the monlist or similar commands, this wasn't really going to have the desired effect. And, of course, if you look at the initial (before I blocked that source address with iptables) traffic volume, it certainly did not:

MRTG

The desired effect, of course, should have been an outbound traffic volume that was greater than the inbound traffic volume, or amplified. In this case, my server was just sending back a 48 byte packet for every 48 byte packet coming in, albeit apparently slightly ratelimited by the NTP daemon.

Was this a misconfigured DDoS bot? Did the attacker really not know what he or she was doing or missed DDoS 101? Or, was this traffic not actually spoofed and was a result of some broken NTP client? Maybe.

Regardless, if this wasn't a misconfigured NTP client, BCP 38 would have prevented this from happening to begin with. I don't know where the traffic was originating, but I do know that it was from a network that probably doesn't implement BCP 38.

Anyway, I thought this was a little odd so I figured I would share.