Page MenuHomeVyOS Platform

Closing IPV6CP by client closes PPPoE link completely, even if IPv6 is optional
Open, HighPublicBUG

Description

Some background - https://forum.vyos.io/t/pppoe-server-dual-stack-ipv4-ipv6-buggy-soho-routers-that-refuse-ipv6/9010/4
traffic capture - http://91.224.224.43/phicomm/phicomm6.cap
RFC1661 - 3.7 Link Termination Phase - Implementation Note:

The closing of the link by LCP is sufficient.  There is no need
for each NCP to send a flurry of Terminate packets.  Conversely,
the fact that one NCP has Closed is not sufficient reason to cause
the termination of the PPP link, even if that NCP was the only NCP
currently in the Opened state.

Some cheap buggy unsupported SOHO routers (specifically, Phicomm KE2M) as PPPoE clients don't support IPv6, but try to negotiate it anyway and then fail, client sends IPV6CP terminate request, and VyOS PPPoE server responds by terminating LCP, even though it could still work fine as IPv4 only. This makes IPv6 deployment difficult, as changing ppp-options ipv6 "deny" to "allow" breaks existing IPv4 customers who happen to have such buggy routers.

Details

Difficulty level
Unknown (require assessment)
Version
1.3.1-S1
Why the issue appeared?
Will be filled on close
Is it a breaking change?
Behavior change
Issue type
Bug (incorrect behavior)

Event Timeline

Log messages - http://91.224.224.43/phicomm/phicomm6.log
PPPoE server config:

set service pppoe-server authentication mode 'radius'
set service pppoe-server authentication protocols 'pap'
set service pppoe-server authentication protocols 'chap'
set service pppoe-server authentication protocols 'mschap'
set service pppoe-server authentication protocols 'mschap-v2'
set service pppoe-server authentication radius acct-timeout '0'
set service pppoe-server authentication radius nas-ip-address '91.224.224.11'
set service pppoe-server authentication radius server 91.224.224.XXX key '********'
set service pppoe-server authentication radius source-address '91.224.224.11'
set service pppoe-server authentication radius timeout '5'
set service pppoe-server client-ip-pool subnet '185.38.221.112/28'
set service pppoe-server client-ipv6-pool delegate 2a04:7840:dd70::/48 delegation-prefix '56'
set service pppoe-server client-ipv6-pool prefix 2a04:7840:dd71::/48 mask '64'
set service pppoe-server gateway-address '91.224.224.11'
set service pppoe-server interface eth2.49
set service pppoe-server interface eth2.50
set service pppoe-server interface eth2.55
set service pppoe-server interface eth2.59
set service pppoe-server mtu '1500'
set service pppoe-server name-server '91.224.224.55'
set service pppoe-server name-server '91.224.224.44'
set service pppoe-server ppp-options ipv4 'require'
set service pppoe-server ppp-options ipv6 'allow'
set service pppoe-server ppp-options ipv6-accept-peer-intf-id
set service pppoe-server ppp-options ipv6-intf-id 'random'
set service pppoe-server ppp-options ipv6-peer-intf-id 'calling-sid'
set service pppoe-server ppp-options lcp-echo-failure '3'
set service pppoe-server ppp-options lcp-echo-interval '10'
set service pppoe-server ppp-options min-mtu '1280'
set service pppoe-server ppp-options mppe 'deny'

client pools only defined to suppress a warning during "commit", all IPv4/v6 addresses/prefixes are actually static assigned by RADIUS.
Simply change ppp-options ipv6 "allow" to "deny" - and all customers with Phicomm routers stop disconnecting, work fine as IPv4 only.
The setup is proven working otherwise - 10/10 on test-ipv6.com and 19/20 on ipv6-test.com (-1 for lack of IPv6 revDNS) via properly configured MikroTik router as PPPoE client.

See also https://github.com/accel-ppp/accel-ppp/issues/57
Testing this patch, PPPoE session with the Phicomm router now stays up, the missing part after "else" is to remove IPv6 configuration from ppp interface (not sure how to do it properly).

diff
diff --git a/accel-pppd/ppp/ppp_ipv6cp.c b/accel-pppd/ppp/ppp_ipv6cp.c
index 1194b31..2bac31b 100644
--- a/accel-pppd/ppp/ppp_ipv6cp.c
+++ b/accel-pppd/ppp/ppp_ipv6cp.c
@@ -738,7 +738,10 @@ static void ipv6cp_recv(struct ppp_handler_t*h)
                        if (conf_ppp_verbose)
                                log_ppp_info2("recv [IPV6CP TermReq id=%x]\n", hdr->id);
                        ppp_fsm_recv_term_req(&ipv6cp->fsm);
-                       ap_session_terminate(&ipv6cp->ppp->ses, TERM_USER_REQUEST, 0);
+                       if (conf_ipv6 == IPV6_REQUIRE)
+                               ap_session_terminate(&ipv6cp->ppp->ses, TERM_USER_REQUEST, 0);
+                       else
+                               ppp_layer_passive(ipv6cp->ppp, &ipv6cp->ld);
                        break;
                case TERMACK:
                        if (conf_ppp_verbose)

As a temporary workaround, I use the script below. For some reason /etc/rc.local no longer runs automatically on VyOS 1.3.2, so I run it manually after each reboot for now. Until it is run, Phicomm routers keep disconnecting due to failed IPV6CP negotiation incorrectly triggering complete PPPoE session termination. I have two PPPoE servers at different locations for redundancy, both rebooting at the same time is very unlikely, so I can live with it for now.

#! /bin/sh
set -e

# block ipv6cp from broken phicomm routers (MAC prefix: d8c8e9, fc7c02)
PPPOE_IFACES="eth2 eth4"

for i in $PPPOE_IFACES ; do
  tc qdisc del dev $i ingress || true
  tc qdisc add dev $i ingress
  tc filter add dev $i parent ffff: u32 match u32 0xd8c8e900 0xffffff00 at -8 match u16 0x8864 0xffff at -2 match u16 0x8057 0xffff at 6 action drop
  tc filter add dev $i parent ffff: u32 match u32 0xfc7c0200 0xffffff00 at -8 match u16 0x8864 0xffff at -2 match u16 0x8057 0xffff at 6 action drop
done

To verify the script worked:

$ tc filter show dev eth2 ingress
filter parent ffff: protocol all pref 49151 u32 chain 0
filter parent ffff: protocol all pref 49151 u32 chain 0 fh 801: ht divisor 1
filter parent ffff: protocol all pref 49151 u32 chain 0 fh 801::800 order 2048 key ht 801 bkt 0 terminal flowid ??? not_in_hw
  match fc7c0200/ffffff00 at -8
  match 00008864/0000ffff at -4
  match 00008057/0000ffff at 4
        action order 1: gact action drop
         random type none pass val 0
         index 2 ref 1 bind 1

filter parent ffff: protocol all pref 49152 u32 chain 0
filter parent ffff: protocol all pref 49152 u32 chain 0 fh 800: ht divisor 1
filter parent ffff: protocol all pref 49152 u32 chain 0 fh 800::800 order 2048 key ht 800 bkt 0 terminal flowid ??? not_in_hw
  match d8c8e900/ffffff00 at -8
  match 00008864/0000ffff at -4
  match 00008057/0000ffff at 4
        action order 1: gact action drop
         random type none pass val 0
         index 1 ref 1 bind 1

Hello @dmbaturin - it's now about 2 years since I've discovered this bug, any chances to review my patch?
I don't claim it's 100% correct, in simple tests it does prevent LCP termination but the closing IPV6CP part (after "else") needs review by someone with better understanding of accel-ppp internals.
I still have a few users of these buggy routers in my network, can help with testing and possibly backporting the fix to Equuleus.
Thanks!

@marekm Can you add the PR to the accel-ppp repo? I guess it will be better to fix it in upstream.
https://github.com/accel-ppp/accel-ppp

I think it's not yet ready for PR - needs review by someone more experienced with accel-ppp internals.
Issue already reported there - https://github.com/accel-ppp/accel-ppp/issues/57
Isn't upstream effectively the same now (accel-ppp owned by VyOS)?
Some issues I reported in accel-ppp were moved to discussions but met with silence otherwise. It might (or not, I don't know) have to do with some personal disagreements - someone other than me might have better luck.

Add PR on accell-ppp repo or patch in the vyos-build via PR https://github.com/vyos/vyos-build/tree/current/packages/linux-kernel/patches/accel-ppp
There are no other options for review.

@marekm Do you plan to make a PR?
If you are overloaded, we can import a patch ourselves, but a PR would be nice.

Yes I am overloaded (who isn't), and yes I plan to make a PR but want to test it a bit more first, to be reasonably sure it causes no regression (potential resource leak if something allocated by the incomplete IPv6 configuration is not freed - not sure enough about accel-ppp internals). I'm working to rebuild replacement accel-ppp package (based on the same commit as in equuleus with just my patch applied, no other changes) and run it for a week or two while watching memory usage. Testing a fairly complex config in production environment, so not brave enough to try rolling or even sagitta just yet.

The fix is still work in progress, I need to dig deeper into accel-ppp history that predates the github repo. Not sure why some code was commented out over 10 years ago, IPV6CP restart timer is not stopped and disconnects in 3 seconds. Looking at older history in sourceforge but not sure if the answer is even there, sourceforge seems to have lost old history before 2011 (broken links).