OSPF Neighbor Flapping
Closed, InvalidPublicBUG
Actions

Assigned To

None

Authored By

	ekim
	Jul 19 2017, 4:34 PM

Description

Using DHCP for a physical interface in which IPSEC is running over a VTI OSPF neighbor establish Full/DROther however for about a minute then drops and reestablishes its neighborship. If I change the interface from DHCP to static IP address without changing anything else OSPF works as expected.

Logs:

Jul 19 13:03:12 corp-fw-b ospfd[2298]: nsm_change_state(1.1.1.1, Loading -> Full): scheduling new router-LSA origination
Jul 19 13:03:59 corp-fw-b ospfd[2298]: nsm_change_state(1.1.1.1, Full -> Init): scheduling new router-LSA origination
Jul 19 13:04:02 corp-fw-b ospfd[2298]: Packet[DD]: Neighbor 1.1.1.1: Initial DBD from Slave, ignoring.
Jul 19 13:04:02 corp-fw-b ospfd[2298]: Packet[DD]: Neighbor 1.1.1.1 Negotiation done (Master).
Jul 19 13:04:02 corp-fw-b ospfd[2298]: nsm_change_state(1.1.1.1, Loading -> Full): scheduling new router-LSA origination
Jul 19 13:04:49 corp-fw-b ospfd[2298]: nsm_change_state(1.1.1.1, Full -> Init): scheduling new router-LSA origination
Jul 19 13:04:52 corp-fw-b ospfd[2298]: Packet[DD]: Neighbor 1.1.1.1: Initial DBD from Slave, ignoring.
Jul 19 13:04:52 corp-fw-b ospfd[2298]: Packet[DD]: Neighbor 1.1.1.1 Negotiation done (Master).
Jul 19 13:04:52 corp-fw-b ospfd[2298]: nsm_change_state(1.1.1.1, Loading -> Full): scheduling new router-LSA origination
Jul 19 13:05:39 corp-fw-b ospfd[2298]: nsm_change_state(1.1.1.1, Full -> Init): scheduling new router-LSA origination
Jul 19 13:05:42 corp-fw-b ospfd[2298]: Packet[DD]: Neighbor 1.1.1.1: Initial DBD from Slave, ignoring.
Jul 19 13:05:42 corp-fw-b ospfd[2298]: Packet[DD]: Neighbor 1.1.1.1 Negotiation done (Master).
Jul 19 13:05:42 corp-fw-b ospfd[2298]: nsm_change_state(1.1.1.1, Loading -> Full): scheduling new router-LSA origination
Jul 19 13:06:29 corp-fw-b ospfd[2298]: nsm_change_state(1.1.1.1, Full -> Init): scheduling new router-LSA origination

Config:

interfaces {

ethernet eth0 {
    address dhcp
    duplex auto
    hw-id 0c:c4:7a:db:ef:8c
    smp_affinity auto
    speed auto
}
loopback lo {
    address 10.126.1.10/32
}
vti vti0 {
    address 10.126.2.1/30
    ip {
        ospf {
            network point-to-point
        }
    }
}

protocols {

ospf {
    area 0.0.0.0 {
        network 10.126.2.0/30
        network 10.126.1.10/32
    }
    log-adjacency-changes {
    }
    neighbor 10.126.2.2 {
    }
    parameters {
        router-id 10.126.1.10
    }
}

Details

Version: 1.1.7

Event Timeline

ekim created this task.Jul 19 2017, 4:34 PM

Hey,

Can you verify using tcpdump or other means how long the dhcp lease is?

tcpdump -i eth0 -n port 67 and port 68

The above should clear out if the dhcp lease is short and the dhclient or the dhcp server or both might cause this issue.

DHCP lease is 1 hour. Is there a lower/upper bounds that DHCP leases should stay inside of?

1 hour is an incredibly short DHCP lease. 8 Hours would be common on something like a guest wifi network- and a day or more would be appropriate for a more static configuration.

Static would make a lot more sense for this sort of configuration- but if that's not an option- I would at least increase the lease timeout.

Having said that- I've never had much luck with OSPF over any sort of WAN link. Would a simpler protocol like RIP or a stateful protocol like BGP make more sense?

As for why the session might be dropping-

Are there any firewall rules that might be dropping the traffic?

You have a VTI but it does not look like you adjusted the MTU to account for the encapsulation. It's possible packets are being dropped on the floor.

I would suggest you tcpdump both sides and verify that every OSPF packet that leaves one host actually makes it to the other.

Hi, for me it's definitely MTU problem, as DBD packets filled fully with prefixes, they require more space,
also you could try to switch "network point-to-point" to broadcast, it could be some limitations of ospfd, but it's probably not the case
anyway Full/DROther for point-to-point looks for me weird, there is should be no elections on link with only two peers...

Perhaps only one side is set point-to-point? That would also cause ... interesting behavior.

1 hour is incredibly short, I agree. However, in some cases we are unable to change this parameter as I don't control the device.

Agreed, static would make more sense, however, I'm at the whims of service provider at these sites.

Since this is being routed over an IPSec tunnel and we want a link-state routing protocol OSPF makes the most sense. I've never had issues running OSPF over IPSec tunnels.

There are no firewall rules dropping any traffic.

I would concede that MTU could be an issue, if when I changed the IP address to a static one the issue persisted. Additionally, it would be weird if the issue were caused by MTU but the adjacency formed, as identical MTU is a requirement for OSPF adjacencies.

I have verified that every OSPF packet that leaves is received by its neighbor.

I thought Full/DROther was weird as well, however, all devices at all sites are set to p2p.

Whether or not the MTU is causing the issue- it should obviously be set anyway. MTU issues always seem to cause intermittent and hard to track down problems.

As for the OSPF issue- if there is even a one byte difference between the static address and the DHCP LSA's- it could get dropped by an MTU problem. The adjacency could still form because both sides _think_ the MTU is 1500- but the moment the prefixes start flowing- you hit the limit. If you have verified that every single packet makes it through- then that's a different ball of wax.

1 minute makes it seem like it's expecting something like a hello but the timer is expiring- though there are lots of possibilities.

Is that the maximum debugging level? Have you tried checking dmesg for other messages? (I had an interface that kept dropping which was causing protocol resets but did not show up in the protocol logs).

@ekim Technically the dhcp lease should not affect on the network traffic at all, the renew should be transparent if the IP stays the same.
I believe that since the issue appears after a minute and the lease is 1 hour then it should be fine and probably not the cause for the issue.

syncer triaged this task as Low priority.Jul 25 2017, 12:07 AM

Will recommend retest with 1.2 nightlies

pasik subscribed.Oct 1 2018, 9:53 AM

syncer edited projects, added VyOS 1.2 Crux (VyOS 1.2.0-rc4); removed VyOS 1.2 Crux.Oct 13 2018, 10:09 AM

no feedback

syncer edited projects, added Invalid; removed VyOS 1.2 Crux (VyOS 1.2.0-rc4).Oct 15 2018, 5:45 AM

OSPF Neighbor FlappingClosed, InvalidPublicBUGActions

Description

Details

Event Timeline

OSPF Neighbor Flapping
Closed, InvalidPublicBUG
Actions