Project

General

Profile

Bug #355

Missing TT entry leads to inconsistent but stable TT state

Added by Sven Eckelmann over 1 year ago. Updated about 1 year ago.

Status:
Closed
Priority:
Normal
Assignee:
batman-adv developers
Target version:
Start date:
05/10/2018
Due date:
% Done:

0%


Description

Bug report is actually from Leonardo Mörlein <>:

We have three nodes, which have the following topology:

NDS-Agenda21Haus70 <-> sn04 <-> sn03

Something happened, so they came into a stable (for hours) but inconsistent
state of their translation tables:

- sn03 has 11 entries in its translocal table and crc = 0x2b6d458c
- sn04 has only 10 entries in the table for sn03 but still crc = 0x2b6d458c
- NDS-Agenda21Haus70 has also 10 entries in the table for sn03 but crc = 0x74fb4f96

It seems as like something (erroneously) removed the entry 33:33:00:00:00:01
from table on node sn04 without recalculating the crc sum.

In the following are the outputs of the table states:

root@NDS-Agenda21Haus70:~# batctl tr 88:e6:40:20:30:01
> traceroute to 88:e6:40:20:30:01 (88:e6:40:20:30:01), 50 hops max, 20 byte packets
>  1: 88:e6:40:20:40:01  16.567 ms  20.593 ms  16.324 ms
>  2: 88:e6:40:20:30:01  20.760 ms  19.826 ms  19.840 ms
> 
> root@NDS-Agenda21Haus70:~# batctl tg | grep 88:e6:40:20:30:01 
>  * 33:33:ff:06:3e:25   -1 [....] (  3) 88:e6:40:20:30:01 (  3) (0x74fb4f96)
>  * 33:33:ff:00:18:32   -1 [....] (  3) 88:e6:40:20:30:01 (  3) (0x74fb4f96)
>    33:33:00:00:00:02   -1 [....] (  3) 88:e6:40:20:30:01 (  3) (0x74fb4f96)
>    01:00:5e:00:00:fc   -1 [....] (  3) 88:e6:40:20:30:01 (  3) (0x74fb4f96)
>    01:00:5e:00:00:01   -1 [....] (  3) 88:e6:40:20:30:01 (  3) (0x74fb4f96)
>    33:33:ff:00:00:00   -1 [....] (  3) 88:e6:40:20:30:01 (  3) (0x74fb4f96)
>  * 33:33:ff:00:30:01   -1 [....] (  3) 88:e6:40:20:30:01 (  3) (0x74fb4f96)
>  * 2a:d7:9a:06:3e:25   -1 [....] (  3) 88:e6:40:20:30:01 (  3) (0x74fb4f96)
>  * 33:33:ff:00:01:08   -1 [....] (  3) 88:e6:40:20:30:01 (  3) (0x74fb4f96)
>    33:33:00:01:00:03   -1 [....] (  3) 88:e6:40:20:30:01 (  3) (0x74fb4f96)
> 
> [root@sn04]:~ # batctl tg | grep 88:e6:40:20:30:01
>  * 33:33:ff:06:3e:25   -1 [....] (  3) 88:e6:40:20:30:01 (  3) (0x2b6d458c)
>  * 33:33:ff:00:18:32   -1 [....] (  3) 88:e6:40:20:30:01 (  3) (0x2b6d458c)
>    33:33:00:00:00:02   -1 [....] (  3) 88:e6:40:20:30:01 (  3) (0x2b6d458c)
>    01:00:5e:00:00:fc   -1 [....] (  3) 88:e6:40:20:30:01 (  3) (0x2b6d458c)
>    01:00:5e:00:00:01   -1 [....] (  3) 88:e6:40:20:30:01 (  3) (0x2b6d458c)
>    33:33:ff:00:00:00   -1 [....] (  3) 88:e6:40:20:30:01 (  3) (0x2b6d458c)
>  * 33:33:ff:00:30:01   -1 [....] (  3) 88:e6:40:20:30:01 (  3) (0x2b6d458c)
>  * 2a:d7:9a:06:3e:25   -1 [....] (  3) 88:e6:40:20:30:01 (  3) (0x2b6d458c)
>  * 33:33:ff:00:01:08   -1 [....] (  3) 88:e6:40:20:30:01 (  3) (0x2b6d458c)
>    33:33:00:01:00:03   -1 [....] (  3) 88:e6:40:20:30:01 (  3) (0x2b6d458c)
> 
> [root@sn03]:~ # batctl tl
> [B.A.T.M.A.N. adv 2018.0-3-g4b2b8c68, MainIF/MAC: mesh_fastd/88:e6:40:20:30:01 (bat0/2a:d7:9a:06:3e:25 BATMAN_IV), TTVN: 3]
> Client             VID Flags    Last seen (CRC       )
> 33:33:ff:06:3e:25   -1 [.P....]   0.000   (0x2b6d458c)
> 33:33:ff:00:18:32   -1 [.P....]   0.000   (0x2b6d458c)
> 33:33:00:00:00:02   -1 [.P....]   0.000   (0x2b6d458c)
> 01:00:5e:00:00:fc   -1 [.P....]   0.000   (0x2b6d458c)
> 01:00:5e:00:00:01   -1 [.P....]   0.000   (0x2b6d458c)
> 33:33:ff:00:00:00   -1 [.P....]   0.000   (0x2b6d458c)
> 33:33:ff:00:30:01   -1 [.P....]   0.000   (0x2b6d458c)
> 2a:d7:9a:06:3e:25   -1 [.P....]   0.000   (0x2b6d458c)
> 33:33:ff:00:01:08   -1 [.P....]   0.000   (0x2b6d458c)
> 33:33:00:01:00:03   -1 [.P....]   0.000   (0x2b6d458c)
> 33:33:00:00:00:01   -1 [.P....]   0.000   (0x2b6d458c)

In the attachment, I added a capture of the received full table response from
sn04 -> NDS-Agenda21Haus70 taken on NDS-Agenda21Haus70. I checked, that this
is really an intermediate response of sn04 and not a direct response from sn03
(sn03 is not receiving queries from NDS-Agenda21Haus70).

missing_tg_entry.pcap (210 Bytes) Sven Eckelmann, 05/10/2018 08:35 AM

History

#1 Updated by Sven Eckelmann over 1 year ago

  • Description updated (diff)

#2 Updated by Sven Eckelmann over 1 year ago

Wild and untested idea:

  • two events happen at the same time at sn04
  • both are processed at the same time in different contexts
    • context a sees the list (hash) of entries not yet updated (11 entries) and calculates crc 0x2b6d458c
    • context b sees the list (hash) of entries which was already updated (10 entries) and calculates crc 0x74fb4f96
  • batadv_tt_global_update_crc does the crc calculation without a lock and thus operations can happen iterleaved
    • context b finishes early (seen from crc memory write/commit perspective) and saves crc 0x74fb4f96
    • context a finishes late (seen from crc memory write/commit perspective) and saves crc 0x2b6d458c

The hash (bat_priv->tt.local_hash) will then have only 10 entries and the crc of 11 entries. It will therefore not recover itself using the data from the originator to which the global entries belong to (because the crc is correct). It will still forward (as "knowing" intermediate node) the wrong number of entries to the querier of the TT full table for this originator.

Looks to me like the tt.crc access must always be protected with a specific lock for the complete hash (bat_priv->tt.local_hash or bat_priv->tt.global_hash).

#3 Updated by Sven Eckelmann over 1 year ago

Antonio just mentioned that TT responses or OGM messages could trigger recalculations of the CRC. And multiple recalculations for the same originator are prevented using the orig_node->tt_lock (only different originators can calculate their CRC at the same time).

#4 Updated by Linus Lüssing about 1 year ago

From looking at the code it seems to me that a node wrongly adding a ROAM or TEMP flag to a multicast TT entry in an OGM or TT Response might cause similar issues in tt_global_add() either here:

or here:

Which would delete a multicast TT entry for the same address on any other node, without updating any CRC.

As in Leonardo's setup we already observed bogus multicast TT entries with the 'W' and 'I' flag set it seems likely that this/those broken node(s) might inject a ROAM or TEMP flag in multicast TT entries, too.

If we can verify that this is indeed the issue for Leonardo then a simple fix should be to add a "!is_multicast_ether_addr()" check for the referenced two cases, to avoid interpreting these flag bits as ROAM or TEMP for multicast TT.

I'll hopefully be able to verify this this weekend.

#5 Updated by Linus Lüssing about 1 year ago

These two patches hopefully fix this issue:

Leonardo, would be great if you could check that this fixes it for you, too.

#7 Updated by Sven Eckelmann about 1 year ago

  • Status changed from Resolved to Closed

batman-adv 2018.2 was just released

Also available in: Atom PDF