Project

General

Profile

Actions

Bug #217

closed

Oops: "Unable to handle kernel paging request" in batadv_tt_local_remove

Added by A Z almost 9 years ago. Updated about 7 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Target version:
Start date:
06/04/2015
Due date:
% Done:

0%

Estimated time:

Description

[  543.321946] Unable to handle kernel paging request for data at address 0x00100104
[  543.329455] Faulting instruction address: 0xf36267d0
[  543.334447] Oops: Kernel access of bad area, sig: 11 [#1]
[  543.339839] SMP NR_CPUS=2 P1020 RDB
[  543.343325] Modules linked in: iptable_nat ath9k nf_nat_ipv4 nf_conntrack_ipv4 ipt_REJECT ipt_MASQUERADE ebtable_nat ebtable_filter ebtable_broute carl9170 ath9k_htc ath9k_common xt_time xt_tcpudp xt_state xt_nat xt_multipo
rt xt_mark xt_mac xt_limit xt_id xt_conntrack xt_comment xt_TCPMSS xt_REDIRECT xt_LOG xt_CT nf_reject_ipv4 nf_nat_masquerade_ipv4 nf_nat_ftp nf_nat nf_log_ipv4 nf_log_common nf_defrag_ipv4 nf_conntrack_rtcache nf_conntrack_ftp
 nf_conntrack macvlan mac80211 lzo_decompress lzo_compress lz4_decompress lz4_compress iptable_raw iptable_mangle iptable_filter ip_tables ebtables ebt_vlan ebt_stp ebt_snat ebt_redirect ebt_pkttype ebt_nflog ebt_mark_m ebt_ma
rk ebt_log ebt_limit ebt_ip6 ebt_ip ebt_dnat ebt_arpreply ebt_arp ebt_among ebt_802_3 crc8 crc7 crc_itu_t crc_ccitt ath9k_hw ath sch_teql sch_tbf sch_sfq sch_red sch_prio sch_htb sch_gred sch_dsmark sch_codel em_text em_nbyte 
em_meta em_cmp cls_basic act_police act_ipt act_skbedit act_mirred em_u32 cls_u32 cls_tcindex cls_flow cls_route cls_fw sch_hfsc sch_ingress i2c_dev ledtrig_morse ledtrig_heartbeat ledtrig_gpio booke_wdt batman_adv libcrc32c x
t_set x_tables ip_set_list_set ip_set_hash_netiface ip_set_hash_netport ip_set_hash_netnet ip_set_hash_net ip_set_hash_netportnet ip_set_hash_mac ip_set_hash_ipportnet ip_set_hash_ipportip ip_set_hash_ipport ip_set_hash_ipmark
 ip_set_hash_ip ip_set_bitmap_port ip_set_bitmap_ipmac ip_set_bitmap_ip ip_set msdos dummy vfat fat nls_utf8 ipv6 bridge stp talitos chainiv eseqiv psnap p8022 llc xts crypto_wq algif_skcipher algif_hash af_alg sha1_generic se
qiv krng rng cryptosoft cryptodev ocf crypto_null michael_mic md5 md4 hmac gf128mul ecb des_generic deflate zlib_deflate ctr ccm cbc authenc aead arc4 crypto_blkcipher cfg80211 compat i2c_mpc mmc_block gpio_keys_polled usb_sto
rage tun mmc_core input_polldev leds_gpio sd_mod scsi_mod button_hotplug crc32c_generic crypto_hash
[  543.509957] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.18.11 #59
[  543.516046] task: c049d340 ti: ee61a000 task.ti: c04ca000
[  543.521438] [stack] NIP: f36267d0 LR: f36267c4 CTR: c03558f4
[  543.527092] [stack] REGS: ee61bbe0 TRAP: 0300   Not tainted  (3.18.11)
[  543.533611] [stack] MSR: 00029000 <CE,EE,ME>  CR: 28828022  XER: 20000000
[stack] DEAR: 00100104 ESR: 00800000
[stack] GPR00: f36267c4 ee61bc90 c049d340 ef4fe960 00000000 00000000 00000000 00000001
[stack] GPR08: 00000000 00200200 00100100 00000000 48828022 10034aa0 c04a3bcc ef00b000
[stack] GPR16: 00000001 00000000 ef6c40c0 22828042 00000001 00000099 c04d0000 00000014
[stack] GPR24: ef77d080 ef4fe960 00000003 ef4fe580 ef3a2ce0 ef0248ce 00000001 ef3a2e40
[  543.574932] [stack] NIP [f36267d0] batadv_tt_local_remove+0xcc4/0xda0 [batman_adv]
[  543.582505] [stack] LR [f36267c4] batadv_tt_local_remove+0xcb8/0xda0 [batman_adv]
[  543.589978] [stack] Call Trace:
[  543.593125] [stack] [ee61bc90] [f36267c4] batadv_tt_local_remove+0xcb8/0xda0 [batman_adv] (unreliable)
[  543.602444] [stack] [ee61bcd0] [f361b4e4] batadv_tvlv_containers_process+0xb0/0x1e8 [batman_adv]
[  543.611232] [stack] [ee61bd10] [f3610c2c] 0xf3610c2c
[  543.616201] [stack] [ee61bda0] [f3611810] 0xf3611810
[  543.621172] [stack] [ee61be00] [f361af1c] batadv_batman_skb_recv+0x10c/0x14c [batman_adv]
[  543.629360] [stack] [ee61be20] [c02ad8a4] __netif_receive_skb_core+0x710/0x7c0
[  543.636584] [stack] [ee61be90] [c02b018c] netif_receive_skb_internal+0xb4/0x130
[  543.643893] [stack] [ee61bec0] [c02b0bf8] napi_gro_receive+0x50/0xb8
[  543.650256] [stack] [ee61bed0] [c0248120] gfar_clean_rx_ring+0x34c/0x420
[  543.656957] [stack] [ee61bf30] [c0248384] gfar_poll_rx_sq+0x3c/0xa4
[  543.663224] [stack] [ee61bf50] [c02b04f4] net_rx_action+0x94/0x1a4
[  543.669409] [stack] [ee61bf80] [c002b448] __do_softirq+0x100/0x244
[  543.675588] [stack] [ee61bfe0] [c002b874] irq_exit+0x68/0x90
[  543.681248] [stack] [ee61bff0] [c000caf8] call_do_irq+0x24/0x3c
[  543.687170] [stack] [c04cbe90] [c00046a0] do_IRQ+0xa8/0xf4
[  543.692658] [stack] [c04cbeb0] [c000e6c8] ret_from_except+0x0/0x18
[  543.698843] [stack] --- interrupt: 501 at arch_cpu_idle+0x24/0x60
[  543.698843] [stack]    LR = arch_cpu_idle+0x24/0x60
[  543.709801] [stack] [c04cbf70] [c006f18c] rcu_idle_enter+0x9c/0xf4 (unreliable)
[  543.717120] [stack] [c04cbf80] [c0058b64] cpu_startup_entry+0x118/0x18c
[  543.723743] [stack] [c04cbfb0] [c04699c8] start_kernel+0x38c/0x3a0
[  543.729922] [stack] [c04cbff0] [c00003d4] set_ivor+0x140/0x17c
[  543.735750] [stack] Instruction dump:
[stack] 3878003c 38848484 480030b9 2f9e0000 419e0054 2f9f0000 419e004c 7f23cb78
[stack] 48002c11 813f0010 815f000c 7f23cb78 <912a0004> 91490000 3d200010 61290100
[  543.753980] ---[ end trace 93721eec70ff979c ]---
[  543.758590]
[  544.760125] Kernel panic - not syncing: Fatal exception in interrupt
[  544.766484] Rebooting in 3 seconds..
batman_adv: B.A.T.M.A.N. advanced 2014.4.0 (compatibility version 15) loaded
batman bridge loop avoidance enabled
distributed arp table: enabled
kernel: Linux femap0106 3.18.11 #37 SMP Sat May 30 10:09:58 CEST 2015 ppc GNU/Linux (openwrt)

I see these device rebooting regularly (thought I do not know the backtrace of all of them), so this likely can be reproduced easily.


Files


Related issues 2 (0 open2 closed)

Related to batman-adv - Bug #223: Kernel Crash when using more than one interface in bat0Closed08/20/2015

Actions
Related to batman-adv - Bug #228: Workqueue: bat_events batadv_send_outstanding_bat_ogm_packetClosedPaulo da Silva01/09/2016

Actions
Actions #1

Updated by A Z almost 9 years ago

I'm wondering if this was related to eth2 flapping up/down, which was added to batman using batctl if add .

Actions #2

Updated by Marek Lindner almost 9 years ago

  • Status changed from New to In Progress

Thanks for reporting the issue. The fastest way to solve this is by making the crash reproducible for us (if we can).

The backtrace indicates a packet was just received. Maybe that coincides with another action. You said you have a lot of 'if add' and 'if down'. Why would that be ? Do you run a script on ifup/ifdown ?

Actions #3

Updated by Marek Lindner almost 9 years ago

Another idea: Could you enable full batman-adv logging and provide the last 10-20 lines ? Might give us a clue where to look.

Actions #4

Updated by A Z almost 9 years ago

There are two embedded devices connected using wire both running some linux firmware; the flashed firmware connects using wire or managed wifi and then fetches a stage2 firmware image to be booted using kexec in ram. The stage2 then runs batman-adv on wire and ibss mesh in parallel to hostapd. WiFi-Clients get bridged into a vpn, that runs over wire or mesh. During kexec, the wired link will go down and back up; additionally, the stage2 of one device rebootet every 20min or so using some script (device was not permitted to connect with vpn).

I'll try to get the logs.

Actions #5

Updated by Sven Eckelmann almost 9 years ago

I did not make a big review but it seems that the function is missing the locking. This must not be related with your problem but it not the worst place to report this problem.

    /* if this client has been added right now, it is possible to
     * immediately purge it
     */
    batadv_tt_local_event(bat_priv, tt_local_entry, BATADV_TT_CLIENT_DEL);
    hlist_del_rcu(&tt_local_entry->common.hash_entry);
    batadv_tt_local_entry_free_ref(tt_local_entry);

hlist_del_rcu is writing to the list but is not taking any lock... this is wrong. The function has to use the hash bucket spinlock of this element before it can delete it. But it first has to make sure that this is actually in the list (in the same correct locking context). RCU will not help here because it doesn't provide any consistency between these unrelated things. Only because it was found earlier, it is not required to be in the a valid object in the any list.

Same (hlist_del_rcu) for things like

static void
batadv_tt_global_del_orig_entry(struct batadv_tt_global_entry *tt_global_entry,
                struct batadv_tt_orig_list_entry *orig_entry)
{
    batadv_tt_global_size_dec(orig_entry->orig_node,
                  tt_global_entry->common.vid);
    atomic_dec(&tt_global_entry->orig_list_count);
    hlist_del_rcu(&orig_entry->list);
    batadv_tt_orig_list_entry_free_ref(orig_entry);
}

Multiple instances searching and deleting on the same list == bad idea. RCU doesn't help here. See https://git.open-mesh.org/linux-merge.git/blob/2be28ed88835ece650678742be1366fcd445bf3b:/Documentation/RCU/whatisRCU.txt#l702

There are many more (mcast, tvlv_handler, ...) were only the delete (hlist_del_rcu) is locked but not the search for the item or the check if it still in the list. This can also lead to such poisoned list pointer page faults as seen in this ticket

Actions #6

Updated by Sven Eckelmann almost 9 years ago

Btw. if it is really needed to do the search + delete in two different locking contexts then you should at least read about the difference of list_del_rcu + hlist_del_init_rcu. But the lock for the list when doing the check if it is in the list (done implicit by hlist_del_init_rcu) in the lock for the hash bucket list where this object would have been expected.

https://git.open-mesh.org/linux-merge.git/blob/2be28ed88835ece650678742be1366fcd445bf3b:/include/linux/rculist.h#l105


Interesting side note:

You are seeing list poison1 https://git.open-mesh.org/linux-merge.git/blob/2be28ed88835ece650678742be1366fcd445bf3b:/include/linux/poison.h#l22 + offset of 4 (maybe to hlist_node::pprev). But I cannot see right now where POISON1 is used in the rcu hlist stuff. POISON2 seems to be used everywhere for pprev. Next is never touched to make forward iterating always possible using the hlist_foreach..._rcu variants.

list_del and hlist_del should only use LIST_POISON1 (when CONFIG_DEBUG_LIST is enabled) and not the _rcu variants. So, it might be something trying to access a pointer of list_head after the next/prev item was deleted via (h)list_del. But this is just wild guessing. The source code + binary objects (*.ko + *.o) of this build could be helpful to analyze this further.


And another hint for Antonio about possible problems in his TT code: batadv_tt_update_orig also never makes sure that tt_req_node is still in a list before calling list_del. It might already have been deleted by batadv_tt_local_event or some other call to list_del on that object.

Actions #7

Updated by Marek Lindner almost 9 years ago

@Sven: Thanks for all your hints. I am going to check them one-by-one.

@A Z: I attached a patch which should address your crash. Could you apply this patch and give it a try ?

Thanks!

Actions #8

Updated by Marek Lindner almost 9 years ago

Sven Eckelmann wrote:

Same (hlist_del_rcu) for things like

static void
batadv_tt_global_del_orig_entry(struct batadv_tt_global_entry *tt_global_entry,
                struct batadv_tt_orig_list_entry *orig_entry)
{
    batadv_tt_global_size_dec(orig_entry->orig_node,
                  tt_global_entry->common.vid);
    atomic_dec(&tt_global_entry->orig_list_count);
    hlist_del_rcu(&orig_entry->list);
    batadv_tt_orig_list_entry_free_ref(orig_entry);
}

@Sven: While following up on your comment I noticed that this function might look similar but does not fall into the same category. The 2 existing callers (batadv_tt_global_del_orig_list() and batadv_tt_global_del_orig_node()) both hold the appropriate locks (tt_global_entry->list_lock) and cycle through the entire list using the hlist_for_each_entry_safe() macro. Looks correct to me - do you agree ?

I could prepare a patch to make this a little more obvious by adding additional kernel doc to batadv_tt_global_del_orig_entry() and maybe adding an underscore to the batadv_tt_global_del_orig_entry() function name ?

Actions #9

Updated by Sven Eckelmann almost 9 years ago

Thanks for the more detail check of the code. You are right about batadv_tt_global_del_orig_entry

And a small correction: I didn't mean batadv_tt_update_orig. The function I wanted to point out with a similar problem is batadv_send_tt_request.

Another weird looking function is batadv_tt_global_size_mod

Actions #10

Updated by Marek Lindner almost 9 years ago

Sven Eckelmann wrote:

And a small correction: I didn't mean batadv_tt_update_orig. The function I wanted to point out with a similar problem is batadv_send_tt_request.

You were right about this section. I already sent a fix to the ml.

Another weird looking function is batadv_tt_global_size_mod

It looks weird but also seems ok. Here the snippet in question:

    if (atomic_add_return(v, &vlan->tt.num_entries) == 0) {
        spin_lock_bh(&orig_node->vlan_list_lock);
        list_del_rcu(&vlan->list);
        spin_unlock_bh(&orig_node->vlan_list_lock);
        batadv_orig_node_vlan_free_ref(vlan);
    }

The tt.num_entries is a counter for the list itself. Only if that counter reaches zero the corresponding list element is removed. I don't quite see how a double delete can be possible. What do you think ?

Actions #11

Updated by Sven Eckelmann almost 9 years ago

batadv_tt_local_size_mod is doing the add without any conditions:

static void batadv_tt_local_size_mod(struct batadv_priv *bat_priv,
                     unsigned short vid, int v)
{
    struct batadv_softif_vlan *vlan;

    vlan = batadv_softif_vlan_get(bat_priv, vid);
    if (!vlan)
        return;

    atomic_add(v, &vlan->tt.num_entries);

    batadv_softif_vlan_free_ref(vlan);
}
Actions #12

Updated by Sven Eckelmann almost 9 years ago

So it can happen in theory that it gets 0 and then it gets back to 1 -> the thing can be called again.

Actions #13

Updated by A Z over 8 years ago

I've just received new devices to test with and new firmware with patches applied is compiling. Looking forward into testing.

Actions #14

Updated by A Z over 8 years ago

I've applied

0001-batman-adv-avoid-DAT-to-mess-up-LAN-state.patch
0001-batman-adv-lock-crc-access-in-bridge-loop-avoidance.patch
0001-batman-adv-protect-tt_local_entry-from-concurrent-de.patch
0002-batman-adv-DEBUG-track-CRC-changes.patch

from issue #216 and #217 and am still seeing this kernel crash.

Actions #15

Updated by Antonio Quartulli about 8 years ago

Could you please try running the latest release (2016.0) as it contains several bugfixes which might be related to this bug.

Actions #16

Updated by A Z about 8 years ago

Thanks, I will try to. But it might take some time.

Actions #17

Updated by Sven Eckelmann almost 8 years ago

The newest release 2016.1 contains more fixes which could be related to this bug

Btw. if you are using OpenWrt with 050-backport_netfilter_rtcache.patch or 120-bridge_allow_receiption_on_disabled_port.patch then please delete these patches and rebuild your images.

Actions #18

Updated by Sven Eckelmann almost 8 years ago

  • Subject changed from Kernel OOPS to Oops: "Unable to handle kernel paging request" in batadv_tt_local_remove
  • Description updated (diff)
Actions #19

Updated by A Z almost 8 years ago

Thanks, I will try to. But it might take some time.

Actions #20

Updated by Sven Eckelmann almost 8 years ago

Maybe waiting for v2016.2 is also not a bad idea. At least I hope to get following patch (or a variant of it) merged for this release: https://patchwork.open-mesh.org/project/b.a.t.m.a.n./patch/1464588694-19855-1-git-send-email-sven@narfation.org/

It tackles a weird memory corruption problem. A memory corruption problem like the one you may have here.

Actions #21

Updated by A Z almost 8 years ago

Thanks for pointing this out.

Actions #22

Updated by Sven Eckelmann almost 8 years ago

  • Related to Bug #223: Kernel Crash when using more than one interface in bat0 added
Actions #23

Updated by Sven Eckelmann almost 8 years ago

  • Related to Bug #228: Workqueue: bat_events batadv_send_outstanding_bat_ogm_packet added
Actions #24

Updated by Sven Eckelmann almost 8 years ago

  • Status changed from In Progress to Feedback
  • Assignee set to A Z

batman-adv 2016.2 was released last week. I suspect that this release fixes this problem. At least I have reports from Freifunk Darmstadt and Freifunk Chemnitz that an included patch solved a similar problem for them.

This ticket doesn't seem to show a lot activity anymore and thus I would like to close it soon to avoid a dead but still open ticket without a chance to mark it as fixed. I will wait until mid of July for feedback but will close this ticket if nothing happens.

Actions #25

Updated by Sven Eckelmann almost 8 years ago

  • Status changed from Feedback to Closed

Closing due to inactivity (and success reports from #223)

Actions #26

Updated by Sven Eckelmann about 7 years ago

  • Target version set to 2016.2
Actions

Also available in: Atom PDF