https://www.open-mesh.org/https://www.open-mesh.org/favicon.ico?16699090422015-06-04T11:13:53ZOpen Meshbatman-adv - Bug #217: Oops: "Unable to handle kernel paging request" in batadv_tt_local_removehttps://www.open-mesh.org/issues/217?journal_id=5622015-06-04T11:13:53ZA ZAlfonsName@web.de
<ul></ul><p>I'm wondering if this was related to eth2 flapping up/down, which was added to batman using batctl if add .</p> batman-adv - Bug #217: Oops: "Unable to handle kernel paging request" in batadv_tt_local_removehttps://www.open-mesh.org/issues/217?journal_id=5702015-06-09T12:31:49ZMarek Lindnermareklindner@neomailbox.ch
<ul><li><strong>Status</strong> changed from <i>New</i> to <i>In Progress</i></li></ul><p>Thanks for reporting the issue. The fastest way to solve this is by making the crash reproducible for us (if we can).</p>
<p>The backtrace indicates a packet was just received. Maybe that coincides with another action. You said you have a lot of 'if add' and 'if down'. Why would that be ? Do you run a script on ifup/ifdown ?</p> batman-adv - Bug #217: Oops: "Unable to handle kernel paging request" in batadv_tt_local_removehttps://www.open-mesh.org/issues/217?journal_id=5732015-06-09T14:22:34ZMarek Lindnermareklindner@neomailbox.ch
<ul></ul><p>Another idea: Could you enable full batman-adv logging and provide the last 10-20 lines ? Might give us a clue where to look.</p> batman-adv - Bug #217: Oops: "Unable to handle kernel paging request" in batadv_tt_local_removehttps://www.open-mesh.org/issues/217?journal_id=5742015-06-09T15:07:38ZA ZAlfonsName@web.de
<ul></ul><p>There are two embedded devices connected using wire both running some linux firmware; the flashed firmware connects using wire or managed wifi and then fetches a stage2 firmware image to be booted using kexec in ram. The stage2 then runs batman-adv on wire and ibss mesh in parallel to hostapd. WiFi-Clients get bridged into a vpn, that runs over wire or mesh. During kexec, the wired link will go down and back up; additionally, the stage2 of one device rebootet every 20min or so using some script (device was not permitted to connect with vpn).</p>
<p>I'll try to get the logs.</p> batman-adv - Bug #217: Oops: "Unable to handle kernel paging request" in batadv_tt_local_removehttps://www.open-mesh.org/issues/217?journal_id=5752015-06-09T19:19:38ZSven Eckelmann
<ul></ul><p>I did not make a big review but it seems that the function is missing the locking. This must not be related with your problem but it not the worst place to report this problem.</p>
<pre>
/* if this client has been added right now, it is possible to
* immediately purge it
*/
batadv_tt_local_event(bat_priv, tt_local_entry, BATADV_TT_CLIENT_DEL);
hlist_del_rcu(&tt_local_entry->common.hash_entry);
batadv_tt_local_entry_free_ref(tt_local_entry);
</pre>
<p>hlist_del_rcu is writing to the list but is not taking any lock... this is wrong. The function has to use the hash bucket spinlock of this element before it can delete it. But it first has to make sure that this is actually in the list (in the same <em>correct</em> locking context). RCU will not help here because it doesn't provide any consistency between these unrelated things. Only because it was found earlier, it is not required to be in the a valid object in the any list.</p>
<p>Same (hlist_del_rcu) for things like</p>
<pre>
static void
batadv_tt_global_del_orig_entry(struct batadv_tt_global_entry *tt_global_entry,
struct batadv_tt_orig_list_entry *orig_entry)
{
batadv_tt_global_size_dec(orig_entry->orig_node,
tt_global_entry->common.vid);
atomic_dec(&tt_global_entry->orig_list_count);
hlist_del_rcu(&orig_entry->list);
batadv_tt_orig_list_entry_free_ref(orig_entry);
}
</pre>
<p>Multiple instances searching and deleting on the same list == bad idea. RCU doesn't help here. See <a class="external" href="https://git.open-mesh.org/linux-merge.git/blob/2be28ed88835ece650678742be1366fcd445bf3b:/Documentation/RCU/whatisRCU.txt#l702">https://git.open-mesh.org/linux-merge.git/blob/2be28ed88835ece650678742be1366fcd445bf3b:/Documentation/RCU/whatisRCU.txt#l702</a></p>
<p>There are many more (mcast, tvlv_handler, ...) were only the delete (hlist_del_rcu) is locked but not the search for the item or the check if it still in the list. This can also lead to such poisoned list pointer page faults as seen in this ticket</p> batman-adv - Bug #217: Oops: "Unable to handle kernel paging request" in batadv_tt_local_removehttps://www.open-mesh.org/issues/217?journal_id=5762015-06-10T08:17:10ZSven Eckelmann
<ul></ul><p>Btw. if it is really needed to do the search + delete in two different locking contexts then you should at least read about the difference of list_del_rcu + hlist_del_init_rcu. But the lock for the list when doing the check if it is in the list (done implicit by hlist_del_init_rcu) in the lock for the hash bucket list where this object would have been expected.</p>
<p><a class="external" href="https://git.open-mesh.org/linux-merge.git/blob/2be28ed88835ece650678742be1366fcd445bf3b:/include/linux/rculist.h#l105">https://git.open-mesh.org/linux-merge.git/blob/2be28ed88835ece650678742be1366fcd445bf3b:/include/linux/rculist.h#l105</a></p>
<hr />
<p>Interesting side note:</p>
<p>You are seeing list poison1 <a class="external" href="https://git.open-mesh.org/linux-merge.git/blob/2be28ed88835ece650678742be1366fcd445bf3b:/include/linux/poison.h#l22">https://git.open-mesh.org/linux-merge.git/blob/2be28ed88835ece650678742be1366fcd445bf3b:/include/linux/poison.h#l22</a> + offset of 4 (maybe to hlist_node::pprev). But I cannot see right now where POISON1 is used in the rcu hlist stuff. POISON2 seems to be used everywhere for pprev. Next is never touched to make forward iterating always possible using the hlist_foreach..._rcu variants.</p>
<p>list_del and hlist_del should only use LIST_POISON1 (when CONFIG_DEBUG_LIST is enabled) and not the _rcu variants. So, it might be something trying to access a pointer of list_head after the next/prev item was deleted via (h)list_del. But this is just wild guessing. The source code + binary objects (*.ko + *.o) of this build could be helpful to analyze this further.</p>
<hr />
<p>And another hint for Antonio about possible problems in his TT code: batadv_tt_update_orig also never makes sure that tt_req_node is still in a list before calling list_del. It might already have been deleted by batadv_tt_local_event or some other call to list_del on that object.</p> batman-adv - Bug #217: Oops: "Unable to handle kernel paging request" in batadv_tt_local_removehttps://www.open-mesh.org/issues/217?journal_id=5822015-06-17T12:09:42ZMarek Lindnermareklindner@neomailbox.ch
<ul><li><strong>File</strong> <a href="/attachments/542">0001-batman-adv-protect-tt_local_entry-from-concurrent-de.patch</a> <a class="icon-only icon-download" title="Download" href="/attachments/download/542/0001-batman-adv-protect-tt_local_entry-from-concurrent-de.patch">0001-batman-adv-protect-tt_local_entry-from-concurrent-de.patch</a> added</li></ul><p>@Sven: Thanks for all your hints. I am going to check them one-by-one.</p>
<p>@A Z: I attached a patch which should address your crash. Could you apply this patch and give it a try ?</p>
<p>Thanks!</p> batman-adv - Bug #217: Oops: "Unable to handle kernel paging request" in batadv_tt_local_removehttps://www.open-mesh.org/issues/217?journal_id=5832015-06-18T03:26:25ZMarek Lindnermareklindner@neomailbox.ch
<ul></ul><p>Sven Eckelmann wrote:</p>
<blockquote>
<p>Same (hlist_del_rcu) for things like</p>
</blockquote>
<pre>
static void
batadv_tt_global_del_orig_entry(struct batadv_tt_global_entry *tt_global_entry,
struct batadv_tt_orig_list_entry *orig_entry)
{
batadv_tt_global_size_dec(orig_entry->orig_node,
tt_global_entry->common.vid);
atomic_dec(&tt_global_entry->orig_list_count);
hlist_del_rcu(&orig_entry->list);
batadv_tt_orig_list_entry_free_ref(orig_entry);
}
</pre>
<p>@Sven: While following up on your comment I noticed that this function might look similar but does not fall into the same category. The 2 existing callers (batadv_tt_global_del_orig_list() and batadv_tt_global_del_orig_node()) both hold the appropriate locks (tt_global_entry->list_lock) and cycle through the entire list using the hlist_for_each_entry_safe() macro. Looks correct to me - do you agree ?</p>
<p>I could prepare a patch to make this a little more obvious by adding additional kernel doc to batadv_tt_global_del_orig_entry() and maybe adding an underscore to the batadv_tt_global_del_orig_entry() function name ?</p> batman-adv - Bug #217: Oops: "Unable to handle kernel paging request" in batadv_tt_local_removehttps://www.open-mesh.org/issues/217?journal_id=5862015-06-18T06:57:35ZSven Eckelmann
<ul></ul><p>Thanks for the more detail check of the code. You are right about batadv_tt_global_del_orig_entry</p>
<p>And a small correction: I didn't mean batadv_tt_update_orig. The function I wanted to point out with a similar problem is batadv_send_tt_request.</p>
<p>Another weird looking function is batadv_tt_global_size_mod</p> batman-adv - Bug #217: Oops: "Unable to handle kernel paging request" in batadv_tt_local_removehttps://www.open-mesh.org/issues/217?journal_id=5872015-06-19T14:36:47ZMarek Lindnermareklindner@neomailbox.ch
<ul></ul><p>Sven Eckelmann wrote:</p>
<blockquote>
<p>And a small correction: I didn't mean batadv_tt_update_orig. The function I wanted to point out with a similar problem is batadv_send_tt_request.</p>
</blockquote>
<p>You were right about this section. I already sent a fix to the ml.</p>
<blockquote>
<p>Another weird looking function is batadv_tt_global_size_mod</p>
</blockquote>
<p>It looks weird but also seems ok. Here the snippet in question:<br /><pre>
if (atomic_add_return(v, &vlan->tt.num_entries) == 0) {
spin_lock_bh(&orig_node->vlan_list_lock);
list_del_rcu(&vlan->list);
spin_unlock_bh(&orig_node->vlan_list_lock);
batadv_orig_node_vlan_free_ref(vlan);
}
</pre></p>
<p>The tt.num_entries is a counter for the list itself. Only if that counter reaches zero the corresponding list element is removed. I don't quite see how a double delete can be possible. What do you think ?</p> batman-adv - Bug #217: Oops: "Unable to handle kernel paging request" in batadv_tt_local_removehttps://www.open-mesh.org/issues/217?journal_id=5882015-06-22T08:56:21ZSven Eckelmann
<ul></ul><p>batadv_tt_local_size_mod is doing the add without any conditions:</p>
<pre>
static void batadv_tt_local_size_mod(struct batadv_priv *bat_priv,
unsigned short vid, int v)
{
struct batadv_softif_vlan *vlan;
vlan = batadv_softif_vlan_get(bat_priv, vid);
if (!vlan)
return;
atomic_add(v, &vlan->tt.num_entries);
batadv_softif_vlan_free_ref(vlan);
}
</pre> batman-adv - Bug #217: Oops: "Unable to handle kernel paging request" in batadv_tt_local_removehttps://www.open-mesh.org/issues/217?journal_id=5892015-06-22T08:57:53ZSven Eckelmann
<ul></ul><p>So it can happen in theory that it gets 0 and then it gets back to 1 -> the thing can be called again.</p> batman-adv - Bug #217: Oops: "Unable to handle kernel paging request" in batadv_tt_local_removehttps://www.open-mesh.org/issues/217?journal_id=6092015-07-21T11:51:51ZA ZAlfonsName@web.de
<ul></ul><p>I've just received new devices to test with and new firmware with patches applied is compiling. Looking forward into testing.</p> batman-adv - Bug #217: Oops: "Unable to handle kernel paging request" in batadv_tt_local_removehttps://www.open-mesh.org/issues/217?journal_id=6542015-09-03T07:51:33ZA ZAlfonsName@web.de
<ul></ul><p>I've applied</p>
<p>0001-batman-adv-avoid-DAT-to-mess-up-LAN-state.patch<br />0001-batman-adv-lock-crc-access-in-bridge-loop-avoidance.patch<br />0001-batman-adv-protect-tt_local_entry-from-concurrent-de.patch<br />0002-batman-adv-DEBUG-track-CRC-changes.patch</p>
<p>from issue <a class="issue tracker-1 status-5 priority-4 priority-default closed" title="Bug: received packet on bat0 with own address as source address (Closed)" href="https://www.open-mesh.org/issues/216">#216</a> and <a class="issue tracker-1 status-5 priority-4 priority-default closed" title="Bug: Oops: "Unable to handle kernel paging request" in batadv_tt_local_remove (Closed)" href="https://www.open-mesh.org/issues/217">#217</a> and am still seeing this kernel crash.</p> batman-adv - Bug #217: Oops: "Unable to handle kernel paging request" in batadv_tt_local_removehttps://www.open-mesh.org/issues/217?journal_id=7002016-01-29T09:46:47ZAntonio Quartulli
<ul></ul><p>Could you please try running the latest release (2016.0) as it contains several bugfixes which might be related to this bug.</p> batman-adv - Bug #217: Oops: "Unable to handle kernel paging request" in batadv_tt_local_removehttps://www.open-mesh.org/issues/217?journal_id=7072016-01-29T13:37:04ZA ZAlfonsName@web.de
<ul></ul><p>Thanks, I will try to. But it might take some time.</p> batman-adv - Bug #217: Oops: "Unable to handle kernel paging request" in batadv_tt_local_removehttps://www.open-mesh.org/issues/217?journal_id=8162016-05-02T19:40:26ZSven Eckelmann
<ul></ul><p>The newest release 2016.1 contains more fixes which could be related to this bug</p>
<p>Btw. if you are using OpenWrt with 050-backport_netfilter_rtcache.patch or 120-bridge_allow_receiption_on_disabled_port.patch then please delete these patches and rebuild your images.</p> batman-adv - Bug #217: Oops: "Unable to handle kernel paging request" in batadv_tt_local_removehttps://www.open-mesh.org/issues/217?journal_id=8172016-05-02T19:43:33ZSven Eckelmann
<ul><li><strong>Subject</strong> changed from <i>Kernel OOPS</i> to <i>Oops: "Unable to handle kernel paging request" in batadv_tt_local_remove</i></li><li><strong>Description</strong> updated (<a title="View differences" href="/journals/817/diff?detail_id=413">diff</a>)</li></ul> batman-adv - Bug #217: Oops: "Unable to handle kernel paging request" in batadv_tt_local_removehttps://www.open-mesh.org/issues/217?journal_id=8192016-05-02T19:57:39ZA ZAlfonsName@web.de
<ul></ul><p>Thanks, I will try to. But it might take some time.</p> batman-adv - Bug #217: Oops: "Unable to handle kernel paging request" in batadv_tt_local_removehttps://www.open-mesh.org/issues/217?journal_id=8352016-05-29T20:45:44ZSven Eckelmann
<ul></ul><p>Maybe waiting for v2016.2 is also not a bad idea. At least I hope to get following patch (or a variant of it) merged for this release: <a class="external" href="https://patchwork.open-mesh.org/project/b.a.t.m.a.n./patch/1464588694-19855-1-git-send-email-sven@narfation.org/">https://patchwork.open-mesh.org/project/b.a.t.m.a.n./patch/1464588694-19855-1-git-send-email-sven@narfation.org/</a></p>
<p>It tackles a weird memory corruption problem. A memory corruption problem like the one you may have here.</p> batman-adv - Bug #217: Oops: "Unable to handle kernel paging request" in batadv_tt_local_removehttps://www.open-mesh.org/issues/217?journal_id=8362016-05-30T05:23:21ZA ZAlfonsName@web.de
<ul></ul><p>Thanks for pointing this out.</p> batman-adv - Bug #217: Oops: "Unable to handle kernel paging request" in batadv_tt_local_removehttps://www.open-mesh.org/issues/217?journal_id=8392016-05-31T07:41:35ZSven Eckelmann
<ul><li><strong>Related to</strong> <i><a class="issue tracker-1 status-5 priority-4 priority-default closed" href="/issues/223">Bug #223</a>: Kernel Crash when using more than one interface in bat0</i> added</li></ul> batman-adv - Bug #217: Oops: "Unable to handle kernel paging request" in batadv_tt_local_removehttps://www.open-mesh.org/issues/217?journal_id=8422016-05-31T18:37:47ZSven Eckelmann
<ul><li><strong>Related to</strong> <i><a class="issue tracker-1 status-5 priority-4 priority-default closed" href="/issues/228">Bug #228</a>: Workqueue: bat_events batadv_send_outstanding_bat_ogm_packet</i> added</li></ul> batman-adv - Bug #217: Oops: "Unable to handle kernel paging request" in batadv_tt_local_removehttps://www.open-mesh.org/issues/217?journal_id=8582016-06-18T09:42:00ZSven Eckelmann
<ul><li><strong>Status</strong> changed from <i>In Progress</i> to <i>Feedback</i></li><li><strong>Assignee</strong> set to <i>A Z</i></li></ul><p>batman-adv 2016.2 was released last week. I suspect that this release fixes this problem. At least I have reports from Freifunk Darmstadt and Freifunk Chemnitz that an included patch solved a similar problem for them.</p>
<p>This ticket doesn't seem to show a lot activity anymore and thus I would like to close it soon to avoid a dead but still open ticket without a chance to mark it as fixed. I will wait until mid of July for feedback but will close this ticket if nothing happens.</p> batman-adv - Bug #217: Oops: "Unable to handle kernel paging request" in batadv_tt_local_removehttps://www.open-mesh.org/issues/217?journal_id=8672016-07-16T21:13:36ZSven Eckelmann
<ul><li><strong>Status</strong> changed from <i>Feedback</i> to <i>Closed</i></li></ul><p>Closing due to inactivity (and success reports from <a class="issue tracker-1 status-5 priority-4 priority-default closed" title="Bug: Kernel Crash when using more than one interface in bat0 (Closed)" href="https://www.open-mesh.org/issues/223">#223</a>)</p> batman-adv - Bug #217: Oops: "Unable to handle kernel paging request" in batadv_tt_local_removehttps://www.open-mesh.org/issues/217?journal_id=10782017-02-11T07:46:04ZSven Eckelmann
<ul><li><strong>Target version</strong> set to <i>2016.2</i></li></ul>