Bug #232: alfred server loose data - alfred - Open Mesh

Actions

Copy link

Bug #232

closed

alfred server loose data

Added by fuzzle FFFR over 9 years ago. Updated over 8 years ago.

Status:

Rejected

Priority:

Low

Assignee:

Start date:

02/19/2016

Due date:

% Done:

Estimated time:

Description

we have very often : alfred return code 1 and therefore hell of mess in map and mapdata

we thought that using alfred we have to use server mode to gather maximum Data of our Freiburg Freifunk Meshnetwork.(https://openfreiburg.de/freifunk/meshviewer)

If we turn on alfred in master mode (on Interface bat0) we get very stable 60 nodes (see the grey line at bottom)
if we use alfred in slave mode (just on interface bat0 without -m option) we got this zigzag fashion as it started some time ago (see grey line in image)

means , we get most of nodes (but not all!! 260 out of 290 ) with their Data than - but with very regular alfred error (return code 1)

i am happy to have this visualisation,
1 general graph, 1 zoomin, original html and here the original link [but we wont change alfred mastermode regulary .. you will see the zig zag )
https://openfreiburg.de/freifunk/data.html
1 "original" ffmap-backend rrdtool generated png where alfred slave before and than mastermode drop the charts https://openfreiburg.de/freifunk/nodes/globalGraph.png https://openfreiburg.de/freifunk/nodes/globalGraphW.png https://openfreiburg.de/freifunk/nodes/globalGraphM.png

data for this graph is generated like batctl tg |grep \.W\.|wc -l for clients (background) and
alfred -r 159 |wc -l for nodes (we could also use batctl o, but that dont help us with alfred errors)
and this errors are the reason for our unreliable map .

do you have a idea how we can make alfred run more stable , or ideas how to find the error - because beside return code 1 - there is no error in syslog or wherever

batctl tells us that the layer2network itself behave pretty normal in all manners

[2 things to mention , we have batman-adv-legacy 2014 in network and i raise timeout from 600 s to 1200 seconds of forgetting alfred data [which helps a little bit]]

Files

Download all files

Bildschirmfoto vom 2016-02-19 23_24_59.png (46.8 KB) Bildschirmfoto vom 2016-02-19 23_24_59.png		fuzzle FFFR, 02/19/2016 11:29 PM
Bildschirmfoto vom 2016-02-19 23_37_51.png (7.05 KB) Bildschirmfoto vom 2016-02-19 23_37_51.png	switch from alfred mastermode to slave	fuzzle FFFR, 02/19/2016 11:39 PM
data.html (97.7 KB) data.html	stable but far too less nodes	fuzzle FFFR, 02/19/2016 11:39 PM
globalGraph.png (33.6 KB) globalGraph.png	connection drop after alfred master mode is used	fuzzle FFFR, 02/19/2016 11:51 PM

Related issues 1 (1 open — 0 closed)

Actions

Copy link

Updated by Sven Eckelmann over 9 years ago

Alfred in slave mode doesn't send stored data to the client but asks a master servers. This can fail when you have too much data packet loss between master and slave. The slave server will then return an error to the client.

You can find out more about that in ticket #202

Actions

Copy link

Updated by Sven Eckelmann over 9 years ago

Related to Bug #202: Multiple Master Syncing Robustness added

Actions

Copy link

Updated by Sven Eckelmann over 9 years ago

Description updated (diff)

Actions

Copy link

Updated by fuzzle FFFR over 9 years ago

some more notes, we recently changed mtu in our network from 1426 to 1280 on the fastd interface,
we learned that unity media cable internet failed with the higher mtu,
and it seems that gluon is doing a patch to batman to drop fragment max size also to 1280. (./gluon/patches/packages/routing/0003-batman-adv-decrease-maximum-fragment-size.patch)

first result is that alfred seems to be more reliable

(so i guess it was something with double fragmentation and because of the high rate of udp packets - many of them failed to arrive, or arrive "broken/incomplete" - which leads to errors)

the problem is still not solved totally : because actual
batctl o | wc -l is counting ~337 and
alfred -r 159 only 290
... even with all gateways and so removed , the number of 320++ is much more likely - but not proven actually.
... unfortunally the data.html now use batctl data to count nodes.

i wonder if you have some ideas , what and how we can test some things to find solutions.
i can build FW on the fly - gluon based, actually still batman-v14 .. i can do a roll out on some of nodes, or if needed maybe on all node (which is not so easy - because there are many people involved - so this is no option for "pure" testing)

Actions

Copy link

Updated by Sven Eckelmann over 9 years ago

Please check #202

Actions

Copy link

Updated by fuzzle FFFR over 9 years ago

sorry my naive question - can you please guide me a bit to succesfull building with the patch in #202

if i want to build a freifunk Firmware with newest Alfred, i go to build folder of
https://github.com/freifunk-gluon/gluon
./gluon/openwrt/build_dir/target-mips_34kc_uClibc-0.9.33.2_gluon-ar71xx-generic/alfred-2015.2

then i download patch - how do i bring it together with latest alfred
https://git.open-mesh.org/alfred.git
https://patchwork.open-mesh.org/project/b.a.t.m.a.n./patch/1459103215-22444-1-git-send-email-hwhilse@gmail.com/

Actions

Copy link

Updated by Sven Eckelmann over 9 years ago

This is actually a question for the gluon developers. But I would guess that you should use their patch directory for that. Just download the mbox file from patchwork and use it to prepare a patch for openwrt-routing and place it in the patches/packages/routing/

The patch for openwrt-routing can be generated by cloning it, adding the patch, committing it and git-format-patch it:

git clone https://github.com/openwrt-routing/packages.git
cd packages
mkdir -p alfred/patches
curl https://patchwork.open-mesh.org/project/b.a.t.m.a.n./patch/1459103215-22444-1-git-send-email-hwhilse@gmail.com/mbox/ -o alfred/patches/0002-tcp-sync.patch
git add alfred/patches/0002-tcp-sync.patch
git commit -a -m 'alfred: implement TCP support for server-to-server communication'
git format-patch --abbrev=7 -U3 --diff-algorithm=histogram --no-signature --format=format:'From: %an <%ae>%nDate: %aD%nSubject: [PATCH] %B' -1 --start-number=4
mv 0004-alfred-implement-TCP-support-for-server-to-server-co.patch "${GLUON_PATH}"/patches/packages/routing/
cd "${GLUON_PATH}" 
make update
make GLUON_TARGET=ar71xx-generic

But this is pure theory because I have not yet used gluon nor build gluon myself.

And please keep in mind that this may break in the future because gluon doesn't seem to be made to be reproducible in any way. It is (when I didn't miss anything [1]) always using the currently most recent version of its dependencies. So an update to the openwrt-routing feed could break it.

Update:

[1] Yes, it seems i've missed something. Gluon isn't by default reproducible but it allows to set PACKAGES_$feedname_COMMIT in the sites/modules configuration which specifies one commit. Most likely something similar is also possible for the openwrt repo

Actions

Copy link

Updated by fuzzle FFFR over 9 years ago

yes, i got trouble with this - and circumvent it by tricking the alfred.tar.gz source to be what i want - because source is not downloaded again.
It s kinda dirty hack, but at least i get what i want ...

last issue o get it really right: if i take latest alfred - do i have to say somewhere which commit its based on? or do the patch work right out of the box ..

# patch < tcp.patch
patching file alfred.h
Hunk #1 succeeded at 90 (offset 1 line).
Hunk #2 succeeded at 103 (offset 1 line).
Hunk #3 succeeded at 131 (offset 1 line).
Hunk #4 succeeded at 147 (offset 1 line).
Hunk #5 succeeded at 170 (offset 1 line).
Hunk #6 succeeded at 187 (offset 1 line).
Hunk #7 succeeded at 200 (offset 1 line).
Hunk #8 succeeded at 227 (offset 1 line).
patching file main.c
patching file netsock.c
patching file recv.c
patching file send.c
patching file server.c
patching file unix_sock.c
Hunk #1 succeeded at 229 (offset 7 lines).
Hunk #2 succeeded at 259 (offset 7 lines).

Actions

Copy link

Updated by Sven Eckelmann over 9 years ago

It is not applied on openwrt-routing? But it should when you do the make update. Maybe Matthias can correct me and tell you the correct way to let gluon reapply all patches again on the package feeds.

And this patch should be for 2016.1. And it looks like it applied fine (could be better but at least it didnt fail). If there are problems with it then you should send a reply using this mbox to the original author (please use "reply to all" so the reply is also visible on the mailing list).

Actions

Copy link

#10

Updated by fuzzle FFFR over 9 years ago

ok seems that i can build this now
the gluon patch : 0001-alfred-adjust-intervals.patch
gluon/patches/packages/routing

and this one dislike each other, but i think its just because of the ordering
(i made my alfred.tar.gz and then the patch failed . so alfred build failed - i delete original patch - while this was only old timeout optimizations ..)

the actual gluon-alfred is v2015.2 ..

i will report when i have this running on one or more debian Gateway servers and at least some nodes.(i think mostly tplink841n router)

Actions

Copy link

#11

Updated by fuzzle FFFR over 9 years ago

as written via mailinglist :
[[ Re: [B.A.T.M.A.N.] [RFC v2] alfred: implement TCP support for server-to-server communication ]] b.a.t.m.a.n@lists.open-mesh.org
i succeed in openwrt/gluon alfred build of 2016.2 with patch - but in debian with errors ... here the corresponding email, i am not fit enough in C to know completly what is happening.

####
i installed your patch .. here some feedback/problems a had

with OpenWRT and gluon i succeeded ... actually testet it on a tplink841n-v9
runs with or without -m (master) for a while (30 min++)
/usr/sbin/alfred -i br-client -b bat0 -m
[my big wonder is , why do these router always only output their own
entry for alfred -r 158 , while there are master .. on debian based
master this is filled up to 290]

on debian 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt20-1+deb8u3
(2016-01-17) x86_64 GNU/Linux
i got some build errors .. and than after a short while in alfred master
mode ... segmentation fault
(1 or 2 minutes, sometimes only seconds)
this after master announcements and already collecting some 90+ alfred
information.

patch stuff

wget https://patchwork.open-mesh.org/project/b.a.t.m.a.n./patch/1459103215-22444-1-git-send-email-hwhilse@gmail.com/raw/ -O tcp.patch

patch < tcp.patch
patching file alfred.h
Hunk #1 succeeded at 90 (offset 1 line).
Hunk #2 succeeded at 103 (offset 1 line).
Hunk #3 succeeded at 131 (offset 1 line).
Hunk #4 succeeded at 147 (offset 1 line).
Hunk #5 succeeded at 170 (offset 1 line).
Hunk #6 succeeded at 187 (offset 1 line).
Hunk #7 succeeded at 200 (offset 1 line).
Hunk #8 succeeded at 227 (offset 1 line).
patching file main.c
patching file netsock.c
patching file recv.c
patching file send.c
patching file server.c
patching file unix_sock.c
Hunk #1 succeeded at 229 (offset 7 lines).
Hunk #2 succeeded at 259 (offset 7 lines).

build errors

make
    CC main.o
    CC server.o
    CC client.o
    CC netsock.o
    CC send.o
    CC recv.o
recv.c: In function 'recv_alfred_packet':
recv.c:436:48: warning: passing argument 5 of 'process_alfred_request'
makes pointer from integer without a cast
            (struct alfred_request_v0 *)packet, -1);
                                                ^
recv.c:302:12: note: expected 'struct tcp_connection *' but argument is
of type 'int'
 static int process_alfred_request(struct globals *globals,
            ^
    CC hash.o
    CC unix_sock.o
    CC util.o
    CC debugfs.o
    CC batadv_query.o
    LD alfred
make -C vis all
make[1]: Entering directory '/home/freifunk/alfred/vis'
    CC vis.o
    CC debugfs.o
    LD batadv-vis
make[1]: Leaving directory '/home/freifunk/alfred/vis'
make -C gpsd all
make[1]: Entering directory '/home/freifunk/alfred/gpsd'
    CC alfred-gpsd.o
    LD alfred-gpsd
make[1]: Leaving directory '/home/freifunk/alfred/gpsd'

var/log/kern.log

Apr 22 20:01:06 fffr-spielwiese kernel: [4398895.515177] alfred[31733]:
segfault at 2f ip 0000000000406034 sp 00007fff9dc39580 error 6 in
alfred[400000+d000]
Apr 22 20:02:05 fffr-spielwiese kernel: [4398954.569665] alfred[32657]:
segfault at 2f ip 0000000000406034 sp 00007ffc45b97d50 error 6 in
alfred[400000+d000]

Actions

Copy link

#12

Updated by fuzzle FFFR over 9 years ago

additionaly i build alfred with -g but there are no core dumps , and gcore only do actual moment shots
here is valgrind memmory output, as i dont know if youre interessted in this error - i can do more of "whatever"

==26158== Memcheck, a memory error detector
==26158== Copyright (C) 2002-2013, and GNU GPL'd, by Julian Seward et al.
==26158== Using Valgrind-3.10.0 and LibVEX; rerun with -h for copyright info
==26158== Command: ./alfred -i bat0 -m -d -t
==26158== 
read unix socket
read unix socket
read unix socket
announce master ...
^[==26158== Invalid write of size 4
==26158==    at 0x406034: push_data (send.c:190)
==26158==    by 0x4070EB: process_alfred_request (recv.c:318)
==26158==    by 0x407540: recv_alfred_packet (recv.c:435)
==26158==    by 0x40543F: netsock_receive_packet (netsock.c:540)
==26158==    by 0x403433: alfred_server (server.c:426)
==26158==    by 0x40239A: main (main.c:283)
==26158==  Address 0x2f is not stack'd, malloc'd or (recently) free'd
==26158== 
==26158== 
==26158== Process terminating with default action of signal 11 (SIGSEGV)
==26158==  Access not within mapped region at address 0x2F
==26158==    at 0x406034: push_data (send.c:190)
==26158==    by 0x4070EB: process_alfred_request (recv.c:318)
==26158==    by 0x407540: recv_alfred_packet (recv.c:435)
==26158==    by 0x40543F: netsock_receive_packet (netsock.c:540)
==26158==    by 0x403433: alfred_server (server.c:426)
==26158==    by 0x40239A: main (main.c:283)
==26158==  If you believe this happened as a result of a stack
==26158==  overflow in your program's main thread (unlikely but
==26158==  possible), you can try to increase the size of the
==26158==  main thread stack using the --main-stacksize= flag.
==26158==  The main thread stack size used in this run was 8388608.
==26158== 
==26158== HEAP SUMMARY:
==26158==     in use at exit: 147,746 bytes in 954 blocks
==26158==   total heap usage: 1,286 allocs, 332 frees, 247,223 bytes allocated
==26158== 
==26158== LEAK SUMMARY:
==26158==    definitely lost: 0 bytes in 0 blocks
==26158==    indirectly lost: 0 bytes in 0 blocks
==26158==      possibly lost: 6,783 bytes in 12 blocks
==26158==    still reachable: 140,963 bytes in 942 blocks
==26158==         suppressed: 0 bytes in 0 blocks
==26158== Rerun with --leak-check=full to see details of leaked memory
==26158== 
==26158== For counts of detected and suppressed errors, rerun with: -v
==26158== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0)
Segmentation fault
again 4

Actions

Copy link

#13