Quantcast
Channel: Enabling Communications, Anywhere, Anytime
Viewing all 144 articles
Browse latest View live

Assembling injection moulded Mesh Extenders

$
0
0
After various delays on various fronts, we now have in our possession enough components to assemble 40 Mesh Extenders, sufficient for the remaining activities for the Vanuatu Pilot.

Yesterday, the RFD900+ radios, antennae and Mesh Extender PCBs arrived:



 We already had the injection-moulded housings on hand (in the boxes behind the radios and PCBs):


First step of assembly is to fit the reverse SMA bulk-head connectors to the cases, and also install the o-rings.  While not particularly glamorous, this represents some number of hours of work to do.  Karthik, a work placement student, has been placed with us over the next few months, and gets to be the lucky one to do this task:


 The first afternoon's work, we have 16 units with seals and 2 of the 3 RSMA leads in place:


After these have been all prepared, we will then proceed with getting the firmware  on the PCBs, and radios and bulk storage fitted.

Our original plan was to use microSD cards, as they are lower power consumption than USB memory sticks, and probably handle power loss better than memory sticks.  However, there is a problem with the kernel driver for the microSD card interface, which we have yet to resolve, so we are probably going to stick with USB memory sticks for now. 

Fortunately we were able to get the USB port working in the Mesh Extenders, after a bunch of earlier problems with signal integrity of the USB data traces.

The only side effect is that we probably won't be able to reliably run these units on solar without a battery connected -- we'll find out for sure as we proceed with testing.

Power/radio cables with over-moulding

$
0
0

I thought some folks might be interested to know how we are getting our cables fabricated, so here are some photos of the tool used for the low-pressure injection over-moulding that adds the red rubbery plastic over the connector and circuitry in the cable head.

First, here you can see a finished cable and the tool, which was designed and machined for us.  Unlike high-pressure injection moulding, where you can turn out multiple parts per minute, but the tool cost is tens of thousands of dollars, the complete tool design and manufacture cost here was only about AU$3,000 with our friends at Innovation Engineering.  The trade-off is that each part takes minutes to produce, instead of seconds. For now at least, that isn't a big problem.

t

Looking closer at the tool, you can see the big fat guide pins that hold it in proper alignment when assembled:


Then on the other half you can see this funny bar with the two brass pins.  This is removable, and held in with a couple of magnets.  It ensures that there are holes through the moulded body for the thumb-screws.  It needs to be removable, as you need to get the cable in and out of the tool.


Here is that same tool without it in place:


And here is the removable part itself. You can see here that it has the recess that fits snugly around the D-SUB25 connector, so that it doesn't get encased in plastic during moulding, but continues to stick out the end, as it needs to do, in order to be a useful cable:

Then finally, here it is all assembled (but without a cable inside):



As I have said before, it is so much easier to work with a company who is only a few minutes away, so that if there are problems, they can be quickly inspected and dealt with, and so that communications can be free flowing, so that everything goes well.

Now we are just waiting on the delivery of our next 45 of these cables, ready for our third trip to Vanuatu.

Migrating between EFM32 parts in Simplicity Studio

$
0
0
One of our projects at the moment involves the EFM32 Pearl Gecko low-power microcontroller.  These are really nice, and capable of very low power consumption.

At the last minute, our supplier could only source us the Jade Gecko instead of Pearl Gecko chips. The only difference is the lack of FPU, so far as I know. Our project doesn't do any floating point, so there wasn't a problem there, or at least, so I thought.

However, when I tried to change the part in Simplicity Studio I hit endless problems, largely due to a lack of understanding about how Simplicity Studio works.

Because we were basing our project on a standard kit board, and had made our PCB pin-compatible, we were continuing to use the Board Support Package (BSP) for that kit. The new part, however, was not an option in that kit, so I had to tell Simplicity Studio that I was not using that kit anymore.  This was the source of all my pain, but it took me several hours of frustration to figure that out.  This is because the error messages that were being generated were not particularly helpful.

It would complain about strange missing include files, like redirectserialconfig.h during the build process. This makes sense as I now understand that such files are provided by the BSP -- although the particular magic by which it provides them I still don't fully understand.

This was because those files are provided from kit-specific include directories.

This meant I was faced with either: (1) keeping the kit, and having the wrong MCU set, which might or might not work through binary compatibility, or (2) having to build a BSP myself, which I really wanted to avoid, as I still don't have enough low-level understanding of what is going on.

A bit of digging around and I discovered that the properties dialog of Simplicity studio which lets you choose among the supported parts on a given kit in reality just sets some #defines and include directories.  Thus, all I needed was:

1.  Go into Properties -> C/C++ Build -> Settings and for each build target modify GNU ARM C Compiler -> Symbols do delete the EFM32PG1B200F256GM48=1 line, and replace it with EFM32JG1B200F256GM48=1 (your part names may naturally be different)

2. Go into Properties -> C/C++ Build -> Settings and for each build target modify GNU ARM C Compiler -> Includes and change  "${StudioSdkPath}/Device/SiliconLabs/EFM32PG1B/Include" to "${StudioSdkPath}/Device/SiliconLabs/EFM32JG1B/Include" (again, adjust part families according to your project).

Then, finally, I could get the build to succeed again.  This is probably all well known to just about everyone, but it tripped me up, so I figured I would document it in case it was useful for others (and also so that I can remember better how to do it in the future).

For those like myself who are more command-line oriented, here is the diff for my project (which unfortunately I cannot share the complete thing with you):

diff --git a/.cproject b/.cproject
index 75fdb6d..7ed84db 100644
--- a/.cproject
+++ b/.cproject
@@ -35,7 +35,7 @@
 <option id="com.silabs.ide.si32.gcc.cdt.managedbuild.tool.gnu.c.compiler.def.symbols.56412088" name="Defined symbols (-D)" superClass="com.silabs.ide.si32.gcc.cdt.managedbuild.tool.gnu.c.compiler.def.symbols" valueType="definedSymbols">
 <listOptionValue builtIn="false" value="DEBUG_EFM=1"/>
 <listOptionValue builtIn="false" value="DEBUG=1"/>
-<listOptionValue builtIn="false" value="EFM32PG1B200F256GM48=1"/>
+<listOptionValue builtIn="false" value="EFM32JG1B200F256GM48=1"/>
 </option>
 <option id="com.silabs.ide.si32.gcc.cdt.managedbuild.tool.gnu.c.compiler.debug.builtin.59744490" name="Always branch to builtin functions (-fno-builtin)" superClass="com.silabs.ide.si32.gcc.cdt.managedbuild.tool.gnu.c.compiler.debug.builtin" value="true" valueType="boolean"/>
 <option id="com.silabs.ide.si32.gcc.cdt.managedbuild.tool.gnu.c.compiler.debug.prolog.2095193964" name="Generate debugger-friendly prologs (-mno-sched-prolog)" superClass="com.silabs.ide.si32.gcc.cdt.managedbuild.tool.gnu.c.compiler.debug.prolog" value="true" valueType="boolean"/>
@@ -47,7 +47,7 @@
 <listOptionValue builtIn="false" value="&quot;${StudioSdkPath}/emlib/inc&quot;"/>
 <listOptionValue builtIn="false" value="&quot;${StudioSdkPath}/kits/common/drivers&quot;"/>
 <listOptionValue builtIn="false" value="&quot;${StudioSdkPath}/CMSIS/Include&quot;"/>
-<listOptionValue builtIn="false" value="&quot;${StudioSdkPath}/Device/SiliconLabs/EFM32PG1B/Include&quot;"/>
+<listOptionValue builtIn="false" value="&quot;${StudioSdkPath}/Device/SiliconLabs/EFM32JG1B/Include&quot;"/>
 </option>
 <inputType id="com.silabs.ide.si32.gcc.cdt.managedbuild.tool.gnu.c.compiler.input.1259106684" superClass="com.silabs.ide.si32.gcc.cdt.managedbuild.tool.gnu.c.compiler.input"/>
 </tool>
@@ -121,7 +121,7 @@
 <option id="com.silabs.ide.si32.gcc.cdt.managedbuild.tool.gnu.c.compiler.def.symbols.1906527140" name="Defined symbols (-D)" superClass="com.silabs.ide.si32.gcc.cdt.managedbuild.tool.gnu.c.compiler.def.symbols" valueType="definedSymbols">
 <listOptionValue builtIn="false" value="DEBUG_EFM=1"/>
 <listOptionValue builtIn="false" value="NDEBUG=1"/>
-<listOptionValue builtIn="false" value="EFM32PG1B200F256GM48=1"/>
+<listOptionValue builtIn="false" value="EFM32JG1B200F256GM48=1"/>
 </option>
 <option id="com.silabs.ide.si32.gcc.cdt.managedbuild.tool.gnu.c.compiler.debug.builtin.1881027154" name="Always branch to builtin functions (-fno-builtin)" superClass="com.silabs.ide.si32.gcc.cdt.managedbuild.tool.gnu.c.compiler.debug.builtin" value="false" valueType="boolean"/>
 <option id="com.silabs.ide.si32.gcc.cdt.managedbuild.tool.gnu.c.compiler.debug.prolog.1759032162" name="Generate debugger-friendly prologs (-mno-sched-prolog)" superClass="com.silabs.ide.si32.gcc.cdt.managedbuild.tool.gnu.c.compiler.debug.prolog"/>
@@ -133,7 +133,7 @@
 <listOptionValue builtIn="false" value="&quot;${StudioSdkPath}/emlib/inc&quot;"/>
 <listOptionValue builtIn="false" value="&quot;${StudioSdkPath}/kits/common/drivers&quot;"/>
 <listOptionValue builtIn="false" value="&quot;${StudioSdkPath}/CMSIS/Include&quot;"/>
-<listOptionValue builtIn="false" value="&quot;${StudioSdkPath}/Device/SiliconLabs/EFM32PG1B/Include&quot;"/>
+<listOptionValue builtIn="false" value="&quot;${StudioSdkPath}/Device/SiliconLabs/EFM32JG1B/Include&quot;"/>
 </option>
 <inputType id="com.silabs.ide.si32.gcc.cdt.managedbuild.tool.gnu.c.compiler.input.2098567682" superClass="com.silabs.ide.si32.gcc.cdt.managedbuild.tool.gnu.c.compiler.input"/>
 </tool>
@@ -206,7 +206,7 @@
 <option id="com.silabs.ide.si32.gcc.cdt.managedbuild.tool.gnu.c.compiler.def.symbols.1801462641" name="Defined symbols (-D)" superClass="com.silabs.ide.si32.gcc.cdt.managedbuild.tool.gnu.c.compiler.def.symbols" valueType="definedSymbols">
 <listOptionValue builtIn="false" value="DEBUG_EFM=1"/>
 <listOptionValue builtIn="false" value="DEBUG=1"/>
-<listOptionValue builtIn="false" value="EFM32PG1B200F256GM48=1"/>
+<listOptionValue builtIn="false" value="EFM32JG1B200F256GM48=1"/>
 </option>
 <option id="com.silabs.ide.si32.gcc.cdt.managedbuild.tool.gnu.c.compiler.debug.builtin.711105496" name="Always branch to builtin functions (-fno-builtin)" superClass="com.silabs.ide.si32.gcc.cdt.managedbuild.tool.gnu.c.compiler.debug.builtin" value="true" valueType="boolean"/>
 <option id="com.silabs.ide.si32.gcc.cdt.managedbuild.tool.gnu.c.compiler.debug.prolog.1497848589" name="Generate debugger-friendly prologs (-mno-sched-prolog)" superClass="com.silabs.ide.si32.gcc.cdt.managedbuild.tool.gnu.c.compiler.debug.prolog" value="true" valueType="boolean"/>
@@ -218,7 +218,7 @@
 <listOptionValue builtIn="false" value="&quot;${StudioSdkPath}/emlib/inc&quot;"/>
 <listOptionValue builtIn="false" value="&quot;${StudioSdkPath}/kits/common/drivers&quot;"/>
 <listOptionValue builtIn="false" value="&quot;${StudioSdkPath}/CMSIS/Include&quot;"/>
-<listOptionValue builtIn="false" value="&quot;${StudioSdkPath}/Device/SiliconLabs/EFM32PG1B/Include&quot;"/>
+<listOptionValue builtIn="false" value="&quot;${StudioSdkPath}/Device/SiliconLabs/EFM32JG1B/Include&quot;"/>
 </option>
 <inputType id="com.silabs.ide.si32.gcc.cdt.managedbuild.tool.gnu.c.compiler.input.462471232" superClass="com.silabs.ide.si32.gcc.cdt.managedbuild.tool.gnu.c.compiler.input"/>
 </tool>
@@ -292,7 +292,7 @@
 <option id="com.silabs.ide.si32.gcc.cdt.managedbuild.tool.gnu.c.compiler.def.symbols.227214347" name="Defined symbols (-D)" superClass="com.silabs.ide.si32.gcc.cdt.managedbuild.tool.gnu.c.compiler.def.symbols" valueType="definedSymbols">
 <listOptionValue builtIn="false" value="DEBUG_EFM=1"/>
 <listOptionValue builtIn="false" value="NDEBUG=1"/>
-<listOptionValue builtIn="false" value="EFM32PG1B200F256GM48=1"/>
+<listOptionValue builtIn="false" value="EFM32JG1B200F256GM48=1"/>
 </option>
 <option id="com.silabs.ide.si32.gcc.cdt.managedbuild.tool.gnu.c.compiler.debug.builtin.856642787" name="Always branch to builtin functions (-fno-builtin)" superClass="com.silabs.ide.si32.gcc.cdt.managedbuild.tool.gnu.c.compiler.debug.builtin" value="false" valueType="boolean"/>
 <option id="com.silabs.ide.si32.gcc.cdt.managedbuild.tool.gnu.c.compiler.debug.prolog.2116127698" name="Generate debugger-friendly prologs (-mno-sched-prolog)" superClass="com.silabs.ide.si32.gcc.cdt.managedbuild.tool.gnu.c.compiler.debug.prolog"/>
@@ -304,7 +304,7 @@
 <listOptionValue builtIn="false" value="&quot;${StudioSdkPath}/emlib/inc&quot;"/>
 <listOptionValue builtIn="false" value="&quot;${StudioSdkPath}/kits/common/drivers&quot;"/>
 <listOptionValue builtIn="false" value="&quot;${StudioSdkPath}/CMSIS/Include&quot;"/>
-<listOptionValue builtIn="false" value="&quot;${StudioSdkPath}/Device/SiliconLabs/EFM32PG1B/Include&quot;"/>
+<listOptionValue builtIn="false" value="&quot;${StudioSdkPath}/Device/SiliconLabs/EFM32JG1B/Include&quot;"/>
 </option>
 <inputType id="com.silabs.ide.si32.gcc.cdt.managedbuild.tool.gnu.c.compiler.input.686138013" superClass="com.silabs.ide.si32.gcc.cdt.managedbuild.tool.gnu.c.compiler.input"/>
 </tool>

Assembling 50 Mesh Extenders ready for Vanuatu

$
0
0
Tomorrow we fly out to Vanuatu for our third visit. 

This means we need to have several dozen Mesh Extenders ready to take with us, to hopefully begin deploying in Port Vila between various government and NGO sites, and assuming we get the final authorities through, to also connect a couple of outlying villages.

From a practical stand point, this means flashing 50 Mesh Extender boards, 50 RFD900+ radios, 50 boot loaders, and pre-formatting and putting the default files on 50 USB memory sticks, and then assembling the boards in their cases.

Flashing the radios is fairly simple, and can be done using an FTDI USB serial cable in about a minute.  I did those last night while sitting at the computer after the kids went to bed.

For the boot loaders and OpenWRT firmware, I wrote a simple program today that listens on a serial port for the uboot boot loader, and then directly issues the commands to flash first the boot loader, and then the firmware.  When it has finished, it runs vlc and plays an MP3 file to alert me that it is ready for the next board, thus saving lots of dead time between flashing boards.  It went smoother with the team in the lab when I stopped playing the music from Wizball, and switched to Greensleves.

Apart from some niggly problems with the USB serial adapter drivers causing the port to stop reading data, the while process takes only a minute or two.  This is allowing us to plough through flashing the boards much faster than expected, which is good (I have been interrupted by VLC playing Greensleves at me several times while writing this post).

Here is my bench setup, as I flash a board.  Firmware and boot loader updates are loaded via ethernet from a tftp server on my laptop. Commands are via serial through an adapter on the radio/power connector on the Mesh Extender, and power for the Mesh Extender comes via the micro-USB connector on the Mesh Extender board which exists for that single purpose (it doesn't actually provide an USB data signals):


The harder part is the assembly.  You can see three of our great students who are helping with the assembly:



Ryan is checking that the boards have correctly flashed and boot and begin transmitting via the UHF packet radio.

Then we have to get the boards into their cases. This is the fun part, because this revision of the PCBs has the wrong D-SUB 25 connector compared to what we wanted, but due to time constraints we are stuck with them for this batch.  The problem is that the D-SUB connectors have holes for screws, but are not threaded.  This means we have to put washers and nuts on the back of the connectors, which are frustratingly located in rather awkward locations, as shown here marked by the yellow arrows. 


Those locations are right in the bottom of the shell, about 55mm down. For the one on the left, the super-capacitor is in the way, all making it rather difficult to get in there to install the washer and nut on each.  Very annoying. Our current least-bad solution is to use some seizers.

So that's our afternoon today, basically getting enough of these assembled, ready to go, as well as getting everything packed in the big case ready to fly out tomorrow.

Second visit to Vanuatu

$
0
0
We are getting to the end of our third trip to Vanuatu, so I figured I really ought to post about our second visit.

This visit was shorter than the first, as it was primarily focused around the UN World Food Programme's regional Emergency Telecommunications Cluster (ETC) meeting, which has representatives from Vanuatu, Samoa, Solomon Islands, Fiji and Tonga.  The purpose of this meeting is to help the relevant folks from each country to get to know one another, as well as to learn more about what is happening globally, and to find solutions to the problems that they are facing in their region and specific countries.  This is the second time I have attended this regional meeting.

Together with Andrew Bate we spoke about the Serval Mesh and our pilot here in Vanuatu.


There was strong interest with plenty of people handling the Mesh Extender prototypes:


And then giving the Serval Mesh a try.  In the Pacific where telecommunications is almost always a challenge, there was no need to convince people of the need: Rather they wanted to know when the Mesh Extenders were available for purchase, and at least one asked why we weren't running the pilot in their country instead!


The interactive demo was lots of fun, with many people around the room trying it out.


But what I really love are situations like that captured in the following photo, where people start showing each other how to use it:


After the indoor demo, we later had a session with other presenters where we went outside to let people have a further hands-on of the various technologies being discussed. So I soldered up the correct connector onto a solar panel to power a Mesh Extender directly from that, without even a battery, using the built-in solar regulator in the Mesh Extender:


Then found a suitable coconut palm to attach it to (this is the tropics, after all):


Held firmly in place with an octopus-strap, i.e., bungie cord for those of you not from Australia:





Then it was open-season for the Mesh Extender.  Again, it was great to see a wide variety of folks from different countries and organisations in the Pacific trying it out:




Again, I just love it when people start showing each other how to use it, without any involvement from me. At that point we have victory!



Also, we discovered that the Mesh Extender is approximately one hand in size:


Then later in the week we had an impromptu meeting with the Vanuatu Ministry of Health, where we repeated the hand-on test, but this time all I did was supply the equipment: They did the temporary installation themselves: 









Then it was another encouraging session where everyone was trying it out and showing each other how it worked.  There were also some folks from the UN World Health Organisation there, who were very interested in how it could be used to help with disease and public health data collection.



One of the other great things is seeing not just men using Serval, but women as well, and women showing other women how to use it.  This is important because the more that a technology is used by all kinds of people in a community, the more likely it is to succeed.





Third Visit to Vanuatu

$
0
0
The past month has been flat out, and I have spent more of it in Vanuatu than at home in Australia.  I am now back in Vanuatu on our fourth visit, and figured I really had better write about our third visit, before I get even further behind, especially since my wife has already started blogging about this fourth visit on her blog.

One of the things that has delayed me blogging more has been the highly variable nature of mobile internet we are experiencing here in Vanuatu.  It is possible that part of the problem is that our mobile phones don't support the local 4G frequency, however that shouldn't stop the 3G from working. What we often see is that the data signal seems to disappear, or even when it is there, the throughput drops to zero, or very close to it.  Basically between 6 am - midnight it is pot luck as to whether the internet will work at our accommodation, despite having full signal strength, and being able to eye ball the local phone tower.  I might have to ask our in-country contacts in the telecommunications industry for clues on this at some point.  But in the meantime, it means that our opportunities to upload photos and generally edit blog posts are severely impaired, as you can sense from my wife's frustrations. Indeed, here I am at 1:30am writing this post, for exactly this reason.

[Update: For some reason our phones won't do 3.5G / HSPDA+.  It isn't a FairPhone2 specific problem, as Matthew's completely different phone does the same thing.  We have since bought a cheapish Samsung from a local store, which does 3.5G (and soon 4G) on the local network without problem, largely solving our internet problem.]

Of course, this is all further motivation for us to get Serval working, so that communities for whom even annoyingly variable internet is only a pipe-dream. And let's not even mention the spate of cyclones, earthquakes and naughty volcanoes around the place that also keep reminding us of why we are working on this problem.

But, first, lets go over what we got up to on the previous visit here to Vanuatu.

Working with the Smart ICT Sistas

As part of our efforts to engage all aspects of community in the pilot project, we ran a number of sessions with the Smart ICT Sistas, a school-age group for girls and young women to develop an interest and skills in ICT.  This was really interesting from a number of fronts.  

This particular Smart ICT Sista was especially keen and active, asking lots of good questions, and giving all sorts of things a try.

It was great to see some young local people engaging with our technology, to help us work out what works, and what does not in the local context. A couple of key items emerged from this process: 

(1) Android phones in Vanuatu almost exclusively have enable installation from untrusted sources enabled by the time we get to see them, because sharing apps from phone to phone is a very common occurrence in Vanuatu, as people save on internet costs.  As a result, it wasn't long before we were watching the girls transfer the Serval Mesh app from phone to phone:

The Smart ICT sistas getting started with the Serval Mesh
(2) We found that the listen before talk settings on the Mesh Extenders were too strict for effective use in urban areas of Vanuatu, where 900MHz point-to-point links are common.  We simply would not have found this out, were we not working with the Sistas, and using space provided by the very kind and supportive folks from the Vanuatu national telecommunications regulator (the TRR).  Their office is on the 2nd floor in central Port Vila, and seems to be in the direct path of a number of 900MHz point-to-point beams.  They also have a very nice spectrum analyser with direction finder antennae, so we were able to confirm that the problem was a very high noise floor (approximately -50dBm) due to directional 900MHz links.


It goes without saying, but Vanuatu is a very pretty place, and we keep on seeing many beautiful and interesting things as we go about our field work here. Here is the local tourist sailing boat that you can pay to be part of the crew of for a few days:



Then it was off to inspect and pay for the 10m mast, that we are hoping to install in Pang Pang, or possibly on top of a hill between Pang Pang and Epau villages, so that we can link them via Mesh Extender:


I was ably assisted by not only Matthew Lloyd, but also by our two French exchange students, Robin and Raphaël, who also made friends with our neighbours in the SIL compound through their cooking skills.

Raphaël, myself and Robin, working on the Mesh Extender firmware.

We also started work on prototyping a satellite-broadcast bridge for Serval, as part of our HIF grant (more on that in up-coming posts).  We are building this into a proof-of-concept low-cost tsunami/cyclone early warning system, that would provide weather forecasts and other useful information via Serval year-round, so that  they won't become so easily neglected. Our key piece of progress so far, is working out that we can put the 12cm patch antennae inside a poly-carbonate box, and still receive the satellite signal.  This will make it much easier to cheaply make well-protected enclosures.  We will have to see how it behaves in wet weather, as well.

Outernet receiver, with antenna in the grey box. The angle just happens to be close enough with the antenna wedged in the box, and the box sitting flat, which is most convenient.

Then it was time to go around to Pang Pang village again ...

... and go through the University's human ethics approval requirements of obtaining informed concent from the community, before we collect any data.


Places like Pang Pang continue to remind me of the importance that people place on communications. Even though the village itself lacks cellular coverage, there are multiple places you can buy top-up credit for pre-paid phones in the village:


Once the formalities were over, it was time to install five Mesh Extenders throughout the village to provide adequate coverage.  The goal was to have one Mesh Extender in each of the four distinct parts of the village.  So while some people may need to walk a short way to get within wi-fi range of a Mesh Extender, they shouldn't have to go far.

Here, Donald, the chief's son is installing the solar panel for the Mesh Extender on the chief's house:


Again, the ubiquity of communications and mobile computing devices continues to be amazing.  Here the chief is using his Samsung tablet to take a picture of us, while we take a picture of him.  Again, remember that there is no cellular coverage in the village.


To test out where we needed to put the Mesh Extenders, we attached one to a long bamboo pole, which we then hoisted up attached to the chief's house:


However, we soon found by accident that the extra elevation was not necessary, as the Mesh Extenders were reasonably close to one another (not more than a few hundred metres), and with no really dense vegetation in the way.  We found this out on the second day, because we forgot to bring the extension cable that would allow us to use the bamboo again, and had to resort to putting it onto the roof without any extra elevation:


Then we wandered around the village with a second battery-powered Mesh Extender and laptop to verify that the signal was getting through:


The third day we knew where we wanted to install the Mesh Extenders, so together with Donald, we did exactly that.  In keeping with our desire that the local communities drive the process as much as possible, we left the physical installation to Donald, who I might add is very adept with a chainsaw. Here he is making pieces of 2x2 from a piece of 2x4 very accurately, using only the chainsaw:


First up, one Mesh Extender on the chief's house. You can see the elbow shape attachment that Donald used in most cases. If you look very closely, you can see that the nails that attach it to the house beam are not nailed all the way in: This is to allow it to be easily removed when a cyclone comes, so that it can avoid damage, and be put back out again quickly after.  A very simple and very Vanuatu solution to the problem of cyclones.

Here is the Mesh Extender itself cable-tied to the 2x2 post on the roof:


Roughly the same process was followed at the other locations, resulting in Mesh Extenders installed on the roofs of five houses over an area of about 1km long by a couple of hundred metres wide.


This was all finished at about 4:30pm the night before flying out at 7am, leaving no time to install the app on people's phones, or to show them how to use it.  That would have to wait for our fourth visit, which is now half complete.

Pandanus cable ties and a Mesh Extender tree

$
0
0
We are now in the process of installing Mesh Extenders into Epau, the second of the two villages we are targeting here on the island of Efaté in Vanuatu.  As with Pang Pang, the community have been very gracious and enthusiastic in working with us.

It is always interesting and educational to watch how the folks here go about installing Mesh Extenders, in ways that are appropriate for them, rather than what we might naturally think of in an infrastructure-rich first-world context.

This series of images follows the process of installing a Mesh Extender in Epau.  After talking with the community, they decided that this Mesh Extender would be better mounted in a tree next to the house, than on the house itself.

First step: Attach the Mesh Extender to a long bamboo pole, which is a convenient locally available material:

Second step: If you don't have enough cable ties, make a make-shift cable-tie from a dried Pandanus leaf (the same leaves that the Vanuatuans use to weave mats, bags and other useful things):


Here you can see it closer up, with the light-brown Pandanus leave around the lower part of the Mesh Extender:


Third step: Explore the tree to work out how to get the > 6 metre long bamboo pole up there, and firmly attached:


It didn't take very long for said person to disappear even higher up the tree:


Then the pole was passed up, and it wasn't long until the Mesh Extender tree flowered: 


The power lead was fed back down through the tree:


And the Mesh Extender lofted a bit higher, to ensure it was well clear of the foliage (and hopefully will remain so for a few months before needing adjustment, due to tree growth):


The tree itself is not that small, either: The crown would be at least six metres tall, on top of a few metres of relatively bare trunk below:


With about 1.5 metres of clearance, I'd estimate that this Mesh Extender is about 10 metres above the ground:


What I really love about this installation, is that it was made using local skills, local materials and knowledge: I have every confidence that the community can take it down when a cyclone comes, and then put it back up again, and that they won't be stymied by the lack of things that they need to order from somewhere.  

All week my children have been accusing me of making "Dad jokes", so I'll continue in that vein here, by calling this installation a blooming success.

Hopefully in the next few days we will be able to get this community installing and using the app on their phones, and get a few more Mesh Extenders installed around the village.

Setting up Mesh Extender capability within NZ Red Cross

$
0
0
I am briefly in Wellington, NZ, visiting NZ Red Cross on my way to the Global Humanitarian Technology Conference where we have a bunch of papers to present at the end of the week.

One purpose of the visit was to update the firmware on the Mesh Extenders we had previously provided NZ Red Cross with, and to transfer the knowledge of how to flash the Mesh Extenders to their IT & Telecomms Emergency Response Unit (IT&T ERU), so that they can do it themselves in the future.

As the ERU does not normally carry laptops running Linux, we found an old disused laptop, and installed Ubuntu on it, and replicated the build and flashing environment from my laptop.

The important parts were to setup a TFTP server on the laptop, copy the firmware files in there, and clone the Mesh Extender openwrt-packages repository from github, checkout the MeshExtender2.0 branch, compile the auto-flash program.

After that, it is just a case of running the auto-flash program with a USB to serial adapter connected to a specially made adapter cable, and connecting the Mesh Extenders and watching the output of auto-flash to see when a unit has been flashed.

Natalie from the ERU was super-helpful being our guinnea pig, and also in documenting the process.  Hopefully we will get the documentation up on the wiki in the near future, at which point I will link to it from this post.

But in the meantime, the following photo shows the completed kit, with the USB serial adapter cable, ethernet cable for TFTP, USB memory stick with Ubuntu so that it can be cloned to other laptops in the future, all in a fashionable marigold laptop case.  The Cat may object wearing gold and marigold at the same time, but we are quite happy with the result for now.





Android has been a bit naughty with its location tracking

$
0
0
I was pointed to this article today:

https://qz.com/1131515/google-collects-android-users-locations-even-when-location-services-are-disabled/

Basically it points out that Android has been tracking location of phones for the past year or so, even when location tracking is disabled.  More specifically, it tells Google whenever you come in range of a cell tower.  By doing this for each cell tower a phone can hear, can provide a fairly good location, especially if you integrate it over time.

The use of spyware in mobile devices is a topic we have talked about previously, both for people living in dangerous places, as well as for victims of domestic violence and other contexts where being able to locate someone further compounds their vulnerability and tips the power-imbalance in the favour of an abusive person, organisation or otherwise.

The really naughty part in this current situation, is that this was happening even without a SIM card in the phone, and even when location services were disabled in Android: There was no way to know it was happening, and no way to disable it, even if you knew.  In fact, Google realised it was naughty by more or less immediately phasing it out as soon as they had been called out on it.

This leads me to a topic that we have been quietly working on in the background for the past couple of years, that is, how can we trust modern computers and communications devices, when they are so complex that it almost requires accidental discovery by dedicated researchers to find these significant privacy and safety damaging functions, which have been silently introduced to our devices -- often through software updates long after the initial purchase.

Our response to this is to explore the creation of "simply secure" communications devices, i.e., communications devices so simple, that their security can be quickly and confidently audited by a reasonably determined user, rather than requiring a team of researchers to explore.  Such devices should also make it much easier to be assured that the device cannot communicate with the outside world -- including getting a location fix -- when you don't want it to. 

Such devices are easy to make. After all, a brick is a secure communications device, in that there isn't really any way to subvert the function of a lump of burnt clay.   But it isn't useful.  This is the opposite extreme from current devices, that are almost omnipotent, but are so easy to subvert.

The challenge is to design and create devices that sit on some sweet spot in the middle, where they are still simple enough to be confident in their correct function, yet not so simple as to be practically useless.

This is exactly the kind of device that we are currently designing, in the form of a specialised smart-phone, that will still be capable of secure email, telephone calls, SMS and so on, while being much more resistent to attack or subversion, due to its simplicity and transparent auditability. 

For example, it will have physical switches to power off the cellular modem, and the cellular modem will be completely sandboxed from the rest of the phone -- including the GPS receiver, microphone and so on. Many of these modules will also be completely removable.

It will also allow full out-of-band memory inspection of the entire system, transparent to, and independent of the processor, and provide a secure compartmentalised architecture that allows a paranoid process, for example an email decryption program, to be sure that even the hypervisor cannot interrupt it to exfiltrate private information.

We know that there are some other folks active in similar spaces, including the excellent folks at Purism. We love what they are doing, and see our thinking in this space as complementary.  The Purism laptops (and soon phone) use all open-hardware, so that if you need a full-function computer, it is as trust-worthy as possible.  What we are looking to do is a little different: We want to see how simple we can go, while preserving enough function to be useful. We are expecting the core operating system to fit in kilo-bytes of memory, not mega-bytes, and applications to be tens to hundreds of kilo-bytes, not mega-bytes. 

There are lots of questions unanswered, not the least whether the thing will actually be useful enough for anyone, but we are exploring, and all going well, hope to be able to produce a few prototype devices by the end of 2018.  We have also secured the necessary defence-related export clearance for such a device, precisely because its combined security measures place it in risk of tipping over into the category of dual-use equipment, so we have a green light there.

So my questions for all of you reading:


  1. Would any of you buy a "phone for the paranoid" along the lines of what I am describing?
  2. What are the absolute core functions that you would require, compared to the list below:
  • Make and receive telephone calls (en claire, and quite possibly end-to-end encrypted).
  • Send and receive SMS messages (en claire or encrypted).
  • Send and receive Email, including GPG or similar encrypted.
  • Very basic web browsing, using a purposely cut-down browser.
  • Ability to run 3rd-party apps in a sand-box environment.



Getting back up to speed for 2018

$
0
0
It's a been a while since the last post, in part due to summer holidays here in Australia, and in part due to various other things taking my time and energy.  But, we are now ramping back up for 2018.  There are a few interesting things going on that are worth commenting on:

1. Porting Serval to Ubuntu Touch

A group has formed to port Serval to Ubuntu Touch, in their own words:

"The Serval Mesh for UBports group is dedicated to the task of launching a fully functional Serval app for Ubuntu Touch and aiding in the expansion of community-built mesh networks."

and

"Hi! We are busy making Serval Mesh work on UT and it's going WELL (they say)"

In just a couple of days, they already have servald running on the phone (this is one of the nice things about Ubuntu Touch basically being a port of Ubuntu Linux, and thus brings with it many benefits of convergence, i.e., where something made for one platform works on the other, possibly without any changes.  Here are some early screenshots showing their current progress (which, given they have been working on it for only two days, is very nice):



In that last one, you can also just see the cat's ears poking up in the bottom of the list of apps, as well :)

Currently the list of devices that support Ubuntu Touch is relatively limited. That said, my FairPhone 2 is on the supported list, but I am not sure if I will switch over just yet.  I might just get a second FP2 for testing for now.  The FP2 is of course the perfect phone for Ubuntu Touch in my view, as it is an open-hardware phone as far as is possible, and allows for end-user servicing -- very much in line with the Linux ethos.

2. Outernet integration

This is a project supported by the Humanitarian Innovation Fund (HIF) out of the UK.  While we have a couple of roadblocks to solve, as a result of Outernet moving back from L-band (~1.5GHz) to Ku-Band (~12GHz), the end result will be VERY exciting, in part because the bandwidth will be higher, and in part because it looks like we will be able to incorporate the entire receiver into the existing Mesh Extender cases, thus offering a single unit that does both functions.  The new Ku receiver antennae are also nicely shaped to allow for strapping to a tree pointing in the general direction of the satellite. 

More to come on this as it progresses.

3. We have our next cohort of excellent students from INSA Lyon, France, arriving as I type.  They will be working on a number of very useful sub-projects, including bedding down the HF radio support we first demonstrated a couple of years ago, fixing bugs we identified in Vanuatu last year and so on.  I am VERY excited about seeing progress on these fronts, so that Serval Mesh Extenders can be usefully deployed by our various partners in a variety of ways.

Paul.

Early access to Serval Mesh Extender hardware

$
0
0
A quick post in response to a few enquiries we have had recently, where folks have been asking about getting their hands on some prototype Serval Mesh Extender devices.

So far, we have only produced 50 of these units, which have our (hopefully) IP65/IP66 injection-moulded case, IP67 power/external radio cable, integrated solar panel controller and battery charger if you wish to operate them off-grid.   Most of those are currently deployed in Vanuatu, or are in use in the lab here or with a few other partner organisations.

Although we are still in the process of testing them, they have already been in some interesting places in Vanuatu and beyond, as the images at the end of the post show.

Thus, we would like to gauge the interest in us having another production run, so that folks who would like to work with these experimental units, including to help us identify, document and fix any problems with them, can do so.

A reminder and warning: The devices would be experimental, and have known software issues that need to be resolved before they can be sensibly deployed.  It is also possible that hardware issues will be found in the process, and that these units may well never be deployable by you in any useful way, i.e., cannot be considered as a finished product or fit for merchantability. Rather, you are being given the opportunity to purchase hardware that would allow you to participate in the ongoing development of these devices.

If you want to buy Mesh Extenders to use in a serious way, this is not the offer you are looking for. As soon as we are ready to make them commercially available, we will be sure to let everyone know, and will update this post as well.


Because of the small production size that we would anticipate for this, the per-unit price will likely be quite high -- potentially as high as AU$700 each (and you really need a pair for them to be useful).  If we were able to get 100 or more manufactured in a single run, then this cost would come down, quite likely a lot.  (But please note, this is not at all representative of the final price of these units when we hopefully bring them to market -- we intend that the end price to be a LOT lower than this.  Again, the purpose of this current offer now is to provide early access that would support the finalisation of these devices, and allow us to get a few extra devices made for our internal use at the same time.)

So, given the above, is there any one else out there who would also like to get their hands on one or more Mesh Extenders, in your choice of Wholly White, Rather Red or Generally Grey?

(We realise the price at this stage is rather high, and that this might preclude interest from some folks from participating. During this pre-production stage, we have a couple of options to reduce the cost a little: First, if you already own an RFD900 or RFD868 radio, or want a Wi-Fi-only Mesh Extender, you can subtract AU$100 from the price.  Second, if you want to make an offer contingent on a production run of at least 100 units, which should get the price down by AU$100 - AU$200 per unit (subject to confirming with our suppliers), you are welcome to indicate this.  We will in any case come back to the community before we proceed with any production run.)







Fixing bugs and structure of LBARD

$
0
0
During our trips to Vanuatu last year we identified a number of bugs with LBARD, which have been sitting on the queue to be fixed for a while. 

At the same time, we have had a couple of folks who have tried to add support for additional radio types to LBARD, but have struggled due to the lack of structure and documentation in the LBARD source code.

Together, these have greatly increased the priority of fixing a variety of things in LBARD.  So over the past few days I have started pulling LBARD apart, taking a look at some preliminary refactoring by Lars from Germany, and tried to improve the understandability, maintainability and correctness of LBARD.  This has focussed on a few separate areas:

Generally improving the structure of the source code

The most obvious change here is that the source files now live in a set of reasonably appropriately named sub-directories, to make everything easier to find.  This is supported by the factoring out of a bunch of functionality into the radio drivers and message handlers described below.

Drivers for different packet radio types are now MUCH easier to write.  

Previously you had to hunt through the source code to find the various places where the radio-specific code existed, hoping you found them all, and worked out how they hang together. This also made maintenance of radio drivers more fragile, because of the multiple files that had to be maintained and that could give rise to merge conflicts.

In contrast, the process now consists of creating a header and source file in src/drivers/, that implement just a few functions, and has a single special header line that is used by the Makefile to create the necessary structures to make the drivers usable. 

The header file is the simplest: It just has to have prototypes for all of the functions you create in the source file.

The source file is not much more complex, excepting that you have to implement the actual functionality.  There are a few assumptions, however, primarily that the radios will all be controlled via a UART.  Using the RFD900 driver as an example, let's quickly go through what is involved, beginning with the magic comment at the start:


/*
The following specially formatted comments tell the LBARD build environment about this radio.
See radio_type for the meaning of each field.
See radios.h target in Makefile to see how this comment is used to register support for the radio.

RADIO TYPE: RFD900,"rfd900","RFDesign RFD900, RFD868 or compatible",rfd900_radio_detect,rfd900_serviceloop,rfd900_receive_bytes,rfd900_send_packet,always_ready,0

*/


In order for the compilation to detect the driver and make it available, it requires a line that begins with RADIO TYPE:  and is followed by the elements for a struct radio_type record, the details of which are found in include/radio_type.h. You should provide here:

1. the unique radio type suffix (this will get RADIOTYPE_ prefixed to it, and #defined to a unique value by the build system).
2. The short name of the radio type to appear in various messages.
3. A longer description of the radio type.
4. An auto-detect routine for the radio.
5. The main service loop routine for the radio. LBARD will call this for you on a regular basis. You have to decide when the radio is ready to send a packet, as well as to manage any session control, e.g., link establishment for radio types that require it.
6. A routine that is called whenever bytes are ready to be received from the UART the radio is connected to.
7. A routine that when called transmits a given packet.
8. A routine that returns 1 when the radio is ready to transmit a packet.
9. The turn-around delay for HF and other similar radios that take a considerable time to switch which end is transmitting.

These drivers are actually quite simple to write in the grand scheme of things. Even the RFD900 driver, which implements a congestion control scheme and an over-complicated transmit power selection scheme requires is still less than 500 lines of C.  The current proof-of-concept drivers for the Barrett and Codan HF radios are less than 300 lines each.

What hasn't been mentioned so far is that if you want your radio driver to get exercised by LBARD's test suite, you also need to add support for it to fakecsmaradio, LBARD's built-in radio simulator.  This simulator implements a number of useful features, such as the ability to implement unicast and broadcast transmission, the ability to detect packet collisions and cause loss of packets, adjustable random packet loss and a flexible firewall language that makes it fairly easy to define even quite complex network topologies.  The drivers for this are placed in src/drivers/fake_radiotype.h, and will be described in more detail in a future blog post.

LBARD Messages are now each defined in a separate source file

Whereas previously the code to produce messages for insertion into packets, and the code to parse and interpret them were found scattered all over the place, they now all live in a source file each in src/messages/. For further convenience, these files are automatically discovered, along with the message types they contain, and hooked into the packet parsing code.  This makes creation of message parsing functions quite trivial: One just needs to create a function called message_parser_XX, where XX is the upper-case hexadecimal value of the message type, i.e., the first byte of the message when encoded in the packet.  Where the same routine can be used to decode multiple message types, #define can be used to allow re-use of the function.  Apart from performing their internal functions, these routines need only return the number of bytes consumed during parsing. For example, here is the routine that handles four related message types that are used to acknowledge progress during a transfer:

#define message_parser_46 message_parser_41
#define message_parser_66 message_parser_41
#define message_parser_61 message_parser_41

int message_parser_41(struct peer_state *sender,char *prefix,
                      char *servald_server, char *credential,
                      unsigned char *message,int length)
{
   sync_parse_ack(sender,message,prefix,servald_server,credential);
  return 17;  // length of ACK message consumed from input
}

Fixing LBARD transfer bugs

In addition to the structural improvements described above (and in many regards facilitated by them), I have also found and fixed quite a few bugs with Rhizome bundle transfers.  The automated tests now all pass, with the exception of the auto-detection of the HF radios, which don't work yet because we are still writing the drivers for fakecsmaradio for them. Here are a pair of the more important bugs fixed:

1. The code for tracking and acknowledging transfers using acknowledgement bitmaps now actually works.  This is important for real-world situations where packet loss can be very high.  In contrast to Wi-Fi that tries to hide packet loss through repeated low-latency retransmissions, we have to conserve our bandwidth, and so we keep track of what a recipient says that they have received, and only retransmit that content when required. 

This also plays an important role when there are multiple senders, as they listen to each other's transmissions, and use that information to try very hard to make sure that they don't send any data that anyone else has recently sent.  This helps tremendously when there are multiple radios in range of one another.

This bug is likely responsible for some of the randomness in transfer time we were seeing in Vanuatu, where some bundles would transfer in a few seconds, while others would take minutes.

2. There were some edge-cases that could cause transmission to get stuck resending the last few bytes of a bundle over and over and over, even though they had already been received, and other parts of the bundle still needed sending.

This particular bug was tickled whenever the length of a bundle, when rounded up to the nearest multiple of 64 bytes was an odd multiple of 64 bytes. In that case, it would mistakenly think that there was 128 bytes there to send, and in trying to be as efficient as possible, send that instead of any single 64 byte piece that was outstanding (since it results in reduced amortised header cost).  However, the mishandling of the bundle length meant it would keep sending the last 64 bytes forever, until a higher priority bundle was encountered. 

It is quite possible that this bug was responsible for some of the rest of the randomness in latencies we were seeing, and also the "every other MeshMS message never delivers" bug, where sending one MeshMS would work, sending a 2nd wouldn't, but sending a 3rd would cause both the 2nd and 3rd to be delivered -- precisely because each additional message had a high probability of flipping the parity of the length of the bundle in 64 byte blocks.

So the net result is that the various tests we had implemented in the test suite, including delivering bundles over multiple (simulated) UHF hops, holding a MeshMS conversation in the face of 75% packet loss, transferring bundles to many recipients or receiving a bundle efficiently from many nearby senders all work just fine. 

The proof of the pudding is in the real-world testing, so I am hoping that we will be able to get outside with a bunch of Mesh Extenders in the coming week and setup some nice multi-hop topologies, and confirm whether we can now reliably deliver MeshMS and MeshMB traffic over multiple UHF radio hops. 

In parallel to this, we will work on the drivers for the HF radios, and also for the RFM96-based Lora-compatible radios that are legal to use in Europe, and get that all merged in and tested. But meanwhile, the current work is all there to see at https://github.com/servalproject/lbard/.

Building an indoor test network and testing LBARD bug fixes

$
0
0
After the successful bug hunting last week, this week I have turned my attention to setting up to easily verify whether the bugs have been fixed. This means setting up a multi-hop Mesh Extender network in a convenient location. 

The building I am based in was purpose-built for the College of Science & Engineering, and we were able to have a number of specific provisions included during the design phase to make this kind of test network easier to deploy, and we are now finally able to start making fuller use of that.  More specifically, we have 13 points around the building where we have copper and fibre directly to our lab, and also have some funky mounting brackets, on which we can fit radio hardware, all nicely protected from the outside weather.  

So the obvious approach was to fit several Mesh Extenders to some of those points, and form a multi-hop UHF network.  Here are some photos from various angles showing the units:

First up, we have Mesh Extender serial #23 with SID BCA306B0*, which is at point NE1. Note the extra long 6dB antenna we fitted to this one to make sure it could talk to the one at N4 (1A15ED32*). Even with that antenna, the link is quite poor, often getting only one packet every 5 to 10 seconds.  While on the one hand annoying, this is actually quite helpful for testing the behaviour of Mesh Extenders in marginal link conditions.



Mesh Extender serial #34 is at point N4, and has SID 1A15ED32*:




Note that we have removed the Wi-Fi antenna from this one. This was necessary to make sure that it can't communicate via Wi-Fi with any of the other units, which it was otherwise able to do intermittently, due to being relatively near to the one in the lab, which is on the same floor, and the one on level 5, to which it almost has line of sight up to via the stair case.

Mesh Extender serial #50 is at NW5 and has SID 5F39F182*:





And finally, Mesh Extender serial #17 (6ABE0326*) is sitting on the rack in our lab:



Mesh Extender #34 (1A15ED32*) is the only node that can see all the others, which we confirmed by using the debug display on LBARD:

I was still fiddling around with things at this point, but you can see already that the RSSI for BCA306B* is quite a lot lower than for the others.  Although here it looks like the link is healthy, it really does come and go a lot, and goes seconds at a time without a single packet arriving.  

The last of the four rows, the one showing 39F1821A* is actually a known-bug manifesting. This is a ghost connection, caused by receiving a packet from 5F39F1821*, with the first byte having gone walk about for some reason. We are still to work out what the cause of this problem is.  It's on our bug list to find and fix.


Now, to give a better idea of how the network is laid out, here are a couple of shots from the outside of the building.  The first is looking along the northern facade to give a sense of the length of the building.  It is from memory about 50m long.


And here is the whole facade:

And, here is the facade showing, in red, the locations of the three units that are located on the mounting points just behind the three layers of metal-film coated glass that acts as a partial Faraday Cage. The black spot shows the approximate position of the lab, as projected onto the northern facade, although it is in reality on the southern side of the building, approximately 25 metres south of the northern facade. Green lines indicate strong UHF links, and the yellow line the poor link between the first and fourth floors.


The glass treatment means that very little signal can actually get out of the building, bounce of something, and find its way back in. We estimate the attenuation to be about 15dB at 923MHz, so a total of 30dB loss to pass out of the windows, and back in again, not counting any free space losses etc.  Thus we suspect that the links between the units are due to internal reflections in the building, and perhaps a bit of common mode radiation along metal parts of the building interior.  It is almost certainly not due to transmission through the ~30 - 50 cm thick reinforced concrete floor decks of the building.

Given that we are in a building with such poor RF propagation properties, it is rather pleasing that we are able to get any link at all between the first and fourth floors, given the multiple concrete decks and other obstacles that the link has to face.

So, with the network in place, it was time to do a little initial testing with it.  For this, I had a phone running Serval Chat in the lab, and I went down to level 1 by BCA306B0 and sent text messages, using the automatic delivery confirmation mechanism to know that a message has got to the other end, and indeed that a bundle has made its way back containing the acknowledgement.

First, we tested using the old version of LBARD, and were able to see the failure of messages to deliver that we had seen in the field.  Then it was a visit to all the Mesh Extenders to update LBARD (we will in the near future supply them all with ethernet connections back to the lab, so that we can remotely upgrade them). The result after upgrading LBARD was messages were reliably delivering, taking between about 1 and 4 minutes to get the delivery confirmation.  This was of course a very pleasing and comforting result, as the inability to reliably deliver messages was the most critical issue we were seeing in Vanuatu.

The large variation in time we expect to be simply due to the very high rates of packet loss that we see on the link between BCA306B0 and 1A15ED32 -- something that we will confirm was we continue testing. The main thing is that we now have a test environment that is both convenient, and realistic enough, for us to reproduce bugs, and confirm that they have been fixed.

Slashing the cost of tsunami (and bushfire) early warning systems

$
0
0
We have just about finished off a project that we have had funded by the Humanitarian Innovation Fund (HIF), which is funded by the UK's foreign aid program.  This has been a bit of a complementary variation on our existing work on the Serval Mesh.  Rather than on-ground peer-to-peer communications, the goal is to get communications out in to very remote locations, in particular, to get earaly warning for tsunamis, cyclones, bush fires and other hazards.

The tragic events in Sulawesi last month are a stark reminder of the need for tsunami early warning systems.  Following the earthquake, a tsunami warning was generated, but there was no way to get the message out to the impacted communities, in part because the earthquake had knocked out the cellular network, and in part, because many of the smaller villages likely lacked cellular coverage to begin with, or in the absence of grid electricity, many phones would have been turned off to conserve power.

These circumstances are not by any means unique to Indonesia -- they occur all around the Indo-Pacific in various forms and permutations.  The question is how to solve the problem.

The traditional approach is big tsunami early warning towers.  However, those are really expensive, especially to install in remote areas, often costing $10,000 or more per tower.  For isolated villages of a hundred or so people, the per-capita cost is simply infeasible.  Maintenance is also problematic both financially, and logistically

We have tackled this problem from the perspective of making a solution that works best in these resource-constrained environments.  In the end, our solution is rather simple:  Use off-the-shelf satellite TV receiver parts combined with some clever signal processing work by Othernet.is, so that no dish is required -- just point it to within about 10 degrees of where the geostationary satellite is.  This gives us a low-bit-rate digital broadcast signal that we can receive over a wide area.  Add to this direct-leasing of satellite capacity, so that the operating cost of the system is low and fixes, i.e., there is no monthly charge per user, so that it can be scaled up region wide.  Then all that is missing is a cheap automotive air-horn, so that the alarm can be heard within a village.  

The result is receiver hardware that costs maybe $200 in modest quantities, i.e., around 100x cheaper than the old approach, and that can be installed by communities themselves.  That is, we don't have to rely on big programs to roll these things out. Once people know about them, they are cheap enough for a village to buy, and easy enough for them to install themselves.

This just leaves the problem of disaster resilience/warning technologies not getting maintained during the year, and thus often having failed before they were needed. Our solution here is to include a low-power FM radio transmitter, so that news, weather, climate change mitigation and other information can be received in each village.  This means when the unit stops working at some point (nothing lasts forever in the tropics), they know immediately (no more radio signal, and the services it was providing them), and they can organise themselves to get it replaced and reinstalled, again, without having to rely on an NGO or aid program to do it.

In short, it is not only way cheaper, it is also way more appropriate to the needs of the Pacific.  It turns out that these characteristics also make it a great potential solution for bush-fire early-warning in Australia, where cell networks are often taken out by large bush fires (= wild fires, for those on the other side of The Pond), being cheap enough for house-holds to install themselves, and simple enough for them to self-install.

We are at the proof-of-concept stage and are looking for ways to fund the next stage.  Hopefully we will find a funder who can help us bring it to operational readiness in the next year or so.


Serval Chat iOS Port finally emerges from the lab

$
0
0
Goodness me, this is probably the single piece of the Serval Project that has taken the longest compared to what was expected.  We have had a number of misadventures along the way with false starts to make a working port of Serval for iOS, but today, we finally have it.

This adventure began back in about 2014 when the NLnet provided us with a grant to achieve this.  Needless to say, we didn't make our original planned timelines.  We had some turn over of engineers and students on the project, and Apple changed from Objective C to Swift, and we hit some early dead ends and speed humps.  But we now have something that works quite nicely on iOS 12.

This port runs the core serval-dna source in a thread in the background, so that it is fully capable and fully compatible.  Getting serval-dna to compile for iOS, and produce binaries that can run on all iPhone CPU types, and that could be sensibly accessed from Swift, and that would behave properly was non-trivial to say the least.  We got to that point earlier in the year, and then the remaining time has been spent making a proof-of-concept app, and working through the remaining issues, like not having the thread die when the app looses focus, and working around a pile of funny little problems, including using the properly containerised paths in strange places we had overlooked it, and discovering that named sockets just don't work on the iPhone.

So today we reached the happy point of having something working, just hours before our French exchange student was due to finish up.  His last great act for us was to help prepare the following short video to celebrate this milestone.


There is still some missing functionality to implement, and then the real fun begins: Trying to get the app approved for release on the Apple App Store.  But that will be an adventure for another day.

It's 2020 and Australia is burning

$
0
0
It almost doesn't need saying, but Australia is burning.  Lives have been lost. Houses destroyed. Livestock, farms and livelihood all gone up in smoke, quite literally.  I can't even imagine the pain and distress that this is all causing for so many. But what I can do, is try to do something about the technology gaps that are making a bad situation even worse for those affected.

As readers of this blog will know, I have been devevloping resilient telecommunications solutions for the last decade or so.  The best known of those is the Serval Project, which is a combination of software for smart phones that can form mesh networks, and small low-cost communications repeaters that we call Serval Mesh Extenders. Here is a brief introduction to what, and why, we are making the Serval Mesh:


 And to go right back in history to almost the beginning, here is the original motivation of the project:


And for another blast from the past, here is the original field test call from back in 2010, at Arkaroola in the Outback, which was also covered by the ABC:



Here's another video about the Mesh Extenders that we made back in 2013, when we were first developing them:


Since then, we now have much more mature hardware, and have tested the hardware for long-term operation in the field in a coastal area of Vanuatu, in large part thanks to a grant from DFAT under the Pacific Humanitarian Challenge a few years ago.

While the Serval Mesh was originally developed with disaster situations abroad, it was always also made with more local situations in mind, for example, following cyclones or in very remote areas lacking phone coverage.  For example, one variant of the technology that was worked on for a while, was the concept of an emergency network that uses vehicles as the main component, because they are the most ubiquitous infrastructure once you get out of the big cities.  Here is a piece that we produced with Toyota on this concept:

That video is also a good general introduction to the potential for the technology.

Basically any situation where cellular or other normal communications infrastructure is missing, damaged or disabled for any reason, the Serval Mesh lets local communities form their own local-area digital communications network.

There are also some branches to this work, for example, creating mobile phones that include the Mesh Extender functionality internally, and are designed to be fully "self-sovereign," that is independent of all power, communications and other infrastructure -- and that can offer security and privacy at least as good as the present state of the art.

So where are we at now?

Basically we have proven all the various parts of the technology that we need to make this a reality. What we need to do now, and our plan for 2020 -- and even before the first started -- is to focus time and energy on shaking down the last wrinkles in the system, so that it is ready for deployment by communities in the field. 

This includes revising the Mesh Extender circuit board, and fixing some known problems in the software, and then testing with larger networks of dozens to hundreds of units, that would more accurately reflect the real use of the Serval Mesh by communities in the field.

Flinders University has kindly granted me a "sabbatical" year to focus on this.  I'll be based at Arkaroola to do this, where we did the original field testing, so that we can use the vast rugged landscape there, including over 600 square kilometres of mountainous desert, to be able to deploy realistic test scenarios, and work on this scaling.

Our goal, is that the Serval Mesh can be ready for individuals and communities who want to use it, to do so, by the end of this year. 

Achieving this will depend on a lot of factors, including the ever present problem of having sufficient funds for the equipment that would allow us to work more quickly, and scale up our tests more meaningfully.  It also doesn't answer the question of how we support communities in their use of the technology once it is finished, or who will offer it for sale.  But those are issues that we can think further through during the coming year.

But one thing is crystal clear to me:  We each need to consider what we can do to mitigate the effects of the fires, and to adapt to a future where such events are more and more likely, and to do what we can to mitigate this threat. Whether we can see the whole solution or not, is to me secondary: We must simply make sure that we do what we can now.  And that is what I am going to do with 2020.

LBARD over Codan HF modem and debugging LBARD sync problems

$
0
0
This week I am finally getting to working on the remaining work for supporting the Codan 3012 HF data modem in LBARD.  This was started a long time ago, but has stalled at various points along the way.

The purpose of using this data modem is that it comes as a software option in the Codan Envoy 2210 HF radios, and can in theory transfer close to 100x more data than the horrible ALE text message method that we had demonstrated a few years ago.  That said, we are still talking about a data link that in theory can do 2,400bits per second. I say in theory, because that relies on good atmospheric conditions for the HF communications, and because the link is actually half-duplex with quite a noticeable turn-around time.  That is, the modem at one end sends a blob of data, stops transmitting, and begins listening. The modem at the other end receives the blob of data, switches to transmit, transmits its blob of data, stops transmitting and begins listening. This is a good thing, but not so helpful for our data throughput. Thus it takes perhaps a tenth of a second or more to do the turn-around.  Because this is all happening with radios that can transmit at 125W, it involves relays and various safe-guards in the TX to RX turn-around.

The modems seem to send a data blob lasting 2 seconds. So in every ~(2 + 0.1 + 2 + 0.1) seconds we get to send one blob of data.  Thus while the channel can carry 2,400 bits per second, we are limited to something below 1,200 bits per second from each end. We'll estimate this to be about 1,000 bits per second for ease of calculation.  This should give us an effective data rate of a bout 125 bytes per second in each direction.

This is pretty modest by modern internet standards, but it does correlate to about one text message per second. Of course, by the time we have overhead for routing, authentication and other house-keeping elements, a text message might be closer to 1KB, which would require about 8 seconds to send under ideal conditions.  This means that a link that is active 24/7 could send and receive about 86,400 seconds/day / 8 seconds = ~10,000 messages per day.  Even with just an hour a day this means that we should be able to send and receive about 400 messages. Given our goal is to connect small isolated island communities in the Pacific, and to provide backup text messaging capability during disasters, this sounds like a pretty reasonable capacity -- and certainly better than the 10 to 15 minutes it was taking to send an authenticated text message via the ALE text messaging channel!

The radio setup I have here is a pair of Codan 2210s kindly loaned to us by Codan themselves.  I have these hooked up to a laptop here via USB serial adapters.

The first step is to enable the data modems on the Envoy radios. If you have obtained these radios without enabling this feature, you will need to give Codan a call to purchase an unlock code.  You shouldn't need any hardware changes, which is handy, if you are already in the middle of nowhere, needing to setup communications.

Next, use the handset to select "Menu", "User data", "Peripherals" and "RFU GP port", and select "2.4kbps data modem". It is important to NOT select "3012 mode", unless you really are using an external modem, rather than the built-in one.  The menus are shown in the following figures:





The modems themselves have an interesting feature, where they report themselves when you connect to them via the USB serial port.  However, if you power the Envoy on with the USB already connected, this doesn't always occur, and sometimes I had problems getting anything out of the modem in that context. The solution is to disconnect the USB port from the computer and reconnect it. 

I'll have to think about a robust solution for that,  as it can cause LBARD's auto-detection of the modem to fail, which would be unfortunate for automatic operation, which is what we want to achieve. It might be possible to do a software USB disconnect and reconnect, for example. Or there might be some magic option I need to setup on the radio.  Anyway, we'll come back to that later.

The modems need only a few simple commands to control them -- assuming the HF radios are sitting on the same channel. I haven't yet used the commands that the modems offer for selecting channels, although it should be quite possible.

The first thing to do is to give the modem a station ID. This is a bit like a phone number. It can have upto 6 digits, but for reasons I don't immediately recall, the last two digits cannot be 00.  This is set using the AT&I=number command, e.g.:

AT&I=2

would set the station ID to 2.  Once that has been setup, you can tell a modem to try to dial it using the good old ATD command, e.g.,

ATD2

would call a modem that had been configured as previously described. This will cause the standard RING, NO ANSWER, NO CARRIER, and CONNECTED messages to be displayed based on whether the remote end answers with ATA or not, or later disconnects with ATH0.  In short, Codan have done a great job of making this modem masquerade as an old-fashion 2400 bps data modem.  This helps it be compatible with various existing software, and helps me, because I have previously had many adventures programming such modems.

The modems default to having compression enabled, although I don't know what kind of compression they have, nor how effective it is.  For now, I am just ignoring it, and leaving it in the default setting of enabled. 

Armed with the information on how to control the modem, I set about creating an LBARD driver for it. This consists mostly of an auto-detection routine, the state machine for managing the call states, and routines for sending and receiving packets.  This is all in the src/drivers/drv_codan3012.c file.

Initialisation just consists of sending a few AT commands to set the modem's HF station ID, and enable hardware flow-control:

int hfcodan3012_initialise(int serialfd)
{
  char cmd[1024];
  fprintf(stderr,"Initialising Codan HF 3012 modem with id '%s'...\n",hfselfid?hfselfid:"<not set>");
  snprintf(cmd,1024,"at&i=%s\r\n",hfselfid?hfselfid:"1");
  write_all(serialfd,cmd,strlen(cmd));
  fprintf(stderr,"Set HF station ID in modem to '%s'\n",hfselfid?hfselfid:"1");

  snprintf(cmd,1024,"at&k=3\r\n");
  write_all(serialfd,cmd,strlen(cmd));
  fprintf(stderr,"Enabling hardware flow control.\n");

  // Slow message rate, so that we don't have overruns all the time,
  // and so that we don't end up with lots of missed packets which messes with the
  // sync algorithm
  message_update_interval = 1000;
 
}


Flow control is a bit interesting to manage. The big problem is that we don't want the buffer of the modem to get too full, as this makes the round-trip time for control messages too long, which slows things down. For example, if the far end doesn't quickly confirm that a given bundle has been received, the sender will keep sending pieces of it. Thus we need to keep well away from hitting the flow control-imposed limit. The packet sending and receiving routines keep track of this by having sequence numbers that are then returned to the sender, so that the sender can get an idea of the number of packets in flight.

LBARD modulates packet flow, not by the number of packets in flight, but by the average interval between packets.  As a first-cut approach, I set the packet interval dynamically based on the number of outstanding packets in flight.  Basically I set the packet interval to some multiple of the number of outstanding packets.  250ms per outstanding packet seems to work ok, with reasonable throughput balanced against not having too many unacknolwedged packets to waste outgoing bandwidth with needless retransmissions.  I can refine that a bit more later.

The state machine for the modem is also fairly simple:
 
  switch(hf_state) {
  case HF_DISCONNECTED:
    if (hfcallplan) {
      int n;
      char remoteid[1024];
      int f=sscanf(&hfcallplan[hfcallplan_pos],
                   "call %[^,]%n",
                   remoteid,&n);
      if (f==1) {
        hfcallplan_pos+=n;
        fprintf(stderr,"Calling station '%s'\n",remoteid);
        char cmd[1024];
        snprintf(cmd,1024,"atd%s\r\n",remoteid);
        write_all(serialfd,cmd,strlen(cmd));
        call_timeout=time(0)+300;
        hf_state=HF_CALLREQUESTED;
     } else {
        fprintf(stderr," remoteid='%s', n=%d, f=%d\n",remoteid,n,f);
      }
     
      while (hfcallplan[hfcallplan_pos]==',') hfcallplan_pos++;
      if (!hfcallplan[hfcallplan_pos]) hfcallplan_pos=0;
    }   
    break;
  case HF_CALLREQUESTED:
    if (time(0)>=call_timeout) hf_state=HF_DISCONNECTED;
    break;
  case HF_ANSWERCALL:
    write_all(serialfd,"ata\r\n",5);
    hf_state=HF_CONNECTING;
    call_timeout=300;
    break;
  case HF_CONNECTING:
    // wait for CONNECT or NO CARRIER message
    break;
  case HF_DATALINK:
    // Modem is connected
    write_all(serialfd,"ato\r\n",5);
    hf_state=HF_DATAONLINE;
    call_timeout=time(0)+120;
    break;
  case HF_DATAONLINE:
    // Mode is online.  Do nothing but indicate that the modem is
    // ready to process packets
    if (time(0)>call_timeout) {
      // Nothing for too long, so hang up
      sleep(2);
      write_all(serialfd,"+++",3);
      sleep(2);
      write_all(serialfd,"ath0\r\n",5);
      hf_state=HF_DISCONNECTED;
    }
    break;
  case HF_DISCONNECTING:
    break;
  default:
    break;
  }
 
In short, we make a call if we have been instructed to, and am not currently on a call.  Otherwise stay on the call unless we get no data for too long, in which case, hang up, and repeat the process.

With the above and the necessary glue, I was able to get packets flowing between the two modems fairly easily.  But when I started trying to transfer bundles, I started seeing the protocol get confused with both sides trying to send each other the same bundle endlessly.  This should never happen, as when one end realises that the other has a given bundle, it should stop trying to send it, because it has pretty clear evidence that the other end has it already. 

I'd seen this problem before in the field with UHF radios in the Mesh Extenders, but it had stubbornly refused to be reproduced reliably in the lab.  So while it is on the one hand annoying, its actually a great relief to have found the problem, and to be able to progressively work on solving it.

It turns out to be caused by a number of things:

First, the TreeSync algorithm seems to not realise when the other end receives a bundle, even though it should.  That module is quite fancy and efficient at doing its job, and I don't understand it well enough to figure out what subtle problem it is suffering from.  The mechanism by which this causes the lockup took a little while to realise, but it made a lot of sense once realised it:  Once both sides have as their highest priority bundle for the other a bundle that the other side actually has, and that trips this TreeSync problem, then they will continue to try to send it forever.

It self-corrects eventually, but only briefly, because when it produces a new generation ID, the tree is re-built, which allows it to send the next few bundles before locking again.  To work around this, I now maintain a list of bundles that we have seen the other end try to send.  If they try to send a bundle, then they must by definition have it.  Therefore we should never try to send that bundle to them again.  This works to stop the TreeSync algorithm from re-scheduling the same bundle over and over.

That fixed the protocol lock-up.  Then I was seeing that the scheduling of which data blocks to send was being completely random.  Now, it is supposed to be rather random, so that if multiple senders are available for a given bundle, a single receiver can receive pieces from all of them, and thus receive the bundle as a whole sooner. 

The problem was that the system for keeping track of which blocks we think the recipient might already have was not being activated, because it was getting confused about which bundle we thought we were sending them, or should be sending them.  As a result very small bundles consisting of only a few blocks would send in reasonable time, as the probability of all their blocks being sent would approach 1 very quickly. But larger bundles consisting of more than a dozen or so blocks would be extremely slow to send, because it was basically pot luck waiting for the last few holes to get filled in.

This required a bit of fiddling about in a few different places to pick up all the corner cases of when we should reset it, and making sure that we clear the bitmap of sent blocks and other minor details. I also fixed the situation where we see another party sending a bundle to someone, and we haven't yet worked out what to send to them.  In that case, we now join in sending the same bundle, so that the receiver can hopefully receive the whole bundle that bit quicker. 

These multi-way transfer optimisations are not really applicable to the HF radios, as the modems are designed for point-to-point, but it made sense to implement them, anyway. Also, down the track, it would be possible for us to look at ways to make a radio listen to multiple HF radio channels in parallel, and receive all the content being delivered on them.  This promiscuous mode of listening could greatly increase the effective capacity of the system, and basically is just a sensible use of the true broadcast nature of radio.

Back to the bitmap sending, even after I got it working, it was still not particularly efficient, as even when there were a relatively few pieces left to send, it would still chose them at random, potentially resending the same block many times until the next bitmap update came through.  As the round-trip latency is typically 10 - 20 seconds, by the time all the modem buffering is done, this can add significant delays through redundant sending. Also, it can get stuck sending manifest pieces for a long time, which again can be quite wasteful, as the manifest is typically less than 300 bytes, and will keep getting resent until the bitmap update comes through indicating that the manifest has been fully received. 

What would be better, would be to keep track of the set of pieces that are outstanding at any point in time, and just randomise their order, and then send them in that randomised order, and then start over.  That way we will avoid redundant sending.  The order can be randomised afresh each time we get a bitmap update.

First implementation of this, seems to work fairly well: It sends the blocks in random order. It's still not as fast as I would like, averaging only about 20 bytes per second in each direction.  This will be partly because I am sending 64 byte blocks at random, but the packets can hold 128 or sometimes even 192 bytes. But if there are no contiguous outstanding blocks, then it will end up being very inefficient.  I'll likely fix that by allowing a message type that can contain more than one piece of the same bundle.  Actually, I figured out an easier quick fix: Consider all odd-numbered blocked to have been send once already, so that we try to send all the even-numbered ones at least once.


The bigger problem, though, is that the indication of when the other end has received a bundle is not working properly, and so we are back to our protocol lock-up situation.

So looking at a particular run when this is occurring, we have LBARD2 and LBARD8 talking to each other.

The begin by exchanging two small bundles with each other, before moving on to transfer bundles of several KB. 

LBARD2 receives bundle 4B47* (4870 bytes long) from LBARD8. LBARD8 receives C884* (4850 bytes long) from LBARD2. 

So both now have C884*.  The problem is that both think that the other doesn't have it. As C884* is smaller than 4B47*, it has a higher TX priority, and thus both sides keep trying to send it to the other.

LBARD8 received C884* at 11:42.38.   By 11:42:47, LBARD8 has received confirmation that the bundle was inserted into the Rhizome database, and thus knows definitively that it has the bundle.

Even though it knows that it has this bundle, it continues to accept receiving the bundle afresh from LBARD2.

Part of the problem is that LBARD8 doesn't immediately note that LBARD2 has the bundle.  It should do this, whenever it sees LBARD2 sending a piece of the bundle. 

In fact, there is no indication whatever that LBARD8 is receiving the pieces of the bundle after it has received it the first time.  I'll need to re-run with more debugging enabled.  Perhaps part of LBARD is noticing that it already has the bundle, thus is ignoring the messages, but without scheduling the indication to the other end that it already has the bundle.  But even if that is not the case, the receiver shouldn't be willing to again start reception of the same bundle, after it knows that it has it itself. 

Each run has randomised bundle IDs, which is a bit annoying.  So I have to trawl through the logs again to see what is going on.

So this time, LBARD2 receives bundle D386* from LBARD8 at 12:18.26.

Interestingly, the "recent senders" information is correctly noting that LBARD8 has sent a piece of this bundle.  So I can make use of that to update the other structures, if necessary.

At the point where LBARD2 receives D386*, it is sending bundle 3A7C* to LBARD8.
Ah, I have just spotted at the point where we realise that the other party has the bundle, instead of dequeuing the bundle to them, we are actually queueing it to them.  That should be fixed. This is also the point at which we should mark that we know that that peer has the bundle already.  So we will try that in the next run.  Simultaneously I have added some instrumentation for the "prioritise even over odd" optimisation, as it doesn't seem to be doing exactly what it should.

Fixed that, and some other related bugs. It now does a fairly reasonable job of scheduling, although there is still room for improvement. The big problem is that it is still sitting at around 20 bytes per second of effective throughput. 

Now, this isn't as bad as it sounds, as that includes the overhead of negotiating what needs to be sent, and packetising it all, so that if the modems do lose packets, we don't lose anything. It also reflects the reality that we are only sending one packet in each direction every 3 seconds or so.  With an effective MTU of 220 bytes, we only end up fitting one or two 64 byte blocks of data in each.  Thus our potential throughput is limited to about 64 or 128 bytes every 3 seconds = ~20 -- 40 bytes per second. 

So at the current state of affairs, we have something that is at least 10x faster than the old ALE messaging based transport -- and its is bidirectional throughput. We are seeing message delivery latencies for short messages that are about 30 to 50 seconds.  This compares very favourably to the ~3 minutes that sending a bundle was taking with the ALE transport.  But it still isn't as good as we would like or, I think, we can achieve.  We know that we should have something like 125 bytes per second of bandwidth in each direction.  Even with the suboptimal scheduling of pieces to send, we should be able to achieve at least half of that figure.

The first port of call is to try to optimise the packet sending rate, as it seems that one packet of ~260 bytes in each direction every ~3 seconds or so is probably not quite filling the 2,400bps channel, even allowing for the change-over.  Well, that sounded like a great theory. Looking closer, the packet rate is closer to one every 2 seconds, not every 3 seconds.  2400bps / 2 = 1200bps = 150 bytes per second raw rate in each direction, before subtracting a bit for turn-around.  Thus sending a 255 byte packet every 2 seconds = ~128 bytes per second, which is pretty much spot on the channel capacity.

This means we will have to look at other options.  The most obvious is to send large slabs of bundles, without all the other protocol fluff.  This is only really possible in this context, because the modem link is point-to-point, so we don't have to worry about any trade-offs. In fact, the whole packet format and message types are really a bit of an over-kill for this configuration.  But short of re-writing the whole thing, we can instead get some substantial gain by just sending a slab of useful bundle data after every normal LBARD packet.  This will get us half of the channel use at close to 100% goodput, and thus bring the average up to >50%.   If we send large contiguous slabs, then that will also reduce the opportunity for the block scheduling to get too much wrong, so it may well be even more effective. 

The modem has no natural packet size limit, so we can even send more than 256 bytes at a time -- if we trust ourselves to the modem's error correction to get the data to the other end intact. If errors creep in, and we don't have a means of detecting them, it will be really bad. The current error correcting code works on blocks of 232 bytes. We don't want to send huge amounts, though, as we want to maintain the agility to respond to changing bundle mixes. 

The largest number of 64 byte blocks we can thus fit would be 192 bytes.  256 would have been nicer, so that it would be an even number of blocks. That's a bit annoying, so I'll have to make a different type of packet that can carry upto 256 bytes of raw data. It will assume that the data is for the currently selected bundle, and thus only have a 4 byte offset and 2 byte length indicator, plus the escaping of the packet termination character. Thus it should have an average overhead of less than 10 bytes, and thus be better than 90% efficient.

Well, the next challenge was that some bytes were getting lost, as though buffer overflow was occurring.  However, as I have hardware flow-control enabled, I'm not sure of the cause here.  My work around is to send only 192 bytes, after all, and to make sure I throttle back the packet sending rate,  to try to keep the data flowing at a reasonable rate, without overwhelming the buffers.  This is a bit worrying, as in real life it is quite likely that we will get data rates below 2,400bps on the HF links quite often. 

Reducing the data packet size to 192 bytes has seemed to fix the buffer overflow problem. Also, I am now seeing the larger bundles go through more quickly.  However, I am still seeing the problem where the same bundle is received more than once.  Presumably it starts being received a 2nd time before we get confirmation from the Rhizome data base that it has been inserted. 

The solution would be to filter the receive list whenever we notice a bundle has been added to the rhizome database.  That way such duplicate transmission of the same bundle should get suppressed. Examining the log file of when this happened, it turns out that the problem was actually that the bundle was corrupted during reception.  This is possible if the bundle being transferred changes part way through, as our pure data packets lack bundle identifying information to save space. However, it looks like we will will need it after, to avoid problems with inferring which bundle is being sent by the remote end.

There is also a pattern where some pieces of a bundle are failing to get sent in the data packets. I'm suspecting that this is due to escaping of characters causing the encoded size of the data packet to become too large.  So I might need to trim the packet size again.  (This is all part of why I don't like it when systems try to hide packetisation behind a pretend stream interface.  You end up having to put framing back in the stream, which adds overhead, and there is no way to make sure your packets actually end on the underlying packet boundaries which increases latency. But I understand why people implement them.)

So I guess I'll have to drop back to 128 bytes per data packet, as well as including the bundle information.

Ok, that has helped quite a lot.  It is now delivering the packets more reliably.  But the throughput is still pretty horrible, with good-put rates still somewhere around 25 bytes per second.  I suspect the modems are really not designed for this kind of two-way sustained communications, but rather are optimised for mostly uni-directional transfers. 

If I can get hardware flow-control working, I should be able to improve things by using larger packet sizes.  But I think its also about time I actually wrote a simple test programme that measures the actual bidirectional throughput that these modems can deliver.  That way I know whether it is my protocol side of things that is the problem, or whether I really am hitting the limit of what these modems can achieve. 

But that will all have to wait for the next blog post.

More work on LBARD over Codan HF Data Modem

$
0
0
In the last post, we got to the point where we had Rhizome bundles flowing over the modem, but not as fast as we would like.  Also, we were still seeing some problems where bundle sending would get stuck.  I've been away for a couple of weeks, but now am back at the bench with the HF radios, and can investigate further.

The first thing I realised that I should investigate, is what the actual throughput of the modems is, so that I can determine what fraction of the theoretical throughput we have available.

To do this, I instrumented LBARD to count the number of bytes, and divide that by the number of seconds since the first byte was received.  This yielded results of about 91 bytes per second in each direction.  Given the channel theoretical capacity of 2,400 bits / sec = 300 bytes / sec, this is noticeably lower than expected.  It means that we are only actually transmitting data (91+91)/300 = ~60% of the time. 

Now, if we were just transferring in one direction, instead of switching all the time, we could surely improve that, but we know that there is less than a factor of 2 available through such optimisation.  So I'm not too worried about that right now.

What is more interesting is that we should be able to get near to 90 bytes per second, but are still languishing down around the 20 bytes per second mark due to protocol issues. That is, we have a factor of 4 that we can improve, without having to increase the channel capacity or use strategy at all.

Having spoken with one of my students who is working on a debug framework for LBARD, it is sounding to me like the packet bitmap that is used for acknowledgement and scheduling of parts of a bundle that the receiver doesn't already have is not working properly.  If this is the case, it would be causing the sending side to send parts of the bundle that the receiver already has, effectively wasting bandwidth. 

Looking through logs of previous runs, I can see one problem already: We schedule sending a bundle at the end of the file, even if we have already sent to the end of the file.  That is, we send 0 bytes of data, positioned at the end of the bundle payload.  This is clearly silly, and should be fixed. What we need to watch out for, though, is that we use end-of-payload packets to implicitly indicate the payload length in LBARD transfers. 

We are also seeing the 4 data bytes being sent quite a number of times, before the missing parts of the manifest are sent through.  That resending needs to be fixed. But it also occurred to me, that if we randomly select a point in the last 64 bytes of the manifest or payload, when we will only send <64 bytes, when typically we have space in the packet for 128 or 192 bytes.  I suspect that this has the potential to greatly improve latency for small bundles, by basically eliminating such very low efficiency packets.  Together with preventing the re-transmissions mentioned in the last paragraph, this should be able to get the latency of a sub-1KB bundle down to perhaps 15 seconds, which would be great improvement.

I'll try these improvements, together with some tweaking of debug output, so that I can more easily follow when particular bundle pieces are sent, and what the list of candidate blocks at a given packet transmission are.

This should have been a simple-rerun, but then I hit a segfault caused by something unrelated, and now running it in gdb all sorts of strange things are happening, causing one of the instances to not properly switch to the online data mode when the modems connect.  Well, I found what caused the whole crazy problem to start: When I copy-pasted the arguments into gdb, I left the "1" off the front of "127.0.0.1", thus causing long delays while waiting for "27.0.0.1" to resolve, which caused the protocol to time out in various exciting ways.  So let's try that all again...

Ok. Now it is running again, after fixing that.  I also found and fixed some other bugs.  It is now transferring small bundles in typically 20 - 50 seconds.  This could be quicker, but it is a start.  However, I am seeing instances where a bundle is received, but then fails to insert into the Rhizome database with an error 422. This means that either the manifest is too big, or that the payload doesn't match the hash.  The bundle is then received a second time, and gets inserted. This means that there is some problem with the integrity of the transfer.

My first suspicion was that the pure data packets I implemented recently might be the problem.  But in this case, no such packets had been sent at all for the bundle in question.  As the packets are in any case protected by Reed-Solomon error correction, this would seem to suggest that the contents of the packets were being incorrectly assembled.

Actually, it turns out that this happens fairly often.  This is good, in that it probably explains a lot of the poor transfer performance: Some bundles are having to be sent two or more times, before actually being accepted. So now to find out why.

It looks like decompression of the bundle manifests fails, resulting in the end of the manifest being messed up, and not passing its cryptographic self-consistency checks.  It is the end of the manifest that gets messed up, so I am guessing that somehow another, either the length of the manifest, or the last partial 64-byte block of data for the manifest doesn't get properly received.  It should hopefully not be too hard to find, as it is happening quite frequently, with probably 20% of bundle transfers being affected by it.

To track it down, I'll make both sender and recipient parties display the compressed manifest data, so that I can play spot-the-differences.

So looking at the compressed manifest on reception, it looks to be the right size, but has all zeros in the last data blocks. The problem was that when I made the optimisation to send not just the last fractional block, but as many 64-byte blocks that come before it that can fit, I hadn't adjusted the start point of the packet data being read, thus it was missing the extra bytes.  So time for another run, to see how it goes.

Right. So now we don't seem to have the problem with the code 422s, which is good.  However, we have the problem that sometimes we keep receiving a bundle repeatedly after receiving and registering it.   This seems to be because we don't get a confirmation message from servald that the bundle is now in our rhizome database.  As a result, we don't cull the duplicate reception of the bundle. It also means that we can't immediately start sending the bundle to other peers. 

This isn't such an issue for the HF modem driver, because we don't have other peers to talk to. However, for Mesh Extenders with the UHF packet radios, this is a big problem, as bundles won't get forwarded on to other peers until this occurs.  Thus we need to find and fix this problem.

LBARD communicates with servald via servald's HTTP RESTful interface.  In particular, it listens for new bundles via an asynchronous API, which means that we should get indication from servald essentially immediately on inserting the bundle.  So I presume something is fishy in that.  To investigate this, I have instrumented the code that communicates with servald, to find out where the problem is.

While looking at the logs for these delayed bundle notifications, I am also seeing that some bundle transfers are slow, as one end is not sending any data packets at all to the other end for 10 -- 20 seconds at a time. This should never happen, especially once the initial synchronisation has occurred, that lets each end know what the other needs.  In the current case, I am seeing a delay of 24 seconds between when successive pieces of a bundle are sent.  So we know that the remote end thinks it has a bundle that is worth sending.

I'll start by investigating one of these intervals, from 17:51.02 to 17:51.26.
During that time, we do receive several packets from the sender at 17:51.10, 17:51.12, 17:51.15, and 17:51.21. That is, we have received four consecutive packets that seem to contain no pieces of bundles.  Okay, so it seems that the synchronisation process is continuing to run, and is sending packets of ~200 bytes of synchronisation information.

Thus these delays are a false alarm, caused by the natural synchronisation process.  What is a bit worrying though, is that the synchronisation procedure is taking a long time. With 100 unique bundles on each side, synchronisation is taking at least tens of minutes.  In this example, I am seeing of the order of 300 sync messages of close to 200 bytes each -- and it shows no sign of settling down.
This amounts to somewhere around 68 KB of data out of a total of 84KB sent.

Well, that's very interesting: It explains why our goodput is so low, and generally, why transfers are going so slowly.

Thus together with the Rhizome bundle notification delay, we have to bugs that are able to greatly slow things down.  I now have something quite definite to concentrate on.

I'll still work on the Rhizome problem first, because I am much more familiar with that code. 

The synctree code is going to be trickier to fix, because I didn't write that. What I suspect might be the issue with it, though, is that the sync tree code doesn't expect to be used in an environment where multiple packets get queued up.  This means that when it sees a response to and old packet of its own, it might be getting confused, and starting the tree traversal from higher up the tree again.

Thus the potential fix for this is to find a way to make sure we have only one packet in flight in each direction at a time. This would be a good idea in any case, because it would help to minimise the latency when a bundle is received, so that the sending party doesn't keep needlessly sending parts of that bundle after it has been received.

Well, actually I am going to fix this one-packet-in-flight issue first.  But first, a brief explanation as to why we do this syncing this way in the first instance, rather than, say, just exchanging lists of bundles, and then sending the bundles that the other side needs to get -- because that would surely be more efficient.

The reason is that a simple "exchange lists" and then bundles method would be fine if no new bundles ever arrive during a communications session.  But we want to be able to deliver messages with reasonable latency, even if they arrive while a connection is active.  That is, we would like to have live HF links between Serval Mesh islands, and have those act as near-real-time data conduits.

Now, back to the need for the sync algorithm to not be several packets behind, I have implemented this, and it seems to be working somewhat: The sync traffic is flowing, and is no longer filling the packets.  However, the rate of packet exchange is VERY slow.  The average raw data rate in each direction is only about 8 bytes per second. It looks like each end thinks that it has 2 unacked packets nearly the whole time, which is causing it to throttle back the packet rate.  I am suspecting that the low data rate is interacting in a bad way with the modems trying to be efficient, and waiting until there is enough data to be bothered sending.

But back to thinking about having only one packet in flight in each direction at a time, this means we will always end up with some wasted time, because one end or the other will almost certainly be waiting most of the time.  Investigating further, the round trip time for a packet turns out to be around 15 seconds (!!).
Even allowing for two maximum size packets of 255 bytes, the propagation time would still only be (255x2 bytes)/(2400 bits per second) = 1.7 seconds.  This makes me all the more suspicious about the modems trying to batch data together for efficiency, which in other cases would work.

So we need a solution to this problem.  The previous work I did on inserting the occasional pure data packet might actually be a good solution here: Basically it means that we have a background level of sync exchange through the existing packets, and then we greatly accelerate the exchange of bundles via the pure data packets, with the added benefit that they should help to flush the modem buffers.
                    
I'm now reading through the manual for the modems, with a view of disabling compression, in case the compression is causing increased latency. In the process, on page 6-2 of the "HF Data Modem 3012 Reference Manual", I have found some helpful information: While the data channel can in theory carry upto 2,400 bits per second, the maximum effective throughput is only 1,475 bits per second, i.e., 184.375 bytes per second.  This explains why I was seeing data rates of about 92 bytes in each direction.  While the extra bandwidth would have been nice, it does at least give me a clearer idea of what the maximum realistic capacity is.

On page A-3, I also found *T, which allows setting the timeout before sending data packets.  This is important, because any non-zero value will increase the latency of packet transmission. This setting can be set from 0ms to 120000ms.  The default is apparently 0ms. I'll set it explicitly in any case, just to make sure... Except the modem complains about the command.  It is oddly listed as only *T, not AT*T, but *T also does not work. I tried *0, in case the T is supposed to be the value itself.

Also AT&M=5 is supposed to select the highest data rate and optimise for interactivity, i.e., reduced latency. That was already set.

Doing some more direct tests with the modems, I timed how long it takes after I paste a 62 byte string into one modem, before it appears at the other end.  This reliably takes 7 to 8 seconds, thus confirming this latency problem.

The issue we have, is that our synchronisation protocol currently does not support having multiple packets in flight, as it was designed with radios in mind that can reliably indicate when they are ready to send a packet, and then send the packet immediately, thus allowing each packet to be build just-in-time, to ensure it has the most appropriate content in it.

In fact, just pressing a letter and waiting for it to appear at the end has the same latency of 7 - 10 seconds, so it isn't a bandwidth thing.  Rather, it seems that the modems make at least one full turn-around each, before the data gets sent.  Is this normal?

I did also find the command to disable data compression, AT%C0, and I am now testing that to see if it makes any noticeable difference.
(Another useful command I have found is AT&C=n,n,n,n,n, which allows providing the modem with a list of channels on which to listen for incoming calls, which I'll use later, and record here so that I don't forget about it ;).

But back to AT%C: Disabling data compression DOES make a difference: It drops minimum latency from around 7 - 10 seconds, to only about 4 seconds.  That's a big help, and should greatly improve our throughput.  With 4 seconds in each direction, and sending packets of up to 232 bytes, this should get us a data rate of up to 58 bytes per second of the theoretical 92 bytes per second available.

Thus, while it is a good step forward, it isn't enough to completely solve the problem.  We still will need to send some pure data packets to make full use of the bandwidth.  Part of me also wonders if disabling data compression might help with the buffer overrun problem we were having previously, that was limiting the size of the data packets we could send between normal LBARD packets. 

But before we make further changes, we should time how long it is taking to send small bundles (without payloads in this case), so that we have a base line to compare against.  Having ~4 second latency in each direction instead of ~8 seconds, means that we will detect the end of bundle transmission on average 8 seconds sooner, which should provide a noticeable improvement.

Sending only the normal LBARD packets, without any extra data packets, and it fails, because of the previously reported problem with payload-less bundles.  I've now put a fix in for that, and that helps.

I'm also seeing another problem which is when the receiver tries to send a Bundle Announce Response (BAR) to indicate to the sender that it has received the bundle, we hit a problem if the packet contains another piece of the same bundle.  That 2nd piece causes the receiver to emit an ack for that piece, which replaces the BAR in the transmission queue. As a result the sender never hears that the sender has the bundle, and keeps sending pieces of it.  This is likely a key contributor to the highly variable receive times we have been seeing for bundles.

There are two fixes for this that should both be done:
1. ACKs should never replace BARs in the progress report TX queue.
2. We should simply ignore pieces of bundles we have marked as recently received, and never generate the ACKs for them, but instead produce BARs for them, to indicate that we have already received that bundle.

With that fixed, along with some other tidy-ups, we are now ready to do the comparison again.  We will count the number of zero-length payload bundles (which thus involve sending about 200 -- 220 bytes of compress bundle manifest, and all the house-keeping) that we receive in 1,000 seconds, and the total bytes transferred in that time, so that we can calculate the average bundle transfer time, and the goodput. 

Again, we will do this with no extra data packets, then with a 128 byte data packet after each normal LBARD packet, and then again with sending a data packet every 2 seconds, independently from the LBARD packets. We will average for both directions over our test between the two radios.

With 0 byte payloads, i.e., just manifests:

No data packets:
A -> B : 16.66 bytes per second, 13 bundles transferred
B -> A : 16.93 bytes per second, 13 bundles transferred
Sending 1 x 128 byte data packet after each normal LBARD packet:
A -> B : 25.74 bytes per second, 19 bundles transferred
B -> A : 25.26 bytes per second,  21 bundles transferred
Sending 1 x 256 byte data packet after each normal LBARD packet:
A -> B: 33.22 bytes per second, 26 bundles transferred
B -> A: 33.06 bytes per second, 33 bundles transferred

Sending 1 x 256 byte (including headers) data packet every 2 seconds, as well as normal LBARD packets:
A -> B : Modems hang up after a while, possibly due to data overrun, as bytes per second climbs to ~90 bytes per second.
B -> A : Modems hang up after a while, possibly due to data overrun, as bytes per second climbs to ~90 bytes per second.
(this makes sense
Sending 1 x 200 byte (including headers) data packet every 3 seconds, as well as normal LBARD packets:
A -> B: 48.13 bytes per second, 29 bundles transferred
B -> A: 48.18 bytes per second, 26 bundles transferred
Sending 1 x 256 byte (including headers) data packet every 3 seconds, as well as normal LBARD packets:
A -> B:  78.37 bytes per second,  23 bundles transferred
B -> A:  80.62 bytes per second,  21 bundles transferred
Sending 1 x 200 byte (including headers) data packet every 2 seconds, as well as normal LBARD packets:
A -> B: 68.59 bytes per second, 23 bundles transferred
B -> A: 68.86 bytes per second, 23 bundles transferred

Note that just because we allow a 256 byte data packet, doesn't mean that one will get sent: The actual packet sent might be quite a bit smaller, if there isn't enough data for the bundle currently being sent to stuff in.

We also see that in some cases, the bytes per second is higher, while the number of bundles transferred does not increase, or in fact goes down.  This will most likely be due to the re-transmission of the same data blocks over and over, while waiting for acknowledgement of reception of the whole bundle.  Indeed, switching from 256 byte to 200 byte data packets with the same 3 second interval, we see no decrease in bundles transferred, although the number of bytes sent per second reduces.

Once we move to bundles with non-zero-length payloads, this effect should reverse. Another effect that is likely at play, is that if we send lots of data blocks between the LBARD packets, then we will increase the rount-trip-time of the LBARD packets, which will slow acknowledgement of bundles, and thus reduce the number of bundles that can be received per unit time.

Another factor that must be taken into account here, is that the first bundle transfer does not typically take place until around 120 -- 180 seconds into the 1,000 second test. Thus once actively communicating, the actual time per bundle will be lower than suggested here.

So now to repeat those tests with non-zero payloads for the bundles... Which showed up a bug in payload handling, which was quick to fix. More critically, it showed up an issue where receivers sometimes don't recognise packets as being addressed to them.  This means that when the other party sends them pieces of a bundle that they have already received, they don't tell the sender that they already have it.

The mechanism that is playing up stuffs each message containing a piece of a bundle with the first two bytes of the SID of the receiver.  This works fine for some time, but after a while, the receiver finds garbage in this field.  What is not clear is whether it is a data handling bug, that corrupts it on the receive side or during assembly on the transmit side, or whether the field that contains a peer's SID on the sender side is getting corrupted.  This takes of the order of 15 minutes to occur.

LBARD maintains both  ASCII-hex and binary representations of each peer.  If these end up with different values, then we can assume that memory corruption has occurred. I'll add instrumentation that will help me to see if that is what is going on.

Interestingly, the data that overwrites the binary representation seems to often begin with $47, suggesting that some piece of code might be putting stuff there by mistake.  The "Generation ID" message type does put a $47 followed by 4 random bytes.   This message type also contains no bounds checking -- it just assumes it has been given a buffer with enough space in it.  However, it is only called at the start of the packet assembly process, when it can be safely assumed that there is enough space in the buffer to fit the message.

A few more runs confirms that it is always $47 in the first byte, so almost certainly something screwing up with the Generation ID messages.  I'll add some more instrumentation to see what the new generation ID is each time one is received, and if it matches what I am seeing in the packet, then I will know that its to blame.

Ok, while waiting for the run to hit the problem again, I've taken a closer look at the code that parses the generation ID messages, and discovered that it does indeed put the generation ID into the SID prefix field. I.e., a straight-forward bug.

This fixed the problem, and got everything going, but it still seems to be rather slower than it should be sometimes.  But investigating the remaining problems can go into the next blog post.

Thinking about different ways to do the tree-sync for LBARD

$
0
0

In the last few blog posts I've been fighting my way through fixing bugs with LBARD's synchronisation algorithms.  While both the tree-sync and the actual transfer of bundles have been problematic, the tree-sync side of things has been niggling at me for HF use, in particular. One of the reasons is that tree-sync has to start its synchronisation again from scratch every time a radio connection is made. This already takes several minutes, with only relatively small numbers of bundles.  Even if neither end has new bundles, the entire synchronisation process has to occur again.  Also the very high latency of the HF radios doesn't really agree with tree-sync, and can result in considerable delays in synchronisation.

Way back around 2014 we had started thinking about reworking the way that Rhizome stores data, by having it use a block-store to represent the bundles. That is, instead of keeping a manifest and payload, we would have a kind of virtual file-system in which the bundles would be represented.  Data blocks would be identified by the hash of their contents, thus providing automatic de-duplication, and making it much easier to cache data.

Now, reworking all of Rhizome to do that would require significant effort.  However, what *might* be doable in reasonable time, is to make the list of bundles an instance holds be represented in such a way.  This would allow each end of an HF link to cache this virtual file-system representation, and thus re-synchronise very rapidly if none or few new bundles have appeared at each end.

My current thinking is along the following lines:

1. Blocks should be small enough to fit in a single LBARD packet. This realistically means a limit of about 200 bytes.

2. We want to minimise the number of nodes in the tree, as each additional node requires an extra round-trip.

3. The small block size means that we need to use relatively small hashes for each block.  Even if they are too short for strong cryptographic security, they should still be strong enough to keep the chance of accidental collisions practically zero.

4. Each block can consist simply of leaf nodes or pointers to child nodes, or a mix of the two. The position of the node in the tree is implicit.

5. It would be nice to have one hex digit of identifier covered per block.  This means 16 pointers to sub-blocks have to be able to fit in a block.  200 bytes / 16 = 12.5 bytes per entry.  This probably means that 12 byte hashes = 96 bit hashes will be the longest possible.  That's ample to keep us safe from Birthday Paradox accidental node collisions (should be safe for up to 2^48 blocks).

6. If all the child nodes for a block can fit in a block, then that is where they are placed.  This avoids having unnecessary intermediate nodes in the tree that would otherwise have to be synchronised.

7. If all the child nodes can't fit in a single block, then we need to have a deterministic way to represent this, so that we don't end up with a proliferation of blocks as we fill nodes.

8. The simplest approach to (7), is to require that an overfull block child blocks for each of the 16 entries it can hold. This may mean that some child blocks point to the empty node (which because of the hashing, will always be the same).

9. The above formulation means that each bundle (or more correctly, version of bundle), will have a deterministic node, which each side can enter into their block cache, thus avoiding the need to ever transfer a block that corresponds to a bundle that the other party already has.

10. Each side can request blocks from the other, to begin exploring the tree structure. Any common sub-trees between the two sides will be implicitly available.

11. The block cache can persist between executions of LBARD, thus allowing the complete state of the other party to be remembered between radio sessions, thus greatly speeding up the initial synchronisation when a radio session commences.

12. Multiple blocks can be requested at a time, thus helping to mask high latency.  In fact, simple optimisations like "please send me block X, plus any children (and optionally further descendents) of it that you don't think that I have" can be used to save further round-trips.

13. When a block is received, its position in the tree is known. This means that in some cases, it will be possible for the receipient to work out the content of the child nodes, if the sender has a strict sub-set of the sender's bundles that are in that part of the tree.  At its simplest, a leaf node can be constructed for every combination of the bundles that it contains.  If any of those match the block ID of the child node indicated in the received block, then there is no need to receive those child-blocks.  Instead, not only can we avoid the need to send those blocks, we can actually immediately infer that the sender is missing those bundles not in the combination that resulted in the matching child node.  

14. The approach from (13) can be extended further up the tree if desired, to help save bandwidth, at the expense of storage space and computation.  At some point, however, the number of combinations that need to be tested becomes excessive, as each additional bundle in the portion of the tree being considered potentially doubles the number of combinations to be computed.  However, because each leaf node is restricted in its potential content, the growth in complexity is actually rather different (leaf nodes are only formed when the collective leaves don't fit in a single block).  The exact complexity of this will need to be modelled.

15. In theory, if the complexity of the above can be handled, then sending just the top of the tree will allow the recipient of any node that contains only a subset of the nodes the recipient possesses, without any further data exchange.  This is more of theoretical interest than practical effect, except for the leaf-node optimisation described in (13).

16. The bundles themselves can also be represented as tree structures, so that similar techniques can be used.  This also allows for implicit transfer of common data sections, including in journal bundles that have invariant starts.  Of course, this ability to transfer common data sections will only work for those that have such commonalities that are aligned with the block size, or better, aligned with a leaf-node of blocks.

17. This approach allows for implicit caching of all transfers, making resumption of communications with other nodes and of bundles much more efficient.

18. This approach also solves many of the authenticity issues with transfers in LBARD, where the transfer of a large bundle could be poisoned with a fake block, or even just by a bit error that escaped the error correcting code. This is a significant advantage of this approach.

19. This approach effectively trades-off storage space (which is cheap and plentiful) for algorithm complexity and RAM requirements (which is in limited supply on the kinds of small embedded devices we want to target, such as low-power meshing phones etc).  This is thus a very nice side-effect of this approach.

Anyway, I'll have a think about whether I implement this now or not, depending on how feasible it looks to get acceptable performance with the existing algorithm.

Continuing working on the LBARD HF protocol efficiency problems

$
0
0

Ok, back from the last post where I was thinking about some longer-term future approaches, to trying to get the current code working as well as it can.

The problem I am seeing at the moment, is that the same blocks are being sent repeatedly, while other blocks are not being sent at all.  

One problem was that pieces of the manifest were being marked as sent, when they hadn't.  This was due to an out-by-one error in the comparison of the extent of a block being sent, compared with the actual data. The effect was when a block was sent, both it, and the following block were marked as having been sent.

This seems to be a common factor in many of the slow transfers.  It has only started showing up now, because we were previously testing with payload-less bundles, and so the manifest would keep being resent, which would then have a higher chance of re-sending the missing part through the normal cyclic re-transmission.

So let's set a test run going, and see if it solves the problem.

Well, its helped, but close examination of the log files reveals that there are still pieces being counted against the wrong blocks, due to a duplicate code-path.  I've now de-duplicated that.  I also notice that sometimes manifest or body pieces would be counted against the other as having been sent, which of course is also causing problems.  So I have fixed those as well. 

On the plus side, its pretty how these sorts of bugs can be causing these problems: If we are marking things that have not been sent as having been sent, then we are not going to send those any time soon.  Instead, we will keep sending the same pieces over and over, because we think we haven't sent them.

I'm also seeing the situation that the sending of a bundle can get reset part way through, which then makes it forget the transfer bitmap information for that bundle, resulting in it re-sending pieces it has already sent. This might be because an old feature that was built in that causes the tree-sync process to restart periodically, to avoid lock-up in the protocol, which in turn causes the selection of the currently being sent bundle to potentially change.  However, the root causes for those lock-up have all been fixed now, so that can probably be removed, if it is still there.  I'll investigate that while the new run with the previous bug fixes in place runs.

Yes, it now looks like the resetting of sending bundles is the most significant remaining problem. As mentioned, this causes the knowledge of the state of a transfer to be lost.  There are two approaches to solving this: 

1. Don't switch which bundle we are transferring. 

2. Remember the state of other bundles, so that we can continue from where we left off, if we do switch which bundle we are transferring.

Option 1 sounds great in theory, but the synchronisation process may well interrupt the transfer of a lower priority bundle with a higher priority one, for example.  So this is probably not possible to achieve in practice. Although I will instrument the changing of the current bundle being sent, to find out why it is changing.

Thus it probably makes sense to implement something to remember the progress of a bundle when switching.  This can be kept in a nice little cache somewhere.  Of course, this is one of several complications that would go away if/when I get to implementing the new approach I wrote about in the previous blog post.  But for now, the cache will have to do.

I've now implemented this caching of transfer progress.  However, the first run after implementing in it showed a weird situation where bundles transferred quite regularly in one direction, but none in the other.   In the direction where traffic was flowing, I saw 6 bundles received fairly quickly, with all of them transferring in 50 to 80 seconds.  That's the good news.  However, there would then be a gap ranging from a couple of minutes to half an hour with nothing being transferred.  Then the same six bundles would be transferred again.

I think what happened is that one of the servald instances crashed or didn't start properly.  As a result one party had no bundles to share, and could not store them into its Rhizome database, and so the same transfers happened repeatedly.  That might have just been some freak event that caused one of the servald instances to not have started properly.  To find out, I'll run the test again, and see what happens.

Second run worked properly in that regard.  Some bundles transferring at a reasonable speed, but others takeing a VERY long time, upto an hour or so.  

One bug spotted: We don't mark the final body block as sent if it is a partial block. This looks to be able to cause the final block to be sent many times repeatedly, which is clearly not good.  I've tried to fix that now.

The other bigger problem, is that the bundle keeps getting de-queued and re-queued, which of course is causing a lot of delay in transfer.  It also seems that at least one code path can de-queue the current bundle being sent, without stashing its progress for later retrieval. 

Found at least one: The sending of a new generation ID by the sender, which causes us to restart the tree-sync process.  In that code, we stop sending the current bundle.  However, we should just keep on sending the same bundle, unless and until we find a better bundle to send.  This is likely the cause of considerable delays, because the synchronisation will run for potentially several minutes before it finds a new bundle to send. That's likely to be it.  We should probably reduce the generation ID rotation rate for HF, anyway, since it doesn't make sense to reset the tree-sync process so frequently on a high-latency link.

All these problems keep making me think about the new tree-based approach I wrote about recently, as there would be no need to do any of this, and resumption of transmission would happen in a much more natural and efficient manner -- even if the lbard process really did die and restart.

Anyway, back to the existing code, it now looks like transfers are happening reasonably sensibly.  In trying to make a script to show the sequence of transfers, I am however still seeing that the currently selected bundle for transmission tends to change more often than it seems it should.  This led me to look more deeply into the TX queue management, and I'm seeing some odd things there, where some bundles added to the queue seem to disappear, and others that were never reported as having been added to the queue turn up in there.  So I'll add some more debugging for that.

Yes: The TX queue is being completely purged from time to time. This is again the resetting of the generation ID, the unique identifier that is used to indicate when an LBARD instance restarts.  Apparently I was resetting the generation ID every four minutes.  Thus it is no wonder that the TX queues are getting purged and we are ending up stuck in a synchronisation hell.  Instead, the generation ID should only be changed when the TX queue at one end or the other has overflown, and thus been emptied, and so the sending side doesn't know which bundles it should have sent when the TX queue overflowed. 

Given that the transfers are now otherwise working fairly efficiently, I'm hopeful that this will get us to some kind of reasonably reliable transfer with steady transfer times for bundles of a given size.

Running more tests, I have found and fixed a few more problems:

1. If the carrier is lost in a HF modem call, we can end up in a situation where no more normal packets get sent, presumably because the number of unacknowledged packets is high, causing the packet rate to be practically zero.

2. Particularly with bundles that require many data packets, they end up slowing down in transfer over time, and eventually transfer halts.  The halting is probably due in part to (1), but the reducing transfer efficiency must have another cause.

3. Some received pure data packets are corrupted, which is a probable significant contributor to (2).

To fix (1), I have reset the TX sequence number and last acknowledged sequence number whenever a HF call starts.

For (2), I need to examine what is going on in the logs of runs, which will probably make more sense once I have fixed (3).

For (3), I suspect that the modem's TX buffer is being easily overwhelmed.  This is in spite of the fact that I have hardware flow control enabled.  It is possible the hardware flow control doesn't work properly with the USB serial adapters.  In any case, I have implemented a check before sending pure data packets, so that they will not be sent if no normal packet has been seen in the last 30 seconds.  This should stop the buffers getting too over full.  But a fuller investigation will require that I log every byte sent to and received from each modem, and then compare the two streams to see exactly what kind of data errors we are experiencing. Is it lost bytes, as I am suspecting, or is it bit errors, for example?

So adding the logging of bytes sent/received, I can see that indeed something is going wrong with the pure data packets almost immediately.  This makes it much easier to see what is going on, since I don't have to wait an hour or more each time. Now to look at the TX and RX logs to see what is happening...

We are indeed missing large slabs of data.  Taking a look at the FTDI USB serial adapter documentation, it looks like you cannot safely write more than 32 bytes at a time, if using hardware flow-control, as that is the size of the FTDI chips internal buffer.  So I've modified the serial writing code to write 16 bytes at a time, to see if that fixes the missing bytes problem. However the code to back-off due to hardware flow control doesn't seem to get triggered at all, and the problem with corrupt data continues.  Found one problem with the TX and RX logs, where the pure data packets weren't being included.  Fixed that, now lets see how it looks...

Ok, so now I am not seeing any more lost bytes at the start, but the pure data packets were still being munged.  That problem turned out to be caused by the truncation of the data packets on the receive side. I also found and fixed a couple of segmentation fault bugs in the process.  Now re-running...

Still seeing corrupted packets. It turns out the pure data packets were not being properly escaped before sending, thus if they contained a "!" character in their data or header, it was messing things up.  Fix that now, too.  This could well be causing the failure to transfer larger bundles efficiently, as it would effectively stuff up the transfer of larger pieces more often.  So its satisfying to have found this problem.  Let's see what it does to the transfer rates...

Yes, bundles now seem to transfer in between 30 and 200 seconds. However, now I am seeing that the same bundles get sent more than once.  I suspect that this is due to the synctree reset that happens when the TX queue is emptied, but overflow of the TX queue was previously recorded.  This tells LBARD that it needs to start the sync again. However, if the bundles were received, then they should not get transferred again.  Looking in the logs, I am seeing HTTP 400 errors when trying to insert the bundles into the rhizome database. Thus we are still having a problem with corruption during transfer. I'll have to look at the reassembly of bundles. My suspicion is that the problem will be with the pure data packets somewhere.

The corruption problems have now been solved. Part of it was that writing in 16-byte pieces helps the serial over USB transfers, but messes up the HTTP transactions.  So I have split the code to only do that for the serial transfers.  That fixed the problem with bundles not making it into the Rhizome database, and transfers now proceed progressively through more than just the same 10 bundles repeating. 

However, I am still seeing the HF call dropping out sometimes. Those tend to recover fairly quickly, and might be legitimate loss of carrier in the radio setup.

More concerning though, is that there can be a period of half an hour or more where no transfers in either direction occur.  These are quite mysterious, as the HF modem doesn't seem to drop out, but the logs of both sender and receiver simply show nothing at all for a period of 46 minutes in this particular case.  I guess I will need to run it again, and try to catch when this happens.  

This time we hit a segfault after almost an hour and transferring 80 bundles (between both directions) in about 64 minutes. The problem is that sometimes rubbish ends up in the TX queue.  More specifically, the priority of a bundle ends up being stored in the bundle ID field of a slot in the TX queue. 

So I've put more debug instrumentation in to find out how the TX queue list is getting corrupted. In the process, I also fixed the case where if a bundle is currently being sent, it could still end up being queued for later transmission as well, e.g., if the tree-sync process is restarted at some point.  

I've also added code that detects when the top of the TX queue ends up containing an illegal bundle number to reduce the impact. However, I would really like to fix the error at its cause.  What I believe is most likely the problem, is that the shuffling of the TX queue has an out-by-one error somewhere, as the bundle numbers occur immediately before the list of priorities. Thus if one too many entries were copied down, it would copy a bundle priority.

So time to start a new test run...

Its now running through a lot happier.  157 of the ~220 bundles have transferred, but it has taken about 4 hours to get that far. But it's not as bad as it sounds, as it seems that my laptop somehow stops USB communications while the display is locked.  This means that if I don't periodically prod the computer, long periods of no transfers result.  I'm pretty sure that this is the cause of this problem that I noticed previously.

That said, I am also seeing it able to get into a hang up and redial loop, where no packets get sent in the meantime.  I assume that the time to next packet TX is somehow being set wrongly. Or more the point, it can grow large, and wasn't being reset when a call was re-established.  I've examined all the code-paths that can get there, and added explicit resetting of the next packet TX time.

I tracked down the "nothing sent for an hour or more" problem: It's actually not with LBARD, but with the laptop I am using.  It automatically suspends after 2 hours of inaction, even when connected to mains power.  This doesn't really help when it happens mid-run.  So I've disabled that in the laptop, and should now be able to run transfers without this problem.

I've also looked through and found the problem that was causing the illegal bundle numbers ending up in the TX queue, and fixed that. The problem was indeed an out-by-one with the list shuffling code, as expected.  I'm expecting that this will likely fix the problem where only 199 or so of the 202 bundles were being transferred, as it could result in the situation where the TX queue loses an entry or two without actually overflowing to trigger the re-start of the sync process to re-discover the bundles that were lost from the queue.   Thus I'm hoping that runs now will correctly transfer all 202 bundles -- 101 in each direction.  It might also end up being marginally faster, because fewer sync restarts will be required.

In short, I'm now hoping that I am at the point where transfers run reliably and in reasonable time.  I'll do a few runs to get some data on the typical speed and variation, but then we should be good, I think.  What I would like to do next is to plan some actual long-distance tests using the HF radios, with one here at the house in Akaroola and another some distance away.  Perhaps in a neighbouring town.  However to do that, I will need to get some antennae and coordinate with folks in the amateur radio community to help with that.

But as this post has already gone on for ages, I'll stop here, and start a new post looking at the remaining TX issues.

Viewing all 144 articles
Browse latest View live