Jump to content

Welcome to Smart Home Forum by FIBARO

Dear Guest,

 

as you can notice parts of Smart Home Forum by FIBARO is not available for you. You have to register in order to view all content and post in our community. Don't worry! Registration is a simple free process that requires minimal information for you to sign up. Become a part of of Smart Home Forum by FIBARO by creating an account.

 

As a member you can:

  •     Start new topics and reply to others
  •     Follow topics and users to get email updates
  •     Get your own profile page and make new friends
  •     Send personal messages
  •     ... and learn a lot about our system!

 

Regards,

Smart Home Forum by FIBARO Team


Zniffer - analysis of FGWP-102 "mesh update"


Recommended Posts

Zniffer - analysis of FGWP-102 "mesh update"

Introduction

This is part 2 in my series on "advanced Zniffer decodes and in-depth Z-Wave workings".

 

Part 1, called "Zniffer capture of a 4-hop route, working, but it takes 110ms" is here:

 

Please login or register to see this link.

 

I'll do my best to make this post accessible to a wide audience, but I doubt it will be easy to understand if you have never thought about how Z-Wave works (or could work).

 

As you may have read before, I am not impressed by all tools that give a graphical representation of the "neighbour table". The results are very nice to look at, please hang them on the wall, but people draw all sorts of wrong conclusions from this "mesh" or rather "mess" as I like to describe them. Don't get me wrong, there is some useful information in the data, but it is easy to "jump to conclusions". The intention of this post is to educate you and if I succeed, you will understand why Zniffer captures are vastly superior.

 

Here is a snippet of data, output by my script and based on API data of my Home Center:

Please login or register to see this code.

From this you would falsely conclude that device 91 is using a 4-hop route 1,933,1206,1018,960... I can assure that device 91 is talking directly - without any repeaters. The LWR data is stale, and on top of that, the repeater selection process is different from the the "neighbors" selection process. After reading this post you should be able to understand why I promote Zniffer, and only Zniffer, to diagnose your Z-Wave network issues.

 

To understand what is causing delays, and what routing is really doing, only Zniffer shows all the details. Well we could learn a bit more about actual routing decisions if HC3 exposed "extended status information" but right now it does not do that. And HCL/HC2 do not support this "extended status information".

 

Before we begin, I would like to repeat something: if you have not read about Z-Wave routing before then you should probably do so, because very likely you base yourself on "internet routing". But Z-Wave routing is not like "internet" routing at all. *Z-Wave routing is based on "Source Routing" (and modified AODV known as "Explorer Frames" on series 500/700 and some series 300). If you do not know what "Source Routing" means, then that is a clear sign you should read about the Z-Wave protocol... I highly recommend this series written by @robmac on the OpenHAB forum:

 

Please login or register to see this link.

 

Also, if you have any issues right now and want a practical guide, then please have a look at a more general look at Z-Wave diagnosis written by @amilanov:

 

"HC2 Repair & Maintenance Guide"

Please login or register to see this link.

 

One last thing... You can use Zniffer to find out many more things and do not be scared, lots of information can be gathered and understood by simple looking at the graphical decodes it does. This post may lead you to believe that "Zniffer is hard" but if you are technically minded then you'll be fine!

 

Most people report a big improvement after simply analysing "which node talks to which node, when and why" and you do not need much explanation.

 

This post is about Z-Wave Routing, it is a tough subject but I claim: "real issues with routing are actually quite rare". But when you have improved your network, by following the best practices and using Zniffer to find out more about your traffic, inevitably you will see some weird or unexplained things in Zniffer, and you might want answers. That is what this post is all about: understanding Z-Wave Routing.

What is "mesh update" and routing anyway?

Any controller will have a "mesh update" function, aka "neighbour update". If you have ever done this before, preferably on a single node only, you might have thought... what a strange command. All it does is print "start" and then (depending on generation of the devices and number of nodes) either "done" or "error". There is no feedback, and it appears to do nothing. Sometimes it magically makes nodes respond faster, often it does not.

 

On various forums you'll read different opinions ranging from "never do it" to "run full mesh update daily" and some claim it will greatly improve performance while in fact it can actually deteriorate your network. I'll add my own personal opinion to that... It usually does not matter much but it depends on your situation. And to understand your situation, you need Zniffer + understand how routing works. I'll tell you all about that in this post.

 

For those wanting to read the official specification, I recommend... register a (free) account on the Silabs web site and download the Z-Wave 500 SDK, when I write this the version is 6.80. Then read this document:

 

INS13954-11 "Z-Wave 500 Series Appl. Programmers Guide v6.8x.0x"

Chapter "3.4 Z-Wave Routing Principles"

 

You might be more familiar with a set of documents called the "public specification", those mainly define all possible Z-Wave commands (like sensors, blind control and so on). But they don't talk about routing.

 

To understand packets and frame types, you'll have to read the G.9959 specification but I do not recommend it, it took me an awful lot of time to understand it and people usually ask me if it is written in some foreign language instead of English. On top of that, Sigma left out important details about the network and protocol layers. This should change in 2020, when Silabs hands over all documents to the new "Z-Wave Standards Development Organization", the successor of the "Z-Wave Alliance".

 

There are also "reverse engineering" documents, usually posted by people "hacking" the protocol. Google is your friend.

 

Z-Wave is called a "mesh network" because all mains devices (with very rare exceptions, like the Fibaro Swipe) can act as repeaters.

 

Z-Wave is source routed, this means the originator selects a route and the packet contains the route. There is no such thing as a "IP routing table", the decision to use a certain repeater or hop was made right at the start and does not change while the packet travels through the network. Zniffer will print routing info like this:

 

(99)->57 - (1)

 

This means, the data started at node 99 and should reach node 1. The -> between 99 and 57 that this particular packet has been transmitted by node 99 and should be picked up by node 57. You can expect node 57 to receive it, and retransmit the packet with minimal modification. In fact, the repeater only changes a nibble, pointing to the "next hop in the route" and recalculates the checksum (1 or 2 bytes). Zniffer will decode this repeated message as:

 

(99) - 57->(1)

 

To reply to this message (to ACK the packet) the last node uses the same route, but in reverse order.

 

Source routing is efficient, and can run on processors with few CPU cycles and small amount of RAM. It does not scale well, but the engineers thought 232 nodes and 4 repeaters per network are OK and I think their choices made sense. I am not sure in 2020 what kind of routing is "future proof" but that is a completely different topic.

 

You would think, that is easy, and you are right, but I haven't explained yet why "57" is in that route, and that is the fun part...

I will give you a simplified overview of routing. I say, Z-Wave devices can support three types of routing:

 

  1. "Direct" or "No Routing". All Z-Wave devices support this. It is the fastest, and has the smallest data packets because there is no room for routing information. You want as many direct connections as possible. Series 700 and 500 (Z-Wave Plus devices) have better RX/TX so better range than series 300 and older devices.

  2. "Routed" or "Source Routed". Not sure when it was introduced, but series 300 -> 700 devices have it and it is closely related to the function "mesh update" or "neighbour update" on your gateway. Uses up to 4 repeaters and because of that uses bigger packets for data, ACK and Routed Error. I insist on calling the intermediate hops "repeaters", they are NOT called "routers" for a reason. This mechanism "kicks in" when "Direct" fails. The controller plays an important part in selecting the repeaters. Zniffer will mark packets as routed, like this: "Routed:(31)->181 - 11 - (1) Meter Report". Most devices can hold 4 routes per destination node and each of those routes can have a maximum of 4 hops. The device cannot discover these four alternative routes by itself, it needs a controller to calculate them. This is part of the "mesh update" process

  3. "Explorer Frames". This is a broadcast based discovery type of routing, using yet another packet type. It is the least well documented and I do not fully understand the details. Z-Wave Plus devices have it, but also some older series 300. It does not depend on a central authority to find routes. Contrary to what many people seem to think, the explorer mechanism does not learn new routes, it only establishes a new "last working route" in case everything else fails. Silabs says it finds a route, on average, in about 350 ms. If the destination node is unavailable it times out in about 3500 ms. Zniffer decodes these packets as "Explorer Normal" and "Explorer Search Result".

 

Mesh update only affects the second type of routing. This gives you a clue, why I say "mesh update might not matter much". Firstly, because maybe nothing has changed and this routing is no different after the updated. Secondly, because you likely have either "direct" or "Explorer" there's always some way to gat data from node A to node B.

 

I've left out a few details, most are not relevant for general understanding but I would like to point out that "controllers" and "devices" use different terminology and work in a different way. This means, sending data from a node to a controller is not identical to sending data from the controller to the device.

 

You can picture the routing information of the controller and the device, per destination ID, to be close to this:

*

Last Working Node Next Last Working Route Route 1 Route 2 Route 3 Route 4

 

Because the controller knows all nodes and all neighbours, and thus "knows" all possible routes, it can calculate routes on the fly. But a almost all other devices do not have the complete table. Instead, they only get a part of all possible routes. For sake of completeness, there are devices that are based on the "controller library" and they replicate the full neighbour table, an example is the Aeotec Minimote.

 

Maybe some of you are thinking... If a device only has room for 4 routes, is Z-Wave really a "mesh" network? I would say "yes" but indeed with a twist... Unless it goes into "Explorer Frame" mode, a node cannot fully utilize the mesh.

 

How does the device get Route 1 to 4? The controller selects them and sends them to the device, as par of the mesh update. Keep this in mind... This only happens when you do a "mesh update". When you include a node, that process includes a "mesh update" so no need to "re-mesh" after inclusion if you freshly included a device in place (always try NWI so you do not have to move the device after inclusion, it saves time). The third routing mechanism, "explorer frames" always kicks in when all routes fail on a (non-controller, fairly recent) device.

 

You need a Zniffer to diagnose routing (although "extended status information" can help but it is not available on Home Center 2/HCL/3). You can find a nice example of a Zniffer capture in my post "Zniffer capture of a 4-hop route, working, but it takes 110ms". I mention this because although routing increases the reliability of the communication, it also comes with a price: increased latency...

 

If you check your network, and you live in a "compact house", you'll see most nodes work "direct" and some will use "source routed". If you see a (working) routed connection then you are looking at the "Last Working Route" aka LWR and it is impossible to tell if the route comes from the routing table or was obtained through "explorer frame(s)"

 

Where do routes come from? If you see this in Zniffer:

 

Routed:(31)->181 - 11 - (1) Meter Report

 

Who decided that device 31 should first send to 181, then to 11 to reach the controller (1)? The device picked that route from a set of 4 possible routes... And those 4 routes were selected by the controller and then send to the device. This is what a "mesh update" does:

 

  1. The gateway application (your Home Center, for example) calls ZW_RequestNodeNeighborUpdate. In Zniffer you will see this decoded as "Find Nodes In Range" coming from the controller. This command lists all possible candidate repeaters-neighbors so smaller networks scan fewer nodes. So the time it takes to scan the network depends on the number of nodes.

  2. Node sends "NOP Power" to nodes at different speeds and registers response packets. If the response is OK, the scanned node is considered "a neighbour". NOP Power means No Operation = a kind of "ping". The packet is transmitted at reduced power: -6dB = 50% of normal power to make sure the device is definitely in range.

  3. Node returns "Node Range Info" to controller. This is a list of "neighbours"

  4. Controller updates its routing matrix with this new data. This matrix has all NodeIDs in rows and in columns, and a 1 or 0 to indicate if node X and Y are neighbors.

  5. After "Find Nodes In Range" the gateway application (your Home Center, for example) calls ZW_DeleteReturnRoute and ZW_AssignReturnRoute multiple times, to calculate routes and send them to the device. These are decoded by Zniffer as "Assign Return Route". The device will store this information until you call "mesh update" again.

 

Let's look at a simplified example before we look at a real-world capture of a mesh update.

 

Assume a minimal network with 4 mains nodes and a controller:

 

2 <-------> 3 - 4 <------> Controller <------> 5

 

The arrows mean a direct connection is possible, but 2 is "too far away" from node 5 and the controller, so it needs to go through either node 3 or 4

Please login or register to see this code.

 

Possible routes from 2 -> 1 are:

 

2 -> 3 -> 1 2 -> 4 -> 1 2 -> 3 -> 4 -> 1 2 -> 4 -> 3 -> 1

Analysis: Fibaro Wall Plug Z-Wave Plus - FGWP-102 mesh update

I will post text snippets copy/pasted from the Zniffer app. If you want to open the capture, see file "05 - FGWP-102 Mesh Reconfigure 2020-06-14".

 

I have left out a few columns that are not too relevant to make everything a bit more compact and improve readability.

 

Node 99 is close enough to the HC3 to always support a direct connection. This serves as a good example of what "mesh update" does on a node in the centre of your house, near the controller, and with plenty of neighbours. I will analyse a capture of a "far away node" in a follow up post.

 

Firmware is 3.2, Z-Wave Version 4.(0)5 (SDK 6.51.06)

 

I've removed some columns to make the text more readable. I won't repeat the header row.

Please login or register to see this code.

 

The Zniffer Application 4.62 does not do a detailed decode of the "Find Nodes" command but it is not hard to do that manually:

 

F22FC1AB0141032963 Z-Wave header, fully decoded by Zniffer, it is a direct packet so does not have a routing header before the payload. 01 Command Class "Protocol" (I invented that name, it is undocumented). 04 Find Nodes In Range (undocumented, but reverse engineered as ZW_RequestNodeNeighborUpdate) 1A number of bytes in bitmask (number of bytes after this): 26 D4 First bitmask means scan on/off for node 8 (msb) to 1 (lsb), 0b11010100 means node 8, 7, 5, 3 should be scanned 05 Second bitmask means scan on/off for node 16 (msb) to 9 (lsb), = 0b101 means node 11, 9 52 = 0b1010010 -> node 23, 21, 18 ... and so on

 

Complete node list: 3, 5, 7, 8, 9, 11, 18, 21, 23, 30, 31, 34, 35, 56, 60, 75, 89, 90, 143, 153, 165, 175, 192, 193, 199, 204, 205, 207

 

Let's find out if my Wall Plug indeed starts to scan node 3, then 5, then 7, ...

Please login or register to see this code.

Yes it does! My Wall Plug send a NOP (= "ping") to node 3 and my Zniffer sees node 3 ACKs it. So they are neighbors! But wait, not too fast, the Zniffer antenna is not the Wall Plug antenna, and the devices are not 100% identical so how can we be certain that the Wall Plug and the Zniffer hear the same ACK? You cannot be 100% certain but this helps: if node 99 does not repeat its NOP, but instead moves on to the next node, then it is very likely to be a good one... And we will find out later, when the Wall Plug sends its neighbour list to the controller.

 

What's the next packet?

Please login or register to see this code.

Okay, the "mesh update" moves on to the next node, so node 3 and 99 are neighbours. We expect the scan to move on to the next candidate neighbour and indeed, that's the next destination node, but here we notice a difference:

Please login or register to see this code.

Ah, Node 7 does not send ACK. I expect the plug to try 3 times and I only see 2 tries, I don't know why that is happening (Zniffer might miss a beat occasionally, that is a possible explanation) but it is not too important, we will find out which nodes are neighbours at the end.

 

Next up: node 8:

Please login or register to see this code.

Yes, those are neighbours, the Plug moves on to node 9 after the ACK:

Please login or register to see this code.

Three attempts and no ACK... Node 9 is not a neighbour. Quick Sanity Check: I know node 9 is a "Dimmer 1" in another room. I know the Wall Plug "looks" at this dimmer at a shallow angle (maybe 5 degrees) through a wall, so the "apparent thickness" of the wall is very high (Say 1 to 2 meters, that is a wild guess). So it does not surprise me that a NOP at 50% of normal power does not reach this device. So it makes sense.

 

Next packet is a CRC error and this confuses "newbies" and they usually worry too much about it. Sometimes CRC packets can reveal interesting information but as a rule of thumb, CRC means the signal reaching the Zniffer was to weak, there was some sort of interference, or 2 devices were talking at the same time. Some users want to know "but which device MAKES these errors?" - that is not what is happening, devices do not "make" CRC errors. Imagine you are listening to a phone call, but the quality is very low. When too many words drop out, you are no longer able to make sense of what the other person says. That person did not "make" the CRC errors. Because the data is unintelligible, it is usually not possible make much sense of the hexadecimal data.

Please login or register to see this code.

We can sometimes make an estimated guess what kind of data it was "meant to be"... If we look at the first 4 bytes of that packet:

 

722FC1AB

 

That is almost exactly my HomeID:

 

F22FC1AB

 

A bit at the start got flipped and now the CRC no longer matches.

 

In this particular example, much of the data in this "CRC error" is still valid and I can see it is very likely node 9 sending an ACK to our wall plug! If move my laptop + Zniffer (closer to node 9) I would "hear" a different thing and then it would not be a CRC error.

 

Let's not dwell on CRC errors, the main message is "do not worry too much about them". Did the wall plug "hear" the ACK? That is impossible to tell based on the information we have now.

 

Let's go through the rest of the data a bit faster:

Please login or register to see this code.

I would say 11 is a neighbour, and nodes 18 and 21 might be, but their ACK got registered as "CRC_ERROR" so not really sure. But I do not see retries of NOP to 18 and 21 which gives us a clue (the ACKS got accepted).

 

I'll skip a bit, until we see the highest NodeID in that list we got in Packet 1:

Please login or register to see this code.

Next we'll see our Wall Plug reporting "Command Complete" and then the controller asks "Get Nodes In Range" and we'll see the Wall Plug produce a list in command "Node Range Info":

Please login or register to see this code.

Zniffer does not decode the reply, but it is easy to decode because it uses the same format as "Find Nodes" (packet 1) at the start of this process. Let's find out if our "manual Zniffer packet analysis of packets 1 to 65" matches the Node 99 Report in packet 69!

 

F22FC1AB63410B2701 Z-Wave header, fully decoded by Zniffer. 01 Command Class "Protocol" (undocumented) 06 Node Range Info (undocumented, but reverse engineered) 1A number of bytes in bitmask (number of bytes after this): 26 - same as in the request. D4 First bitmask means scan on/off for node 8 (msb) to 1 (lsb), 0b11010100 means node 8, 7, 5, 3 are neighbours 04 Second bitmask means scan on/off for node 16 (msb) to 9 (lsb), = 0b100 means node 11 is a neighbour 52 = 0b1010010 -> node 23, 21, 18 are neighbors ... and so on

 

The complete neighbour list: 3, 5, 7, 8, 11, 18, 21, 23, 30, 31, 34, 56, 60, 75, 89, 90, 143, 153, 175, 192, 193, 204, 205, 207

 

Let's check some of our preliminary conclusions we've made based on the previous packets.

 

Packet 11-13 clearly suggested node 9 is not a neighbour but packet 14 might have been an ACK. The verdict is: 9 is absent from the report, it is too far away from node 99

 

Packet 18 and 21 might have been ACKs from node 18 and 21 respectively, and according to the Node Range Info Report 18 and 21 are indeed neighbours.

 

You would think, we're done... But have a look at the next packet and compare with packet 1 and see if you can spot the difference!

Please login or register to see this code.

This is a second request to scan for neighbours, but with a different list:

 

Find Nodes In Range: 1, 57, 62, 87, 112, 131, 181, 195, 197, 209

 

None of those nodes were scanned the first time... And when I look at what these nodes have in common: they all have protocol Z-Wave protocol version >= 4, while the first set of nodes have protocol 3. They also have something else in common: those nodes are "series 500" while the first set was "series 300".

 

Let's have a look at what happens during scanning. Can you spot an important change?

Please login or register to see this code.

These nodes get scanned at 100KBit/s which means they are using the highest speed available on the ZW500 (and a different channel).

 

Skip to packet 97...

Please login or register to see this code.

Node Range Info: 1, 57, 87, 181, 195, 197, 209

 

That means, those nodes are able to forward packets at 100 kBit/s...

 

Remember, this is not documented by Silabs and is still the "secret sauce" of Z-Wave when I write this. On the other hand, what we have seen so far makes sense. The first round of "scans" found the 40k (also capable of doing 9.6k) nodes and now the second round has found the faster 100k nodes.

 

But then a 3rd scan started and that was something that I did not expect!

Please login or register to see this code.

This means:

 

Find Nodes In Range: 135, 160, 169, 214

 

For some crazy reason....... I immediately recognized this list...... I have been staring at captures way too long :)

 

Those 4 devices are all Fibaro FGT-001, they are FLiRS devices and they are protocol 4.61, this is SDK 6.71.01

 

That is very interesting, the controller (my HC3) for some reason wants to find out if my Wall Plug (node 99) can reach the FLiRS devices. FLiRS devices cannot act as repeaters, and I wonder what the controller tries to "learn" from the reachability of these devices. I can make an educated guess, it probably has to do with optimization of "BEAMING"... But I don't know.

 

I'll skip a few packets and show the wake up beam and response of the last node:

Please login or register to see this code.

Packet 125 and 126 mark the start and end of e BEAM which is a special packet, send in a continuous loop, that gets picked up by FLiRS nodes. It takes about 1100 ms and if the node wakes up it will respond, that is what you see happening in packet 127 and 128.

 

The result of this third scan is:

Please login or register to see this code.

Node Range Info: 135, 160, 169, 214

 

This means my 4 FGT-001 Fibaro Heat Controllers are reachable by this Wall Plug (node 99).

 

Finished? Done? Not yet! The device, node 99, has only reported neighbours, it has not yet received routes to use.

 

After about 600 ms the controller has finished processing the "neighbour matrix" and has distilled routes for the Wall Plug to use when it wants to reach the controller.

Please login or register to see this code.

Again, secret sauce of Z-Wave and undocumented. But thanks to a study floating on the internet plus some testing, I can decode a few things...

 

The first packet is probably a "erase" function, telling to try a "direct" connection (no routing) cacheIndex: 0, SRlen: 0, hops:

 

Return Route packets 2 -> 4 decode as:

 

cacheIndex: 1, SRlen: 1, hops: 39h = Node 57 cacheIndex: 2, SRlen: 1, hops: 57h = Node 87 cacheIndex: 3, SRlen: 1, hops: B5h = Node 181

 

To summarize, the controller has send 4 routing hints and node 99 can take one of these alternative routes:

 

99 - 1 99 - 57 - 1 99 - 87 - 1 99 - 181 -1

 

...Or try "explorer frames" if all 4 have failed...

 

Does this sound reasonable? Let's go back to the "neighbour list" my Wall Plug learned by scanning neighbours at 100 kBit/s:

 

Node Range Info: 1, 57, 87, 181, 195, 197, 209

 

As you can see, no magic involved, the router has selected the lowest possible node IDs and the shortest possible routes at the fasted speed!

 

Does this deice actually use these routes? Am I sure of the decoded data of the "Assign Return Route" commands? As far as I can tell, there is no way to "read back" the 4 routes stored on the device.

 

But we can easily test the routing engine: power down the controller the make the Wall Plug send some data, and you'll see it exhaust all possible ways to contact the controller...

 

See file "06 - FGWP-102 to dead controller 2020-06-17".

 

First we test if everything is working fine, before powering down:

Please login or register to see this code.

Direct connection, looks good to me! Turn of my HC3. Don't reboot, but turn off completely, otherwise the controller stays powered on and we want it to be totally non-responsive.

 

First six attempts: try direct at 100 and 40 k speed:

Please login or register to see this code.

This device does not drop to 9.6 k, some older devices will do that.

 

As expected, "direct" has failed so now the Wall Plug tries its first route:

Please login or register to see this code.

As we have figured out, that is indeed the 2nd route slot:

 

99 - 1 99 - 57 - 1 99 - 87 - 1 99 - 181 -1

 

What happens when node 57 sends to 1?

Please login or register to see this code.

As expected, 3 tries and no ACK - interestingly though in this case there is no "speed change" to 40k, we'll come to that later.

 

Now device 57 knows node 1 is dead, and it sends back a hint to the source:

Please login or register to see this code.

As you can see, there is special kind of packet called "Routed Error" and it tells node 99 exactly which part of the route "99 - 57 - 1" has failed... It was the last step so either (a) node 1 is dead (b) node 57 is no longer able to reach node 1 (for example, because you have moved node 1 or node 57). How can you be sure of this?

 

The Data column "Routed:(99)<-57 - (1)" gives all the details about the "travelling" of this packet. It means this packet originated from 99 and its destination was 1. The arrow <- pointing to the left that this packet is traveling backwards, and the (real) sender is node 57 and the intended (next) recipient is node 99. Zniffer also tells you which node failed, it is under "Properties3", lower left decoder pane of Zniffer. It says "Failed Hop: 0x01".

 

To summarize: Node 57 tries to reach node 1 but gets no ACK. It then "reverses" the route, to signal to the originator "In this route, I am unable to reach node 1".

 

This process is repeated, but at 40 k instead of 100 k. This makes sense because they use a different frequency and have different reach and possibly different noise levels:

Please login or register to see this code.

I am not going to copy/past every packet after this because you can guess what happens next, the Wall Plug moves on to the next route...

Please login or register to see this code.

And then the last one...

Please login or register to see this code.

How long did it take this node to realize that none of the 4 routes is going to work?

 

Start: 11:34:17.031 ACK of last routed error: 11:34:18.266

 

So the network has been busy, doing nothing useful, for 1.2 seconds. That is not too bad, but routes can be longer and the process can be slower, and also, we're not done yet!

 

The Wall Plug tries "direct" one more time.

Please login or register to see this code.

Then moves on to the 3rd type of routing: "Explorer Frames". That type is very much undocumented but you can make estimated guesses and again some decoding happens (lower left pane) in Zniffer:

Please login or register to see this code.

An interesting choice of the Z-Wave designers: the broadcasted packets encapsulate both routing discovery data and the original payload, so when the packet reaches its destination, the data gets delivered at the same time. This reduces latency compared to systems that first have to learn a route, then send the data.

 

In total, 12 devices repeat this message (you can see the middle part of the packet changes and repeaters are under "Properties 5")

 

Start: 11:34:17.031 Last Explorer Frame: 11:34:18.893

 

It took almost 1.6 (in total) seconds to "give up talking to the controller". Depending on the type and firmware of the device, it can take much, much longer, I have seen captures of routed + explorer take more than 13 seconds. The Z-Wave specification mentions it can take even longer.

 

The moral of this story: do not power off your Home Center, it will make all sending nodes (very) unhappy.

This concludes the analysis of the mesh update of a Z-Wave Plus device (series 500, not yet the latest-and-greatest series 700) near the controller. I decided a I wanted a somewhat more exciting real world example, one showing a "far away node" on my network, one that is barely able to get a direct connection. I will write a follow up post with the FGK-101 + DS18B20 that is my main outdoor sensor in the garden.

 

Please login or register to see this attachment.

Please login or register to see this attachment.

  • Like 9
  • Thanks 1
Link to comment
Share on other sites

7 hours ago, petergebruers said:

 

source routing is efficient, and can run on processors with few CPU cycles and small amount of RAM. It does not scale well, but the engineers thought 232 nodes and 4 repeaters per network are OK and I think their choices made sense. 

 

it was once designed for little AVR uC, with RF stage and already "fast" with ZW200 series. 

 

7 hours ago, petergebruers said:

"Explorer Frames". - Contrary to what many people seem to think, the explorer mechanism does not learn new routes, it only establishes a new "last working route" in case everything else fails. Silabs says it finds a route, on average, in about 350 ms. If the destination node is unavailable it times out in about 3500 ms.

 

For sure it is not real routing table (with up to 232 nodes and upt o 5 routes each),  explorer frame makes LWR table with 232 (direct or routed) entrys, which is kind of routing table, but even Zensys spoke in early docs about "routing created with explorer", probably that's why. It is maybe easier to when one think about "i'm lost" command/situation.

 

In my trainings i use always simplfied statement "if everything fails, explorer frame will build working route", this is only small part of the story, but simple enough to understand that something in z-wave protocol will try to find a way to communicate. It is confusing to explain that NWI is explorer as well ^^.

 

7 hours ago, petergebruers said:

The third routing mechanism, "explorer frames" always kicks in when all routes fail on a (non-controller, fairly recent) device.

 

i would say half (the last/recent half - actually everything since end of 2008, but let say certified since end of 2009) of the zw300 based devices can use explorer frames, actually all with Z(S)DK 4.50 till 4.55, but not 5.x-5.02 (very few zw400 based devies, i think i have Duwi or Mertens here)

 

7 hours ago, petergebruers said:

F22FC1AB0141032963 Z-Wave header, fully decoded by Zniffer, it is a direct packet so does not have a routing header before the payload. 01 Command Class "Protocol" (I invented that name, it is undocumented). 04 Find Nodes In Range (undocumented, but reverse engineered as ZW_RequestNodeNeighborUpdate) 1A number of bytes in bitmask (number of bytes after this): 26 D4 First bitmask means scan on/off for node 8 (msb) to 1 (lsb), 0b11010100 means node 8, 7, 5, 3 should be scanned 05 Second bitmask means scan on/off for node 16 (msb) to 9 (lsb), = 0b101 means node 11, 9 52 = 0b1010010 -> node 23, 21, 18 ... and so on

 

7 hours ago, petergebruers said:

Again, secret sauce of Z-Wave and undocumented. But thanks to a study floating on the internet plus some testing, I can decode a few things...

 

yeah that "Z-Wave protocol Command Class / ZWAVE_CMD_CLASS", it is documented since 2009 in SDS10264-2  ? (and yes, i know, nobody ever saw that doc).

 

 

7 hours ago, petergebruers said:

The moral of this story: do not power off your Home Center, it will make all sending nodes (very) unhappy.

 

 

?

 

nice writing!

 

 

 

Link to comment
Share on other sites

Wow excellent post, i've learned so much! Thank you @petergebruers. I have already read it twice and learn each time, i'll probably need to read again :)

On 6/19/2020 at 12:16 PM, petergebruers said:

When you include a node, that process includes a "mesh update" so no need to "re-mesh" after inclusion if you freshly included a device in place

This was really unclear for me, i know have a clear answer, thank you for that. I assume this mainly true for powered device and remeshing is still necessary when we talk about battery devices, to know new potential better candidats for routing or just to remove old neigbourds because they have been moved or removed.

What is still questioning me is when we are removing a device, i we consider that they is no need to remesh when we do a fresh inclusion because the mesh is included into to inclusion process, i was supposing having the same when we are removing a device. I'm saying that because I have identified sniffing my network, that some powerd devices were still trying to get hop on nodes that were not existing anymore. I can understand shuld happen with battery devices untill the remesh but not with powered devices. Maybe i'm mixing as i spent so many hours sniffing my weird system, but i nearly sure about that one, as it was tipically the way to retreive a direct route for some devices, "supposed" to be in direct route as near of the controller, they were using node A as a hop, after i exclude node A, the device was first trying to get the controller by the use of node A as hop, but didn't add any ACK as node was gone, then tries alternative routes for generally by the end taking the direct route. Might it be just a mystery ? I believe that after a exclusion a remesh sould be done and the device removed from the list of neigbourd of devices using it? am I wrong or should i read it more than fourth, what i'm going to do in any case :)

Thank you also to have clarified Explorer frames they were also obscure for me ! When a node can find any hop than explorer frame is the last hope ? 

 

 

Edited by Tony270570
Link to comment
Share on other sites

  • Topic Author
  • On 6/19/2020 at 10:57 PM, tinman said:

    it was once designed for little AVR uC, with RF stage and already "fast" with ZW200 series.

    Yes, backwards compatibility is "a blessing and a curse", design constraints carried over from 15-20 years ago (eg I would rather have 2 or 3 x 100.k channels instead of one still supporting 9.6 k + 40 k and one 100 k channel). Moore's law not only applies to CPU used in PCs but also (more or less) applies to MCUs. I still have a few PIC micro-controllers  with 64 bytes (not kilobytes or megabytes).

     

    Thank you for clarifying a few bits,  I'll update my post.

     

    1 hour ago, Tony270570 said:

    This was really unclear for me, i know have a clear answer, thank you for that.

    I am glad you like my post.

     

    1 hour ago, Tony270570 said:

    I assume this mainly true for powered device and remeshing is still necessary when we talk about battery devices

    Routing and remeshing apply to both mains and battery powered devices and work the same way. There is however a small caveat: long routes increase power consumption so reduce battery life somewhat. I do not have exact data but if you read claims like "10 years of battery life" this always assumes direct connection,  no retries, no routing an no explorer frames!

     

    1 hour ago, Tony270570 said:

    What is still questioning me is when we are removing a device, i we consider that they is no need to remesh when we do a fresh inclusion because the mesh is included into to inclusion process, i was supposing having the same when we are removing a device.

    Ah, you are right about inclusion. Adding a device does a "neighbor update" and  the controller sill send "fresh" routes to that newly included node.

     

    Removing devices does not work the same way for a simple reason... When a "potential repeater" disappears from the routing matrix, the controller would have to (a) calculate new routes for every device (so in your case > 100 devices) and (b) would have to send routing hints to all "affected nodes" some of which are sleeping nodes.

     

    Without Zniffer, you cannot really know for sure if any device is using that route. And even with Zniffer you can only "see" the 4 potential routes by shutting tho controller, like I did when I wrote my post.

     

    I am being cautions, but I think for most users, generally speaking, removing a device causes less issues then you might think, the device still probably has 2 or 3 alternative routes or resorts to explorer frames.

     

    There might be one specific case that warrants a full re-mesh, without even looking at Zniffer... We had a private conversation about a specific problem and I asked you specifically to remove node 7 based on your captured data. So firstly I knew with 100%  certainty that node 7 acted as a repeater for several nodes, and secondly based on observation I say routing prefers lower node numbers. I wanted devices to use other repeaters,, deleting the device was the first step and a re-mesh was the second step (calculate new routes and send them to devices.

     

    We're talking about getting the optimal routing experience, but that does not mean the gains will be significant, it depends what you use the device for.

     

    1 hour ago, Tony270570 said:

    they were using node A as a hop, after i exclude node A, the device was first trying to get the controller by the use of node A as hop, but didn't add any ACK as node was gone, then tries alternative routes for generally by the end taking the direct route. Might it be just a mystery ?

    If node A had a short route over node 7 then indeed it would try that, it would fail and if another route works the node should remember that, because it is in its LWR (or Return Route).

     

    Based on the documentation I would assume the the "last working route" is really what it means. In that case you would expect that device A never tries router 7 again and I think that is 99% true. But what might happen is eg you have a power cut, or your controller is offline, or the alternative route fails as well... Then the node will try all possibilities again. There is imho also that 1% of unexplained case ;)

     

    Based on what "LWR" means I would say you almost never see node 7 used in a route... If you can find an example of a node using a repeater that no longer exists, and does that "a lot" (like: almost every time it sends that) then I would like to investigate that particular case.

     

    1 hour ago, Tony270570 said:

    I believe that after a exclusion a remesh sould be done and the device removed from the list of neigbourd of devices using it?

    Ideally, yes, but I  would recommend a manual, partial or full refresh , but doing them "one by one". But that gets tedious if you have > 20 nodes ;) I hate the "full remesh"  that is in the menu of the HC because you cannot keep track of failed and sleeping nodes in an easy way. Maybe you can start Zniffer, launch a "full mesh update" and after 24 h (most battery devices will have passed their wake up interval) filter on "Assign Return Route" and check if every node gets a few hints. Sorry for thinking out loud, I don't do full mesh updates often because I don't have many changes, and I am not sure what is the best way...

     

    BTW I am preparing a post on dissecting the mesh update of an non-Z-Wave Plus FGK-101 D/W Sensor and that one will only start an update if it wakes up "by interval". No amount of clicking could trigger the update, I'll explain why in that post... Quite a few users have posted "I cannot start a mesh device on device of type X" and this warrants more investigation :D

     

    • Thanks 1
    Link to comment
    Share on other sites

    Thank you @petergebruers for the additional information you provided to me/us ! 

    36 minutes ago, petergebruers said:

    No amount of clicking could trigger the update,

    OMG what kind of secrets are you going to deliver !!We never discussed about that one,  I'm already expecting some surprising info ! 

    THX Peter

    Link to comment
    Share on other sites

    • 2 years later...

    Thank you @petergebruers for the very detailed tutorial. It has helped me tremendously to better understand zWave network traffic and to improve my understanding of Zniffer captures. While making zWave captures to try and resolve a random 503 error on my HC2 I noticed a sudden burst of "WakeUp beam" packets. I did not initiate any mesh update when this process started and my questions are:

    1. What triggers or initiate these "WakeUp beam" packets?
    2. Why would these WakeUp packets send requests to non-existent nodeID's?

     

    See screen capture below of filtered Zniffer output. I also attached the actual Zniffer file and NodeList detail for reference.

    Please login or register to see this spoiler.

     

     

     

    Please login or register to see this attachment.

    Please login or register to see this attachment.

    Edited by nicopret
    Link to comment
    Share on other sites

  • Topic Author
  • Sorry for the delay, I have not been available for a while. I will have a thorough look at it next week.

    Link to comment
    Share on other sites

    13 hours ago, petergebruers said:

    Sorry for the delay, I have not been available for a while. I will have a thorough look at it next week.

    No problem @petergebruers, much appreciated. After reading more about WakeUp Beam on

    Please login or register to see this link.

    , I now understand it is related to FliRS. I still don't understand why my zWave network went "crazy", but at least I know that the "WakeUp beam" traffic may have be related to the DanaLock that I have installed.

     

    Edited by nicopret
    Link to comment
    Share on other sites

  • Topic Author
  • "drzwave" is an excellent resource and FLiRS is indeed a type of device that needs the BEAM to wake op from sleep. Simply put, a FLiRS battery device listens every 1 second for about 1-10 ms to "hear" a continuous repetition of a specific Z-Wave packet. This means it uses less than 1% of a  "normal device" and can wake up - which makes it possible to build relatively responsive battery powered actuators like a lock. This is not the full story though.

     

    Because the BEAM is continuous (to be pedantic: there is a newer discontinuous beam for newer devices) and lasts up to 1.1 seconds, it halts your network communication while it is active. Zniffer has often trouble detecting the exact start and stop and in case of a collapse you might have also broadcast explorer frames, which makes it hard to diagnose. If you build a small network with a FLiRS device and a few nodes, you will more likely understand what I am saying.

     

    I can confirm that BEAMS are also produced when certain older devices seem to think that the destination, probably your controller (with Node ID 1), needs a BEAM to wake up then that last device in a routed packet will send a BEAM to node ID 1. Your controller is not battery powered, so that is an unnecessary think to do. I cannot find the exact reason, but Z-Wave did support battery powered controllers so it might be in certain older firmware versions (pre-Z-Wave Plus). Or it is a bug. That is pure speculation though.

     

     

    • Thanks 1
    Link to comment
    Share on other sites

    • 2 weeks later...
  • Topic Author
  • I am going to address the question of @nicopret about "what is all the traffic " and "what about the BEAMing" in this network.

     

    First a warning for people stumbling on this post......., I really have to repeat: if you have not studied the details of Z-Wave, your assumptions about Z-Wave routing are very likely wrong. I can point you to some excellent posts.

     

    In this case, you can really forget about that and focus on something you have heard a few times: "Z-Wave uses mesh routing" and you have to make sure "the mesh is up to date". 

     

    This "update" is what we are seeing here.

     

    The TL;DR is mesh update aka "neighbor update" is important and should be done if your topology changes, but keep in mind it greatly takes your network.

     

    Also, I recommend to do devices one by one. If you select "all" then sure enough it will work but (a) you won't see progress and (b) your battery devices wake up randomly and start doing a neighbor update which might take hours...

     

    On 2/7/2023 at 2:39 PM, nicopret said:

    I did not initiate any mesh update when this process started and my questions are:

    Well, that is intriguing. As far as I can tell, node 69 is not a sleeping device and yet this is what I see in your capture:

     

    I have filtered the start of this process to show you what you will typically find in Zniffer:

     

    Please login or register to see this image.

    /monthly_2023_02/928109277_NeighborScan9k6.png.c4adb71c2d5f2c4b3aaf48b102f0fcf1.png" />

     

    NOP power is like a "ping" but with reduced power. The idea of a mesh update is to find strong neighbors

    The next think you will notice is the increasing node ID 

    Also worth nothing the first packets are really low speed: 9k6. For compatibility reasons, Z-Wave devices have to check up to 3 frequencies

     

    • 9k6
    • 40k
    • 100k on a different frequency

    There is a fourth type of Z-Wave communicqtion but it does not use a mesh

    • 100k Long Range - US only on different frequency

    I expect a 40k scan to start after this.

     

    So let's prove that this is a "neighbor update by filtering on the keyword "Range". We find packet 470

     

    Packet 407:

    Find Nodes In Range

     

    Please login or register to see this code.

     

    The sending node is 1 and it is sending a request to node 69

     

    Note that the controller is the "accountant" of all routing info so it tells node 69 which nodes might be neighbors, and thus which nodes are not available and should not be scanned this saves time.

     

    On 2/7/2023 at 2:39 PM, nicopret said:

    Why would these WakeUp packets send requests to non-existent nodeID's?

    We are not at the BEAM part yet but it is worth checking the NodeIDs listed above because that info comes straight from the controller and its non volatile memory.

     

    The algorithm will now start "pinging" the nodes, carefully noting response. Here is a screenshot of the start of the 40k scan

     

    1082200406_NeighborScan40kstart.png.5756fea6c16cfea17671ebac17755794.png

     

    And this is the last part

     

    2041138994_NeighborScan40kend.png.356463304365920924eba28c668d20f6.png

     

    As you may notice, not all nodes in the controller's request are here, the algorithm has some knowledge about which nodes to scan at what speed (possible be also prior knowledge but that is speculation - the algorithms are secret)

     

    Now comes the intriguing bit, which looks like your Z-Wave network is falling apart but it is actually quite OK. It does take bandwidth and definitely can be notice while this NOP and BEAM thing is happeing.

     

    774191506_FLiRSscan.png.4bf496f7fdefaff0c20dcdbddf8bdfc0.png

     

    Again, you will see an increasing "Dst" number but you may be surprised to see no Src Node ID. It is is shown as 0 and that is not a valid number. This is correct because a BEAM is a special packet, to wake up FLiRS devices.

     

    Z-Wave, at the moment, defines 3 types of devices

    • Sleeping devices. Probably battery devices, they wake up at the pre-set "wake up interval" and when they are sleeping, they cannot respond because their receiver has been turned off. Cannot act as a repeater.
    • Always on devices. Unlikely to be battery powered because they need 30 mA (probably a bit less if it is a ZW700 or ZW800). Can always respond and can act as a repeater.
    • Frequently Listening Routing Slaves. Probably battery devices, they wake up for a few milliseconds every second, and during that short interval they can be woken by a special BEAM packet. They have no "wake up interval" and they cannot act as a repeater. Yeah, I am not wrong about that, the R in FLiRS stands for "routing" but as I said, Z-Wave routing does not work like internet routing. Z-Wave is source routed 

    Although it is not specifically mentioned in the docs, it makes sense that these FLiRS are somewhat of an exception in a neighbor scan. You cannot send a NOP to them, without first sending a BEAM.

     

    A beam takes 1100 ms so this again adds time to the "neighbor update"

     

    Depending on the size of the network, this process can take tens of seconds

     

    At the end we get this:

     

    Please login or register to see this attachment.

     

    Please login or register to see this code.

     

     

    At last, we know that "22, 81, 87, 89, 133, 134" are all good neighbors of node 69

     

    The controller stores this information in non volatile memory and uses it to select "good" routes

     

    This is near the end of the the capture and I do not see the controller sending routes to the device. The algorithm on the controller can definitely take more than a second to add and recalculate routes, that is probably what is happening here.

     

    Here is what it should look like

     

    Please login or register to see this code.

     

    On 2/7/2023 at 2:39 PM, nicopret said:

    Why would these WakeUp packets send requests to non-existent nodeID's?

    I am not sure, but I want to speculate. Z-Wave is an evolving standard and the protocol changes but has to work with older and future devices. For example, device that only do 9k6 are really rare, I think they must be over 12 years old. Yet, some packets use 9k6 because, well, we cannot undo the past.

     

    I think likewise adding FLiRS that acted like half-mains-half-battery left some "quirks" in the possible combinations of "is is a repeater" and is it in the XYZ category.

     

    It is very possible that different firmware versions have some variation of what you see (eg a HC2 cannot do 100k, only 9.6 and 40 so when people migrate to HC3 which does support 100k then this mesh update looks very different).

     

    What does strike me as odd is that BEAMs are sent to NodeIDs that do not appear on the controllers list. I am talking out loud now but there is an interesting twist to the BEAM story, iirc there are two variants and on of them can wake FLiRS nodes of foreign networks (which apart from taking batteries is harmless because the device will go back to sleep because it won't get any data). It is about the Home ID Hash (which is visible in Zniffer).

     

    This is the secret sauce of Z-Wave, you won't get any good official explanation.



     

    • Thanks 1
    Link to comment
    Share on other sites

    Thank you @petergebruers for your very detailed and valuable dissection of the Zniffer capture. BTW, Node ID 69 was a device that I physically moved to a better position (onto a pole to give it a better elevation and clear communication path to the rest of the zWave network) earlier that day. 

     

    Below more detail about NodeID:69

     

    Please login or register to see this spoiler.

     

    The realtime view of the zWave network that Zniffer provides has been very valuable and essential in helping me to optimise my zWave network along with your many valuable tutorials and forum posts. It also made me realise that I should take more care when I decide to add new devices and not to do so too soon after a reboot or backup while the network is still settling.

    Link to comment
    Share on other sites

    Join the conversation

    You can post now and register later. If you have an account, sign in now to post with your account.

    Guest
    Reply to this topic...

    ×   Pasted as rich text.   Paste as plain text instead

      Only 75 emoji are allowed.

    ×   Your link has been automatically embedded.   Display as a link instead

    ×   Your previous content has been restored.   Clear editor

    ×   You cannot paste images directly. Upload or insert images from URL.

    ×
    ×
    • Create New...