Behaviour variation

mtimc · May 22, 2018

Hullo

I'm using a small number (~200) of FGMS-001s in a number of homes to identify movement.

The devices are not behaving as I expect and I'd like to get some guidance on whether my expectations are wrong, the setups are wrong, or the devices are not appropriate for the scenarios that I'm considering. I'd like to scale this up to 40k PIRs, so I'd like to understand what I need to change before I go much further.

The devices are being driven through the OpenZwave stack (wrapped by python-openzwave), so I'm fairly sure that I'm seeing what is really going on. Each network has 3-7 PIRs, and they have been installed with their out-of-the-box configurations.

I'm assuming that motion detection is encoded in messages of type ValueChanged, and COMMAND_CLASS_ALARM with label "Burglar", with 0 meaning 'no motion' and anything else meaning 'motion detected'. I think that the behaviour is encapsulated in a finite state machine with two states (motion_detected [md], and no_motion_detected [nmd]), with the CC_ALARM/Burglar messages spat out when a state is entered. md is entered if the PIR is triggered by a motion (subsequent PIR identification of motion while in this state should not re-emit the message), entering md resets a timeout timer with a value of 30s. nmd is entered when the timer times out.

The ZWave networks have been kept small to minimise network errors, although some are still possible that some messages get lost as ZWave uses collision avoidance, rather than collision detection with correction.

What I see on the ZWave network includes several anomalies:

- multiple messages of the same type/value within 1.5 seconds. Are these just multiple sends of the same message and a result of the ZWave collision avoidance protocol? Presumably they can be safely consolidated into a single event by the receiving node?

- multiple messages of the same type/value over longer periods of time: both 0 followed by another many minutes or hours later, and [not 0] also followed by another.

When I set up multiple PIRs next to each other, so that each should detect the same motion events, I can see the above behaviours, mixed in with missing events (1 or 2 PIRs spot movement, but the other(s) do not).

Reading the various home automation forums, it looks like these devices do sometimes not trigger events as expected. However, in those cases, they are being used to trigger actuators (eg light switch) and a missed event is worked around by continuing to move in the field of view of the PIR. Clearly, that's not how the devices should work in the context of spotting a burglar.

Can anyone point me at any obvious blunders that I've made, and/or documentation on the expected behaviours of message emission, and false positive / false negative rates for these devices - I would like to understand whether I have a few rogue PIRs or a design issue.

tia

Tim

enbemokel · May 23, 2018

Upps, this is a bad beginning in this forum from my feeling:

>Let's hope that someone with more experience of using the protocols does read and respond.

Peter wrote about the advantages of z-wave protocol and if you can get rid of your "single point of control" you may re-think your position.

I attach a screenshot from one wake up of a device missing it´s controller. You can see the different routes that it try to use.

Normaly when it is in the network, then there are only 1-2 messages, depending on information that will be send after wake up.

Here you can see a burst in 2 seconds.

I assume that you have problems as you only use batterie devices without meshing. Maybe you can give us a picture of your network

Please login or register to see this attachment.

Edited May 23, 2018 by enbemokel

petergebruers · May 22, 2018

I think you need a Z-Sniffer to understand what is going on.

As you have mentioned, repeated messages due to collisions, mesh network issues, interference are all possible causes of missed and duplicate events.

Is this in a "lab" environment, or a real environment?

mtimc · May 22, 2018

Thanks for the pointer. I'm in the process if trialling the Installer kit, which, I believe is similar to Z-Sniffer - pls correct me if I'm wrong.

There is no mesh network involved here - mesh topologies are not a good design for long running, slowly changing environments as they break from time to time and breakages are hard to spot. They work well in battlefields.

How many/few such collisions are reasonable and is there any information on the failure rate arising from such collisions? The numbers do not look to me like they should arise due to random noise as, say, if multiple 'on' messages with no intervening 'off' messages (or vice versa) arising from missing messages would imply a minimal activity level on the network, but it looks quite calm.

Is there a canonical definition of how PIRs should behave?

This is a real environment.

petergebruers · May 22, 2018

1 minute ago, mtimc said:

which, I believe is similar to Z-Sniffer - pls correct me if I'm wrong.

If I understand the explanation from @tinman correctly, the installer kit is limited to "your own network" which is not relevant in your case. The "Zniffer" is truly "promiscuous". He also mentioned that this limitation will disappear when they release new software for the installer kit.

4 minutes ago, mtimc said:

There is no mesh network involved here - mesh topologies are not a good design for long running, slowly changing environments as they break from time to time and breakages are hard to spot.

I probably do not understand what you are saying. Mesh network is at the core of Z-Wave networking and it is one of their strong points. Z-Wave is source routed and a packet can take 4 hops. If everything is Z-Wave Plus, the mesh routing can repair itself (little white lie). Also, Z-Wave defines a "portable controller" as a device with complete knowledge of the routing table so it can be used anywhere in larger networks....

Please login or register to see this attachment.

7 minutes ago, mtimc said:

How many/few such collisions are reasonable and is there any information on the failure rate arising from such collisions?

Hard to say... Until 2016 Z-Waze was so closed it was not possible to find much relevant information. You had to pay several 1000 $ to buy an SDK, get the docs and sign and NDA so nobody talked about such things. In 2018 Silabs released the SDK, it is free now.

So enthusiast end-users like me are still reading the docs

I think previous owners of an SDK are no longer bound by the NDA and several users on this forum have experience with development of Z-Wave devices. So I hope they read this...

With my limited experience, I do want to say something regarding performance: if your background is WiFi or ethernet and TCP/IP you can forget about most things you know... It is slower and much more limited on many levels. The protocol layer of Z-Wave is a published specification, if you like to read normative documents, I can find back the correct reference

mtimc · May 22, 2018

45 minutes ago, petergebruers said:

If I understand the explanation from @tinman correctly, the installer kit is limited to "your own network" which is not relevant in your case. The "Zniffer" is truly "promiscuous". He also mentioned that this limitation will disappear when they release new software for the installer kit.

That's good.

Mesh is ok if the network is observed and has an in-house IT department: if it breaks, I'm aware and can fix it. However, for the mass market, there are several usecases that require the whole system to work unobserved (eg identify an intruder). You can use the meshing, but it makes life harder as there is no single node that understands everything that is going on, so you lose a single point of control.

Let's hope that someone with more experience of using the protocols does read and respond.

I have read the specs, and a number of the other ~100 HAN/PAN protocols that are available. All have weaknesses, few seem to have learned the lessons from the development of tcp/ip or enterprise systems management.

mtimc · May 23, 2018

HI Enbemokel

We can debate the system behaviours where there is/is not a single point of control separately. It's fine for observed systems, but it becomes problematic when you need to reason about the known state of a system.

My networks are topologically simple, eg:

Please login or register to see this image.

/monthly_2018_05/image.png.126670f764e6743fa38b5164c82501d5.png" alt="image.png.126670f764e6743fa38b5164c82501d5.png" />

I don't have a tapping point at the ZWave network itself, but a simple example encoded as json would be something like this (format from my code, data from the python wrapper):

```

2018-05-21 23:55:33,213:

{ "notificationType": "ValueChanged", "nodeId": 6, "homeId": 4221335583, "valueId": { "nodeId": 6, "id": 72057594144637089, "homeId": 4221335583, "commandClass": "COMMAND_CLASS_ALARM", "index": 10, "instance": 1, "label": "Burglar", "genre": "User", "readOnly": true, "units": "", "type": "Byte", "value": 8 } }

2018-05-21 23:55:33,293:

{ "notificationType": "ValueChanged", "nodeId": 6, "homeId": 4221335583, "valueId": { "nodeId": 6, "id": 72057594144637089, "homeId": 4221335583, "commandClass": "COMMAND_CLASS_ALARM", "index": 10, "instance": 1, "label": "Burglar", "genre": "User", "readOnly": true, "units": "", "type": "Byte", "value": 8 } }

```

That example is two identical messages 0.08 seconds apart. Both of label Burglar, both with a value of 8. I have seen runs of up to 7 such messages, either all 8, or all 0 for the value, within 1.5 seconds. At the same time, there will be similar messages with a value of 0, and no corresponding message with a value of 8. And vice versa. Usually, there are no messages separated by >1.5 seconds and < 29.5 seconds, which, I think, is how the PIRs are supposed to behave. However, such spacings are not unknown.

I can guess that multiple messages in quick succession should be combined, but are there any limits? What about apparently missing messages?

What does the data mean in your example where there are 12 messages to node 22 in quick succession? (I suspect that they are ZWave control packets and absorbed by that stack).

mtimc

enbemokel · May 23, 2018

Hi,

these 12 messages are always the same. It is a wake up from source device 017 to controller 001.

As there is no answer from controller (it is in another location, as i want to test something) the 017 tries all the differen routes it learned. First directly, packet 1-3

then 4-6 via node 19, then 7-18 via node 22. Then again 3 direct packets to 001. You can also see channel 1 and 0, this depends also to the KBit/s rate.

Just posted this screenshot to show the worst thing if controller isn´t reachable.

I assume that you may have wireless problems, I had them also in my network and have to install some e.g. repeaters or power plugs, there is always another device to control

Btw. I´m also using a Raspberry with zwave board and Z-Way software. There is in expert menu a zniffer with history that maybe helpful, as you can see the different Channel/Kbit/s

Another question, if you build up a test enviroment with all devices near the controller, do you also have these 8 packets?

And, is it always label burgler, as i can see that there is also temperatur and light sensor. ( I don´t own this device)

Edited May 23, 2018 by enbemokel

mtimc · May 23, 2018

HI Enbemokel

Ok. those messages from 017 to 001 are network related and each is unique, I presume.

I have no Z-Way software.

Yes, there are similar duplicates/missing application level packets communicating state changes of the sensor. This is most obviously observed if 3 PIRs are set up next to each other, pointing in the same direction.

There are other sensors within the device, but these just report current levels. I have no idea over what period such measures are integrated (if any).

I cannot find a definitive description of what the devices should do. I can only observe what they do do, which is pretty frustrating.

tc

What caused your network comms troubles?

What are the limiting factors?

I cannot identify the expected and measured reliability of ZWave's CSMA/CA approach.

enbemokel · May 23, 2018

Hi, it´s like a chat. Did you check the manual, here you can find all the information and default configuration. Wakeup default is 7200 sec.

Please login or register to see this link.

I´m sure you know it, but i think everything is there.

Maybe you install an openhab to see something more. As you only have one controller and only battery driven devices, there will be no mesh in your enviroment.

mtimc · May 23, 2018

thanks. I've read that, but couldn't find any description of legitimate message sequences and timings. There is a description somewhere in the ZWave standards of a finite state machine for a motion sensor, but it does not identify the details of which messages can be emitted when, so it's not possible to say whether a sequence such as On/On/On or Off/Off/On/On is legitimate or not, nor what it means. If I get an Off with no corresponding prior On (ie previous message was an off > 30 seconds ago), does that mean that I can imply that there should have been a corresponding On 30 seconds before. How do I know that I've not missed such isolated Offs?

tc

petergebruers · May 23, 2018

4 hours ago, enbemokel said:

Peter wrote about the advantages of z-wave protocol and if you can get rid of your "single point of control" you may re-think your position.

Than you @enbemokel for joining this topic.

I don't see how Z-Wave could function without mesh network, as the OP points out, the only way to have "no mesh" is by "having battery operated devices only" indeed. Please correct me if I am wrong.

1 hour ago, enbemokel said:

these 12 messages are always the same. It is a wake up from source device 017 to controller 001.

I have done similar tests a while ago, but unfortunately at that time I did not have have the official Z-Sniffer so I was reluctant to post the results.

Thanks you for posting that screenshot.

Like both of you, I am interested to understand protocols at a low level.

As far as I understand it as a mere amateur, and data confirms it, retransmissions can occur at three different levels...

1) At the network level, which is hidden by the Z-Wave stack but visible with a sniffer. The G.9959 tells this about the retransmission algorithm:

A.4.4.1.4.3 Retransmission

A node that sends a singlecast MPDU with its ACK request subfield set to 1 shall wait for a minimum
of aMacMinAckWaitDuration symbols for the corresponding ACK MPDU to be received. If an ACK
MPDU is received within aMacMinAckWaitDuration symbols and contains the correct HomeID and
Src NodeID, the transmission is considered successful, and no further action shall be taken by the
originator. If an ACK MPDU is not received within aMacMinAckWaitDuration symbols the
transmission attempt has failed. The originator shall repeat the process of transmitting the MPDU and
waiting for the ACK MPDU up to aMacMaxFrameRetries times. Before retransmitting the node shall
wait for a random backoff period (see clause A.4.4.1.4.4).
If an ACK MPDU is still not received after aMacMaxFrameRetries retransmissions, the MAC layer
shall assume the transmission has failed and notify the network layer of the failure. This shall be done
via the MD-DATA.confirm primitive with a status of NO_ACK (see clause A.4.1.1.2).

A.4.4.1.4.4 Random backoff

If a singlecast MPDU with its ACK request subfield set to 1 or the corresponding ACK MPDU is lost
or corrupted, the singlecast MPDU shall be retransmitted. The MAC layer collision avoidance
mechanism prevents nodes from retransmitting at the same time. The random delay shall be calculated
as a period in the interval aMacMinRetransmitDelay… aMacMaxRetransmitDelay; Refer to
Table A.48.

That table is rather long but the part that interests you is:

aMacMinAckWaitDuration = depends on speed

aMacMaxFrameRetries = 2

So that basically means, a packet is never sent more than three times...

aMacMinRetransmitDelay = 10 ms

aMacMaxRetransmitDelay = 40 ms

It is possible if A sends to B then B sends ACK to A but if that path is weak, A might try three times, and B might receive the message 3 times.

I kind of expect the Z-Wave stack not to propagate this, because all packets contain the same sequence number, but I cannot find any clear explanation.

If you look at the data of @enbemokel you see those groups of three...

2) At the network level. This is described in "INS13954-5 Z-Wave 500 Series Appl. Programmers Guide v6.81.0x" especially "3.4 Z-Wave Routing Principles".

It is too long to summarize and the OP said he is not interested in mesh, so I'll over-simplify... If the previous level failed, the "routing slave" loops over know neighbors. In case of @enbemokel I see node 22 and 19 are attempted. It is much more complex, more details under "3.10 Z-Wave Nodes" - several pages long in total.

3) At the "application" level (Z-Wave parlance for the firmware of the device, written by the developer of the device using the SDK).

In document INS12350-14 they say:

A transmitter may time out waiting for an ACK frame after transmitting a Data frame or it may receive a
NAK or a CAN frame. In either case, the transmitter SHOULD retransmit the Data frame. A waiting period
MUST be applied before the retransmission.

T waiting = 100ms + n*1000ms

A host or Z-Wave chip MUST NOT carry out more than 3 retransmissions.

From personal observations, I'd say devices comply with this, which leads to my simplified rule "if a message did not get ACK within about 5 seconds, the message is lost".

In document "INS13954-5 Z-Wave 500 Series Appl. Programmers Guide v6.81.0x" under "3.5 Z-Wave Application Layer" there is an important remark, which might explain the observations of the OP:

"No precautions can
unfortunately prevent that multiple copies of the same frame are passed to the application. Therefore is it
very important to implement a robust state machine on application level there can handle multiple copies
of the same frame."

Please login or register to see this image.

/monthly_2018_05/Duplicates.png.29c90ca81560b44007e5c32b711d52c2.png" alt="Duplicates.png.29c90ca81560b44007e5c32b711d52c2.png" />

When you think about it... The backoff algorithm does not have many possible values (4 only)... if there is a lot of traffic this leads to problems. This is described in INS13954-5 Figure 4. Simultaneous communication to a number of nodes. Also, if you fire a number of devices at the same time (to within ms resolution I guess) then they are guaranteed to interfere.

I am sorry, I am not very good at deciphering openzwave logs, but I found a similar observation (but no response) her:

Please login or register to see this link.

So in conclusion, I think I can add my point of view to the original questions:

On 5/22/2018 at 12:08 PM, mtimc said:

multiple messages of the same type/value within 1.5 seconds. Are these just multiple sends of the same message and a result of the ZWave collision avoidance protocol? Presumably they can be safely consolidated into a single event by the receiving node?

I'd say, reception in one direction is not very good, or collisons occur. Your sniffer tool should be able to establish that, collisions and weak signal can lead to CRC errors so if these show up, you can move your sniffer to see where/when it happens.

I'd say, yes, consolidating the messages should be done by either openzwave (which it clearly does not do, I have no opinion about that) or your application software.

On 5/22/2018 at 12:08 PM, mtimc said:

multiple messages of the same type/value over longer periods of time: both 0 followed by another many minutes or hours later, and [not 0] also followed by another.

It depends on the parameters, as pointed out by @enbemokel

On my HC it works as expected by me: movement triggers the device, as long as there is movement no data is sent. If there is no movement within the "p6 Motion detection - alarm cancellation delay" a "safe" is sent.

2 hours ago, mtimc said:

There are other sensors within the device, but these just report current levels. I have no idea over what period such measures are integrated (if any).

There is no clear specification and from experience I can tell you it is a bit complicated. For example, this is a gotcha with the FGMS-001 - if you put it outside and it's a cloudy day, it'll constantly (say two times per minute) report Lux, which is not very useful but to be expected. Unsolicited reports should have a 30 s minimum interval.

2 hours ago, mtimc said:

I cannot find a definitive description of what the devices should do.

I share your frustration occasionally.

I have never seen how an FGMS-001 should behave.

Also, Z-Wave docs leave much freedom to the designer of the product...

If you enjoy programming, you might like the Z-Uno, both for learning purposes and to install your own algorithm. I built my own light sensor, so I know when it sends updates

10 minutes ago, mtimc said:

How do I know that I've not missed such isolated Offs?

As far as I can tell... even a Z-Sniffer might not see all missed events, but it is a start. On occasion, I have resorted to SDR to "listen" to the Z-Wave frequency and check if there is no interference from rogue devices. You might get an idea by observing the background noise in your installer tool.

One last remark: secure mode increases the data size and number of collisions, so increases the chances of getting into trouble. I do not have firm data, most of my devices are in non-secure mode so I cannot measure...

I hope this gives some direction...

Welcome to Smart Home Forum by FIBARO

Behaviour variation

Question

mtimc 0

Link to comment

Share on other sites

11 answers to this question

Recommended Posts

enbemokel 7

Link to comment

Share on other sites

petergebruers 1,264

Link to comment

Share on other sites

mtimc 0

Link to comment

Share on other sites

petergebruers 1,264

Link to comment

Share on other sites

mtimc 0

Link to comment

Share on other sites

mtimc 0

Link to comment

Share on other sites

enbemokel 7

Link to comment

Share on other sites

mtimc 0

Link to comment

Share on other sites

enbemokel 7

Link to comment

Share on other sites

mtimc 0

Link to comment

Share on other sites

petergebruers 1,264

Link to comment

Share on other sites

Join the conversation

FORUM

KNOWLEDGE

DOWNLOADS

SUPPORT