What is “good” and what is “bad” in NiFi. Part 2

Monitoring

We continue the conversation about what can and should be done in NiFi, and what can be done, but it’s better not to. If you missed the first part of the conversation, then you are here. It’s about improving the readability of circuits and improving performance (well, almost). Here we will also talk about how to monitor the business part of the scheme so that everyone is fine (well, or not bad), and a little about processor portability. Go!

There is an opinion that the worst thing is not to monitor the business part of the scheme at all, using the popular approach “it will do!”. But when you think about it, there’s one thing worse than no monitoring, and that’s bad monitoring. A typical “temporary crutch”: add a disabled / stopped group of processors, transfer the error stream there, debug …. and forget. What happens next? Yes, yes, the queue overflowed, backpressure stopped the processor and “Everything is gone, boss !!!”. Do not do it this way. If we do it, then we do it right. And how is it right? Without pretending to the Nobel Prize in the field of monitoring, I will give a dozen unsolicited advice.

Figure 1. FlowFile Expiration definition

  1. We set the FlowFile Expiration for a “reasonably sufficient” time and get a simple and convenient online debugging mechanism without much risk of stopping everything for good. See Fig. 1.

  1. If we are talking about queues, let’s try to quickly and easily set up queue monitoring in the simplest way. We unload the state of all queues, in case of a backpressure in any of them, we run to figure out what exactly is broken. To transfer data, we will use the built-in Site-to-Site Reporting Task in NiFi, the result of which we will transfer to ourselves through the Remote Process Group and process it there – see Fig. 2.

Figure 13

Figure 13

Figure 2. Site-to-Site Reporting Task

Note that although :8080/nifi (Interface) is specified in the destinationUrl settings, it is only used to negotiate the connection, the transmission itself goes on port 10000.

Figure 14

Figure 14

Figure 3. Filtering a report with queryrecord

For parsing, we use the QueryRecord processor (no split-evaluate-route!). And with the result in the form [{“sourceId”:”cbffbfdc-017d-1000-9942-e797792a97a9″,”sourceName”:”GenerateFlowFile”}]indicating what-where broke, you can already work.

3. In the same way, you can set up SiteToSiteBulletinReportTask – it will send us EVERY processing error message (remember those red squares in the corner that you never have time to take a screenshot?) with the attributes of the flow file. This generates a fair amount of traffic on loaded (crooked) instances, but the more flow authors have more reasons to fix it faster.

But all this, of course, is “monitoring from a bird’s eye view”, i.e. the entire cluster. And I would like error handling at the level of the processor group itself. After all, before reporting a bug, you need to try to process it on the spot. Those. Quietly dropping a failure branch is still not good, and sounding the alarm right away is not entirely correct. How will be correct? See point 4.

4. The same InvokeHTTP, in addition to Failure, provides a separate Retry result. This is the so-called “recoverable” class of errors. Of course, if the server returned 400 to you – a “crooked request”, then it will not become “direct” by itself, and if the answer is “50 (0|2|3)” after N seconds (service restart, request to another instance and etc.) the result may be completely different. It’s easier (but not more correct) to wrap the Retry queue on the processor itself, but in the case of a “load failure”, the result of an infinite number of calls to the service may be far from desired. This is where the ControlRate processor comes to the rescue – look at Fig. 4.

Figure 15

Figure 15

Figure 4. Basic error handling

We wait N-seconds and repeat sending N flow-files. In order to avoid endless looping of crooked requests, we set the “FlowFile Expiration” parameter for the queue. Minimum overhead + conditionally decent result. Why “conditionally”? Because if 100500 crooked requests have fallen, a variant is possible in which the entire queue of errors does not have time to be processed, taking into account the lifetime of flow files and delays. Those. with MaximumRate=10, Schedule=10 sec and FlowFile Expiration=60 sec we will process 60 errors, everything else will die silently. Of course, these parameters can and should be moved, but with this approach, the indicated risk will always be. Plus, in this scheme, we will not know in any way that the execution of the action ended in an error. If you need more stringent processing, then see point 5.

Figure 5. More complex version

5. In the UpdateAttribute we set the counter of the form ${counter:replaceNull(0):plus(1)}, in the RouteOnAttribute the condition ${counter:ge(5)}. At the output, we get a fixed number of repetitions for each flow-file with the generation of an alarm on failure. The complication is quite significant (less is better, right?), But taking into account (normally!) A small number of errors, it is justified. In the case of a large thread, in order to avoid backpressure from the error handling thread, it may make sense to increase the size of the corresponding queues.

Ah, yes. Important: 100% error handling coverage is generally not required anywhere. It is not necessary to do 10 repetitions in 10 seconds if the user is waiting for a response to his request during the default timeout of 60 seconds 😊.

6. Another important mechanism is the control of work under the condition of an explicit “no error”. Situations like “there are no problems, but there are no data either” are not rare. Sometimes an indirect sign of a process hanging is the growth of the queue, but if, for example, ConsumeKafka hangs at the group entrance, it turns out not at all good. To solve this problem, the MonitorActivity processor comes to the rescue. It can generate a message if the data does not arrive longer than specified in Threshold Duration. We hang ConsumeKafka on the output parallel to the main MonitorActivity branch with a minutes parameter of 15, on the output we do logmessage|putemail approximately in the same way as in Fig. 6, … and enjoy the result.

Figure 18

Figure 18

Figure 6. Checking the arrival of data

It is clear that the actual thresholds are set by your data flow, and false positives are inevitable in this case. But it’s definitely better than nothing. As some alternative, a heartbeat probing implementation of the service of interest is possible. At the same time, you need to understand that the fact that the service is operational and the fact that you have data is not the same thing. It would be nice to combine approaches and for key circuits to configure both of them approximately in the same way as in Fig. 7.

Figure 20

Figure 20

Figure 7. Heartbeat service check

We send test requests according to the schedule. In case of single problems – we write messages to the log, if the problem “did not resolve itself” – using the same MonitorActivity, we send a message to the administrator. In general, if we talk about the ZIIoT platform, then this should not be done by NiFi, but by “Platform Monitoring”. But, for example, when applied to customer systems and / or, if necessary, monitoring at a higher level (we make an actual request and check not the presence, but the correctness of the answer) – this is quite an option. When implementing it, it makes sense to remember that this method is somewhat “invasive”, and with crooked hands (for example, without setting Scheduling for GenerateFlowFile), you can break anything.

7. So, at the moment we monitor the state of the entire instance and process those errors that make sense to handle. It remains to signal that the processing failed. Here we have two regular tools – logmessage and logattribute. The first logically allows you to write a message to the log like “I fell!” indicating loglevel and logprefix, which is very convenient to grep’at logs. The second one also allows you to send the contents of the flow file and the set of attributes specified by the regular expression to the log. It should be used with caution – there is a risk of spamming the log to smithereens.

In addition to the “built-in” means of writing to the log for subsequent analysis in key places, it makes sense to explicitly signal errors to the flow administrator, for example, using the PutEmail processor. From experience, no one needs 100,500 letters about 100,500 identical errors, so ControlRate + FlowFile Expiration will help.

By the way, in addition to the NiFi tools, the ZIIoT platform has another information channel – the built-in zif-events event mechanism. If suddenly anyone is interested, we will tell you more about it. For now, I’m going to write the third part of the NiFi saga, in which I will present my thoughts on how to ensure the portability of processors, as well as analyze several template solutions for popular tasks on the ZIIoT platform.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *