Nodes retrying many times (infinite loop?)

Discussion on developing CloverETL engine, transformation components etc.

jfuentesve
Posts: 5
Joined: Sat Jan 26, 2013 6:00 am

Nodes retrying many times (infinite loop?)

Postby jfuentesve » Tue Jan 29, 2013 6:08 pm

Hello, we are implementing a set of validation tests for our clover projects, and we came across with a problem when if an external service is down, the nodes that interact with it get stuck retrying. It makes sense to wait a little bit and retry until the service is available again, but stoping all dependent processes for more than an hour is not acceptable. We want the graph and the phase to fail and notify the issue after a defined period of time, or number of retries (i.e. no more than 5 minutes or 5 retries with a timeout of 60 seconds)


Is that possible to achieve? Is there such a property in the Node xml or abstract Node class to set? Or is it settable somewhere else in the graph?


Thank you!

kubosj
Posts: 372
Joined: Thu Jan 12, 2012 9:10 am

Re: Nodes retrying many times (infinite loop?)

Postby kubosj » Wed Jan 30, 2013 11:00 am

Hi jfuentesve,

There is no mechanism like that on phase/graph level. This should be handled by components - it should fail when service timeout is detected.

What are the problematic components?
Jaroslav Kubos
CloverCARE Support
CloverETL | Rapid Data Integration

Visit us online at http://www.cloveretl.com

jfuentesve
Posts: 5
Joined: Sat Jan 26, 2013 6:00 am

Re: Nodes retrying many times (infinite loop?)

Postby jfuentesve » Wed Jan 30, 2013 5:00 pm

Hello kubosj, thanks for the response.


The component in this case is a custom DataService writer, governed by a class that we wrote and that is passed some arguments. This class extends Node, like this:

Code: Select all

MetadataServiceWriter extends org.jetel.graph.Node


But coming from Node attributes and setters, I can't find a way to tell it how to retry, I can see how to deal with errors, there is a threshold of maximun errors, but in this case, this is not an error, it is just a timeout, not treated as an error.

The request times out, after a minute, it happens ok, the problem is that the node retries it, which is ok for a while, sometimes the service is too busy. But when the service is down or stuck (alive but zombie or paused by a debugger for instance), a test gets stuck retrying forever, and never fails, it just blocks all the code validation process, and when someone finds out in the morning it could have been running for hours and everyone's blocked because of it. So then someone manually kills the task (the graph) and then we explore the logs and find out that a DS Writer was stuck retrying forever after every timeout, it NEVER fails :S

We want it to fail, after a certain number of retries... is that possible?

kubosj
Posts: 372
Joined: Thu Jan 12, 2012 9:10 am

Re: Nodes retrying many times (infinite loop?)

Postby kubosj » Thu Jan 31, 2013 9:58 am

Hi,

this is how I understand your description:
* you wrote own component calling external service
** on input it gets a lot of records containing parameters for calling service
** on output port, there is produced call result
* external service may be unresponsive
* your component tries to retry service call many times and that causes problems when service is down and there is a lot of input records

There are some thoughts I have:

1] you can define custom properties on custom component, something like:
* "service call timeout" - possibility to change default 1 minute timeout
* "retry count" - how many times should component retry calling service for one input record
* "ignore after N fails" - after how many records for which call failed should component ignore rest

2] you can set component properties in graph from outside
* use ${PARAMETER} in property value
* pass this PARAMETER to graph from outside - depending on production/test environment
* use this for properties in 1]
* see http://doc.cloveretl.com/documentation/ ... ments.html and parameter "-P"

3] react on custom properties in your component
* component is done when reads all inputs and exit method execute()
* inside of execute() you can react on properties from 1] and 2]
** e.g. when "ignore after N fails" is exceed, then just read input records and send output records (without calling service)

In general, our components solve this internally because general concept would be confusing and not enough powerful. E.g. http://doc.cloveretl.com/documentation/ ... table.html and its property "Max error count".

I hope this helps.
Jaroslav Kubos
CloverCARE Support
CloverETL | Rapid Data Integration

Visit us online at http://www.cloveretl.com


cron