Thank you for such a detailed report. I think max_graph_instance_age will not help you.max_graph_instance_age = Time interval in ms which specifies how long may transformation instance last in server's cache. 0 means that transformation is initialized and released for each execution. Transformation cannot be stored in the pool and reused in some cases (transformation uses placeholders using dynamically specified parameters)
We have seen similar issues a few times already. A graph usually got stuck after some issues with DB connections. Either the DB connection was lost or a timeout appeared or a firewall blocked the connection or the DB server could not handle another connection. CloverETL graphs open a network socket on the OS level when they want to communicate with a DB. And when some query takes too much time, it is sometimes hard to tell whether it is because the query is just too complex or there is an issue with the connection itself. OS therefore not always closes the connection but waits for any response instead. And when the connection remains open, we can not kill the graph.
Third party services like Amazon or NetSuite are quite often the source of these troubles. I do not know whether they have some artificial limit for overall number of connections you can use at once or they are just unreliable sometimes so it is hard to tell what could help you in this situation. I would definitely try at least the following two changes:
1) Change your standard JDBC connection to JNDI if you have not done it yet. In JNDI, you can set parameters like limit for active connections, validation query, validation interval etc. which could help you avoid the error in some situations. Here is a documentation page describing the configuration process for our most common application server - Tomcat
2) Change keepalive settings
on the machine running CloverETL Server. But be very careful. If you set it incorrectly, you can affect many other things starting with errors appearing during too complex queries and ending with unability to connect anywhere. The default value of tcp_keepalive_time is 7200 (=2 hours), I would try to set it to for example 600 (=10 minutes) and restart the machine. If 10 minutes are not suitable, try other values based on duration of your DB queries. This should ensure more frequent keepalive messages between OS and DB which should then help with quicker recognition of any network troubles.
Beside that, some databases like MySQL or PostgreSQL support autoReconnect=true in JDBC URL. I see you use Redshift, maybe it has some similar setting.
And by the way, when you mentioned the status of the graph never changes to red, did you try to click on the Refresh button above the execution hierarchy?
Let me know with the results, please.