Ok, that's what I thought about the file getting written out and executed separately. thanks.
It's a little odd what's happening. Essentially a high level here is the process...
1) Jobflow is executed which reads a file sending 500 about input parameters. The jobflow calls a single graph for each new parameter to get passed to the the system execute component, which executes a py script.
2) When we run this job - every hour so often - it will work fine for a while. Eventually we'll get the error which says "connection refused". This error is generated in the py script.
3) we've tried running synchronous and asynchronous and it doesn't make a difference.
4) Architecture is Linux OS, Clover ETL Server, Weblogic.
5) If we restart the clover service this fixes all issues, so it seems like it's stuck threads to me as it's consistent with the error message of connection refused.
6) When I look in the weblogic logs, clover logs and several linux commands to look for number of used threads/available threads, stuck files, etc nothing really jumps out. There look to be more than enough open threads, available files, etc. That said, it's definitely a networking issues and a restart of the clover service fixes it (clover and weblogic I should say).
So, for now we've stabilized it by just writing a cron job to restart the service every night. Sort of a sledgehammer approach but it is working for us. it seems like to me that the python scripts/clover/weblogic combination is not shutting down properly potentially on an error and that's getting caught up somewhere (but I don't see it in the logs with any of the architecture components). That's my best guess. We gave up hunting it down when the brute force restart fixed the issue and things stabilized. it would be good to know if you have any other thoughts though.