Scheduled Tasks Stop Running After a While

jsaab · October 27, 2020, 3:43pm

Hello,

From time to time the scheduled tasks are no longer running. We have 67 scheduled tasks, each runs according to its fixed delay or cron expression. Most of them are configured to run every 60 seconds, but some are scheduled to run every day or every hour. As I mentioned in the beginning all of these scheduled tasks suddenly stop working and don’t resume back. This issue is happening approximately every two days, and the only way to force the scheduled tasks to run again is by restarting the web server (which is obviously causing a lot of issues to the users).
We’ve searched for possible solutions and found a similar one that has been resolved (Possible “freezing” of fast “fixed delay” singleton scheduled task) in versions 6.10.13, 7.0.11, 7.1.1, and 7.2.0.

Can you please confirm that it’s the same issue and that if we upgrade to 7.2.0 this issue will stop from happening over again? Or is there at least a workaround that will allow us to temporarily fix this issue while we work on migrating to 7.2.0?

Platform Information:

platform-6.7.16
Web Server: Tomcat 8.5
Server: Linux CentOS 7 with 16 Gb RAM and 8 vCPUs

krivopustov · October 28, 2020, 10:04am

Hi,

Most probably it’s the same issue, but unfortunately I cannot guarantee that.
I would recommend upgrading to the latest 6.10.x where the issue is resolved.

jsaab · November 6, 2020, 1:05pm

Hello Konstantin,

Migrating to CUBA 6.10.19 didn’t solve the issue unfortunately. The scheduled tasks are still freezing/not running after a couple of days.

Here is some information that might help you in debugging the issue:

Our scheduled tasks are of three types, one that imports files from an FTP, one that sends files to an FTP, and another that sends data through REST API
All scheduled tasks are freezing, not just one, and they’re freezing at approximately the same time
Printing the active scheduled tasks using the JMX Console, shows that all scheduled tasks are active. But in fact, non of them is running, the files are still not imported from the FTP which indicates that the scheduled tasks are not running.
The LAST_START_TIME doesn’t reflect the last scheduled task execution history record when this issue happens.
Setting the LAST_START_TIME to NULL doesn’t restart the scheduled task since it’s marked as running in the Scheduling bean.
All Scheduled tasks have the scheduling type set to “Fixed Delay” with a period of 60 seconds.
We have exactly 66 scheduled tasks (for now at least) that must periodically run to send and import data.
Some scheduled tasks take more than 200 seconds to finish and others less than 20 seconds. It depends on the amount of data being transferred.
The only way to run the scheduled tasks again is by restarting the entire web server.

I’ve taken some screenshots to help you visualize the issue.

scheduled_task_1
Shows the scheduled task ID so we can search for it in the JMX console

scheduled_task_1_active_but_not_running
Searching for it shows that the scheduling task is active

scheduled_task_execution_history_stopped
Shows that the execution history hasn’t been updated since yesterday

scheduled_task_marked_as_active
In the database you can see that the scheduled task is marked as active and the last start time is not long ago

scheduled_task_1_not_running_proof
Shows that the scheduled task is not running because there are still a lot of files in the SFTP folder. I’ve checked all scheduled tasks, and they’re all stuck and not exchanging data. The file name shows the timestamp in which the file was added to the SFTP directory from a 3rd party system. NOTE: I blurred out sensitive information.

If you need more information about anything, please let me know.

albudarov · November 6, 2020, 2:31pm

Hi,
I would advise you to take the following measures to diagnose the cause of the problem:

First of all, I see that you scheduled task doesn’t have TIMEOUT set.
Please assign the timeout value to all scheduled tasks.
Maybe it’s not mentioned in docs but from my experience I highly recommend to do it. This value is used by the scheduling engine to force task launch in some situations.

Timeout should be several times larger than the biggest sensible execution time of the task. E.g. for tasks with typical period of 60 seconds you can assign TIMEOUT = 1800.

I even suppose this might be the solution for your problem.

The next measures you should undertake when/if you meet scheduled tasks stopped for another time. These measures will give us additional troubleshooting information and most likely will show the cause of the problem:

Take a snapshot of the Java threads execution state - it’s called a thread dump.
It’s easy to take thread dump by using JMX console: use
java.lang:type=Threading --> javax.management.openmbean.CompositeData[] dumpAllThreads()
, save the result and copy result to this thread (we will be interested in threads whose names start with cuba_core_scheduler-).
Thread dump will clearly show us if there is any deadly hung code related to the scheduling logic.
Take TRACE logs from this server.
You can enable and disable detailed logging in runtime by using Administration -> Server Log -> Options screen.

Set these settings (in runtime, after you’ve experienced scheduled task stop but before server restart):
Appender -> “File” -> TRACE -> Set
Logger -> com.haulmont.cuba.core.app.scheduling -> TRACE -> Set.

Then you should wait for 2-3 minutes and then change settings back to DEBUG, because this detailed logging consumes lots of disk space.
Then please extract trace and debug log messages from app.log and present them here.

One more piece of troubleshooting information may be information about locks in the database (for sys_scheduled_task table).
How to view table locks - it depends on your database type. E.g. for PostgreSQL it’s the pg_locks view and pg_stat_activity view (shows active transactions, can outline hung or locked connections).

jsaab · November 6, 2020, 2:38pm

Hello Alex,

Thank you for your suggestions, I’ll try them on our QA server and let you know what happens.
For now, I’ve changed most of the Scheduled Tasks to Cron on our Production server to see if the same behavior will occur.

jsaab · November 11, 2020, 8:04am

Hello,

Setting a timeout didn’t help, the tasks are still getting stuck almost everyday. I’ve changed the scheduling type to cron and that didn’t help either. The issue happened at 1 AM and it’s not until now that I’ve exported the logs, so they may be out of context a bit. I got the thread dump, but forgot to fetch the table locks; I’ll do it next time.
Here is all the information that I was able to gather:
Thread dump after restart (155.3 KB)
Thread dump before restart (206.4 KB)
ThreadSnapshot.xls (30.5 KB)
app.log-cuba.zip (8.7 KB)
app.log-core.zip (539 Bytes)
catalina.2020-11-11.log-cuba.zip (539 Bytes)
catalina.2020-11-11.log-core.zip (1.9 KB)

If you still need any information or there’s something that you want me to try out let me know. It’s really important to get this issue resolved, because it’s causing an outage every time it happens.

albudarov · November 11, 2020, 11:34am

Hi,

In the “Thread dump before restart” you can see these threads:

, "ScheduledRunnerThread-10" Id=215 TIMED_WAITING on com.jcraft.jsch.Channel$MyPipedInputStream@5db8d819
	at java.lang.Object.wait(Native Method)
	-  waiting on com.jcraft.jsch.Channel$MyPipedInputStream@5db8d819
	at java.io.PipedInputStream.read(PipedInputStream.java:326)
	at java.io.PipedInputStream.read(PipedInputStream.java:377)
	at com.jcraft.jsch.ChannelSftp.fill(ChannelSftp.java:2909)
	at com.jcraft.jsch.ChannelSftp.header(ChannelSftp.java:2935)
	at com.jcraft.jsch.ChannelSftp._stat(ChannelSftp.java:2216)
	at com.jcraft.jsch.ChannelSftp.cd(ChannelSftp.java:343)
	at net.activeidentity.rpac.service.FTPOutboundIntegrationServiceBean.writeOperationToFTPSecureAUX(FTPOutboundIntegrationServiceBean.java:337)
	...

, "ScheduledRunnerThread-9" Id=214 RUNNABLE (in native)
	at java.net.SocketInputStream.socketRead0(Native Method)
	at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
	at java.net.SocketInputStream.read(SocketInputStream.java:171)
	at java.net.SocketInputStream.read(SocketInputStream.java:141)
	at com.jcraft.jsch.IO.getByte(IO.java:82)
	at com.jcraft.jsch.Session.read(Session.java:926)
	at com.jcraft.jsch.UserAuthPassword.start(UserAuthPassword.java:91)
	at com.jcraft.jsch.Session.connect(Session.java:470)
	...

, "ScheduledRunnerThread-8" Id=213 RUNNABLE (in native)
	at java.net.SocketInputStream.socketRead0(Native Method)
	at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
	at java.net.SocketInputStream.read(SocketInputStream.java:171)
	at java.net.SocketInputStream.read(SocketInputStream.java:141)
	at com.jcraft.jsch.IO.getByte(IO.java:82)
	at com.jcraft.jsch.Session.read(Session.java:926)
	at com.jcraft.jsch.Session.connect(Session.java:358)
	at com.jcraft.jsch.Session.connect(Session.java:183)
	...

, "ScheduledRunnerThread-7" Id=212 RUNNABLE (in native)
	at java.net.SocketInputStream.socketRead0(Native Method)
	at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
	at java.net.SocketInputStream.read(SocketInputStream.java:171)
	at java.net.SocketInputStream.read(SocketInputStream.java:141)
	at java.net.SocketInputStream.read(SocketInputStream.java:224)
	at com.jcraft.jsch.IO.getByte(IO.java:73)
	at com.jcraft.jsch.Session.connect(Session.java:263)
	at com.jcraft.jsch.Session.connect(Session.java:183)
	...

, "ScheduledRunnerThread-6" Id=211 RUNNABLE (in native)
	at java.net.SocketInputStream.socketRead0(Native Method)
	at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
	at java.net.SocketInputStream.read(SocketInputStream.java:171)
	at java.net.SocketInputStream.read(SocketInputStream.java:141)
	at com.jcraft.jsch.IO.getByte(IO.java:82)
	at com.jcraft.jsch.Session.read(Session.java:926)
	at com.jcraft.jsch.UserAuthNone.start(UserAuthNone.java:56)
	at com.jcraft.jsch.Session.connect(Session.java:396)
	...

, "ScheduledRunnerThread-5" Id=210 TIMED_WAITING
	at java.lang.Thread.sleep(Native Method)
	at com.jcraft.jsch.Request.write(Request.java:54)
	at com.jcraft.jsch.RequestSftp.request(RequestSftp.java:47)
	at com.jcraft.jsch.ChannelSftp.start(ChannelSftp.java:237)
	at com.jcraft.jsch.Channel.connect(Channel.java:152)
	at com.jcraft.jsch.Channel.connect(Channel.java:145)
	at net.activeidentity.rpac.service.FTPOutboundIntegrationServiceBean.writeOperationToFTPSecureAUX(FTPOutboundIntegrationServiceBean.java:335)
	at net.activeidentity.rpac.service.FTPOutboundIntegrationServiceBean.writeOperationToFTP(FTPOutboundIntegrationServiceBean.java:242)
	...

, "ScheduledRunnerThread-4" Id=209 RUNNABLE (in native)
	at java.net.SocketInputStream.socketRead0(Native Method)
	at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
	at java.net.SocketInputStream.read(SocketInputStream.java:171)
	at java.net.SocketInputStream.read(SocketInputStream.java:141)
	at com.jcraft.jsch.IO.getByte(IO.java:82)
	at com.jcraft.jsch.Session.read(Session.java:926)
	at com.jcraft.jsch.UserAuthNone.start(UserAuthNone.java:56)
	at com.jcraft.jsch.Session.connect(Session.java:396)
	...

, "ScheduledRunnerThread-3" Id=208 RUNNABLE (in native)
	at java.net.SocketInputStream.socketRead0(Native Method)
	at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
	at java.net.SocketInputStream.read(SocketInputStream.java:171)
	at java.net.SocketInputStream.read(SocketInputStream.java:141)
	at com.jcraft.jsch.IO.getByte(IO.java:82)
	at com.jcraft.jsch.Session.read(Session.java:926)
	at com.jcraft.jsch.UserAuthNone.start(UserAuthNone.java:56)
	at com.jcraft.jsch.Session.connect(Session.java:396)
	...

, "ScheduledRunnerThread-2" Id=207 RUNNABLE (in native)
	at java.net.PlainSocketImpl.socketConnect(Native Method)
	at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
	at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
	at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
	at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
	at java.net.Socket.connect(Socket.java:589)
	at java.net.Socket.connect(Socket.java:538)
	at java.net.Socket.<init>(Socket.java:434)
	...

, "ScheduledRunnerThread-1" Id=206 TIMED_WAITING on com.jcraft.jsch.Channel$MyPipedInputStream@225b9f03
	at java.lang.Object.wait(Native Method)
	-  waiting on com.jcraft.jsch.Channel$MyPipedInputStream@225b9f03
	at java.io.PipedInputStream.read(PipedInputStream.java:326)
	at java.io.PipedInputStream.read(PipedInputStream.java:377)
	at com.jcraft.jsch.ChannelSftp.fill(ChannelSftp.java:2909)
	at com.jcraft.jsch.ChannelSftp.header(ChannelSftp.java:2935)
	at com.jcraft.jsch.ChannelSftp._put(ChannelSftp.java:583)
	at com.jcraft.jsch.ChannelSftp.put(ChannelSftp.java:540)
	at com.jcraft.jsch.ChannelSftp.put(ChannelSftp.java:492)
	...

So all 10 schedule runner threads are stuck in the com.jcraft.jsch library code.
The most probable reason for that is temporary network problem which leads to the synchronous network I/O call to hang forever or for a long time.

When you are working with network libraries, it’s best practice to always set network timeouts to some finite value (connection timeout and read timeout).

jsaab · November 19, 2020, 11:24am

Hello Alex,

Thanks to your hints we were pointed to the right direction to solve the issue.
Here are our findings:

Due to the huge number of files being transferred, the scheduled task was never finishing. It was always running. The only way for it to end was to transfer all files which takes an enormous amount of time.
The timeout for the scheduled task was set to 3600 seconds, so after 3600 seconds the scheduled task executor was creating another instance of the same task without killing the one that is currently running. As a side note, the scheduled task is marked as singleton.
After ten hours (3600 * 10) the scheduled task queue was full because there were 10 tasks that are already running.

The solution was to limit the number of files to be transferred so that it ends before the scheduled task times out. Since applying this solution we haven’t faced any problems.

Questions:

Is there a way to increase the scheduled tasks queue size? and is this queue limit per server?
Does the scheduled task executor call the interrupt method for the scheduled task thread after timeout?

albudarov · November 19, 2020, 4:03pm

Hi,

It’s not queue. It’s thread pool whose size is 10.
I don’t see any reason to increase it. The problem isn’t in the pool size, it’s in the scheduler that never finishes.

The scheduled task invoked periodically shouldn’t try to do all work at once. If invocation #1 doesn’t finish all work, the next periodic invocation will continue the work. So your solution with limiting number of files to transfer is good.

No, and there is no way in Java to reliably interrupt the thread (especially the thread working with synchronous I/O).

Again, timeout should be much bigger than expected scheduler working time. If it’s OK for scheduler to process its data for 1 hour, then timeout should be 12 hours or more.

albudarov · October 8, 2024, 7:35am

A post was split to a new topic: None of the scheduled jobs executed