Pythian Blog: Technical Track

Killing an Oracle Job Dead

If you ever thought it was easy to kill an oracle job, let me show you something that gave me a headache today.

We’re testing an application that uses custom code to de-queue and propagate AQ messages between databases. There are nine processes in total, and all of them are submitted as jobs using DBMS_JOB. They are supposed to run all the time, waiting for messages to de-queue. Once in a while, I need to kill them in order to recompile the objects they are executing.

The steps are simple enough: remove/break all jobs in question, kill the jobs that are running, recompile the packages with new the version, and submit/un-break the jobs again. This time, however, things were not going as smoothly as I expected. Let’s have a look.

First I remove all the currently running jobs:

SYS@ORCL>select 'exec dbms_ijob.remove('||job||');' from dba_jobs_running;

'EXECDBMS_IJOB.REMOVE('||JOB||');'
----------------------------------------------------------------
exec dbms_ijob.remove(261);
exec dbms_ijob.remove(264);
exec dbms_ijob.remove(267);
exec dbms_ijob.remove(262);
exec dbms_ijob.remove(265);
exec dbms_ijob.remove(268);
exec dbms_ijob.remove(263);
exec dbms_ijob.remove(266);
exec dbms_ijob.remove(269);

9 rows selected.

SYS@ORCL>exec dbms_ijob.remove(261);
PL/SQL procedure successfully completed.

SYS@ORCL> exec dbms_ijob.remove(264);
PL/SQL procedure successfully completed.

SYS@ORCL> exec dbms_ijob.remove(267);
PL/SQL procedure successfully completed.

SYS@ORCL> exec dbms_ijob.remove(262);
PL/SQL procedure successfully completed.

SYS@ORCL> exec dbms_ijob.remove(265);
PL/SQL procedure successfully completed.

SYS@ORCL> exec dbms_ijob.remove(268);
PL/SQL procedure successfully completed.

SYS@ORCL> exec dbms_ijob.remove(263);
PL/SQL procedure successfully completed.

SYS@ORCL> exec dbms_ijob.remove(266);
PL/SQL procedure successfully completed.

SYS@ORCL> exec dbms_ijob.remove(269);
PL/SQL procedure successfully completed.

As a next step, I’ll just kill the sessions of the running jobs:

SYS@ORCL>select 'alter system kill session '''||sid||','||serial#||''';' from v$session where sid in (select sid from dba_jobs_running);

'ALTERSYSTEMKILLSESSION'''||SID||','||SERIAL#||''';'
--------------------------------------------------------------------------------
alter system kill session '120,73';
alter system kill session '122,111';
alter system kill session '123,788';
alter system kill session '136,20';
alter system kill session '137,28';
alter system kill session '138,143';
alter system kill session '142,132';
alter system kill session '144,129';
alter system kill session '159,68';

9 rows selected.

SYS@ORCL>alter system kill session '120,73';
System altered.

SYS@ORCL> alter system kill session '122,111';
System altered.

SYS@ORCL> alter system kill session '123,788';
System altered.

SYS@ORCL> alter system kill session '136,20';
System altered.

SYS@ORCL> alter system kill session '137,28';
System altered.

SYS@ORCL> alter system kill session '138,143';
System altered.

SYS@ORCL> alter system kill session '142,132';
System altered.

SYS@ORCL> alter system kill session '144,129';
System altered.

SYS@ORCL> alter system kill session '159,68';
System altered.

And the running jobs are gone:

SYS@ORCL>select count(*) from dba_jobs_running;

  COUNT(*)
----------
         0

I’m thinking this is fine, but after few seconds the jobs are back!

SYS@ORCL> /

  COUNT(*)
----------
         9

I found this behavior quite intriguing, so I searched on the Net and found a really nice blog explaining in great detail how the DBMS_JOB jobs behaves.

That gave a hint that the OS process may actually keep the database processes alive. So I think, that sounds reasonable—let’s kill the OS processes instead.

SYS@ORCL>!ps -ef | grep ora_j
oracle   16234     1  0 12:58 ?        00:00:00 ora_j000_ORCL
oracle   16236     1  9 12:58 ?        00:00:07 ora_j001_ORCL
oracle   16238     1  9 12:58 ?        00:00:07 ora_j002_ORCL
oracle   16240     1  0 12:58 ?        00:00:00 ora_j003_ORCL
oracle   16242     1  0 12:58 ?        00:00:00 ora_j004_ORCL
oracle   16244     1  8 12:58 ?        00:00:06 ora_j005_ORCL
oracle   16246     1  9 12:58 ?        00:00:07 ora_j006_ORCL
oracle   16248     1  0 12:58 ?        00:00:00 ora_j007_ORCL
oracle   16250     1  9 12:58 ?        00:00:07 ora_j008_ORCL
oracle   16280 15459  0 12:59 pts/0    00:00:00 /bin/bash -c ps -ef | grep ora_j
oracle   16282 16280  0 12:59 pts/0    00:00:00 grep ora_j

SYS@ORCL>!kill -9 16234 16236 16238 16240 16242 16244 16246 16248 16250 16252

Done!

Hmm, but not really—after a while, the jobs are back from the dead. Under different Oracle PIDs and different OS PIDs.

SYS@ORCL>select count(*) from dba_jobs_running;

  COUNT(*)
----------
         9

Now this left me really puzzled. How do I get rid of them?

Of course! Why hadn’t I thought of this before? I’ll set job_queue_processes to 0. Kill the processes and they will not respawn.

SYS@ORCL>alter system set job_queue_processes=0;

System altered.

Killed all processes again.

SYS@ORCL>select count(*) from dba_jobs_running;

  COUNT(*)
----------
         0

SYS@ORCL>alter system set job_queue_processes=10;

System altered.

SYS@ORCL>select count(*) from dba_jobs_running;

  COUNT(*)
----------
         9

Oops, they’re back again! Notice that the jobs were gone from the DBA_JOBs table, and the running jobs were referencing non-existent job ID’s. How can they be respawning?

SYS@ORCL>select j.what, j.job, r.sid from dba_jobs j, dba_jobs_running r where j.job in (select job from dba_jobs_running) and j.job=r.job order by j.what;

no rows selected

I decided to expand my knowledge about jobs, and read the whole of James Koopmann’s blog. And then I saw the light! User JB (thank you JB, I too learned the hard way) mentioned in comments under the article that one has to run COMMIT after manipulating jobs.

So after all of the above, I just typed . . .

SYS@ORCL> commit;

Commit complete.

. . . then killed the jobs again, and bingo! Those pesky jobs were finally gone.

I’ve done the same procedure in the past using DBMS_JOB and never faced the issue, but I realize that I always switched to the SYS user in order to kill sessions, so it was autocommited upon reconnect.

It is important to mention—the respawned jobs were not doing what they were supposed to do, that means they were not de-queing messages. The database version I’m using is 10.2.0.4, and it’s true for both DBMS_JOB/IJOB, as well as REMOVE and BROKEN procedures, and it might apply to other procedures of this package as well. After all, things are as simple as commit, aren’t they?

No Comments Yet

Let us know what you think

Subscribe by email