Sunday, December 6, 2009

The Haunted DBA (1) -- none empty tnsnames.ora ifile in iAS Oracle Home

Dr. Z once worked for a manufacturing company with total of two DBAs and one Linux system administrator. The Oracle E-Business Suite 11.5.10.2 ran on six Oracle Enterprise Linux nodes: two nodes for web/forms/MWA, two nodes for concurrent managers and report server, and admin server; database is on the two node 10gR2 RAC with ASM. Initially the system was on 32bit database nodes and the performance sucks, especially during monthly close.

The existing cloning method does not follow Oracle’s rapid clone approach. No adpreclone steps from the source system. No adcfgclone step on the destination system. They simply copy the application code over, modify the context file and then run adconfig.sh. after sourcing the environment, run adautocfg.sh again. They had problems from time to time by following this approach. During the server migration from 32bit database node to 64bit database node, they had six new servers for the PROD. One weekend, Dr. Z’s fellow DBA conducted the server migration while he watched. They could not get it done due to configuration issues for some of the concurrent managers.

After some trouble shooting, eventually, they got 64 bit migration completed without load balancing etc. Thus, Dr. Z proposed to fix the cloning method and conducted an in-place clone for the PROD to fix context file, load balancing for web, forms, MWA and concurrent managers.

About one week from the Halloween, they decided to go through the in-place clone procedure for PROD. They got a slow start in the morning due to the nightly security guard forgot to leave office building door keys to the daily security guard. They intended to start at 7:00am and had to wait for about 40 min to get into office. The in-place clone went smoothly and they left the office at about 10:30am. Normally, for the scheduled maintenance, they did not need to go to office. However, Dr. Z’s fellow DBA would like to watch what Dr.Z did.

Dr.Z checked system. Everything was fine. However, on the following Monday, some users had connection problems for the Forms server. Initially, they thought it was due to IE 6, or SUN JRE etc. As more users got problems, they began to check the backend. They found that on web-forms node 2, there were more than 200 forms connections and on web-forms node 1, there were only 4 connections. Trying to connect to web-forms node 1 forms directly, they got errors for connection. Checking Apache logs: error_log_pls, we found: ORA-12541 error which started at 3:47pm on Sunday afternoon.

[Sun Oct 25 10:21:10 2009] [notice] FastCGI: process manager initialized (pid 912)
[Sun Oct 25 10:21:11 2009] [notice] Oracle HTTP Server Powered by Apache/1.3.19 configured -- resuming normal operations
[Sun Oct 25 15:47:03 2009] [error] mod_plsql: /pls/PROD/fnd_icx_launch.launch HTTP-503 ORA-12541
[Sun Oct 25 15:47:19 2009] [error] mod_plsql: /pls/PROD/fnd_icx_launch.launch HTTP-503 ORA-12541

Dr.Z checked all entries in $IAS_ORACLE_HOME/network/admin/tnsnames.ora and none gave him trouble. As the PROD_pind23_ifile.ora is supposed to be empty, so he ran ls –lart *.ora and found that it was not empty. He talked to his fellow DBA and his fellow DBA went to empty it. Thus, Dr.Z lost the file and the timestamp of the file. After bouncing the web/forms tier, their system was back to normal.

Due to above Oracle errors in error_log_pls which started at 3:47pm on Sunday, then the ifile.ora must not be empty since then. Dr.Z wanted to investigate further to find out who might put a none empty file PROD_pind23_ifile.ora there. Before Dr.Z conducted the in-place, he took a backup. Following was from backup that he took in the morning of Sunday. It clearly shows that PROD_pind23_ifile.ora is empty.

wfmprod@pind23(PROD_806_BALANCE/Web:Forms:MWA):$ ls -lR *
-rw-r--r-- 1 wfmprod oinstall 0 Mar 25 2009 PROD_pind23_ifile.ora
-rw-r--r-- 1 wfmprod oinstall 7935 Sep 20 07:50 tnsnames.ora

old:
total 24
-rw-r--r-- 1 wfmprod oinstall 0 Oct 23 08:56 PROD_pind23_ifile.ora
-rw-r--r-- 1 wfmprod oinstall 7731 Oct 23 08:56 PROD_pind23_ifile.ora.old
-rw-r--r-- 1 wfmprod oinstall 7935 Oct 23 08:56 tnsnames.ora
-rw-r--r-- 1 wfmprod oinstall 7889 Oct 23 08:56 tnsnames.ora.cya
[/share/oracle/pind23/product/iAS/network/admin/PROD_pind23]

He also asked his backup administrator to restore PROD_pind23_ifile.ora from tape. The tape backup was taken on Friday. PROD_pind23_ifile.ora is empty and with same timestamp as above.

wfmprod@pind23(PROD_806_BALANCE/Web:Forms:MWA):$ ls -l
total 12
drwxr-xr-x 2 wfmprod oinstall 4096 Sep 8 09:39 old
-rw-r--r-- 1 wfmprod oinstall 0 Mar 25 2009 PROD_pind23_ifile.ora
-rw-r--r-- 1 wfmprod oinstall 0 Oct 26 15:43 PROD_pind23_ifile.ora.bak
-rw-r--r-- 1 wfmprod oinstall 7926 Oct 25 10:16 tnsnames.ora
[/data/wfmprod/product/iAS/network/admin/PROD_pind23]

Thus, when they began to work on the system, the ifile.ora was empty. As the cloning procedure does not touch PROD_pind23_ifile.ora, he suspected that somebody put a none empty file there.

Here were connections to the box from last command: On Sunday, only two DBAs were on the pind23.

wfmprod pts/4 172.18.33.73 Sun Oct 25 08:01 - 08:06 (1+00:04)
wfmprod pts/1 172.18.31.85 Sun Oct 25 07:40 - 16:16 (1+08:36)
wfmprod pts/3 172.18.31.85 Sat Oct 24 00:12 - 16:16 (2+16:04)
root pts/3 172.18.33.53 Fri Oct 23 15:24 - 15:25 (00:00)
root pts/3 172.18.33.53 Fri Oct 23 13:09 - 13:32 (00:23)

172.18.33.73 is Dr.Z’s desktop IP address. 172.18.31.85 is his fellow DBA’s desktop IP address. 172.18.33.53 is Linux system administrator’s desktop IP address. Dr.Z talked to his fellow DBA about this and his fellow DBA denied he put the file there. The command history for bash did not show any activities for PROD_pind23_ifile.ora.

As there were only three people had access to the system, his fellow DBA or Linux system administrator might intentionally put the none empty ifile.ora with tnsnames.ora alias for none existing systems. Here were some suspecting activities:

1. Dr.Z asked his Linux system admin to give him /var/log/secure and message files, the Linux Admin refused to give him the files or let him read them. The Linux admin checked the files and reported that only two DBAs were on the system.
2. Dr.Z’s fellow DBA emptied the file before he could save it.
3. Dr.Z’s fellow DBA was on the system on Sunday.
4. Somebody could use scp to put a copy of file there to avoid directly login to the system.
5. Somebody did not use bash, thus command history shows no activity about the file.

Talked to his manager about this issue and asked for the file level audit. It appears that they are not going to have it. Dr.Z had created shell script to check the file changes for the pass 24 hours. Hopefully, Dr.Z will not have this kind surprise in the future.

No comments:

Post a Comment