You are here

ROAD for Oracle VM Ready for Business Reporting for Oracle VM for x86 Release 3.x

ROAD for Oracle VM Ready for Business Reporting

ROAD™ for Oracle® VM ready for business reporting allows operations to proactively respond to issues before they impact the business. ROAD for Oracle VM takes a fundamentally different approach to Oracle VM reporting compared to Oracle VM Manager, and Enterprise Manager. ROAD for Oracle VM ready for business reports focuse on giving operations visibility into issues before they impact the security and availability of the Oracle VM servers, server pools, and their business critical Oracle workloads.
 
ROAD for Oracle VM ready for business reports filter the Oracle VM server log files, and then deliver reports in a dashboard format showing leading security and availability indicators. The dashboard have been designed to allow the recipients to quickly determine if there are any actionable indicators.  The reports are delivered to email recipients used a flexible schedule, i.e. daily or weekly.
 
ROAD for Oracle VM ready for business reports contain the following details:
  • The Oracle VM server hostname and the date of the report
  • Uptime statistics
  • Local disk and storage repository statistics
  • Failed Login Attempts
  • D State Processes
  • Zombie Processes
  • Filter /var/log/ovsagent.log for refused entries
  • Filter /var/log/messages for warning, failed, recovery, error and evict messages
 
The next example shows a daily ROAD for Oracle VM ready for business email report.
Daily Ready for Business Status for ovs-prod437
2017-02-11
Uptime
1186 days 43 minutes
Disk Space
Mount Point
Used %
Avail
       
/
4%
54G
/boot
30%
67M
/dev/shm
0%
931M
/var/lib/xenstored
1%
931M
/poolfsmnt/0004fb00000500005a9b745eb940c981
2%
14G
/OVS/Repositories/0004fb0000030000732541521e134825
75%
7.7T
/OVS/Repositories/0004fb00000300006de3dba49e87b8f7
75%
7.7T
/OVS/Repositories/0004fb00000300004af79ef1fffed38e
65%
3.5T
Failed Login Attempts
20
D State Processes
0
Zombies
1
Log File Events
ovsagent.log "refused"
0
Messages "Warning!"
0
Messages "failed!"
0
Messages "recovery!"
0
Messages "error!"
0
Messages "evict!"
0
Call: 415-860-2851
 
The next tables describe each of the report items including the evaluation criteria, and troubleshooting next steps.
Dashboard Details
Description, Evaluation Criteria, & Troubleshooting Steps
The Oracle VM server hostname
Date of the report
Uptime
These details are informational.
Local disk and storage repository statistics
These details are a leading availability indicator that describe the used and available disk statistics for an Oracle VM server and the mounted storage repositories. Both the local disks/partitions, as well as the mounted OCFS2 and NFS storage repositories are checked.
 
Local Disks:
If an Oracle VM server’s disk near full capacity or become full the availability of the server is at risk. Oracle VM servers with full disks may become unstable, crash, as well as have its file systems go read only.
 
The resolution is to first track down the files that have filled up the disk or partition, review them, and if possible delete or remove them.
 
The next example is from an Oracle VM server that had its disk filled with errant log files. The following steps demonstrates how to list the used space from all directories in / (root) showing the problematic var directory.
 
On the target Oracle VM server, as root type the following commands:
cd / (so you are in the root directory)
ls | xargs du -hs
6.1M bin
22M boot
380K dev
0 dlm
6.2M etc
4.0K home
146M lib
17M lib64
16K lost+found
4.0K media
4.0K mnt
4.0K nfsmnt
6.5M opt
746G OVS
44K poolfsmnt
0 proc
421M root
27M sbin
4.0K selinux
4.0K srv
0 sys
112K tmp
490M usr
52G var
 
As an example, note that var directory is 52G, so to investagate the /var directory "cd /var" and run "ls | xargs du -hs" again to drill down into the /var directory. 
 
OCFS2, NFS and Local Storage Repositories
The default settings in Oracle VM Manager disables automatic storage repository refreshes. The result is disks can fill without warning resulting in read only file systems on the repositories and possible data corruption.
 
OCFS does not factor disk space exhaustion and volume metadata. OCFS2 metadata can consume over 6% of disk space. Plan accordingly or as soon as your Oracle VM repositories become ~94% full they will go read only!
Failed Login Attempts
These details are a leading security indicator that detect Oracle VM server login failures. Oracle VM server login failure auditing is enabled by default. Auditing the number of failed login attempts by a user helps you learn about brute force, dictionary, and other password attacks executed against the Oracle VM server.
 
 Indicated no failed logins.
 Indicated failed logins.
 
Searching and filtering /var/log/message error messages can be done centrally using an Enterprise Analytics solution, or locally using grep.
 
Failed logins can be searched locally by accessing the Oracle VM server, then searching the /var/log/secure file for login failures. The /var/log/secure files contain login failures with the, service, username and timestamp.
 
For example to search /var/log/secure for current failed logins, as root type:
grep -i failed --color=always /var/log/secure
Or try the following command to search all of the messages files including the archived messages files.
grep -i failed --color=always /var/log/secure*
 
Once you have reviewed the messages, start filtering the messages to only display specific users, and times.
 
To filter failed error messages from a specific user, for example for root, try:
grep -i failed --color=always /var/log/secure | grep root --color=always
 
To filter failed error messages from a specific user during a Month. For example, the root user in February, try:
grep -i failed --color=always /var/log/secure | grep root --color=always | grep Feb
D State Processes
These details are a leading availability indicator that detects D State Processes. An Oracle VM server with D State Processes may have processes from critical services such as o2cb, ocsf2 or ovs-agent reaching a deep disk sleep state.
 
Oracle VM servers with many D State Processes may become unstable, unresponsive, reboot, or cause cluster wide reboots. If left unchecked, D State Processes may result in having to shutdown and reboot the affected node, or all the nodes in the pool to restore the system or pool to full operation.
 
 Indicated no D State Processes.
 Indicated D State Processes.
 
Once D State Processes are detected the first step is to  evaluate the Oracle VM server using the command “ps -ef” command. D State Processes are difficult to kill and often result more hung process for each kill command. Before trying to kill the D State Processes try to migrate all of the running virtual machine to a different Oracle VM server. Next, try to kill the D State Processes, or reboot the server.
Zombie Processes
These details are a leading availability indicator that detects Zombie Processes. On Unix and Unix-like operating systems, a zombie process, or a defunct process is a process that has completed execution, but still has an entry in the process table in the "Terminated state". The term "zombie" comes from the status entry which should be a "Z." Since the process has exited, the program name is no longer available, so the name zombie is used indicating its state as “dead and gone”.
 
Oracle VM servers with many Zombie Processes may become unstable, unresponsive, or reboot. If left unchecked, Zombie Processes may result in having to shutdown and reboot the affected node.
 
 Indicated no Zombie Processes.
 Indicated Zombie Processes.
 
Zombie Processes are difficult to kill because the process is no longer running, thus the kill command has nothing to accept or react to the kill signal.
 
If there are only a couple Zombie Processes and over time no additional Zombie Processes appear, your ok. But if there are lots of Zombie Processes or the number of Zombie Processes start increasing try to migrate all of the running virtual machine to a different Oracle VM server. Next, try to kill the Zombie Processes, or just reboot the server.
/var/log/ovsagent.log "refused" entries
These details are a leading availability indicator that detects issues with the Oracle VM agent (ovs-agent), Oracle VM Manager, or iSCSI storage. The /var/log/ovsagent.log log file contains the Oracle VM agent activities (ovs-agent).
 
 Indicated no connection refused entries.
 Indicated connection refused entries.
 
Oracle VM Manager Agent (ovs-agent) Connection Refused Messages such as "ERROR (notification:44) Unable to send notification: (111, 'Connection refused')"
Oracle VM facilitates centralized management of servers, pools and their resources using an agent-based architecture. Oracle VM Manager dispatches administrative commands made in the Oracle VM Manager GUI to Oracle VM servers via XML RPC to the Oracle VM agent (ovs-agent).
 
Once an ovs-agent hangs or crashes, or ovs-agent files get corrupt we may see connection refused entries.
 
The ovs-agent process can be checked by accessing the Oracle VM server as root and running the following command :
# service ovs-agent status
 
If the agent is ok, you may see something like:
log server (pid 7245) is running...
notification server (pid 7562) is running...
remaster server (pid 7578) is running...
monitor server (pid 7580) is running...
ha server (pid 7581) is running...
stats server (pid 7584) is running...
xmlrpc server (pid 7586) is running…
 
If the agent is stopped, you may see something like:
log server is stopped...
notification server is stopped...
remaster server is stopped...
monitor server is stopped...
ha server is stopped...
stats server is stopped...
xmlrpc server is stopped…
 
If there is a crashed service, or corrupt file you may see something like:
log server (pid 7245) is running...
notification server (pid 7562) is running...
remaster server (pid 7578) is running...
monitor server (pid 7580) is running...
ha server (pid 7581) is running...
stats server (pid 7584) is running...
xmlrpc server dead but pid file exists <<<<<<<<
 
Or something like this:
log server (pid 7245) is running...
notificationserver server dead but pid file exists <<<<<<<<
remaster server (pid 7578) is running...
monitor server (pid 7580) is running...
ha server (pid 7581) is running...
stats server (pid 7584) is running...
xmlrpc server (pid 7586) is running...
 
If the connection refused is ovs-agent related restart the ovs-­agent service by running the following command
# service ovs-­agent restart
Restarting the ovs­-agent service should not cause a service disruption with running virtual machines.
 
Often restarting the ovs-agent will clean up a problematic ovs-agent. That being said, hung ovs-agent processes are difficult to kill and if left unchecked often result in the Oracle VM servers becoming unresponsive and rebooting. If the ovs-agent is hung first try to migrate all of the running virtual machine to a different Oracle VM server. Next, try to kill the ovs-agent processes, or just reboot the server.
 
If the ovs-agent status shows:
“xmlrpc server dead but pid file exists <<<<<<<<”
it's possible that the files in the /etc/ovs-­agent/cert/ directory got deleted. To confirm, as root type:
# ls -ltr  /etc/ovs-­agent/cert/
If the directory is empty we can regenerate all the files with the following command:
# ovs-­agent-keygen -f
Next restart the ovs-agent
# server ovs-­agent restart
 
Oracle VM Manager Connection Refused Messages
Its also possible that the connection refused message may be due to a firewall on the Oracle VM Manager host blocking XML RPC communication to the ovs-agent on TCP 8899.
 
Review the firewall configuration on the Oracle VM Manager host to confirm that it is not blocking TCP traffic on port 8899. If you're using iptables, as root type:
# iptables -L
If iptables is disabled, you’ll see something like:
Chain INPUT (policy ACCEPT)
target     prot opt source               destination         
Chain FORWARD (policy ACCEPT)
target     prot opt source               destination         
Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination
 
iSCSI Storage Connection Refused Messages
With iSCSi storage it's also possible to see connection refused messages that are caused by a faulty storage arrays, a network issue, or a UUID issue caused by replacing the motherboard or a NIC.
 
Storage array and networking failures, as well as maintenance or configuration changes may be the culprit. Engage your storage and network team to help troubleshoot.
 
Oracle VM servers are identified by Oracle VM Manager using a UUID. These UUIDs are derived either by interrogating the BIOS or by concatenating the NIC MAC addresses. If the motherboard or NICs is changed the UUID changes, and Oracle VM Manager not communicate to the Oracle VM server.
/var/log/messages:
  • Error
  • Evict
  • Failed
  • Recovery
  • Warning
These details contain both leading and trailing availability indicators from the /var/log/message log file. The error, evict, failed, recovery, and warning messages should be investigated to confirm if the messages are actionable.
 
/var/log/message logs many system messages, including messages logged during system startup. The /var/log/message settings are managed via /etc/syslog.conf.
 
Oracle VM server by default logs local and cluster-wide o2cb and ocfs2 messages. The o2cb and ocfs2 messages provide insight into cluster incidents on individual Oracle VM servers that may produce secondary effects on other Oracle VM servers in the cluster.
 
 No events.
 Events.
 
/var/log/message Error Messages
The error messages logged in /var/log/message are system wide, and should be investigated to confirm which error messages are actionable.
 
Searching and filtering /var/log/message error messages can be done centrally using an Enterprise Analytics solution, or locally using grep.
 
For example, to search for the current error messages locally, as root type:
grep -i error --color=always /var/log/messages
Or try the following command to search all of the messages files including the archived messages files.
grep -i error --color=always /var/log/messages*
 
Once you have reviewed the errors, start filtering the messages to only display specific messages, and times.
 
To filter error messages from a specific Month, for example for January, try:
grep -i error --color=always /var/log/messages | grep -i Jan
 
To filter only ocfs2 error messages, try:
grep -i error --color=always /var/log/messages | grep -i ocfs2 --color=always
 
To filter only ocfs2 error messages in Janurary, try:
grep -i error --color=always /var/log/messages | grep -i ocfs2 --color=always | grep -i Jan
 
Or to filter only ocfs2 error messages from a specific date and time, for example, February 2nd at 7 AM, try:
grep -i error --color=always /var/log/messages* | grep -i ocfs2 --color=always | grep -i "Feb  2 07"
 
/var/log/message Evict and Recovery Messages
Evict and Recovery messages in /var/log/message indicate that an ocfs2 eviction event has occurred. Eviction and Recovery messages are actionable.
 
Searching and filtering /var/log/message warning messages can be done centrally using an Enterprise Analytics solution, or locally using grep.
 
Ocfs2 monitors the status of each Oracle VM server pool member using a network and storage heartbeat. If an Oracle VM server pool member fails to update or respond to network and/or storage heartbeats, the server pool member is evicted (also referred to as fenced) from the pool, then promptly reboots. Ocfs2 evictions forcefully remove at risk Oracle VM servers from a pool.
 
For example, to search for the current evict messages locally, as root type:
grep -i evict --color=always /var/log/messages
Or try the following command to search all of the messages files including the archived messages files.
grep -i evict --color=always /var/log/messages*
 
Once you have reviewed the errors, start filtering the messages to only display specific messages, and times.
 
To filter error messages from a specific Month, for example for January, try:
grep -i evict --color=always /var/log/messages | grep -i Jan
 
/var/log/message Failed Messages
The failed messages logged in /var/log/message are system wide, and should be investigated to confirm which failed messages are actionable.
 
Searching and filtering /var/log/message failed messages can be done centrally using an Enterprise Analytics solution, or locally using grep.
 
For example, to search for the current failed messages locally, as root type:
grep -i failed --color=always /var/log/messages
Or try the following command to search all of the messages files including the archived messages files.
grep -i failed --color=always /var/log/messages*
 
Once you have reviewed the message, start filtering the messages to only display specific messages, and times.
 
To filter failed messages from a specific Month, for example for January, try:
grep -i failed --color=always /var/log/messages | grep -i Jan
 
To filter only multipath failed messages, try:
grep -i failed --color=always /var/log/messages | grep -i multipathd --color=always
 
To filter only multipath failed messages in Janurary, try:
grep -i failed --color=always /var/log/messages | grep -i multipathd --color=always | grep -i Jan
 
Or to filter only multipath failed messages from a specific date and time, for example, January 28th at 9 AM, try:
grep -i failed --color=always /var/log/messages | grep -i multipathd --color=always | grep -i Jan | grep -i "09:" --color=always
 
/var/log/message Warning Messages
The warning messages logged in /var/log/message are system wide, and should be investigated to confirm which warning messages are actionable.
 
Searching and filtering /var/log/message warning messages can be done centrally using an Enterprise Analytics solution, or locally using grep.
 
For example, to search for the current warning messages locally, as root type:
grep -i warning --color=always /var/log/messages
Or try the following command to search all of the messages files including the archived messages files.
grep -i warning --color=always /var/log/messages*
 
Once you have reviewed the message, start filtering the messages to only display specific messages, and times.
 
To filter failed messages from a specific Month, for example for January, try:
grep -i warning --color=always /var/log/messages | grep -i Jan
 
To filter only bonding failed messages, try:
grep -i warning --color=always /var/log/messages | grep -i bond --color=always
 
To filter only bonding failed messages in Janurary, try:
grep -i warning --color=always /var/log/messages | grep -i bond --color=always | grep -i Jan
 
Or to filter only multipath failed messages from a specific date and time, for example, January 28th at 9 AM, try:
grep -i warning --color=always /var/log/messages | grep -i bond --color=always | grep -i Jan | grep -i "09:" --color=always
 

ROAD for Oracle VM Ready for Business Reporting Prerequisites and Configuration

The ROAD for Oracle VM ready for business reports are generated via a script named rfbreport.sh. The rfbreport.sh script must be placed on each Oracle VM server, made executable, edit the scripts subject line, and recipients list, as well create a cronjob to run the script. For example:
1) Copy the rfbreport.sh script to each Oracle VM server in the /opt/mokum, or in /var/usr/local/sbin directory. Note please change the path to your standard.
2) Make the rfbreport.sh script executable, i.e. "chmod +x rfbreport.sh" 
3) Edit the email description (i.e. enter something like Prod, Dev, Test, DR, SF, NYC, Amsterdam, etc..) and recipients list in the last line in the rfbreport.sh script. The next example shows the Description, and the recipients list. Both sections must be edited to be able to run the script. 
mail -v -s "$(echo -e "Description $hostname\nContent-Type: text/html")" email2@domain.com,email2@domain.com,email3@domain.com < $hostname.html
Replace the Description place holder with a discriptave name for your enviroment (i.e. Prod, Test, QA, LA, NYC, ADAM, etc..), and replace the email recipient list with a single email address, or a comma separated list of email addresses. Please note that a space must be placed before the first email address as shown in the example: i.e. text/html")" email2@domain.com, not text/html")"email2@domain.com.
4) Create a cron job on the Oracle VM server with the desired interval, user, path to the rfbreport.sh script, and if run other than daily enter the number of days to report on after the rfbreport.sh script entry. We recommend starting out with daily reports. Once your Reports show all green lights, move to a weekly cron job. On each Oracle VM Server, as root create a cronjob for the rfbreport.sh script, i.e. type crontab -e and add the following:
0 4 * * * root /opt/mokum/rfbreport.sh (this would run the Business Readiness Report as root daily from /opt/mokum/rfbreport.sh at 4 AM)
0 4 * * 6 root /opt/mokum/rfbreport.sh 7 (this would run the Business Readiness Report as root from /opt/mokum/rfbreport.sh each Sunday at 4 AM. The 7 at the end of the line tells the script to report on 7 days of data. All cronjobs other than daily require the number of day to be entered after the rfbreport.sh entry. Please change the cronjob interval, the path to the rfbreport.sh script, the user to run the rfbreport.sh script, and the number of days to report on to meet your requirments.)
 
Note: Oracle VM Release 3.2 requires the sendmail RPMs and its dependencies. Oracle VM Release 3.3, and 3.4 ships with sendmail.
 

Appendix

When running rfbreport.sh I receive: /bin/bash^M: bad interpreter: No such file or directory
The ^M is a carriage return character. Linux uses the line feed character to mark the end of a line, whereas Windows uses the two-character sequence CR LF. The ROAD for Oracle VM command (the file) has end up with a Windows line ending, which makes Bash generate the "/bin/bash^M: bad interpreter: No such file or directory" message.  
Solution:  Edit the rfbreport.sh script (the file) with vim and use the 'fileformat' and 'fileformats' options to set the file format. Using vim open rfbreport.sh script and enter :set ff=unix to remove ^M, or use the dos2unix command, i.e. dos2unix rfbreport.sh to remove the ^Ms.