Troubleshooting⚓︎
I will outline the steps I use to identify the issue and provide direct links to relevant documentation where the solution is described.
First we need to check GPFS state on the node.
The second step involves running mmhealth command to gain an overview of the issue.
Cluster Level
Node Level
Event Analysis
once we have the event type we can look into the detailsLogs⚓︎
The GPFS log can be found in the /var/adm/ras directory on each node. The GPFS log file is named mmfs.log.date.nodeName, where date is the time stamp when the instance of GPFS started on the node and nodeName is the name of the node. The latest GPFS log file can be found by using the symbolic file name /var/adm/ras/mmfs.log.latest.
Monitoring Events⚓︎
The recorded events are stored in the local database on each node. The user can get a list of recorded events by using the mmhealth node eventlog command. Users can use the mmhealth node show or mmhealth cluster show commands to display the active events in the node and cluster respectively.
when using mmhealth node eventlog you will be presented with Event Name for each event type we can get more details and solution recommendations from the RAS events list. This can be very helpful for locating the issue.
disk_down Event
For example if we have disk_down event we can go to the list of disk events where we will find cause and recommened user action.
Log Dump⚓︎
In case needed for incident report or sending to IBM support for diagonics.
Also you can look into the Trace facility.
Extra Commands⚓︎
CCR⚓︎
References
RAS stands for Reliability, Availability, and Serviceability