Skip to content

Troubleshooting⚓︎

I will outline the steps I use to identify the issue and provide direct links to relevant documentation where the solution is described.

First we need to check GPFS state on the node.

GPFS State on the Node
#mmgetstate

The second step involves running mmhealth command to gain an overview of the issue.

Cluster Level

Cluster Health Status
#mmhealth cluster show --verbose

Node Level

All Nodes Health Status
#mmhealth node show -N all
Node Health Status
#mmhealth node show --verbose

Event Analysis

Node Event Logs
#mmhealth node eventlog --verbose
once we have the event type we can look into the details

Logs related to quorum_down event
#mmhealth event show quorum_down

Logs⚓︎

The GPFS log can be found in the /var/adm/ras directory on each node. The GPFS log file is named mmfs.log.date.nodeName, where date is the time stamp when the instance of GPFS started on the node and nodeName is the name of the node. The latest GPFS log file can be found by using the symbolic file name /var/adm/ras/mmfs.log.latest.

mmfs log
#/var/adm/ras/mmfs.log.latest
Operating system error logs
#/var/adm/ras/mmsysmonitor.localhost.log
simply grepping mmfs
#grep "mmfs:" /var/log/messages
CCR logs
#/var/mmf/ccr/*
winbind logs
#egrep '([0-9]{1,3}\.){3}[0-9]{1,3}$' /var/adm/ras/log.wb-<domain>

Monitoring Events⚓︎

The recorded events are stored in the local database on each node. The user can get a list of recorded events by using the mmhealth node eventlog command. Users can use the mmhealth node show or mmhealth cluster show commands to display the active events in the node and cluster respectively.

when using mmhealth node eventlog you will be presented with Event Name for each event type we can get more details and solution recommendations from the RAS events list. This can be very helpful for locating the issue.

disk_down Event

For example if we have disk_down event we can go to the list of disk events where we will find cause and recommened user action.

Log Dump⚓︎

In case needed for incident report or sending to IBM support for diagonics.

Lenovo DSS-G specific debug data collection
#dssg.snap
Creating a master GPFS log file
#gpfs.snap --gather-logs -d /tmp/logs -N all

Also you can look into the Trace facility.

Extra Commands⚓︎

CCR⚓︎

CCR Check
#mmccr check -e -Y
CCR Nodes
#mmccr lsnodes

References

RAS stands for Reliability, Availability, and Serviceability

Master Log

Troubleshooting Overview

Events

Event Types

CCR