Look for problematic LID within ibdiagnet report log file, usually problematic LID will be with “symbol” errors
For example the problematic LID in this example is LID 83, the GID of this LID is GID:fe80::ec0d:9a03:22:4850
Now we want to find what is connected to this switch, for this we run following command:
Within less search for last four symbols of the GID 4850
This will bring the information about all active/down/disabled ports at the switch with LID 83:
On the left side first two columns shows the LID and the Port on this switch, the column 10 and 11 shows where each port on this switch is connected to remote switch and port.
For example LID 83 and PORT 34 is connected to the LID 424 and PORT 33 - this is the connection between FDR14 and EDR racks
Another example LID 83 port 6 is connected to the LID 105 and port 28 what is the unmanageable leaf switch.
If you see “SwitchIB Mellanox Technologies” name per Redline this means unmanageable switch.
From John email above it’s noticeable that most of restarts are happening with LID 529. In the recent screenshot we find the line with LID 529 in column 10 where we find that LID 83 port 27 is connected to the LID 529 and PORT 33