Tag Archives: monitoring

How to monitor NVMe drives in the OSX

NVMe support in OSX

After upgrade to the latest Macbook Pro i found that smartctl is not able to find any smart capable drive. This is because Apple replaces SATA SSD with NVMe one and old SMART API is not working (it is very ATA specific). Smartmontools itself includes NVMe support for the Linux, Windows and FreeBSD, so i decided to try to add it to the Darwin as well. However it was not as easy as expected – Apple did not published any source code or documentation about NVMe device support or monitoring. Moreover – there is no any tool in OSX to show such statistic and old tools from SDK are useless because of API Change.

Starting to search for the API provider

After looking on the file tree i have found good candidate: /System/Library/Extensions/NVMeSMARTLib.plugin. Its done more or less similar to the /System/Library/Extensions/SMARTLib.plugin/ which provides SATA/ATA SMART support. As i mentioned – there is no documentation, so i had to use otool, nm and lldb to deal with it. As expected, it was found that API is similar to the ATA one. You can get list of the symbols and functions using this command: nm NVMeSMARTLib | c++filt -p -i. So i tried to connect to it using modified example from SDK for the SMART. Tricky part was to find kIONVMeSMARTUserClientTypeID and kIONVMeSMARTInterfaceID values which are using by CFPLUGIN infrastructure in the IOKit to initialize API interface. Fortunately library comparing this data during runtime, so with disasm i been able to find them. After successful connect to the interface i been able to reconstruct missing headers and use some of the functions (see below)

What is working and what is not.

The most important functions SMARTReadData and GetIdentifyData are working and result is provided in the structures matching with NVMe standard. I was not able to get GetLogPage function running, probably it expects pointer to some structure with defined data. If apple will release any consumer of it it would be easy to find this out.

Also there are some other, unknown functions in this API: GetFieldCounters (always returns error), ScheduleBGRefresh (no parameters, returns ok), GetSystemCounters and GetAlgorithmCounters (some driver info? or vendor-specific log pages?). Another interesting finding was string “Sandisk 401Z128G-4p-MLC” in the SMART log page, so possibly this NVMe is originally from this vendor.

SmartMontools support

I am working to add limited NVMe support to the smartctl and smartd for OSX. Now i already have working prototype, but need to cleanup and refactor some code. I am planning to add this before the next release. Below is an output from my disk:

smartctl 6.6 2017-09-14 r4434M [Darwin 16.7.0 x86_64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===

Model Number:                       APPLE SSD AP0512J
Serial Number:                      XXXXXX
Firmware Version:                   16.14.01
PCI Vendor/Subsystem ID:            0x106b
IEEE OUI Identifier:                0x000502
Controller ID:                      0
Number of Namespaces:               2
Local Time is:                      Wed Sep 20 08:56:36 2017 CEST
Firmware Updates (0x02):            1 Slot
Optional Admin Commands (0x0004):   Frmw_DL
Optional NVM Commands (0x0004):     DS_Mngmt
Maximum Data Transfer Size:         256 Pages

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     0.00W       -        -    0  0  0  0        0       0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02, NSID 0x0)
Critical Warning:                   0x00
Temperature:                        33 Celsius
Available Spare:                    90%
Available Spare Threshold:          2%
Percentage Used:                    0%
Data Units Read:                    19,311,330 [9.88 TB]
Data Units Written:                 11,653,167 [5.96 TB]
Host Read Commands:                 50,388,833
Host Write Commands:                37,404,327
Controller Busy Time:               0
Power Cycles:                       2,320
Power On Hours:                     23
Unsafe Shutdowns:                   7
Media and Data Integrity Errors:    0
Error Information Log Entries:      0

Reconstructed API

If you want to play with the API yourself – you can use this header. Please let me know if you found how to use GetLogPage or any other useful information:

// NVMe definitions, non documented, experimental

// Constant to init driver
#define kIONVMeSMARTUserClientTypeID       CFUUIDGetConstantUUIDWithBytes(NULL,      \
                                        0xAA, 0x0F, 0xA6, 0xF9, 0xC2, 0xD6, 0x45, 0x7F, 0xB1, 0x0B, \
                    0x59, 0xA1, 0x32, 0x53, 0x29, 0x2F)

// Constant to use plugin interface
#define kIONVMeSMARTInterfaceID        CFUUIDGetConstantUUIDWithBytes(NULL,                  \
                    0xcc, 0xd1, 0xdb, 0x19, 0xfd, 0x9a, 0x4d, 0xaf, 0xbf, 0x95, \
                    0x12, 0x45, 0x4b, 0x23, 0xa, 0xb6)

// interface structure, obtained using lldb, could be incomplete or wrong
typedef struct IONVMeSMARTInterface
{
        IUNKNOWN_C_GUTS;

        UInt16 version;
        UInt16 revision;

                // NVMe smart data, returns nvme_smart_log structure
        IOReturn ( *SMARTReadData )( void *  interface,
                                     struct nvme_smart_log * NVMeSMARTData );

                // NVMe IdentifyData, returns nvme_id_ctrl per namespace
        IOReturn ( *GetIdentifyData )( void *  interface,
                                      struct nvme_id_ctrl * NVMeIdentifyControllerStruct,
                                      unsigned int ns );

                // Always getting kIOReturnDeviceError
        IOReturn ( *GetFieldCounters )( void *   interface,
                                        char * FieldCounters );
                // Returns 0
        IOReturn ( *ScheduleBGRefresh )( void *   interface);

                // Always returns kIOReturnDeviceError, probably expects pointer to some
                // structure as an argument
        IOReturn ( *GetLogPage )( void *  interface, void * data, unsigned int, unsigned int);


                /* GetSystemCounters Looks like a table with an attributes. Sample result:

                0x101022200: 0x01 0x00 0x08 0x00 0x00 0x00 0x00 0x00
                0x101022208: 0x00 0x00 0x00 0x00 0x02 0x00 0x08 0x00
                0x101022210: 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
                0x101022218: 0x03 0x00 0x08 0x00 0xf1 0x74 0x26 0x01
                0x101022220: 0x00 0x00 0x00 0x00 0x04 0x00 0x08 0x00
                0x101022228: 0x0a 0x91 0xb1 0x00 0x00 0x00 0x00 0x00
                0x101022230: 0x05 0x00 0x08 0x00 0x24 0x9f 0xfe 0x02
                0x101022238: 0x00 0x00 0x00 0x00 0x06 0x00 0x08 0x00
                0x101022240: 0x9b 0x42 0x38 0x02 0x00 0x00 0x00 0x00
                0x101022248: 0x07 0x00 0x08 0x00 0xdd 0x08 0x00 0x00
                0x101022250: 0x00 0x00 0x00 0x00 0x08 0x00 0x08 0x00
                0x101022258: 0x07 0x00 0x00 0x00 0x00 0x00 0x00 0x00
                0x101022260: 0x09 0x00 0x08 0x00 0x00 0x00 0x00 0x00
                0x101022268: 0x00 0x00 0x00 0x00 0x0a 0x00 0x04 0x00
                .........
                0x101022488: 0x74 0x00 0x08 0x00 0x00 0x00 0x00 0x00
                0x101022490: 0x00 0x00 0x00 0x00 0x75 0x00 0x40 0x02
                0x101022498: 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
                */
        IOReturn ( *GetSystemCounters )( void *  interface, char *, unsigned int *);


                /* GetAlgorithmCounters returns mostly 0
                0x102004000: 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
                0x102004008: 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
                0x102004010: 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
                0x102004018: 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
                0x102004020: 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
                0x102004028: 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
                0x102004038: 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
                0x102004040: 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
                0x102004048: 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
                0x102004050: 0x00 0x00 0x00 0x00 0x80 0x00 0x00 0x00
                0x102004058: 0x80 0x00 0x00 0x00 0x00 0x00 0x00 0x00
                0x102004060: 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
                0x102004068: 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
                0x102004070: 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
                0x102004078: 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
                0x102004080: 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
                0x102004088: 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
                0x102004090: 0x00 0x01 0x00 0x00 0x00 0x00 0x00 0x00
                0x102004098: 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00

                */
        IOReturn ( *GetAlgorithmCounters )( void *  interface, char *, unsigned int *);
} IONVMeSMARTInterface;


Advertisements
Tagged , , , ,

Smartmontools daily builds

Sometime i need to audit some servers and often smartmontools is very old, not installed at all (and repositories are broken) or not working for some reasons. Thats one of the reasons why http://builds.smartmontools.org was created. You can download latest SVN builds for the following systems:

  • Darwin (OSX) package, Mach-O universal binary with 2 architectures: i386+x86_64
  • Win32 installer (32 and 64 bit)
  • Linux: i686,x86_64,static and dynamic
  • Source code

Service is now in “experimental” status, please report any issues with it here or on https://smartmontools.org.

Tagged , , , ,

Monitoring WAN status on OpenWRT using Alarm Pinger

Idea

I am connected to the Internet using wireless link which is sometime not very stable. I decided to monitor status of the link to make sure that I am aware of the problem. Initially i tried to monitor link with Monit or Nagios + fping, but results were not very good, this software is not designed for continues monitoring with very small interval. So I decided to find some alternatives.

About Alarm Pinger

I was using Alarm Pinger (apinger) with pfSense distribution — it was used to monitor WAN links to switch between them if needed.

Alarm Pinger (apinger) is a little tool which monitors various IP devices by simple ICMP echo requests. There are various other tools, that can do this, but most of them are shell or perl scripts, spawning many processes, thus much CPU-expensive, especially when one wants continuous monitoring and fast response on target failure. Alarm Pinger is a single program written in C, so it doesn’t need much CPU power even when monitoring many targets with frequent probes. Alarm Pinger supports both IPv4 and IPv6.

This tool supports multiply monitoring targets, external scripts, email notification, daemon mode. Only problem was that tool was not available as OpenWRT package. So i decided to port it.

OpenWRT port

After few tests I found, that code can be compiled with only few minor patches (autoconf related). You can grab Makefile for package from this pull request. Hopefully it will be integrated in the official packages feed soon. Update: port merged.
Port provides init.d script and sample configuration. In the feature I am also planning to make Luci integration to show link status from the web interface.

To buid package on Turris I would recommend to use my turris buildroot docker image.

## Service configuration

I am using very simple configuration to monitor status of the Wireless link using pings to the ISP gateway:

# we need to use root because "rainbow" tool fails to work from other uid. 
user "root"
group "root"

# status file with link quality information
status {
    file "/tmp/apinger.status"
    interval 1s
}
# command to run, with alarm type and reason
# if used with multiply targets %t needs to be added
alarm default {
    command on "/root/gateway.sh %A %r"
    command off "/root/gateway.sh %A %r"
}
# This alarm will be fired when target doesn't respond for 30 seconds.
alarm down "down" {
    time 30s
}
# This alarm will be fired when responses are delayed more than 80ms
# it will be canceled, when the delay drops below 50ms
alarm delay "delay" {
    delay_low 50ms
    delay_high 80ms
}
# This alarm will be fired when packet loss goes over 5%
# it will be canceled, when the loss drops below 3%
alarm loss "loss" {
    percent_low 3
    percent_high 5
}
target default {
    interval 1s
    avg_delay_samples 10
    avg_loss_samples 50
    avg_loss_delay_samples 20
    alarms "down","delay","loss"
}
# ISP Gateway host to monitor. You can define many targets in case of MultiWAN. 
target "1.2.3.4" {
    description "ISP Gateway"
}

Also I am using simple script to change WAN LED color in case of problems:

#!/bin/sh

DEF_COLOR=33FF33 # see https://gitlab.labs.nic.cz/turris/rainbow/blob/master/turris.c
WARNING_COLOR=FFFF00 # yellow
DOWN_COLOR=red
RAINBOW=/usr/bin/rainbow

logger "event: $@"
# read data from status file
STATUS=`grep  "Active alarms:" /tmp/apinger.status`

case "$@" in
"delay ALARM")
  touch /tmp/apinger.delay.flag
  ;;
"delay alarm canceled")
  rm -f /tmp/apinger.delay.flag
  ;;
"down ALARM")
  touch /tmp/apinger.down.flag
  ;;
"down alarm canceled")
  rm -f /tmp/apinger.down.flag
  ;;
"loss ALARM")
  touch /tmp/apinger.loss.flag
  ;;
"loss alarm canceled")
  rm -f /tmp/apinger.loss.flag
  ;;
esac
# link is down
if [ -e /tmp/apinger.down.flag ]; then
  ${RAINBOW} wan ${DOWN_COLOR}
  exit
fi
# loss or delay
if [ -e /tmp/apinger.loss.flag -o -e /tmp/apinger.delay.flag ]; then
  ${RAINBOW} wan ${WARNING_COLOR}
  exit
fi
# no active alarms found
${RAINBOW} wan ${DEF_COLOR}

This works pretty good – if line is down – WAN color is red, if it is unstable or congested – yellow. We can also monitor link status manually:

root@turris:~# cat /tmp/apinger.status
Fri Apr 10 12:39:24 2015

Target: 1.2.3.4
Description: ISP Gateway
Last reply received: #2876 Fri Apr 10 12:39:23 2015
Average delay: 3.247ms
Average packet loss: 0.0%
Active alarms: None
Received packets buffer: ################################################## ###################.

Todo

I am planning to extend functionality of the script with some cool features:

  • Integrate with Luci to show status in the web interface.
  • Add support for the failover switch to the LTE channel if link is down (and LTE dongle connected).
  • Enable rrdtools support provided by apinger.
Tagged , , , ,

How to access Integrated Management Module on IBM System x3650 M3 server under FreeBSD

IBM System x3650 M3 server provides nice looking Integrated Management Module (IMM) GUI/CLI which can be accessed remotely (using dedicated network interface) or directly from host. In this short article I will describe how to do this from FreeBSD host machine.All tests were done with FreeBSD 10.1-RELEASE-p6 using GENERIC kernel.

  1. We will need to find virtual network card provided by IMM (RNDISCDC ETHER IBM):
    root@host /root]# usbconfig
    ugen0.1: <UHCI root HUB Intel> at usbus0, cfg=0 md=HOST spd=FULL (12Mbps) pwr=SAVE (0mA)
    ugen2.1: <EHCI root HUB Intel> at usbus2, cfg=0 md=HOST spd=HIGH (480Mbps) pwr=SAVE (0mA)
    ugen1.1: <UHCI root HUB Intel> at usbus1, cfg=0 md=HOST spd=FULL (12Mbps) pwr=SAVE (0mA)
    ugen4.1: <UHCI root HUB Intel> at usbus4, cfg=0 md=HOST spd=FULL (12Mbps) pwr=SAVE (0mA)
    ugen3.1: <UHCI root HUB Intel> at usbus3, cfg=0 md=HOST spd=FULL (12Mbps) pwr=SAVE (0mA)
    ugen6.1: <EHCI root HUB Intel> at usbus6, cfg=0 md=HOST spd=HIGH (480Mbps) pwr=SAVE (0mA)
    ugen5.1: <UHCI root HUB Intel> at usbus5, cfg=0 md=HOST spd=FULL (12Mbps) pwr=SAVE (0mA)
    ugen3.2: <RNDISCDC ETHER IBM> at usbus3, cfg=1 md=HOST spd=FULL (12Mbps) pwr=ON (100mA)
    In our case it is ugen3.2.
  2. This USB device supports 2 USB configuration – default (and active on boot) – RNDIS or alternate – CDC. FreeBSD works fine with CDC, so we need to switch this USB device to it:
    root@host /root]# usbconfig -d ugen3.2 set_config 1
    After this device should be detected by FreeBSD and dmesg should contain something like this:
    umodem0: at uhub4, port 2, addr 2 (disconnected)
    cdce0: on usbus3
    ue0: on cdce0
    ue0: Ethernet address: e6:1f:13:5e:ab:cd
  3. Now only thing left is to run dhclient on the new network interface:
    [root@host /root]# dhclient ue0
    DHCPREQUEST on ue0 to 255.255.255.255 port 67
    DHCPACK from 169.254.95.118
    bound to 169.254.95.120 -- renewal in 300 seconds.

    Here we can see that address of the IMM is 169.254.95.118. We can use it to connect with telnet or https to get IMM interface.
  4. Username should be admin, and password could be changed using “ipmitool” utility:
    [root@host /root]# ipmitool user set password 2

Thats it 🙂 Using IMM you can manage your hardware, monitor server and do many other interesting things.

Tagged , , , ,