FAT47の底辺インフラ議事録

学んだことのメモ帳です

冬の日2014〜ioDriveが壊れた日〜

ある日

ioDriveを積んでるMySQLスレーブサーバが突然の死

というか、レプリケーションが止まっていました。

サービスから参照されていないDBではあったので、
特に死んでいても問題にはなりませんでした。
今回つかっていたのはioDrive Duoです。

/var/log/messages確認
とりあえずシステムのログを確認してみると、

Jan 27 05:39:36 hoge-dbs kernel: fioinf HP 640GB MLC PCIe ioDrive Duo for ProLiant Servers 0000:09:00.0: groomer read had error -1024
Jan 27 05:39:36 hoge-dbs kernel: fioerr HP 640GB MLC PCIe ioDrive Duo for ProLiant Servers 0000:09:00.0: groomer error -1024 during read on eb 3408
Jan 27 05:39:36 hoge-dbs kernel: fioerr HP 640GB MLC PCIe ioDrive Duo for ProLiant Servers 0000:09:00.0:-  Due to simultaneous multiple device failur
es in EB 3408, the location of this
Jan 27 05:39:36 hoge-dbs kernel: fioerr HP 640GB MLC PCIe ioDrive Duo for ProLiant Servers 0000:09:00.0:-  error in the filesystem can not be easily 
determined. It is suggested that
Jan 27 05:39:36 hoge-dbs kernel: fioerr HP 640GB MLC PCIe ioDrive Duo for ProLiant Servers 0000:09:00.0:-  all data be checked to find the bad block 
and overwritten.
Jan 27 05:39:36 hoge-dbs kernel: fioerr HP 640GB MLC PCIe ioDrive Duo for ProLiant Servers 0000:09:00.0:-  For best results do not reboot the device 
until this is done.
Jan 27 05:39:36 hoge-dbs kernel: fioerr HP 640GB MLC PCIe ioDrive Duo for ProLiant Servers 0000:09:00.0: Groomer was unable to groom EB 3408 after 5 
retries: 5 sectors were ungroomed

なんだか怪しげなログがいっぱい出ていたので、

Fusion-IO ioDriveの障害例と調査方法


こちらの記事を参考にしながら確認してみます。

データ領域に書き込めるか確認
/dataにマウントしていたので、そこにデータが書き込めるかを確認したところ現状では書き込めていました。

fio-status確認

# fio-status
Found 2 ioDrives in this system with 1 ioDrive Duo
Fusion-io driver version: 2.3.10 build 110

Adapter: ioDrive Duo
        HP 640GB MLC PCIe ioDrive Duo for ProLiant Servers, Product Number:600282-B21 SN:xxxxx
        External Power: NOT connected
        PCIE Power limit threshold: 24.75W
        Sufficient power available: Unknown
        Connected ioDimm modules:
          fct0: HP 640GB MLC PCIe ioDrive Duo for ProLiant Servers, Product Number:600282-B21 SN:xxxxx
          fct1: HP 640GB MLC PCIe ioDrive Duo for ProLiant Servers, Product Number:600282-B21 SN:xxxxx

fct0    Attached as 'fioa' (block device)
        HP 640GB MLC PCIe ioDrive Duo for ProLiant Servers, Product Number:600282-B21 SN:xxxxx
        Located in slot 0 Upper of ioDrive Duo SN:xxxxx
        PCI:09:00.0
        Firmware v5.0.7, rev 107053
        322.55 GBytes block device size, 396 GBytes physical device size
        Sufficient power available: Unknown
        Internal temperature: 51.7 degC, max 56.1 degC
        Media status: Healthy; Reserves: 100.00%, warn at 10.00%

fct1    Attached as 'fiob' (block device)
        HP 640GB MLC PCIe ioDrive Duo for ProLiant Servers, Product Number:600282-B21 SN:xxxxx
        Located in slot 1 Lower of ioDrive Duo SN:xxxxx
        PCI:0a:00.0
        Firmware v5.0.7, rev 107053
        322.55 GBytes block device size, 396 GBytes physical device size
        Sufficient power available: Unknown
        Internal temperature: 52.2 degC, max 62.0 degC
        Media status: Healthy; Reserves: 100.00%, warn at 10.00%

なんだか正常に見えるなぁ。

※他にもログとか確認していたのですが、
OS再起動する前に記録取り忘れていたので割愛します。

OS再起動してみる
特にサービスで利用してないサーバだったのでOS再起動してみることに。
OSは無事上がりました。

# df -h
Filesystem            Size  Used Avail Use% マウント位置
/dev/sda3             547G  4.5G  515G   1% /
tmpfs                  16G     0   16G   0% /dev/shm
/dev/sda1             504M   63M  416M  14% /boot

さようなら、/dataちゃん・・・

もういっちょfio-status

# fio-status                                                                                                                       

Found 2 ioDrives in this system with 1 ioDrive Duo
Fusion-io driver version: 2.3.10 build 110

Adapter: ioDrive Duo
        HP 640GB MLC PCIe ioDrive Duo for ProLiant Servers, Product Number:600282-B21 SN:xxxxx
        External Power: NOT connected
        PCIE Power limit threshold: 24.75W
        Sufficient power available: Unknown
        Connected ioDimm modules:
          fct0: HP 640GB MLC PCIe ioDrive Duo for ProLiant Servers, Product Number:600282-B21 SN:xxxxx
          fct1: HP 640GB MLC PCIe ioDrive Duo for ProLiant Servers, Product Number:600282-B21 SN:xxxxx

fct0    Not attached
        HP 640GB MLC PCIe ioDrive Duo for ProLiant Servers, Product Number:600282-B21 SN:xxxxx
        Located in slot 0 Upper of ioDrive Duo SN:xxxxx
        PCI:09:00.0
        Firmware v5.0.7, rev 107053
        322.55 GBytes block device size, 396 GBytes physical device size
        Sufficient power available: Unknown
        Internal temperature: 51.7 degC, max 56.1 degC
        Media status: Healthy; Reserves: 100.00%, warn at 10.00%

fct1    Attached as 'fiob' (block device)
        HP 640GB MLC PCIe ioDrive Duo for ProLiant Servers, Product Number:600282-B21 SN:xxxxx
        Located in slot 1 Lower of ioDrive Duo SN:xxxxx
        PCI:0a:00.0
        Firmware v5.0.7, rev 107053
        322.55 GBytes block device size, 396 GBytes physical device size
        Sufficient power available: Unknown
        Internal temperature: 52.2 degC, max 62.0 degC
        Media status: Healthy; Reserves: 100.00%, warn at 10.00%

さようなら、fct0ちゃん・・・

試しにfct0にattach

# fio-attach /dev/fct0
Attaching: [====================] (100%) \
Error: failed to attach /dev/fct0. (4)

やっぱりダメ。


fct0ちゃんよ・・・永遠に・・・。

::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::。:::::::::::::::::::::::::::::::::::::::::::::
:::::::::::::::::::::::::::::::::。::::::...... ...   --─-  :::::::::::::::::::: ..::::: . ..::::::::
:::::::::::::::::...... ....:::::::゜::::::::::..   (___ )(___ ) ::::。::::::::::::::::: ゜.::::::::::::
:. .:::::。:::........ . .::::::::::::::::: _ i/ = =ヽi :::::::::::::。::::::::::: . . . ..::::
:::: :::::::::.....:☆彡::::   //[||    」  ||]  ::::::::::゜:::::::::: ...:: :::::
 :::::::::::::::::: . . . ..: :::: / ヘ | |  ____,ヽ | | :::::::::::.... .... .. .::::::::::::::
::::::...゜ . .:::::::::  /ヽ ノ    ヽ__/  ....... . .::::::::::::........ ..::::
:.... .... .. .     く  /     三三三∠⌒>:.... .... .. .:.... .... ..
:.... .... ..:.... .... ..... .... .. .:.... .... .. ..... .... .. ..... ............. .. . ........ ......
:.... . ∧∧   ∧∧  ∧∧   ∧∧ .... .... .. .:.... .... ..... .... .. .
... ..:(   )ゝ (   )ゝ(   )ゝ(   )ゝ無茶しやがって… ..........
....  i⌒ /   i⌒ /  i⌒ /   i⌒ / .. ..... ................... .. . ...
..   三  |   三  |   三  |   三 |  ... ............. ........... . .....
...  ∪ ∪   ∪ ∪   ∪ ∪  ∪ ∪ ............. ............. .. ........ ...
  三三  三三  三三   三三
 三三  三三  三三   三三

Fusion ioのioDriveは結構な枚数と種類を1年以上使ってきていますが、
故障に出会ったのは始めてでした。

fio-bugreportとりました