首先介紹一下什么是多路徑(multi-path)先說說多路徑功能產生的背景,在多路徑功能出現(xiàn)之前,主機上的硬盤是直接掛接到一個總線(PCI)上,路徑是一對一的關系,也就是一條路徑指向一個硬盤或是存儲設備,這樣的一對一關系對于操作系統(tǒng)而言,處理相對簡單,但是缺少了可靠性。當出現(xiàn)了光纖通道網(wǎng)絡(Fibre Channle)也就是通常所說的SAN網(wǎng)絡時,或者由iSCSI組成的IPSAN環(huán)境時,由于主機和存儲之間通過光纖通道交換機或者多塊網(wǎng)卡及IP來連接時,構成了多對多關系的IO通道,也就是說一臺主機到一臺存儲設備之間存在多條路徑。當這些路徑同時生效時,I/O流量如何分配和調度,如何做IO流量的負載均衡,如何做主備。這種背景下多路徑軟件就產生了。
多路徑的主要功能就是和存儲設備一起配合實現(xiàn)如下功能:
1.故障的切換和恢復
2.IO流量的負載均衡
3.磁盤的虛擬化
在linux操作系統(tǒng)中,RedHat和Suse的2.6的內核中都自帶了免費的多路徑軟件包,ESX操作系統(tǒng)下也是自帶了免費的多路徑功能,而windows操作系統(tǒng)下,就需要購買一個叫MPIO的軟件lience才能使用multi-path多路徑功能。其他windows和ESX操作系統(tǒng)下的多路徑 功能都是圖形化界面比較簡單這里就不多做介紹了,在這里就是介紹一下linux環(huán)境下如何配置multi-path多路徑功能。
一、Linux下multipath相關工具和參數(shù)介紹:
1、device-mapper-multipath:即multipath-tools。主要提供multipathd和multipath等工具和 multipath.conf等配置文件。這些工具通過device mapper的ioctr的接口創(chuàng)建和配置multipath,設備創(chuàng)建的多路徑設備映射會在/dev /mapper中。
2、 device-mapper:主要包括兩大部分:內核部分和用戶部分。內核部分主要由device mapper核心(dm.ko)和一些target driver(md-multipath.ko)。核心完成設備的映射,而target根據(jù)映射關系和自身特點具體處理從mappered device 下來的i/o。同時,在核心部分,提供了一個接口,用戶通過ioctr可和內核部分通信,以指導內核驅動的行為,比如如何創(chuàng)建mappered device,這些divece的屬性等。linux device mapper的用戶空間部分主要包括device-mapper這個包。其中包括dmsetup工具和一些幫助創(chuàng)建和配置mappered device的庫。這些庫主要抽象,封裝了與ioctr通信的接口,以便方便創(chuàng)建和配置mappered device。multipath-tool的程序中就需要調用這些庫。
3、dm-multipath.ko和dm.ko:dm.ko是device mapper驅動。它是實現(xiàn)multipath的基礎。dm-multipath其實是dm的一個target驅動。
4、scsi_id: 包含在udev程序包中,可以在multipath.conf中配置該程序來獲取scsi設備的序號。通過序號,便可以判斷多個路徑對應了同一設備。這個是多路徑實現(xiàn)的關鍵。scsi_id是通過sg驅動,向設備發(fā)送EVPD page80或page83 的inquery命令來查詢scsi設備的標識。但一些設備并不支持EVPD 的inquery命令,所以他們無法被用來生成multipath設備。但可以改寫scsi_id,為不能提供scsi設備標識的設備虛擬一個標識符,并輸出到標準輸出。multipath程序在創(chuàng)建multipath設備時,會調用scsi_id,從其標準輸出中獲得該設備的scsi id。在改寫時,需要修改scsi_id程序的返回值為0。因為在multipath程序中,會檢查該值來確定scsi id是否已經(jīng)成功得到。
二、multipath在redhat 6.2中的基本配置:
1. 通過命令:lsmod |grep dm_multipath 檢查是否正常安裝成功。如果沒有輸出說明沒有安裝那么通過yum功能安裝一下軟件包:yum –y install device-mapper device-mapper-multipath
接著通過命令:multipath –ll 查看多路徑狀態(tài)查看模塊是否加載成功
[root@liujing ~]# multipath –ll 查看多路徑狀態(tài)
Mar 10 19:18:28 | /etc/multipath.conf does not exist, blacklisting all devices.
Mar 10 19:18:28 | A sample multipath.conf file is located at
Mar 10 19:18:28 | /usr/share/doc/device-mapper-multipath-0.4.9/multipath.conf
Mar 10 19:18:28 | You can run /sbin/mpathconf to create or modify /etc/multipath.conf
Mar 10 19:18:28 | DM multipath kernel driver not loaded —-DM模塊沒有加載
如果模塊沒有加載成功請使用下列命初始化DM,或重啟系統(tǒng)
—Use the following commands to initialize and start DM for the first time:
# modprobe dm-multipath
# modprobe dm-round-robin
# service multipathd start
# multipath –v2
初始化完了之后再通過multipath -ll命令查看是否加載成功
[root@liujing ~]# multipath -ll
Mar 10 19:21:14 | /etc/multipath.conf does not exist, blacklisting all devices.
Mar 10 19:21:14 | A sample multipath.conf file is located at
Mar 10 19:21:14 | /usr/share/doc/device-mapper-multipath-0.4.9/multipath.conf
Mar 10 19:21:14 | You can run /sbin/mpathconf to create or modify /etc/multipath.conf
DM multipath kernel driver not loaded —-這個提示沒了說明DM模塊已加載成功。
從上面的提示可以看到,DM模塊是成功加載,但是/etc/下沒有multipath.conf 配置文件,下一步介紹如何配置multipath.conf 文件。
2. 配置multipath:
通過vi命令創(chuàng)建一個Multipath的配置文件路徑是/etc/multipath.conf ,在配置文件中添加multipath正常工作的最簡配置如下:
vi /etc/multipath.conf
blacklist {
devnode “^sda”
}
defaults {
user_friendly_names yes
path_grouping_policy multibus
failback immediate
no_path_retry fail
}
編輯完成后保存配置,同時通過命令:
#開啟mulitipath服務
# /etc/init.d/multipathd start
如果出現(xiàn)無法開啟服務的情況,沒有提示OK的話如下:
[root@liujing mapper]# service multipathd start
Starting multipathd daemon: 沒有提示OK
重新開關一下服務就可以解決了。
[root@liujing mapper]# /etc/init.d/multipathd stop
Stopping multipathd daemon: [ OK ]
[root@localhost mapper]# /etc/init.d/multipathd start
Starting multipathd daemon: [ OK ] —–提示OK 正常開啟服務
通過命令查看:
[root@liujing mapper]# multipath -ll
mpatha (360a9800064665072443469563477396c) dm-0 NETAPP,LUN —-創(chuàng)建了一個lun
size=3.5G features=’0′ hwhandler=’0′ wp=rw
`-+- policy=’round-robin 0′ prio=4 status=active
|- 1:0:0:0 sdb 8:16 active ready running —-多路徑下的兩個盤符sdb和sde.
`- 2:0:0:0 sde 8:64 active ready running
目錄/dev/mapper/ 下多了兩個文件夾mpatha 和mpathap1。
[root@liujing mapper]# cd /dev/mapper/
[root@liujing mapper]# ls
control mpatha mpathap1
同時fdisk –l的命令下也多了兩個設備標識:
沒有配置多路徑時:
[root@liujing~]# fdisk -l
Disk /dev/sda: 146.8 GB, 146815733760 bytes
255 heads, 63 sectors/track, 17849 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x000a6cdd
Device Boot Start End Blocks Id System
/dev/sda1 * 1 26 204800 83 Linux
Partition 1 does not end on cylinder boundary.
/dev/sda2 26 287 2097152 82 Linux swap / Solaris
Partition 2 does not end on cylinder boundary.
/dev/sda3 287 17850 141071360 83 Linux
Disk /dev/sdb: 3774 MB, 3774873600 bytes
117 heads, 62 sectors/track, 1016 cylinders
Units = cylinders of 7254 * 512 = 3714048 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 4096 bytes / 65536 bytes
Disk identifier: 0xac956c3a
Device Boot Start End Blocks Id System
/dev/sdb1 1 1016 3685001 83 Linux
Partition 1 does not start on physical sector boundary.
Disk /dev/sde: 3774 MB, 3774873600 bytes
117 heads, 62 sectors/track, 1016 cylinders
Units = cylinders of 7254 * 512 = 3714048 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 4096 bytes / 65536 bytes
Disk identifier: 0xac956c3a
Device Boot Start End Blocks Id System
/dev/sde1 1 1016 3685001 83 Linux
Partition 1 does not start on physical sector boundary.
兩個CAN網(wǎng)卡獲取到同一盤符:
/dev/sde和/dev/sdb.
配置后多了/dev/mapper/mpatha和/dev/mapper/mpathap1:
[root@localhost mapper]# fdisk -l
Disk /dev/sda: 146.8 GB, 146815733760 bytes
255 heads, 63 sectors/track, 17849 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x000a6cdd
Device Boot Start End Blocks Id System
/dev/sda1 * 1 26 204800 83 Linux
Partition 1 does not end on cylinder boundary.
/dev/sda2 26 287 2097152 82 Linux swap / Solaris
Partition 2 does not end on cylinder boundary.
/dev/sda3 287 17850 141071360 83 Linux
Disk /dev/sdb: 3774 MB, 3774873600 bytes
117 heads, 62 sectors/track, 1016 cylinders
Units = cylinders of 7254 * 512 = 3714048 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 4096 bytes / 65536 bytes
Disk identifier: 0xac956c3a
Device Boot Start End Blocks Id System
/dev/sdb1 1 1016 3685001 83 Linux
Partition 1 does not start on physical sector boundary.
Disk /dev/sde: 3774 MB, 3774873600 bytes
117 heads, 62 sectors/track, 1016 cylinders
Units = cylinders of 7254 * 512 = 3714048 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 4096 bytes / 65536 bytes
Disk identifier: 0xac956c3a
Device Boot Start End Blocks Id System
/dev/sde1 1 1016 3685001 83 Linux
Partition 1 does not start on physical sector boundary.
Disk /dev/mapper/mpatha: 3774 MB, 3774873600 bytes
117 heads, 62 sectors/track, 1016 cylinders
Units = cylinders of 7254 * 512 = 3714048 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 4096 bytes / 65536 bytes
Disk identifier: 0xac956c3a
Device Boot Start End Blocks Id System
/dev/mapper/mpathap1 1 1016 3685001 83 Linux
Partition 1 does not start on physical sector boundary.
Disk /dev/mapper/mpathap1: 3773 MB, 3773441024 bytes
255 heads, 63 sectors/track, 458 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 4096 bytes / 65536 bytes
Alignment offset: 1024 bytes
Disk identifier: 0x00000000
Disk /dev/mapper/mpathap1 doesn’t contain a valid partition table
# multipath -F #刪除現(xiàn)有路徑 兩個新的路徑就會被刪除
# multipath -v2 #格式化路徑 格式化后又出現(xiàn)
3. multipath磁盤的基本操作
要對多路徑軟件生成的磁盤進行操作直接操作/dev/mapper/目錄下的磁盤就行.
在對多路徑軟件生成的磁盤進行分區(qū)之前最好運行一下pvcreate命令:
# pvcreate /dev/mapper/mpatha
# fdisk /dev/mapper/mpatha分區(qū)時用這個目錄/dev/mapper/mpatha
用fdisk對多路徑軟件生成的磁盤進行分區(qū)保存時會有一個報錯,此報錯不用理會.
# ls -l /dev/mapper/
[root@liujing mnt]# ls -l /dev/mapper/
total 0
crw-rw—-. 1 root root 10, 58 Mar 10 19:10 control
lrwxrwxrwx. 1 root root 7 Mar 10 20:28 mpatha -> ../dm-0
lrwxrwxrwx. 1 root root 7 Mar 10 20:33 mpathap1 -> ../dm-1
的mpathap1就是我們對multipath磁盤進行的分區(qū)
# mkfs.ext4 /dev/mapper/mpathap1 #對mpath1p1分區(qū)格式化成ext4文件系統(tǒng)
# mount /dev/mapper/mpathap1 /mnt/ #掛載mpathap1分區(qū)
格式化和掛載時用/dev/mapper/mpathap1
4. 分區(qū)磁盤:
上面有提到分區(qū)時用目錄/dev/mapper/mpatha
[root@liujing~]# fdisk/dev/mapper/mpatha
Device contains neither a valid DOS partition table, nor Sun, SGI or OSF disklabel
Building a new DOS disklabel with disk identifier 0xac956c3a.
Changes will remain in memory only, until you decide to write them.
After that, of course, the previous content won’t be recoverable.
Warning: invalid flag 0x0000 of partition table 4 will be corrected by w(rite)
WARNING: DOS-compatible mode is deprecated. It’s strongly recommended to
switch off the mode (command ‘c’) and change display units to
sectors (command ‘u’).
Command (m for help): n————————新建分區(qū)
Command action
e extended
p primary partition (1-4)
p—————————–主分區(qū)
Partition number (1-4): 1
First cylinder (1-1016, default 1):
Using default value 1
Last cylinder, +cylinders or +size{K,M,G} (1-1016, default 1016):
Using default value 1016
Command (m for help): w ———————寫入列表相當于保存
The partition table has been altered!
Calling ioctl() to re-read partition table.
Syncing disks.
注:如果同一臺設備的兩個node掛同樣的盤符,另一個盤符還需要再次寫入w就行。不需要n了。
5. 格式化:
[root@liujing ~]# mkfs.ext4/dev/mapper/mpathap1
mke2fs 1.41.12 (17-May-2010)
/dev/sdd1 alignment is offset by 1024 bytes.
This may result in very poor performance, (re)-partitioning suggested.
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
Stride=1 blocks, Stripe width=16 blocks
230608 inodes, 921250 blocks
46062 blocks (5.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=943718400
29 block groups
32768 blocks per group, 32768 fragments per group
7952 inodes per group
Superblock backups stored on blocks:
32768, 98304, 163840, 229376, 294912, 819200, 884736
Writing inode tables: done
Creating journal (16384 blocks): done
Writing superblocks and filesystem accounting information: done
This filesystem will be automatically checked every 33 mounts or
180 days, whichever comes first. Use tune2fs -c or -i to override.
6. 掛載/dev/mapper/mpathap1到 /mnt
[root@liujing ~]# mount /dev/mapper/mpathap1/mnt
三、multipath的高級配置之前的配置都是用multipath的默認配置來完成multipath,比如映射設備的名稱,multipath負載均衡的方法都是默認設置。那有沒有按照我們自己定義的方法來配置multipath呢,答案是OK。
1、multipath.conf文件的配置
接下來的工作就是要編輯/etc/multipath.conf的配置文件
multipath.conf主要包括blacklist、multipaths、devices三部份的配置
blacklist配置
blacklist {
devnode “^sda”
}
Multipaths部分配置multipaths和devices兩部份的配置。
multipaths {
multipath {
wwid **************** #此值multipath -v3可以看到
alias iscsi-dm0 #映射后的別名,可以隨便取
path_grouping_policy multibus #路徑組策略
path_checker tur #決定路徑狀態(tài)的方法
path_selector “round-robin 0” #選擇那條路徑進行下一個IO操作的方法
}
}
Devices部分配置
devices {
device {
vendor “iSCSI-Enterprise” #廠商名稱
product “Virtual disk” #產品型號
path_grouping_policy multibus #默認的路徑組策略
getuid_callout “/sbin/scsi_id -g -u -s /block/%n” #獲得唯一設備號使用的默認程序
prio_callout “/sbin/acs_prio_alua %d” #獲取有限級數(shù)值使用的默認程序
path_checker readsector0 #決定路徑狀態(tài)的方法
path_selector “round-robin 0” #選擇那條路徑進行下一個IO操作的方法
failback immediate #故障恢復的模式
no_path_retry queue #在disable queue之前系統(tǒng)嘗試使用失效路徑的次數(shù)的數(shù)值
rr_min_io 100 #在當前的用戶組中,在切換到另外一條路徑之前的IO請求的數(shù)目
}
}
下面是相關參數(shù)的標準文檔的介紹:
Attribute
Description
wwid
Specifies the WWID of the multipath device to which themultipathattributes apply. This parameter is mandatory for this section of themultipath.conffile.
alias
Specifies the symbolic name for the multipath device to which themultipathattributes apply. If you are usinguser_friendly_names, do not set this value tompathn; this may conflict with an automatically assigned user friendly name and give you incorrect device node names.
path_grouping_policy
Specifies the default path grouping policy to apply to unspecified multipaths. Possible values include:
failover= 1 path per priority group
multibus= all valid paths in 1 priority group
group_by_serial= 1 priority group per detected serial number
group_by_prio= 1 priority group per path priority value
group_by_node_name= 1 priority group per target node name
path_selector
Specifies the default algorithm to use in determining what path to use for the next I/O operation. Possible values include:
round-robin 0: Loop through every path in the path group, sending the same amount of I/O to each.
queue-length 0: Send the next bunch of I/O down the path with the least number of outstanding I/O requests.
service-time 0: Send the next bunch of I/O down the path with the shortest estimated service time, which is determined by dividing the total size of the outstanding I/O to each path by its relative throughput.
failback
Manages path group failback.
A value ofimmediatespecifies immediate failback to the highest priority path group that contains active paths.
A value ofmanualspecifies that there should not be immediate failback but that failback can happen only with operator intervention.
A value offollowoverspecifies that automatic failback should be performed when the first path of a path group becomes active. This keeps a node from automatically failing back when another node requested the failover.
A numeric value greater than zero specifies deferred failback, expressed in seconds.
prio
Specifies the default function to call to obtain a path priority value. For example, the ALUA bits in SPC-3 provide an exploitablepriovalue. Possible values include:
const: Set a priority of 1 to all paths.
emc: Generate the path priority for EMC arrays.
alua: Generate the path priority based on the SCSI-3 ALUA settings.
tpg_pref: Generate the path priority based on the SCSI-3 ALUA settings, using the preferred port bit.
ontap: Generate the path priority for NetApp arrays.
rdac: Generate the path priority for LSI/Engenio RDAC controller.
hp_sw: Generate the path priority for Compaq/HP controller in active/standby mode.
hds: Generate the path priority for Hitachi HDS Modular storage arrays.
no_path_retry
A numeric value for this attribute specifies the number of times the system should attempt to use a failed path before disabling queueing.
A value offailindicates immediate failure, without queueing.
A value ofqueueindicates that queueing should not stop until the path is fixed.
rr_min_io
Specifies the number of I/O requests to route to a path before switching to the next path in the current path group. This setting is only for systems running kernels older that 2.6.31. Newer systems should userr_min_io_rq. The default value is 1000.
rr_min_io_rq
Specifies the number of I/O requests to route to a path before switching to the next path in the current path group, using request-based device-mapper-multipath. This setting should be used on systems running current kernels. On systems running kernels older than 2.6.31, userr_min_io. The default value is 1.
rr_weight
If set topriorities, then instead of sendingrr_min_iorequests to a path before callingpath_selectorto choose the next path, the number of requests to send is determined byrr_min_iotimes the path’s priority, as determined by thepriofunction. If set touniform, all path weights are equal.
flush_on_last_del
If set toyes, then multipath will disable queueing when the last path to a device has been deleted.
在我本地的一個完整的高級配置如下:
[root@liujing ~]# vi /etc/multipath.conf
blacklist {
devnode “^sda”
}
multipaths {
multipath {
wwid 360a98000646650724434697454546156
aliasmpathb_fcoe
path_grouping_policy multibus
#path_checker “directio”
prio “random”
path_selector “round-robin 0”
}
}
devices {
device {
vendor “NETAPP”
product”LUN”
getuid_callout “/lib/udev/scsi_id –whitelisted –device=/dev/%n”
#path_checker “directio”
#path_selector “round-robin 0”
failback immediate
no_path_retry fail
}
}
其中wwid,vendor,product,getuid_callout這些參數(shù)可以通過:multipath -v3命令來獲取。如果在/etc/multipath.conf中有設定各wwid 別名,別名會覆蓋此設定。
四、負載均衡測試:
可以使用dd命令來對設備進行讀寫操作,并同時通過iostat來查看I/0狀態(tài),流量從哪個路徑出去:
DD命令:dd if=/dev/zero of=/mnt/1Gfile bs=8k count=131072 在上面我們已經(jīng)把磁盤掛載在/MNT文件夾下所以我們在讀寫磁盤時直接對/mnt文件夾直接讀寫就可以了。
如果想對磁盤重復讀寫可以用如下語句:
[root@liujing ~]# for ((i=1;i<=5;i++));do dd if=/dev/zero of=/mnt/1Gfile bs=8k count=131072 2>&1|grep MB;done; —重復讀寫5次這個值可以根據(jù)自己測試需求修改。
深度分析LINUX環(huán)境下如何配置multi-path
另一個控制臺輸入iostat 2 10查看IO讀寫狀態(tài):
深度分析LINUX環(huán)境下如何配置multi-path
可以看到sdc和sdd是兩個多路徑的盤符,流量均勻的負載在兩條路徑中,負載均衡很成功。
五、路徑冗余備份測試
將其中一條路徑的端口down掉,所有流量會直接切換到另一個路徑中。
深度分析LINUX環(huán)境下如何配置multi-path