Wednesday, 10 May 2017

What is the best bonding mode for TCP traffic such as NFS, ISCSI, CIFS, etc? (c) RedHat

Environment

Red Hat Enterprise Linux (all versions)
Bonding or Teaming
Large streaming TCP traffic such as NFS, Samba/CIFS, ISCSI, rsync over SSH/SCP, backups

Issue

What is the best bonding mode for TCP traffic such as NFS and Samba/CIFS?
NFS repeatedly logs nfs: server not responding, still trying when no network issue is present
A packet capture displays many TCP retransmission, TCP Out-of-order, RPC retransmission, when there should be no reason for this.

Resolution

Use a bonding mode which guarantees in-order delivery of TCP traffic such as:

Bonding Mode 1 (active-backup)
Bonding Mode 2 (balance-xor)
Bonding Mode 4 (802.3ad aka LACP)

Note that Bonding Mode 2 (balance-xor) requires an EtherChannel or similar configured on the switch, and Mode 4 (802.3ad) requires an EtherChannel with LACP on the switch. Bonding Mode 1 (active-backup) requires no switch configuration.

For advice on configuring bonding, refer to How do I configure a bonding device on Red Hat Enterprise Linux (RHEL)?

Root Cause

The following bonding modes:

Bonding Mode 0 (round-robin)
Bonding Mode 3 (broadcast)
Bonding Mode 5 (balance-tlb)
Bonding Mode 6 (balance-alb)

Do not guarantee in-order delivery of TCP streams, as each packet of a stream may be transmitted down a different slave, and no switch guarantees that packets received in different switchports will be delivered in order.

Given the following example configuration:

.---------------------------.
| bond0 in 0 (round-robin)  |
'---------------------------'
| eth0 | eth1 | eth2 | eth3 |
'--=---'--=---'---=--'---=--'
   |      |       |      |
   |      |       |      |
.--=------=-------=------=--.
|          switch           |
'---------------------------'

The bond system may send traffic out each slave in a correct order, like ABCD ABCD ABCD, but the switch may forward this traffic in any random order, like CADB BDCA DACB.

As TCP on the receiver expects to be presented a TCP stream in-order, this causes the receiver to think it's missed packets and request retransmissions, to spend a great deal of time reassembling out-of-order traffic in to be in the correct order, and for the sender to waste bandwidth sending retransmissions which are not really required.

The following bonding modes:

Bonding Mode 1 (active-backup)
Bonding Mode 2 (balance-xor)
Bonding Mode 4 (802.3ad aka LACP)

Avoid this issue by transmitting traffic for one destination down the one slave. Mode 2 and Mode 4's balancing algorithm can be altered by the xmit_hash_policy bonding option, but they will never balance a single TCP stream down different ports, and so will avoid the problematic behaviour discussed above.

It is not possible to effectively balance a single TCP stream across multiple bonding or teaming devices. If higher speed is required for a single stream, then faster interfaces (and possibly faster network infrastructure) must be used.

This theory applies to all TCP streams. The most common occurrences of this issue are seen on high-speed long-lived TCP streams such as NFS, Samba/CIFS, ISCSI, rsync over SSH/SCP, and so on.

Diagnostic Steps

Inspect syslog for nfs: server X not responding, still trying and nfs: server X OK messages when there are no other network issues.

Inspect a packet capture for many occurrences of TCP retransmission, TCP Out-of-Order, RPC retransmission, or other similar messages.

Inspect bonding mode in /proc/net/bonding/bondX

NFS and Dirty Pages (c) RedHat

PROBLEM

Computers with lots of RAM and lots of processing power can quickly create many Dirty Pages (data to be written eventually to a filesystem) in RAM. When the time comes to flush these Dirty Pages to the respective filesystem, called Writeback, there can be a lot of congestion with NFS. The throughput of data travelling over the network is significantly slower than writing to RAM. Picture the impact on road traffic if a 10 lane road suddenly reduced down to 2 lanes.

One may expect this to only impact the NFS mount, however, the number of permitted Dirty Pages is a system-wide value. Once this threshold is reached, every process on the system is responsible for freeing up pages if they attempt to allocate memory. If there are only a few dirty pages, this is fine, but if there are 40GiB of dirty pages, all the process can get blocked for a long time.

WORKAROUNDS

There are a number of ways to work around this issue. They range from solutions that only impact the process and the file being written to, all the way to impacting all processes and all filesystems.

File-level impact

Direct I/O

When opening the file for writing, use the O_DIRECT flag to completely bypass the Page Cache. This can also be achieved by using dd to copy a file to the NFS mount with the oflag=direct option.

Throttle I/O

The next option is to throttle the rate of reading the data to match the NFS WRITE rate. e.g. Use rsync and option --bwlimit.

Flush NFS Dirty Pages frequently

If you have access to recompiling the source code, periodically call fsync(). If you are unable to recompile the source code, run the following periodically:

ls -l /nfsmount/dir_containing_files

Write smaller files

If possible, try breaking up single large files into smaller files. Dirty Pages associated with each file will be flushed when it is closed. This results in Dirty Pages being flushed more frequently.

NFS mount impact

Use only synchronous I/O

Normally, I/O is done asynchronously on the NFS Client, meaning the application writes to the Page Cache and the NFS Client sends the data to the NFS Server later.

I/O can be forced to be done synchronously, meaning the application does not consider a write complete until the NFS Client has sent the data to the NFS Server, and the NFS Server has acknowledged receiving the data.

Using the sync NFS Client mount option forces all writes to be synchronous. However, it will also severely degrade the NFS Client WRITE performance.

`rsize`/`wsize` (NFS client mount options)

The rsize/wsize is the maximum number of bytes per network READ/WRITE request. Increasing these values has the potential to increase the throughput depending on the type of workload and the performance of the network.

The default rsize/wsize is negotiated with the NFS Server by the NFS Client. If your workload is a streaming READ/WRITE workload, increasing rsize/wsize to 1048576 (1MiB) could improve throughput performance.

System-wide impact

Limit the number of system-wide Dirty Pages

From RHEL 5.6 (kernel 2.6.18-238) onwards (including RHEL 6.0) the tunables vm.dirty_background_bytes and vm.dirty_bytes are available. These tunables provide finer grain adjustments particularly if the system has a lot of RAM. Prior to RHEL 5.6, the tunables vm.dirty_background_ratio and vm.dirty_ratio can be used to achieve the same objective.

Set vm.dirty_expire_centisecs (/proc/sys/vm/dirty_expire_centisecs) to 500 from the 3000 default
Limit vm.dirty_background_bytes (/proc/sys/vm/dirty_background_bytes) to 500MiB
Limit vm.dirty_bytes (/proc/sys/vm/dirty_bytes) to not more than 1 GiB

Ensure that /proc/sys/vm/dirty_background_bytes is always a smaller, non-zero, value than /proc/sys/vm/dirty_bytes.

Changing these values can impact throughput negatively while improving latency. To shift the balance between throughput and latency, adjust these values slightly and measure the impact, in particular dirty_bytes.

The behaviour of Dirty Pages and Writeback can be observed by running the following command:

$ watch -d -n 1 cat /proc/meminfo

Documentation/sysctl/vm.txt:

dirty_expire_centisecs

This tunable is used to define when dirty data is old enough to be eligible
for writeout by the kernel flusher threads.  It is expressed in 100'ths
of a second.  Data which has been dirty in-memory for longer than this
interval will be written out next time a flusher thread wakes up.

dirty_bytes

Contains the amount of dirty memory at which a process generating disk writes
will itself start writeback.

Note: dirty_bytes is the counterpart of dirty_ratio. Only one of them may be
specified at a time. When one sysctl is written it is immediately taken into
account to evaluate the dirty memory limits and the other appears as 0 when
read.

Note: the minimum value allowed for dirty_bytes is two pages (in bytes); any
value lower than this limit will be ignored and the old configuration will be
retained.

dirty_ratio

Contains, as a percentage of total available memory that contains free pages
and reclaimable pages, the number of pages at which a process which is
generating disk writes will itself start writing out dirty data.

The total available memory is not equal to total system memory.

dirty_background_bytes

Contains the amount of dirty memory at which the background kernel
flusher threads will start writeback.

Note: dirty_background_bytes is the counterpart of dirty_background_ratio. Only
one of them may be specified at a time. When one sysctl is written it is
immediately taken into account to evaluate the dirty memory limits and the
other appears as 0 when read.

dirty_background_ratio

Contains, as a percentage of total available memory that contains free pages
and reclaimable pages, the number of pages at which the background kernel
flusher threads will start writing out dirty data.

The total available memory is not equal to total system memory.

Environment-wide impact

Improve the Network Performance (iperf benchmarking)

The performance of the network has a significant bearing on NFS. Check that the network is performing well by running iperf. It can be used to measure network throughput between the NFS client and another system, if possible, the NFS server. e.g.

Receiver:

$ iperf -s -f M

Transmitter:

$ iperf -c RECEIVER-IP -f M -t 60

Do a few iterations and try to make each test run for at least 60 seconds. You should be able to get an idea of baseline network throughput. NFS will not perform any faster than the baseline.

Thursday, 4 May 2017

Solaris Comstar client setup

iSCSI

Linux

Create a new file called 99-zfssa.rules inside /etc/udev/rules.d with the following contents:

ACTION=="add", SYSFS{vendor}=="SUN", SYSFS{model}=="*ZFS*", ENV{ID_FS_USAGE}!="filesystem", ENV{ID_PATH}=="*-iscsi-*", RUN+="/bin/sh -c 'echo 1024 > /sys$DEVPATH/queue/max_sectors_kb'"

Reboot

Windows

Open up the Registry Editor (regedit)

Navigate to HKLM\SYSTEM\CurrentControlSet\Control\Class\{4D36E97B-E325-11CE-BFC1-08002BE10318}\<Instance Number>\Parameters

Change the following parameters to 1048576 in Decimal form:

    FirstBurstLength

    MaxBurstLength

    MaxRecvDataSegmentLength

    MaxTransferLength

Reboot

Solaris

# echo 'set maxphys=1048576' >> /etc/system
Reboot

Wednesday, 29 March 2017

rhel7, fastboot

How to boot kernel and bypass the BIOS

Kexec is a fastboot mechanism which allows botting linux from the context of already running kernel without going to BIOS

Check if kexec tools installed

rpm -q kexec-tools

To load a kernel, the syntax is as follows:

#    kexec -l kernel-image --append="any command-line-options" --initrd=initrd-image

Easy way to boot current kernel is:

# krnl=`uname -r` ; kexec -l /boot/vmlinuz-$krnl \
--initrd=/boot/initramfs-"$krnl".img --reuse-cmdline

Sunday, 5 March 2017

luxadm. cheat sheet

Display port state

# luxadm -e port
/devices/pci@700/pci@0/pci@c/SUNW,emlxs@0,1/fp@0,0:devctl CONNECTED
/devices/pci@400/pci@0/pci@d/SUNW,emlxs@0,1/fp@0,0:devctl CONNECTED
/devices/pci@600/pci@0/pci@c/SUNW,emlxs@0,1/fp@0,0:devctl NOT CONNECTED
/devices/pci@500/pci@0/pci@d/SUNW,emlxs@0/fp@0,0:devctl NOT CONNECTED

Dump devices connected to port

# luxadm -e dump_map /devices/pci@400/pci@1/pci@0/pci@0/SUNW,emlxs@0/fp@0,0:devctl
Pos Port_ID Hard_Addr Port WWN Node WWN Type
0 97f4c0 0 204300a0b848a702 200200a0b848a702 0x0 (Disk device)
1 97f680 0 10000000c9b1f51c 20000000c9b1f51c 0x1f (Unknown Type,Host Bus Adapter)
# luxadm -e dump_map /dev/cfg/c6
Pos Port_ID Hard_Addr Port WWN Node WWN Type
0 97f4c0 0 204300a0b848a702 200200a0b848a702 0x0 (Disk device)
1 97f680 0 10000000c9b1f51c 20000000c9b1f51c 0x1f (Unknown Type,Host Bus Adapter)

List disks

# luxadm probe -p
No Network Array enclosures found in /dev/es
Found Fibre Channel device(s):
Node WWN:206000c0ff0067d9 Device Type:Disk device
Logical Path:/dev/rdsk/c14t600C0FF0000000000067D96B373B6600d0s2
Physical Path:
/devices/scsi_vhci/ssd@g600c0ff0000000000067d96b373b6600:c,raw
Node WWN:206000c0ff0067d9 Device Type:Disk device
Logical Path:/dev/rdsk/c14t600C0FF0000000000067D96B373B6601d0s2
Physical Path:
/devices/scsi_vhci/ssd@g600c0ff0000000000067d96b373b6601:c,raw
Node WWN:206000c0ff0067d9 Device Type:Disk device
Logical Path:/dev/rdsk/c14t600C0FF0000000000067D96B373B6602d0s2
Physical Path:
/devices/scsi_vhci/ssd@g600c0ff0000000000067d96b373b6602:c,raw

Show disk info

# luxadm display /dev/rdsk/c14t60060E8004F236000000F23600000A00d0s2
DEVICE PROPERTIES for disk: /dev/rdsk/c14t60060E8004F236000000F23600000A00d0s2
Vendor: HITACHI
Product ID: OPEN-V -SUN
Revision: 5009
Serial Num: 50 0F2360A00
Unformatted capacity: 34091.250 MBytes
Write Cache: Enabled
Read Cache: Enabled
Minimum prefetch: 0x0
Maximum prefetch: 0x0
Device Type: Disk device
Path(s):
/dev/rdsk/c14t60060E8004F236000000F23600000A00d0s2
/devices/scsi_vhci/ssd@g60060e8004f236000000f23600000a00:c,raw
Controller /devices/pci@400/pci@0/pci@c/SUNW,emlxs@0,1/fp@0,0
Device Address 50060e8004f23674,4
Host controller port WWN 10000000c98b08d3
Class primary
State ONLINE
Controller /devices/pci@400/pci@0/pci@c/SUNW,emlxs@0,1/fp@0,0
Device Address 50060e8004f23676,4
Host controller port WWN 10000000c98b08d3
Class primary
State ONLINE
Controller /devices/pci@400/pci@0/pci@c/SUNW,emlxs@0/fp@0,0
Device Address 50060e8004f23664,4
Host controller port WWN 10000000c98b08d2
Class primary
State ONLINE
Controller /devices/pci@400/pci@0/pci@c/SUNW,emlxs@0/fp@0,0
Device Address 50060e8004f23666,4
Host controller port WWN 10000000c98b08d2
Class primary
State ONLINE

Inquire device

# luxadm inquiry/dev/rdsk/c14t600C0FF00000000009208424A938CA00d0s2
INQUIRY:
Physical Path:
/devices/scsi_vhci/ssd@g600c0ff00000000009208424a938ca00:c,raw
Vendor: SUN
Product: StorEdge 3511
Revision: 421F
Serial Number 09208424A938CA00
Device type: 0x0 (Disk device)
Removable media: no
Medium Changer Element: no
ISO version: 0
ECMA version: 0
ANSI version: 3 (Device complies to SCSI-3)
Terminate task: no
Response data format: 2
Additional length: 0xf7
Command queueing: no
VENDOR-SPECIFIC PARAMETERS
Byte# Hex Value ASCII
52 00 00 00 00 ....
96 43 6f 70 79 72 69 67 68 74 20 28 43 29 20 31 39 Copyright (C) 19
39 39 20 49 6e 66 6f 72 74 72 65 6e 64 2e 20 41 99 Infortrend. A
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00 00 00 00 00 00 00 00 00 00 00 00 ............

Display tape information

# luxadm -v display /dev/rmt/3
Displaying information for: /dev/rmt/3
DEVICE PROPERTIES for tape: /dev/rmt/3
Vendor: ARCHIVE
Product ID: Python
Revision: V000
Serial Num: Unsupported
Device Type: Tape device
Path(s):
/dev/rmt/3n
/devices/pci@700/pci@0/pci@c/SUNW,emlxs@0,1/fp@0,0/st@w2101001b3232ef61,0:n
LUN path port WWN: 2101001b3232ef61
Host controller port WWN: 10000000c98b07d5
Path status: Not Ready

Display disk information

# luxadm display 226000c0ff992084
DEVICE PROPERTIES for disk: 226000c0ff992084
Vendor: SUN
Product ID: StorEdge 3511
Revision: 421F
Serial Num: 09208424A938CA00
Unformatted capacity: 956000.000 MBytes
Write Cache: Enabled
Read Cache: Enabled
Minimum prefetch: 0x0
Maximum prefetch: 0xffff
Device Type: Disk device
Path(s):
/dev/rdsk/c14t600C0FF00000000009208424A938CA00d0s2
/devices/scsi_vhci/ssd@g600c0ff00000000009208424a938ca00:c,raw
Controller /devices/pci@700/pci@0/pci@c/SUNW,emlxs@0,1/fp@0,0
Device Address 226000c0ff992084,0
Host controller port WWN 10000000c98b07d5
Class primary
State ONLINE
Controller /devices/pci@700/pci@0/pci@c/SUNW,emlxs@0/fp@0,0
Device Address 216000c0ff892084,0
Host controller port WWN 10000000c98b07d4
Class primary
State ONLINE
DEVICE PROPERTIES for disk: 226000c0ff992084
Vendor: SUN
Product ID: StorEdge 3511
Revision: 421F
Serial Num: 09208424A938CA01
Unformatted capacity: 956000.000 MBytes
Write Cache: Enabled
Read Cache: Enabled
Minimum prefetch: 0x0
Maximum prefetch: 0xffff
Device Type: Disk device
Path(s):

List hba && firmware version

# luxadm fcode_download -p
Found Path to 0 FC100/S Cards
Complete
Found Path to 0 FC100/P, ISP2200, ISP23xx Devices
Complete
Found Path to 0 JNI1560 Devices.
Complete
Found Path to 12 Emulex Devices.
Opening Device: /devices/pci@400/pci@0/pci@c/SUNW,emlxs@0/fp@0,0:devctl
Detected FCode Version: 3.01a1
Opening Device: /devices/pci@400/pci@0/pci@c/SUNW,emlxs@0,1/fp@0,0:devctl
Detected FCode Version: 3.01a1
Opening Device: /devices/pci@400/pci@0/pci@d/SUNW,emlxs@0/fp@0,0:devctl
Detected FCode Version: 3.01a1

FC Error stats

# luxadm -e rdls/dev/cfg/c3
Link Error Status information for loop:
al_pa lnk fail sync loss signal loss sequence err invalid word CRC
712100 1 0 0 0 293 0
713600 0 0 0 0 255 0
713700 1 0 0 0 255 0
713b00 0 0 0 0 255 0
8c0000 0 1 0 0 3 0
NOTE: These LESB counts are not cleared by a reset, only power cycles.
These counts must be compared to previously read counts.

FC device state

# luxadm -e bus_getstate/dev/rdsk/c14t60060E8004F236000000F23600000300d0s2
/dev/rdsk/c14t60060E8004F236000000F23600000300d0s2: Active
# for i in /dev/rdsk/c14*s2 ; do luxadm -e bus_getstate $i; done
/dev/rdsk/c14t60060E8004F236000000F23600000300d0s2: Active
Error: Invalid pathname (/devices/scsi_vhci/ssd@g60060e80153438000001343800000501:c,raw)
Error: Invalid pathname (/devices/scsi_vhci/ssd@g60060e80153438000001343800000502:c,raw)

Failover path

# luxadm failover secondary /dev/rdsk/c14t6d0s2
Error: Device does not support failover
# luxadm failover primary /dev/rdsk/c14t6d0s2
Error: Device does not support failover

LIP port

# luxadm -e forcelip /dev/cfg/c3
# tail -100 /var/adm/messages
Oct 26 17:32:22 pioneer emlxs: [ID 349649 kern.info] [ 5.05F1]emlxs1: NOTICE: 730: Link reset. (Resetting link...)
Oct 26 17:32:22 pioneer emlxs: [ID 349649 kern.info] [ 5.031F]emlxs1: NOTICE: 710: Link down.
Oct 26 17:32:22 pioneer emlxs: [ID 349649 kern.info] [ 5.0631]emlxs1: NOTICE: 730: Link reset.
Oct 26 17:32:24 pioneer emlxs: [ID 349649 kern.info] [ 5.0549]emlxs1: NOTICE: 720: Link up. (4Gb, fabric, initiator)

Offline/Online/Reset device

# luxadm -e offline /dev/rdsk/c14t6d0s2

# luxadm -e online /dev/rdsk/c14t6d0s2

# luxadm -e dev_reset /dev/rdsk/c14t6d0s2

Led ON/OFF

# luxadm led /dev/rdsk/c14t6d0s2
# luxadm led_on /dev/rdsk/c14t6d0s2
# luxadm led_off /dev/rdsk/c14t6d0s2
# luxadm led_blink /dev/rdsk/c14t6d0s2

Update firmware

# luxadm fcode_download -d /path_to_firmware

Sunday, 26 February 2017

puppet cheat sheet

Puppet cheat sheet

Agent

action command

Client bootstrap puppet agent --server puppetmaster --waitforcert 60 --test

Validate manifest syntax puppet parser validate init.pp

Puppet dry run puppet agent --noop --verbose

Disable/Enable Agent puppet agent --disable / --enable

Execute specific class puppet agent --tags Some::Class

action	command
Client bootstrap	puppet agent --server puppetmaster --waitforcert 60 --test
Validate manifest syntax	puppet parser validate init.pp
Puppet dry run	puppet agent --noop --verbose
Disable/Enable Agent	puppet agent --disable / --enable
Execute specific class	puppet agent --tags Some::Class

Cert managment

puppet cert list
puppet cert list --all
puppet cert sign <name>
puppet cert clean <name>
puppet node clean <name>   # removes node + cert

Module managment

puppet module list
puppet module install <name>
puppet module uninstall <name>
puppet module upgrade <name>
puppet module search <name>

Inspecting Resources/Types

puppet describe -l
puppet resource <type name>

# Querying Examples
puppet resource user john.smith
puppet resource service apache
puppet resource mount /data
puppet resource file /etc/motd
puppet resource package wget

# Trigger puppet run from master
puppet kick <name>
puppet kick -p 5 <names>      # 5 parallel

Hiera Queries

On Puppet master

hiera <key>     # to query common.yaml only
hiera <key> -m <FQDN>   # to query config of a given node (using mcollective)
hiera <key> -i <FQDN>   # to query config of a given node (using Puppet inventory)
hiera <key> environment=production fqdn=myhost1   # to pass values for hiera.yaml

# To dump complex data
hiera -a <array key>
hiera -h <hash key>

Hiera Debuging

puppet apply -e "notice(hiera_array('some key'))"


$ cat /etc/puppet/manifests/site.pp
notice("hostname: ${::hostname}")
notice("fqdn: ${::fqdn}")
notice("certname: ${::certname}")

$ puppet master \
  --hiera_config /etc/puppet/hiera.yaml \
  --manifest /etc/puppet/manifests/site.pp \
  --modulepath /etc/puppet/modules \
  --compile test01

Facter

command description

facter All system facts

facter -p All system facts and puppet facts

facter -y Facts in YAML format

facter -j Facts in JSON format

facter [-p] A specific fact

command	description
facter	All system facts
facter -p	All system facts and puppet facts
facter -y	Facts in YAML format
facter -j	Facts in JSON format
facter [-p]	A specific fact

Puppet Core Types

Puppet Module Cheat Sheat