Wednesday 10 May 2017

NFS and Dirty Pages (c) RedHat

NFS and Dirty Pages

PROBLEM

Computers with lots of RAM and lots of processing power can quickly create many Dirty Pages (data to be written eventually to a filesystem) in RAM. When the time comes to flush these Dirty Pages to the respective filesystem, called Writeback, there can be a lot of congestion with NFS. The throughput of data travelling over the network is significantly slower than writing to RAM. Picture the impact on road traffic if a 10 lane road suddenly reduced down to 2 lanes.
One may expect this to only impact the NFS mount, however, the number of permitted Dirty Pages is a system-wide value. Once this threshold is reached, every process on the system is responsible for freeing up pages if they attempt to allocate memory. If there are only a few dirty pages, this is fine, but if there are 40GiB of dirty pages, all the process can get blocked for a long time.

WORKAROUNDS

There are a number of ways to work around this issue. They range from solutions that only impact the process and the file being written to, all the way to impacting all processes and all filesystems.

File-level impact

Direct I/O

When opening the file for writing, use the O_DIRECT flag to completely bypass the Page Cache. This can also be achieved by using dd to copy a file to the NFS mount with the oflag=direct option.

Throttle I/O

The next option is to throttle the rate of reading the data to match the NFS WRITE rate. e.g. Use rsync and option --bwlimit.

Flush NFS Dirty Pages frequently

If you have access to recompiling the source code, periodically call fsync(). If you are unable to recompile the source code, run the following periodically:
ls -l /nfsmount/dir_containing_files

Write smaller files

If possible, try breaking up single large files into smaller files. Dirty Pages associated with each file will be flushed when it is closed. This results in Dirty Pages being flushed more frequently.

NFS mount impact

Use only synchronous I/O

Normally, I/O is done asynchronously on the NFS Client, meaning the application writes to the Page Cache and the NFS Client sends the data to the NFS Server later.
I/O can be forced to be done synchronously, meaning the application does not consider a write complete until the NFS Client has sent the data to the NFS Server, and the NFS Server has acknowledged receiving the data.
Using the sync NFS Client mount option forces all writes to be synchronous. However, it will also severely degrade the NFS Client WRITE performance.

rsize/wsize (NFS client mount options)

The rsize/wsize is the maximum number of bytes per network READ/WRITE request. Increasing these values has the potential to increase the throughput depending on the type of workload and the performance of the network.
The default rsize/wsize is negotiated with the NFS Server by the NFS Client. If your workload is a streaming READ/WRITE workload, increasing rsize/wsize to 1048576 (1MiB) could improve throughput performance.

System-wide impact

Limit the number of system-wide Dirty Pages

From RHEL 5.6 (kernel 2.6.18-238) onwards (including RHEL 6.0) the tunables vm.dirty_background_bytes and vm.dirty_bytes are available. These tunables provide finer grain adjustments particularly if the system has a lot of RAM. Prior to RHEL 5.6, the tunables vm.dirty_background_ratio and vm.dirty_ratio can be used to achieve the same objective.
  • Set vm.dirty_expire_centisecs (/proc/sys/vm/dirty_expire_centisecs) to 500 from the 3000 default
  • Limit vm.dirty_background_bytes (/proc/sys/vm/dirty_background_bytes) to 500MiB
  • Limit vm.dirty_bytes (/proc/sys/vm/dirty_bytes) to not more than 1 GiB
Ensure that /proc/sys/vm/dirty_background_bytes is always a smaller, non-zero, value than /proc/sys/vm/dirty_bytes.
Changing these values can impact throughput negatively while improving latency. To shift the balance between throughput and latency, adjust these values slightly and measure the impact, in particular dirty_bytes.
The behaviour of Dirty Pages and Writeback can be observed by running the following command:
$ watch -d -n 1 cat /proc/meminfo
Documentation/sysctl/vm.txt:
dirty_expire_centisecs

This tunable is used to define when dirty data is old enough to be eligible
for writeout by the kernel flusher threads.  It is expressed in 100'ths
of a second.  Data which has been dirty in-memory for longer than this
interval will be written out next time a flusher thread wakes up.
dirty_bytes

Contains the amount of dirty memory at which a process generating disk writes
will itself start writeback.

Note: dirty_bytes is the counterpart of dirty_ratio. Only one of them may be
specified at a time. When one sysctl is written it is immediately taken into
account to evaluate the dirty memory limits and the other appears as 0 when
read.

Note: the minimum value allowed for dirty_bytes is two pages (in bytes); any
value lower than this limit will be ignored and the old configuration will be
retained.
dirty_ratio

Contains, as a percentage of total available memory that contains free pages
and reclaimable pages, the number of pages at which a process which is
generating disk writes will itself start writing out dirty data.

The total available memory is not equal to total system memory.
dirty_background_bytes

Contains the amount of dirty memory at which the background kernel
flusher threads will start writeback.

Note: dirty_background_bytes is the counterpart of dirty_background_ratio. Only
one of them may be specified at a time. When one sysctl is written it is
immediately taken into account to evaluate the dirty memory limits and the
other appears as 0 when read.
dirty_background_ratio

Contains, as a percentage of total available memory that contains free pages
and reclaimable pages, the number of pages at which the background kernel
flusher threads will start writing out dirty data.

The total available memory is not equal to total system memory.

Environment-wide impact

Improve the Network Performance (iperf benchmarking)

The performance of the network has a significant bearing on NFS. Check that the network is performing well by running iperf. It can be used to measure network throughput between the NFS client and another system, if possible, the NFS server. e.g.
Receiver:
$ iperf -s -f M
Transmitter:
$ iperf -c RECEIVER-IP -f M -t 60
Do a few iterations and try to make each test run for at least 60 seconds. You should be able to get an idea of baseline network throughput. NFS will not perform any faster than the baseline.

No comments:

Post a Comment