Code Review for illumos-gate

Prepared by:Ilya Usvyatsky (ilya) on 2013-May-29 14:09 -0400 EDT
Workspace:/home/ilya/illumos-gate (at 447495b7c1cc)
Compare against: origin/master (git://github.com/illumos/illumos-gate.git at 3731b53766e2)
Summary of changes: 27 lines changed: 2 ins; 9 del; 16 mod; 4758 unchg
Patch of changes: illumos-gate.patch
Printable review: illumos-gate.pdf

Cdiffs Udiffs Wdiffs Sdiffs Frames Old New Patch Raw usr/src/uts/common/fs/zfs/arc.c

re #13729 assign each ARC hash bucket its own mutex
In ARC the number of buckets in buffer header hash table is
proportional to the size of physical RAM.
The number of locks protecting headers in the buckets is fixed to 256 though.
Hence, on systems with large memory (>= 128GB) too many unrelated buffer
headers are protected by the same mutex.
When the memory in the system is fragmented this may cause a deadlock:
- An arc_read thread may be trying to allocate a 128k buffer while holding
a header lock.
- The allocation uses KM_PUSHPAGE option that blocks the thread if no contigous
chunk of requested size is available.
- ARC eviction thread that is supposed to evict some buffers would call
an evict callback on one of the buffers.
- Before freing the memory, the callback will attempt to take a lock on buffer
header.
- Incidentally, this buffer header will be protected by the same lock as
he one in arc_read() thread.
The solution in this patch is not perfect - that is, it protects all headers
in the hash bucket by the same lock.
However, a probability of collision is very low and does not depend on memory
size.
By the same argument, padding locks to cacheline looks like a waste of memory
here since the probability of contention on a cacheline is quite low, given
he number of buckets, number of locks per cacheline (4) and the fact that
he hash function (crc64 % hash table size) is supposed to be a very good
randomizer.
This effect on memory usage is as follows:
Per hash table size n,
- Original code uses 16K + 16 + n * 8 bytes of memory
- This fix uses 2 * n * 8 + 8 bytes of memory
- The net memory overhead is therefore n * 8 - 16K - 8 bytes
The value of n grows proportionally to physical memory size.
For 128GB of physical memory it is 2M, so the memory overhead is
16M - 16K - 8 bytes.
For smaller memory configurations the overhead is proportionally smaller, and
for larger memory configurations it is propottionally bigger.
The patch has been tested for 30+ hours using vdbench script that reproduces
hang with original code 100% of times in 20-30 minutes.
27 lines changed: 2 ins; 9 del; 16 mod; 4758 unchg

This code review page was prepared using /opt/onbld/bin/webrev. Webrev is maintained by the illumos project. The latest version may be obtained here.