Print this page
NEX-19742 A race between ARC and L2ARC causes system panic
Reviewed by: Joyce McIntosh <joyce.mcintosh@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
NEX-16904 Need to port Illumos Bug #9433 to fix ARC hit rate
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
NEX-15303 ARC-ABD logic works incorrect when deduplication is enabled
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
NEX-15303 ARC-ABD logic works incorrect when deduplication is enabled
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
NEX-15446 set zfs_ddt_limit_type to DDT_LIMIT_TO_ARC
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
NEX-15446 set zfs_ddt_limit_type to DDT_LIMIT_TO_ARC
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
NEX-14571 remove isal support remnants
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
NEX-9752 backport illumos 6950 ARC should cache compressed data
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
6950 ARC should cache compressed data
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Don Brady <don.brady@intel.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
NEX-8057 renaming of mount points should not be allowed (redo)
Reviewed by: Alek Pinchuk <alek@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
NEX-5785 zdb: assertion failed for thread 0xf8a20240, thread-id 130: mp->initialized == B_TRUE, file ../common/kernel.c, line 162
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexent.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
NEX-4228 dedup arcstats are redundant
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
NEX-7317 Getting assert !refcount_is_zero(&scl->scl_count) when trying to import pool
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
NEX-5671 assertion: (ab->b_l2hdr.b_asize) >> (9) >= 1 (0x0 >= 0x1), file: ../../common/fs/zfs/arc.c, line: 8275
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Revert "Merge pull request #520 in OS/nza-kernel from ~SASO.KISELKOV/nza-kernel:NEX-5671-pl2arc-le_psize to master"
This reverts commit b63e91b939886744224854ea365d70e05ddd6077, reversing
changes made to a6e3a0255c8b22f65343bf641ffefaf9ae948fd4.
NEX-5671 assertion: (ab->b_l2hdr.b_asize) >> (9) >= 1 (0x0 >= 0x1), file: ../../common/fs/zfs/arc.c, line: 8275
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
NEX-5058 WBC: Race between the purging of window and opening new one
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
NEX-2830 ZFS smart compression
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
6421 Add missing multilist_destroy calls to arc_fini
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Jorgen Lundman <lundman@lundman.net>
Approved by: Robert Mustacchi <rm@joyent.com>
6293 ztest failure: error == 28 (0xc == 0x1c) in ztest_tx_assign()
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
5219 l2arc_write_buffers() may write beyond target_sz
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Saso Kiselkov <skiselkov@gmail.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Steven Hartland <steven.hartland@multiplay.co.uk>
Reviewed by: Justin Gibbs <gibbs@FreeBSD.org>
Approved by: Matthew Ahrens <mahrens@delphix.com>
4185 add new cryptographic checksums to ZFS: SHA-512, Skein, Edon-R (fix studio build)
4185 add new cryptographic checksums to ZFS: SHA-512, Skein, Edon-R
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Richard Lowe <richlowe@richlowe.net>
Approved by: Garrett D'Amore <garrett@damore.org>
6220 memleak in l2arc on debug build
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Simon Klinkert <simon.klinkert@gmail.com>
Reviewed by: George Wilson <george@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
5987 zfs prefetch code needs work
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Approved by: Gordon Ross <gordon.ross@nexenta.com>
5847 libzfs_diff should check zfs_prop_get() return
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Albert Lee <trisk@omniti.com>
Approved by: Dan McDonald <danmcd@omniti.com>
5701 zpool list reports incorrect "alloc" value for cache devices
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Approved by: Dan McDonald <danmcd@omniti.com>
5817 change type of arcs_size from uint64_t to refcount_t
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <paul.dagnelie@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Alex Reece <alex@delphix.com>
Reviewed by: Richard Elling <richard.elling@richardelling.com>
Approved by: Garrett D'Amore <garrett@damore.org>
NEX-3879 L2ARC evict task allocates a useless struct
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
NEX-4408 backport illumos #6214 to avoid corruption (fix pL2ARC integration)
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
NEX-4408 backport illumos #6214 to avoid corruption
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
NEX-3979 fix arc_mru/mfu typo
Reviewed by: Dan Fields <dan.fields@nexenta.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
NEX-3961 arc_meta_max is not counted correctly
Reviewed by: Dan Fields <dan.fields@nexenta.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
NEX-3946 Port Illumos 5983 to release-5.0
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
Reviewed by: Jean McCormack <jean.maccormack@nexenta.com>
NEX-3945 file-backed cache devices considered harmful
Reviewed by: Alek Pinchuk <alek@nexenta.com>
NEX-3541 Implement persistent L2ARC - fix build breakage in libzpool (v2).
NEX-3541 Implement persistent L2ARC
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Josef Sipek <josef.sipek@nexenta.com>
Conflicts:
usr/src/uts/common/fs/zfs/sys/spa.h
NEX-3630 Backport illumos #5701 from master to 5.0
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
NEX-3558 KRRP Integration
NEX-3387 ARC stats appear to be in wrong/weird order
Reviewed by: Kirill Davydychev <kirill.davydychev@nexenta.com>
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
NEX-3296 turn on DDT limit by default
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
NEX-3300 ddt byte count ceiling tunables should not depend on zfs_ddt_limit_type being set
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
NEX-3165 need some dedup improvements
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
NEX-3165 segregate ddt in arc (other lint fix)
Reviewed by: Jean McCormack <jean.mccormack@nexenta.com>
Reviewed by: Rob Gittins <rob.gittins@nexenta.com>
NEX-3165 segregate ddt in arc
NEX-3079 port illumos ARC improvements
NEX-2301 zpool destroy assertion failed: vd->vdev_stat.vs_alloc == 0 (part 2)
NEX-2704 smbstat man page needs update
NEX-2301 zpool destroy assertion failed: vd->vdev_stat.vs_alloc == 0
3995 Memory leak of compressed buffers in l2arc_write_done
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Garrett D'Amore <garrett@damore.org>
Approved by: Garrett D'Amore <garrett@damore.org>
4370 avoid transmitting holes during zfs send
4371 DMU code clean up
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net>
Approved by: Garrett D'Amore <garrett@damore.org>
OS-80 support for vdev and CoS properties for the new I/O scheduler
OS-95 lint warning introduced by OS-61
NEX-463: bumped max queue size for L2ARC async evict
Maximum length of a taskq used for async arc and l2arc flush is
now a tuneable (zfs_flush_ntasks) that is initialized to 64.
The number is equally arbitrary, yet higher than original 4.
Real fix should rework l2arc evict according to OS-53, but for now
just longer queue should suffice.
Support for secondarycache=data option
Align mutex tables in arc.c and dbuf.c to 64 bytes (cache line), place each kmutex_t on cache line by itself to avoid false sharing
re #14119 BAD-TRAP panic under load
re #13989 port of illumos-3805
3805 arc shouldn't cache freed blocks
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Richard Elling <richard.elling@dey-sys.com>
Reviewed by: Will Andrews <will@firepipe.net>
Approved by: Dan McDonald <danmcd@nexenta.com>
re #13729 assign each ARC hash bucket its own mutex
In ARC the number of buckets in buffer header hash table is
proportional to the size of physical RAM.
The number of locks protecting headers in the buckets is fixed to 256 though.
Hence, on systems with large memory (>= 128GB) too many unrelated buffer
headers are protected by the same mutex.
When the memory in the system is fragmented this may cause a deadlock:
- An arc_read thread may be trying to allocate a 128k buffer while holding
a header lock.
- The allocation uses KM_PUSHPAGE option that blocks the thread if no contigous
chunk of requested size is available.
- ARC eviction thread that is supposed to evict some buffers would call
an evict callback on one of the buffers.
- Before freing the memory, the callback will attempt to take a lock on buffer
header.
- Incidentally, this buffer header will be protected by the same lock as
the one in arc_read() thread.
The solution in this patch is not perfect - that is, it protects all headers
in the hash bucket by the same lock.
However, a probability of collision is very low and does not depend on memory
size.
By the same argument, padding locks to cacheline looks like a waste of memory
here since the probability of contention on a cacheline is quite low, given
the number of buckets, number of locks per cacheline (4) and the fact that
the hash function (crc64 % hash table size) is supposed to be a very good
randomizer.
This effect on memory usage is as follows:
Per hash table size n,
- Original code uses 16K + 16 + n * 8 bytes of memory
- This fix uses 2 * n * 8 + 8 bytes of memory
- The net memory overhead is therefore n * 8 - 16K - 8 bytes
The value of n grows proportionally to physical memory size.
For 128GB of physical memory it is 2M, so the memory overhead is
16M - 16K - 8 bytes.
For smaller memory configurations the overhead is proportionally smaller, and
for larger memory configurations it is propottionally bigger.
The patch has been tested for 30+ hours using vdbench script that reproduces
hang with original code 100% of times in 20-30 minutes.
re #10054 rb4467 Support for asynchronous ARC/L2ARC eviction
re #13165 rb4265 zfs-monitor should fallback to using DEV_BSIZE
re #10054 rb4249 Long export time causes failover to fail
| Split |
Close |
| Expand all |
| Collapse all |
--- old/usr/src/uts/common/fs/zfs/arc.c
+++ new/usr/src/uts/common/fs/zfs/arc.c
1 1 /*
2 2 * CDDL HEADER START
3 3 *
4 4 * The contents of this file are subject to the terms of the
5 5 * Common Development and Distribution License (the "License").
6 6 * You may not use this file except in compliance with the License.
7 7 *
8 8 * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
9 9 * or http://www.opensolaris.org/os/licensing.
10 10 * See the License for the specific language governing permissions
11 11 * and limitations under the License.
12 12 *
13 13 * When distributing Covered Code, include this CDDL HEADER in each
14 14 * file and include the License file at usr/src/OPENSOLARIS.LICENSE.
15 15 * If applicable, add the following below this CDDL HEADER, with the
|
↓ open down ↓ |
15 lines elided |
↑ open up ↑ |
16 16 * fields enclosed by brackets "[]" replaced with your own identifying
17 17 * information: Portions Copyright [yyyy] [name of copyright owner]
18 18 *
19 19 * CDDL HEADER END
20 20 */
21 21 /*
22 22 * Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.
23 23 * Copyright (c) 2018, Joyent, Inc.
24 24 * Copyright (c) 2011, 2017 by Delphix. All rights reserved.
25 25 * Copyright (c) 2014 by Saso Kiselkov. All rights reserved.
26 - * Copyright 2017 Nexenta Systems, Inc. All rights reserved.
26 + * Copyright 2019 Nexenta Systems, Inc. All rights reserved.
27 27 */
28 28
29 29 /*
30 30 * DVA-based Adjustable Replacement Cache
31 31 *
32 32 * While much of the theory of operation used here is
33 33 * based on the self-tuning, low overhead replacement cache
34 34 * presented by Megiddo and Modha at FAST 2003, there are some
35 35 * significant differences:
36 36 *
37 37 * 1. The Megiddo and Modha model assumes any page is evictable.
38 38 * Pages in its cache cannot be "locked" into memory. This makes
39 39 * the eviction algorithm simple: evict the last page in the list.
40 40 * This also make the performance characteristics easy to reason
41 41 * about. Our cache is not so simple. At any given moment, some
42 42 * subset of the blocks in the cache are un-evictable because we
43 43 * have handed out a reference to them. Blocks are only evictable
44 44 * when there are no external references active. This makes
45 45 * eviction far more problematic: we choose to evict the evictable
46 46 * blocks that are the "lowest" in the list.
47 47 *
48 48 * There are times when it is not possible to evict the requested
49 49 * space. In these circumstances we are unable to adjust the cache
50 50 * size. To prevent the cache growing unbounded at these times we
51 51 * implement a "cache throttle" that slows the flow of new data
52 52 * into the cache until we can make space available.
53 53 *
54 54 * 2. The Megiddo and Modha model assumes a fixed cache size.
55 55 * Pages are evicted when the cache is full and there is a cache
56 56 * miss. Our model has a variable sized cache. It grows with
57 57 * high use, but also tries to react to memory pressure from the
58 58 * operating system: decreasing its size when system memory is
59 59 * tight.
60 60 *
61 61 * 3. The Megiddo and Modha model assumes a fixed page size. All
62 62 * elements of the cache are therefore exactly the same size. So
63 63 * when adjusting the cache size following a cache miss, its simply
64 64 * a matter of choosing a single page to evict. In our model, we
65 65 * have variable sized cache blocks (rangeing from 512 bytes to
66 66 * 128K bytes). We therefore choose a set of blocks to evict to make
67 67 * space for a cache miss that approximates as closely as possible
68 68 * the space used by the new block.
69 69 *
70 70 * See also: "ARC: A Self-Tuning, Low Overhead Replacement Cache"
71 71 * by N. Megiddo & D. Modha, FAST 2003
72 72 */
73 73
74 74 /*
75 75 * The locking model:
76 76 *
77 77 * A new reference to a cache buffer can be obtained in two
78 78 * ways: 1) via a hash table lookup using the DVA as a key,
79 79 * or 2) via one of the ARC lists. The arc_read() interface
80 80 * uses method 1, while the internal ARC algorithms for
81 81 * adjusting the cache use method 2. We therefore provide two
82 82 * types of locks: 1) the hash table lock array, and 2) the
83 83 * ARC list locks.
84 84 *
85 85 * Buffers do not have their own mutexes, rather they rely on the
86 86 * hash table mutexes for the bulk of their protection (i.e. most
87 87 * fields in the arc_buf_hdr_t are protected by these mutexes).
88 88 *
89 89 * buf_hash_find() returns the appropriate mutex (held) when it
90 90 * locates the requested buffer in the hash table. It returns
91 91 * NULL for the mutex if the buffer was not in the table.
92 92 *
93 93 * buf_hash_remove() expects the appropriate hash mutex to be
94 94 * already held before it is invoked.
95 95 *
96 96 * Each ARC state also has a mutex which is used to protect the
97 97 * buffer list associated with the state. When attempting to
98 98 * obtain a hash table lock while holding an ARC list lock you
99 99 * must use: mutex_tryenter() to avoid deadlock. Also note that
100 100 * the active state mutex must be held before the ghost state mutex.
101 101 *
102 102 * Note that the majority of the performance stats are manipulated
103 103 * with atomic operations.
104 104 *
105 105 * The L2ARC uses the l2ad_mtx on each vdev for the following:
106 106 *
107 107 * - L2ARC buflist creation
108 108 * - L2ARC buflist eviction
109 109 * - L2ARC write completion, which walks L2ARC buflists
110 110 * - ARC header destruction, as it removes from L2ARC buflists
111 111 * - ARC header release, as it removes from L2ARC buflists
112 112 */
113 113
114 114 /*
115 115 * ARC operation:
116 116 *
117 117 * Every block that is in the ARC is tracked by an arc_buf_hdr_t structure.
118 118 * This structure can point either to a block that is still in the cache or to
119 119 * one that is only accessible in an L2 ARC device, or it can provide
120 120 * information about a block that was recently evicted. If a block is
121 121 * only accessible in the L2ARC, then the arc_buf_hdr_t only has enough
122 122 * information to retrieve it from the L2ARC device. This information is
123 123 * stored in the l2arc_buf_hdr_t sub-structure of the arc_buf_hdr_t. A block
124 124 * that is in this state cannot access the data directly.
125 125 *
126 126 * Blocks that are actively being referenced or have not been evicted
127 127 * are cached in the L1ARC. The L1ARC (l1arc_buf_hdr_t) is a structure within
128 128 * the arc_buf_hdr_t that will point to the data block in memory. A block can
129 129 * only be read by a consumer if it has an l1arc_buf_hdr_t. The L1ARC
130 130 * caches data in two ways -- in a list of ARC buffers (arc_buf_t) and
131 131 * also in the arc_buf_hdr_t's private physical data block pointer (b_pabd).
132 132 *
133 133 * The L1ARC's data pointer may or may not be uncompressed. The ARC has the
134 134 * ability to store the physical data (b_pabd) associated with the DVA of the
135 135 * arc_buf_hdr_t. Since the b_pabd is a copy of the on-disk physical block,
136 136 * it will match its on-disk compression characteristics. This behavior can be
137 137 * disabled by setting 'zfs_compressed_arc_enabled' to B_FALSE. When the
138 138 * compressed ARC functionality is disabled, the b_pabd will point to an
139 139 * uncompressed version of the on-disk data.
140 140 *
141 141 * Data in the L1ARC is not accessed by consumers of the ARC directly. Each
142 142 * arc_buf_hdr_t can have multiple ARC buffers (arc_buf_t) which reference it.
143 143 * Each ARC buffer (arc_buf_t) is being actively accessed by a specific ARC
144 144 * consumer. The ARC will provide references to this data and will keep it
145 145 * cached until it is no longer in use. The ARC caches only the L1ARC's physical
146 146 * data block and will evict any arc_buf_t that is no longer referenced. The
147 147 * amount of memory consumed by the arc_buf_ts' data buffers can be seen via the
148 148 * "overhead_size" kstat.
149 149 *
150 150 * Depending on the consumer, an arc_buf_t can be requested in uncompressed or
151 151 * compressed form. The typical case is that consumers will want uncompressed
152 152 * data, and when that happens a new data buffer is allocated where the data is
153 153 * decompressed for them to use. Currently the only consumer who wants
154 154 * compressed arc_buf_t's is "zfs send", when it streams data exactly as it
155 155 * exists on disk. When this happens, the arc_buf_t's data buffer is shared
156 156 * with the arc_buf_hdr_t.
157 157 *
158 158 * Here is a diagram showing an arc_buf_hdr_t referenced by two arc_buf_t's. The
159 159 * first one is owned by a compressed send consumer (and therefore references
160 160 * the same compressed data buffer as the arc_buf_hdr_t) and the second could be
161 161 * used by any other consumer (and has its own uncompressed copy of the data
162 162 * buffer).
163 163 *
164 164 * arc_buf_hdr_t
165 165 * +-----------+
166 166 * | fields |
167 167 * | common to |
168 168 * | L1- and |
169 169 * | L2ARC |
170 170 * +-----------+
171 171 * | l2arc_buf_hdr_t
172 172 * | |
173 173 * +-----------+
174 174 * | l1arc_buf_hdr_t
175 175 * | | arc_buf_t
176 176 * | b_buf +------------>+-----------+ arc_buf_t
177 177 * | b_pabd +-+ |b_next +---->+-----------+
178 178 * +-----------+ | |-----------| |b_next +-->NULL
179 179 * | |b_comp = T | +-----------+
180 180 * | |b_data +-+ |b_comp = F |
181 181 * | +-----------+ | |b_data +-+
182 182 * +->+------+ | +-----------+ |
183 183 * compressed | | | |
184 184 * data | |<--------------+ | uncompressed
185 185 * +------+ compressed, | data
186 186 * shared +-->+------+
187 187 * data | |
188 188 * | |
189 189 * +------+
190 190 *
191 191 * When a consumer reads a block, the ARC must first look to see if the
192 192 * arc_buf_hdr_t is cached. If the hdr is cached then the ARC allocates a new
193 193 * arc_buf_t and either copies uncompressed data into a new data buffer from an
194 194 * existing uncompressed arc_buf_t, decompresses the hdr's b_pabd buffer into a
195 195 * new data buffer, or shares the hdr's b_pabd buffer, depending on whether the
196 196 * hdr is compressed and the desired compression characteristics of the
197 197 * arc_buf_t consumer. If the arc_buf_t ends up sharing data with the
198 198 * arc_buf_hdr_t and both of them are uncompressed then the arc_buf_t must be
199 199 * the last buffer in the hdr's b_buf list, however a shared compressed buf can
200 200 * be anywhere in the hdr's list.
201 201 *
202 202 * The diagram below shows an example of an uncompressed ARC hdr that is
203 203 * sharing its data with an arc_buf_t (note that the shared uncompressed buf is
204 204 * the last element in the buf list):
205 205 *
206 206 * arc_buf_hdr_t
207 207 * +-----------+
208 208 * | |
209 209 * | |
210 210 * | |
211 211 * +-----------+
212 212 * l2arc_buf_hdr_t| |
213 213 * | |
214 214 * +-----------+
215 215 * l1arc_buf_hdr_t| |
216 216 * | | arc_buf_t (shared)
217 217 * | b_buf +------------>+---------+ arc_buf_t
218 218 * | | |b_next +---->+---------+
219 219 * | b_pabd +-+ |---------| |b_next +-->NULL
220 220 * +-----------+ | | | +---------+
221 221 * | |b_data +-+ | |
222 222 * | +---------+ | |b_data +-+
223 223 * +->+------+ | +---------+ |
224 224 * | | | |
225 225 * uncompressed | | | |
226 226 * data +------+ | |
227 227 * ^ +->+------+ |
228 228 * | uncompressed | | |
229 229 * | data | | |
230 230 * | +------+ |
231 231 * +---------------------------------+
232 232 *
233 233 * Writing to the ARC requires that the ARC first discard the hdr's b_pabd
234 234 * since the physical block is about to be rewritten. The new data contents
235 235 * will be contained in the arc_buf_t. As the I/O pipeline performs the write,
236 236 * it may compress the data before writing it to disk. The ARC will be called
237 237 * with the transformed data and will bcopy the transformed on-disk block into
238 238 * a newly allocated b_pabd. Writes are always done into buffers which have
239 239 * either been loaned (and hence are new and don't have other readers) or
240 240 * buffers which have been released (and hence have their own hdr, if there
241 241 * were originally other readers of the buf's original hdr). This ensures that
242 242 * the ARC only needs to update a single buf and its hdr after a write occurs.
243 243 *
244 244 * When the L2ARC is in use, it will also take advantage of the b_pabd. The
245 245 * L2ARC will always write the contents of b_pabd to the L2ARC. This means
|
↓ open down ↓ |
209 lines elided |
↑ open up ↑ |
246 246 * that when compressed ARC is enabled that the L2ARC blocks are identical
247 247 * to the on-disk block in the main data pool. This provides a significant
248 248 * advantage since the ARC can leverage the bp's checksum when reading from the
249 249 * L2ARC to determine if the contents are valid. However, if the compressed
250 250 * ARC is disabled, then the L2ARC's block must be transformed to look
251 251 * like the physical block in the main data pool before comparing the
252 252 * checksum and determining its validity.
253 253 */
254 254
255 255 #include <sys/spa.h>
256 +#include <sys/spa_impl.h>
256 257 #include <sys/zio.h>
257 258 #include <sys/spa_impl.h>
258 259 #include <sys/zio_compress.h>
259 260 #include <sys/zio_checksum.h>
260 261 #include <sys/zfs_context.h>
261 262 #include <sys/arc.h>
262 263 #include <sys/refcount.h>
263 264 #include <sys/vdev.h>
264 265 #include <sys/vdev_impl.h>
265 266 #include <sys/dsl_pool.h>
266 267 #include <sys/zio_checksum.h>
267 268 #include <sys/multilist.h>
|
↓ open down ↓ |
2 lines elided |
↑ open up ↑ |
268 269 #include <sys/abd.h>
269 270 #ifdef _KERNEL
270 271 #include <sys/vmsystm.h>
271 272 #include <vm/anon.h>
272 273 #include <sys/fs/swapnode.h>
273 274 #include <sys/dnlc.h>
274 275 #endif
275 276 #include <sys/callb.h>
276 277 #include <sys/kstat.h>
277 278 #include <zfs_fletcher.h>
278 -#include <sys/aggsum.h>
279 -#include <sys/cityhash.h>
279 +#include <sys/byteorder.h>
280 +#include <sys/spa_impl.h>
280 281
281 282 #ifndef _KERNEL
282 283 /* set with ZFS_DEBUG=watch, to enable watchpoints on frozen buffers */
283 284 boolean_t arc_watch = B_FALSE;
284 285 int arc_procfd;
285 286 #endif
286 287
287 288 static kmutex_t arc_reclaim_lock;
288 289 static kcondvar_t arc_reclaim_thread_cv;
289 290 static boolean_t arc_reclaim_thread_exit;
290 291 static kcondvar_t arc_reclaim_waiters_cv;
291 292
292 293 uint_t arc_reduce_dnlc_percent = 3;
293 294
294 295 /*
295 296 * The number of headers to evict in arc_evict_state_impl() before
296 297 * dropping the sublist lock and evicting from another sublist. A lower
297 298 * value means we're more likely to evict the "correct" header (i.e. the
298 299 * oldest header in the arc state), but comes with higher overhead
299 300 * (i.e. more invocations of arc_evict_state_impl()).
300 301 */
301 302 int zfs_arc_evict_batch_limit = 10;
302 303
303 304 /* number of seconds before growing cache again */
304 305 static int arc_grow_retry = 60;
305 306
306 307 /* number of milliseconds before attempting a kmem-cache-reap */
307 308 static int arc_kmem_cache_reap_retry_ms = 1000;
308 309
309 310 /* shift of arc_c for calculating overflow limit in arc_get_data_impl */
310 311 int zfs_arc_overflow_shift = 8;
311 312
312 313 /* shift of arc_c for calculating both min and max arc_p */
313 314 static int arc_p_min_shift = 4;
314 315
315 316 /* log2(fraction of arc to reclaim) */
316 317 static int arc_shrink_shift = 7;
317 318
318 319 /*
319 320 * log2(fraction of ARC which must be free to allow growing).
320 321 * I.e. If there is less than arc_c >> arc_no_grow_shift free memory,
321 322 * when reading a new block into the ARC, we will evict an equal-sized block
322 323 * from the ARC.
323 324 *
324 325 * This must be less than arc_shrink_shift, so that when we shrink the ARC,
325 326 * we will still not allow it to grow.
326 327 */
327 328 int arc_no_grow_shift = 5;
328 329
329 330
330 331 /*
331 332 * minimum lifespan of a prefetch block in clock ticks
332 333 * (initialized in arc_init())
333 334 */
334 335 static int arc_min_prefetch_lifespan;
335 336
336 337 /*
337 338 * If this percent of memory is free, don't throttle.
338 339 */
339 340 int arc_lotsfree_percent = 10;
340 341
341 342 static int arc_dead;
342 343
343 344 /*
344 345 * The arc has filled available memory and has now warmed up.
345 346 */
346 347 static boolean_t arc_warm;
347 348
348 349 /*
349 350 * log2 fraction of the zio arena to keep free.
|
↓ open down ↓ |
60 lines elided |
↑ open up ↑ |
350 351 */
351 352 int arc_zio_arena_free_shift = 2;
352 353
353 354 /*
354 355 * These tunables are for performance analysis.
355 356 */
356 357 uint64_t zfs_arc_max;
357 358 uint64_t zfs_arc_min;
358 359 uint64_t zfs_arc_meta_limit = 0;
359 360 uint64_t zfs_arc_meta_min = 0;
361 +uint64_t zfs_arc_ddt_limit = 0;
362 +/*
363 + * Tunable to control "dedup ceiling"
364 + * Possible values:
365 + * DDT_NO_LIMIT - default behaviour, ie no ceiling
366 + * DDT_LIMIT_TO_ARC - stop DDT growth if DDT is bigger than it's "ARC space"
367 + * DDT_LIMIT_TO_L2ARC - stop DDT growth when DDT size is bigger than the
368 + * L2ARC DDT dev(s) for that pool
369 + */
370 +zfs_ddt_limit_t zfs_ddt_limit_type = DDT_LIMIT_TO_ARC;
371 +/*
372 + * Alternative to the above way of controlling "dedup ceiling":
373 + * Stop DDT growth when in core DDTs size is above the below tunable.
374 + * This tunable overrides the zfs_ddt_limit_type tunable.
375 + */
376 +uint64_t zfs_ddt_byte_ceiling = 0;
377 +boolean_t zfs_arc_segregate_ddt = B_TRUE;
360 378 int zfs_arc_grow_retry = 0;
361 379 int zfs_arc_shrink_shift = 0;
362 380 int zfs_arc_p_min_shift = 0;
363 381 int zfs_arc_average_blocksize = 8 * 1024; /* 8KB */
364 382
383 +/* Tuneable, default is 64, which is essentially arbitrary */
384 +int zfs_flush_ntasks = 64;
385 +
365 386 boolean_t zfs_compressed_arc_enabled = B_TRUE;
366 387
367 388 /*
368 389 * Note that buffers can be in one of 6 states:
369 390 * ARC_anon - anonymous (discussed below)
370 391 * ARC_mru - recently used, currently cached
371 392 * ARC_mru_ghost - recentely used, no longer in cache
372 393 * ARC_mfu - frequently used, currently cached
373 394 * ARC_mfu_ghost - frequently used, no longer in cache
374 395 * ARC_l2c_only - exists in L2ARC but not other states
375 396 * When there are no active references to the buffer, they are
376 397 * are linked onto a list in one of these arc states. These are
377 398 * the only buffers that can be evicted or deleted. Within each
378 399 * state there are multiple lists, one for meta-data and one for
379 400 * non-meta-data. Meta-data (indirect blocks, blocks of dnodes,
380 401 * etc.) is tracked separately so that it can be managed more
381 402 * explicitly: favored over data, limited explicitly.
382 403 *
383 404 * Anonymous buffers are buffers that are not associated with
384 405 * a DVA. These are buffers that hold dirty block copies
385 406 * before they are written to stable storage. By definition,
386 407 * they are "ref'd" and are considered part of arc_mru
387 408 * that cannot be freed. Generally, they will aquire a DVA
388 409 * as they are written and migrate onto the arc_mru list.
389 410 *
390 411 * The ARC_l2c_only state is for buffers that are in the second
391 412 * level ARC but no longer in any of the ARC_m* lists. The second
392 413 * level ARC itself may also contain buffers that are in any of
393 414 * the ARC_m* states - meaning that a buffer can exist in two
394 415 * places. The reason for the ARC_l2c_only state is to keep the
395 416 * buffer header in the hash table, so that reads that hit the
396 417 * second level ARC benefit from these fast lookups.
397 418 */
398 419
399 420 typedef struct arc_state {
|
↓ open down ↓ |
25 lines elided |
↑ open up ↑ |
400 421 /*
401 422 * list of evictable buffers
402 423 */
403 424 multilist_t *arcs_list[ARC_BUFC_NUMTYPES];
404 425 /*
405 426 * total amount of evictable data in this state
406 427 */
407 428 refcount_t arcs_esize[ARC_BUFC_NUMTYPES];
408 429 /*
409 430 * total amount of data in this state; this includes: evictable,
410 - * non-evictable, ARC_BUFC_DATA, and ARC_BUFC_METADATA.
431 + * non-evictable, ARC_BUFC_DATA, ARC_BUFC_METADATA and ARC_BUFC_DDT.
432 + * ARC_BUFC_DDT list is only populated when zfs_arc_segregate_ddt is
433 + * true.
411 434 */
412 435 refcount_t arcs_size;
413 436 } arc_state_t;
414 437
438 +/*
439 + * We loop through these in l2arc_write_buffers() starting from
440 + * PRIORITY_MFU_DDT until we reach PRIORITY_NUMTYPES or the buffer that we
441 + * will be writing to L2ARC dev gets full.
442 + */
443 +enum l2arc_priorities {
444 + PRIORITY_MFU_DDT,
445 + PRIORITY_MRU_DDT,
446 + PRIORITY_MFU_META,
447 + PRIORITY_MRU_META,
448 + PRIORITY_MFU_DATA,
449 + PRIORITY_MRU_DATA,
450 + PRIORITY_NUMTYPES,
451 +};
452 +
415 453 /* The 6 states: */
416 454 static arc_state_t ARC_anon;
417 455 static arc_state_t ARC_mru;
418 456 static arc_state_t ARC_mru_ghost;
419 457 static arc_state_t ARC_mfu;
420 458 static arc_state_t ARC_mfu_ghost;
421 459 static arc_state_t ARC_l2c_only;
422 460
423 461 typedef struct arc_stats {
424 462 kstat_named_t arcstat_hits;
463 + kstat_named_t arcstat_ddt_hits;
425 464 kstat_named_t arcstat_misses;
426 465 kstat_named_t arcstat_demand_data_hits;
427 466 kstat_named_t arcstat_demand_data_misses;
428 467 kstat_named_t arcstat_demand_metadata_hits;
429 468 kstat_named_t arcstat_demand_metadata_misses;
469 + kstat_named_t arcstat_demand_ddt_hits;
470 + kstat_named_t arcstat_demand_ddt_misses;
430 471 kstat_named_t arcstat_prefetch_data_hits;
431 472 kstat_named_t arcstat_prefetch_data_misses;
432 473 kstat_named_t arcstat_prefetch_metadata_hits;
433 474 kstat_named_t arcstat_prefetch_metadata_misses;
475 + kstat_named_t arcstat_prefetch_ddt_hits;
476 + kstat_named_t arcstat_prefetch_ddt_misses;
434 477 kstat_named_t arcstat_mru_hits;
435 478 kstat_named_t arcstat_mru_ghost_hits;
436 479 kstat_named_t arcstat_mfu_hits;
437 480 kstat_named_t arcstat_mfu_ghost_hits;
438 481 kstat_named_t arcstat_deleted;
439 482 /*
440 483 * Number of buffers that could not be evicted because the hash lock
441 484 * was held by another thread. The lock may not necessarily be held
442 485 * by something using the same buffer, since hash locks are shared
443 486 * by multiple buffers.
444 487 */
445 488 kstat_named_t arcstat_mutex_miss;
446 489 /*
490 + * Number of buffers skipped when updating the access state due to the
491 + * header having already been released after acquiring the hash lock.
492 + */
493 + kstat_named_t arcstat_access_skip;
494 + /*
447 495 * Number of buffers skipped because they have I/O in progress, are
448 - * indrect prefetch buffers that have not lived long enough, or are
496 + * indirect prefetch buffers that have not lived long enough, or are
449 497 * not from the spa we're trying to evict from.
450 498 */
451 499 kstat_named_t arcstat_evict_skip;
452 500 /*
453 501 * Number of times arc_evict_state() was unable to evict enough
454 502 * buffers to reach it's target amount.
455 503 */
456 504 kstat_named_t arcstat_evict_not_enough;
457 505 kstat_named_t arcstat_evict_l2_cached;
458 506 kstat_named_t arcstat_evict_l2_eligible;
459 507 kstat_named_t arcstat_evict_l2_ineligible;
|
↓ open down ↓ |
1 lines elided |
↑ open up ↑ |
460 508 kstat_named_t arcstat_evict_l2_skip;
461 509 kstat_named_t arcstat_hash_elements;
462 510 kstat_named_t arcstat_hash_elements_max;
463 511 kstat_named_t arcstat_hash_collisions;
464 512 kstat_named_t arcstat_hash_chains;
465 513 kstat_named_t arcstat_hash_chain_max;
466 514 kstat_named_t arcstat_p;
467 515 kstat_named_t arcstat_c;
468 516 kstat_named_t arcstat_c_min;
469 517 kstat_named_t arcstat_c_max;
470 - /* Not updated directly; only synced in arc_kstat_update. */
471 518 kstat_named_t arcstat_size;
472 519 /*
473 520 * Number of compressed bytes stored in the arc_buf_hdr_t's b_pabd.
474 521 * Note that the compressed bytes may match the uncompressed bytes
475 522 * if the block is either not compressed or compressed arc is disabled.
476 523 */
477 524 kstat_named_t arcstat_compressed_size;
478 525 /*
479 526 * Uncompressed size of the data stored in b_pabd. If compressed
480 527 * arc is disabled then this value will be identical to the stat
481 528 * above.
482 529 */
483 530 kstat_named_t arcstat_uncompressed_size;
484 531 /*
485 532 * Number of bytes stored in all the arc_buf_t's. This is classified
486 533 * as "overhead" since this data is typically short-lived and will
487 534 * be evicted from the arc when it becomes unreferenced unless the
488 535 * zfs_keep_uncompressed_metadata or zfs_keep_uncompressed_level
|
↓ open down ↓ |
8 lines elided |
↑ open up ↑ |
489 536 * values have been set (see comment in dbuf.c for more information).
490 537 */
491 538 kstat_named_t arcstat_overhead_size;
492 539 /*
493 540 * Number of bytes consumed by internal ARC structures necessary
494 541 * for tracking purposes; these structures are not actually
495 542 * backed by ARC buffers. This includes arc_buf_hdr_t structures
496 543 * (allocated via arc_buf_hdr_t_full and arc_buf_hdr_t_l2only
497 544 * caches), and arc_buf_t structures (allocated via arc_buf_t
498 545 * cache).
499 - * Not updated directly; only synced in arc_kstat_update.
500 546 */
501 547 kstat_named_t arcstat_hdr_size;
502 548 /*
503 549 * Number of bytes consumed by ARC buffers of type equal to
504 550 * ARC_BUFC_DATA. This is generally consumed by buffers backing
505 551 * on disk user data (e.g. plain file contents).
506 - * Not updated directly; only synced in arc_kstat_update.
507 552 */
508 553 kstat_named_t arcstat_data_size;
509 554 /*
510 555 * Number of bytes consumed by ARC buffers of type equal to
511 556 * ARC_BUFC_METADATA. This is generally consumed by buffers
512 557 * backing on disk data that is used for internal ZFS
513 558 * structures (e.g. ZAP, dnode, indirect blocks, etc).
514 - * Not updated directly; only synced in arc_kstat_update.
515 559 */
516 560 kstat_named_t arcstat_metadata_size;
517 561 /*
562 + * Number of bytes consumed by ARC buffers of type equal to
563 + * ARC_BUFC_DDT. This is consumed by buffers backing on disk data
564 + * that is used to store DDT (ZAP, ddt stats).
565 + * Only used if zfs_arc_segregate_ddt is true.
566 + */
567 + kstat_named_t arcstat_ddt_size;
568 + /*
518 569 * Number of bytes consumed by various buffers and structures
519 570 * not actually backed with ARC buffers. This includes bonus
520 571 * buffers (allocated directly via zio_buf_* functions),
521 572 * dmu_buf_impl_t structures (allocated via dmu_buf_impl_t
522 573 * cache), and dnode_t structures (allocated via dnode_t cache).
523 - * Not updated directly; only synced in arc_kstat_update.
524 574 */
525 575 kstat_named_t arcstat_other_size;
526 576 /*
527 577 * Total number of bytes consumed by ARC buffers residing in the
528 578 * arc_anon state. This includes *all* buffers in the arc_anon
529 579 * state; e.g. data, metadata, evictable, and unevictable buffers
530 580 * are all included in this value.
531 - * Not updated directly; only synced in arc_kstat_update.
532 581 */
533 582 kstat_named_t arcstat_anon_size;
534 583 /*
535 584 * Number of bytes consumed by ARC buffers that meet the
536 585 * following criteria: backing buffers of type ARC_BUFC_DATA,
537 586 * residing in the arc_anon state, and are eligible for eviction
538 587 * (e.g. have no outstanding holds on the buffer).
539 - * Not updated directly; only synced in arc_kstat_update.
540 588 */
541 589 kstat_named_t arcstat_anon_evictable_data;
542 590 /*
543 591 * Number of bytes consumed by ARC buffers that meet the
544 592 * following criteria: backing buffers of type ARC_BUFC_METADATA,
545 593 * residing in the arc_anon state, and are eligible for eviction
546 594 * (e.g. have no outstanding holds on the buffer).
547 - * Not updated directly; only synced in arc_kstat_update.
548 595 */
549 596 kstat_named_t arcstat_anon_evictable_metadata;
550 597 /*
598 + * Number of bytes consumed by ARC buffers that meet the
599 + * following criteria: backing buffers of type ARC_BUFC_DDT,
600 + * residing in the arc_anon state, and are eligible for eviction
601 + * Only used if zfs_arc_segregate_ddt is true.
602 + */
603 + kstat_named_t arcstat_anon_evictable_ddt;
604 + /*
551 605 * Total number of bytes consumed by ARC buffers residing in the
552 606 * arc_mru state. This includes *all* buffers in the arc_mru
553 607 * state; e.g. data, metadata, evictable, and unevictable buffers
554 608 * are all included in this value.
555 - * Not updated directly; only synced in arc_kstat_update.
556 609 */
557 610 kstat_named_t arcstat_mru_size;
558 611 /*
559 612 * Number of bytes consumed by ARC buffers that meet the
560 613 * following criteria: backing buffers of type ARC_BUFC_DATA,
561 614 * residing in the arc_mru state, and are eligible for eviction
562 615 * (e.g. have no outstanding holds on the buffer).
563 - * Not updated directly; only synced in arc_kstat_update.
564 616 */
565 617 kstat_named_t arcstat_mru_evictable_data;
566 618 /*
567 619 * Number of bytes consumed by ARC buffers that meet the
568 620 * following criteria: backing buffers of type ARC_BUFC_METADATA,
569 621 * residing in the arc_mru state, and are eligible for eviction
570 622 * (e.g. have no outstanding holds on the buffer).
571 - * Not updated directly; only synced in arc_kstat_update.
572 623 */
573 624 kstat_named_t arcstat_mru_evictable_metadata;
574 625 /*
626 + * Number of bytes consumed by ARC buffers that meet the
627 + * following criteria: backing buffers of type ARC_BUFC_DDT,
628 + * residing in the arc_mru state, and are eligible for eviction
629 + * (e.g. have no outstanding holds on the buffer).
630 + * Only used if zfs_arc_segregate_ddt is true.
631 + */
632 + kstat_named_t arcstat_mru_evictable_ddt;
633 + /*
575 634 * Total number of bytes that *would have been* consumed by ARC
576 635 * buffers in the arc_mru_ghost state. The key thing to note
577 636 * here, is the fact that this size doesn't actually indicate
578 637 * RAM consumption. The ghost lists only consist of headers and
579 638 * don't actually have ARC buffers linked off of these headers.
580 639 * Thus, *if* the headers had associated ARC buffers, these
581 640 * buffers *would have* consumed this number of bytes.
582 - * Not updated directly; only synced in arc_kstat_update.
583 641 */
584 642 kstat_named_t arcstat_mru_ghost_size;
585 643 /*
586 644 * Number of bytes that *would have been* consumed by ARC
587 645 * buffers that are eligible for eviction, of type
588 646 * ARC_BUFC_DATA, and linked off the arc_mru_ghost state.
589 - * Not updated directly; only synced in arc_kstat_update.
590 647 */
591 648 kstat_named_t arcstat_mru_ghost_evictable_data;
592 649 /*
593 650 * Number of bytes that *would have been* consumed by ARC
594 651 * buffers that are eligible for eviction, of type
595 652 * ARC_BUFC_METADATA, and linked off the arc_mru_ghost state.
596 - * Not updated directly; only synced in arc_kstat_update.
597 653 */
598 654 kstat_named_t arcstat_mru_ghost_evictable_metadata;
599 655 /*
656 + * Number of bytes that *would have been* consumed by ARC
657 + * buffers that are eligible for eviction, of type
658 + * ARC_BUFC_DDT, and linked off the arc_mru_ghost state.
659 + * Only used if zfs_arc_segregate_ddt is true.
660 + */
661 + kstat_named_t arcstat_mru_ghost_evictable_ddt;
662 + /*
600 663 * Total number of bytes consumed by ARC buffers residing in the
601 664 * arc_mfu state. This includes *all* buffers in the arc_mfu
602 665 * state; e.g. data, metadata, evictable, and unevictable buffers
603 666 * are all included in this value.
604 - * Not updated directly; only synced in arc_kstat_update.
605 667 */
606 668 kstat_named_t arcstat_mfu_size;
607 669 /*
608 670 * Number of bytes consumed by ARC buffers that are eligible for
609 671 * eviction, of type ARC_BUFC_DATA, and reside in the arc_mfu
610 672 * state.
611 - * Not updated directly; only synced in arc_kstat_update.
612 673 */
613 674 kstat_named_t arcstat_mfu_evictable_data;
614 675 /*
615 676 * Number of bytes consumed by ARC buffers that are eligible for
616 677 * eviction, of type ARC_BUFC_METADATA, and reside in the
617 678 * arc_mfu state.
618 - * Not updated directly; only synced in arc_kstat_update.
619 679 */
620 680 kstat_named_t arcstat_mfu_evictable_metadata;
621 681 /*
682 + * Number of bytes consumed by ARC buffers that are eligible for
683 + * eviction, of type ARC_BUFC_DDT, and reside in the
684 + * arc_mfu state.
685 + * Only used if zfs_arc_segregate_ddt is true.
686 + */
687 + kstat_named_t arcstat_mfu_evictable_ddt;
688 + /*
622 689 * Total number of bytes that *would have been* consumed by ARC
623 690 * buffers in the arc_mfu_ghost state. See the comment above
624 691 * arcstat_mru_ghost_size for more details.
625 - * Not updated directly; only synced in arc_kstat_update.
626 692 */
627 693 kstat_named_t arcstat_mfu_ghost_size;
628 694 /*
629 695 * Number of bytes that *would have been* consumed by ARC
630 696 * buffers that are eligible for eviction, of type
631 697 * ARC_BUFC_DATA, and linked off the arc_mfu_ghost state.
632 - * Not updated directly; only synced in arc_kstat_update.
633 698 */
634 699 kstat_named_t arcstat_mfu_ghost_evictable_data;
635 700 /*
636 701 * Number of bytes that *would have been* consumed by ARC
637 702 * buffers that are eligible for eviction, of type
638 703 * ARC_BUFC_METADATA, and linked off the arc_mru_ghost state.
639 - * Not updated directly; only synced in arc_kstat_update.
640 704 */
641 705 kstat_named_t arcstat_mfu_ghost_evictable_metadata;
706 + /*
707 + * Number of bytes that *would have been* consumed by ARC
708 + * buffers that are eligible for eviction, of type
709 + * ARC_BUFC_DDT, and linked off the arc_mru_ghost state.
710 + * Only used if zfs_arc_segregate_ddt is true.
711 + */
712 + kstat_named_t arcstat_mfu_ghost_evictable_ddt;
642 713 kstat_named_t arcstat_l2_hits;
714 + kstat_named_t arcstat_l2_ddt_hits;
643 715 kstat_named_t arcstat_l2_misses;
644 716 kstat_named_t arcstat_l2_feeds;
645 717 kstat_named_t arcstat_l2_rw_clash;
646 718 kstat_named_t arcstat_l2_read_bytes;
719 + kstat_named_t arcstat_l2_ddt_read_bytes;
647 720 kstat_named_t arcstat_l2_write_bytes;
721 + kstat_named_t arcstat_l2_ddt_write_bytes;
648 722 kstat_named_t arcstat_l2_writes_sent;
649 723 kstat_named_t arcstat_l2_writes_done;
650 724 kstat_named_t arcstat_l2_writes_error;
651 725 kstat_named_t arcstat_l2_writes_lock_retry;
652 726 kstat_named_t arcstat_l2_evict_lock_retry;
653 727 kstat_named_t arcstat_l2_evict_reading;
654 728 kstat_named_t arcstat_l2_evict_l1cached;
655 729 kstat_named_t arcstat_l2_free_on_write;
656 730 kstat_named_t arcstat_l2_abort_lowmem;
657 731 kstat_named_t arcstat_l2_cksum_bad;
658 732 kstat_named_t arcstat_l2_io_error;
659 733 kstat_named_t arcstat_l2_lsize;
660 734 kstat_named_t arcstat_l2_psize;
661 - /* Not updated directly; only synced in arc_kstat_update. */
662 735 kstat_named_t arcstat_l2_hdr_size;
736 + kstat_named_t arcstat_l2_log_blk_writes;
737 + kstat_named_t arcstat_l2_log_blk_avg_size;
738 + kstat_named_t arcstat_l2_data_to_meta_ratio;
739 + kstat_named_t arcstat_l2_rebuild_successes;
740 + kstat_named_t arcstat_l2_rebuild_abort_unsupported;
741 + kstat_named_t arcstat_l2_rebuild_abort_io_errors;
742 + kstat_named_t arcstat_l2_rebuild_abort_cksum_errors;
743 + kstat_named_t arcstat_l2_rebuild_abort_loop_errors;
744 + kstat_named_t arcstat_l2_rebuild_abort_lowmem;
745 + kstat_named_t arcstat_l2_rebuild_size;
746 + kstat_named_t arcstat_l2_rebuild_bufs;
747 + kstat_named_t arcstat_l2_rebuild_bufs_precached;
748 + kstat_named_t arcstat_l2_rebuild_psize;
749 + kstat_named_t arcstat_l2_rebuild_log_blks;
663 750 kstat_named_t arcstat_memory_throttle_count;
664 - /* Not updated directly; only synced in arc_kstat_update. */
665 751 kstat_named_t arcstat_meta_used;
666 752 kstat_named_t arcstat_meta_limit;
667 753 kstat_named_t arcstat_meta_max;
668 754 kstat_named_t arcstat_meta_min;
755 + kstat_named_t arcstat_ddt_limit;
669 756 kstat_named_t arcstat_sync_wait_for_async;
670 757 kstat_named_t arcstat_demand_hit_predictive_prefetch;
671 758 } arc_stats_t;
672 759
673 760 static arc_stats_t arc_stats = {
674 761 { "hits", KSTAT_DATA_UINT64 },
762 + { "ddt_hits", KSTAT_DATA_UINT64 },
675 763 { "misses", KSTAT_DATA_UINT64 },
676 764 { "demand_data_hits", KSTAT_DATA_UINT64 },
677 765 { "demand_data_misses", KSTAT_DATA_UINT64 },
678 766 { "demand_metadata_hits", KSTAT_DATA_UINT64 },
679 767 { "demand_metadata_misses", KSTAT_DATA_UINT64 },
768 + { "demand_ddt_hits", KSTAT_DATA_UINT64 },
769 + { "demand_ddt_misses", KSTAT_DATA_UINT64 },
680 770 { "prefetch_data_hits", KSTAT_DATA_UINT64 },
681 771 { "prefetch_data_misses", KSTAT_DATA_UINT64 },
682 772 { "prefetch_metadata_hits", KSTAT_DATA_UINT64 },
683 773 { "prefetch_metadata_misses", KSTAT_DATA_UINT64 },
774 + { "prefetch_ddt_hits", KSTAT_DATA_UINT64 },
775 + { "prefetch_ddt_misses", KSTAT_DATA_UINT64 },
684 776 { "mru_hits", KSTAT_DATA_UINT64 },
685 777 { "mru_ghost_hits", KSTAT_DATA_UINT64 },
686 778 { "mfu_hits", KSTAT_DATA_UINT64 },
687 779 { "mfu_ghost_hits", KSTAT_DATA_UINT64 },
688 780 { "deleted", KSTAT_DATA_UINT64 },
689 781 { "mutex_miss", KSTAT_DATA_UINT64 },
782 + { "access_skip", KSTAT_DATA_UINT64 },
690 783 { "evict_skip", KSTAT_DATA_UINT64 },
691 784 { "evict_not_enough", KSTAT_DATA_UINT64 },
692 785 { "evict_l2_cached", KSTAT_DATA_UINT64 },
693 786 { "evict_l2_eligible", KSTAT_DATA_UINT64 },
694 787 { "evict_l2_ineligible", KSTAT_DATA_UINT64 },
695 788 { "evict_l2_skip", KSTAT_DATA_UINT64 },
696 789 { "hash_elements", KSTAT_DATA_UINT64 },
697 790 { "hash_elements_max", KSTAT_DATA_UINT64 },
698 791 { "hash_collisions", KSTAT_DATA_UINT64 },
699 792 { "hash_chains", KSTAT_DATA_UINT64 },
700 793 { "hash_chain_max", KSTAT_DATA_UINT64 },
701 794 { "p", KSTAT_DATA_UINT64 },
|
↓ open down ↓ |
2 lines elided |
↑ open up ↑ |
702 795 { "c", KSTAT_DATA_UINT64 },
703 796 { "c_min", KSTAT_DATA_UINT64 },
704 797 { "c_max", KSTAT_DATA_UINT64 },
705 798 { "size", KSTAT_DATA_UINT64 },
706 799 { "compressed_size", KSTAT_DATA_UINT64 },
707 800 { "uncompressed_size", KSTAT_DATA_UINT64 },
708 801 { "overhead_size", KSTAT_DATA_UINT64 },
709 802 { "hdr_size", KSTAT_DATA_UINT64 },
710 803 { "data_size", KSTAT_DATA_UINT64 },
711 804 { "metadata_size", KSTAT_DATA_UINT64 },
805 + { "ddt_size", KSTAT_DATA_UINT64 },
712 806 { "other_size", KSTAT_DATA_UINT64 },
713 807 { "anon_size", KSTAT_DATA_UINT64 },
714 808 { "anon_evictable_data", KSTAT_DATA_UINT64 },
715 809 { "anon_evictable_metadata", KSTAT_DATA_UINT64 },
810 + { "anon_evictable_ddt", KSTAT_DATA_UINT64 },
716 811 { "mru_size", KSTAT_DATA_UINT64 },
717 812 { "mru_evictable_data", KSTAT_DATA_UINT64 },
718 813 { "mru_evictable_metadata", KSTAT_DATA_UINT64 },
814 + { "mru_evictable_ddt", KSTAT_DATA_UINT64 },
719 815 { "mru_ghost_size", KSTAT_DATA_UINT64 },
720 816 { "mru_ghost_evictable_data", KSTAT_DATA_UINT64 },
721 817 { "mru_ghost_evictable_metadata", KSTAT_DATA_UINT64 },
818 + { "mru_ghost_evictable_ddt", KSTAT_DATA_UINT64 },
722 819 { "mfu_size", KSTAT_DATA_UINT64 },
723 820 { "mfu_evictable_data", KSTAT_DATA_UINT64 },
724 821 { "mfu_evictable_metadata", KSTAT_DATA_UINT64 },
822 + { "mfu_evictable_ddt", KSTAT_DATA_UINT64 },
725 823 { "mfu_ghost_size", KSTAT_DATA_UINT64 },
726 824 { "mfu_ghost_evictable_data", KSTAT_DATA_UINT64 },
727 825 { "mfu_ghost_evictable_metadata", KSTAT_DATA_UINT64 },
826 + { "mfu_ghost_evictable_ddt", KSTAT_DATA_UINT64 },
728 827 { "l2_hits", KSTAT_DATA_UINT64 },
828 + { "l2_ddt_hits", KSTAT_DATA_UINT64 },
729 829 { "l2_misses", KSTAT_DATA_UINT64 },
730 830 { "l2_feeds", KSTAT_DATA_UINT64 },
731 831 { "l2_rw_clash", KSTAT_DATA_UINT64 },
732 832 { "l2_read_bytes", KSTAT_DATA_UINT64 },
833 + { "l2_ddt_read_bytes", KSTAT_DATA_UINT64 },
733 834 { "l2_write_bytes", KSTAT_DATA_UINT64 },
835 + { "l2_ddt_write_bytes", KSTAT_DATA_UINT64 },
734 836 { "l2_writes_sent", KSTAT_DATA_UINT64 },
735 837 { "l2_writes_done", KSTAT_DATA_UINT64 },
736 838 { "l2_writes_error", KSTAT_DATA_UINT64 },
737 839 { "l2_writes_lock_retry", KSTAT_DATA_UINT64 },
738 840 { "l2_evict_lock_retry", KSTAT_DATA_UINT64 },
739 841 { "l2_evict_reading", KSTAT_DATA_UINT64 },
740 842 { "l2_evict_l1cached", KSTAT_DATA_UINT64 },
741 843 { "l2_free_on_write", KSTAT_DATA_UINT64 },
742 844 { "l2_abort_lowmem", KSTAT_DATA_UINT64 },
743 845 { "l2_cksum_bad", KSTAT_DATA_UINT64 },
744 846 { "l2_io_error", KSTAT_DATA_UINT64 },
745 847 { "l2_size", KSTAT_DATA_UINT64 },
746 848 { "l2_asize", KSTAT_DATA_UINT64 },
747 849 { "l2_hdr_size", KSTAT_DATA_UINT64 },
850 + { "l2_log_blk_writes", KSTAT_DATA_UINT64 },
851 + { "l2_log_blk_avg_size", KSTAT_DATA_UINT64 },
852 + { "l2_data_to_meta_ratio", KSTAT_DATA_UINT64 },
853 + { "l2_rebuild_successes", KSTAT_DATA_UINT64 },
854 + { "l2_rebuild_unsupported", KSTAT_DATA_UINT64 },
855 + { "l2_rebuild_io_errors", KSTAT_DATA_UINT64 },
856 + { "l2_rebuild_cksum_errors", KSTAT_DATA_UINT64 },
857 + { "l2_rebuild_loop_errors", KSTAT_DATA_UINT64 },
858 + { "l2_rebuild_lowmem", KSTAT_DATA_UINT64 },
859 + { "l2_rebuild_size", KSTAT_DATA_UINT64 },
860 + { "l2_rebuild_bufs", KSTAT_DATA_UINT64 },
861 + { "l2_rebuild_bufs_precached", KSTAT_DATA_UINT64 },
862 + { "l2_rebuild_psize", KSTAT_DATA_UINT64 },
863 + { "l2_rebuild_log_blks", KSTAT_DATA_UINT64 },
748 864 { "memory_throttle_count", KSTAT_DATA_UINT64 },
749 865 { "arc_meta_used", KSTAT_DATA_UINT64 },
750 866 { "arc_meta_limit", KSTAT_DATA_UINT64 },
751 867 { "arc_meta_max", KSTAT_DATA_UINT64 },
752 868 { "arc_meta_min", KSTAT_DATA_UINT64 },
869 + { "arc_ddt_limit", KSTAT_DATA_UINT64 },
753 870 { "sync_wait_for_async", KSTAT_DATA_UINT64 },
754 871 { "demand_hit_predictive_prefetch", KSTAT_DATA_UINT64 },
755 872 };
756 873
757 874 #define ARCSTAT(stat) (arc_stats.stat.value.ui64)
758 875
759 876 #define ARCSTAT_INCR(stat, val) \
760 877 atomic_add_64(&arc_stats.stat.value.ui64, (val))
761 878
762 879 #define ARCSTAT_BUMP(stat) ARCSTAT_INCR(stat, 1)
763 880 #define ARCSTAT_BUMPDOWN(stat) ARCSTAT_INCR(stat, -1)
764 881
765 882 #define ARCSTAT_MAX(stat, val) { \
766 883 uint64_t m; \
767 884 while ((val) > (m = arc_stats.stat.value.ui64) && \
768 885 (m != atomic_cas_64(&arc_stats.stat.value.ui64, m, (val)))) \
769 886 continue; \
770 887 }
771 888
772 889 #define ARCSTAT_MAXSTAT(stat) \
|
↓ open down ↓ |
10 lines elided |
↑ open up ↑ |
773 890 ARCSTAT_MAX(stat##_max, arc_stats.stat.value.ui64)
774 891
775 892 /*
776 893 * We define a macro to allow ARC hits/misses to be easily broken down by
777 894 * two separate conditions, giving a total of four different subtypes for
778 895 * each of hits and misses (so eight statistics total).
779 896 */
780 897 #define ARCSTAT_CONDSTAT(cond1, stat1, notstat1, cond2, stat2, notstat2, stat) \
781 898 if (cond1) { \
782 899 if (cond2) { \
783 - ARCSTAT_BUMP(arcstat_##stat1##_##stat2##_##stat); \
900 + ARCSTAT_BUMP(arcstat_##stat1##_##stat##_##stat2); \
784 901 } else { \
785 - ARCSTAT_BUMP(arcstat_##stat1##_##notstat2##_##stat); \
902 + ARCSTAT_BUMP(arcstat_##stat1##_##stat##_##notstat2); \
786 903 } \
787 904 } else { \
788 905 if (cond2) { \
789 - ARCSTAT_BUMP(arcstat_##notstat1##_##stat2##_##stat); \
906 + ARCSTAT_BUMP(arcstat_##notstat1##_##stat##_##stat2); \
790 907 } else { \
791 - ARCSTAT_BUMP(arcstat_##notstat1##_##notstat2##_##stat);\
908 + ARCSTAT_BUMP(arcstat_##notstat1##_##stat##_##notstat2);\
792 909 } \
793 910 }
794 911
912 +/*
913 + * This macro allows us to use kstats as floating averages. Each time we
914 + * update this kstat, we first factor it and the update value by
915 + * ARCSTAT_AVG_FACTOR to shrink the new value's contribution to the overall
916 + * average. This macro assumes that integer loads and stores are atomic, but
917 + * is not safe for multiple writers updating the kstat in parallel (only the
918 + * last writer's update will remain).
919 + */
920 +#define ARCSTAT_F_AVG_FACTOR 3
921 +#define ARCSTAT_F_AVG(stat, value) \
922 + do { \
923 + uint64_t x = ARCSTAT(stat); \
924 + x = x - x / ARCSTAT_F_AVG_FACTOR + \
925 + (value) / ARCSTAT_F_AVG_FACTOR; \
926 + ARCSTAT(stat) = x; \
927 + _NOTE(CONSTCOND) \
928 + } while (0)
929 +
795 930 kstat_t *arc_ksp;
796 931 static arc_state_t *arc_anon;
797 932 static arc_state_t *arc_mru;
798 933 static arc_state_t *arc_mru_ghost;
799 934 static arc_state_t *arc_mfu;
800 935 static arc_state_t *arc_mfu_ghost;
801 936 static arc_state_t *arc_l2c_only;
802 937
803 938 /*
804 939 * There are several ARC variables that are critical to export as kstats --
805 940 * but we don't want to have to grovel around in the kstat whenever we wish to
806 941 * manipulate them. For these variables, we therefore define them to be in
807 942 * terms of the statistic variable. This assures that we are not introducing
808 943 * the possibility of inconsistency by having shadow copies of the variables,
809 944 * while still allowing the code to be readable.
810 945 */
946 +#define arc_size ARCSTAT(arcstat_size) /* actual total arc size */
811 947 #define arc_p ARCSTAT(arcstat_p) /* target size of MRU */
812 948 #define arc_c ARCSTAT(arcstat_c) /* target size of cache */
813 949 #define arc_c_min ARCSTAT(arcstat_c_min) /* min target cache size */
814 950 #define arc_c_max ARCSTAT(arcstat_c_max) /* max target cache size */
815 951 #define arc_meta_limit ARCSTAT(arcstat_meta_limit) /* max size for metadata */
816 952 #define arc_meta_min ARCSTAT(arcstat_meta_min) /* min size for metadata */
953 +#define arc_meta_used ARCSTAT(arcstat_meta_used) /* size of metadata */
817 954 #define arc_meta_max ARCSTAT(arcstat_meta_max) /* max size of metadata */
955 +#define arc_ddt_size ARCSTAT(arcstat_ddt_size) /* ddt size in arc */
956 +#define arc_ddt_limit ARCSTAT(arcstat_ddt_limit) /* ddt in arc size limit */
818 957
958 +/*
959 + * Used int zio.c to optionally keep DDT cached in ARC
960 + */
961 +uint64_t const *arc_ddt_evict_threshold;
962 +
819 963 /* compressed size of entire arc */
820 964 #define arc_compressed_size ARCSTAT(arcstat_compressed_size)
821 965 /* uncompressed size of entire arc */
822 966 #define arc_uncompressed_size ARCSTAT(arcstat_uncompressed_size)
823 967 /* number of bytes in the arc from arc_buf_t's */
824 968 #define arc_overhead_size ARCSTAT(arcstat_overhead_size)
825 969
826 -/*
827 - * There are also some ARC variables that we want to export, but that are
828 - * updated so often that having the canonical representation be the statistic
829 - * variable causes a performance bottleneck. We want to use aggsum_t's for these
830 - * instead, but still be able to export the kstat in the same way as before.
831 - * The solution is to always use the aggsum version, except in the kstat update
832 - * callback.
833 - */
834 -aggsum_t arc_size;
835 -aggsum_t arc_meta_used;
836 -aggsum_t astat_data_size;
837 -aggsum_t astat_metadata_size;
838 -aggsum_t astat_hdr_size;
839 -aggsum_t astat_other_size;
840 -aggsum_t astat_l2_hdr_size;
841 970
842 971 static int arc_no_grow; /* Don't try to grow cache size */
843 972 static uint64_t arc_tempreserve;
844 973 static uint64_t arc_loaned_bytes;
845 974
846 975 typedef struct arc_callback arc_callback_t;
847 976
848 977 struct arc_callback {
849 978 void *acb_private;
850 979 arc_done_func_t *acb_done;
851 980 arc_buf_t *acb_buf;
852 981 boolean_t acb_compressed;
853 982 zio_t *acb_zio_dummy;
854 983 arc_callback_t *acb_next;
855 984 };
856 985
857 986 typedef struct arc_write_callback arc_write_callback_t;
858 987
859 988 struct arc_write_callback {
860 989 void *awcb_private;
861 990 arc_done_func_t *awcb_ready;
862 991 arc_done_func_t *awcb_children_ready;
863 992 arc_done_func_t *awcb_physdone;
864 993 arc_done_func_t *awcb_done;
865 994 arc_buf_t *awcb_buf;
866 995 };
867 996
868 997 /*
869 998 * ARC buffers are separated into multiple structs as a memory saving measure:
870 999 * - Common fields struct, always defined, and embedded within it:
871 1000 * - L2-only fields, always allocated but undefined when not in L2ARC
872 1001 * - L1-only fields, only allocated when in L1ARC
873 1002 *
874 1003 * Buffer in L1 Buffer only in L2
875 1004 * +------------------------+ +------------------------+
876 1005 * | arc_buf_hdr_t | | arc_buf_hdr_t |
877 1006 * | | | |
878 1007 * | | | |
879 1008 * | | | |
880 1009 * +------------------------+ +------------------------+
881 1010 * | l2arc_buf_hdr_t | | l2arc_buf_hdr_t |
882 1011 * | (undefined if L1-only) | | |
883 1012 * +------------------------+ +------------------------+
884 1013 * | l1arc_buf_hdr_t |
885 1014 * | |
886 1015 * | |
887 1016 * | |
888 1017 * | |
889 1018 * +------------------------+
890 1019 *
|
↓ open down ↓ |
40 lines elided |
↑ open up ↑ |
891 1020 * Because it's possible for the L2ARC to become extremely large, we can wind
892 1021 * up eating a lot of memory in L2ARC buffer headers, so the size of a header
893 1022 * is minimized by only allocating the fields necessary for an L1-cached buffer
894 1023 * when a header is actually in the L1 cache. The sub-headers (l1arc_buf_hdr and
895 1024 * l2arc_buf_hdr) are embedded rather than allocated separately to save a couple
896 1025 * words in pointers. arc_hdr_realloc() is used to switch a header between
897 1026 * these two allocation states.
898 1027 */
899 1028 typedef struct l1arc_buf_hdr {
900 1029 kmutex_t b_freeze_lock;
901 - zio_cksum_t *b_freeze_cksum;
902 1030 #ifdef ZFS_DEBUG
903 1031 /*
904 1032 * Used for debugging with kmem_flags - by allocating and freeing
905 1033 * b_thawed when the buffer is thawed, we get a record of the stack
906 1034 * trace that thawed it.
907 1035 */
908 1036 void *b_thawed;
909 1037 #endif
910 1038
1039 + /* number of krrp tasks using this buffer */
1040 + uint64_t b_krrp;
1041 +
911 1042 arc_buf_t *b_buf;
912 1043 uint32_t b_bufcnt;
913 1044 /* for waiting on writes to complete */
914 1045 kcondvar_t b_cv;
915 1046 uint8_t b_byteswap;
916 1047
917 1048 /* protected by arc state mutex */
918 1049 arc_state_t *b_state;
919 1050 multilist_node_t b_arc_node;
920 1051
921 1052 /* updated atomically */
922 1053 clock_t b_arc_access;
923 1054
924 1055 /* self protecting */
925 1056 refcount_t b_refcnt;
926 1057
927 1058 arc_callback_t *b_acb;
928 1059 abd_t *b_pabd;
929 1060 } l1arc_buf_hdr_t;
930 1061
931 1062 typedef struct l2arc_dev l2arc_dev_t;
932 1063
933 1064 typedef struct l2arc_buf_hdr {
934 1065 /* protected by arc_buf_hdr mutex */
935 1066 l2arc_dev_t *b_dev; /* L2ARC device */
|
↓ open down ↓ |
15 lines elided |
↑ open up ↑ |
936 1067 uint64_t b_daddr; /* disk address, offset byte */
937 1068
938 1069 list_node_t b_l2node;
939 1070 } l2arc_buf_hdr_t;
940 1071
941 1072 struct arc_buf_hdr {
942 1073 /* protected by hash lock */
943 1074 dva_t b_dva;
944 1075 uint64_t b_birth;
945 1076
1077 + /*
1078 + * Even though this checksum is only set/verified when a buffer is in
1079 + * the L1 cache, it needs to be in the set of common fields because it
1080 + * must be preserved from the time before a buffer is written out to
1081 + * L2ARC until after it is read back in.
1082 + */
1083 + zio_cksum_t *b_freeze_cksum;
1084 +
946 1085 arc_buf_contents_t b_type;
947 1086 arc_buf_hdr_t *b_hash_next;
948 1087 arc_flags_t b_flags;
949 1088
950 1089 /*
951 1090 * This field stores the size of the data buffer after
952 1091 * compression, and is set in the arc's zio completion handlers.
953 1092 * It is in units of SPA_MINBLOCKSIZE (e.g. 1 == 512 bytes).
954 1093 *
955 1094 * While the block pointers can store up to 32MB in their psize
956 1095 * field, we can only store up to 32MB minus 512B. This is due
957 1096 * to the bp using a bias of 1, whereas we use a bias of 0 (i.e.
958 1097 * a field of zeros represents 512B in the bp). We can't use a
959 1098 * bias of 1 since we need to reserve a psize of zero, here, to
960 1099 * represent holes and embedded blocks.
961 1100 *
962 1101 * This isn't a problem in practice, since the maximum size of a
963 1102 * buffer is limited to 16MB, so we never need to store 32MB in
964 1103 * this field. Even in the upstream illumos code base, the
965 1104 * maximum size of a buffer is limited to 16MB.
966 1105 */
967 1106 uint16_t b_psize;
968 1107
969 1108 /*
970 1109 * This field stores the size of the data buffer before
971 1110 * compression, and cannot change once set. It is in units
972 1111 * of SPA_MINBLOCKSIZE (e.g. 2 == 1024 bytes)
973 1112 */
974 1113 uint16_t b_lsize; /* immutable */
975 1114 uint64_t b_spa; /* immutable */
976 1115
977 1116 /* L2ARC fields. Undefined when not in L2ARC. */
978 1117 l2arc_buf_hdr_t b_l2hdr;
979 1118 /* L1ARC fields. Undefined when in l2arc_only state */
980 1119 l1arc_buf_hdr_t b_l1hdr;
981 1120 };
982 1121
983 1122 #define GHOST_STATE(state) \
984 1123 ((state) == arc_mru_ghost || (state) == arc_mfu_ghost || \
985 1124 (state) == arc_l2c_only)
986 1125
987 1126 #define HDR_IN_HASH_TABLE(hdr) ((hdr)->b_flags & ARC_FLAG_IN_HASH_TABLE)
988 1127 #define HDR_IO_IN_PROGRESS(hdr) ((hdr)->b_flags & ARC_FLAG_IO_IN_PROGRESS)
989 1128 #define HDR_IO_ERROR(hdr) ((hdr)->b_flags & ARC_FLAG_IO_ERROR)
990 1129 #define HDR_PREFETCH(hdr) ((hdr)->b_flags & ARC_FLAG_PREFETCH)
991 1130 #define HDR_COMPRESSION_ENABLED(hdr) \
992 1131 ((hdr)->b_flags & ARC_FLAG_COMPRESSED_ARC)
|
↓ open down ↓ |
37 lines elided |
↑ open up ↑ |
993 1132
994 1133 #define HDR_L2CACHE(hdr) ((hdr)->b_flags & ARC_FLAG_L2CACHE)
995 1134 #define HDR_L2_READING(hdr) \
996 1135 (((hdr)->b_flags & ARC_FLAG_IO_IN_PROGRESS) && \
997 1136 ((hdr)->b_flags & ARC_FLAG_HAS_L2HDR))
998 1137 #define HDR_L2_WRITING(hdr) ((hdr)->b_flags & ARC_FLAG_L2_WRITING)
999 1138 #define HDR_L2_EVICTED(hdr) ((hdr)->b_flags & ARC_FLAG_L2_EVICTED)
1000 1139 #define HDR_L2_WRITE_HEAD(hdr) ((hdr)->b_flags & ARC_FLAG_L2_WRITE_HEAD)
1001 1140 #define HDR_SHARED_DATA(hdr) ((hdr)->b_flags & ARC_FLAG_SHARED_DATA)
1002 1141
1142 +#define HDR_ISTYPE_DDT(hdr) \
1143 + ((hdr)->b_flags & ARC_FLAG_BUFC_DDT)
1003 1144 #define HDR_ISTYPE_METADATA(hdr) \
1004 1145 ((hdr)->b_flags & ARC_FLAG_BUFC_METADATA)
1005 -#define HDR_ISTYPE_DATA(hdr) (!HDR_ISTYPE_METADATA(hdr))
1146 +#define HDR_ISTYPE_DATA(hdr) (!HDR_ISTYPE_METADATA(hdr) && \
1147 + !HDR_ISTYPE_DDT(hdr))
1006 1148
1007 1149 #define HDR_HAS_L1HDR(hdr) ((hdr)->b_flags & ARC_FLAG_HAS_L1HDR)
1008 1150 #define HDR_HAS_L2HDR(hdr) ((hdr)->b_flags & ARC_FLAG_HAS_L2HDR)
1009 1151
1010 1152 /* For storing compression mode in b_flags */
1011 1153 #define HDR_COMPRESS_OFFSET (highbit64(ARC_FLAG_COMPRESS_0) - 1)
1012 1154
1013 1155 #define HDR_GET_COMPRESS(hdr) ((enum zio_compress)BF32_GET((hdr)->b_flags, \
1014 1156 HDR_COMPRESS_OFFSET, SPA_COMPRESSBITS))
1015 1157 #define HDR_SET_COMPRESS(hdr, cmp) BF32_SET((hdr)->b_flags, \
1016 1158 HDR_COMPRESS_OFFSET, SPA_COMPRESSBITS, (cmp));
1017 1159
1018 1160 #define ARC_BUF_LAST(buf) ((buf)->b_next == NULL)
1019 1161 #define ARC_BUF_SHARED(buf) ((buf)->b_flags & ARC_BUF_FLAG_SHARED)
1020 1162 #define ARC_BUF_COMPRESSED(buf) ((buf)->b_flags & ARC_BUF_FLAG_COMPRESSED)
1021 1163
1022 1164 /*
|
↓ open down ↓ |
7 lines elided |
↑ open up ↑ |
1023 1165 * Other sizes
1024 1166 */
1025 1167
1026 1168 #define HDR_FULL_SIZE ((int64_t)sizeof (arc_buf_hdr_t))
1027 1169 #define HDR_L2ONLY_SIZE ((int64_t)offsetof(arc_buf_hdr_t, b_l1hdr))
1028 1170
1029 1171 /*
1030 1172 * Hash table routines
1031 1173 */
1032 1174
1033 -#define HT_LOCK_PAD 64
1034 -
1035 -struct ht_lock {
1036 - kmutex_t ht_lock;
1037 -#ifdef _KERNEL
1038 - unsigned char pad[(HT_LOCK_PAD - sizeof (kmutex_t))];
1039 -#endif
1175 +struct ht_table {
1176 + arc_buf_hdr_t *hdr;
1177 + kmutex_t lock;
1040 1178 };
1041 1179
1042 -#define BUF_LOCKS 256
1043 1180 typedef struct buf_hash_table {
1044 1181 uint64_t ht_mask;
1045 - arc_buf_hdr_t **ht_table;
1046 - struct ht_lock ht_locks[BUF_LOCKS];
1182 + struct ht_table *ht_table;
1047 1183 } buf_hash_table_t;
1048 1184
1185 +#pragma align 64(buf_hash_table)
1049 1186 static buf_hash_table_t buf_hash_table;
1050 1187
1051 1188 #define BUF_HASH_INDEX(spa, dva, birth) \
1052 1189 (buf_hash(spa, dva, birth) & buf_hash_table.ht_mask)
1053 -#define BUF_HASH_LOCK_NTRY(idx) (buf_hash_table.ht_locks[idx & (BUF_LOCKS-1)])
1054 -#define BUF_HASH_LOCK(idx) (&(BUF_HASH_LOCK_NTRY(idx).ht_lock))
1190 +#define BUF_HASH_LOCK(idx) (&buf_hash_table.ht_table[idx].lock)
1055 1191 #define HDR_LOCK(hdr) \
1056 1192 (BUF_HASH_LOCK(BUF_HASH_INDEX(hdr->b_spa, &hdr->b_dva, hdr->b_birth)))
1057 1193
1058 1194 uint64_t zfs_crc64_table[256];
1059 1195
1060 1196 /*
1061 1197 * Level 2 ARC
1062 1198 */
1063 1199
1064 1200 #define L2ARC_WRITE_SIZE (8 * 1024 * 1024) /* initial write max */
1065 1201 #define L2ARC_HEADROOM 2 /* num of writes */
1066 1202 /*
1067 1203 * If we discover during ARC scan any buffers to be compressed, we boost
1068 1204 * our headroom for the next scanning cycle by this percentage multiple.
1069 1205 */
1070 1206 #define L2ARC_HEADROOM_BOOST 200
1071 1207 #define L2ARC_FEED_SECS 1 /* caching interval secs */
1072 1208 #define L2ARC_FEED_MIN_MS 200 /* min caching interval ms */
1073 1209
1074 1210 #define l2arc_writes_sent ARCSTAT(arcstat_l2_writes_sent)
1075 1211 #define l2arc_writes_done ARCSTAT(arcstat_l2_writes_done)
1076 1212
1077 1213 /* L2ARC Performance Tunables */
|
↓ open down ↓ |
13 lines elided |
↑ open up ↑ |
1078 1214 uint64_t l2arc_write_max = L2ARC_WRITE_SIZE; /* default max write size */
1079 1215 uint64_t l2arc_write_boost = L2ARC_WRITE_SIZE; /* extra write during warmup */
1080 1216 uint64_t l2arc_headroom = L2ARC_HEADROOM; /* number of dev writes */
1081 1217 uint64_t l2arc_headroom_boost = L2ARC_HEADROOM_BOOST;
1082 1218 uint64_t l2arc_feed_secs = L2ARC_FEED_SECS; /* interval seconds */
1083 1219 uint64_t l2arc_feed_min_ms = L2ARC_FEED_MIN_MS; /* min interval milliseconds */
1084 1220 boolean_t l2arc_noprefetch = B_TRUE; /* don't cache prefetch bufs */
1085 1221 boolean_t l2arc_feed_again = B_TRUE; /* turbo warmup */
1086 1222 boolean_t l2arc_norw = B_TRUE; /* no reads during writes */
1087 1223
1088 -/*
1089 - * L2ARC Internals
1090 - */
1091 -struct l2arc_dev {
1092 - vdev_t *l2ad_vdev; /* vdev */
1093 - spa_t *l2ad_spa; /* spa */
1094 - uint64_t l2ad_hand; /* next write location */
1095 - uint64_t l2ad_start; /* first addr on device */
1096 - uint64_t l2ad_end; /* last addr on device */
1097 - boolean_t l2ad_first; /* first sweep through */
1098 - boolean_t l2ad_writing; /* currently writing */
1099 - kmutex_t l2ad_mtx; /* lock for buffer list */
1100 - list_t l2ad_buflist; /* buffer list */
1101 - list_node_t l2ad_node; /* device list node */
1102 - refcount_t l2ad_alloc; /* allocated bytes */
1103 -};
1104 -
1105 1224 static list_t L2ARC_dev_list; /* device list */
1106 1225 static list_t *l2arc_dev_list; /* device list pointer */
1107 1226 static kmutex_t l2arc_dev_mtx; /* device list mutex */
1108 1227 static l2arc_dev_t *l2arc_dev_last; /* last device used */
1228 +static l2arc_dev_t *l2arc_ddt_dev_last; /* last DDT device used */
1109 1229 static list_t L2ARC_free_on_write; /* free after write buf list */
1110 1230 static list_t *l2arc_free_on_write; /* free after write list ptr */
1111 1231 static kmutex_t l2arc_free_on_write_mtx; /* mutex for list */
1112 1232 static uint64_t l2arc_ndev; /* number of devices */
1113 1233
1114 1234 typedef struct l2arc_read_callback {
1115 1235 arc_buf_hdr_t *l2rcb_hdr; /* read header */
1116 1236 blkptr_t l2rcb_bp; /* original blkptr */
1117 1237 zbookmark_phys_t l2rcb_zb; /* original bookmark */
1118 1238 int l2rcb_flags; /* original flags */
1119 1239 abd_t *l2rcb_abd; /* temporary buffer */
1120 1240 } l2arc_read_callback_t;
1121 1241
1122 1242 typedef struct l2arc_write_callback {
1123 1243 l2arc_dev_t *l2wcb_dev; /* device info */
1124 1244 arc_buf_hdr_t *l2wcb_head; /* head of write buflist */
1245 + list_t l2wcb_log_blk_buflist; /* in-flight log blocks */
1125 1246 } l2arc_write_callback_t;
1126 1247
1127 1248 typedef struct l2arc_data_free {
1128 1249 /* protected by l2arc_free_on_write_mtx */
1129 1250 abd_t *l2df_abd;
1130 1251 size_t l2df_size;
1131 1252 arc_buf_contents_t l2df_type;
1132 1253 list_node_t l2df_list_node;
1133 1254 } l2arc_data_free_t;
1134 1255
1135 1256 static kmutex_t l2arc_feed_thr_lock;
1136 1257 static kcondvar_t l2arc_feed_thr_cv;
1137 1258 static uint8_t l2arc_thread_exit;
1138 1259
1139 1260 static abd_t *arc_get_data_abd(arc_buf_hdr_t *, uint64_t, void *);
|
↓ open down ↓ |
5 lines elided |
↑ open up ↑ |
1140 1261 static void *arc_get_data_buf(arc_buf_hdr_t *, uint64_t, void *);
1141 1262 static void arc_get_data_impl(arc_buf_hdr_t *, uint64_t, void *);
1142 1263 static void arc_free_data_abd(arc_buf_hdr_t *, abd_t *, uint64_t, void *);
1143 1264 static void arc_free_data_buf(arc_buf_hdr_t *, void *, uint64_t, void *);
1144 1265 static void arc_free_data_impl(arc_buf_hdr_t *hdr, uint64_t size, void *tag);
1145 1266 static void arc_hdr_free_pabd(arc_buf_hdr_t *);
1146 1267 static void arc_hdr_alloc_pabd(arc_buf_hdr_t *);
1147 1268 static void arc_access(arc_buf_hdr_t *, kmutex_t *);
1148 1269 static boolean_t arc_is_overflowing();
1149 1270 static void arc_buf_watch(arc_buf_t *);
1271 +static l2arc_dev_t *l2arc_vdev_get(vdev_t *vd);
1150 1272
1151 1273 static arc_buf_contents_t arc_buf_type(arc_buf_hdr_t *);
1152 1274 static uint32_t arc_bufc_to_flags(arc_buf_contents_t);
1275 +static arc_buf_contents_t arc_flags_to_bufc(uint32_t);
1153 1276 static inline void arc_hdr_set_flags(arc_buf_hdr_t *hdr, arc_flags_t flags);
1154 1277 static inline void arc_hdr_clear_flags(arc_buf_hdr_t *hdr, arc_flags_t flags);
1155 1278
1156 1279 static boolean_t l2arc_write_eligible(uint64_t, arc_buf_hdr_t *);
1157 1280 static void l2arc_read_done(zio_t *);
1158 1281
1282 +static void
1283 +arc_update_hit_stat(arc_buf_hdr_t *hdr, boolean_t hit)
1284 +{
1285 + boolean_t pf = !HDR_PREFETCH(hdr);
1286 + switch (arc_buf_type(hdr)) {
1287 + case ARC_BUFC_DATA:
1288 + ARCSTAT_CONDSTAT(pf, demand, prefetch, hit, hits, misses, data);
1289 + break;
1290 + case ARC_BUFC_METADATA:
1291 + ARCSTAT_CONDSTAT(pf, demand, prefetch, hit, hits, misses,
1292 + metadata);
1293 + break;
1294 + case ARC_BUFC_DDT:
1295 + ARCSTAT_CONDSTAT(pf, demand, prefetch, hit, hits, misses, ddt);
1296 + break;
1297 + default:
1298 + break;
1299 + }
1300 +}
1159 1301
1302 +enum {
1303 + L2ARC_DEV_HDR_EVICT_FIRST = (1 << 0) /* mirror of l2ad_first */
1304 +};
1305 +
1160 1306 /*
1161 - * We use Cityhash for this. It's fast, and has good hash properties without
1162 - * requiring any large static buffers.
1307 + * Pointer used in persistent L2ARC (for pointing to log blocks & ARC buffers).
1163 1308 */
1164 -static uint64_t
1309 +typedef struct l2arc_log_blkptr {
1310 + uint64_t lbp_daddr; /* device address of log */
1311 + /*
1312 + * lbp_prop is the same format as the blk_prop in blkptr_t:
1313 + * * logical size (in sectors)
1314 + * * physical size (in sectors)
1315 + * * checksum algorithm (used for lbp_cksum)
1316 + * * object type & level (unused for now)
1317 + */
1318 + uint64_t lbp_prop;
1319 + zio_cksum_t lbp_cksum; /* fletcher4 of log */
1320 +} l2arc_log_blkptr_t;
1321 +
1322 +/*
1323 + * The persistent L2ARC device header.
1324 + * Byte order of magic determines whether 64-bit bswap of fields is necessary.
1325 + */
1326 +typedef struct l2arc_dev_hdr_phys {
1327 + uint64_t dh_magic; /* L2ARC_DEV_HDR_MAGIC_Vx */
1328 + zio_cksum_t dh_self_cksum; /* fletcher4 of fields below */
1329 +
1330 + /*
1331 + * Global L2ARC device state and metadata.
1332 + */
1333 + uint64_t dh_spa_guid;
1334 + uint64_t dh_alloc_space; /* vdev space alloc status */
1335 + uint64_t dh_flags; /* l2arc_dev_hdr_flags_t */
1336 +
1337 + /*
1338 + * Start of log block chain. [0] -> newest log, [1] -> one older (used
1339 + * for initiating prefetch).
1340 + */
1341 + l2arc_log_blkptr_t dh_start_lbps[2];
1342 +
1343 + const uint64_t dh_pad[44]; /* pad to 512 bytes */
1344 +} l2arc_dev_hdr_phys_t;
1345 +CTASSERT(sizeof (l2arc_dev_hdr_phys_t) == SPA_MINBLOCKSIZE);
1346 +
1347 +/*
1348 + * A single ARC buffer header entry in a l2arc_log_blk_phys_t.
1349 + */
1350 +typedef struct l2arc_log_ent_phys {
1351 + dva_t le_dva; /* dva of buffer */
1352 + uint64_t le_birth; /* birth txg of buffer */
1353 + zio_cksum_t le_freeze_cksum;
1354 + /*
1355 + * le_prop is the same format as the blk_prop in blkptr_t:
1356 + * * logical size (in sectors)
1357 + * * physical size (in sectors)
1358 + * * checksum algorithm (used for b_freeze_cksum)
1359 + * * object type & level (used to restore arc_buf_contents_t)
1360 + */
1361 + uint64_t le_prop;
1362 + uint64_t le_daddr; /* buf location on l2dev */
1363 + const uint64_t le_pad[7]; /* resv'd for future use */
1364 +} l2arc_log_ent_phys_t;
1365 +
1366 +/*
1367 + * These design limits give us the following metadata overhead (before
1368 + * compression):
1369 + * avg_blk_sz overhead
1370 + * 1k 12.51 %
1371 + * 2k 6.26 %
1372 + * 4k 3.13 %
1373 + * 8k 1.56 %
1374 + * 16k 0.78 %
1375 + * 32k 0.39 %
1376 + * 64k 0.20 %
1377 + * 128k 0.10 %
1378 + * Compression should be able to sequeeze these down by about a factor of 2x.
1379 + */
1380 +#define L2ARC_LOG_BLK_SIZE (128 * 1024) /* 128k */
1381 +#define L2ARC_LOG_BLK_HEADER_LEN (128)
1382 +#define L2ARC_LOG_BLK_ENTRIES /* 1023 entries */ \
1383 + ((L2ARC_LOG_BLK_SIZE - L2ARC_LOG_BLK_HEADER_LEN) / \
1384 + sizeof (l2arc_log_ent_phys_t))
1385 +/*
1386 + * Maximum amount of data in an l2arc log block (used to terminate rebuilding
1387 + * before we hit the write head and restore potentially corrupted blocks).
1388 + */
1389 +#define L2ARC_LOG_BLK_MAX_PAYLOAD_SIZE \
1390 + (SPA_MAXBLOCKSIZE * L2ARC_LOG_BLK_ENTRIES)
1391 +/*
1392 + * For the persistency and rebuild algorithms to operate reliably we need
1393 + * the L2ARC device to at least be able to hold 3 full log blocks (otherwise
1394 + * excessive log block looping might confuse the log chain end detection).
1395 + * Under normal circumstances this is not a problem, since this is somewhere
1396 + * around only 400 MB.
1397 + */
1398 +#define L2ARC_PERSIST_MIN_SIZE (3 * L2ARC_LOG_BLK_MAX_PAYLOAD_SIZE)
1399 +
1400 +/*
1401 + * A log block of up to 1023 ARC buffer log entries, chained into the
1402 + * persistent L2ARC metadata linked list. Byte order of magic determines
1403 + * whether 64-bit bswap of fields is necessary.
1404 + */
1405 +typedef struct l2arc_log_blk_phys {
1406 + /* Header - see L2ARC_LOG_BLK_HEADER_LEN above */
1407 + uint64_t lb_magic; /* L2ARC_LOG_BLK_MAGIC */
1408 + l2arc_log_blkptr_t lb_back2_lbp; /* back 2 steps in chain */
1409 + uint64_t lb_pad[9]; /* resv'd for future use */
1410 + /* Payload */
1411 + l2arc_log_ent_phys_t lb_entries[L2ARC_LOG_BLK_ENTRIES];
1412 +} l2arc_log_blk_phys_t;
1413 +
1414 +CTASSERT(sizeof (l2arc_log_blk_phys_t) == L2ARC_LOG_BLK_SIZE);
1415 +CTASSERT(offsetof(l2arc_log_blk_phys_t, lb_entries) -
1416 + offsetof(l2arc_log_blk_phys_t, lb_magic) == L2ARC_LOG_BLK_HEADER_LEN);
1417 +
1418 +/*
1419 + * These structures hold in-flight l2arc_log_blk_phys_t's as they're being
1420 + * written to the L2ARC device. They may be compressed, hence the uint8_t[].
1421 + */
1422 +typedef struct l2arc_log_blk_buf {
1423 + uint8_t lbb_log_blk[sizeof (l2arc_log_blk_phys_t)];
1424 + list_node_t lbb_node;
1425 +} l2arc_log_blk_buf_t;
1426 +
1427 +/* Macros for the manipulation fields in the blk_prop format of blkptr_t */
1428 +#define BLKPROP_GET_LSIZE(_obj, _field) \
1429 + BF64_GET_SB((_obj)->_field, 0, 16, SPA_MINBLOCKSHIFT, 1)
1430 +#define BLKPROP_SET_LSIZE(_obj, _field, x) \
1431 + BF64_SET_SB((_obj)->_field, 0, 16, SPA_MINBLOCKSHIFT, 1, x)
1432 +#define BLKPROP_GET_PSIZE(_obj, _field) \
1433 + BF64_GET_SB((_obj)->_field, 16, 16, SPA_MINBLOCKSHIFT, 0)
1434 +#define BLKPROP_SET_PSIZE(_obj, _field, x) \
1435 + BF64_SET_SB((_obj)->_field, 16, 16, SPA_MINBLOCKSHIFT, 0, x)
1436 +#define BLKPROP_GET_COMPRESS(_obj, _field) \
1437 + BF64_GET((_obj)->_field, 32, 7)
1438 +#define BLKPROP_SET_COMPRESS(_obj, _field, x) \
1439 + BF64_SET((_obj)->_field, 32, 7, x)
1440 +#define BLKPROP_GET_ARC_COMPRESS(_obj, _field) \
1441 + BF64_GET((_obj)->_field, 39, 1)
1442 +#define BLKPROP_SET_ARC_COMPRESS(_obj, _field, x) \
1443 + BF64_SET((_obj)->_field, 39, 1, x)
1444 +#define BLKPROP_GET_CHECKSUM(_obj, _field) \
1445 + BF64_GET((_obj)->_field, 40, 8)
1446 +#define BLKPROP_SET_CHECKSUM(_obj, _field, x) \
1447 + BF64_SET((_obj)->_field, 40, 8, x)
1448 +#define BLKPROP_GET_TYPE(_obj, _field) \
1449 + BF64_GET((_obj)->_field, 48, 8)
1450 +#define BLKPROP_SET_TYPE(_obj, _field, x) \
1451 + BF64_SET((_obj)->_field, 48, 8, x)
1452 +
1453 +/* Macros for manipulating a l2arc_log_blkptr_t->lbp_prop field */
1454 +#define LBP_GET_LSIZE(_add) BLKPROP_GET_LSIZE(_add, lbp_prop)
1455 +#define LBP_SET_LSIZE(_add, x) BLKPROP_SET_LSIZE(_add, lbp_prop, x)
1456 +#define LBP_GET_PSIZE(_add) BLKPROP_GET_PSIZE(_add, lbp_prop)
1457 +#define LBP_SET_PSIZE(_add, x) BLKPROP_SET_PSIZE(_add, lbp_prop, x)
1458 +#define LBP_GET_COMPRESS(_add) BLKPROP_GET_COMPRESS(_add, lbp_prop)
1459 +#define LBP_SET_COMPRESS(_add, x) BLKPROP_SET_COMPRESS(_add, lbp_prop, x)
1460 +#define LBP_GET_CHECKSUM(_add) BLKPROP_GET_CHECKSUM(_add, lbp_prop)
1461 +#define LBP_SET_CHECKSUM(_add, x) BLKPROP_SET_CHECKSUM(_add, lbp_prop, x)
1462 +#define LBP_GET_TYPE(_add) BLKPROP_GET_TYPE(_add, lbp_prop)
1463 +#define LBP_SET_TYPE(_add, x) BLKPROP_SET_TYPE(_add, lbp_prop, x)
1464 +
1465 +/* Macros for manipulating a l2arc_log_ent_phys_t->le_prop field */
1466 +#define LE_GET_LSIZE(_le) BLKPROP_GET_LSIZE(_le, le_prop)
1467 +#define LE_SET_LSIZE(_le, x) BLKPROP_SET_LSIZE(_le, le_prop, x)
1468 +#define LE_GET_PSIZE(_le) BLKPROP_GET_PSIZE(_le, le_prop)
1469 +#define LE_SET_PSIZE(_le, x) BLKPROP_SET_PSIZE(_le, le_prop, x)
1470 +#define LE_GET_COMPRESS(_le) BLKPROP_GET_COMPRESS(_le, le_prop)
1471 +#define LE_SET_COMPRESS(_le, x) BLKPROP_SET_COMPRESS(_le, le_prop, x)
1472 +#define LE_GET_ARC_COMPRESS(_le) BLKPROP_GET_ARC_COMPRESS(_le, le_prop)
1473 +#define LE_SET_ARC_COMPRESS(_le, x) BLKPROP_SET_ARC_COMPRESS(_le, le_prop, x)
1474 +#define LE_GET_CHECKSUM(_le) BLKPROP_GET_CHECKSUM(_le, le_prop)
1475 +#define LE_SET_CHECKSUM(_le, x) BLKPROP_SET_CHECKSUM(_le, le_prop, x)
1476 +#define LE_GET_TYPE(_le) BLKPROP_GET_TYPE(_le, le_prop)
1477 +#define LE_SET_TYPE(_le, x) BLKPROP_SET_TYPE(_le, le_prop, x)
1478 +
1479 +#define PTR_SWAP(x, y) \
1480 + do { \
1481 + void *tmp = (x);\
1482 + x = y; \
1483 + y = tmp; \
1484 + _NOTE(CONSTCOND)\
1485 + } while (0)
1486 +
1487 +/*
1488 + * Sadly, after compressed ARC integration older kernels would panic
1489 + * when trying to rebuild persistent L2ARC created by the new code.
1490 + */
1491 +#define L2ARC_DEV_HDR_MAGIC_V1 0x4c32415243763031LLU /* ASCII: "L2ARCv01" */
1492 +#define L2ARC_LOG_BLK_MAGIC 0x4c4f47424c4b4844LLU /* ASCII: "LOGBLKHD" */
1493 +
1494 +/*
1495 + * Performance tuning of L2ARC persistency:
1496 + *
1497 + * l2arc_rebuild_enabled : Controls whether L2ARC device adds (either at
1498 + * pool import or when adding one manually later) will attempt
1499 + * to rebuild L2ARC buffer contents. In special circumstances,
1500 + * the administrator may want to set this to B_FALSE, if they
1501 + * are having trouble importing a pool or attaching an L2ARC
1502 + * device (e.g. the L2ARC device is slow to read in stored log
1503 + * metadata, or the metadata has become somehow
1504 + * fragmented/unusable).
1505 + */
1506 +boolean_t l2arc_rebuild_enabled = B_TRUE;
1507 +
1508 +/* L2ARC persistency rebuild control routines. */
1509 +static void l2arc_dev_rebuild_start(l2arc_dev_t *dev);
1510 +static int l2arc_rebuild(l2arc_dev_t *dev);
1511 +
1512 +/* L2ARC persistency read I/O routines. */
1513 +static int l2arc_dev_hdr_read(l2arc_dev_t *dev);
1514 +static int l2arc_log_blk_read(l2arc_dev_t *dev,
1515 + const l2arc_log_blkptr_t *this_lp, const l2arc_log_blkptr_t *next_lp,
1516 + l2arc_log_blk_phys_t *this_lb, l2arc_log_blk_phys_t *next_lb,
1517 + uint8_t *this_lb_buf, uint8_t *next_lb_buf,
1518 + zio_t *this_io, zio_t **next_io);
1519 +static zio_t *l2arc_log_blk_prefetch(vdev_t *vd,
1520 + const l2arc_log_blkptr_t *lp, uint8_t *lb_buf);
1521 +static void l2arc_log_blk_prefetch_abort(zio_t *zio);
1522 +
1523 +/* L2ARC persistency block restoration routines. */
1524 +static void l2arc_log_blk_restore(l2arc_dev_t *dev, uint64_t load_guid,
1525 + const l2arc_log_blk_phys_t *lb, uint64_t lb_psize);
1526 +static void l2arc_hdr_restore(const l2arc_log_ent_phys_t *le,
1527 + l2arc_dev_t *dev, uint64_t guid);
1528 +
1529 +/* L2ARC persistency write I/O routines. */
1530 +static void l2arc_dev_hdr_update(l2arc_dev_t *dev, zio_t *pio);
1531 +static void l2arc_log_blk_commit(l2arc_dev_t *dev, zio_t *pio,
1532 + l2arc_write_callback_t *cb);
1533 +
1534 +/* L2ARC persistency auxilliary routines. */
1535 +static boolean_t l2arc_log_blkptr_valid(l2arc_dev_t *dev,
1536 + const l2arc_log_blkptr_t *lp);
1537 +static void l2arc_dev_hdr_checksum(const l2arc_dev_hdr_phys_t *hdr,
1538 + zio_cksum_t *cksum);
1539 +static boolean_t l2arc_log_blk_insert(l2arc_dev_t *dev,
1540 + const arc_buf_hdr_t *ab);
1541 +static inline boolean_t l2arc_range_check_overlap(uint64_t bottom,
1542 + uint64_t top, uint64_t check);
1543 +
1544 +/*
1545 + * L2ARC Internals
1546 + */
1547 +struct l2arc_dev {
1548 + vdev_t *l2ad_vdev; /* vdev */
1549 + spa_t *l2ad_spa; /* spa */
1550 + uint64_t l2ad_hand; /* next write location */
1551 + uint64_t l2ad_start; /* first addr on device */
1552 + uint64_t l2ad_end; /* last addr on device */
1553 + boolean_t l2ad_first; /* first sweep through */
1554 + boolean_t l2ad_writing; /* currently writing */
1555 + kmutex_t l2ad_mtx; /* lock for buffer list */
1556 + list_t l2ad_buflist; /* buffer list */
1557 + list_node_t l2ad_node; /* device list node */
1558 + refcount_t l2ad_alloc; /* allocated bytes */
1559 + l2arc_dev_hdr_phys_t *l2ad_dev_hdr; /* persistent device header */
1560 + uint64_t l2ad_dev_hdr_asize; /* aligned hdr size */
1561 + l2arc_log_blk_phys_t l2ad_log_blk; /* currently open log block */
1562 + int l2ad_log_ent_idx; /* index into cur log blk */
1563 + /* number of bytes in current log block's payload */
1564 + uint64_t l2ad_log_blk_payload_asize;
1565 + /* flag indicating whether a rebuild is scheduled or is going on */
1566 + boolean_t l2ad_rebuild;
1567 + boolean_t l2ad_rebuild_cancel;
1568 + kt_did_t l2ad_rebuild_did;
1569 +};
1570 +
1571 +static inline uint64_t
1165 1572 buf_hash(uint64_t spa, const dva_t *dva, uint64_t birth)
1166 1573 {
1167 - return (cityhash4(spa, dva->dva_word[0], dva->dva_word[1], birth));
1574 + uint8_t *vdva = (uint8_t *)dva;
1575 + uint64_t crc = -1ULL;
1576 + int i;
1577 +
1578 + ASSERT(zfs_crc64_table[128] == ZFS_CRC64_POLY);
1579 +
1580 + for (i = 0; i < sizeof (dva_t); i++)
1581 + crc = (crc >> 8) ^ zfs_crc64_table[(crc ^ vdva[i]) & 0xFF];
1582 +
1583 + crc ^= (spa>>8) ^ birth;
1584 +
1585 + return (crc);
1168 1586 }
1169 1587
1170 1588 #define HDR_EMPTY(hdr) \
1171 1589 ((hdr)->b_dva.dva_word[0] == 0 && \
1172 1590 (hdr)->b_dva.dva_word[1] == 0)
1173 1591
1174 1592 #define HDR_EQUAL(spa, dva, birth, hdr) \
1175 1593 ((hdr)->b_dva.dva_word[0] == (dva)->dva_word[0]) && \
1176 1594 ((hdr)->b_dva.dva_word[1] == (dva)->dva_word[1]) && \
1177 1595 ((hdr)->b_birth == birth) && ((hdr)->b_spa == spa)
1178 1596
1179 1597 static void
1180 1598 buf_discard_identity(arc_buf_hdr_t *hdr)
1181 1599 {
1182 1600 hdr->b_dva.dva_word[0] = 0;
1183 1601 hdr->b_dva.dva_word[1] = 0;
1184 1602 hdr->b_birth = 0;
1185 1603 }
1186 1604
|
↓ open down ↓ |
9 lines elided |
↑ open up ↑ |
1187 1605 static arc_buf_hdr_t *
1188 1606 buf_hash_find(uint64_t spa, const blkptr_t *bp, kmutex_t **lockp)
1189 1607 {
1190 1608 const dva_t *dva = BP_IDENTITY(bp);
1191 1609 uint64_t birth = BP_PHYSICAL_BIRTH(bp);
1192 1610 uint64_t idx = BUF_HASH_INDEX(spa, dva, birth);
1193 1611 kmutex_t *hash_lock = BUF_HASH_LOCK(idx);
1194 1612 arc_buf_hdr_t *hdr;
1195 1613
1196 1614 mutex_enter(hash_lock);
1197 - for (hdr = buf_hash_table.ht_table[idx]; hdr != NULL;
1615 + for (hdr = buf_hash_table.ht_table[idx].hdr; hdr != NULL;
1198 1616 hdr = hdr->b_hash_next) {
1199 1617 if (HDR_EQUAL(spa, dva, birth, hdr)) {
1200 1618 *lockp = hash_lock;
1201 1619 return (hdr);
1202 1620 }
1203 1621 }
1204 1622 mutex_exit(hash_lock);
1205 1623 *lockp = NULL;
1206 1624 return (NULL);
1207 1625 }
1208 1626
1209 1627 /*
1210 1628 * Insert an entry into the hash table. If there is already an element
1211 1629 * equal to elem in the hash table, then the already existing element
1212 1630 * will be returned and the new element will not be inserted.
1213 1631 * Otherwise returns NULL.
1214 1632 * If lockp == NULL, the caller is assumed to already hold the hash lock.
1215 1633 */
1216 1634 static arc_buf_hdr_t *
1217 1635 buf_hash_insert(arc_buf_hdr_t *hdr, kmutex_t **lockp)
1218 1636 {
1219 1637 uint64_t idx = BUF_HASH_INDEX(hdr->b_spa, &hdr->b_dva, hdr->b_birth);
1220 1638 kmutex_t *hash_lock = BUF_HASH_LOCK(idx);
1221 1639 arc_buf_hdr_t *fhdr;
1222 1640 uint32_t i;
1223 1641
1224 1642 ASSERT(!DVA_IS_EMPTY(&hdr->b_dva));
|
↓ open down ↓ |
17 lines elided |
↑ open up ↑ |
1225 1643 ASSERT(hdr->b_birth != 0);
1226 1644 ASSERT(!HDR_IN_HASH_TABLE(hdr));
1227 1645
1228 1646 if (lockp != NULL) {
1229 1647 *lockp = hash_lock;
1230 1648 mutex_enter(hash_lock);
1231 1649 } else {
1232 1650 ASSERT(MUTEX_HELD(hash_lock));
1233 1651 }
1234 1652
1235 - for (fhdr = buf_hash_table.ht_table[idx], i = 0; fhdr != NULL;
1653 + for (fhdr = buf_hash_table.ht_table[idx].hdr, i = 0; fhdr != NULL;
1236 1654 fhdr = fhdr->b_hash_next, i++) {
1237 1655 if (HDR_EQUAL(hdr->b_spa, &hdr->b_dva, hdr->b_birth, fhdr))
1238 1656 return (fhdr);
1239 1657 }
1240 1658
1241 - hdr->b_hash_next = buf_hash_table.ht_table[idx];
1242 - buf_hash_table.ht_table[idx] = hdr;
1659 + hdr->b_hash_next = buf_hash_table.ht_table[idx].hdr;
1660 + buf_hash_table.ht_table[idx].hdr = hdr;
1243 1661 arc_hdr_set_flags(hdr, ARC_FLAG_IN_HASH_TABLE);
1244 1662
1245 1663 /* collect some hash table performance data */
1246 1664 if (i > 0) {
1247 1665 ARCSTAT_BUMP(arcstat_hash_collisions);
1248 1666 if (i == 1)
1249 1667 ARCSTAT_BUMP(arcstat_hash_chains);
1250 1668
1251 1669 ARCSTAT_MAX(arcstat_hash_chain_max, i);
1252 1670 }
1253 1671
1254 1672 ARCSTAT_BUMP(arcstat_hash_elements);
1255 1673 ARCSTAT_MAXSTAT(arcstat_hash_elements);
1256 1674
1257 1675 return (NULL);
1258 1676 }
|
↓ open down ↓ |
6 lines elided |
↑ open up ↑ |
1259 1677
1260 1678 static void
1261 1679 buf_hash_remove(arc_buf_hdr_t *hdr)
1262 1680 {
1263 1681 arc_buf_hdr_t *fhdr, **hdrp;
1264 1682 uint64_t idx = BUF_HASH_INDEX(hdr->b_spa, &hdr->b_dva, hdr->b_birth);
1265 1683
1266 1684 ASSERT(MUTEX_HELD(BUF_HASH_LOCK(idx)));
1267 1685 ASSERT(HDR_IN_HASH_TABLE(hdr));
1268 1686
1269 - hdrp = &buf_hash_table.ht_table[idx];
1687 + hdrp = &buf_hash_table.ht_table[idx].hdr;
1270 1688 while ((fhdr = *hdrp) != hdr) {
1271 1689 ASSERT3P(fhdr, !=, NULL);
1272 1690 hdrp = &fhdr->b_hash_next;
1273 1691 }
1274 1692 *hdrp = hdr->b_hash_next;
1275 1693 hdr->b_hash_next = NULL;
1276 1694 arc_hdr_clear_flags(hdr, ARC_FLAG_IN_HASH_TABLE);
1277 1695
1278 1696 /* collect some hash table performance data */
1279 1697 ARCSTAT_BUMPDOWN(arcstat_hash_elements);
1280 1698
1281 - if (buf_hash_table.ht_table[idx] &&
1282 - buf_hash_table.ht_table[idx]->b_hash_next == NULL)
1699 + if (buf_hash_table.ht_table[idx].hdr &&
1700 + buf_hash_table.ht_table[idx].hdr->b_hash_next == NULL)
1283 1701 ARCSTAT_BUMPDOWN(arcstat_hash_chains);
1284 1702 }
1285 1703
1286 1704 /*
1287 1705 * Global data structures and functions for the buf kmem cache.
1288 1706 */
1289 1707 static kmem_cache_t *hdr_full_cache;
1290 1708 static kmem_cache_t *hdr_l2only_cache;
1291 1709 static kmem_cache_t *buf_cache;
1292 1710
1293 1711 static void
1294 1712 buf_fini(void)
1295 1713 {
1296 1714 int i;
1297 1715
1716 + for (i = 0; i < buf_hash_table.ht_mask + 1; i++)
1717 + mutex_destroy(&buf_hash_table.ht_table[i].lock);
1298 1718 kmem_free(buf_hash_table.ht_table,
1299 - (buf_hash_table.ht_mask + 1) * sizeof (void *));
1300 - for (i = 0; i < BUF_LOCKS; i++)
1301 - mutex_destroy(&buf_hash_table.ht_locks[i].ht_lock);
1719 + (buf_hash_table.ht_mask + 1) * sizeof (struct ht_table));
1302 1720 kmem_cache_destroy(hdr_full_cache);
1303 1721 kmem_cache_destroy(hdr_l2only_cache);
1304 1722 kmem_cache_destroy(buf_cache);
1305 1723 }
1306 1724
1307 1725 /*
1308 1726 * Constructor callback - called when the cache is empty
1309 1727 * and a new buf is requested.
1310 1728 */
1311 1729 /* ARGSUSED */
1312 1730 static int
1313 1731 hdr_full_cons(void *vbuf, void *unused, int kmflag)
1314 1732 {
1315 1733 arc_buf_hdr_t *hdr = vbuf;
1316 1734
1317 1735 bzero(hdr, HDR_FULL_SIZE);
1318 1736 cv_init(&hdr->b_l1hdr.b_cv, NULL, CV_DEFAULT, NULL);
1319 1737 refcount_create(&hdr->b_l1hdr.b_refcnt);
1320 1738 mutex_init(&hdr->b_l1hdr.b_freeze_lock, NULL, MUTEX_DEFAULT, NULL);
1321 1739 multilist_link_init(&hdr->b_l1hdr.b_arc_node);
1322 1740 arc_space_consume(HDR_FULL_SIZE, ARC_SPACE_HDRS);
1323 1741
1324 1742 return (0);
1325 1743 }
1326 1744
1327 1745 /* ARGSUSED */
1328 1746 static int
1329 1747 hdr_l2only_cons(void *vbuf, void *unused, int kmflag)
1330 1748 {
1331 1749 arc_buf_hdr_t *hdr = vbuf;
1332 1750
1333 1751 bzero(hdr, HDR_L2ONLY_SIZE);
1334 1752 arc_space_consume(HDR_L2ONLY_SIZE, ARC_SPACE_L2HDRS);
1335 1753
1336 1754 return (0);
1337 1755 }
1338 1756
1339 1757 /* ARGSUSED */
1340 1758 static int
1341 1759 buf_cons(void *vbuf, void *unused, int kmflag)
1342 1760 {
1343 1761 arc_buf_t *buf = vbuf;
1344 1762
1345 1763 bzero(buf, sizeof (arc_buf_t));
1346 1764 mutex_init(&buf->b_evict_lock, NULL, MUTEX_DEFAULT, NULL);
1347 1765 arc_space_consume(sizeof (arc_buf_t), ARC_SPACE_HDRS);
1348 1766
1349 1767 return (0);
1350 1768 }
1351 1769
1352 1770 /*
1353 1771 * Destructor callback - called when a cached buf is
1354 1772 * no longer required.
1355 1773 */
1356 1774 /* ARGSUSED */
1357 1775 static void
1358 1776 hdr_full_dest(void *vbuf, void *unused)
1359 1777 {
1360 1778 arc_buf_hdr_t *hdr = vbuf;
1361 1779
1362 1780 ASSERT(HDR_EMPTY(hdr));
1363 1781 cv_destroy(&hdr->b_l1hdr.b_cv);
1364 1782 refcount_destroy(&hdr->b_l1hdr.b_refcnt);
1365 1783 mutex_destroy(&hdr->b_l1hdr.b_freeze_lock);
1366 1784 ASSERT(!multilist_link_active(&hdr->b_l1hdr.b_arc_node));
1367 1785 arc_space_return(HDR_FULL_SIZE, ARC_SPACE_HDRS);
1368 1786 }
1369 1787
1370 1788 /* ARGSUSED */
1371 1789 static void
1372 1790 hdr_l2only_dest(void *vbuf, void *unused)
1373 1791 {
1374 1792 arc_buf_hdr_t *hdr = vbuf;
1375 1793
1376 1794 ASSERT(HDR_EMPTY(hdr));
1377 1795 arc_space_return(HDR_L2ONLY_SIZE, ARC_SPACE_L2HDRS);
1378 1796 }
1379 1797
1380 1798 /* ARGSUSED */
1381 1799 static void
1382 1800 buf_dest(void *vbuf, void *unused)
1383 1801 {
1384 1802 arc_buf_t *buf = vbuf;
1385 1803
1386 1804 mutex_destroy(&buf->b_evict_lock);
1387 1805 arc_space_return(sizeof (arc_buf_t), ARC_SPACE_HDRS);
1388 1806 }
1389 1807
1390 1808 /*
1391 1809 * Reclaim callback -- invoked when memory is low.
1392 1810 */
1393 1811 /* ARGSUSED */
1394 1812 static void
1395 1813 hdr_recl(void *unused)
1396 1814 {
1397 1815 dprintf("hdr_recl called\n");
1398 1816 /*
1399 1817 * umem calls the reclaim func when we destroy the buf cache,
1400 1818 * which is after we do arc_fini().
1401 1819 */
1402 1820 if (!arc_dead)
1403 1821 cv_signal(&arc_reclaim_thread_cv);
1404 1822 }
1405 1823
1406 1824 static void
1407 1825 buf_init(void)
1408 1826 {
1409 1827 uint64_t *ct;
1410 1828 uint64_t hsize = 1ULL << 12;
1411 1829 int i, j;
1412 1830
1413 1831 /*
|
↓ open down ↓ |
102 lines elided |
↑ open up ↑ |
1414 1832 * The hash table is big enough to fill all of physical memory
1415 1833 * with an average block size of zfs_arc_average_blocksize (default 8K).
1416 1834 * By default, the table will take up
1417 1835 * totalmem * sizeof(void*) / 8K (1MB per GB with 8-byte pointers).
1418 1836 */
1419 1837 while (hsize * zfs_arc_average_blocksize < physmem * PAGESIZE)
1420 1838 hsize <<= 1;
1421 1839 retry:
1422 1840 buf_hash_table.ht_mask = hsize - 1;
1423 1841 buf_hash_table.ht_table =
1424 - kmem_zalloc(hsize * sizeof (void*), KM_NOSLEEP);
1842 + kmem_zalloc(hsize * sizeof (struct ht_table), KM_NOSLEEP);
1425 1843 if (buf_hash_table.ht_table == NULL) {
1426 1844 ASSERT(hsize > (1ULL << 8));
1427 1845 hsize >>= 1;
1428 1846 goto retry;
1429 1847 }
1430 1848
1431 1849 hdr_full_cache = kmem_cache_create("arc_buf_hdr_t_full", HDR_FULL_SIZE,
1432 1850 0, hdr_full_cons, hdr_full_dest, hdr_recl, NULL, NULL, 0);
1433 1851 hdr_l2only_cache = kmem_cache_create("arc_buf_hdr_t_l2only",
1434 1852 HDR_L2ONLY_SIZE, 0, hdr_l2only_cons, hdr_l2only_dest, hdr_recl,
1435 1853 NULL, NULL, 0);
1436 1854 buf_cache = kmem_cache_create("arc_buf_t", sizeof (arc_buf_t),
1437 1855 0, buf_cons, buf_dest, NULL, NULL, NULL, 0);
1438 1856
1439 1857 for (i = 0; i < 256; i++)
1440 1858 for (ct = zfs_crc64_table + i, *ct = i, j = 8; j > 0; j--)
1441 1859 *ct = (*ct >> 1) ^ (-(*ct & 1) & ZFS_CRC64_POLY);
1442 1860
1443 - for (i = 0; i < BUF_LOCKS; i++) {
1444 - mutex_init(&buf_hash_table.ht_locks[i].ht_lock,
1861 + for (i = 0; i < hsize; i++) {
1862 + mutex_init(&buf_hash_table.ht_table[i].lock,
1445 1863 NULL, MUTEX_DEFAULT, NULL);
1446 1864 }
1447 1865 }
1448 1866
1867 +/* wait until krrp releases the buffer */
1868 +static inline void
1869 +arc_wait_for_krrp(arc_buf_hdr_t *hdr)
1870 +{
1871 + while (HDR_HAS_L1HDR(hdr) && hdr->b_l1hdr.b_krrp != 0)
1872 + cv_wait(&hdr->b_l1hdr.b_cv, HDR_LOCK(hdr));
1873 +}
1874 +
1449 1875 /*
1450 1876 * This is the size that the buf occupies in memory. If the buf is compressed,
1451 1877 * it will correspond to the compressed size. You should use this method of
1452 1878 * getting the buf size unless you explicitly need the logical size.
1453 1879 */
1454 1880 int32_t
1455 1881 arc_buf_size(arc_buf_t *buf)
1456 1882 {
1457 1883 return (ARC_BUF_COMPRESSED(buf) ?
1458 1884 HDR_GET_PSIZE(buf->b_hdr) : HDR_GET_LSIZE(buf->b_hdr));
1459 1885 }
1460 1886
1461 1887 int32_t
1462 1888 arc_buf_lsize(arc_buf_t *buf)
1463 1889 {
1464 1890 return (HDR_GET_LSIZE(buf->b_hdr));
1465 1891 }
1466 1892
1467 1893 enum zio_compress
1468 1894 arc_get_compression(arc_buf_t *buf)
1469 1895 {
1470 1896 return (ARC_BUF_COMPRESSED(buf) ?
1471 1897 HDR_GET_COMPRESS(buf->b_hdr) : ZIO_COMPRESS_OFF);
1472 1898 }
1473 1899
1474 1900 #define ARC_MINTIME (hz>>4) /* 62 ms */
1475 1901
1476 1902 static inline boolean_t
1477 1903 arc_buf_is_shared(arc_buf_t *buf)
1478 1904 {
1479 1905 boolean_t shared = (buf->b_data != NULL &&
1480 1906 buf->b_hdr->b_l1hdr.b_pabd != NULL &&
1481 1907 abd_is_linear(buf->b_hdr->b_l1hdr.b_pabd) &&
1482 1908 buf->b_data == abd_to_buf(buf->b_hdr->b_l1hdr.b_pabd));
1483 1909 IMPLY(shared, HDR_SHARED_DATA(buf->b_hdr));
1484 1910 IMPLY(shared, ARC_BUF_SHARED(buf));
1485 1911 IMPLY(shared, ARC_BUF_COMPRESSED(buf) || ARC_BUF_LAST(buf));
1486 1912
1487 1913 /*
1488 1914 * It would be nice to assert arc_can_share() too, but the "hdr isn't
1489 1915 * already being shared" requirement prevents us from doing that.
1490 1916 */
1491 1917
1492 1918 return (shared);
1493 1919 }
|
↓ open down ↓ |
35 lines elided |
↑ open up ↑ |
1494 1920
1495 1921 /*
1496 1922 * Free the checksum associated with this header. If there is no checksum, this
1497 1923 * is a no-op.
1498 1924 */
1499 1925 static inline void
1500 1926 arc_cksum_free(arc_buf_hdr_t *hdr)
1501 1927 {
1502 1928 ASSERT(HDR_HAS_L1HDR(hdr));
1503 1929 mutex_enter(&hdr->b_l1hdr.b_freeze_lock);
1504 - if (hdr->b_l1hdr.b_freeze_cksum != NULL) {
1505 - kmem_free(hdr->b_l1hdr.b_freeze_cksum, sizeof (zio_cksum_t));
1506 - hdr->b_l1hdr.b_freeze_cksum = NULL;
1930 + if (hdr->b_freeze_cksum != NULL) {
1931 + kmem_free(hdr->b_freeze_cksum, sizeof (zio_cksum_t));
1932 + hdr->b_freeze_cksum = NULL;
1507 1933 }
1508 1934 mutex_exit(&hdr->b_l1hdr.b_freeze_lock);
1509 1935 }
1510 1936
1511 1937 /*
1512 1938 * Return true iff at least one of the bufs on hdr is not compressed.
1513 1939 */
1514 1940 static boolean_t
1515 1941 arc_hdr_has_uncompressed_buf(arc_buf_hdr_t *hdr)
1516 1942 {
1517 1943 for (arc_buf_t *b = hdr->b_l1hdr.b_buf; b != NULL; b = b->b_next) {
1518 1944 if (!ARC_BUF_COMPRESSED(b)) {
1519 1945 return (B_TRUE);
1520 1946 }
1521 1947 }
1522 1948 return (B_FALSE);
1523 1949 }
1524 1950
1525 1951 /*
1526 1952 * If we've turned on the ZFS_DEBUG_MODIFY flag, verify that the buf's data
1527 1953 * matches the checksum that is stored in the hdr. If there is no checksum,
1528 1954 * or if the buf is compressed, this is a no-op.
1529 1955 */
|
↓ open down ↓ |
13 lines elided |
↑ open up ↑ |
1530 1956 static void
1531 1957 arc_cksum_verify(arc_buf_t *buf)
1532 1958 {
1533 1959 arc_buf_hdr_t *hdr = buf->b_hdr;
1534 1960 zio_cksum_t zc;
1535 1961
1536 1962 if (!(zfs_flags & ZFS_DEBUG_MODIFY))
1537 1963 return;
1538 1964
1539 1965 if (ARC_BUF_COMPRESSED(buf)) {
1540 - ASSERT(hdr->b_l1hdr.b_freeze_cksum == NULL ||
1966 + ASSERT(hdr->b_freeze_cksum == NULL ||
1541 1967 arc_hdr_has_uncompressed_buf(hdr));
1542 1968 return;
1543 1969 }
1544 1970
1545 1971 ASSERT(HDR_HAS_L1HDR(hdr));
1546 1972
1547 1973 mutex_enter(&hdr->b_l1hdr.b_freeze_lock);
1548 - if (hdr->b_l1hdr.b_freeze_cksum == NULL || HDR_IO_ERROR(hdr)) {
1974 + if (hdr->b_freeze_cksum == NULL || HDR_IO_ERROR(hdr)) {
1549 1975 mutex_exit(&hdr->b_l1hdr.b_freeze_lock);
1550 1976 return;
1551 1977 }
1552 1978
1553 1979 fletcher_2_native(buf->b_data, arc_buf_size(buf), NULL, &zc);
1554 - if (!ZIO_CHECKSUM_EQUAL(*hdr->b_l1hdr.b_freeze_cksum, zc))
1980 + if (!ZIO_CHECKSUM_EQUAL(*hdr->b_freeze_cksum, zc))
1555 1981 panic("buffer modified while frozen!");
1556 1982 mutex_exit(&hdr->b_l1hdr.b_freeze_lock);
1557 1983 }
1558 1984
1559 1985 static boolean_t
1560 1986 arc_cksum_is_equal(arc_buf_hdr_t *hdr, zio_t *zio)
1561 1987 {
1562 1988 enum zio_compress compress = BP_GET_COMPRESS(zio->io_bp);
1563 1989 boolean_t valid_cksum;
1564 1990
1565 1991 ASSERT(!BP_IS_EMBEDDED(zio->io_bp));
1566 1992 VERIFY3U(BP_GET_PSIZE(zio->io_bp), ==, HDR_GET_PSIZE(hdr));
1567 1993
1568 1994 /*
1569 1995 * We rely on the blkptr's checksum to determine if the block
1570 1996 * is valid or not. When compressed arc is enabled, the l2arc
1571 1997 * writes the block to the l2arc just as it appears in the pool.
1572 1998 * This allows us to use the blkptr's checksum to validate the
1573 1999 * data that we just read off of the l2arc without having to store
1574 2000 * a separate checksum in the arc_buf_hdr_t. However, if compressed
|
↓ open down ↓ |
10 lines elided |
↑ open up ↑ |
1575 2001 * arc is disabled, then the data written to the l2arc is always
1576 2002 * uncompressed and won't match the block as it exists in the main
1577 2003 * pool. When this is the case, we must first compress it if it is
1578 2004 * compressed on the main pool before we can validate the checksum.
1579 2005 */
1580 2006 if (!HDR_COMPRESSION_ENABLED(hdr) && compress != ZIO_COMPRESS_OFF) {
1581 2007 ASSERT3U(HDR_GET_COMPRESS(hdr), ==, ZIO_COMPRESS_OFF);
1582 2008 uint64_t lsize = HDR_GET_LSIZE(hdr);
1583 2009 uint64_t csize;
1584 2010
1585 - abd_t *cdata = abd_alloc_linear(HDR_GET_PSIZE(hdr), B_TRUE);
1586 - csize = zio_compress_data(compress, zio->io_abd,
1587 - abd_to_buf(cdata), lsize);
2011 + void *cbuf = zio_buf_alloc(HDR_GET_PSIZE(hdr));
2012 + csize = zio_compress_data(compress, zio->io_abd, cbuf, lsize);
2013 + abd_t *cdata = abd_get_from_buf(cbuf, HDR_GET_PSIZE(hdr));
2014 + abd_take_ownership_of_buf(cdata, B_TRUE);
1588 2015
1589 2016 ASSERT3U(csize, <=, HDR_GET_PSIZE(hdr));
1590 2017 if (csize < HDR_GET_PSIZE(hdr)) {
1591 2018 /*
1592 2019 * Compressed blocks are always a multiple of the
1593 2020 * smallest ashift in the pool. Ideally, we would
1594 2021 * like to round up the csize to the next
1595 2022 * spa_min_ashift but that value may have changed
1596 2023 * since the block was last written. Instead,
1597 2024 * we rely on the fact that the hdr's psize
1598 2025 * was set to the psize of the block when it was
1599 2026 * last written. We set the csize to that value
1600 2027 * and zero out any part that should not contain
1601 2028 * data.
1602 2029 */
1603 2030 abd_zero_off(cdata, csize, HDR_GET_PSIZE(hdr) - csize);
1604 2031 csize = HDR_GET_PSIZE(hdr);
1605 2032 }
1606 2033 zio_push_transform(zio, cdata, csize, HDR_GET_PSIZE(hdr), NULL);
1607 2034 }
1608 2035
1609 2036 /*
1610 2037 * Block pointers always store the checksum for the logical data.
1611 2038 * If the block pointer has the gang bit set, then the checksum
1612 2039 * it represents is for the reconstituted data and not for an
1613 2040 * individual gang member. The zio pipeline, however, must be able to
1614 2041 * determine the checksum of each of the gang constituents so it
1615 2042 * treats the checksum comparison differently than what we need
1616 2043 * for l2arc blocks. This prevents us from using the
1617 2044 * zio_checksum_error() interface directly. Instead we must call the
1618 2045 * zio_checksum_error_impl() so that we can ensure the checksum is
1619 2046 * generated using the correct checksum algorithm and accounts for the
1620 2047 * logical I/O size and not just a gang fragment.
1621 2048 */
1622 2049 valid_cksum = (zio_checksum_error_impl(zio->io_spa, zio->io_bp,
1623 2050 BP_GET_CHECKSUM(zio->io_bp), zio->io_abd, zio->io_size,
1624 2051 zio->io_offset, NULL) == 0);
1625 2052 zio_pop_transforms(zio);
1626 2053 return (valid_cksum);
1627 2054 }
1628 2055
1629 2056 /*
1630 2057 * Given a buf full of data, if ZFS_DEBUG_MODIFY is enabled this computes a
1631 2058 * checksum and attaches it to the buf's hdr so that we can ensure that the buf
1632 2059 * isn't modified later on. If buf is compressed or there is already a checksum
1633 2060 * on the hdr, this is a no-op (we only checksum uncompressed bufs).
1634 2061 */
1635 2062 static void
|
↓ open down ↓ |
38 lines elided |
↑ open up ↑ |
1636 2063 arc_cksum_compute(arc_buf_t *buf)
1637 2064 {
1638 2065 arc_buf_hdr_t *hdr = buf->b_hdr;
1639 2066
1640 2067 if (!(zfs_flags & ZFS_DEBUG_MODIFY))
1641 2068 return;
1642 2069
1643 2070 ASSERT(HDR_HAS_L1HDR(hdr));
1644 2071
1645 2072 mutex_enter(&buf->b_hdr->b_l1hdr.b_freeze_lock);
1646 - if (hdr->b_l1hdr.b_freeze_cksum != NULL) {
2073 + if (hdr->b_freeze_cksum != NULL) {
1647 2074 ASSERT(arc_hdr_has_uncompressed_buf(hdr));
1648 2075 mutex_exit(&hdr->b_l1hdr.b_freeze_lock);
1649 2076 return;
1650 2077 } else if (ARC_BUF_COMPRESSED(buf)) {
1651 2078 mutex_exit(&hdr->b_l1hdr.b_freeze_lock);
1652 2079 return;
1653 2080 }
1654 2081
1655 2082 ASSERT(!ARC_BUF_COMPRESSED(buf));
1656 - hdr->b_l1hdr.b_freeze_cksum = kmem_alloc(sizeof (zio_cksum_t),
2083 + hdr->b_freeze_cksum = kmem_alloc(sizeof (zio_cksum_t),
1657 2084 KM_SLEEP);
1658 2085 fletcher_2_native(buf->b_data, arc_buf_size(buf), NULL,
1659 - hdr->b_l1hdr.b_freeze_cksum);
2086 + hdr->b_freeze_cksum);
1660 2087 mutex_exit(&hdr->b_l1hdr.b_freeze_lock);
1661 2088 arc_buf_watch(buf);
1662 2089 }
1663 2090
1664 2091 #ifndef _KERNEL
1665 2092 typedef struct procctl {
1666 2093 long cmd;
1667 2094 prwatch_t prwatch;
1668 2095 } procctl_t;
1669 2096 #endif
1670 2097
1671 2098 /* ARGSUSED */
1672 2099 static void
1673 2100 arc_buf_unwatch(arc_buf_t *buf)
1674 2101 {
1675 2102 #ifndef _KERNEL
1676 2103 if (arc_watch) {
1677 2104 int result;
1678 2105 procctl_t ctl;
1679 2106 ctl.cmd = PCWATCH;
1680 2107 ctl.prwatch.pr_vaddr = (uintptr_t)buf->b_data;
1681 2108 ctl.prwatch.pr_size = 0;
1682 2109 ctl.prwatch.pr_wflags = 0;
1683 2110 result = write(arc_procfd, &ctl, sizeof (ctl));
1684 2111 ASSERT3U(result, ==, sizeof (ctl));
1685 2112 }
1686 2113 #endif
1687 2114 }
1688 2115
1689 2116 /* ARGSUSED */
1690 2117 static void
1691 2118 arc_buf_watch(arc_buf_t *buf)
1692 2119 {
1693 2120 #ifndef _KERNEL
1694 2121 if (arc_watch) {
1695 2122 int result;
1696 2123 procctl_t ctl;
1697 2124 ctl.cmd = PCWATCH;
1698 2125 ctl.prwatch.pr_vaddr = (uintptr_t)buf->b_data;
1699 2126 ctl.prwatch.pr_size = arc_buf_size(buf);
1700 2127 ctl.prwatch.pr_wflags = WA_WRITE;
|
↓ open down ↓ |
31 lines elided |
↑ open up ↑ |
1701 2128 result = write(arc_procfd, &ctl, sizeof (ctl));
1702 2129 ASSERT3U(result, ==, sizeof (ctl));
1703 2130 }
1704 2131 #endif
1705 2132 }
1706 2133
1707 2134 static arc_buf_contents_t
1708 2135 arc_buf_type(arc_buf_hdr_t *hdr)
1709 2136 {
1710 2137 arc_buf_contents_t type;
2138 +
1711 2139 if (HDR_ISTYPE_METADATA(hdr)) {
1712 2140 type = ARC_BUFC_METADATA;
2141 + } else if (HDR_ISTYPE_DDT(hdr)) {
2142 + type = ARC_BUFC_DDT;
1713 2143 } else {
1714 2144 type = ARC_BUFC_DATA;
1715 2145 }
1716 2146 VERIFY3U(hdr->b_type, ==, type);
1717 2147 return (type);
1718 2148 }
1719 2149
1720 2150 boolean_t
1721 2151 arc_is_metadata(arc_buf_t *buf)
1722 2152 {
1723 2153 return (HDR_ISTYPE_METADATA(buf->b_hdr) != 0);
1724 2154 }
|
↓ open down ↓ |
2 lines elided |
↑ open up ↑ |
1725 2155
1726 2156 static uint32_t
1727 2157 arc_bufc_to_flags(arc_buf_contents_t type)
1728 2158 {
1729 2159 switch (type) {
1730 2160 case ARC_BUFC_DATA:
1731 2161 /* metadata field is 0 if buffer contains normal data */
1732 2162 return (0);
1733 2163 case ARC_BUFC_METADATA:
1734 2164 return (ARC_FLAG_BUFC_METADATA);
2165 + case ARC_BUFC_DDT:
2166 + return (ARC_FLAG_BUFC_DDT);
1735 2167 default:
1736 2168 break;
1737 2169 }
1738 2170 panic("undefined ARC buffer type!");
1739 2171 return ((uint32_t)-1);
1740 2172 }
1741 2173
2174 +static arc_buf_contents_t
2175 +arc_flags_to_bufc(uint32_t flags)
2176 +{
2177 + if (flags & ARC_FLAG_BUFC_DDT)
2178 + return (ARC_BUFC_DDT);
2179 + if (flags & ARC_FLAG_BUFC_METADATA)
2180 + return (ARC_BUFC_METADATA);
2181 + return (ARC_BUFC_DATA);
2182 +}
2183 +
1742 2184 void
1743 2185 arc_buf_thaw(arc_buf_t *buf)
1744 2186 {
1745 2187 arc_buf_hdr_t *hdr = buf->b_hdr;
1746 2188
1747 2189 ASSERT3P(hdr->b_l1hdr.b_state, ==, arc_anon);
1748 2190 ASSERT(!HDR_IO_IN_PROGRESS(hdr));
1749 2191
1750 2192 arc_cksum_verify(buf);
1751 2193
1752 2194 /*
1753 2195 * Compressed buffers do not manipulate the b_freeze_cksum or
1754 2196 * allocate b_thawed.
1755 2197 */
1756 2198 if (ARC_BUF_COMPRESSED(buf)) {
1757 - ASSERT(hdr->b_l1hdr.b_freeze_cksum == NULL ||
2199 + ASSERT(hdr->b_freeze_cksum == NULL ||
1758 2200 arc_hdr_has_uncompressed_buf(hdr));
1759 2201 return;
1760 2202 }
1761 2203
1762 2204 ASSERT(HDR_HAS_L1HDR(hdr));
1763 2205 arc_cksum_free(hdr);
1764 2206
1765 2207 mutex_enter(&hdr->b_l1hdr.b_freeze_lock);
1766 2208 #ifdef ZFS_DEBUG
1767 2209 if (zfs_flags & ZFS_DEBUG_MODIFY) {
1768 2210 if (hdr->b_l1hdr.b_thawed != NULL)
1769 2211 kmem_free(hdr->b_l1hdr.b_thawed, 1);
1770 2212 hdr->b_l1hdr.b_thawed = kmem_alloc(1, KM_SLEEP);
1771 2213 }
1772 2214 #endif
1773 2215
1774 2216 mutex_exit(&hdr->b_l1hdr.b_freeze_lock);
1775 2217
1776 2218 arc_buf_unwatch(buf);
1777 2219 }
1778 2220
|
↓ open down ↓ |
11 lines elided |
↑ open up ↑ |
1779 2221 void
1780 2222 arc_buf_freeze(arc_buf_t *buf)
1781 2223 {
1782 2224 arc_buf_hdr_t *hdr = buf->b_hdr;
1783 2225 kmutex_t *hash_lock;
1784 2226
1785 2227 if (!(zfs_flags & ZFS_DEBUG_MODIFY))
1786 2228 return;
1787 2229
1788 2230 if (ARC_BUF_COMPRESSED(buf)) {
1789 - ASSERT(hdr->b_l1hdr.b_freeze_cksum == NULL ||
2231 + ASSERT(hdr->b_freeze_cksum == NULL ||
1790 2232 arc_hdr_has_uncompressed_buf(hdr));
1791 2233 return;
1792 2234 }
1793 2235
1794 2236 hash_lock = HDR_LOCK(hdr);
1795 2237 mutex_enter(hash_lock);
1796 2238
1797 2239 ASSERT(HDR_HAS_L1HDR(hdr));
1798 - ASSERT(hdr->b_l1hdr.b_freeze_cksum != NULL ||
2240 + ASSERT(hdr->b_freeze_cksum != NULL ||
1799 2241 hdr->b_l1hdr.b_state == arc_anon);
1800 2242 arc_cksum_compute(buf);
1801 2243 mutex_exit(hash_lock);
1802 2244 }
1803 2245
1804 2246 /*
1805 2247 * The arc_buf_hdr_t's b_flags should never be modified directly. Instead,
1806 2248 * the following functions should be used to ensure that the flags are
1807 2249 * updated in a thread-safe way. When manipulating the flags either
1808 2250 * the hash_lock must be held or the hdr must be undiscoverable. This
1809 2251 * ensures that we're not racing with any other threads when updating
1810 2252 * the flags.
1811 2253 */
1812 2254 static inline void
1813 2255 arc_hdr_set_flags(arc_buf_hdr_t *hdr, arc_flags_t flags)
1814 2256 {
1815 2257 ASSERT(MUTEX_HELD(HDR_LOCK(hdr)) || HDR_EMPTY(hdr));
1816 2258 hdr->b_flags |= flags;
1817 2259 }
1818 2260
1819 2261 static inline void
1820 2262 arc_hdr_clear_flags(arc_buf_hdr_t *hdr, arc_flags_t flags)
1821 2263 {
1822 2264 ASSERT(MUTEX_HELD(HDR_LOCK(hdr)) || HDR_EMPTY(hdr));
1823 2265 hdr->b_flags &= ~flags;
1824 2266 }
1825 2267
1826 2268 /*
1827 2269 * Setting the compression bits in the arc_buf_hdr_t's b_flags is
1828 2270 * done in a special way since we have to clear and set bits
1829 2271 * at the same time. Consumers that wish to set the compression bits
1830 2272 * must use this function to ensure that the flags are updated in
1831 2273 * thread-safe manner.
1832 2274 */
1833 2275 static void
1834 2276 arc_hdr_set_compress(arc_buf_hdr_t *hdr, enum zio_compress cmp)
1835 2277 {
1836 2278 ASSERT(MUTEX_HELD(HDR_LOCK(hdr)) || HDR_EMPTY(hdr));
1837 2279
1838 2280 /*
1839 2281 * Holes and embedded blocks will always have a psize = 0 so
1840 2282 * we ignore the compression of the blkptr and set the
1841 2283 * arc_buf_hdr_t's compression to ZIO_COMPRESS_OFF.
1842 2284 * Holes and embedded blocks remain anonymous so we don't
1843 2285 * want to uncompress them. Mark them as uncompressed.
1844 2286 */
1845 2287 if (!zfs_compressed_arc_enabled || HDR_GET_PSIZE(hdr) == 0) {
1846 2288 arc_hdr_clear_flags(hdr, ARC_FLAG_COMPRESSED_ARC);
1847 2289 HDR_SET_COMPRESS(hdr, ZIO_COMPRESS_OFF);
1848 2290 ASSERT(!HDR_COMPRESSION_ENABLED(hdr));
1849 2291 ASSERT3U(HDR_GET_COMPRESS(hdr), ==, ZIO_COMPRESS_OFF);
1850 2292 } else {
1851 2293 arc_hdr_set_flags(hdr, ARC_FLAG_COMPRESSED_ARC);
1852 2294 HDR_SET_COMPRESS(hdr, cmp);
1853 2295 ASSERT3U(HDR_GET_COMPRESS(hdr), ==, cmp);
1854 2296 ASSERT(HDR_COMPRESSION_ENABLED(hdr));
1855 2297 }
1856 2298 }
1857 2299
1858 2300 /*
1859 2301 * Looks for another buf on the same hdr which has the data decompressed, copies
1860 2302 * from it, and returns true. If no such buf exists, returns false.
1861 2303 */
1862 2304 static boolean_t
1863 2305 arc_buf_try_copy_decompressed_data(arc_buf_t *buf)
1864 2306 {
1865 2307 arc_buf_hdr_t *hdr = buf->b_hdr;
1866 2308 boolean_t copied = B_FALSE;
1867 2309
1868 2310 ASSERT(HDR_HAS_L1HDR(hdr));
1869 2311 ASSERT3P(buf->b_data, !=, NULL);
1870 2312 ASSERT(!ARC_BUF_COMPRESSED(buf));
1871 2313
1872 2314 for (arc_buf_t *from = hdr->b_l1hdr.b_buf; from != NULL;
1873 2315 from = from->b_next) {
1874 2316 /* can't use our own data buffer */
1875 2317 if (from == buf) {
1876 2318 continue;
1877 2319 }
1878 2320
1879 2321 if (!ARC_BUF_COMPRESSED(from)) {
|
↓ open down ↓ |
71 lines elided |
↑ open up ↑ |
1880 2322 bcopy(from->b_data, buf->b_data, arc_buf_size(buf));
1881 2323 copied = B_TRUE;
1882 2324 break;
1883 2325 }
1884 2326 }
1885 2327
1886 2328 /*
1887 2329 * There were no decompressed bufs, so there should not be a
1888 2330 * checksum on the hdr either.
1889 2331 */
1890 - EQUIV(!copied, hdr->b_l1hdr.b_freeze_cksum == NULL);
2332 + EQUIV(!copied, hdr->b_freeze_cksum == NULL);
1891 2333
1892 2334 return (copied);
1893 2335 }
1894 2336
1895 2337 /*
1896 2338 * Given a buf that has a data buffer attached to it, this function will
1897 2339 * efficiently fill the buf with data of the specified compression setting from
1898 2340 * the hdr and update the hdr's b_freeze_cksum if necessary. If the buf and hdr
1899 2341 * are already sharing a data buf, no copy is performed.
1900 2342 *
1901 2343 * If the buf is marked as compressed but uncompressed data was requested, this
1902 2344 * will allocate a new data buffer for the buf, remove that flag, and fill the
1903 2345 * buf with uncompressed data. You can't request a compressed buf on a hdr with
1904 2346 * uncompressed data, and (since we haven't added support for it yet) if you
1905 2347 * want compressed data your buf must already be marked as compressed and have
1906 2348 * the correct-sized data buffer.
1907 2349 */
1908 2350 static int
1909 2351 arc_buf_fill(arc_buf_t *buf, boolean_t compressed)
1910 2352 {
1911 2353 arc_buf_hdr_t *hdr = buf->b_hdr;
1912 2354 boolean_t hdr_compressed = (HDR_GET_COMPRESS(hdr) != ZIO_COMPRESS_OFF);
1913 2355 dmu_object_byteswap_t bswap = hdr->b_l1hdr.b_byteswap;
1914 2356
1915 2357 ASSERT3P(buf->b_data, !=, NULL);
1916 2358 IMPLY(compressed, hdr_compressed);
1917 2359 IMPLY(compressed, ARC_BUF_COMPRESSED(buf));
1918 2360
1919 2361 if (hdr_compressed == compressed) {
1920 2362 if (!arc_buf_is_shared(buf)) {
1921 2363 abd_copy_to_buf(buf->b_data, hdr->b_l1hdr.b_pabd,
1922 2364 arc_buf_size(buf));
1923 2365 }
1924 2366 } else {
1925 2367 ASSERT(hdr_compressed);
1926 2368 ASSERT(!compressed);
1927 2369 ASSERT3U(HDR_GET_LSIZE(hdr), !=, HDR_GET_PSIZE(hdr));
1928 2370
1929 2371 /*
1930 2372 * If the buf is sharing its data with the hdr, unlink it and
1931 2373 * allocate a new data buffer for the buf.
1932 2374 */
1933 2375 if (arc_buf_is_shared(buf)) {
1934 2376 ASSERT(ARC_BUF_COMPRESSED(buf));
1935 2377
1936 2378 /* We need to give the buf it's own b_data */
1937 2379 buf->b_flags &= ~ARC_BUF_FLAG_SHARED;
1938 2380 buf->b_data =
1939 2381 arc_get_data_buf(hdr, HDR_GET_LSIZE(hdr), buf);
1940 2382 arc_hdr_clear_flags(hdr, ARC_FLAG_SHARED_DATA);
1941 2383
1942 2384 /* Previously overhead was 0; just add new overhead */
1943 2385 ARCSTAT_INCR(arcstat_overhead_size, HDR_GET_LSIZE(hdr));
1944 2386 } else if (ARC_BUF_COMPRESSED(buf)) {
1945 2387 /* We need to reallocate the buf's b_data */
1946 2388 arc_free_data_buf(hdr, buf->b_data, HDR_GET_PSIZE(hdr),
1947 2389 buf);
1948 2390 buf->b_data =
1949 2391 arc_get_data_buf(hdr, HDR_GET_LSIZE(hdr), buf);
1950 2392
1951 2393 /* We increased the size of b_data; update overhead */
1952 2394 ARCSTAT_INCR(arcstat_overhead_size,
1953 2395 HDR_GET_LSIZE(hdr) - HDR_GET_PSIZE(hdr));
1954 2396 }
1955 2397
1956 2398 /*
1957 2399 * Regardless of the buf's previous compression settings, it
1958 2400 * should not be compressed at the end of this function.
|
↓ open down ↓ |
58 lines elided |
↑ open up ↑ |
1959 2401 */
1960 2402 buf->b_flags &= ~ARC_BUF_FLAG_COMPRESSED;
1961 2403
1962 2404 /*
1963 2405 * Try copying the data from another buf which already has a
1964 2406 * decompressed version. If that's not possible, it's time to
1965 2407 * bite the bullet and decompress the data from the hdr.
1966 2408 */
1967 2409 if (arc_buf_try_copy_decompressed_data(buf)) {
1968 2410 /* Skip byteswapping and checksumming (already done) */
1969 - ASSERT3P(hdr->b_l1hdr.b_freeze_cksum, !=, NULL);
2411 + ASSERT3P(hdr->b_freeze_cksum, !=, NULL);
1970 2412 return (0);
1971 2413 } else {
1972 2414 int error = zio_decompress_data(HDR_GET_COMPRESS(hdr),
1973 2415 hdr->b_l1hdr.b_pabd, buf->b_data,
1974 2416 HDR_GET_PSIZE(hdr), HDR_GET_LSIZE(hdr));
1975 2417
1976 2418 /*
1977 2419 * Absent hardware errors or software bugs, this should
1978 2420 * be impossible, but log it anyway so we can debug it.
1979 2421 */
1980 2422 if (error != 0) {
1981 2423 zfs_dbgmsg(
1982 2424 "hdr %p, compress %d, psize %d, lsize %d",
1983 2425 hdr, HDR_GET_COMPRESS(hdr),
1984 2426 HDR_GET_PSIZE(hdr), HDR_GET_LSIZE(hdr));
1985 2427 return (SET_ERROR(EIO));
1986 2428 }
1987 2429 }
1988 2430 }
1989 2431
1990 2432 /* Byteswap the buf's data if necessary */
1991 2433 if (bswap != DMU_BSWAP_NUMFUNCS) {
1992 2434 ASSERT(!HDR_SHARED_DATA(hdr));
1993 2435 ASSERT3U(bswap, <, DMU_BSWAP_NUMFUNCS);
1994 2436 dmu_ot_byteswap[bswap].ob_func(buf->b_data, HDR_GET_LSIZE(hdr));
1995 2437 }
1996 2438
1997 2439 /* Compute the hdr's checksum if necessary */
1998 2440 arc_cksum_compute(buf);
1999 2441
2000 2442 return (0);
2001 2443 }
2002 2444
2003 2445 int
2004 2446 arc_decompress(arc_buf_t *buf)
2005 2447 {
2006 2448 return (arc_buf_fill(buf, B_FALSE));
2007 2449 }
2008 2450
2009 2451 /*
2010 2452 * Return the size of the block, b_pabd, that is stored in the arc_buf_hdr_t.
2011 2453 */
2012 2454 static uint64_t
2013 2455 arc_hdr_size(arc_buf_hdr_t *hdr)
2014 2456 {
2015 2457 uint64_t size;
2016 2458
2017 2459 if (HDR_GET_COMPRESS(hdr) != ZIO_COMPRESS_OFF &&
2018 2460 HDR_GET_PSIZE(hdr) > 0) {
2019 2461 size = HDR_GET_PSIZE(hdr);
2020 2462 } else {
2021 2463 ASSERT3U(HDR_GET_LSIZE(hdr), !=, 0);
2022 2464 size = HDR_GET_LSIZE(hdr);
2023 2465 }
2024 2466 return (size);
2025 2467 }
2026 2468
2027 2469 /*
2028 2470 * Increment the amount of evictable space in the arc_state_t's refcount.
2029 2471 * We account for the space used by the hdr and the arc buf individually
2030 2472 * so that we can add and remove them from the refcount individually.
2031 2473 */
2032 2474 static void
2033 2475 arc_evictable_space_increment(arc_buf_hdr_t *hdr, arc_state_t *state)
2034 2476 {
2035 2477 arc_buf_contents_t type = arc_buf_type(hdr);
2036 2478
2037 2479 ASSERT(HDR_HAS_L1HDR(hdr));
2038 2480
2039 2481 if (GHOST_STATE(state)) {
2040 2482 ASSERT0(hdr->b_l1hdr.b_bufcnt);
2041 2483 ASSERT3P(hdr->b_l1hdr.b_buf, ==, NULL);
2042 2484 ASSERT3P(hdr->b_l1hdr.b_pabd, ==, NULL);
2043 2485 (void) refcount_add_many(&state->arcs_esize[type],
2044 2486 HDR_GET_LSIZE(hdr), hdr);
2045 2487 return;
2046 2488 }
2047 2489
2048 2490 ASSERT(!GHOST_STATE(state));
2049 2491 if (hdr->b_l1hdr.b_pabd != NULL) {
2050 2492 (void) refcount_add_many(&state->arcs_esize[type],
2051 2493 arc_hdr_size(hdr), hdr);
2052 2494 }
2053 2495 for (arc_buf_t *buf = hdr->b_l1hdr.b_buf; buf != NULL;
2054 2496 buf = buf->b_next) {
2055 2497 if (arc_buf_is_shared(buf))
2056 2498 continue;
2057 2499 (void) refcount_add_many(&state->arcs_esize[type],
2058 2500 arc_buf_size(buf), buf);
2059 2501 }
2060 2502 }
2061 2503
2062 2504 /*
2063 2505 * Decrement the amount of evictable space in the arc_state_t's refcount.
2064 2506 * We account for the space used by the hdr and the arc buf individually
2065 2507 * so that we can add and remove them from the refcount individually.
2066 2508 */
2067 2509 static void
2068 2510 arc_evictable_space_decrement(arc_buf_hdr_t *hdr, arc_state_t *state)
2069 2511 {
2070 2512 arc_buf_contents_t type = arc_buf_type(hdr);
2071 2513
2072 2514 ASSERT(HDR_HAS_L1HDR(hdr));
2073 2515
2074 2516 if (GHOST_STATE(state)) {
2075 2517 ASSERT0(hdr->b_l1hdr.b_bufcnt);
2076 2518 ASSERT3P(hdr->b_l1hdr.b_buf, ==, NULL);
2077 2519 ASSERT3P(hdr->b_l1hdr.b_pabd, ==, NULL);
2078 2520 (void) refcount_remove_many(&state->arcs_esize[type],
2079 2521 HDR_GET_LSIZE(hdr), hdr);
2080 2522 return;
2081 2523 }
2082 2524
2083 2525 ASSERT(!GHOST_STATE(state));
2084 2526 if (hdr->b_l1hdr.b_pabd != NULL) {
2085 2527 (void) refcount_remove_many(&state->arcs_esize[type],
2086 2528 arc_hdr_size(hdr), hdr);
2087 2529 }
2088 2530 for (arc_buf_t *buf = hdr->b_l1hdr.b_buf; buf != NULL;
2089 2531 buf = buf->b_next) {
2090 2532 if (arc_buf_is_shared(buf))
2091 2533 continue;
2092 2534 (void) refcount_remove_many(&state->arcs_esize[type],
2093 2535 arc_buf_size(buf), buf);
2094 2536 }
2095 2537 }
2096 2538
2097 2539 /*
2098 2540 * Add a reference to this hdr indicating that someone is actively
2099 2541 * referencing that memory. When the refcount transitions from 0 to 1,
2100 2542 * we remove it from the respective arc_state_t list to indicate that
2101 2543 * it is not evictable.
2102 2544 */
2103 2545 static void
2104 2546 add_reference(arc_buf_hdr_t *hdr, void *tag)
2105 2547 {
2106 2548 ASSERT(HDR_HAS_L1HDR(hdr));
2107 2549 if (!MUTEX_HELD(HDR_LOCK(hdr))) {
2108 2550 ASSERT(hdr->b_l1hdr.b_state == arc_anon);
2109 2551 ASSERT(refcount_is_zero(&hdr->b_l1hdr.b_refcnt));
2110 2552 ASSERT3P(hdr->b_l1hdr.b_buf, ==, NULL);
2111 2553 }
2112 2554
2113 2555 arc_state_t *state = hdr->b_l1hdr.b_state;
2114 2556
2115 2557 if ((refcount_add(&hdr->b_l1hdr.b_refcnt, tag) == 1) &&
2116 2558 (state != arc_anon)) {
2117 2559 /* We don't use the L2-only state list. */
2118 2560 if (state != arc_l2c_only) {
2119 2561 multilist_remove(state->arcs_list[arc_buf_type(hdr)],
2120 2562 hdr);
2121 2563 arc_evictable_space_decrement(hdr, state);
2122 2564 }
2123 2565 /* remove the prefetch flag if we get a reference */
2124 2566 arc_hdr_clear_flags(hdr, ARC_FLAG_PREFETCH);
2125 2567 }
2126 2568 }
2127 2569
2128 2570 /*
2129 2571 * Remove a reference from this hdr. When the reference transitions from
2130 2572 * 1 to 0 and we're not anonymous, then we add this hdr to the arc_state_t's
2131 2573 * list making it eligible for eviction.
2132 2574 */
2133 2575 static int
2134 2576 remove_reference(arc_buf_hdr_t *hdr, kmutex_t *hash_lock, void *tag)
2135 2577 {
2136 2578 int cnt;
2137 2579 arc_state_t *state = hdr->b_l1hdr.b_state;
2138 2580
2139 2581 ASSERT(HDR_HAS_L1HDR(hdr));
2140 2582 ASSERT(state == arc_anon || MUTEX_HELD(hash_lock));
2141 2583 ASSERT(!GHOST_STATE(state));
2142 2584
2143 2585 /*
2144 2586 * arc_l2c_only counts as a ghost state so we don't need to explicitly
2145 2587 * check to prevent usage of the arc_l2c_only list.
2146 2588 */
2147 2589 if (((cnt = refcount_remove(&hdr->b_l1hdr.b_refcnt, tag)) == 0) &&
2148 2590 (state != arc_anon)) {
2149 2591 multilist_insert(state->arcs_list[arc_buf_type(hdr)], hdr);
2150 2592 ASSERT3U(hdr->b_l1hdr.b_bufcnt, >, 0);
2151 2593 arc_evictable_space_increment(hdr, state);
2152 2594 }
2153 2595 return (cnt);
2154 2596 }
2155 2597
2156 2598 /*
2157 2599 * Move the supplied buffer to the indicated state. The hash lock
2158 2600 * for the buffer must be held by the caller.
2159 2601 */
2160 2602 static void
2161 2603 arc_change_state(arc_state_t *new_state, arc_buf_hdr_t *hdr,
2162 2604 kmutex_t *hash_lock)
2163 2605 {
2164 2606 arc_state_t *old_state;
2165 2607 int64_t refcnt;
2166 2608 uint32_t bufcnt;
2167 2609 boolean_t update_old, update_new;
2168 2610 arc_buf_contents_t buftype = arc_buf_type(hdr);
2169 2611
2170 2612 /*
2171 2613 * We almost always have an L1 hdr here, since we call arc_hdr_realloc()
2172 2614 * in arc_read() when bringing a buffer out of the L2ARC. However, the
2173 2615 * L1 hdr doesn't always exist when we change state to arc_anon before
2174 2616 * destroying a header, in which case reallocating to add the L1 hdr is
2175 2617 * pointless.
2176 2618 */
2177 2619 if (HDR_HAS_L1HDR(hdr)) {
2178 2620 old_state = hdr->b_l1hdr.b_state;
2179 2621 refcnt = refcount_count(&hdr->b_l1hdr.b_refcnt);
2180 2622 bufcnt = hdr->b_l1hdr.b_bufcnt;
2181 2623 update_old = (bufcnt > 0 || hdr->b_l1hdr.b_pabd != NULL);
2182 2624 } else {
2183 2625 old_state = arc_l2c_only;
2184 2626 refcnt = 0;
2185 2627 bufcnt = 0;
2186 2628 update_old = B_FALSE;
2187 2629 }
2188 2630 update_new = update_old;
2189 2631
2190 2632 ASSERT(MUTEX_HELD(hash_lock));
2191 2633 ASSERT3P(new_state, !=, old_state);
2192 2634 ASSERT(!GHOST_STATE(new_state) || bufcnt == 0);
2193 2635 ASSERT(old_state != arc_anon || bufcnt <= 1);
2194 2636
2195 2637 /*
2196 2638 * If this buffer is evictable, transfer it from the
2197 2639 * old state list to the new state list.
2198 2640 */
2199 2641 if (refcnt == 0) {
2200 2642 if (old_state != arc_anon && old_state != arc_l2c_only) {
2201 2643 ASSERT(HDR_HAS_L1HDR(hdr));
2202 2644 multilist_remove(old_state->arcs_list[buftype], hdr);
2203 2645
2204 2646 if (GHOST_STATE(old_state)) {
2205 2647 ASSERT0(bufcnt);
2206 2648 ASSERT3P(hdr->b_l1hdr.b_buf, ==, NULL);
2207 2649 update_old = B_TRUE;
2208 2650 }
2209 2651 arc_evictable_space_decrement(hdr, old_state);
2210 2652 }
2211 2653 if (new_state != arc_anon && new_state != arc_l2c_only) {
2212 2654
2213 2655 /*
2214 2656 * An L1 header always exists here, since if we're
2215 2657 * moving to some L1-cached state (i.e. not l2c_only or
2216 2658 * anonymous), we realloc the header to add an L1hdr
2217 2659 * beforehand.
2218 2660 */
2219 2661 ASSERT(HDR_HAS_L1HDR(hdr));
2220 2662 multilist_insert(new_state->arcs_list[buftype], hdr);
2221 2663
|
↓ open down ↓ |
242 lines elided |
↑ open up ↑ |
2222 2664 if (GHOST_STATE(new_state)) {
2223 2665 ASSERT0(bufcnt);
2224 2666 ASSERT3P(hdr->b_l1hdr.b_buf, ==, NULL);
2225 2667 update_new = B_TRUE;
2226 2668 }
2227 2669 arc_evictable_space_increment(hdr, new_state);
2228 2670 }
2229 2671 }
2230 2672
2231 2673 ASSERT(!HDR_EMPTY(hdr));
2232 - if (new_state == arc_anon && HDR_IN_HASH_TABLE(hdr))
2674 + if (new_state == arc_anon && HDR_IN_HASH_TABLE(hdr)) {
2675 + arc_wait_for_krrp(hdr);
2233 2676 buf_hash_remove(hdr);
2677 + }
2234 2678
2235 2679 /* adjust state sizes (ignore arc_l2c_only) */
2236 2680
2237 2681 if (update_new && new_state != arc_l2c_only) {
2238 2682 ASSERT(HDR_HAS_L1HDR(hdr));
2239 2683 if (GHOST_STATE(new_state)) {
2240 2684 ASSERT0(bufcnt);
2241 2685
2242 2686 /*
2243 2687 * When moving a header to a ghost state, we first
2244 2688 * remove all arc buffers. Thus, we'll have a
2245 2689 * bufcnt of zero, and no arc buffer to use for
2246 2690 * the reference. As a result, we use the arc
2247 2691 * header pointer for the reference.
2248 2692 */
2249 2693 (void) refcount_add_many(&new_state->arcs_size,
2250 2694 HDR_GET_LSIZE(hdr), hdr);
2251 2695 ASSERT3P(hdr->b_l1hdr.b_pabd, ==, NULL);
2252 2696 } else {
2253 2697 uint32_t buffers = 0;
2254 2698
2255 2699 /*
2256 2700 * Each individual buffer holds a unique reference,
2257 2701 * thus we must remove each of these references one
2258 2702 * at a time.
2259 2703 */
2260 2704 for (arc_buf_t *buf = hdr->b_l1hdr.b_buf; buf != NULL;
2261 2705 buf = buf->b_next) {
2262 2706 ASSERT3U(bufcnt, !=, 0);
2263 2707 buffers++;
2264 2708
2265 2709 /*
2266 2710 * When the arc_buf_t is sharing the data
2267 2711 * block with the hdr, the owner of the
2268 2712 * reference belongs to the hdr. Only
2269 2713 * add to the refcount if the arc_buf_t is
2270 2714 * not shared.
2271 2715 */
2272 2716 if (arc_buf_is_shared(buf))
2273 2717 continue;
2274 2718
2275 2719 (void) refcount_add_many(&new_state->arcs_size,
2276 2720 arc_buf_size(buf), buf);
2277 2721 }
2278 2722 ASSERT3U(bufcnt, ==, buffers);
2279 2723
2280 2724 if (hdr->b_l1hdr.b_pabd != NULL) {
2281 2725 (void) refcount_add_many(&new_state->arcs_size,
2282 2726 arc_hdr_size(hdr), hdr);
2283 2727 } else {
2284 2728 ASSERT(GHOST_STATE(old_state));
2285 2729 }
2286 2730 }
2287 2731 }
2288 2732
2289 2733 if (update_old && old_state != arc_l2c_only) {
2290 2734 ASSERT(HDR_HAS_L1HDR(hdr));
2291 2735 if (GHOST_STATE(old_state)) {
2292 2736 ASSERT0(bufcnt);
2293 2737 ASSERT3P(hdr->b_l1hdr.b_pabd, ==, NULL);
2294 2738
2295 2739 /*
2296 2740 * When moving a header off of a ghost state,
2297 2741 * the header will not contain any arc buffers.
2298 2742 * We use the arc header pointer for the reference
2299 2743 * which is exactly what we did when we put the
2300 2744 * header on the ghost state.
2301 2745 */
2302 2746
2303 2747 (void) refcount_remove_many(&old_state->arcs_size,
2304 2748 HDR_GET_LSIZE(hdr), hdr);
2305 2749 } else {
2306 2750 uint32_t buffers = 0;
2307 2751
2308 2752 /*
2309 2753 * Each individual buffer holds a unique reference,
2310 2754 * thus we must remove each of these references one
2311 2755 * at a time.
2312 2756 */
2313 2757 for (arc_buf_t *buf = hdr->b_l1hdr.b_buf; buf != NULL;
2314 2758 buf = buf->b_next) {
2315 2759 ASSERT3U(bufcnt, !=, 0);
2316 2760 buffers++;
2317 2761
2318 2762 /*
2319 2763 * When the arc_buf_t is sharing the data
2320 2764 * block with the hdr, the owner of the
2321 2765 * reference belongs to the hdr. Only
2322 2766 * add to the refcount if the arc_buf_t is
2323 2767 * not shared.
2324 2768 */
2325 2769 if (arc_buf_is_shared(buf))
2326 2770 continue;
2327 2771
2328 2772 (void) refcount_remove_many(
2329 2773 &old_state->arcs_size, arc_buf_size(buf),
2330 2774 buf);
2331 2775 }
2332 2776 ASSERT3U(bufcnt, ==, buffers);
2333 2777 ASSERT3P(hdr->b_l1hdr.b_pabd, !=, NULL);
2334 2778 (void) refcount_remove_many(
2335 2779 &old_state->arcs_size, arc_hdr_size(hdr), hdr);
|
↓ open down ↓ |
92 lines elided |
↑ open up ↑ |
2336 2780 }
2337 2781 }
2338 2782
2339 2783 if (HDR_HAS_L1HDR(hdr))
2340 2784 hdr->b_l1hdr.b_state = new_state;
2341 2785
2342 2786 /*
2343 2787 * L2 headers should never be on the L2 state list since they don't
2344 2788 * have L1 headers allocated.
2345 2789 */
2346 - ASSERT(multilist_is_empty(arc_l2c_only->arcs_list[ARC_BUFC_DATA]) &&
2347 - multilist_is_empty(arc_l2c_only->arcs_list[ARC_BUFC_METADATA]));
2790 + ASSERT(multilist_is_empty(arc_l2c_only->arcs_list[ARC_BUFC_DATA]));
2791 + ASSERT(multilist_is_empty(arc_l2c_only->arcs_list[ARC_BUFC_METADATA]));
2792 + ASSERT(multilist_is_empty(arc_l2c_only->arcs_list[ARC_BUFC_DDT]));
2348 2793 }
2349 2794
2350 2795 void
2351 2796 arc_space_consume(uint64_t space, arc_space_type_t type)
2352 2797 {
2353 2798 ASSERT(type >= 0 && type < ARC_SPACE_NUMTYPES);
2354 2799
2355 2800 switch (type) {
2356 2801 case ARC_SPACE_DATA:
2357 - aggsum_add(&astat_data_size, space);
2802 + ARCSTAT_INCR(arcstat_data_size, space);
2358 2803 break;
2359 2804 case ARC_SPACE_META:
2360 - aggsum_add(&astat_metadata_size, space);
2805 + ARCSTAT_INCR(arcstat_metadata_size, space);
2361 2806 break;
2807 + case ARC_SPACE_DDT:
2808 + ARCSTAT_INCR(arcstat_ddt_size, space);
2809 + break;
2362 2810 case ARC_SPACE_OTHER:
2363 - aggsum_add(&astat_other_size, space);
2811 + ARCSTAT_INCR(arcstat_other_size, space);
2364 2812 break;
2365 2813 case ARC_SPACE_HDRS:
2366 - aggsum_add(&astat_hdr_size, space);
2814 + ARCSTAT_INCR(arcstat_hdr_size, space);
2367 2815 break;
2368 2816 case ARC_SPACE_L2HDRS:
2369 - aggsum_add(&astat_l2_hdr_size, space);
2817 + ARCSTAT_INCR(arcstat_l2_hdr_size, space);
2370 2818 break;
2371 2819 }
2372 2820
2373 - if (type != ARC_SPACE_DATA)
2374 - aggsum_add(&arc_meta_used, space);
2821 + if (type != ARC_SPACE_DATA && type != ARC_SPACE_DDT)
2822 + ARCSTAT_INCR(arcstat_meta_used, space);
2375 2823
2376 - aggsum_add(&arc_size, space);
2824 + atomic_add_64(&arc_size, space);
2377 2825 }
2378 2826
2379 2827 void
2380 2828 arc_space_return(uint64_t space, arc_space_type_t type)
2381 2829 {
2382 2830 ASSERT(type >= 0 && type < ARC_SPACE_NUMTYPES);
2383 2831
2384 2832 switch (type) {
2385 2833 case ARC_SPACE_DATA:
2386 - aggsum_add(&astat_data_size, -space);
2834 + ARCSTAT_INCR(arcstat_data_size, -space);
2387 2835 break;
2388 2836 case ARC_SPACE_META:
2389 - aggsum_add(&astat_metadata_size, -space);
2837 + ARCSTAT_INCR(arcstat_metadata_size, -space);
2390 2838 break;
2839 + case ARC_SPACE_DDT:
2840 + ARCSTAT_INCR(arcstat_ddt_size, -space);
2841 + break;
2391 2842 case ARC_SPACE_OTHER:
2392 - aggsum_add(&astat_other_size, -space);
2843 + ARCSTAT_INCR(arcstat_other_size, -space);
2393 2844 break;
2394 2845 case ARC_SPACE_HDRS:
2395 - aggsum_add(&astat_hdr_size, -space);
2846 + ARCSTAT_INCR(arcstat_hdr_size, -space);
2396 2847 break;
2397 2848 case ARC_SPACE_L2HDRS:
2398 - aggsum_add(&astat_l2_hdr_size, -space);
2849 + ARCSTAT_INCR(arcstat_l2_hdr_size, -space);
2399 2850 break;
2400 2851 }
2401 2852
2402 - if (type != ARC_SPACE_DATA) {
2403 - ASSERT(aggsum_compare(&arc_meta_used, space) >= 0);
2404 - /*
2405 - * We use the upper bound here rather than the precise value
2406 - * because the arc_meta_max value doesn't need to be
2407 - * precise. It's only consumed by humans via arcstats.
2408 - */
2409 - if (arc_meta_max < aggsum_upper_bound(&arc_meta_used))
2410 - arc_meta_max = aggsum_upper_bound(&arc_meta_used);
2411 - aggsum_add(&arc_meta_used, -space);
2853 + if (type != ARC_SPACE_DATA && type != ARC_SPACE_DDT) {
2854 + ASSERT(arc_meta_used >= space);
2855 + if (arc_meta_max < arc_meta_used)
2856 + arc_meta_max = arc_meta_used;
2857 + ARCSTAT_INCR(arcstat_meta_used, -space);
2412 2858 }
2413 2859
2414 - ASSERT(aggsum_compare(&arc_size, space) >= 0);
2415 - aggsum_add(&arc_size, -space);
2860 + ASSERT(arc_size >= space);
2861 + atomic_add_64(&arc_size, -space);
2416 2862 }
2417 2863
2418 2864 /*
2419 2865 * Given a hdr and a buf, returns whether that buf can share its b_data buffer
2420 2866 * with the hdr's b_pabd.
2421 2867 */
2422 2868 static boolean_t
2423 2869 arc_can_share(arc_buf_hdr_t *hdr, arc_buf_t *buf)
2424 2870 {
2425 2871 /*
2426 2872 * The criteria for sharing a hdr's data are:
2427 2873 * 1. the hdr's compression matches the buf's compression
2428 2874 * 2. the hdr doesn't need to be byteswapped
2429 2875 * 3. the hdr isn't already being shared
2430 2876 * 4. the buf is either compressed or it is the last buf in the hdr list
2431 2877 *
2432 2878 * Criterion #4 maintains the invariant that shared uncompressed
2433 2879 * bufs must be the final buf in the hdr's b_buf list. Reading this, you
2434 2880 * might ask, "if a compressed buf is allocated first, won't that be the
2435 2881 * last thing in the list?", but in that case it's impossible to create
2436 2882 * a shared uncompressed buf anyway (because the hdr must be compressed
2437 2883 * to have the compressed buf). You might also think that #3 is
2438 2884 * sufficient to make this guarantee, however it's possible
2439 2885 * (specifically in the rare L2ARC write race mentioned in
2440 2886 * arc_buf_alloc_impl()) there will be an existing uncompressed buf that
2441 2887 * is sharable, but wasn't at the time of its allocation. Rather than
2442 2888 * allow a new shared uncompressed buf to be created and then shuffle
2443 2889 * the list around to make it the last element, this simply disallows
2444 2890 * sharing if the new buf isn't the first to be added.
2445 2891 */
2446 2892 ASSERT3P(buf->b_hdr, ==, hdr);
2447 2893 boolean_t hdr_compressed = HDR_GET_COMPRESS(hdr) != ZIO_COMPRESS_OFF;
2448 2894 boolean_t buf_compressed = ARC_BUF_COMPRESSED(buf) != 0;
2449 2895 return (buf_compressed == hdr_compressed &&
2450 2896 hdr->b_l1hdr.b_byteswap == DMU_BSWAP_NUMFUNCS &&
2451 2897 !HDR_SHARED_DATA(hdr) &&
2452 2898 (ARC_BUF_LAST(buf) || ARC_BUF_COMPRESSED(buf)));
2453 2899 }
2454 2900
2455 2901 /*
2456 2902 * Allocate a buf for this hdr. If you care about the data that's in the hdr,
2457 2903 * or if you want a compressed buffer, pass those flags in. Returns 0 if the
2458 2904 * copy was made successfully, or an error code otherwise.
|
↓ open down ↓ |
33 lines elided |
↑ open up ↑ |
2459 2905 */
2460 2906 static int
2461 2907 arc_buf_alloc_impl(arc_buf_hdr_t *hdr, void *tag, boolean_t compressed,
2462 2908 boolean_t fill, arc_buf_t **ret)
2463 2909 {
2464 2910 arc_buf_t *buf;
2465 2911
2466 2912 ASSERT(HDR_HAS_L1HDR(hdr));
2467 2913 ASSERT3U(HDR_GET_LSIZE(hdr), >, 0);
2468 2914 VERIFY(hdr->b_type == ARC_BUFC_DATA ||
2469 - hdr->b_type == ARC_BUFC_METADATA);
2915 + hdr->b_type == ARC_BUFC_METADATA ||
2916 + hdr->b_type == ARC_BUFC_DDT);
2470 2917 ASSERT3P(ret, !=, NULL);
2471 2918 ASSERT3P(*ret, ==, NULL);
2472 2919
2473 2920 buf = *ret = kmem_cache_alloc(buf_cache, KM_PUSHPAGE);
2474 2921 buf->b_hdr = hdr;
2475 2922 buf->b_data = NULL;
2476 2923 buf->b_next = hdr->b_l1hdr.b_buf;
2477 2924 buf->b_flags = 0;
2478 2925
2479 2926 add_reference(hdr, tag);
2480 2927
2481 2928 /*
2482 2929 * We're about to change the hdr's b_flags. We must either
2483 2930 * hold the hash_lock or be undiscoverable.
2484 2931 */
2485 2932 ASSERT(MUTEX_HELD(HDR_LOCK(hdr)) || HDR_EMPTY(hdr));
2486 2933
2487 2934 /*
2488 2935 * Only honor requests for compressed bufs if the hdr is actually
2489 2936 * compressed.
2490 2937 */
2491 2938 if (compressed && HDR_GET_COMPRESS(hdr) != ZIO_COMPRESS_OFF)
2492 2939 buf->b_flags |= ARC_BUF_FLAG_COMPRESSED;
2493 2940
2494 2941 /*
2495 2942 * If the hdr's data can be shared then we share the data buffer and
2496 2943 * set the appropriate bit in the hdr's b_flags to indicate the hdr is
2497 2944 * sharing it's b_pabd with the arc_buf_t. Otherwise, we allocate a new
2498 2945 * buffer to store the buf's data.
2499 2946 *
2500 2947 * There are two additional restrictions here because we're sharing
2501 2948 * hdr -> buf instead of the usual buf -> hdr. First, the hdr can't be
2502 2949 * actively involved in an L2ARC write, because if this buf is used by
2503 2950 * an arc_write() then the hdr's data buffer will be released when the
2504 2951 * write completes, even though the L2ARC write might still be using it.
2505 2952 * Second, the hdr's ABD must be linear so that the buf's user doesn't
2506 2953 * need to be ABD-aware.
2507 2954 */
2508 2955 boolean_t can_share = arc_can_share(hdr, buf) && !HDR_L2_WRITING(hdr) &&
2509 2956 abd_is_linear(hdr->b_l1hdr.b_pabd);
2510 2957
2511 2958 /* Set up b_data and sharing */
2512 2959 if (can_share) {
2513 2960 buf->b_data = abd_to_buf(hdr->b_l1hdr.b_pabd);
2514 2961 buf->b_flags |= ARC_BUF_FLAG_SHARED;
2515 2962 arc_hdr_set_flags(hdr, ARC_FLAG_SHARED_DATA);
2516 2963 } else {
2517 2964 buf->b_data =
2518 2965 arc_get_data_buf(hdr, arc_buf_size(buf), buf);
2519 2966 ARCSTAT_INCR(arcstat_overhead_size, arc_buf_size(buf));
2520 2967 }
2521 2968 VERIFY3P(buf->b_data, !=, NULL);
2522 2969
2523 2970 hdr->b_l1hdr.b_buf = buf;
2524 2971 hdr->b_l1hdr.b_bufcnt += 1;
2525 2972
2526 2973 /*
2527 2974 * If the user wants the data from the hdr, we need to either copy or
2528 2975 * decompress the data.
2529 2976 */
2530 2977 if (fill) {
2531 2978 return (arc_buf_fill(buf, ARC_BUF_COMPRESSED(buf) != 0));
2532 2979 }
2533 2980
2534 2981 return (0);
2535 2982 }
2536 2983
2537 2984 static char *arc_onloan_tag = "onloan";
2538 2985
|
↓ open down ↓ |
59 lines elided |
↑ open up ↑ |
2539 2986 static inline void
2540 2987 arc_loaned_bytes_update(int64_t delta)
2541 2988 {
2542 2989 atomic_add_64(&arc_loaned_bytes, delta);
2543 2990
2544 2991 /* assert that it did not wrap around */
2545 2992 ASSERT3S(atomic_add_64_nv(&arc_loaned_bytes, 0), >=, 0);
2546 2993 }
2547 2994
2548 2995 /*
2996 + * Allocates an ARC buf header that's in an evicted & L2-cached state.
2997 + * This is used during l2arc reconstruction to make empty ARC buffers
2998 + * which circumvent the regular disk->arc->l2arc path and instead come
2999 + * into being in the reverse order, i.e. l2arc->arc.
3000 + */
3001 +static arc_buf_hdr_t *
3002 +arc_buf_alloc_l2only(uint64_t load_guid, arc_buf_contents_t type,
3003 + l2arc_dev_t *dev, dva_t dva, uint64_t daddr, uint64_t lsize,
3004 + uint64_t psize, uint64_t birth, zio_cksum_t cksum, int checksum_type,
3005 + enum zio_compress compress, boolean_t arc_compress)
3006 +{
3007 + arc_buf_hdr_t *hdr;
3008 +
3009 + if (type == ARC_BUFC_DDT && !zfs_arc_segregate_ddt)
3010 + type = ARC_BUFC_METADATA;
3011 +
3012 + ASSERT(lsize != 0);
3013 + hdr = kmem_cache_alloc(hdr_l2only_cache, KM_PUSHPAGE);
3014 + ASSERT(HDR_EMPTY(hdr));
3015 + ASSERT3P(hdr->b_freeze_cksum, ==, NULL);
3016 +
3017 + hdr->b_spa = load_guid;
3018 + hdr->b_type = type;
3019 + hdr->b_flags = 0;
3020 +
3021 + if (arc_compress)
3022 + arc_hdr_set_flags(hdr, ARC_FLAG_COMPRESSED_ARC);
3023 + else
3024 + arc_hdr_clear_flags(hdr, ARC_FLAG_COMPRESSED_ARC);
3025 +
3026 + HDR_SET_COMPRESS(hdr, compress);
3027 +
3028 + arc_hdr_set_flags(hdr, arc_bufc_to_flags(type) | ARC_FLAG_HAS_L2HDR);
3029 + hdr->b_dva = dva;
3030 + hdr->b_birth = birth;
3031 + if (checksum_type != ZIO_CHECKSUM_OFF) {
3032 + hdr->b_freeze_cksum = kmem_alloc(sizeof (zio_cksum_t), KM_SLEEP);
3033 + bcopy(&cksum, hdr->b_freeze_cksum, sizeof (cksum));
3034 + }
3035 +
3036 + HDR_SET_PSIZE(hdr, psize);
3037 + HDR_SET_LSIZE(hdr, lsize);
3038 +
3039 + hdr->b_l2hdr.b_dev = dev;
3040 + hdr->b_l2hdr.b_daddr = daddr;
3041 +
3042 + return (hdr);
3043 +}
3044 +
3045 +/*
2549 3046 * Loan out an anonymous arc buffer. Loaned buffers are not counted as in
2550 3047 * flight data by arc_tempreserve_space() until they are "returned". Loaned
2551 3048 * buffers must be returned to the arc before they can be used by the DMU or
2552 3049 * freed.
2553 3050 */
2554 3051 arc_buf_t *
2555 3052 arc_loan_buf(spa_t *spa, boolean_t is_metadata, int size)
2556 3053 {
2557 3054 arc_buf_t *buf = arc_alloc_buf(spa, arc_onloan_tag,
2558 3055 is_metadata ? ARC_BUFC_METADATA : ARC_BUFC_DATA, size);
2559 3056
2560 3057 arc_loaned_bytes_update(size);
2561 3058
2562 3059 return (buf);
2563 3060 }
2564 3061
2565 3062 arc_buf_t *
2566 3063 arc_loan_compressed_buf(spa_t *spa, uint64_t psize, uint64_t lsize,
2567 3064 enum zio_compress compression_type)
2568 3065 {
2569 3066 arc_buf_t *buf = arc_alloc_compressed_buf(spa, arc_onloan_tag,
2570 3067 psize, lsize, compression_type);
2571 3068
2572 3069 arc_loaned_bytes_update(psize);
2573 3070
2574 3071 return (buf);
2575 3072 }
2576 3073
2577 3074
2578 3075 /*
2579 3076 * Return a loaned arc buffer to the arc.
2580 3077 */
2581 3078 void
2582 3079 arc_return_buf(arc_buf_t *buf, void *tag)
2583 3080 {
2584 3081 arc_buf_hdr_t *hdr = buf->b_hdr;
2585 3082
2586 3083 ASSERT3P(buf->b_data, !=, NULL);
2587 3084 ASSERT(HDR_HAS_L1HDR(hdr));
2588 3085 (void) refcount_add(&hdr->b_l1hdr.b_refcnt, tag);
2589 3086 (void) refcount_remove(&hdr->b_l1hdr.b_refcnt, arc_onloan_tag);
2590 3087
2591 3088 arc_loaned_bytes_update(-arc_buf_size(buf));
2592 3089 }
2593 3090
2594 3091 /* Detach an arc_buf from a dbuf (tag) */
2595 3092 void
2596 3093 arc_loan_inuse_buf(arc_buf_t *buf, void *tag)
2597 3094 {
2598 3095 arc_buf_hdr_t *hdr = buf->b_hdr;
2599 3096
2600 3097 ASSERT3P(buf->b_data, !=, NULL);
2601 3098 ASSERT(HDR_HAS_L1HDR(hdr));
2602 3099 (void) refcount_add(&hdr->b_l1hdr.b_refcnt, arc_onloan_tag);
2603 3100 (void) refcount_remove(&hdr->b_l1hdr.b_refcnt, tag);
2604 3101
2605 3102 arc_loaned_bytes_update(arc_buf_size(buf));
2606 3103 }
2607 3104
2608 3105 static void
2609 3106 l2arc_free_abd_on_write(abd_t *abd, size_t size, arc_buf_contents_t type)
2610 3107 {
2611 3108 l2arc_data_free_t *df = kmem_alloc(sizeof (*df), KM_SLEEP);
2612 3109
2613 3110 df->l2df_abd = abd;
2614 3111 df->l2df_size = size;
2615 3112 df->l2df_type = type;
2616 3113 mutex_enter(&l2arc_free_on_write_mtx);
2617 3114 list_insert_head(l2arc_free_on_write, df);
2618 3115 mutex_exit(&l2arc_free_on_write_mtx);
2619 3116 }
2620 3117
2621 3118 static void
2622 3119 arc_hdr_free_on_write(arc_buf_hdr_t *hdr)
2623 3120 {
2624 3121 arc_state_t *state = hdr->b_l1hdr.b_state;
2625 3122 arc_buf_contents_t type = arc_buf_type(hdr);
2626 3123 uint64_t size = arc_hdr_size(hdr);
|
↓ open down ↓ |
68 lines elided |
↑ open up ↑ |
2627 3124
2628 3125 /* protected by hash lock, if in the hash table */
2629 3126 if (multilist_link_active(&hdr->b_l1hdr.b_arc_node)) {
2630 3127 ASSERT(refcount_is_zero(&hdr->b_l1hdr.b_refcnt));
2631 3128 ASSERT(state != arc_anon && state != arc_l2c_only);
2632 3129
2633 3130 (void) refcount_remove_many(&state->arcs_esize[type],
2634 3131 size, hdr);
2635 3132 }
2636 3133 (void) refcount_remove_many(&state->arcs_size, size, hdr);
2637 - if (type == ARC_BUFC_METADATA) {
3134 + if (type == ARC_BUFC_DDT) {
3135 + arc_space_return(size, ARC_SPACE_DDT);
3136 + } else if (type == ARC_BUFC_METADATA) {
2638 3137 arc_space_return(size, ARC_SPACE_META);
2639 3138 } else {
2640 3139 ASSERT(type == ARC_BUFC_DATA);
2641 3140 arc_space_return(size, ARC_SPACE_DATA);
2642 3141 }
2643 3142
2644 3143 l2arc_free_abd_on_write(hdr->b_l1hdr.b_pabd, size, type);
2645 3144 }
2646 3145
2647 3146 /*
2648 3147 * Share the arc_buf_t's data with the hdr. Whenever we are sharing the
2649 3148 * data buffer, we transfer the refcount ownership to the hdr and update
2650 3149 * the appropriate kstats.
2651 3150 */
2652 3151 static void
2653 3152 arc_share_buf(arc_buf_hdr_t *hdr, arc_buf_t *buf)
2654 3153 {
2655 3154 arc_state_t *state = hdr->b_l1hdr.b_state;
2656 3155
2657 3156 ASSERT(arc_can_share(hdr, buf));
2658 3157 ASSERT3P(hdr->b_l1hdr.b_pabd, ==, NULL);
|
↓ open down ↓ |
11 lines elided |
↑ open up ↑ |
2659 3158 ASSERT(MUTEX_HELD(HDR_LOCK(hdr)) || HDR_EMPTY(hdr));
2660 3159
2661 3160 /*
2662 3161 * Start sharing the data buffer. We transfer the
2663 3162 * refcount ownership to the hdr since it always owns
2664 3163 * the refcount whenever an arc_buf_t is shared.
2665 3164 */
2666 3165 refcount_transfer_ownership(&state->arcs_size, buf, hdr);
2667 3166 hdr->b_l1hdr.b_pabd = abd_get_from_buf(buf->b_data, arc_buf_size(buf));
2668 3167 abd_take_ownership_of_buf(hdr->b_l1hdr.b_pabd,
2669 - HDR_ISTYPE_METADATA(hdr));
3168 + !HDR_ISTYPE_DATA(hdr));
2670 3169 arc_hdr_set_flags(hdr, ARC_FLAG_SHARED_DATA);
2671 3170 buf->b_flags |= ARC_BUF_FLAG_SHARED;
2672 3171
2673 3172 /*
2674 3173 * Since we've transferred ownership to the hdr we need
2675 3174 * to increment its compressed and uncompressed kstats and
2676 3175 * decrement the overhead size.
2677 3176 */
2678 3177 ARCSTAT_INCR(arcstat_compressed_size, arc_hdr_size(hdr));
2679 3178 ARCSTAT_INCR(arcstat_uncompressed_size, HDR_GET_LSIZE(hdr));
2680 3179 ARCSTAT_INCR(arcstat_overhead_size, -arc_buf_size(buf));
2681 3180 }
2682 3181
2683 3182 static void
2684 3183 arc_unshare_buf(arc_buf_hdr_t *hdr, arc_buf_t *buf)
2685 3184 {
2686 3185 arc_state_t *state = hdr->b_l1hdr.b_state;
2687 3186
2688 3187 ASSERT(arc_buf_is_shared(buf));
2689 3188 ASSERT3P(hdr->b_l1hdr.b_pabd, !=, NULL);
2690 3189 ASSERT(MUTEX_HELD(HDR_LOCK(hdr)) || HDR_EMPTY(hdr));
2691 3190
2692 3191 /*
2693 3192 * We are no longer sharing this buffer so we need
2694 3193 * to transfer its ownership to the rightful owner.
2695 3194 */
2696 3195 refcount_transfer_ownership(&state->arcs_size, hdr, buf);
2697 3196 arc_hdr_clear_flags(hdr, ARC_FLAG_SHARED_DATA);
2698 3197 abd_release_ownership_of_buf(hdr->b_l1hdr.b_pabd);
2699 3198 abd_put(hdr->b_l1hdr.b_pabd);
2700 3199 hdr->b_l1hdr.b_pabd = NULL;
2701 3200 buf->b_flags &= ~ARC_BUF_FLAG_SHARED;
2702 3201
2703 3202 /*
2704 3203 * Since the buffer is no longer shared between
2705 3204 * the arc buf and the hdr, count it as overhead.
2706 3205 */
2707 3206 ARCSTAT_INCR(arcstat_compressed_size, -arc_hdr_size(hdr));
2708 3207 ARCSTAT_INCR(arcstat_uncompressed_size, -HDR_GET_LSIZE(hdr));
2709 3208 ARCSTAT_INCR(arcstat_overhead_size, arc_buf_size(buf));
2710 3209 }
2711 3210
2712 3211 /*
2713 3212 * Remove an arc_buf_t from the hdr's buf list and return the last
2714 3213 * arc_buf_t on the list. If no buffers remain on the list then return
2715 3214 * NULL.
2716 3215 */
2717 3216 static arc_buf_t *
2718 3217 arc_buf_remove(arc_buf_hdr_t *hdr, arc_buf_t *buf)
2719 3218 {
2720 3219 ASSERT(HDR_HAS_L1HDR(hdr));
2721 3220 ASSERT(MUTEX_HELD(HDR_LOCK(hdr)) || HDR_EMPTY(hdr));
2722 3221
2723 3222 arc_buf_t **bufp = &hdr->b_l1hdr.b_buf;
2724 3223 arc_buf_t *lastbuf = NULL;
2725 3224
2726 3225 /*
2727 3226 * Remove the buf from the hdr list and locate the last
2728 3227 * remaining buffer on the list.
2729 3228 */
2730 3229 while (*bufp != NULL) {
2731 3230 if (*bufp == buf)
2732 3231 *bufp = buf->b_next;
2733 3232
2734 3233 /*
2735 3234 * If we've removed a buffer in the middle of
2736 3235 * the list then update the lastbuf and update
2737 3236 * bufp.
2738 3237 */
2739 3238 if (*bufp != NULL) {
2740 3239 lastbuf = *bufp;
2741 3240 bufp = &(*bufp)->b_next;
2742 3241 }
2743 3242 }
2744 3243 buf->b_next = NULL;
2745 3244 ASSERT3P(lastbuf, !=, buf);
2746 3245 IMPLY(hdr->b_l1hdr.b_bufcnt > 0, lastbuf != NULL);
2747 3246 IMPLY(hdr->b_l1hdr.b_bufcnt > 0, hdr->b_l1hdr.b_buf != NULL);
2748 3247 IMPLY(lastbuf != NULL, ARC_BUF_LAST(lastbuf));
2749 3248
2750 3249 return (lastbuf);
2751 3250 }
2752 3251
2753 3252 /*
2754 3253 * Free up buf->b_data and pull the arc_buf_t off of the the arc_buf_hdr_t's
2755 3254 * list and free it.
2756 3255 */
2757 3256 static void
2758 3257 arc_buf_destroy_impl(arc_buf_t *buf)
2759 3258 {
2760 3259 arc_buf_hdr_t *hdr = buf->b_hdr;
2761 3260
2762 3261 /*
2763 3262 * Free up the data associated with the buf but only if we're not
2764 3263 * sharing this with the hdr. If we are sharing it with the hdr, the
2765 3264 * hdr is responsible for doing the free.
2766 3265 */
2767 3266 if (buf->b_data != NULL) {
2768 3267 /*
2769 3268 * We're about to change the hdr's b_flags. We must either
2770 3269 * hold the hash_lock or be undiscoverable.
2771 3270 */
2772 3271 ASSERT(MUTEX_HELD(HDR_LOCK(hdr)) || HDR_EMPTY(hdr));
2773 3272
2774 3273 arc_cksum_verify(buf);
2775 3274 arc_buf_unwatch(buf);
2776 3275
2777 3276 if (arc_buf_is_shared(buf)) {
2778 3277 arc_hdr_clear_flags(hdr, ARC_FLAG_SHARED_DATA);
2779 3278 } else {
2780 3279 uint64_t size = arc_buf_size(buf);
2781 3280 arc_free_data_buf(hdr, buf->b_data, size, buf);
2782 3281 ARCSTAT_INCR(arcstat_overhead_size, -size);
2783 3282 }
2784 3283 buf->b_data = NULL;
2785 3284
2786 3285 ASSERT(hdr->b_l1hdr.b_bufcnt > 0);
2787 3286 hdr->b_l1hdr.b_bufcnt -= 1;
2788 3287 }
2789 3288
2790 3289 arc_buf_t *lastbuf = arc_buf_remove(hdr, buf);
2791 3290
2792 3291 if (ARC_BUF_SHARED(buf) && !ARC_BUF_COMPRESSED(buf)) {
2793 3292 /*
2794 3293 * If the current arc_buf_t is sharing its data buffer with the
2795 3294 * hdr, then reassign the hdr's b_pabd to share it with the new
2796 3295 * buffer at the end of the list. The shared buffer is always
2797 3296 * the last one on the hdr's buffer list.
2798 3297 *
2799 3298 * There is an equivalent case for compressed bufs, but since
2800 3299 * they aren't guaranteed to be the last buf in the list and
2801 3300 * that is an exceedingly rare case, we just allow that space be
2802 3301 * wasted temporarily.
2803 3302 */
2804 3303 if (lastbuf != NULL) {
2805 3304 /* Only one buf can be shared at once */
2806 3305 VERIFY(!arc_buf_is_shared(lastbuf));
2807 3306 /* hdr is uncompressed so can't have compressed buf */
2808 3307 VERIFY(!ARC_BUF_COMPRESSED(lastbuf));
2809 3308
2810 3309 ASSERT3P(hdr->b_l1hdr.b_pabd, !=, NULL);
2811 3310 arc_hdr_free_pabd(hdr);
2812 3311
2813 3312 /*
2814 3313 * We must setup a new shared block between the
2815 3314 * last buffer and the hdr. The data would have
2816 3315 * been allocated by the arc buf so we need to transfer
2817 3316 * ownership to the hdr since it's now being shared.
2818 3317 */
2819 3318 arc_share_buf(hdr, lastbuf);
2820 3319 }
2821 3320 } else if (HDR_SHARED_DATA(hdr)) {
2822 3321 /*
2823 3322 * Uncompressed shared buffers are always at the end
2824 3323 * of the list. Compressed buffers don't have the
2825 3324 * same requirements. This makes it hard to
2826 3325 * simply assert that the lastbuf is shared so
2827 3326 * we rely on the hdr's compression flags to determine
2828 3327 * if we have a compressed, shared buffer.
2829 3328 */
2830 3329 ASSERT3P(lastbuf, !=, NULL);
2831 3330 ASSERT(arc_buf_is_shared(lastbuf) ||
2832 3331 HDR_GET_COMPRESS(hdr) != ZIO_COMPRESS_OFF);
2833 3332 }
2834 3333
2835 3334 /*
2836 3335 * Free the checksum if we're removing the last uncompressed buf from
2837 3336 * this hdr.
2838 3337 */
2839 3338 if (!arc_hdr_has_uncompressed_buf(hdr)) {
2840 3339 arc_cksum_free(hdr);
2841 3340 }
2842 3341
2843 3342 /* clean up the buf */
2844 3343 buf->b_hdr = NULL;
2845 3344 kmem_cache_free(buf_cache, buf);
2846 3345 }
2847 3346
2848 3347 static void
2849 3348 arc_hdr_alloc_pabd(arc_buf_hdr_t *hdr)
2850 3349 {
2851 3350 ASSERT3U(HDR_GET_LSIZE(hdr), >, 0);
|
↓ open down ↓ |
172 lines elided |
↑ open up ↑ |
2852 3351 ASSERT(HDR_HAS_L1HDR(hdr));
2853 3352 ASSERT(!HDR_SHARED_DATA(hdr));
2854 3353
2855 3354 ASSERT3P(hdr->b_l1hdr.b_pabd, ==, NULL);
2856 3355 hdr->b_l1hdr.b_pabd = arc_get_data_abd(hdr, arc_hdr_size(hdr), hdr);
2857 3356 hdr->b_l1hdr.b_byteswap = DMU_BSWAP_NUMFUNCS;
2858 3357 ASSERT3P(hdr->b_l1hdr.b_pabd, !=, NULL);
2859 3358
2860 3359 ARCSTAT_INCR(arcstat_compressed_size, arc_hdr_size(hdr));
2861 3360 ARCSTAT_INCR(arcstat_uncompressed_size, HDR_GET_LSIZE(hdr));
3361 + arc_update_hit_stat(hdr, B_TRUE);
2862 3362 }
2863 3363
2864 3364 static void
2865 3365 arc_hdr_free_pabd(arc_buf_hdr_t *hdr)
2866 3366 {
2867 3367 ASSERT(HDR_HAS_L1HDR(hdr));
2868 3368 ASSERT3P(hdr->b_l1hdr.b_pabd, !=, NULL);
2869 3369
2870 3370 /*
2871 3371 * If the hdr is currently being written to the l2arc then
2872 3372 * we defer freeing the data by adding it to the l2arc_free_on_write
2873 3373 * list. The l2arc will free the data once it's finished
2874 3374 * writing it to the l2arc device.
2875 3375 */
2876 3376 if (HDR_L2_WRITING(hdr)) {
2877 3377 arc_hdr_free_on_write(hdr);
2878 3378 ARCSTAT_BUMP(arcstat_l2_free_on_write);
2879 3379 } else {
2880 3380 arc_free_data_abd(hdr, hdr->b_l1hdr.b_pabd,
2881 3381 arc_hdr_size(hdr), hdr);
2882 3382 }
2883 3383 hdr->b_l1hdr.b_pabd = NULL;
2884 3384 hdr->b_l1hdr.b_byteswap = DMU_BSWAP_NUMFUNCS;
2885 3385
|
↓ open down ↓ |
14 lines elided |
↑ open up ↑ |
2886 3386 ARCSTAT_INCR(arcstat_compressed_size, -arc_hdr_size(hdr));
2887 3387 ARCSTAT_INCR(arcstat_uncompressed_size, -HDR_GET_LSIZE(hdr));
2888 3388 }
2889 3389
2890 3390 static arc_buf_hdr_t *
2891 3391 arc_hdr_alloc(uint64_t spa, int32_t psize, int32_t lsize,
2892 3392 enum zio_compress compression_type, arc_buf_contents_t type)
2893 3393 {
2894 3394 arc_buf_hdr_t *hdr;
2895 3395
2896 - VERIFY(type == ARC_BUFC_DATA || type == ARC_BUFC_METADATA);
3396 + ASSERT3U(lsize, >, 0);
2897 3397
3398 + if (type == ARC_BUFC_DDT && !zfs_arc_segregate_ddt)
3399 + type = ARC_BUFC_METADATA;
3400 + VERIFY(type == ARC_BUFC_DATA || type == ARC_BUFC_METADATA ||
3401 + type == ARC_BUFC_DDT);
3402 +
2898 3403 hdr = kmem_cache_alloc(hdr_full_cache, KM_PUSHPAGE);
2899 3404 ASSERT(HDR_EMPTY(hdr));
2900 - ASSERT3P(hdr->b_l1hdr.b_freeze_cksum, ==, NULL);
3405 + ASSERT3P(hdr->b_freeze_cksum, ==, NULL);
2901 3406 ASSERT3P(hdr->b_l1hdr.b_thawed, ==, NULL);
2902 3407 HDR_SET_PSIZE(hdr, psize);
2903 3408 HDR_SET_LSIZE(hdr, lsize);
2904 3409 hdr->b_spa = spa;
2905 3410 hdr->b_type = type;
2906 3411 hdr->b_flags = 0;
2907 3412 arc_hdr_set_flags(hdr, arc_bufc_to_flags(type) | ARC_FLAG_HAS_L1HDR);
2908 3413 arc_hdr_set_compress(hdr, compression_type);
2909 3414
2910 3415 hdr->b_l1hdr.b_state = arc_anon;
2911 3416 hdr->b_l1hdr.b_arc_access = 0;
2912 3417 hdr->b_l1hdr.b_bufcnt = 0;
2913 3418 hdr->b_l1hdr.b_buf = NULL;
2914 3419
2915 3420 /*
2916 3421 * Allocate the hdr's buffer. This will contain either
2917 3422 * the compressed or uncompressed data depending on the block
2918 3423 * it references and compressed arc enablement.
2919 3424 */
2920 3425 arc_hdr_alloc_pabd(hdr);
2921 3426 ASSERT(refcount_is_zero(&hdr->b_l1hdr.b_refcnt));
2922 3427
2923 3428 return (hdr);
2924 3429 }
2925 3430
2926 3431 /*
2927 3432 * Transition between the two allocation states for the arc_buf_hdr struct.
2928 3433 * The arc_buf_hdr struct can be allocated with (hdr_full_cache) or without
2929 3434 * (hdr_l2only_cache) the fields necessary for the L1 cache - the smaller
2930 3435 * version is used when a cache buffer is only in the L2ARC in order to reduce
2931 3436 * memory usage.
2932 3437 */
2933 3438 static arc_buf_hdr_t *
2934 3439 arc_hdr_realloc(arc_buf_hdr_t *hdr, kmem_cache_t *old, kmem_cache_t *new)
2935 3440 {
2936 3441 ASSERT(HDR_HAS_L2HDR(hdr));
2937 3442
2938 3443 arc_buf_hdr_t *nhdr;
2939 3444 l2arc_dev_t *dev = hdr->b_l2hdr.b_dev;
2940 3445
2941 3446 ASSERT((old == hdr_full_cache && new == hdr_l2only_cache) ||
2942 3447 (old == hdr_l2only_cache && new == hdr_full_cache));
2943 3448
2944 3449 nhdr = kmem_cache_alloc(new, KM_PUSHPAGE);
2945 3450
2946 3451 ASSERT(MUTEX_HELD(HDR_LOCK(hdr)));
2947 3452 buf_hash_remove(hdr);
2948 3453
2949 3454 bcopy(hdr, nhdr, HDR_L2ONLY_SIZE);
2950 3455
2951 3456 if (new == hdr_full_cache) {
2952 3457 arc_hdr_set_flags(nhdr, ARC_FLAG_HAS_L1HDR);
2953 3458 /*
2954 3459 * arc_access and arc_change_state need to be aware that a
|
↓ open down ↓ |
44 lines elided |
↑ open up ↑ |
2955 3460 * header has just come out of L2ARC, so we set its state to
2956 3461 * l2c_only even though it's about to change.
2957 3462 */
2958 3463 nhdr->b_l1hdr.b_state = arc_l2c_only;
2959 3464
2960 3465 /* Verify previous threads set to NULL before freeing */
2961 3466 ASSERT3P(nhdr->b_l1hdr.b_pabd, ==, NULL);
2962 3467 } else {
2963 3468 ASSERT3P(hdr->b_l1hdr.b_buf, ==, NULL);
2964 3469 ASSERT0(hdr->b_l1hdr.b_bufcnt);
2965 - ASSERT3P(hdr->b_l1hdr.b_freeze_cksum, ==, NULL);
3470 + ASSERT3P(hdr->b_freeze_cksum, ==, NULL);
2966 3471
2967 3472 /*
2968 3473 * If we've reached here, We must have been called from
2969 3474 * arc_evict_hdr(), as such we should have already been
2970 3475 * removed from any ghost list we were previously on
2971 3476 * (which protects us from racing with arc_evict_state),
2972 3477 * thus no locking is needed during this check.
2973 3478 */
2974 3479 ASSERT(!multilist_link_active(&hdr->b_l1hdr.b_arc_node));
2975 3480
2976 3481 /*
2977 3482 * A buffer must not be moved into the arc_l2c_only
2978 3483 * state if it's not finished being written out to the
2979 3484 * l2arc device. Otherwise, the b_l1hdr.b_pabd field
2980 3485 * might try to be accessed, even though it was removed.
2981 3486 */
2982 3487 VERIFY(!HDR_L2_WRITING(hdr));
2983 3488 VERIFY3P(hdr->b_l1hdr.b_pabd, ==, NULL);
2984 3489
2985 3490 #ifdef ZFS_DEBUG
2986 3491 if (hdr->b_l1hdr.b_thawed != NULL) {
2987 3492 kmem_free(hdr->b_l1hdr.b_thawed, 1);
2988 3493 hdr->b_l1hdr.b_thawed = NULL;
2989 3494 }
2990 3495 #endif
2991 3496
2992 3497 arc_hdr_clear_flags(nhdr, ARC_FLAG_HAS_L1HDR);
2993 3498 }
2994 3499 /*
2995 3500 * The header has been reallocated so we need to re-insert it into any
2996 3501 * lists it was on.
2997 3502 */
2998 3503 (void) buf_hash_insert(nhdr, NULL);
2999 3504
3000 3505 ASSERT(list_link_active(&hdr->b_l2hdr.b_l2node));
3001 3506
3002 3507 mutex_enter(&dev->l2ad_mtx);
3003 3508
3004 3509 /*
3005 3510 * We must place the realloc'ed header back into the list at
3006 3511 * the same spot. Otherwise, if it's placed earlier in the list,
3007 3512 * l2arc_write_buffers() could find it during the function's
3008 3513 * write phase, and try to write it out to the l2arc.
3009 3514 */
3010 3515 list_insert_after(&dev->l2ad_buflist, hdr, nhdr);
3011 3516 list_remove(&dev->l2ad_buflist, hdr);
3012 3517
3013 3518 mutex_exit(&dev->l2ad_mtx);
3014 3519
3015 3520 /*
3016 3521 * Since we're using the pointer address as the tag when
3017 3522 * incrementing and decrementing the l2ad_alloc refcount, we
3018 3523 * must remove the old pointer (that we're about to destroy) and
3019 3524 * add the new pointer to the refcount. Otherwise we'd remove
3020 3525 * the wrong pointer address when calling arc_hdr_destroy() later.
3021 3526 */
3022 3527
3023 3528 (void) refcount_remove_many(&dev->l2ad_alloc, arc_hdr_size(hdr), hdr);
3024 3529 (void) refcount_add_many(&dev->l2ad_alloc, arc_hdr_size(nhdr), nhdr);
3025 3530
3026 3531 buf_discard_identity(hdr);
3027 3532 kmem_cache_free(old, hdr);
3028 3533
3029 3534 return (nhdr);
3030 3535 }
3031 3536
3032 3537 /*
3033 3538 * Allocate a new arc_buf_hdr_t and arc_buf_t and return the buf to the caller.
3034 3539 * The buf is returned thawed since we expect the consumer to modify it.
3035 3540 */
3036 3541 arc_buf_t *
3037 3542 arc_alloc_buf(spa_t *spa, void *tag, arc_buf_contents_t type, int32_t size)
3038 3543 {
3039 3544 arc_buf_hdr_t *hdr = arc_hdr_alloc(spa_load_guid(spa), size, size,
3040 3545 ZIO_COMPRESS_OFF, type);
3041 3546 ASSERT(!MUTEX_HELD(HDR_LOCK(hdr)));
3042 3547
3043 3548 arc_buf_t *buf = NULL;
3044 3549 VERIFY0(arc_buf_alloc_impl(hdr, tag, B_FALSE, B_FALSE, &buf));
3045 3550 arc_buf_thaw(buf);
3046 3551
3047 3552 return (buf);
3048 3553 }
3049 3554
3050 3555 /*
3051 3556 * Allocate a compressed buf in the same manner as arc_alloc_buf. Don't use this
3052 3557 * for bufs containing metadata.
3053 3558 */
3054 3559 arc_buf_t *
3055 3560 arc_alloc_compressed_buf(spa_t *spa, void *tag, uint64_t psize, uint64_t lsize,
3056 3561 enum zio_compress compression_type)
3057 3562 {
3058 3563 ASSERT3U(lsize, >, 0);
3059 3564 ASSERT3U(lsize, >=, psize);
|
↓ open down ↓ |
84 lines elided |
↑ open up ↑ |
3060 3565 ASSERT(compression_type > ZIO_COMPRESS_OFF);
3061 3566 ASSERT(compression_type < ZIO_COMPRESS_FUNCTIONS);
3062 3567
3063 3568 arc_buf_hdr_t *hdr = arc_hdr_alloc(spa_load_guid(spa), psize, lsize,
3064 3569 compression_type, ARC_BUFC_DATA);
3065 3570 ASSERT(!MUTEX_HELD(HDR_LOCK(hdr)));
3066 3571
3067 3572 arc_buf_t *buf = NULL;
3068 3573 VERIFY0(arc_buf_alloc_impl(hdr, tag, B_TRUE, B_FALSE, &buf));
3069 3574 arc_buf_thaw(buf);
3070 - ASSERT3P(hdr->b_l1hdr.b_freeze_cksum, ==, NULL);
3575 + ASSERT3P(hdr->b_freeze_cksum, ==, NULL);
3071 3576
3072 3577 if (!arc_buf_is_shared(buf)) {
3073 3578 /*
3074 3579 * To ensure that the hdr has the correct data in it if we call
3075 3580 * arc_decompress() on this buf before it's been written to
3076 3581 * disk, it's easiest if we just set up sharing between the
3077 3582 * buf and the hdr.
3078 3583 */
3079 3584 ASSERT(!abd_is_linear(hdr->b_l1hdr.b_pabd));
3080 3585 arc_hdr_free_pabd(hdr);
3081 3586 arc_share_buf(hdr, buf);
3082 3587 }
3083 3588
3084 3589 return (buf);
3085 3590 }
3086 3591
3087 3592 static void
3088 3593 arc_hdr_l2hdr_destroy(arc_buf_hdr_t *hdr)
3089 3594 {
3090 3595 l2arc_buf_hdr_t *l2hdr = &hdr->b_l2hdr;
3091 3596 l2arc_dev_t *dev = l2hdr->b_dev;
|
↓ open down ↓ |
11 lines elided |
↑ open up ↑ |
3092 3597 uint64_t psize = arc_hdr_size(hdr);
3093 3598
3094 3599 ASSERT(MUTEX_HELD(&dev->l2ad_mtx));
3095 3600 ASSERT(HDR_HAS_L2HDR(hdr));
3096 3601
3097 3602 list_remove(&dev->l2ad_buflist, hdr);
3098 3603
3099 3604 ARCSTAT_INCR(arcstat_l2_psize, -psize);
3100 3605 ARCSTAT_INCR(arcstat_l2_lsize, -HDR_GET_LSIZE(hdr));
3101 3606
3102 - vdev_space_update(dev->l2ad_vdev, -psize, 0, 0);
3607 + /*
3608 + * l2ad_vdev can be NULL here if we async evicted it
3609 + */
3610 + if (dev->l2ad_vdev != NULL)
3611 + vdev_space_update(dev->l2ad_vdev, -psize, 0, 0);
3103 3612
3104 3613 (void) refcount_remove_many(&dev->l2ad_alloc, psize, hdr);
3105 3614 arc_hdr_clear_flags(hdr, ARC_FLAG_HAS_L2HDR);
3106 3615 }
3107 3616
3108 3617 static void
3109 3618 arc_hdr_destroy(arc_buf_hdr_t *hdr)
3110 3619 {
3111 3620 if (HDR_HAS_L1HDR(hdr)) {
3112 3621 ASSERT(hdr->b_l1hdr.b_buf == NULL ||
3113 3622 hdr->b_l1hdr.b_bufcnt > 0);
3114 3623 ASSERT(refcount_is_zero(&hdr->b_l1hdr.b_refcnt));
3115 3624 ASSERT3P(hdr->b_l1hdr.b_state, ==, arc_anon);
3116 3625 }
3117 3626 ASSERT(!HDR_IO_IN_PROGRESS(hdr));
3118 3627 ASSERT(!HDR_IN_HASH_TABLE(hdr));
3119 3628
3120 - if (!HDR_EMPTY(hdr))
3121 - buf_discard_identity(hdr);
3122 -
3123 3629 if (HDR_HAS_L2HDR(hdr)) {
3124 3630 l2arc_dev_t *dev = hdr->b_l2hdr.b_dev;
3125 3631 boolean_t buflist_held = MUTEX_HELD(&dev->l2ad_mtx);
3126 3632
3633 + /* To avoid racing with L2ARC the header needs to be locked */
3634 + ASSERT(MUTEX_HELD(HDR_LOCK(hdr)));
3635 +
3127 3636 if (!buflist_held)
3128 3637 mutex_enter(&dev->l2ad_mtx);
3129 3638
3130 3639 /*
3640 + * L2ARC buflist has been held, so we can safety discard
3641 + * identity, otherwise L2ARC can lock incorrect mutex
3642 + * for the hdr, that will cause a panic. That is possible,
3643 + * because a mutex is selected according to identity.
3644 + */
3645 + if (!HDR_EMPTY(hdr))
3646 + buf_discard_identity(hdr);
3647 +
3648 + /*
3131 3649 * Even though we checked this conditional above, we
3132 3650 * need to check this again now that we have the
3133 3651 * l2ad_mtx. This is because we could be racing with
3134 3652 * another thread calling l2arc_evict() which might have
3135 3653 * destroyed this header's L2 portion as we were waiting
3136 3654 * to acquire the l2ad_mtx. If that happens, we don't
3137 3655 * want to re-destroy the header's L2 portion.
3138 3656 */
3139 3657 if (HDR_HAS_L2HDR(hdr))
3140 3658 arc_hdr_l2hdr_destroy(hdr);
3141 3659
3142 3660 if (!buflist_held)
3143 3661 mutex_exit(&dev->l2ad_mtx);
3144 3662 }
3145 3663
3664 + if (!HDR_EMPTY(hdr))
3665 + buf_discard_identity(hdr);
3666 +
3146 3667 if (HDR_HAS_L1HDR(hdr)) {
3147 3668 arc_cksum_free(hdr);
3148 3669
3149 3670 while (hdr->b_l1hdr.b_buf != NULL)
3150 3671 arc_buf_destroy_impl(hdr->b_l1hdr.b_buf);
3151 3672
3152 3673 #ifdef ZFS_DEBUG
3153 3674 if (hdr->b_l1hdr.b_thawed != NULL) {
3154 3675 kmem_free(hdr->b_l1hdr.b_thawed, 1);
3155 3676 hdr->b_l1hdr.b_thawed = NULL;
3156 3677 }
3157 3678 #endif
3158 3679
3159 3680 if (hdr->b_l1hdr.b_pabd != NULL) {
3160 3681 arc_hdr_free_pabd(hdr);
3161 3682 }
3162 3683 }
3163 3684
3164 3685 ASSERT3P(hdr->b_hash_next, ==, NULL);
3165 3686 if (HDR_HAS_L1HDR(hdr)) {
3166 3687 ASSERT(!multilist_link_active(&hdr->b_l1hdr.b_arc_node));
3167 3688 ASSERT3P(hdr->b_l1hdr.b_acb, ==, NULL);
3168 3689 kmem_cache_free(hdr_full_cache, hdr);
3169 3690 } else {
3170 3691 kmem_cache_free(hdr_l2only_cache, hdr);
3171 3692 }
3172 3693 }
3173 3694
3174 3695 void
3175 3696 arc_buf_destroy(arc_buf_t *buf, void* tag)
3176 3697 {
3177 3698 arc_buf_hdr_t *hdr = buf->b_hdr;
3178 3699 kmutex_t *hash_lock = HDR_LOCK(hdr);
3179 3700
3180 3701 if (hdr->b_l1hdr.b_state == arc_anon) {
3181 3702 ASSERT3U(hdr->b_l1hdr.b_bufcnt, ==, 1);
3182 3703 ASSERT(!HDR_IO_IN_PROGRESS(hdr));
3183 3704 VERIFY0(remove_reference(hdr, NULL, tag));
3184 3705 arc_hdr_destroy(hdr);
3185 3706 return;
3186 3707 }
3187 3708
3188 3709 mutex_enter(hash_lock);
3189 3710 ASSERT3P(hdr, ==, buf->b_hdr);
3190 3711 ASSERT(hdr->b_l1hdr.b_bufcnt > 0);
3191 3712 ASSERT3P(hash_lock, ==, HDR_LOCK(hdr));
3192 3713 ASSERT3P(hdr->b_l1hdr.b_state, !=, arc_anon);
3193 3714 ASSERT3P(buf->b_data, !=, NULL);
3194 3715
3195 3716 (void) remove_reference(hdr, hash_lock, tag);
3196 3717 arc_buf_destroy_impl(buf);
3197 3718 mutex_exit(hash_lock);
3198 3719 }
3199 3720
3200 3721 /*
3201 3722 * Evict the arc_buf_hdr that is provided as a parameter. The resultant
3202 3723 * state of the header is dependent on it's state prior to entering this
3203 3724 * function. The following transitions are possible:
3204 3725 *
3205 3726 * - arc_mru -> arc_mru_ghost
3206 3727 * - arc_mfu -> arc_mfu_ghost
3207 3728 * - arc_mru_ghost -> arc_l2c_only
3208 3729 * - arc_mru_ghost -> deleted
3209 3730 * - arc_mfu_ghost -> arc_l2c_only
3210 3731 * - arc_mfu_ghost -> deleted
|
↓ open down ↓ |
55 lines elided |
↑ open up ↑ |
3211 3732 */
3212 3733 static int64_t
3213 3734 arc_evict_hdr(arc_buf_hdr_t *hdr, kmutex_t *hash_lock)
3214 3735 {
3215 3736 arc_state_t *evicted_state, *state;
3216 3737 int64_t bytes_evicted = 0;
3217 3738
3218 3739 ASSERT(MUTEX_HELD(hash_lock));
3219 3740 ASSERT(HDR_HAS_L1HDR(hdr));
3220 3741
3742 + arc_wait_for_krrp(hdr);
3743 +
3221 3744 state = hdr->b_l1hdr.b_state;
3222 3745 if (GHOST_STATE(state)) {
3223 3746 ASSERT(!HDR_IO_IN_PROGRESS(hdr));
3224 3747 ASSERT3P(hdr->b_l1hdr.b_buf, ==, NULL);
3225 3748
3226 3749 /*
3227 3750 * l2arc_write_buffers() relies on a header's L1 portion
3228 3751 * (i.e. its b_pabd field) during it's write phase.
3229 3752 * Thus, we cannot push a header onto the arc_l2c_only
3230 3753 * state (removing it's L1 piece) until the header is
3231 3754 * done being written to the l2arc.
3232 3755 */
3233 3756 if (HDR_HAS_L2HDR(hdr) && HDR_L2_WRITING(hdr)) {
3234 3757 ARCSTAT_BUMP(arcstat_evict_l2_skip);
3235 3758 return (bytes_evicted);
3236 3759 }
3237 3760
3238 3761 ARCSTAT_BUMP(arcstat_deleted);
3239 3762 bytes_evicted += HDR_GET_LSIZE(hdr);
3240 3763
3241 3764 DTRACE_PROBE1(arc__delete, arc_buf_hdr_t *, hdr);
3242 3765
3243 3766 ASSERT3P(hdr->b_l1hdr.b_pabd, ==, NULL);
3244 3767 if (HDR_HAS_L2HDR(hdr)) {
3245 3768 /*
3246 3769 * This buffer is cached on the 2nd Level ARC;
3247 3770 * don't destroy the header.
3248 3771 */
3249 3772 arc_change_state(arc_l2c_only, hdr, hash_lock);
3250 3773 /*
3251 3774 * dropping from L1+L2 cached to L2-only,
3252 3775 * realloc to remove the L1 header.
3253 3776 */
3254 3777 hdr = arc_hdr_realloc(hdr, hdr_full_cache,
3255 3778 hdr_l2only_cache);
3256 3779 } else {
3257 3780 arc_change_state(arc_anon, hdr, hash_lock);
3258 3781 arc_hdr_destroy(hdr);
3259 3782 }
3260 3783 return (bytes_evicted);
3261 3784 }
3262 3785
3263 3786 ASSERT(state == arc_mru || state == arc_mfu);
3264 3787 evicted_state = (state == arc_mru) ? arc_mru_ghost : arc_mfu_ghost;
3265 3788
3266 3789 /* prefetch buffers have a minimum lifespan */
3267 3790 if (HDR_IO_IN_PROGRESS(hdr) ||
3268 3791 ((hdr->b_flags & (ARC_FLAG_PREFETCH | ARC_FLAG_INDIRECT)) &&
3269 3792 ddi_get_lbolt() - hdr->b_l1hdr.b_arc_access <
3270 3793 arc_min_prefetch_lifespan)) {
3271 3794 ARCSTAT_BUMP(arcstat_evict_skip);
3272 3795 return (bytes_evicted);
3273 3796 }
3274 3797
3275 3798 ASSERT0(refcount_count(&hdr->b_l1hdr.b_refcnt));
3276 3799 while (hdr->b_l1hdr.b_buf) {
3277 3800 arc_buf_t *buf = hdr->b_l1hdr.b_buf;
3278 3801 if (!mutex_tryenter(&buf->b_evict_lock)) {
3279 3802 ARCSTAT_BUMP(arcstat_mutex_miss);
3280 3803 break;
3281 3804 }
3282 3805 if (buf->b_data != NULL)
3283 3806 bytes_evicted += HDR_GET_LSIZE(hdr);
3284 3807 mutex_exit(&buf->b_evict_lock);
3285 3808 arc_buf_destroy_impl(buf);
3286 3809 }
3287 3810
3288 3811 if (HDR_HAS_L2HDR(hdr)) {
3289 3812 ARCSTAT_INCR(arcstat_evict_l2_cached, HDR_GET_LSIZE(hdr));
3290 3813 } else {
3291 3814 if (l2arc_write_eligible(hdr->b_spa, hdr)) {
3292 3815 ARCSTAT_INCR(arcstat_evict_l2_eligible,
3293 3816 HDR_GET_LSIZE(hdr));
3294 3817 } else {
3295 3818 ARCSTAT_INCR(arcstat_evict_l2_ineligible,
3296 3819 HDR_GET_LSIZE(hdr));
3297 3820 }
3298 3821 }
3299 3822
3300 3823 if (hdr->b_l1hdr.b_bufcnt == 0) {
3301 3824 arc_cksum_free(hdr);
3302 3825
3303 3826 bytes_evicted += arc_hdr_size(hdr);
3304 3827
3305 3828 /*
3306 3829 * If this hdr is being evicted and has a compressed
3307 3830 * buffer then we discard it here before we change states.
3308 3831 * This ensures that the accounting is updated correctly
3309 3832 * in arc_free_data_impl().
3310 3833 */
3311 3834 arc_hdr_free_pabd(hdr);
3312 3835
3313 3836 arc_change_state(evicted_state, hdr, hash_lock);
3314 3837 ASSERT(HDR_IN_HASH_TABLE(hdr));
3315 3838 arc_hdr_set_flags(hdr, ARC_FLAG_IN_HASH_TABLE);
3316 3839 DTRACE_PROBE1(arc__evict, arc_buf_hdr_t *, hdr);
3317 3840 }
3318 3841
3319 3842 return (bytes_evicted);
3320 3843 }
3321 3844
3322 3845 static uint64_t
3323 3846 arc_evict_state_impl(multilist_t *ml, int idx, arc_buf_hdr_t *marker,
3324 3847 uint64_t spa, int64_t bytes)
3325 3848 {
3326 3849 multilist_sublist_t *mls;
3327 3850 uint64_t bytes_evicted = 0;
3328 3851 arc_buf_hdr_t *hdr;
3329 3852 kmutex_t *hash_lock;
3330 3853 int evict_count = 0;
3331 3854
3332 3855 ASSERT3P(marker, !=, NULL);
3333 3856 IMPLY(bytes < 0, bytes == ARC_EVICT_ALL);
3334 3857
3335 3858 mls = multilist_sublist_lock(ml, idx);
3336 3859
3337 3860 for (hdr = multilist_sublist_prev(mls, marker); hdr != NULL;
3338 3861 hdr = multilist_sublist_prev(mls, marker)) {
3339 3862 if ((bytes != ARC_EVICT_ALL && bytes_evicted >= bytes) ||
3340 3863 (evict_count >= zfs_arc_evict_batch_limit))
3341 3864 break;
3342 3865
3343 3866 /*
3344 3867 * To keep our iteration location, move the marker
3345 3868 * forward. Since we're not holding hdr's hash lock, we
3346 3869 * must be very careful and not remove 'hdr' from the
3347 3870 * sublist. Otherwise, other consumers might mistake the
3348 3871 * 'hdr' as not being on a sublist when they call the
3349 3872 * multilist_link_active() function (they all rely on
3350 3873 * the hash lock protecting concurrent insertions and
3351 3874 * removals). multilist_sublist_move_forward() was
3352 3875 * specifically implemented to ensure this is the case
3353 3876 * (only 'marker' will be removed and re-inserted).
3354 3877 */
3355 3878 multilist_sublist_move_forward(mls, marker);
3356 3879
3357 3880 /*
3358 3881 * The only case where the b_spa field should ever be
3359 3882 * zero, is the marker headers inserted by
3360 3883 * arc_evict_state(). It's possible for multiple threads
3361 3884 * to be calling arc_evict_state() concurrently (e.g.
3362 3885 * dsl_pool_close() and zio_inject_fault()), so we must
3363 3886 * skip any markers we see from these other threads.
3364 3887 */
3365 3888 if (hdr->b_spa == 0)
3366 3889 continue;
3367 3890
3368 3891 /* we're only interested in evicting buffers of a certain spa */
3369 3892 if (spa != 0 && hdr->b_spa != spa) {
3370 3893 ARCSTAT_BUMP(arcstat_evict_skip);
3371 3894 continue;
3372 3895 }
3373 3896
3374 3897 hash_lock = HDR_LOCK(hdr);
3375 3898
3376 3899 /*
3377 3900 * We aren't calling this function from any code path
3378 3901 * that would already be holding a hash lock, so we're
3379 3902 * asserting on this assumption to be defensive in case
3380 3903 * this ever changes. Without this check, it would be
3381 3904 * possible to incorrectly increment arcstat_mutex_miss
3382 3905 * below (e.g. if the code changed such that we called
3383 3906 * this function with a hash lock held).
3384 3907 */
3385 3908 ASSERT(!MUTEX_HELD(hash_lock));
3386 3909
3387 3910 if (mutex_tryenter(hash_lock)) {
3388 3911 uint64_t evicted = arc_evict_hdr(hdr, hash_lock);
3389 3912 mutex_exit(hash_lock);
3390 3913
3391 3914 bytes_evicted += evicted;
3392 3915
3393 3916 /*
3394 3917 * If evicted is zero, arc_evict_hdr() must have
3395 3918 * decided to skip this header, don't increment
3396 3919 * evict_count in this case.
3397 3920 */
3398 3921 if (evicted != 0)
3399 3922 evict_count++;
3400 3923
3401 3924 /*
3402 3925 * If arc_size isn't overflowing, signal any
3403 3926 * threads that might happen to be waiting.
3404 3927 *
3405 3928 * For each header evicted, we wake up a single
3406 3929 * thread. If we used cv_broadcast, we could
3407 3930 * wake up "too many" threads causing arc_size
3408 3931 * to significantly overflow arc_c; since
3409 3932 * arc_get_data_impl() doesn't check for overflow
3410 3933 * when it's woken up (it doesn't because it's
3411 3934 * possible for the ARC to be overflowing while
3412 3935 * full of un-evictable buffers, and the
3413 3936 * function should proceed in this case).
3414 3937 *
3415 3938 * If threads are left sleeping, due to not
3416 3939 * using cv_broadcast, they will be woken up
3417 3940 * just before arc_reclaim_thread() sleeps.
3418 3941 */
3419 3942 mutex_enter(&arc_reclaim_lock);
3420 3943 if (!arc_is_overflowing())
3421 3944 cv_signal(&arc_reclaim_waiters_cv);
3422 3945 mutex_exit(&arc_reclaim_lock);
3423 3946 } else {
3424 3947 ARCSTAT_BUMP(arcstat_mutex_miss);
3425 3948 }
3426 3949 }
3427 3950
3428 3951 multilist_sublist_unlock(mls);
3429 3952
3430 3953 return (bytes_evicted);
3431 3954 }
3432 3955
3433 3956 /*
3434 3957 * Evict buffers from the given arc state, until we've removed the
3435 3958 * specified number of bytes. Move the removed buffers to the
3436 3959 * appropriate evict state.
3437 3960 *
3438 3961 * This function makes a "best effort". It skips over any buffers
3439 3962 * it can't get a hash_lock on, and so, may not catch all candidates.
3440 3963 * It may also return without evicting as much space as requested.
3441 3964 *
3442 3965 * If bytes is specified using the special value ARC_EVICT_ALL, this
3443 3966 * will evict all available (i.e. unlocked and evictable) buffers from
3444 3967 * the given arc state; which is used by arc_flush().
3445 3968 */
3446 3969 static uint64_t
3447 3970 arc_evict_state(arc_state_t *state, uint64_t spa, int64_t bytes,
3448 3971 arc_buf_contents_t type)
3449 3972 {
3450 3973 uint64_t total_evicted = 0;
3451 3974 multilist_t *ml = state->arcs_list[type];
3452 3975 int num_sublists;
3453 3976 arc_buf_hdr_t **markers;
3454 3977
3455 3978 IMPLY(bytes < 0, bytes == ARC_EVICT_ALL);
3456 3979
3457 3980 num_sublists = multilist_get_num_sublists(ml);
3458 3981
3459 3982 /*
3460 3983 * If we've tried to evict from each sublist, made some
3461 3984 * progress, but still have not hit the target number of bytes
3462 3985 * to evict, we want to keep trying. The markers allow us to
3463 3986 * pick up where we left off for each individual sublist, rather
3464 3987 * than starting from the tail each time.
3465 3988 */
3466 3989 markers = kmem_zalloc(sizeof (*markers) * num_sublists, KM_SLEEP);
3467 3990 for (int i = 0; i < num_sublists; i++) {
3468 3991 markers[i] = kmem_cache_alloc(hdr_full_cache, KM_SLEEP);
3469 3992
3470 3993 /*
3471 3994 * A b_spa of 0 is used to indicate that this header is
3472 3995 * a marker. This fact is used in arc_adjust_type() and
3473 3996 * arc_evict_state_impl().
3474 3997 */
3475 3998 markers[i]->b_spa = 0;
3476 3999
3477 4000 multilist_sublist_t *mls = multilist_sublist_lock(ml, i);
3478 4001 multilist_sublist_insert_tail(mls, markers[i]);
3479 4002 multilist_sublist_unlock(mls);
3480 4003 }
3481 4004
3482 4005 /*
3483 4006 * While we haven't hit our target number of bytes to evict, or
3484 4007 * we're evicting all available buffers.
3485 4008 */
3486 4009 while (total_evicted < bytes || bytes == ARC_EVICT_ALL) {
3487 4010 /*
3488 4011 * Start eviction using a randomly selected sublist,
3489 4012 * this is to try and evenly balance eviction across all
3490 4013 * sublists. Always starting at the same sublist
3491 4014 * (e.g. index 0) would cause evictions to favor certain
3492 4015 * sublists over others.
3493 4016 */
3494 4017 int sublist_idx = multilist_get_random_index(ml);
3495 4018 uint64_t scan_evicted = 0;
3496 4019
3497 4020 for (int i = 0; i < num_sublists; i++) {
3498 4021 uint64_t bytes_remaining;
3499 4022 uint64_t bytes_evicted;
3500 4023
3501 4024 if (bytes == ARC_EVICT_ALL)
3502 4025 bytes_remaining = ARC_EVICT_ALL;
3503 4026 else if (total_evicted < bytes)
3504 4027 bytes_remaining = bytes - total_evicted;
3505 4028 else
3506 4029 break;
3507 4030
3508 4031 bytes_evicted = arc_evict_state_impl(ml, sublist_idx,
3509 4032 markers[sublist_idx], spa, bytes_remaining);
3510 4033
3511 4034 scan_evicted += bytes_evicted;
3512 4035 total_evicted += bytes_evicted;
3513 4036
3514 4037 /* we've reached the end, wrap to the beginning */
3515 4038 if (++sublist_idx >= num_sublists)
3516 4039 sublist_idx = 0;
3517 4040 }
3518 4041
3519 4042 /*
3520 4043 * If we didn't evict anything during this scan, we have
3521 4044 * no reason to believe we'll evict more during another
3522 4045 * scan, so break the loop.
3523 4046 */
3524 4047 if (scan_evicted == 0) {
3525 4048 /* This isn't possible, let's make that obvious */
3526 4049 ASSERT3S(bytes, !=, 0);
3527 4050
3528 4051 /*
3529 4052 * When bytes is ARC_EVICT_ALL, the only way to
3530 4053 * break the loop is when scan_evicted is zero.
3531 4054 * In that case, we actually have evicted enough,
3532 4055 * so we don't want to increment the kstat.
3533 4056 */
3534 4057 if (bytes != ARC_EVICT_ALL) {
3535 4058 ASSERT3S(total_evicted, <, bytes);
3536 4059 ARCSTAT_BUMP(arcstat_evict_not_enough);
3537 4060 }
3538 4061
3539 4062 break;
3540 4063 }
3541 4064 }
3542 4065
3543 4066 for (int i = 0; i < num_sublists; i++) {
3544 4067 multilist_sublist_t *mls = multilist_sublist_lock(ml, i);
3545 4068 multilist_sublist_remove(mls, markers[i]);
3546 4069 multilist_sublist_unlock(mls);
3547 4070
3548 4071 kmem_cache_free(hdr_full_cache, markers[i]);
3549 4072 }
3550 4073 kmem_free(markers, sizeof (*markers) * num_sublists);
3551 4074
3552 4075 return (total_evicted);
3553 4076 }
3554 4077
3555 4078 /*
3556 4079 * Flush all "evictable" data of the given type from the arc state
3557 4080 * specified. This will not evict any "active" buffers (i.e. referenced).
3558 4081 *
3559 4082 * When 'retry' is set to B_FALSE, the function will make a single pass
3560 4083 * over the state and evict any buffers that it can. Since it doesn't
3561 4084 * continually retry the eviction, it might end up leaving some buffers
3562 4085 * in the ARC due to lock misses.
3563 4086 *
3564 4087 * When 'retry' is set to B_TRUE, the function will continually retry the
3565 4088 * eviction until *all* evictable buffers have been removed from the
3566 4089 * state. As a result, if concurrent insertions into the state are
3567 4090 * allowed (e.g. if the ARC isn't shutting down), this function might
3568 4091 * wind up in an infinite loop, continually trying to evict buffers.
3569 4092 */
3570 4093 static uint64_t
3571 4094 arc_flush_state(arc_state_t *state, uint64_t spa, arc_buf_contents_t type,
3572 4095 boolean_t retry)
3573 4096 {
3574 4097 uint64_t evicted = 0;
3575 4098
3576 4099 while (refcount_count(&state->arcs_esize[type]) != 0) {
3577 4100 evicted += arc_evict_state(state, spa, ARC_EVICT_ALL, type);
3578 4101
3579 4102 if (!retry)
3580 4103 break;
3581 4104 }
3582 4105
3583 4106 return (evicted);
3584 4107 }
3585 4108
3586 4109 /*
3587 4110 * Evict the specified number of bytes from the state specified,
3588 4111 * restricting eviction to the spa and type given. This function
3589 4112 * prevents us from trying to evict more from a state's list than
3590 4113 * is "evictable", and to skip evicting altogether when passed a
3591 4114 * negative value for "bytes". In contrast, arc_evict_state() will
3592 4115 * evict everything it can, when passed a negative value for "bytes".
3593 4116 */
3594 4117 static uint64_t
3595 4118 arc_adjust_impl(arc_state_t *state, uint64_t spa, int64_t bytes,
3596 4119 arc_buf_contents_t type)
3597 4120 {
3598 4121 int64_t delta;
|
↓ open down ↓ |
368 lines elided |
↑ open up ↑ |
3599 4122
3600 4123 if (bytes > 0 && refcount_count(&state->arcs_esize[type]) > 0) {
3601 4124 delta = MIN(refcount_count(&state->arcs_esize[type]), bytes);
3602 4125 return (arc_evict_state(state, spa, delta, type));
3603 4126 }
3604 4127
3605 4128 return (0);
3606 4129 }
3607 4130
3608 4131 /*
3609 - * Evict metadata buffers from the cache, such that arc_meta_used is
3610 - * capped by the arc_meta_limit tunable.
4132 + * Depending on the value of adjust_ddt arg evict either DDT (B_TRUE)
4133 + * or metadata (B_TRUE) buffers.
4134 + * Evict metadata or DDT buffers from the cache, such that arc_meta_used or
4135 + * arc_ddt_size is capped by the arc_meta_limit or arc_ddt_limit tunable.
3611 4136 */
3612 4137 static uint64_t
3613 -arc_adjust_meta(uint64_t meta_used)
4138 +arc_adjust_meta_or_ddt(boolean_t adjust_ddt)
3614 4139 {
3615 4140 uint64_t total_evicted = 0;
3616 - int64_t target;
4141 + int64_t target, over_limit;
4142 + arc_buf_contents_t type;
3617 4143
4144 + if (adjust_ddt) {
4145 + over_limit = arc_ddt_size - arc_ddt_limit;
4146 + type = ARC_BUFC_DDT;
4147 + } else {
4148 + over_limit = arc_meta_used - arc_meta_limit;
4149 + type = ARC_BUFC_METADATA;
4150 + }
4151 +
3618 4152 /*
3619 - * If we're over the meta limit, we want to evict enough
3620 - * metadata to get back under the meta limit. We don't want to
4153 + * If we're over the limit, we want to evict enough
4154 + * to get back under the limit. We don't want to
3621 4155 * evict so much that we drop the MRU below arc_p, though. If
3622 4156 * we're over the meta limit more than we're over arc_p, we
3623 4157 * evict some from the MRU here, and some from the MFU below.
3624 4158 */
3625 - target = MIN((int64_t)(meta_used - arc_meta_limit),
4159 + target = MIN(over_limit,
3626 4160 (int64_t)(refcount_count(&arc_anon->arcs_size) +
3627 4161 refcount_count(&arc_mru->arcs_size) - arc_p));
3628 4162
3629 - total_evicted += arc_adjust_impl(arc_mru, 0, target, ARC_BUFC_METADATA);
4163 + total_evicted += arc_adjust_impl(arc_mru, 0, target, type);
3630 4164
4165 + over_limit = adjust_ddt ? arc_ddt_size - arc_ddt_limit :
4166 + arc_meta_used - arc_meta_limit;
4167 +
3631 4168 /*
3632 4169 * Similar to the above, we want to evict enough bytes to get us
3633 4170 * below the meta limit, but not so much as to drop us below the
3634 4171 * space allotted to the MFU (which is defined as arc_c - arc_p).
3635 4172 */
3636 - target = MIN((int64_t)(meta_used - arc_meta_limit),
3637 - (int64_t)(refcount_count(&arc_mfu->arcs_size) -
3638 - (arc_c - arc_p)));
4173 + target = MIN(over_limit,
4174 + (int64_t)(refcount_count(&arc_mfu->arcs_size) - (arc_c - arc_p)));
3639 4175
3640 - total_evicted += arc_adjust_impl(arc_mfu, 0, target, ARC_BUFC_METADATA);
4176 + total_evicted += arc_adjust_impl(arc_mfu, 0, target, type);
3641 4177
3642 4178 return (total_evicted);
3643 4179 }
3644 4180
3645 4181 /*
3646 4182 * Return the type of the oldest buffer in the given arc state
3647 4183 *
3648 - * This function will select a random sublist of type ARC_BUFC_DATA and
3649 - * a random sublist of type ARC_BUFC_METADATA. The tail of each sublist
4184 + * This function will select a random sublists of type ARC_BUFC_DATA,
4185 + * ARC_BUFC_METADATA, and ARC_BUFC_DDT. The tail of each sublist
3650 4186 * is compared, and the type which contains the "older" buffer will be
3651 4187 * returned.
3652 4188 */
3653 4189 static arc_buf_contents_t
3654 4190 arc_adjust_type(arc_state_t *state)
3655 4191 {
3656 4192 multilist_t *data_ml = state->arcs_list[ARC_BUFC_DATA];
3657 4193 multilist_t *meta_ml = state->arcs_list[ARC_BUFC_METADATA];
4194 + multilist_t *ddt_ml = state->arcs_list[ARC_BUFC_DDT];
3658 4195 int data_idx = multilist_get_random_index(data_ml);
3659 4196 int meta_idx = multilist_get_random_index(meta_ml);
4197 + int ddt_idx = multilist_get_random_index(ddt_ml);
3660 4198 multilist_sublist_t *data_mls;
3661 4199 multilist_sublist_t *meta_mls;
3662 - arc_buf_contents_t type;
4200 + multilist_sublist_t *ddt_mls;
4201 + arc_buf_contents_t type = ARC_BUFC_DATA; /* silence compiler warning */
3663 4202 arc_buf_hdr_t *data_hdr;
3664 4203 arc_buf_hdr_t *meta_hdr;
4204 + arc_buf_hdr_t *ddt_hdr;
4205 + clock_t oldest;
3665 4206
3666 4207 /*
3667 4208 * We keep the sublist lock until we're finished, to prevent
3668 4209 * the headers from being destroyed via arc_evict_state().
3669 4210 */
3670 4211 data_mls = multilist_sublist_lock(data_ml, data_idx);
3671 4212 meta_mls = multilist_sublist_lock(meta_ml, meta_idx);
4213 + ddt_mls = multilist_sublist_lock(ddt_ml, ddt_idx);
3672 4214
3673 4215 /*
3674 4216 * These two loops are to ensure we skip any markers that
3675 4217 * might be at the tail of the lists due to arc_evict_state().
3676 4218 */
3677 4219
3678 4220 for (data_hdr = multilist_sublist_tail(data_mls); data_hdr != NULL;
3679 4221 data_hdr = multilist_sublist_prev(data_mls, data_hdr)) {
3680 4222 if (data_hdr->b_spa != 0)
3681 4223 break;
3682 4224 }
3683 4225
3684 4226 for (meta_hdr = multilist_sublist_tail(meta_mls); meta_hdr != NULL;
3685 4227 meta_hdr = multilist_sublist_prev(meta_mls, meta_hdr)) {
3686 4228 if (meta_hdr->b_spa != 0)
3687 4229 break;
3688 4230 }
3689 4231
3690 - if (data_hdr == NULL && meta_hdr == NULL) {
4232 + for (ddt_hdr = multilist_sublist_tail(ddt_mls); ddt_hdr != NULL;
4233 + ddt_hdr = multilist_sublist_prev(ddt_mls, ddt_hdr)) {
4234 + if (ddt_hdr->b_spa != 0)
4235 + break;
4236 + }
4237 +
4238 + if (data_hdr == NULL && meta_hdr == NULL && ddt_hdr == NULL) {
3691 4239 type = ARC_BUFC_DATA;
3692 - } else if (data_hdr == NULL) {
4240 + } else if (data_hdr != NULL && meta_hdr != NULL && ddt_hdr != NULL) {
4241 + /* The headers can't be on the sublist without an L1 header */
4242 + ASSERT(HDR_HAS_L1HDR(data_hdr));
4243 + ASSERT(HDR_HAS_L1HDR(meta_hdr));
4244 + ASSERT(HDR_HAS_L1HDR(ddt_hdr));
4245 +
4246 + oldest = data_hdr->b_l1hdr.b_arc_access;
4247 + type = ARC_BUFC_DATA;
4248 + if (oldest > meta_hdr->b_l1hdr.b_arc_access) {
4249 + oldest = meta_hdr->b_l1hdr.b_arc_access;
4250 + type = ARC_BUFC_METADATA;
4251 + }
4252 + if (oldest > ddt_hdr->b_l1hdr.b_arc_access) {
4253 + type = ARC_BUFC_DDT;
4254 + }
4255 + } else if (data_hdr == NULL && ddt_hdr == NULL) {
3693 4256 ASSERT3P(meta_hdr, !=, NULL);
3694 4257 type = ARC_BUFC_METADATA;
3695 - } else if (meta_hdr == NULL) {
4258 + } else if (meta_hdr == NULL && ddt_hdr == NULL) {
3696 4259 ASSERT3P(data_hdr, !=, NULL);
3697 4260 type = ARC_BUFC_DATA;
3698 - } else {
3699 - ASSERT3P(data_hdr, !=, NULL);
3700 - ASSERT3P(meta_hdr, !=, NULL);
4261 + } else if (meta_hdr == NULL && data_hdr == NULL) {
4262 + ASSERT3P(ddt_hdr, !=, NULL);
4263 + type = ARC_BUFC_DDT;
4264 + } else if (data_hdr != NULL && ddt_hdr != NULL) {
4265 + ASSERT3P(meta_hdr, ==, NULL);
3701 4266
3702 4267 /* The headers can't be on the sublist without an L1 header */
3703 4268 ASSERT(HDR_HAS_L1HDR(data_hdr));
4269 + ASSERT(HDR_HAS_L1HDR(ddt_hdr));
4270 +
4271 + if (data_hdr->b_l1hdr.b_arc_access <
4272 + ddt_hdr->b_l1hdr.b_arc_access) {
4273 + type = ARC_BUFC_DATA;
4274 + } else {
4275 + type = ARC_BUFC_DDT;
4276 + }
4277 + } else if (meta_hdr != NULL && ddt_hdr != NULL) {
4278 + ASSERT3P(data_hdr, ==, NULL);
4279 +
4280 + /* The headers can't be on the sublist without an L1 header */
3704 4281 ASSERT(HDR_HAS_L1HDR(meta_hdr));
4282 + ASSERT(HDR_HAS_L1HDR(ddt_hdr));
3705 4283
4284 + if (meta_hdr->b_l1hdr.b_arc_access <
4285 + ddt_hdr->b_l1hdr.b_arc_access) {
4286 + type = ARC_BUFC_METADATA;
4287 + } else {
4288 + type = ARC_BUFC_DDT;
4289 + }
4290 + } else if (meta_hdr != NULL && data_hdr != NULL) {
4291 + ASSERT3P(ddt_hdr, ==, NULL);
4292 +
4293 + /* The headers can't be on the sublist without an L1 header */
4294 + ASSERT(HDR_HAS_L1HDR(data_hdr));
4295 + ASSERT(HDR_HAS_L1HDR(meta_hdr));
4296 +
3706 4297 if (data_hdr->b_l1hdr.b_arc_access <
3707 4298 meta_hdr->b_l1hdr.b_arc_access) {
3708 4299 type = ARC_BUFC_DATA;
3709 4300 } else {
3710 4301 type = ARC_BUFC_METADATA;
3711 4302 }
4303 + } else {
4304 + /* should never get here */
4305 + ASSERT(0);
3712 4306 }
3713 4307
4308 + multilist_sublist_unlock(ddt_mls);
3714 4309 multilist_sublist_unlock(meta_mls);
3715 4310 multilist_sublist_unlock(data_mls);
3716 4311
3717 4312 return (type);
3718 4313 }
3719 4314
3720 4315 /*
3721 4316 * Evict buffers from the cache, such that arc_size is capped by arc_c.
3722 4317 */
3723 4318 static uint64_t
3724 4319 arc_adjust(void)
3725 4320 {
3726 4321 uint64_t total_evicted = 0;
3727 4322 uint64_t bytes;
3728 4323 int64_t target;
3729 - uint64_t asize = aggsum_value(&arc_size);
3730 - uint64_t ameta = aggsum_value(&arc_meta_used);
3731 4324
3732 4325 /*
3733 4326 * If we're over arc_meta_limit, we want to correct that before
3734 4327 * potentially evicting data buffers below.
3735 4328 */
3736 - total_evicted += arc_adjust_meta(ameta);
4329 + total_evicted += arc_adjust_meta_or_ddt(B_FALSE);
3737 4330
3738 4331 /*
4332 + * If we're over arc_ddt_limit, we want to correct that before
4333 + * potentially evicting data buffers below.
4334 + */
4335 + total_evicted += arc_adjust_meta_or_ddt(B_TRUE);
4336 +
4337 + /*
3739 4338 * Adjust MRU size
3740 4339 *
3741 4340 * If we're over the target cache size, we want to evict enough
3742 4341 * from the list to get back to our target size. We don't want
3743 4342 * to evict too much from the MRU, such that it drops below
3744 4343 * arc_p. So, if we're over our target cache size more than
3745 4344 * the MRU is over arc_p, we'll evict enough to get back to
3746 4345 * arc_p here, and then evict more from the MFU below.
3747 4346 */
3748 - target = MIN((int64_t)(asize - arc_c),
4347 + target = MIN((int64_t)(arc_size - arc_c),
3749 4348 (int64_t)(refcount_count(&arc_anon->arcs_size) +
3750 - refcount_count(&arc_mru->arcs_size) + ameta - arc_p));
4349 + refcount_count(&arc_mru->arcs_size) + arc_meta_used - arc_p));
3751 4350
3752 4351 /*
3753 4352 * If we're below arc_meta_min, always prefer to evict data.
3754 4353 * Otherwise, try to satisfy the requested number of bytes to
3755 4354 * evict from the type which contains older buffers; in an
3756 4355 * effort to keep newer buffers in the cache regardless of their
3757 4356 * type. If we cannot satisfy the number of bytes from this
3758 4357 * type, spill over into the next type.
3759 4358 */
3760 4359 if (arc_adjust_type(arc_mru) == ARC_BUFC_METADATA &&
3761 - ameta > arc_meta_min) {
4360 + arc_meta_used > arc_meta_min) {
3762 4361 bytes = arc_adjust_impl(arc_mru, 0, target, ARC_BUFC_METADATA);
3763 4362 total_evicted += bytes;
3764 4363
3765 4364 /*
3766 4365 * If we couldn't evict our target number of bytes from
3767 4366 * metadata, we try to get the rest from data.
3768 4367 */
3769 4368 target -= bytes;
3770 4369
3771 - total_evicted +=
3772 - arc_adjust_impl(arc_mru, 0, target, ARC_BUFC_DATA);
4370 + bytes += arc_adjust_impl(arc_mru, 0, target, ARC_BUFC_DATA);
4371 + total_evicted += bytes;
3773 4372 } else {
3774 4373 bytes = arc_adjust_impl(arc_mru, 0, target, ARC_BUFC_DATA);
3775 4374 total_evicted += bytes;
3776 4375
3777 4376 /*
3778 4377 * If we couldn't evict our target number of bytes from
3779 4378 * data, we try to get the rest from metadata.
3780 4379 */
3781 4380 target -= bytes;
3782 4381
3783 - total_evicted +=
3784 - arc_adjust_impl(arc_mru, 0, target, ARC_BUFC_METADATA);
4382 + bytes += arc_adjust_impl(arc_mru, 0, target, ARC_BUFC_METADATA);
4383 + total_evicted += bytes;
3785 4384 }
3786 4385
3787 4386 /*
4387 + * If we couldn't evict our target number of bytes from
4388 + * data and metadata, we try to get the rest from ddt.
4389 + */
4390 + target -= bytes;
4391 + total_evicted +=
4392 + arc_adjust_impl(arc_mru, 0, target, ARC_BUFC_DDT);
4393 +
4394 + /*
3788 4395 * Adjust MFU size
3789 4396 *
3790 4397 * Now that we've tried to evict enough from the MRU to get its
3791 4398 * size back to arc_p, if we're still above the target cache
3792 4399 * size, we evict the rest from the MFU.
3793 4400 */
3794 - target = asize - arc_c;
4401 + target = arc_size - arc_c;
3795 4402
3796 4403 if (arc_adjust_type(arc_mfu) == ARC_BUFC_METADATA &&
3797 - ameta > arc_meta_min) {
4404 + arc_meta_used > arc_meta_min) {
3798 4405 bytes = arc_adjust_impl(arc_mfu, 0, target, ARC_BUFC_METADATA);
3799 4406 total_evicted += bytes;
3800 4407
3801 4408 /*
3802 4409 * If we couldn't evict our target number of bytes from
3803 4410 * metadata, we try to get the rest from data.
3804 4411 */
3805 4412 target -= bytes;
3806 4413
3807 - total_evicted +=
3808 - arc_adjust_impl(arc_mfu, 0, target, ARC_BUFC_DATA);
4414 + bytes += arc_adjust_impl(arc_mfu, 0, target, ARC_BUFC_DATA);
4415 + total_evicted += bytes;
3809 4416 } else {
3810 4417 bytes = arc_adjust_impl(arc_mfu, 0, target, ARC_BUFC_DATA);
3811 4418 total_evicted += bytes;
3812 4419
3813 4420 /*
3814 4421 * If we couldn't evict our target number of bytes from
3815 4422 * data, we try to get the rest from data.
3816 4423 */
3817 4424 target -= bytes;
3818 4425
3819 - total_evicted +=
3820 - arc_adjust_impl(arc_mfu, 0, target, ARC_BUFC_METADATA);
4426 + bytes += arc_adjust_impl(arc_mfu, 0, target, ARC_BUFC_METADATA);
4427 + total_evicted += bytes;
3821 4428 }
3822 4429
3823 4430 /*
4431 + * If we couldn't evict our target number of bytes from
4432 + * data and metadata, we try to get the rest from ddt.
4433 + */
4434 + target -= bytes;
4435 + total_evicted +=
4436 + arc_adjust_impl(arc_mfu, 0, target, ARC_BUFC_DDT);
4437 +
4438 + /*
3824 4439 * Adjust ghost lists
3825 4440 *
3826 4441 * In addition to the above, the ARC also defines target values
3827 4442 * for the ghost lists. The sum of the mru list and mru ghost
3828 4443 * list should never exceed the target size of the cache, and
3829 4444 * the sum of the mru list, mfu list, mru ghost list, and mfu
3830 4445 * ghost list should never exceed twice the target size of the
3831 4446 * cache. The following logic enforces these limits on the ghost
3832 4447 * caches, and evicts from them as needed.
3833 4448 */
3834 4449 target = refcount_count(&arc_mru->arcs_size) +
3835 4450 refcount_count(&arc_mru_ghost->arcs_size) - arc_c;
3836 4451
3837 4452 bytes = arc_adjust_impl(arc_mru_ghost, 0, target, ARC_BUFC_DATA);
3838 4453 total_evicted += bytes;
3839 4454
3840 4455 target -= bytes;
3841 4456
4457 + bytes += arc_adjust_impl(arc_mru_ghost, 0, target, ARC_BUFC_METADATA);
4458 + total_evicted += bytes;
4459 +
4460 + target -= bytes;
4461 +
3842 4462 total_evicted +=
3843 - arc_adjust_impl(arc_mru_ghost, 0, target, ARC_BUFC_METADATA);
4463 + arc_adjust_impl(arc_mru_ghost, 0, target, ARC_BUFC_DDT);
3844 4464
3845 4465 /*
3846 4466 * We assume the sum of the mru list and mfu list is less than
3847 4467 * or equal to arc_c (we enforced this above), which means we
3848 4468 * can use the simpler of the two equations below:
3849 4469 *
3850 4470 * mru + mfu + mru ghost + mfu ghost <= 2 * arc_c
3851 4471 * mru ghost + mfu ghost <= arc_c
3852 4472 */
3853 4473 target = refcount_count(&arc_mru_ghost->arcs_size) +
3854 4474 refcount_count(&arc_mfu_ghost->arcs_size) - arc_c;
3855 4475
3856 4476 bytes = arc_adjust_impl(arc_mfu_ghost, 0, target, ARC_BUFC_DATA);
3857 4477 total_evicted += bytes;
3858 4478
3859 4479 target -= bytes;
3860 4480
4481 + bytes += arc_adjust_impl(arc_mfu_ghost, 0, target, ARC_BUFC_METADATA);
4482 + total_evicted += bytes;
4483 +
4484 + target -= bytes;
4485 +
3861 4486 total_evicted +=
3862 - arc_adjust_impl(arc_mfu_ghost, 0, target, ARC_BUFC_METADATA);
4487 + arc_adjust_impl(arc_mfu_ghost, 0, target, ARC_BUFC_DDT);
3863 4488
3864 4489 return (total_evicted);
3865 4490 }
3866 4491
4492 +typedef struct arc_async_flush_data {
4493 + uint64_t aaf_guid;
4494 + boolean_t aaf_retry;
4495 +} arc_async_flush_data_t;
4496 +
4497 +static taskq_t *arc_flush_taskq;
4498 +
4499 +static void
4500 +arc_flush_impl(uint64_t guid, boolean_t retry)
4501 +{
4502 + arc_buf_contents_t arcs;
4503 +
4504 + for (arcs = ARC_BUFC_DATA; arcs < ARC_BUFC_NUMTYPES; ++arcs) {
4505 + (void) arc_flush_state(arc_mru, guid, arcs, retry);
4506 + (void) arc_flush_state(arc_mfu, guid, arcs, retry);
4507 + (void) arc_flush_state(arc_mru_ghost, guid, arcs, retry);
4508 + (void) arc_flush_state(arc_mfu_ghost, guid, arcs, retry);
4509 + }
4510 +}
4511 +
4512 +static void
4513 +arc_flush_task(void *arg)
4514 +{
4515 + arc_async_flush_data_t *aaf = (arc_async_flush_data_t *)arg;
4516 + arc_flush_impl(aaf->aaf_guid, aaf->aaf_retry);
4517 + kmem_free(aaf, sizeof (arc_async_flush_data_t));
4518 +}
4519 +
4520 +boolean_t zfs_fastflush = B_TRUE;
4521 +
3867 4522 void
3868 4523 arc_flush(spa_t *spa, boolean_t retry)
3869 4524 {
3870 4525 uint64_t guid = 0;
4526 + boolean_t async_flush = (spa != NULL ? zfs_fastflush : FALSE);
4527 + arc_async_flush_data_t *aaf = NULL;
3871 4528
3872 4529 /*
3873 4530 * If retry is B_TRUE, a spa must not be specified since we have
3874 4531 * no good way to determine if all of a spa's buffers have been
3875 4532 * evicted from an arc state.
3876 4533 */
3877 - ASSERT(!retry || spa == 0);
4534 + ASSERT(!retry || spa == NULL);
3878 4535
3879 - if (spa != NULL)
4536 + if (spa != NULL) {
3880 4537 guid = spa_load_guid(spa);
4538 + if (async_flush) {
4539 + aaf = kmem_alloc(sizeof (arc_async_flush_data_t),
4540 + KM_SLEEP);
4541 + aaf->aaf_guid = guid;
4542 + aaf->aaf_retry = retry;
4543 + }
4544 + }
3881 4545
3882 - (void) arc_flush_state(arc_mru, guid, ARC_BUFC_DATA, retry);
3883 - (void) arc_flush_state(arc_mru, guid, ARC_BUFC_METADATA, retry);
3884 -
3885 - (void) arc_flush_state(arc_mfu, guid, ARC_BUFC_DATA, retry);
3886 - (void) arc_flush_state(arc_mfu, guid, ARC_BUFC_METADATA, retry);
3887 -
3888 - (void) arc_flush_state(arc_mru_ghost, guid, ARC_BUFC_DATA, retry);
3889 - (void) arc_flush_state(arc_mru_ghost, guid, ARC_BUFC_METADATA, retry);
3890 -
3891 - (void) arc_flush_state(arc_mfu_ghost, guid, ARC_BUFC_DATA, retry);
3892 - (void) arc_flush_state(arc_mfu_ghost, guid, ARC_BUFC_METADATA, retry);
4546 + /*
4547 + * Try to flush per-spa remaining ARC ghost buffers asynchronously
4548 + * while a pool is being closed.
4549 + * An ARC buffer is bound to spa only by guid, so buffer can
4550 + * exist even when pool has already gone. If asynchronous flushing
4551 + * fails we fall back to regular (synchronous) one.
4552 + * NOTE: If asynchronous flushing had not yet finished when the pool
4553 + * was imported again it wouldn't be a problem, even when guids before
4554 + * and after export/import are the same. We can evict only unreferenced
4555 + * buffers, other are skipped.
4556 + */
4557 + if (!async_flush || (taskq_dispatch(arc_flush_taskq, arc_flush_task,
4558 + aaf, TQ_NOSLEEP) == NULL)) {
4559 + arc_flush_impl(guid, retry);
4560 + if (async_flush)
4561 + kmem_free(aaf, sizeof (arc_async_flush_data_t));
4562 + }
3893 4563 }
3894 4564
3895 4565 void
3896 4566 arc_shrink(int64_t to_free)
3897 4567 {
3898 - uint64_t asize = aggsum_value(&arc_size);
3899 4568 if (arc_c > arc_c_min) {
3900 4569
3901 4570 if (arc_c > arc_c_min + to_free)
3902 4571 atomic_add_64(&arc_c, -to_free);
3903 4572 else
3904 4573 arc_c = arc_c_min;
3905 4574
3906 4575 atomic_add_64(&arc_p, -(arc_p >> arc_shrink_shift));
3907 - if (asize < arc_c)
3908 - arc_c = MAX(asize, arc_c_min);
4576 + if (arc_c > arc_size)
4577 + arc_c = MAX(arc_size, arc_c_min);
3909 4578 if (arc_p > arc_c)
3910 4579 arc_p = (arc_c >> 1);
3911 4580 ASSERT(arc_c >= arc_c_min);
3912 4581 ASSERT((int64_t)arc_p >= 0);
3913 4582 }
3914 4583
3915 - if (asize > arc_c)
4584 + if (arc_size > arc_c)
3916 4585 (void) arc_adjust();
3917 4586 }
3918 4587
3919 4588 typedef enum free_memory_reason_t {
3920 4589 FMR_UNKNOWN,
3921 4590 FMR_NEEDFREE,
3922 4591 FMR_LOTSFREE,
3923 4592 FMR_SWAPFS_MINFREE,
3924 4593 FMR_PAGES_PP_MAXIMUM,
3925 4594 FMR_HEAP_ARENA,
3926 4595 FMR_ZIO_ARENA,
3927 4596 } free_memory_reason_t;
3928 4597
3929 4598 int64_t last_free_memory;
3930 4599 free_memory_reason_t last_free_reason;
3931 4600
3932 4601 /*
3933 4602 * Additional reserve of pages for pp_reserve.
3934 4603 */
3935 4604 int64_t arc_pages_pp_reserve = 64;
3936 4605
3937 4606 /*
3938 4607 * Additional reserve of pages for swapfs.
3939 4608 */
3940 4609 int64_t arc_swapfs_reserve = 64;
3941 4610
3942 4611 /*
3943 4612 * Return the amount of memory that can be consumed before reclaim will be
3944 4613 * needed. Positive if there is sufficient free memory, negative indicates
3945 4614 * the amount of memory that needs to be freed up.
3946 4615 */
3947 4616 static int64_t
3948 4617 arc_available_memory(void)
3949 4618 {
3950 4619 int64_t lowest = INT64_MAX;
3951 4620 int64_t n;
3952 4621 free_memory_reason_t r = FMR_UNKNOWN;
3953 4622
3954 4623 #ifdef _KERNEL
3955 4624 if (needfree > 0) {
3956 4625 n = PAGESIZE * (-needfree);
3957 4626 if (n < lowest) {
3958 4627 lowest = n;
3959 4628 r = FMR_NEEDFREE;
3960 4629 }
3961 4630 }
3962 4631
3963 4632 /*
3964 4633 * check that we're out of range of the pageout scanner. It starts to
3965 4634 * schedule paging if freemem is less than lotsfree and needfree.
3966 4635 * lotsfree is the high-water mark for pageout, and needfree is the
3967 4636 * number of needed free pages. We add extra pages here to make sure
3968 4637 * the scanner doesn't start up while we're freeing memory.
3969 4638 */
3970 4639 n = PAGESIZE * (freemem - lotsfree - needfree - desfree);
3971 4640 if (n < lowest) {
3972 4641 lowest = n;
3973 4642 r = FMR_LOTSFREE;
3974 4643 }
3975 4644
3976 4645 /*
3977 4646 * check to make sure that swapfs has enough space so that anon
3978 4647 * reservations can still succeed. anon_resvmem() checks that the
3979 4648 * availrmem is greater than swapfs_minfree, and the number of reserved
3980 4649 * swap pages. We also add a bit of extra here just to prevent
3981 4650 * circumstances from getting really dire.
3982 4651 */
3983 4652 n = PAGESIZE * (availrmem - swapfs_minfree - swapfs_reserve -
3984 4653 desfree - arc_swapfs_reserve);
3985 4654 if (n < lowest) {
3986 4655 lowest = n;
3987 4656 r = FMR_SWAPFS_MINFREE;
3988 4657 }
3989 4658
3990 4659
3991 4660 /*
3992 4661 * Check that we have enough availrmem that memory locking (e.g., via
3993 4662 * mlock(3C) or memcntl(2)) can still succeed. (pages_pp_maximum
3994 4663 * stores the number of pages that cannot be locked; when availrmem
3995 4664 * drops below pages_pp_maximum, page locking mechanisms such as
3996 4665 * page_pp_lock() will fail.)
3997 4666 */
3998 4667 n = PAGESIZE * (availrmem - pages_pp_maximum -
3999 4668 arc_pages_pp_reserve);
4000 4669 if (n < lowest) {
4001 4670 lowest = n;
4002 4671 r = FMR_PAGES_PP_MAXIMUM;
4003 4672 }
4004 4673
4005 4674 #if defined(__i386)
4006 4675 /*
4007 4676 * If we're on an i386 platform, it's possible that we'll exhaust the
4008 4677 * kernel heap space before we ever run out of available physical
4009 4678 * memory. Most checks of the size of the heap_area compare against
4010 4679 * tune.t_minarmem, which is the minimum available real memory that we
4011 4680 * can have in the system. However, this is generally fixed at 25 pages
4012 4681 * which is so low that it's useless. In this comparison, we seek to
4013 4682 * calculate the total heap-size, and reclaim if more than 3/4ths of the
4014 4683 * heap is allocated. (Or, in the calculation, if less than 1/4th is
4015 4684 * free)
4016 4685 */
4017 4686 n = (int64_t)vmem_size(heap_arena, VMEM_FREE) -
4018 4687 (vmem_size(heap_arena, VMEM_FREE | VMEM_ALLOC) >> 2);
4019 4688 if (n < lowest) {
4020 4689 lowest = n;
4021 4690 r = FMR_HEAP_ARENA;
4022 4691 }
4023 4692 #endif
4024 4693
4025 4694 /*
4026 4695 * If zio data pages are being allocated out of a separate heap segment,
4027 4696 * then enforce that the size of available vmem for this arena remains
4028 4697 * above about 1/4th (1/(2^arc_zio_arena_free_shift)) free.
4029 4698 *
4030 4699 * Note that reducing the arc_zio_arena_free_shift keeps more virtual
4031 4700 * memory (in the zio_arena) free, which can avoid memory
4032 4701 * fragmentation issues.
4033 4702 */
4034 4703 if (zio_arena != NULL) {
4035 4704 n = (int64_t)vmem_size(zio_arena, VMEM_FREE) -
4036 4705 (vmem_size(zio_arena, VMEM_ALLOC) >>
4037 4706 arc_zio_arena_free_shift);
4038 4707 if (n < lowest) {
4039 4708 lowest = n;
4040 4709 r = FMR_ZIO_ARENA;
4041 4710 }
4042 4711 }
4043 4712 #else
4044 4713 /* Every 100 calls, free a small amount */
4045 4714 if (spa_get_random(100) == 0)
4046 4715 lowest = -1024;
4047 4716 #endif
4048 4717
4049 4718 last_free_memory = lowest;
4050 4719 last_free_reason = r;
4051 4720
4052 4721 return (lowest);
4053 4722 }
4054 4723
4055 4724
4056 4725 /*
4057 4726 * Determine if the system is under memory pressure and is asking
4058 4727 * to reclaim memory. A return value of B_TRUE indicates that the system
4059 4728 * is under memory pressure and that the arc should adjust accordingly.
4060 4729 */
4061 4730 static boolean_t
4062 4731 arc_reclaim_needed(void)
4063 4732 {
4064 4733 return (arc_available_memory() < 0);
4065 4734 }
4066 4735
4067 4736 static void
4068 4737 arc_kmem_reap_now(void)
|
↓ open down ↓ |
143 lines elided |
↑ open up ↑ |
4069 4738 {
4070 4739 size_t i;
4071 4740 kmem_cache_t *prev_cache = NULL;
4072 4741 kmem_cache_t *prev_data_cache = NULL;
4073 4742 extern kmem_cache_t *zio_buf_cache[];
4074 4743 extern kmem_cache_t *zio_data_buf_cache[];
4075 4744 extern kmem_cache_t *range_seg_cache;
4076 4745 extern kmem_cache_t *abd_chunk_cache;
4077 4746
4078 4747 #ifdef _KERNEL
4079 - if (aggsum_compare(&arc_meta_used, arc_meta_limit) >= 0) {
4748 + if (arc_meta_used >= arc_meta_limit || arc_ddt_size >= arc_ddt_limit) {
4080 4749 /*
4081 - * We are exceeding our meta-data cache limit.
4082 - * Purge some DNLC entries to release holds on meta-data.
4750 + * We are exceeding our meta-data or DDT cache limit.
4751 + * Purge some DNLC entries to release holds on meta-data/DDT.
4083 4752 */
4084 4753 dnlc_reduce_cache((void *)(uintptr_t)arc_reduce_dnlc_percent);
4085 4754 }
4086 4755 #if defined(__i386)
4087 4756 /*
4088 4757 * Reclaim unused memory from all kmem caches.
4089 4758 */
4090 4759 kmem_reap();
4091 4760 #endif
4092 4761 #endif
4093 4762
4094 4763 /*
4095 4764 * If a kmem reap is already active, don't schedule more. We must
4096 4765 * check for this because kmem_cache_reap_soon() won't actually
4097 4766 * block on the cache being reaped (this is to prevent callers from
4098 4767 * becoming implicitly blocked by a system-wide kmem reap -- which,
4099 4768 * on a system with many, many full magazines, can take minutes).
4100 4769 */
4101 4770 if (kmem_cache_reap_active())
4102 4771 return;
4103 4772
4104 4773 for (i = 0; i < SPA_MAXBLOCKSIZE >> SPA_MINBLOCKSHIFT; i++) {
4105 4774 if (zio_buf_cache[i] != prev_cache) {
4106 4775 prev_cache = zio_buf_cache[i];
4107 4776 kmem_cache_reap_soon(zio_buf_cache[i]);
4108 4777 }
4109 4778 if (zio_data_buf_cache[i] != prev_data_cache) {
4110 4779 prev_data_cache = zio_data_buf_cache[i];
4111 4780 kmem_cache_reap_soon(zio_data_buf_cache[i]);
4112 4781 }
4113 4782 }
4114 4783 kmem_cache_reap_soon(abd_chunk_cache);
4115 4784 kmem_cache_reap_soon(buf_cache);
4116 4785 kmem_cache_reap_soon(hdr_full_cache);
4117 4786 kmem_cache_reap_soon(hdr_l2only_cache);
4118 4787 kmem_cache_reap_soon(range_seg_cache);
4119 4788
4120 4789 if (zio_arena != NULL) {
4121 4790 /*
4122 4791 * Ask the vmem arena to reclaim unused memory from its
4123 4792 * quantum caches.
4124 4793 */
4125 4794 vmem_qcache_reap(zio_arena);
4126 4795 }
4127 4796 }
4128 4797
4129 4798 /*
4130 4799 * Threads can block in arc_get_data_impl() waiting for this thread to evict
4131 4800 * enough data and signal them to proceed. When this happens, the threads in
4132 4801 * arc_get_data_impl() are sleeping while holding the hash lock for their
4133 4802 * particular arc header. Thus, we must be careful to never sleep on a
4134 4803 * hash lock in this thread. This is to prevent the following deadlock:
4135 4804 *
4136 4805 * - Thread A sleeps on CV in arc_get_data_impl() holding hash lock "L",
4137 4806 * waiting for the reclaim thread to signal it.
4138 4807 *
4139 4808 * - arc_reclaim_thread() tries to acquire hash lock "L" using mutex_enter,
4140 4809 * fails, and goes to sleep forever.
4141 4810 *
4142 4811 * This possible deadlock is avoided by always acquiring a hash lock
4143 4812 * using mutex_tryenter() from arc_reclaim_thread().
4144 4813 */
4145 4814 /* ARGSUSED */
4146 4815 static void
4147 4816 arc_reclaim_thread(void *unused)
4148 4817 {
4149 4818 hrtime_t growtime = 0;
4150 4819 hrtime_t kmem_reap_time = 0;
4151 4820 callb_cpr_t cpr;
4152 4821
4153 4822 CALLB_CPR_INIT(&cpr, &arc_reclaim_lock, callb_generic_cpr, FTAG);
4154 4823
4155 4824 mutex_enter(&arc_reclaim_lock);
4156 4825 while (!arc_reclaim_thread_exit) {
4157 4826 uint64_t evicted = 0;
4158 4827
4159 4828 /*
4160 4829 * This is necessary in order for the mdb ::arc dcmd to
4161 4830 * show up to date information. Since the ::arc command
4162 4831 * does not call the kstat's update function, without
4163 4832 * this call, the command may show stale stats for the
4164 4833 * anon, mru, mru_ghost, mfu, and mfu_ghost lists. Even
4165 4834 * with this change, the data might be up to 1 second
4166 4835 * out of date; but that should suffice. The arc_state_t
4167 4836 * structures can be queried directly if more accurate
4168 4837 * information is needed.
4169 4838 */
4170 4839 if (arc_ksp != NULL)
4171 4840 arc_ksp->ks_update(arc_ksp, KSTAT_READ);
4172 4841
4173 4842 mutex_exit(&arc_reclaim_lock);
4174 4843
4175 4844 /*
4176 4845 * We call arc_adjust() before (possibly) calling
4177 4846 * arc_kmem_reap_now(), so that we can wake up
4178 4847 * arc_get_data_impl() sooner.
4179 4848 */
4180 4849 evicted = arc_adjust();
4181 4850
4182 4851 int64_t free_memory = arc_available_memory();
4183 4852 if (free_memory < 0) {
4184 4853 hrtime_t curtime = gethrtime();
4185 4854 arc_no_grow = B_TRUE;
4186 4855 arc_warm = B_TRUE;
4187 4856
4188 4857 /*
4189 4858 * Wait at least zfs_grow_retry (default 60) seconds
4190 4859 * before considering growing.
4191 4860 */
4192 4861 growtime = curtime + SEC2NSEC(arc_grow_retry);
4193 4862
4194 4863 /*
4195 4864 * Wait at least arc_kmem_cache_reap_retry_ms
4196 4865 * between arc_kmem_reap_now() calls. Without
4197 4866 * this check it is possible to end up in a
4198 4867 * situation where we spend lots of time
4199 4868 * reaping caches, while we're near arc_c_min.
4200 4869 */
4201 4870 if (curtime >= kmem_reap_time) {
4202 4871 arc_kmem_reap_now();
4203 4872 kmem_reap_time = gethrtime() +
4204 4873 MSEC2NSEC(arc_kmem_cache_reap_retry_ms);
4205 4874 }
4206 4875
4207 4876 /*
4208 4877 * If we are still low on memory, shrink the ARC
4209 4878 * so that we have arc_shrink_min free space.
4210 4879 */
4211 4880 free_memory = arc_available_memory();
4212 4881
4213 4882 int64_t to_free =
4214 4883 (arc_c >> arc_shrink_shift) - free_memory;
4215 4884 if (to_free > 0) {
4216 4885 #ifdef _KERNEL
4217 4886 to_free = MAX(to_free, ptob(needfree));
4218 4887 #endif
4219 4888 arc_shrink(to_free);
4220 4889 }
4221 4890 } else if (free_memory < arc_c >> arc_no_grow_shift) {
4222 4891 arc_no_grow = B_TRUE;
4223 4892 } else if (gethrtime() >= growtime) {
4224 4893 arc_no_grow = B_FALSE;
4225 4894 }
4226 4895
4227 4896 mutex_enter(&arc_reclaim_lock);
|
↓ open down ↓ |
135 lines elided |
↑ open up ↑ |
4228 4897
4229 4898 /*
4230 4899 * If evicted is zero, we couldn't evict anything via
4231 4900 * arc_adjust(). This could be due to hash lock
4232 4901 * collisions, but more likely due to the majority of
4233 4902 * arc buffers being unevictable. Therefore, even if
4234 4903 * arc_size is above arc_c, another pass is unlikely to
4235 4904 * be helpful and could potentially cause us to enter an
4236 4905 * infinite loop.
4237 4906 */
4238 - if (aggsum_compare(&arc_size, arc_c) <= 0|| evicted == 0) {
4907 + if (arc_size <= arc_c || evicted == 0) {
4239 4908 /*
4240 4909 * We're either no longer overflowing, or we
4241 4910 * can't evict anything more, so we should wake
4242 4911 * up any threads before we go to sleep.
4243 4912 */
4244 4913 cv_broadcast(&arc_reclaim_waiters_cv);
4245 4914
4246 4915 /*
4247 4916 * Block until signaled, or after one second (we
4248 4917 * might need to perform arc_kmem_reap_now()
4249 4918 * even if we aren't being signalled)
4250 4919 */
4251 4920 CALLB_CPR_SAFE_BEGIN(&cpr);
4252 4921 (void) cv_timedwait_hires(&arc_reclaim_thread_cv,
4253 4922 &arc_reclaim_lock, SEC2NSEC(1), MSEC2NSEC(1), 0);
4254 4923 CALLB_CPR_SAFE_END(&cpr, &arc_reclaim_lock);
4255 4924 }
4256 4925 }
4257 4926
4258 4927 arc_reclaim_thread_exit = B_FALSE;
4259 4928 cv_broadcast(&arc_reclaim_thread_cv);
4260 4929 CALLB_CPR_EXIT(&cpr); /* drops arc_reclaim_lock */
4261 4930 thread_exit();
4262 4931 }
4263 4932
4264 4933 /*
4265 4934 * Adapt arc info given the number of bytes we are trying to add and
4266 4935 * the state that we are comming from. This function is only called
4267 4936 * when we are adding new content to the cache.
4268 4937 */
4269 4938 static void
4270 4939 arc_adapt(int bytes, arc_state_t *state)
4271 4940 {
4272 4941 int mult;
4273 4942 uint64_t arc_p_min = (arc_c >> arc_p_min_shift);
4274 4943 int64_t mrug_size = refcount_count(&arc_mru_ghost->arcs_size);
4275 4944 int64_t mfug_size = refcount_count(&arc_mfu_ghost->arcs_size);
4276 4945
4277 4946 if (state == arc_l2c_only)
4278 4947 return;
4279 4948
4280 4949 ASSERT(bytes > 0);
4281 4950 /*
4282 4951 * Adapt the target size of the MRU list:
4283 4952 * - if we just hit in the MRU ghost list, then increase
4284 4953 * the target size of the MRU list.
4285 4954 * - if we just hit in the MFU ghost list, then increase
4286 4955 * the target size of the MFU list by decreasing the
4287 4956 * target size of the MRU list.
4288 4957 */
4289 4958 if (state == arc_mru_ghost) {
4290 4959 mult = (mrug_size >= mfug_size) ? 1 : (mfug_size / mrug_size);
4291 4960 mult = MIN(mult, 10); /* avoid wild arc_p adjustment */
4292 4961
4293 4962 arc_p = MIN(arc_c - arc_p_min, arc_p + bytes * mult);
4294 4963 } else if (state == arc_mfu_ghost) {
4295 4964 uint64_t delta;
4296 4965
4297 4966 mult = (mfug_size >= mrug_size) ? 1 : (mrug_size / mfug_size);
4298 4967 mult = MIN(mult, 10);
4299 4968
4300 4969 delta = MIN(bytes * mult, arc_p);
4301 4970 arc_p = MAX(arc_p_min, arc_p - delta);
4302 4971 }
4303 4972 ASSERT((int64_t)arc_p >= 0);
4304 4973
4305 4974 if (arc_reclaim_needed()) {
4306 4975 cv_signal(&arc_reclaim_thread_cv);
4307 4976 return;
4308 4977 }
4309 4978
|
↓ open down ↓ |
61 lines elided |
↑ open up ↑ |
4310 4979 if (arc_no_grow)
4311 4980 return;
4312 4981
4313 4982 if (arc_c >= arc_c_max)
4314 4983 return;
4315 4984
4316 4985 /*
4317 4986 * If we're within (2 * maxblocksize) bytes of the target
4318 4987 * cache size, increment the target cache size
4319 4988 */
4320 - if (aggsum_compare(&arc_size, arc_c - (2ULL << SPA_MAXBLOCKSHIFT)) >
4321 - 0) {
4989 + if (arc_size > arc_c - (2ULL << SPA_MAXBLOCKSHIFT)) {
4322 4990 atomic_add_64(&arc_c, (int64_t)bytes);
4323 4991 if (arc_c > arc_c_max)
4324 4992 arc_c = arc_c_max;
4325 4993 else if (state == arc_anon)
4326 4994 atomic_add_64(&arc_p, (int64_t)bytes);
4327 4995 if (arc_p > arc_c)
4328 4996 arc_p = arc_c;
4329 4997 }
4330 4998 ASSERT((int64_t)arc_p >= 0);
4331 4999 }
4332 5000
4333 5001 /*
|
↓ open down ↓ |
2 lines elided |
↑ open up ↑ |
4334 5002 * Check if arc_size has grown past our upper threshold, determined by
4335 5003 * zfs_arc_overflow_shift.
4336 5004 */
4337 5005 static boolean_t
4338 5006 arc_is_overflowing(void)
4339 5007 {
4340 5008 /* Always allow at least one block of overflow */
4341 5009 uint64_t overflow = MAX(SPA_MAXBLOCKSIZE,
4342 5010 arc_c >> zfs_arc_overflow_shift);
4343 5011
4344 - /*
4345 - * We just compare the lower bound here for performance reasons. Our
4346 - * primary goals are to make sure that the arc never grows without
4347 - * bound, and that it can reach its maximum size. This check
4348 - * accomplishes both goals. The maximum amount we could run over by is
4349 - * 2 * aggsum_borrow_multiplier * NUM_CPUS * the average size of a block
4350 - * in the ARC. In practice, that's in the tens of MB, which is low
4351 - * enough to be safe.
4352 - */
4353 - return (aggsum_lower_bound(&arc_size) >= arc_c + overflow);
5012 + return (arc_size >= arc_c + overflow);
4354 5013 }
4355 5014
4356 5015 static abd_t *
4357 5016 arc_get_data_abd(arc_buf_hdr_t *hdr, uint64_t size, void *tag)
4358 5017 {
4359 5018 arc_buf_contents_t type = arc_buf_type(hdr);
4360 5019
4361 5020 arc_get_data_impl(hdr, size, tag);
4362 - if (type == ARC_BUFC_METADATA) {
5021 + if (type == ARC_BUFC_METADATA || type == ARC_BUFC_DDT) {
4363 5022 return (abd_alloc(size, B_TRUE));
4364 5023 } else {
4365 5024 ASSERT(type == ARC_BUFC_DATA);
4366 5025 return (abd_alloc(size, B_FALSE));
4367 5026 }
4368 5027 }
4369 5028
4370 5029 static void *
4371 5030 arc_get_data_buf(arc_buf_hdr_t *hdr, uint64_t size, void *tag)
4372 5031 {
4373 5032 arc_buf_contents_t type = arc_buf_type(hdr);
4374 5033
4375 5034 arc_get_data_impl(hdr, size, tag);
4376 - if (type == ARC_BUFC_METADATA) {
5035 + if (type == ARC_BUFC_METADATA || type == ARC_BUFC_DDT) {
4377 5036 return (zio_buf_alloc(size));
4378 5037 } else {
4379 5038 ASSERT(type == ARC_BUFC_DATA);
4380 5039 return (zio_data_buf_alloc(size));
4381 5040 }
4382 5041 }
4383 5042
4384 5043 /*
4385 5044 * Allocate a block and return it to the caller. If we are hitting the
4386 5045 * hard limit for the cache size, we must sleep, waiting for the eviction
4387 5046 * thread to catch up. If we're past the target size but below the hard
4388 5047 * limit, we'll only signal the reclaim thread and continue on.
4389 5048 */
4390 5049 static void
4391 5050 arc_get_data_impl(arc_buf_hdr_t *hdr, uint64_t size, void *tag)
4392 5051 {
4393 5052 arc_state_t *state = hdr->b_l1hdr.b_state;
4394 5053 arc_buf_contents_t type = arc_buf_type(hdr);
4395 5054
4396 5055 arc_adapt(size, state);
4397 5056
4398 5057 /*
4399 5058 * If arc_size is currently overflowing, and has grown past our
4400 5059 * upper limit, we must be adding data faster than the evict
4401 5060 * thread can evict. Thus, to ensure we don't compound the
4402 5061 * problem by adding more data and forcing arc_size to grow even
4403 5062 * further past it's target size, we halt and wait for the
4404 5063 * eviction thread to catch up.
4405 5064 *
4406 5065 * It's also possible that the reclaim thread is unable to evict
4407 5066 * enough buffers to get arc_size below the overflow limit (e.g.
4408 5067 * due to buffers being un-evictable, or hash lock collisions).
4409 5068 * In this case, we want to proceed regardless if we're
4410 5069 * overflowing; thus we don't use a while loop here.
4411 5070 */
4412 5071 if (arc_is_overflowing()) {
4413 5072 mutex_enter(&arc_reclaim_lock);
4414 5073
4415 5074 /*
4416 5075 * Now that we've acquired the lock, we may no longer be
4417 5076 * over the overflow limit, lets check.
4418 5077 *
4419 5078 * We're ignoring the case of spurious wake ups. If that
4420 5079 * were to happen, it'd let this thread consume an ARC
4421 5080 * buffer before it should have (i.e. before we're under
4422 5081 * the overflow limit and were signalled by the reclaim
4423 5082 * thread). As long as that is a rare occurrence, it
4424 5083 * shouldn't cause any harm.
|
↓ open down ↓ |
38 lines elided |
↑ open up ↑ |
4425 5084 */
4426 5085 if (arc_is_overflowing()) {
4427 5086 cv_signal(&arc_reclaim_thread_cv);
4428 5087 cv_wait(&arc_reclaim_waiters_cv, &arc_reclaim_lock);
4429 5088 }
4430 5089
4431 5090 mutex_exit(&arc_reclaim_lock);
4432 5091 }
4433 5092
4434 5093 VERIFY3U(hdr->b_type, ==, type);
4435 - if (type == ARC_BUFC_METADATA) {
5094 + if (type == ARC_BUFC_DDT) {
5095 + arc_space_consume(size, ARC_SPACE_DDT);
5096 + } else if (type == ARC_BUFC_METADATA) {
4436 5097 arc_space_consume(size, ARC_SPACE_META);
4437 5098 } else {
4438 5099 arc_space_consume(size, ARC_SPACE_DATA);
4439 5100 }
4440 5101
4441 5102 /*
4442 5103 * Update the state size. Note that ghost states have a
4443 5104 * "ghost size" and so don't need to be updated.
4444 5105 */
4445 5106 if (!GHOST_STATE(state)) {
4446 5107
4447 5108 (void) refcount_add_many(&state->arcs_size, size, tag);
4448 5109
4449 5110 /*
4450 5111 * If this is reached via arc_read, the link is
4451 5112 * protected by the hash lock. If reached via
4452 5113 * arc_buf_alloc, the header should not be accessed by
4453 5114 * any other thread. And, if reached via arc_read_done,
4454 5115 * the hash lock will protect it if it's found in the
4455 5116 * hash table; otherwise no other thread should be
4456 5117 * trying to [add|remove]_reference it.
4457 5118 */
|
↓ open down ↓ |
12 lines elided |
↑ open up ↑ |
4458 5119 if (multilist_link_active(&hdr->b_l1hdr.b_arc_node)) {
4459 5120 ASSERT(refcount_is_zero(&hdr->b_l1hdr.b_refcnt));
4460 5121 (void) refcount_add_many(&state->arcs_esize[type],
4461 5122 size, tag);
4462 5123 }
4463 5124
4464 5125 /*
4465 5126 * If we are growing the cache, and we are adding anonymous
4466 5127 * data, and we have outgrown arc_p, update arc_p
4467 5128 */
4468 - if (aggsum_compare(&arc_size, arc_c) < 0 &&
4469 - hdr->b_l1hdr.b_state == arc_anon &&
5129 + if (arc_size < arc_c && hdr->b_l1hdr.b_state == arc_anon &&
4470 5130 (refcount_count(&arc_anon->arcs_size) +
4471 5131 refcount_count(&arc_mru->arcs_size) > arc_p))
4472 5132 arc_p = MIN(arc_c, arc_p + size);
4473 5133 }
4474 5134 }
4475 5135
4476 5136 static void
4477 5137 arc_free_data_abd(arc_buf_hdr_t *hdr, abd_t *abd, uint64_t size, void *tag)
4478 5138 {
4479 5139 arc_free_data_impl(hdr, size, tag);
4480 5140 abd_free(abd);
4481 5141 }
4482 5142
4483 5143 static void
4484 5144 arc_free_data_buf(arc_buf_hdr_t *hdr, void *buf, uint64_t size, void *tag)
4485 5145 {
4486 5146 arc_buf_contents_t type = arc_buf_type(hdr);
4487 5147
4488 5148 arc_free_data_impl(hdr, size, tag);
4489 - if (type == ARC_BUFC_METADATA) {
5149 + if (type == ARC_BUFC_METADATA || type == ARC_BUFC_DDT) {
4490 5150 zio_buf_free(buf, size);
4491 5151 } else {
4492 5152 ASSERT(type == ARC_BUFC_DATA);
4493 5153 zio_data_buf_free(buf, size);
4494 5154 }
4495 5155 }
4496 5156
4497 5157 /*
4498 5158 * Free the arc data buffer.
4499 5159 */
4500 5160 static void
4501 5161 arc_free_data_impl(arc_buf_hdr_t *hdr, uint64_t size, void *tag)
4502 5162 {
4503 5163 arc_state_t *state = hdr->b_l1hdr.b_state;
4504 5164 arc_buf_contents_t type = arc_buf_type(hdr);
4505 5165
4506 5166 /* protected by hash lock, if in the hash table */
|
↓ open down ↓ |
7 lines elided |
↑ open up ↑ |
4507 5167 if (multilist_link_active(&hdr->b_l1hdr.b_arc_node)) {
4508 5168 ASSERT(refcount_is_zero(&hdr->b_l1hdr.b_refcnt));
4509 5169 ASSERT(state != arc_anon && state != arc_l2c_only);
4510 5170
4511 5171 (void) refcount_remove_many(&state->arcs_esize[type],
4512 5172 size, tag);
4513 5173 }
4514 5174 (void) refcount_remove_many(&state->arcs_size, size, tag);
4515 5175
4516 5176 VERIFY3U(hdr->b_type, ==, type);
4517 - if (type == ARC_BUFC_METADATA) {
5177 + if (type == ARC_BUFC_DDT) {
5178 + arc_space_return(size, ARC_SPACE_DDT);
5179 + } else if (type == ARC_BUFC_METADATA) {
4518 5180 arc_space_return(size, ARC_SPACE_META);
4519 5181 } else {
4520 5182 ASSERT(type == ARC_BUFC_DATA);
4521 5183 arc_space_return(size, ARC_SPACE_DATA);
4522 5184 }
4523 5185 }
4524 5186
4525 5187 /*
4526 5188 * This routine is called whenever a buffer is accessed.
4527 5189 * NOTE: the hash lock is dropped in this function.
4528 5190 */
4529 5191 static void
4530 5192 arc_access(arc_buf_hdr_t *hdr, kmutex_t *hash_lock)
4531 5193 {
4532 5194 clock_t now;
4533 5195
4534 5196 ASSERT(MUTEX_HELD(hash_lock));
4535 5197 ASSERT(HDR_HAS_L1HDR(hdr));
4536 5198
4537 5199 if (hdr->b_l1hdr.b_state == arc_anon) {
4538 5200 /*
4539 5201 * This buffer is not in the cache, and does not
4540 5202 * appear in our "ghost" list. Add the new buffer
4541 5203 * to the MRU state.
4542 5204 */
4543 5205
4544 5206 ASSERT0(hdr->b_l1hdr.b_arc_access);
4545 5207 hdr->b_l1hdr.b_arc_access = ddi_get_lbolt();
4546 5208 DTRACE_PROBE1(new_state__mru, arc_buf_hdr_t *, hdr);
4547 5209 arc_change_state(arc_mru, hdr, hash_lock);
4548 5210
4549 5211 } else if (hdr->b_l1hdr.b_state == arc_mru) {
4550 5212 now = ddi_get_lbolt();
4551 5213
4552 5214 /*
4553 5215 * If this buffer is here because of a prefetch, then either:
4554 5216 * - clear the flag if this is a "referencing" read
4555 5217 * (any subsequent access will bump this into the MFU state).
4556 5218 * or
4557 5219 * - move the buffer to the head of the list if this is
4558 5220 * another prefetch (to make it less likely to be evicted).
4559 5221 */
4560 5222 if (HDR_PREFETCH(hdr)) {
4561 5223 if (refcount_count(&hdr->b_l1hdr.b_refcnt) == 0) {
4562 5224 /* link protected by hash lock */
4563 5225 ASSERT(multilist_link_active(
4564 5226 &hdr->b_l1hdr.b_arc_node));
4565 5227 } else {
4566 5228 arc_hdr_clear_flags(hdr, ARC_FLAG_PREFETCH);
4567 5229 ARCSTAT_BUMP(arcstat_mru_hits);
4568 5230 }
4569 5231 hdr->b_l1hdr.b_arc_access = now;
4570 5232 return;
4571 5233 }
4572 5234
4573 5235 /*
4574 5236 * This buffer has been "accessed" only once so far,
4575 5237 * but it is still in the cache. Move it to the MFU
4576 5238 * state.
4577 5239 */
4578 5240 if (now > hdr->b_l1hdr.b_arc_access + ARC_MINTIME) {
4579 5241 /*
4580 5242 * More than 125ms have passed since we
4581 5243 * instantiated this buffer. Move it to the
4582 5244 * most frequently used state.
4583 5245 */
4584 5246 hdr->b_l1hdr.b_arc_access = now;
4585 5247 DTRACE_PROBE1(new_state__mfu, arc_buf_hdr_t *, hdr);
4586 5248 arc_change_state(arc_mfu, hdr, hash_lock);
4587 5249 }
4588 5250 ARCSTAT_BUMP(arcstat_mru_hits);
4589 5251 } else if (hdr->b_l1hdr.b_state == arc_mru_ghost) {
4590 5252 arc_state_t *new_state;
4591 5253 /*
4592 5254 * This buffer has been "accessed" recently, but
4593 5255 * was evicted from the cache. Move it to the
4594 5256 * MFU state.
4595 5257 */
4596 5258
4597 5259 if (HDR_PREFETCH(hdr)) {
4598 5260 new_state = arc_mru;
4599 5261 if (refcount_count(&hdr->b_l1hdr.b_refcnt) > 0)
4600 5262 arc_hdr_clear_flags(hdr, ARC_FLAG_PREFETCH);
4601 5263 DTRACE_PROBE1(new_state__mru, arc_buf_hdr_t *, hdr);
4602 5264 } else {
4603 5265 new_state = arc_mfu;
4604 5266 DTRACE_PROBE1(new_state__mfu, arc_buf_hdr_t *, hdr);
4605 5267 }
4606 5268
4607 5269 hdr->b_l1hdr.b_arc_access = ddi_get_lbolt();
4608 5270 arc_change_state(new_state, hdr, hash_lock);
4609 5271
4610 5272 ARCSTAT_BUMP(arcstat_mru_ghost_hits);
4611 5273 } else if (hdr->b_l1hdr.b_state == arc_mfu) {
4612 5274 /*
4613 5275 * This buffer has been accessed more than once and is
4614 5276 * still in the cache. Keep it in the MFU state.
4615 5277 *
4616 5278 * NOTE: an add_reference() that occurred when we did
4617 5279 * the arc_read() will have kicked this off the list.
4618 5280 * If it was a prefetch, we will explicitly move it to
4619 5281 * the head of the list now.
4620 5282 */
4621 5283 if ((HDR_PREFETCH(hdr)) != 0) {
4622 5284 ASSERT(refcount_is_zero(&hdr->b_l1hdr.b_refcnt));
4623 5285 /* link protected by hash_lock */
4624 5286 ASSERT(multilist_link_active(&hdr->b_l1hdr.b_arc_node));
4625 5287 }
4626 5288 ARCSTAT_BUMP(arcstat_mfu_hits);
4627 5289 hdr->b_l1hdr.b_arc_access = ddi_get_lbolt();
4628 5290 } else if (hdr->b_l1hdr.b_state == arc_mfu_ghost) {
4629 5291 arc_state_t *new_state = arc_mfu;
4630 5292 /*
4631 5293 * This buffer has been accessed more than once but has
4632 5294 * been evicted from the cache. Move it back to the
4633 5295 * MFU state.
4634 5296 */
4635 5297
4636 5298 if (HDR_PREFETCH(hdr)) {
4637 5299 /*
4638 5300 * This is a prefetch access...
4639 5301 * move this block back to the MRU state.
4640 5302 */
4641 5303 ASSERT0(refcount_count(&hdr->b_l1hdr.b_refcnt));
4642 5304 new_state = arc_mru;
4643 5305 }
4644 5306
4645 5307 hdr->b_l1hdr.b_arc_access = ddi_get_lbolt();
4646 5308 DTRACE_PROBE1(new_state__mfu, arc_buf_hdr_t *, hdr);
4647 5309 arc_change_state(new_state, hdr, hash_lock);
4648 5310
4649 5311 ARCSTAT_BUMP(arcstat_mfu_ghost_hits);
4650 5312 } else if (hdr->b_l1hdr.b_state == arc_l2c_only) {
4651 5313 /*
4652 5314 * This buffer is on the 2nd Level ARC.
|
↓ open down ↓ |
125 lines elided |
↑ open up ↑ |
4653 5315 */
4654 5316
4655 5317 hdr->b_l1hdr.b_arc_access = ddi_get_lbolt();
4656 5318 DTRACE_PROBE1(new_state__mfu, arc_buf_hdr_t *, hdr);
4657 5319 arc_change_state(arc_mfu, hdr, hash_lock);
4658 5320 } else {
4659 5321 ASSERT(!"invalid arc state");
4660 5322 }
4661 5323 }
4662 5324
5325 +/*
5326 + * This routine is called by dbuf_hold() to update the arc_access() state
5327 + * which otherwise would be skipped for entries in the dbuf cache.
5328 + */
5329 +void
5330 +arc_buf_access(arc_buf_t *buf)
5331 +{
5332 + mutex_enter(&buf->b_evict_lock);
5333 + arc_buf_hdr_t *hdr = buf->b_hdr;
5334 +
5335 + /*
5336 + * Avoid taking the hash_lock when possible as an optimization.
5337 + * The header must be checked again under the hash_lock in order
5338 + * to handle the case where it is concurrently being released.
5339 + */
5340 + if (hdr->b_l1hdr.b_state == arc_anon || HDR_EMPTY(hdr)) {
5341 + mutex_exit(&buf->b_evict_lock);
5342 + return;
5343 + }
5344 +
5345 + kmutex_t *hash_lock = HDR_LOCK(hdr);
5346 + mutex_enter(hash_lock);
5347 +
5348 + if (hdr->b_l1hdr.b_state == arc_anon || HDR_EMPTY(hdr)) {
5349 + mutex_exit(hash_lock);
5350 + mutex_exit(&buf->b_evict_lock);
5351 + ARCSTAT_BUMP(arcstat_access_skip);
5352 + return;
5353 + }
5354 +
5355 + mutex_exit(&buf->b_evict_lock);
5356 +
5357 + ASSERT(hdr->b_l1hdr.b_state == arc_mru ||
5358 + hdr->b_l1hdr.b_state == arc_mfu);
5359 +
5360 + DTRACE_PROBE1(arc__hit, arc_buf_hdr_t *, hdr);
5361 + arc_access(hdr, hash_lock);
5362 + mutex_exit(hash_lock);
5363 +
5364 + ARCSTAT_BUMP(arcstat_hits);
5365 + /*
5366 + * Upstream used the ARCSTAT_CONDSTAT macro here, but they changed
5367 + * the argument format for that macro, which would requie that we
5368 + * go and modify all other uses of it. So it's easier to just expand
5369 + * this one invocation of the macro to do the right thing.
5370 + */
5371 + if (!HDR_PREFETCH(hdr)) {
5372 + if (!HDR_ISTYPE_METADATA(hdr))
5373 + ARCSTAT_BUMP(arcstat_demand_data_hits);
5374 + else
5375 + ARCSTAT_BUMP(arcstat_demand_metadata_hits);
5376 + } else {
5377 + if (!HDR_ISTYPE_METADATA(hdr))
5378 + ARCSTAT_BUMP(arcstat_prefetch_data_hits);
5379 + else
5380 + ARCSTAT_BUMP(arcstat_prefetch_metadata_hits);
5381 + }
5382 +}
5383 +
4663 5384 /* a generic arc_done_func_t which you can use */
4664 5385 /* ARGSUSED */
4665 5386 void
4666 5387 arc_bcopy_func(zio_t *zio, arc_buf_t *buf, void *arg)
4667 5388 {
4668 5389 if (zio == NULL || zio->io_error == 0)
4669 5390 bcopy(buf->b_data, arg, arc_buf_size(buf));
4670 5391 arc_buf_destroy(buf, arg);
4671 5392 }
4672 5393
4673 5394 /* a generic arc_done_func_t */
4674 5395 void
4675 5396 arc_getbuf_func(zio_t *zio, arc_buf_t *buf, void *arg)
4676 5397 {
4677 5398 arc_buf_t **bufp = arg;
4678 5399 if (zio && zio->io_error) {
4679 5400 arc_buf_destroy(buf, arg);
4680 5401 *bufp = NULL;
4681 5402 } else {
4682 5403 *bufp = buf;
4683 5404 ASSERT(buf->b_data);
4684 5405 }
4685 5406 }
4686 5407
4687 5408 static void
4688 5409 arc_hdr_verify(arc_buf_hdr_t *hdr, blkptr_t *bp)
4689 5410 {
4690 5411 if (BP_IS_HOLE(bp) || BP_IS_EMBEDDED(bp)) {
4691 5412 ASSERT3U(HDR_GET_PSIZE(hdr), ==, 0);
4692 5413 ASSERT3U(HDR_GET_COMPRESS(hdr), ==, ZIO_COMPRESS_OFF);
4693 5414 } else {
4694 5415 if (HDR_COMPRESSION_ENABLED(hdr)) {
4695 5416 ASSERT3U(HDR_GET_COMPRESS(hdr), ==,
4696 5417 BP_GET_COMPRESS(bp));
4697 5418 }
4698 5419 ASSERT3U(HDR_GET_LSIZE(hdr), ==, BP_GET_LSIZE(bp));
4699 5420 ASSERT3U(HDR_GET_PSIZE(hdr), ==, BP_GET_PSIZE(bp));
4700 5421 }
4701 5422 }
4702 5423
4703 5424 static void
4704 5425 arc_read_done(zio_t *zio)
4705 5426 {
4706 5427 arc_buf_hdr_t *hdr = zio->io_private;
4707 5428 kmutex_t *hash_lock = NULL;
4708 5429 arc_callback_t *callback_list;
4709 5430 arc_callback_t *acb;
4710 5431 boolean_t freeable = B_FALSE;
4711 5432 boolean_t no_zio_error = (zio->io_error == 0);
4712 5433
4713 5434 /*
4714 5435 * The hdr was inserted into hash-table and removed from lists
4715 5436 * prior to starting I/O. We should find this header, since
4716 5437 * it's in the hash table, and it should be legit since it's
4717 5438 * not possible to evict it during the I/O. The only possible
4718 5439 * reason for it not to be found is if we were freed during the
4719 5440 * read.
4720 5441 */
4721 5442 if (HDR_IN_HASH_TABLE(hdr)) {
4722 5443 ASSERT3U(hdr->b_birth, ==, BP_PHYSICAL_BIRTH(zio->io_bp));
4723 5444 ASSERT3U(hdr->b_dva.dva_word[0], ==,
4724 5445 BP_IDENTITY(zio->io_bp)->dva_word[0]);
4725 5446 ASSERT3U(hdr->b_dva.dva_word[1], ==,
4726 5447 BP_IDENTITY(zio->io_bp)->dva_word[1]);
4727 5448
4728 5449 arc_buf_hdr_t *found = buf_hash_find(hdr->b_spa, zio->io_bp,
4729 5450 &hash_lock);
4730 5451
4731 5452 ASSERT((found == hdr &&
4732 5453 DVA_EQUAL(&hdr->b_dva, BP_IDENTITY(zio->io_bp))) ||
4733 5454 (found == hdr && HDR_L2_READING(hdr)));
4734 5455 ASSERT3P(hash_lock, !=, NULL);
4735 5456 }
4736 5457
4737 5458 if (no_zio_error) {
4738 5459 /* byteswap if necessary */
4739 5460 if (BP_SHOULD_BYTESWAP(zio->io_bp)) {
4740 5461 if (BP_GET_LEVEL(zio->io_bp) > 0) {
4741 5462 hdr->b_l1hdr.b_byteswap = DMU_BSWAP_UINT64;
4742 5463 } else {
4743 5464 hdr->b_l1hdr.b_byteswap =
4744 5465 DMU_OT_BYTESWAP(BP_GET_TYPE(zio->io_bp));
4745 5466 }
4746 5467 } else {
4747 5468 hdr->b_l1hdr.b_byteswap = DMU_BSWAP_NUMFUNCS;
4748 5469 }
4749 5470 }
4750 5471
4751 5472 arc_hdr_clear_flags(hdr, ARC_FLAG_L2_EVICTED);
4752 5473 if (l2arc_noprefetch && HDR_PREFETCH(hdr))
4753 5474 arc_hdr_clear_flags(hdr, ARC_FLAG_L2CACHE);
4754 5475
4755 5476 callback_list = hdr->b_l1hdr.b_acb;
4756 5477 ASSERT3P(callback_list, !=, NULL);
4757 5478
4758 5479 if (hash_lock && no_zio_error && hdr->b_l1hdr.b_state == arc_anon) {
4759 5480 /*
4760 5481 * Only call arc_access on anonymous buffers. This is because
4761 5482 * if we've issued an I/O for an evicted buffer, we've already
4762 5483 * called arc_access (to prevent any simultaneous readers from
4763 5484 * getting confused).
4764 5485 */
4765 5486 arc_access(hdr, hash_lock);
4766 5487 }
4767 5488
4768 5489 /*
4769 5490 * If a read request has a callback (i.e. acb_done is not NULL), then we
4770 5491 * make a buf containing the data according to the parameters which were
4771 5492 * passed in. The implementation of arc_buf_alloc_impl() ensures that we
4772 5493 * aren't needlessly decompressing the data multiple times.
4773 5494 */
4774 5495 int callback_cnt = 0;
4775 5496 for (acb = callback_list; acb != NULL; acb = acb->acb_next) {
4776 5497 if (!acb->acb_done)
4777 5498 continue;
4778 5499
4779 5500 /* This is a demand read since prefetches don't use callbacks */
4780 5501 callback_cnt++;
4781 5502
4782 5503 int error = arc_buf_alloc_impl(hdr, acb->acb_private,
4783 5504 acb->acb_compressed, no_zio_error, &acb->acb_buf);
4784 5505 if (no_zio_error) {
4785 5506 zio->io_error = error;
4786 5507 }
4787 5508 }
4788 5509 hdr->b_l1hdr.b_acb = NULL;
4789 5510 arc_hdr_clear_flags(hdr, ARC_FLAG_IO_IN_PROGRESS);
4790 5511 if (callback_cnt == 0) {
4791 5512 ASSERT(HDR_PREFETCH(hdr));
4792 5513 ASSERT0(hdr->b_l1hdr.b_bufcnt);
4793 5514 ASSERT3P(hdr->b_l1hdr.b_pabd, !=, NULL);
4794 5515 }
|
↓ open down ↓ |
122 lines elided |
↑ open up ↑ |
4795 5516
4796 5517 ASSERT(refcount_is_zero(&hdr->b_l1hdr.b_refcnt) ||
4797 5518 callback_list != NULL);
4798 5519
4799 5520 if (no_zio_error) {
4800 5521 arc_hdr_verify(hdr, zio->io_bp);
4801 5522 } else {
4802 5523 arc_hdr_set_flags(hdr, ARC_FLAG_IO_ERROR);
4803 5524 if (hdr->b_l1hdr.b_state != arc_anon)
4804 5525 arc_change_state(arc_anon, hdr, hash_lock);
4805 - if (HDR_IN_HASH_TABLE(hdr))
5526 + if (HDR_IN_HASH_TABLE(hdr)) {
5527 + if (hash_lock)
5528 + arc_wait_for_krrp(hdr);
4806 5529 buf_hash_remove(hdr);
5530 + }
4807 5531 freeable = refcount_is_zero(&hdr->b_l1hdr.b_refcnt);
4808 5532 }
4809 5533
4810 5534 /*
4811 5535 * Broadcast before we drop the hash_lock to avoid the possibility
4812 5536 * that the hdr (and hence the cv) might be freed before we get to
4813 5537 * the cv_broadcast().
4814 5538 */
4815 5539 cv_broadcast(&hdr->b_l1hdr.b_cv);
4816 5540
4817 5541 if (hash_lock != NULL) {
4818 5542 mutex_exit(hash_lock);
4819 5543 } else {
4820 5544 /*
4821 5545 * This block was freed while we waited for the read to
4822 5546 * complete. It has been removed from the hash table and
4823 5547 * moved to the anonymous state (so that it won't show up
4824 5548 * in the cache).
4825 5549 */
4826 5550 ASSERT3P(hdr->b_l1hdr.b_state, ==, arc_anon);
4827 5551 freeable = refcount_is_zero(&hdr->b_l1hdr.b_refcnt);
4828 5552 }
4829 5553
4830 5554 /* execute each callback and free its structure */
4831 5555 while ((acb = callback_list) != NULL) {
4832 5556 if (acb->acb_done)
4833 5557 acb->acb_done(zio, acb->acb_buf, acb->acb_private);
4834 5558
4835 5559 if (acb->acb_zio_dummy != NULL) {
4836 5560 acb->acb_zio_dummy->io_error = zio->io_error;
4837 5561 zio_nowait(acb->acb_zio_dummy);
4838 5562 }
|
↓ open down ↓ |
22 lines elided |
↑ open up ↑ |
4839 5563
4840 5564 callback_list = acb->acb_next;
4841 5565 kmem_free(acb, sizeof (arc_callback_t));
4842 5566 }
4843 5567
4844 5568 if (freeable)
4845 5569 arc_hdr_destroy(hdr);
4846 5570 }
4847 5571
4848 5572 /*
5573 + * The function to process data from arc by a callback
5574 + * The main purpose is to directly copy data from arc to a target buffer
5575 + */
5576 +int
5577 +arc_io_bypass(spa_t *spa, const blkptr_t *bp,
5578 + arc_bypass_io_func func, void *arg)
5579 +{
5580 + arc_buf_hdr_t *hdr;
5581 + kmutex_t *hash_lock = NULL;
5582 + int error = 0;
5583 + uint64_t guid = spa_load_guid(spa);
5584 +
5585 +top:
5586 + hdr = buf_hash_find(guid, bp, &hash_lock);
5587 + if (hdr && HDR_HAS_L1HDR(hdr) && hdr->b_l1hdr.b_bufcnt > 0 &&
5588 + hdr->b_l1hdr.b_buf->b_data) {
5589 + if (HDR_IO_IN_PROGRESS(hdr)) {
5590 + cv_wait(&hdr->b_l1hdr.b_cv, hash_lock);
5591 + mutex_exit(hash_lock);
5592 + DTRACE_PROBE(arc_bypass_wait);
5593 + goto top;
5594 + }
5595 +
5596 + /*
5597 + * As the func is an arbitrary callback, which can block, lock
5598 + * should be released not to block other threads from
5599 + * performing. A counter is used to hold a reference to block
5600 + * which are held by krrp.
5601 + */
5602 +
5603 + hdr->b_l1hdr.b_krrp++;
5604 + mutex_exit(hash_lock);
5605 +
5606 + error = func(hdr->b_l1hdr.b_buf->b_data, hdr->b_lsize, arg);
5607 +
5608 + mutex_enter(hash_lock);
5609 + hdr->b_l1hdr.b_krrp--;
5610 + cv_broadcast(&hdr->b_l1hdr.b_cv);
5611 + mutex_exit(hash_lock);
5612 +
5613 + return (error);
5614 + } else {
5615 + if (hash_lock)
5616 + mutex_exit(hash_lock);
5617 + return (ENODATA);
5618 + }
5619 +}
5620 +
5621 +/*
4849 5622 * "Read" the block at the specified DVA (in bp) via the
4850 5623 * cache. If the block is found in the cache, invoke the provided
4851 5624 * callback immediately and return. Note that the `zio' parameter
4852 5625 * in the callback will be NULL in this case, since no IO was
4853 5626 * required. If the block is not in the cache pass the read request
4854 5627 * on to the spa with a substitute callback function, so that the
4855 5628 * requested block will be added to the cache.
4856 5629 *
4857 5630 * If a read request arrives for a block that has a read in-progress,
4858 5631 * either wait for the in-progress read to complete (and return the
4859 5632 * results); or, if this is a read with a "done" func, add a record
4860 5633 * to the read to invoke the "done" func when the read completes,
4861 5634 * and return; or just return.
4862 5635 *
4863 5636 * arc_read_done() will invoke all the requested "done" functions
4864 5637 * for readers of this block.
4865 5638 */
4866 5639 int
4867 5640 arc_read(zio_t *pio, spa_t *spa, const blkptr_t *bp, arc_done_func_t *done,
4868 5641 void *private, zio_priority_t priority, int zio_flags,
4869 5642 arc_flags_t *arc_flags, const zbookmark_phys_t *zb)
4870 5643 {
4871 5644 arc_buf_hdr_t *hdr = NULL;
4872 5645 kmutex_t *hash_lock = NULL;
4873 5646 zio_t *rzio;
4874 5647 uint64_t guid = spa_load_guid(spa);
4875 5648 boolean_t compressed_read = (zio_flags & ZIO_FLAG_RAW) != 0;
4876 5649
4877 5650 ASSERT(!BP_IS_EMBEDDED(bp) ||
4878 5651 BPE_GET_ETYPE(bp) == BP_EMBEDDED_TYPE_DATA);
4879 5652
4880 5653 top:
4881 5654 if (!BP_IS_EMBEDDED(bp)) {
4882 5655 /*
4883 5656 * Embedded BP's have no DVA and require no I/O to "read".
4884 5657 * Create an anonymous arc buf to back it.
4885 5658 */
4886 5659 hdr = buf_hash_find(guid, bp, &hash_lock);
4887 5660 }
4888 5661
4889 5662 if (hdr != NULL && HDR_HAS_L1HDR(hdr) && hdr->b_l1hdr.b_pabd != NULL) {
4890 5663 arc_buf_t *buf = NULL;
4891 5664 *arc_flags |= ARC_FLAG_CACHED;
4892 5665
4893 5666 if (HDR_IO_IN_PROGRESS(hdr)) {
4894 5667
4895 5668 if ((hdr->b_flags & ARC_FLAG_PRIO_ASYNC_READ) &&
4896 5669 priority == ZIO_PRIORITY_SYNC_READ) {
4897 5670 /*
4898 5671 * This sync read must wait for an
4899 5672 * in-progress async read (e.g. a predictive
4900 5673 * prefetch). Async reads are queued
4901 5674 * separately at the vdev_queue layer, so
4902 5675 * this is a form of priority inversion.
4903 5676 * Ideally, we would "inherit" the demand
4904 5677 * i/o's priority by moving the i/o from
4905 5678 * the async queue to the synchronous queue,
4906 5679 * but there is currently no mechanism to do
4907 5680 * so. Track this so that we can evaluate
4908 5681 * the magnitude of this potential performance
4909 5682 * problem.
4910 5683 *
4911 5684 * Note that if the prefetch i/o is already
4912 5685 * active (has been issued to the device),
4913 5686 * the prefetch improved performance, because
4914 5687 * we issued it sooner than we would have
4915 5688 * without the prefetch.
4916 5689 */
4917 5690 DTRACE_PROBE1(arc__sync__wait__for__async,
4918 5691 arc_buf_hdr_t *, hdr);
4919 5692 ARCSTAT_BUMP(arcstat_sync_wait_for_async);
4920 5693 }
4921 5694 if (hdr->b_flags & ARC_FLAG_PREDICTIVE_PREFETCH) {
4922 5695 arc_hdr_clear_flags(hdr,
4923 5696 ARC_FLAG_PREDICTIVE_PREFETCH);
4924 5697 }
4925 5698
4926 5699 if (*arc_flags & ARC_FLAG_WAIT) {
4927 5700 cv_wait(&hdr->b_l1hdr.b_cv, hash_lock);
4928 5701 mutex_exit(hash_lock);
4929 5702 goto top;
4930 5703 }
4931 5704 ASSERT(*arc_flags & ARC_FLAG_NOWAIT);
4932 5705
4933 5706 if (done) {
4934 5707 arc_callback_t *acb = NULL;
4935 5708
4936 5709 acb = kmem_zalloc(sizeof (arc_callback_t),
4937 5710 KM_SLEEP);
4938 5711 acb->acb_done = done;
4939 5712 acb->acb_private = private;
4940 5713 acb->acb_compressed = compressed_read;
4941 5714 if (pio != NULL)
4942 5715 acb->acb_zio_dummy = zio_null(pio,
4943 5716 spa, NULL, NULL, NULL, zio_flags);
4944 5717
4945 5718 ASSERT3P(acb->acb_done, !=, NULL);
4946 5719 acb->acb_next = hdr->b_l1hdr.b_acb;
4947 5720 hdr->b_l1hdr.b_acb = acb;
4948 5721 mutex_exit(hash_lock);
4949 5722 return (0);
4950 5723 }
4951 5724 mutex_exit(hash_lock);
4952 5725 return (0);
4953 5726 }
4954 5727
4955 5728 ASSERT(hdr->b_l1hdr.b_state == arc_mru ||
4956 5729 hdr->b_l1hdr.b_state == arc_mfu);
4957 5730
4958 5731 if (done) {
4959 5732 if (hdr->b_flags & ARC_FLAG_PREDICTIVE_PREFETCH) {
4960 5733 /*
4961 5734 * This is a demand read which does not have to
4962 5735 * wait for i/o because we did a predictive
4963 5736 * prefetch i/o for it, which has completed.
4964 5737 */
4965 5738 DTRACE_PROBE1(
4966 5739 arc__demand__hit__predictive__prefetch,
4967 5740 arc_buf_hdr_t *, hdr);
4968 5741 ARCSTAT_BUMP(
4969 5742 arcstat_demand_hit_predictive_prefetch);
4970 5743 arc_hdr_clear_flags(hdr,
4971 5744 ARC_FLAG_PREDICTIVE_PREFETCH);
4972 5745 }
4973 5746 ASSERT(!BP_IS_EMBEDDED(bp) || !BP_IS_HOLE(bp));
4974 5747
4975 5748 /* Get a buf with the desired data in it. */
4976 5749 VERIFY0(arc_buf_alloc_impl(hdr, private,
4977 5750 compressed_read, B_TRUE, &buf));
|
↓ open down ↓ |
119 lines elided |
↑ open up ↑ |
4978 5751 } else if (*arc_flags & ARC_FLAG_PREFETCH &&
4979 5752 refcount_count(&hdr->b_l1hdr.b_refcnt) == 0) {
4980 5753 arc_hdr_set_flags(hdr, ARC_FLAG_PREFETCH);
4981 5754 }
4982 5755 DTRACE_PROBE1(arc__hit, arc_buf_hdr_t *, hdr);
4983 5756 arc_access(hdr, hash_lock);
4984 5757 if (*arc_flags & ARC_FLAG_L2CACHE)
4985 5758 arc_hdr_set_flags(hdr, ARC_FLAG_L2CACHE);
4986 5759 mutex_exit(hash_lock);
4987 5760 ARCSTAT_BUMP(arcstat_hits);
4988 - ARCSTAT_CONDSTAT(!HDR_PREFETCH(hdr),
4989 - demand, prefetch, !HDR_ISTYPE_METADATA(hdr),
4990 - data, metadata, hits);
5761 + if (HDR_ISTYPE_DDT(hdr))
5762 + ARCSTAT_BUMP(arcstat_ddt_hits);
5763 + arc_update_hit_stat(hdr, B_TRUE);
4991 5764
4992 5765 if (done)
4993 5766 done(NULL, buf, private);
4994 5767 } else {
4995 5768 uint64_t lsize = BP_GET_LSIZE(bp);
4996 5769 uint64_t psize = BP_GET_PSIZE(bp);
4997 5770 arc_callback_t *acb;
4998 5771 vdev_t *vd = NULL;
4999 5772 uint64_t addr = 0;
5000 5773 boolean_t devw = B_FALSE;
5001 5774 uint64_t size;
5002 5775
5003 5776 if (hdr == NULL) {
5004 5777 /* this block is not in the cache */
5005 5778 arc_buf_hdr_t *exists = NULL;
5006 5779 arc_buf_contents_t type = BP_GET_BUFC_TYPE(bp);
|
↓ open down ↓ |
6 lines elided |
↑ open up ↑ |
5007 5780 hdr = arc_hdr_alloc(spa_load_guid(spa), psize, lsize,
5008 5781 BP_GET_COMPRESS(bp), type);
5009 5782
5010 5783 if (!BP_IS_EMBEDDED(bp)) {
5011 5784 hdr->b_dva = *BP_IDENTITY(bp);
5012 5785 hdr->b_birth = BP_PHYSICAL_BIRTH(bp);
5013 5786 exists = buf_hash_insert(hdr, &hash_lock);
5014 5787 }
5015 5788 if (exists != NULL) {
5016 5789 /* somebody beat us to the hash insert */
5017 - mutex_exit(hash_lock);
5018 - buf_discard_identity(hdr);
5019 5790 arc_hdr_destroy(hdr);
5791 + mutex_exit(hash_lock);
5020 5792 goto top; /* restart the IO request */
5021 5793 }
5022 5794 } else {
5023 5795 /*
5024 5796 * This block is in the ghost cache. If it was L2-only
5025 5797 * (and thus didn't have an L1 hdr), we realloc the
5026 5798 * header to add an L1 hdr.
5027 5799 */
5028 5800 if (!HDR_HAS_L1HDR(hdr)) {
5029 5801 hdr = arc_hdr_realloc(hdr, hdr_l2only_cache,
5030 5802 hdr_full_cache);
5031 5803 }
5032 5804 ASSERT3P(hdr->b_l1hdr.b_pabd, ==, NULL);
5033 5805 ASSERT(GHOST_STATE(hdr->b_l1hdr.b_state));
5034 5806 ASSERT(!HDR_IO_IN_PROGRESS(hdr));
5035 5807 ASSERT(refcount_is_zero(&hdr->b_l1hdr.b_refcnt));
5036 5808 ASSERT3P(hdr->b_l1hdr.b_buf, ==, NULL);
5037 - ASSERT3P(hdr->b_l1hdr.b_freeze_cksum, ==, NULL);
5809 + ASSERT3P(hdr->b_freeze_cksum, ==, NULL);
5038 5810
5039 5811 /*
5040 5812 * This is a delicate dance that we play here.
5041 5813 * This hdr is in the ghost list so we access it
5042 5814 * to move it out of the ghost list before we
5043 5815 * initiate the read. If it's a prefetch then
5044 5816 * it won't have a callback so we'll remove the
5045 5817 * reference that arc_buf_alloc_impl() created. We
5046 5818 * do this after we've called arc_access() to
5047 5819 * avoid hitting an assert in remove_reference().
5048 5820 */
5049 5821 arc_access(hdr, hash_lock);
5050 5822 arc_hdr_alloc_pabd(hdr);
5051 5823 }
5052 5824 ASSERT3P(hdr->b_l1hdr.b_pabd, !=, NULL);
5053 5825 size = arc_hdr_size(hdr);
5054 5826
5055 5827 /*
5056 5828 * If compression is enabled on the hdr, then will do
5057 5829 * RAW I/O and will store the compressed data in the hdr's
5058 5830 * data block. Otherwise, the hdr's data block will contain
5059 5831 * the uncompressed data.
5060 5832 */
5061 5833 if (HDR_GET_COMPRESS(hdr) != ZIO_COMPRESS_OFF) {
5062 5834 zio_flags |= ZIO_FLAG_RAW;
5063 5835 }
5064 5836
5065 5837 if (*arc_flags & ARC_FLAG_PREFETCH)
5066 5838 arc_hdr_set_flags(hdr, ARC_FLAG_PREFETCH);
5067 5839 if (*arc_flags & ARC_FLAG_L2CACHE)
5068 5840 arc_hdr_set_flags(hdr, ARC_FLAG_L2CACHE);
5069 5841 if (BP_GET_LEVEL(bp) > 0)
5070 5842 arc_hdr_set_flags(hdr, ARC_FLAG_INDIRECT);
5071 5843 if (*arc_flags & ARC_FLAG_PREDICTIVE_PREFETCH)
5072 5844 arc_hdr_set_flags(hdr, ARC_FLAG_PREDICTIVE_PREFETCH);
5073 5845 ASSERT(!GHOST_STATE(hdr->b_l1hdr.b_state));
5074 5846
5075 5847 acb = kmem_zalloc(sizeof (arc_callback_t), KM_SLEEP);
5076 5848 acb->acb_done = done;
5077 5849 acb->acb_private = private;
5078 5850 acb->acb_compressed = compressed_read;
|
↓ open down ↓ |
31 lines elided |
↑ open up ↑ |
5079 5851
5080 5852 ASSERT3P(hdr->b_l1hdr.b_acb, ==, NULL);
5081 5853 hdr->b_l1hdr.b_acb = acb;
5082 5854 arc_hdr_set_flags(hdr, ARC_FLAG_IO_IN_PROGRESS);
5083 5855
5084 5856 if (HDR_HAS_L2HDR(hdr) &&
5085 5857 (vd = hdr->b_l2hdr.b_dev->l2ad_vdev) != NULL) {
5086 5858 devw = hdr->b_l2hdr.b_dev->l2ad_writing;
5087 5859 addr = hdr->b_l2hdr.b_daddr;
5088 5860 /*
5089 - * Lock out L2ARC device removal.
5861 + * Lock out device removal.
5090 5862 */
5091 5863 if (vdev_is_dead(vd) ||
5092 5864 !spa_config_tryenter(spa, SCL_L2ARC, vd, RW_READER))
5093 5865 vd = NULL;
5094 5866 }
5095 5867
5096 5868 if (priority == ZIO_PRIORITY_ASYNC_READ)
5097 5869 arc_hdr_set_flags(hdr, ARC_FLAG_PRIO_ASYNC_READ);
5098 5870 else
5099 5871 arc_hdr_clear_flags(hdr, ARC_FLAG_PRIO_ASYNC_READ);
5100 5872
5101 5873 if (hash_lock != NULL)
5102 5874 mutex_exit(hash_lock);
|
↓ open down ↓ |
3 lines elided |
↑ open up ↑ |
5103 5875
5104 5876 /*
5105 5877 * At this point, we have a level 1 cache miss. Try again in
5106 5878 * L2ARC if possible.
5107 5879 */
5108 5880 ASSERT3U(HDR_GET_LSIZE(hdr), ==, lsize);
5109 5881
5110 5882 DTRACE_PROBE4(arc__miss, arc_buf_hdr_t *, hdr, blkptr_t *, bp,
5111 5883 uint64_t, lsize, zbookmark_phys_t *, zb);
5112 5884 ARCSTAT_BUMP(arcstat_misses);
5113 - ARCSTAT_CONDSTAT(!HDR_PREFETCH(hdr),
5114 - demand, prefetch, !HDR_ISTYPE_METADATA(hdr),
5115 - data, metadata, misses);
5885 + arc_update_hit_stat(hdr, B_FALSE);
5116 5886
5117 5887 if (vd != NULL && l2arc_ndev != 0 && !(l2arc_norw && devw)) {
5118 5888 /*
5119 5889 * Read from the L2ARC if the following are true:
5120 5890 * 1. The L2ARC vdev was previously cached.
5121 5891 * 2. This buffer still has L2ARC metadata.
5122 5892 * 3. This buffer isn't currently writing to the L2ARC.
5123 5893 * 4. The L2ARC entry wasn't evicted, which may
5124 5894 * also have invalidated the vdev.
5125 5895 * 5. This isn't prefetch and l2arc_noprefetch is set.
5126 5896 */
5127 5897 if (HDR_HAS_L2HDR(hdr) &&
5128 5898 !HDR_L2_WRITING(hdr) && !HDR_L2_EVICTED(hdr) &&
5129 5899 !(l2arc_noprefetch && HDR_PREFETCH(hdr))) {
5130 5900 l2arc_read_callback_t *cb;
5131 5901 abd_t *abd;
5132 5902 uint64_t asize;
5133 5903
5134 5904 DTRACE_PROBE1(l2arc__hit, arc_buf_hdr_t *, hdr);
5135 5905 ARCSTAT_BUMP(arcstat_l2_hits);
5906 + if (vdev_type_is_ddt(vd))
5907 + ARCSTAT_BUMP(arcstat_l2_ddt_hits);
5136 5908
5137 5909 cb = kmem_zalloc(sizeof (l2arc_read_callback_t),
5138 5910 KM_SLEEP);
5139 5911 cb->l2rcb_hdr = hdr;
5140 5912 cb->l2rcb_bp = *bp;
5141 5913 cb->l2rcb_zb = *zb;
5142 5914 cb->l2rcb_flags = zio_flags;
5143 5915
5144 5916 asize = vdev_psize_to_asize(vd, size);
5145 5917 if (asize != size) {
5146 5918 abd = abd_alloc_for_io(asize,
5147 - HDR_ISTYPE_METADATA(hdr));
5919 + !HDR_ISTYPE_DATA(hdr));
5148 5920 cb->l2rcb_abd = abd;
5149 5921 } else {
5150 5922 abd = hdr->b_l1hdr.b_pabd;
5151 5923 }
5152 5924
5153 5925 ASSERT(addr >= VDEV_LABEL_START_SIZE &&
5154 5926 addr + asize <= vd->vdev_psize -
5155 5927 VDEV_LABEL_END_SIZE);
5156 5928
5157 5929 /*
5158 5930 * l2arc read. The SCL_L2ARC lock will be
5159 5931 * released by l2arc_read_done().
5160 5932 * Issue a null zio if the underlying buffer
5161 5933 * was squashed to zero size by compression.
5162 5934 */
5163 5935 ASSERT3U(HDR_GET_COMPRESS(hdr), !=,
5164 5936 ZIO_COMPRESS_EMPTY);
|
↓ open down ↓ |
7 lines elided |
↑ open up ↑ |
5165 5937 rzio = zio_read_phys(pio, vd, addr,
5166 5938 asize, abd,
5167 5939 ZIO_CHECKSUM_OFF,
5168 5940 l2arc_read_done, cb, priority,
5169 5941 zio_flags | ZIO_FLAG_DONT_CACHE |
5170 5942 ZIO_FLAG_CANFAIL |
5171 5943 ZIO_FLAG_DONT_PROPAGATE |
5172 5944 ZIO_FLAG_DONT_RETRY, B_FALSE);
5173 5945 DTRACE_PROBE2(l2arc__read, vdev_t *, vd,
5174 5946 zio_t *, rzio);
5947 +
5175 5948 ARCSTAT_INCR(arcstat_l2_read_bytes, size);
5949 + if (vdev_type_is_ddt(vd))
5950 + ARCSTAT_INCR(arcstat_l2_ddt_read_bytes,
5951 + size);
5176 5952
5177 5953 if (*arc_flags & ARC_FLAG_NOWAIT) {
5178 5954 zio_nowait(rzio);
5179 5955 return (0);
5180 5956 }
5181 5957
5182 5958 ASSERT(*arc_flags & ARC_FLAG_WAIT);
5183 5959 if (zio_wait(rzio) == 0)
5184 5960 return (0);
5185 5961
5186 5962 /* l2arc read error; goto zio_read() */
5187 5963 } else {
5188 5964 DTRACE_PROBE1(l2arc__miss,
5189 5965 arc_buf_hdr_t *, hdr);
5190 5966 ARCSTAT_BUMP(arcstat_l2_misses);
5191 5967 if (HDR_L2_WRITING(hdr))
5192 5968 ARCSTAT_BUMP(arcstat_l2_rw_clash);
5193 5969 spa_config_exit(spa, SCL_L2ARC, vd);
5194 5970 }
5195 5971 } else {
5196 5972 if (vd != NULL)
5197 5973 spa_config_exit(spa, SCL_L2ARC, vd);
5198 5974 if (l2arc_ndev != 0) {
5199 5975 DTRACE_PROBE1(l2arc__miss,
5200 5976 arc_buf_hdr_t *, hdr);
5201 5977 ARCSTAT_BUMP(arcstat_l2_misses);
5202 5978 }
5203 5979 }
5204 5980
5205 5981 rzio = zio_read(pio, spa, bp, hdr->b_l1hdr.b_pabd, size,
5206 5982 arc_read_done, hdr, priority, zio_flags, zb);
5207 5983
5208 5984 if (*arc_flags & ARC_FLAG_WAIT)
5209 5985 return (zio_wait(rzio));
5210 5986
5211 5987 ASSERT(*arc_flags & ARC_FLAG_NOWAIT);
5212 5988 zio_nowait(rzio);
5213 5989 }
5214 5990 return (0);
5215 5991 }
5216 5992
5217 5993 /*
5218 5994 * Notify the arc that a block was freed, and thus will never be used again.
5219 5995 */
5220 5996 void
5221 5997 arc_freed(spa_t *spa, const blkptr_t *bp)
5222 5998 {
5223 5999 arc_buf_hdr_t *hdr;
5224 6000 kmutex_t *hash_lock;
5225 6001 uint64_t guid = spa_load_guid(spa);
5226 6002
5227 6003 ASSERT(!BP_IS_EMBEDDED(bp));
5228 6004
5229 6005 hdr = buf_hash_find(guid, bp, &hash_lock);
5230 6006 if (hdr == NULL)
5231 6007 return;
5232 6008
5233 6009 /*
5234 6010 * We might be trying to free a block that is still doing I/O
5235 6011 * (i.e. prefetch) or has a reference (i.e. a dedup-ed,
5236 6012 * dmu_sync-ed block). If this block is being prefetched, then it
5237 6013 * would still have the ARC_FLAG_IO_IN_PROGRESS flag set on the hdr
5238 6014 * until the I/O completes. A block may also have a reference if it is
5239 6015 * part of a dedup-ed, dmu_synced write. The dmu_sync() function would
5240 6016 * have written the new block to its final resting place on disk but
5241 6017 * without the dedup flag set. This would have left the hdr in the MRU
5242 6018 * state and discoverable. When the txg finally syncs it detects that
5243 6019 * the block was overridden in open context and issues an override I/O.
5244 6020 * Since this is a dedup block, the override I/O will determine if the
5245 6021 * block is already in the DDT. If so, then it will replace the io_bp
5246 6022 * with the bp from the DDT and allow the I/O to finish. When the I/O
5247 6023 * reaches the done callback, dbuf_write_override_done, it will
5248 6024 * check to see if the io_bp and io_bp_override are identical.
5249 6025 * If they are not, then it indicates that the bp was replaced with
5250 6026 * the bp in the DDT and the override bp is freed. This allows
5251 6027 * us to arrive here with a reference on a block that is being
5252 6028 * freed. So if we have an I/O in progress, or a reference to
5253 6029 * this hdr, then we don't destroy the hdr.
5254 6030 */
5255 6031 if (!HDR_HAS_L1HDR(hdr) || (!HDR_IO_IN_PROGRESS(hdr) &&
5256 6032 refcount_is_zero(&hdr->b_l1hdr.b_refcnt))) {
5257 6033 arc_change_state(arc_anon, hdr, hash_lock);
5258 6034 arc_hdr_destroy(hdr);
5259 6035 mutex_exit(hash_lock);
5260 6036 } else {
5261 6037 mutex_exit(hash_lock);
5262 6038 }
5263 6039
5264 6040 }
5265 6041
5266 6042 /*
5267 6043 * Release this buffer from the cache, making it an anonymous buffer. This
5268 6044 * must be done after a read and prior to modifying the buffer contents.
5269 6045 * If the buffer has more than one reference, we must make
5270 6046 * a new hdr for the buffer.
5271 6047 */
5272 6048 void
5273 6049 arc_release(arc_buf_t *buf, void *tag)
5274 6050 {
5275 6051 arc_buf_hdr_t *hdr = buf->b_hdr;
5276 6052
5277 6053 /*
5278 6054 * It would be nice to assert that if it's DMU metadata (level >
5279 6055 * 0 || it's the dnode file), then it must be syncing context.
5280 6056 * But we don't know that information at this level.
5281 6057 */
5282 6058
5283 6059 mutex_enter(&buf->b_evict_lock);
5284 6060
5285 6061 ASSERT(HDR_HAS_L1HDR(hdr));
5286 6062
5287 6063 /*
5288 6064 * We don't grab the hash lock prior to this check, because if
5289 6065 * the buffer's header is in the arc_anon state, it won't be
5290 6066 * linked into the hash table.
5291 6067 */
5292 6068 if (hdr->b_l1hdr.b_state == arc_anon) {
5293 6069 mutex_exit(&buf->b_evict_lock);
5294 6070 ASSERT(!HDR_IO_IN_PROGRESS(hdr));
5295 6071 ASSERT(!HDR_IN_HASH_TABLE(hdr));
5296 6072 ASSERT(!HDR_HAS_L2HDR(hdr));
5297 6073 ASSERT(HDR_EMPTY(hdr));
5298 6074
5299 6075 ASSERT3U(hdr->b_l1hdr.b_bufcnt, ==, 1);
5300 6076 ASSERT3S(refcount_count(&hdr->b_l1hdr.b_refcnt), ==, 1);
5301 6077 ASSERT(!list_link_active(&hdr->b_l1hdr.b_arc_node));
5302 6078
5303 6079 hdr->b_l1hdr.b_arc_access = 0;
5304 6080
5305 6081 /*
5306 6082 * If the buf is being overridden then it may already
5307 6083 * have a hdr that is not empty.
5308 6084 */
5309 6085 buf_discard_identity(hdr);
5310 6086 arc_buf_thaw(buf);
5311 6087
5312 6088 return;
5313 6089 }
5314 6090
5315 6091 kmutex_t *hash_lock = HDR_LOCK(hdr);
5316 6092 mutex_enter(hash_lock);
5317 6093
5318 6094 /*
5319 6095 * This assignment is only valid as long as the hash_lock is
5320 6096 * held, we must be careful not to reference state or the
5321 6097 * b_state field after dropping the lock.
5322 6098 */
5323 6099 arc_state_t *state = hdr->b_l1hdr.b_state;
5324 6100 ASSERT3P(hash_lock, ==, HDR_LOCK(hdr));
5325 6101 ASSERT3P(state, !=, arc_anon);
5326 6102
5327 6103 /* this buffer is not on any list */
5328 6104 ASSERT3S(refcount_count(&hdr->b_l1hdr.b_refcnt), >, 0);
5329 6105
5330 6106 if (HDR_HAS_L2HDR(hdr)) {
5331 6107 mutex_enter(&hdr->b_l2hdr.b_dev->l2ad_mtx);
5332 6108
5333 6109 /*
5334 6110 * We have to recheck this conditional again now that
5335 6111 * we're holding the l2ad_mtx to prevent a race with
5336 6112 * another thread which might be concurrently calling
5337 6113 * l2arc_evict(). In that case, l2arc_evict() might have
5338 6114 * destroyed the header's L2 portion as we were waiting
5339 6115 * to acquire the l2ad_mtx.
5340 6116 */
5341 6117 if (HDR_HAS_L2HDR(hdr))
5342 6118 arc_hdr_l2hdr_destroy(hdr);
5343 6119
5344 6120 mutex_exit(&hdr->b_l2hdr.b_dev->l2ad_mtx);
5345 6121 }
5346 6122
5347 6123 /*
5348 6124 * Do we have more than one buf?
5349 6125 */
5350 6126 if (hdr->b_l1hdr.b_bufcnt > 1) {
5351 6127 arc_buf_hdr_t *nhdr;
5352 6128 uint64_t spa = hdr->b_spa;
5353 6129 uint64_t psize = HDR_GET_PSIZE(hdr);
5354 6130 uint64_t lsize = HDR_GET_LSIZE(hdr);
5355 6131 enum zio_compress compress = HDR_GET_COMPRESS(hdr);
5356 6132 arc_buf_contents_t type = arc_buf_type(hdr);
5357 6133 VERIFY3U(hdr->b_type, ==, type);
5358 6134
5359 6135 ASSERT(hdr->b_l1hdr.b_buf != buf || buf->b_next != NULL);
5360 6136 (void) remove_reference(hdr, hash_lock, tag);
5361 6137
5362 6138 if (arc_buf_is_shared(buf) && !ARC_BUF_COMPRESSED(buf)) {
5363 6139 ASSERT3P(hdr->b_l1hdr.b_buf, !=, buf);
5364 6140 ASSERT(ARC_BUF_LAST(buf));
5365 6141 }
5366 6142
5367 6143 /*
5368 6144 * Pull the data off of this hdr and attach it to
5369 6145 * a new anonymous hdr. Also find the last buffer
5370 6146 * in the hdr's buffer list.
5371 6147 */
5372 6148 arc_buf_t *lastbuf = arc_buf_remove(hdr, buf);
5373 6149 ASSERT3P(lastbuf, !=, NULL);
5374 6150
5375 6151 /*
5376 6152 * If the current arc_buf_t and the hdr are sharing their data
5377 6153 * buffer, then we must stop sharing that block.
5378 6154 */
5379 6155 if (arc_buf_is_shared(buf)) {
5380 6156 VERIFY(!arc_buf_is_shared(lastbuf));
5381 6157
5382 6158 /*
5383 6159 * First, sever the block sharing relationship between
5384 6160 * buf and the arc_buf_hdr_t.
5385 6161 */
5386 6162 arc_unshare_buf(hdr, buf);
5387 6163
5388 6164 /*
5389 6165 * Now we need to recreate the hdr's b_pabd. Since we
5390 6166 * have lastbuf handy, we try to share with it, but if
5391 6167 * we can't then we allocate a new b_pabd and copy the
5392 6168 * data from buf into it.
5393 6169 */
5394 6170 if (arc_can_share(hdr, lastbuf)) {
5395 6171 arc_share_buf(hdr, lastbuf);
5396 6172 } else {
5397 6173 arc_hdr_alloc_pabd(hdr);
5398 6174 abd_copy_from_buf(hdr->b_l1hdr.b_pabd,
5399 6175 buf->b_data, psize);
5400 6176 }
5401 6177 VERIFY3P(lastbuf->b_data, !=, NULL);
5402 6178 } else if (HDR_SHARED_DATA(hdr)) {
5403 6179 /*
5404 6180 * Uncompressed shared buffers are always at the end
5405 6181 * of the list. Compressed buffers don't have the
5406 6182 * same requirements. This makes it hard to
5407 6183 * simply assert that the lastbuf is shared so
5408 6184 * we rely on the hdr's compression flags to determine
5409 6185 * if we have a compressed, shared buffer.
5410 6186 */
5411 6187 ASSERT(arc_buf_is_shared(lastbuf) ||
5412 6188 HDR_GET_COMPRESS(hdr) != ZIO_COMPRESS_OFF);
5413 6189 ASSERT(!ARC_BUF_SHARED(buf));
5414 6190 }
5415 6191 ASSERT3P(hdr->b_l1hdr.b_pabd, !=, NULL);
5416 6192 ASSERT3P(state, !=, arc_l2c_only);
5417 6193
5418 6194 (void) refcount_remove_many(&state->arcs_size,
5419 6195 arc_buf_size(buf), buf);
5420 6196
5421 6197 if (refcount_is_zero(&hdr->b_l1hdr.b_refcnt)) {
5422 6198 ASSERT3P(state, !=, arc_l2c_only);
5423 6199 (void) refcount_remove_many(&state->arcs_esize[type],
5424 6200 arc_buf_size(buf), buf);
5425 6201 }
5426 6202
5427 6203 hdr->b_l1hdr.b_bufcnt -= 1;
5428 6204 arc_cksum_verify(buf);
5429 6205 arc_buf_unwatch(buf);
5430 6206
5431 6207 mutex_exit(hash_lock);
5432 6208
5433 6209 /*
5434 6210 * Allocate a new hdr. The new hdr will contain a b_pabd
5435 6211 * buffer which will be freed in arc_write().
5436 6212 */
|
↓ open down ↓ |
251 lines elided |
↑ open up ↑ |
5437 6213 nhdr = arc_hdr_alloc(spa, psize, lsize, compress, type);
5438 6214 ASSERT3P(nhdr->b_l1hdr.b_buf, ==, NULL);
5439 6215 ASSERT0(nhdr->b_l1hdr.b_bufcnt);
5440 6216 ASSERT0(refcount_count(&nhdr->b_l1hdr.b_refcnt));
5441 6217 VERIFY3U(nhdr->b_type, ==, type);
5442 6218 ASSERT(!HDR_SHARED_DATA(nhdr));
5443 6219
5444 6220 nhdr->b_l1hdr.b_buf = buf;
5445 6221 nhdr->b_l1hdr.b_bufcnt = 1;
5446 6222 (void) refcount_add(&nhdr->b_l1hdr.b_refcnt, tag);
6223 + nhdr->b_l1hdr.b_krrp = 0;
6224 +
5447 6225 buf->b_hdr = nhdr;
5448 6226
5449 6227 mutex_exit(&buf->b_evict_lock);
5450 6228 (void) refcount_add_many(&arc_anon->arcs_size,
5451 6229 arc_buf_size(buf), buf);
5452 6230 } else {
5453 6231 mutex_exit(&buf->b_evict_lock);
5454 6232 ASSERT(refcount_count(&hdr->b_l1hdr.b_refcnt) == 1);
5455 6233 /* protected by hash lock, or hdr is on arc_anon */
5456 6234 ASSERT(!multilist_link_active(&hdr->b_l1hdr.b_arc_node));
5457 6235 ASSERT(!HDR_IO_IN_PROGRESS(hdr));
5458 6236 arc_change_state(arc_anon, hdr, hash_lock);
5459 6237 hdr->b_l1hdr.b_arc_access = 0;
5460 6238 mutex_exit(hash_lock);
5461 6239
5462 6240 buf_discard_identity(hdr);
5463 6241 arc_buf_thaw(buf);
5464 6242 }
5465 6243 }
5466 6244
5467 6245 int
5468 6246 arc_released(arc_buf_t *buf)
5469 6247 {
5470 6248 int released;
5471 6249
5472 6250 mutex_enter(&buf->b_evict_lock);
5473 6251 released = (buf->b_data != NULL &&
5474 6252 buf->b_hdr->b_l1hdr.b_state == arc_anon);
5475 6253 mutex_exit(&buf->b_evict_lock);
5476 6254 return (released);
5477 6255 }
5478 6256
5479 6257 #ifdef ZFS_DEBUG
5480 6258 int
5481 6259 arc_referenced(arc_buf_t *buf)
5482 6260 {
5483 6261 int referenced;
5484 6262
5485 6263 mutex_enter(&buf->b_evict_lock);
5486 6264 referenced = (refcount_count(&buf->b_hdr->b_l1hdr.b_refcnt));
5487 6265 mutex_exit(&buf->b_evict_lock);
5488 6266 return (referenced);
5489 6267 }
5490 6268 #endif
5491 6269
5492 6270 static void
5493 6271 arc_write_ready(zio_t *zio)
5494 6272 {
5495 6273 arc_write_callback_t *callback = zio->io_private;
5496 6274 arc_buf_t *buf = callback->awcb_buf;
5497 6275 arc_buf_hdr_t *hdr = buf->b_hdr;
5498 6276 uint64_t psize = BP_IS_HOLE(zio->io_bp) ? 0 : BP_GET_PSIZE(zio->io_bp);
5499 6277
5500 6278 ASSERT(HDR_HAS_L1HDR(hdr));
5501 6279 ASSERT(!refcount_is_zero(&buf->b_hdr->b_l1hdr.b_refcnt));
5502 6280 ASSERT(hdr->b_l1hdr.b_bufcnt > 0);
5503 6281
5504 6282 /*
5505 6283 * If we're reexecuting this zio because the pool suspended, then
5506 6284 * cleanup any state that was previously set the first time the
5507 6285 * callback was invoked.
5508 6286 */
5509 6287 if (zio->io_flags & ZIO_FLAG_REEXECUTED) {
5510 6288 arc_cksum_free(hdr);
5511 6289 arc_buf_unwatch(buf);
5512 6290 if (hdr->b_l1hdr.b_pabd != NULL) {
5513 6291 if (arc_buf_is_shared(buf)) {
5514 6292 arc_unshare_buf(hdr, buf);
5515 6293 } else {
5516 6294 arc_hdr_free_pabd(hdr);
5517 6295 }
5518 6296 }
5519 6297 }
5520 6298 ASSERT3P(hdr->b_l1hdr.b_pabd, ==, NULL);
5521 6299 ASSERT(!HDR_SHARED_DATA(hdr));
5522 6300 ASSERT(!arc_buf_is_shared(buf));
5523 6301
5524 6302 callback->awcb_ready(zio, buf, callback->awcb_private);
5525 6303
5526 6304 if (HDR_IO_IN_PROGRESS(hdr))
5527 6305 ASSERT(zio->io_flags & ZIO_FLAG_REEXECUTED);
5528 6306
5529 6307 arc_cksum_compute(buf);
5530 6308 arc_hdr_set_flags(hdr, ARC_FLAG_IO_IN_PROGRESS);
5531 6309
5532 6310 enum zio_compress compress;
5533 6311 if (BP_IS_HOLE(zio->io_bp) || BP_IS_EMBEDDED(zio->io_bp)) {
5534 6312 compress = ZIO_COMPRESS_OFF;
5535 6313 } else {
5536 6314 ASSERT3U(HDR_GET_LSIZE(hdr), ==, BP_GET_LSIZE(zio->io_bp));
5537 6315 compress = BP_GET_COMPRESS(zio->io_bp);
5538 6316 }
5539 6317 HDR_SET_PSIZE(hdr, psize);
5540 6318 arc_hdr_set_compress(hdr, compress);
5541 6319
5542 6320
5543 6321 /*
5544 6322 * Fill the hdr with data. If the hdr is compressed, the data we want
5545 6323 * is available from the zio, otherwise we can take it from the buf.
5546 6324 *
5547 6325 * We might be able to share the buf's data with the hdr here. However,
5548 6326 * doing so would cause the ARC to be full of linear ABDs if we write a
5549 6327 * lot of shareable data. As a compromise, we check whether scattered
5550 6328 * ABDs are allowed, and assume that if they are then the user wants
5551 6329 * the ARC to be primarily filled with them regardless of the data being
5552 6330 * written. Therefore, if they're allowed then we allocate one and copy
5553 6331 * the data into it; otherwise, we share the data directly if we can.
5554 6332 */
5555 6333 if (zfs_abd_scatter_enabled || !arc_can_share(hdr, buf)) {
5556 6334 arc_hdr_alloc_pabd(hdr);
5557 6335
5558 6336 /*
5559 6337 * Ideally, we would always copy the io_abd into b_pabd, but the
5560 6338 * user may have disabled compressed ARC, thus we must check the
5561 6339 * hdr's compression setting rather than the io_bp's.
5562 6340 */
5563 6341 if (HDR_GET_COMPRESS(hdr) != ZIO_COMPRESS_OFF) {
5564 6342 ASSERT3U(BP_GET_COMPRESS(zio->io_bp), !=,
5565 6343 ZIO_COMPRESS_OFF);
5566 6344 ASSERT3U(psize, >, 0);
5567 6345
5568 6346 abd_copy(hdr->b_l1hdr.b_pabd, zio->io_abd, psize);
5569 6347 } else {
5570 6348 ASSERT3U(zio->io_orig_size, ==, arc_hdr_size(hdr));
5571 6349
5572 6350 abd_copy_from_buf(hdr->b_l1hdr.b_pabd, buf->b_data,
5573 6351 arc_buf_size(buf));
5574 6352 }
5575 6353 } else {
5576 6354 ASSERT3P(buf->b_data, ==, abd_to_buf(zio->io_orig_abd));
5577 6355 ASSERT3U(zio->io_orig_size, ==, arc_buf_size(buf));
5578 6356 ASSERT3U(hdr->b_l1hdr.b_bufcnt, ==, 1);
5579 6357
5580 6358 arc_share_buf(hdr, buf);
5581 6359 }
5582 6360
5583 6361 arc_hdr_verify(hdr, zio->io_bp);
5584 6362 }
5585 6363
5586 6364 static void
5587 6365 arc_write_children_ready(zio_t *zio)
5588 6366 {
5589 6367 arc_write_callback_t *callback = zio->io_private;
5590 6368 arc_buf_t *buf = callback->awcb_buf;
5591 6369
5592 6370 callback->awcb_children_ready(zio, buf, callback->awcb_private);
5593 6371 }
5594 6372
5595 6373 /*
5596 6374 * The SPA calls this callback for each physical write that happens on behalf
5597 6375 * of a logical write. See the comment in dbuf_write_physdone() for details.
5598 6376 */
5599 6377 static void
5600 6378 arc_write_physdone(zio_t *zio)
5601 6379 {
5602 6380 arc_write_callback_t *cb = zio->io_private;
5603 6381 if (cb->awcb_physdone != NULL)
5604 6382 cb->awcb_physdone(zio, cb->awcb_buf, cb->awcb_private);
5605 6383 }
5606 6384
5607 6385 static void
5608 6386 arc_write_done(zio_t *zio)
5609 6387 {
5610 6388 arc_write_callback_t *callback = zio->io_private;
5611 6389 arc_buf_t *buf = callback->awcb_buf;
5612 6390 arc_buf_hdr_t *hdr = buf->b_hdr;
5613 6391
5614 6392 ASSERT3P(hdr->b_l1hdr.b_acb, ==, NULL);
5615 6393
5616 6394 if (zio->io_error == 0) {
5617 6395 arc_hdr_verify(hdr, zio->io_bp);
5618 6396
5619 6397 if (BP_IS_HOLE(zio->io_bp) || BP_IS_EMBEDDED(zio->io_bp)) {
5620 6398 buf_discard_identity(hdr);
5621 6399 } else {
5622 6400 hdr->b_dva = *BP_IDENTITY(zio->io_bp);
5623 6401 hdr->b_birth = BP_PHYSICAL_BIRTH(zio->io_bp);
5624 6402 }
5625 6403 } else {
5626 6404 ASSERT(HDR_EMPTY(hdr));
5627 6405 }
5628 6406
5629 6407 /*
5630 6408 * If the block to be written was all-zero or compressed enough to be
5631 6409 * embedded in the BP, no write was performed so there will be no
5632 6410 * dva/birth/checksum. The buffer must therefore remain anonymous
5633 6411 * (and uncached).
5634 6412 */
5635 6413 if (!HDR_EMPTY(hdr)) {
5636 6414 arc_buf_hdr_t *exists;
5637 6415 kmutex_t *hash_lock;
5638 6416
5639 6417 ASSERT3U(zio->io_error, ==, 0);
5640 6418
5641 6419 arc_cksum_verify(buf);
5642 6420
5643 6421 exists = buf_hash_insert(hdr, &hash_lock);
5644 6422 if (exists != NULL) {
5645 6423 /*
5646 6424 * This can only happen if we overwrite for
|
↓ open down ↓ |
190 lines elided |
↑ open up ↑ |
5647 6425 * sync-to-convergence, because we remove
5648 6426 * buffers from the hash table when we arc_free().
5649 6427 */
5650 6428 if (zio->io_flags & ZIO_FLAG_IO_REWRITE) {
5651 6429 if (!BP_EQUAL(&zio->io_bp_orig, zio->io_bp))
5652 6430 panic("bad overwrite, hdr=%p exists=%p",
5653 6431 (void *)hdr, (void *)exists);
5654 6432 ASSERT(refcount_is_zero(
5655 6433 &exists->b_l1hdr.b_refcnt));
5656 6434 arc_change_state(arc_anon, exists, hash_lock);
5657 - mutex_exit(hash_lock);
6435 + arc_wait_for_krrp(exists);
5658 6436 arc_hdr_destroy(exists);
6437 + mutex_exit(hash_lock);
5659 6438 exists = buf_hash_insert(hdr, &hash_lock);
5660 6439 ASSERT3P(exists, ==, NULL);
5661 6440 } else if (zio->io_flags & ZIO_FLAG_NOPWRITE) {
5662 6441 /* nopwrite */
5663 6442 ASSERT(zio->io_prop.zp_nopwrite);
5664 6443 if (!BP_EQUAL(&zio->io_bp_orig, zio->io_bp))
5665 6444 panic("bad nopwrite, hdr=%p exists=%p",
5666 6445 (void *)hdr, (void *)exists);
5667 6446 } else {
5668 6447 /* Dedup */
5669 6448 ASSERT(hdr->b_l1hdr.b_bufcnt == 1);
5670 6449 ASSERT(hdr->b_l1hdr.b_state == arc_anon);
5671 6450 ASSERT(BP_GET_DEDUP(zio->io_bp));
5672 6451 ASSERT(BP_GET_LEVEL(zio->io_bp) == 0);
5673 6452 }
5674 6453 }
5675 6454 arc_hdr_clear_flags(hdr, ARC_FLAG_IO_IN_PROGRESS);
5676 6455 /* if it's not anon, we are doing a scrub */
5677 6456 if (exists == NULL && hdr->b_l1hdr.b_state == arc_anon)
5678 6457 arc_access(hdr, hash_lock);
5679 6458 mutex_exit(hash_lock);
5680 6459 } else {
5681 6460 arc_hdr_clear_flags(hdr, ARC_FLAG_IO_IN_PROGRESS);
5682 6461 }
5683 6462
5684 6463 ASSERT(!refcount_is_zero(&hdr->b_l1hdr.b_refcnt));
5685 6464 callback->awcb_done(zio, buf, callback->awcb_private);
|
↓ open down ↓ |
17 lines elided |
↑ open up ↑ |
5686 6465
5687 6466 abd_put(zio->io_abd);
5688 6467 kmem_free(callback, sizeof (arc_write_callback_t));
5689 6468 }
5690 6469
5691 6470 zio_t *
5692 6471 arc_write(zio_t *pio, spa_t *spa, uint64_t txg, blkptr_t *bp, arc_buf_t *buf,
5693 6472 boolean_t l2arc, const zio_prop_t *zp, arc_done_func_t *ready,
5694 6473 arc_done_func_t *children_ready, arc_done_func_t *physdone,
5695 6474 arc_done_func_t *done, void *private, zio_priority_t priority,
5696 - int zio_flags, const zbookmark_phys_t *zb)
6475 + int zio_flags, const zbookmark_phys_t *zb,
6476 + const zio_smartcomp_info_t *smartcomp)
5697 6477 {
5698 6478 arc_buf_hdr_t *hdr = buf->b_hdr;
5699 6479 arc_write_callback_t *callback;
5700 6480 zio_t *zio;
5701 6481 zio_prop_t localprop = *zp;
5702 6482
5703 6483 ASSERT3P(ready, !=, NULL);
5704 6484 ASSERT3P(done, !=, NULL);
5705 6485 ASSERT(!HDR_IO_ERROR(hdr));
5706 6486 ASSERT(!HDR_IO_IN_PROGRESS(hdr));
5707 6487 ASSERT3P(hdr->b_l1hdr.b_acb, ==, NULL);
5708 6488 ASSERT3U(hdr->b_l1hdr.b_bufcnt, >, 0);
5709 6489 if (l2arc)
5710 6490 arc_hdr_set_flags(hdr, ARC_FLAG_L2CACHE);
5711 6491 if (ARC_BUF_COMPRESSED(buf)) {
5712 6492 /*
5713 6493 * We're writing a pre-compressed buffer. Make the
5714 6494 * compression algorithm requested by the zio_prop_t match
5715 6495 * the pre-compressed buffer's compression algorithm.
5716 6496 */
5717 6497 localprop.zp_compress = HDR_GET_COMPRESS(hdr);
5718 6498
5719 6499 ASSERT3U(HDR_GET_LSIZE(hdr), !=, arc_buf_size(buf));
5720 6500 zio_flags |= ZIO_FLAG_RAW;
5721 6501 }
5722 6502 callback = kmem_zalloc(sizeof (arc_write_callback_t), KM_SLEEP);
5723 6503 callback->awcb_ready = ready;
5724 6504 callback->awcb_children_ready = children_ready;
5725 6505 callback->awcb_physdone = physdone;
5726 6506 callback->awcb_done = done;
5727 6507 callback->awcb_private = private;
5728 6508 callback->awcb_buf = buf;
5729 6509
5730 6510 /*
5731 6511 * The hdr's b_pabd is now stale, free it now. A new data block
5732 6512 * will be allocated when the zio pipeline calls arc_write_ready().
5733 6513 */
5734 6514 if (hdr->b_l1hdr.b_pabd != NULL) {
5735 6515 /*
5736 6516 * If the buf is currently sharing the data block with
5737 6517 * the hdr then we need to break that relationship here.
5738 6518 * The hdr will remain with a NULL data pointer and the
5739 6519 * buf will take sole ownership of the block.
5740 6520 */
5741 6521 if (arc_buf_is_shared(buf)) {
5742 6522 arc_unshare_buf(hdr, buf);
5743 6523 } else {
5744 6524 arc_hdr_free_pabd(hdr);
5745 6525 }
5746 6526 VERIFY3P(buf->b_data, !=, NULL);
|
↓ open down ↓ |
40 lines elided |
↑ open up ↑ |
5747 6527 arc_hdr_set_compress(hdr, ZIO_COMPRESS_OFF);
5748 6528 }
5749 6529 ASSERT(!arc_buf_is_shared(buf));
5750 6530 ASSERT3P(hdr->b_l1hdr.b_pabd, ==, NULL);
5751 6531
5752 6532 zio = zio_write(pio, spa, txg, bp,
5753 6533 abd_get_from_buf(buf->b_data, HDR_GET_LSIZE(hdr)),
5754 6534 HDR_GET_LSIZE(hdr), arc_buf_size(buf), &localprop, arc_write_ready,
5755 6535 (children_ready != NULL) ? arc_write_children_ready : NULL,
5756 6536 arc_write_physdone, arc_write_done, callback,
5757 - priority, zio_flags, zb);
6537 + priority, zio_flags, zb, smartcomp);
5758 6538
5759 6539 return (zio);
5760 6540 }
5761 6541
5762 6542 static int
5763 6543 arc_memory_throttle(uint64_t reserve, uint64_t txg)
5764 6544 {
5765 6545 #ifdef _KERNEL
5766 6546 uint64_t available_memory = ptob(freemem);
5767 6547 static uint64_t page_load = 0;
5768 6548 static uint64_t last_txg = 0;
5769 6549
5770 6550 #if defined(__i386)
5771 6551 available_memory =
5772 6552 MIN(available_memory, vmem_size(heap_arena, VMEM_FREE));
5773 6553 #endif
5774 6554
5775 6555 if (freemem > physmem * arc_lotsfree_percent / 100)
5776 6556 return (0);
5777 6557
5778 6558 if (txg > last_txg) {
5779 6559 last_txg = txg;
5780 6560 page_load = 0;
5781 6561 }
5782 6562 /*
5783 6563 * If we are in pageout, we know that memory is already tight,
5784 6564 * the arc is already going to be evicting, so we just want to
5785 6565 * continue to let page writes occur as quickly as possible.
5786 6566 */
5787 6567 if (curproc == proc_pageout) {
5788 6568 if (page_load > MAX(ptob(minfree), available_memory) / 4)
5789 6569 return (SET_ERROR(ERESTART));
5790 6570 /* Note: reserve is inflated, so we deflate */
5791 6571 page_load += reserve / 8;
5792 6572 return (0);
5793 6573 } else if (page_load > 0 && arc_reclaim_needed()) {
5794 6574 /* memory is low, delay before restarting */
5795 6575 ARCSTAT_INCR(arcstat_memory_throttle_count, 1);
5796 6576 return (SET_ERROR(EAGAIN));
5797 6577 }
5798 6578 page_load = 0;
5799 6579 #endif
5800 6580 return (0);
5801 6581 }
5802 6582
5803 6583 void
5804 6584 arc_tempreserve_clear(uint64_t reserve)
5805 6585 {
5806 6586 atomic_add_64(&arc_tempreserve, -reserve);
5807 6587 ASSERT((int64_t)arc_tempreserve >= 0);
5808 6588 }
5809 6589
5810 6590 int
5811 6591 arc_tempreserve_space(uint64_t reserve, uint64_t txg)
5812 6592 {
5813 6593 int error;
5814 6594 uint64_t anon_size;
5815 6595
5816 6596 if (reserve > arc_c/4 && !arc_no_grow)
5817 6597 arc_c = MIN(arc_c_max, reserve * 4);
5818 6598 if (reserve > arc_c)
5819 6599 return (SET_ERROR(ENOMEM));
5820 6600
5821 6601 /*
5822 6602 * Don't count loaned bufs as in flight dirty data to prevent long
5823 6603 * network delays from blocking transactions that are ready to be
5824 6604 * assigned to a txg.
5825 6605 */
5826 6606
5827 6607 /* assert that it has not wrapped around */
5828 6608 ASSERT3S(atomic_add_64_nv(&arc_loaned_bytes, 0), >=, 0);
5829 6609
5830 6610 anon_size = MAX((int64_t)(refcount_count(&arc_anon->arcs_size) -
5831 6611 arc_loaned_bytes), 0);
5832 6612
5833 6613 /*
5834 6614 * Writes will, almost always, require additional memory allocations
5835 6615 * in order to compress/encrypt/etc the data. We therefore need to
5836 6616 * make sure that there is sufficient available memory for this.
5837 6617 */
5838 6618 error = arc_memory_throttle(reserve, txg);
|
↓ open down ↓ |
71 lines elided |
↑ open up ↑ |
5839 6619 if (error != 0)
5840 6620 return (error);
5841 6621
5842 6622 /*
5843 6623 * Throttle writes when the amount of dirty data in the cache
5844 6624 * gets too large. We try to keep the cache less than half full
5845 6625 * of dirty blocks so that our sync times don't grow too large.
5846 6626 * Note: if two requests come in concurrently, we might let them
5847 6627 * both succeed, when one of them should fail. Not a huge deal.
5848 6628 */
5849 -
5850 6629 if (reserve + arc_tempreserve + anon_size > arc_c / 2 &&
5851 6630 anon_size > arc_c / 4) {
6631 + DTRACE_PROBE4(arc__tempreserve__space__throttle, uint64_t,
6632 + arc_tempreserve, arc_state_t *, arc_anon, uint64_t,
6633 + reserve, uint64_t, arc_c);
6634 +
5852 6635 uint64_t meta_esize =
5853 6636 refcount_count(&arc_anon->arcs_esize[ARC_BUFC_METADATA]);
5854 6637 uint64_t data_esize =
5855 6638 refcount_count(&arc_anon->arcs_esize[ARC_BUFC_DATA]);
5856 6639 dprintf("failing, arc_tempreserve=%lluK anon_meta=%lluK "
5857 6640 "anon_data=%lluK tempreserve=%lluK arc_c=%lluK\n",
5858 6641 arc_tempreserve >> 10, meta_esize >> 10,
5859 6642 data_esize >> 10, reserve >> 10, arc_c >> 10);
5860 6643 return (SET_ERROR(ERESTART));
5861 6644 }
5862 6645 atomic_add_64(&arc_tempreserve, reserve);
5863 6646 return (0);
5864 6647 }
5865 6648
5866 6649 static void
5867 6650 arc_kstat_update_state(arc_state_t *state, kstat_named_t *size,
5868 - kstat_named_t *evict_data, kstat_named_t *evict_metadata)
6651 + kstat_named_t *evict_data, kstat_named_t *evict_metadata,
6652 + kstat_named_t *evict_ddt)
5869 6653 {
5870 6654 size->value.ui64 = refcount_count(&state->arcs_size);
5871 6655 evict_data->value.ui64 =
5872 6656 refcount_count(&state->arcs_esize[ARC_BUFC_DATA]);
5873 6657 evict_metadata->value.ui64 =
5874 6658 refcount_count(&state->arcs_esize[ARC_BUFC_METADATA]);
6659 + evict_ddt->value.ui64 =
6660 + refcount_count(&state->arcs_esize[ARC_BUFC_DDT]);
5875 6661 }
5876 6662
5877 6663 static int
5878 6664 arc_kstat_update(kstat_t *ksp, int rw)
5879 6665 {
5880 6666 arc_stats_t *as = ksp->ks_data;
5881 6667
5882 6668 if (rw == KSTAT_WRITE) {
5883 6669 return (EACCES);
5884 6670 } else {
5885 6671 arc_kstat_update_state(arc_anon,
5886 6672 &as->arcstat_anon_size,
5887 6673 &as->arcstat_anon_evictable_data,
5888 - &as->arcstat_anon_evictable_metadata);
6674 + &as->arcstat_anon_evictable_metadata,
6675 + &as->arcstat_anon_evictable_ddt);
5889 6676 arc_kstat_update_state(arc_mru,
5890 6677 &as->arcstat_mru_size,
5891 6678 &as->arcstat_mru_evictable_data,
5892 - &as->arcstat_mru_evictable_metadata);
6679 + &as->arcstat_mru_evictable_metadata,
6680 + &as->arcstat_mru_evictable_ddt);
5893 6681 arc_kstat_update_state(arc_mru_ghost,
5894 6682 &as->arcstat_mru_ghost_size,
5895 6683 &as->arcstat_mru_ghost_evictable_data,
5896 - &as->arcstat_mru_ghost_evictable_metadata);
6684 + &as->arcstat_mru_ghost_evictable_metadata,
6685 + &as->arcstat_mru_ghost_evictable_ddt);
5897 6686 arc_kstat_update_state(arc_mfu,
5898 6687 &as->arcstat_mfu_size,
5899 6688 &as->arcstat_mfu_evictable_data,
5900 - &as->arcstat_mfu_evictable_metadata);
6689 + &as->arcstat_mfu_evictable_metadata,
6690 + &as->arcstat_mfu_evictable_ddt);
5901 6691 arc_kstat_update_state(arc_mfu_ghost,
5902 6692 &as->arcstat_mfu_ghost_size,
5903 6693 &as->arcstat_mfu_ghost_evictable_data,
5904 - &as->arcstat_mfu_ghost_evictable_metadata);
5905 -
5906 - ARCSTAT(arcstat_size) = aggsum_value(&arc_size);
5907 - ARCSTAT(arcstat_meta_used) = aggsum_value(&arc_meta_used);
5908 - ARCSTAT(arcstat_data_size) = aggsum_value(&astat_data_size);
5909 - ARCSTAT(arcstat_metadata_size) =
5910 - aggsum_value(&astat_metadata_size);
5911 - ARCSTAT(arcstat_hdr_size) = aggsum_value(&astat_hdr_size);
5912 - ARCSTAT(arcstat_other_size) = aggsum_value(&astat_other_size);
5913 - ARCSTAT(arcstat_l2_hdr_size) = aggsum_value(&astat_l2_hdr_size);
6694 + &as->arcstat_mfu_ghost_evictable_metadata,
6695 + &as->arcstat_mfu_ghost_evictable_ddt);
5914 6696 }
5915 6697
5916 6698 return (0);
5917 6699 }
5918 6700
5919 6701 /*
5920 6702 * This function *must* return indices evenly distributed between all
5921 6703 * sublists of the multilist. This is needed due to how the ARC eviction
5922 6704 * code is laid out; arc_evict_state() assumes ARC buffers are evenly
5923 6705 * distributed between all sublists and uses this assumption when
5924 6706 * deciding which sublist to evict from and how much to evict from it.
5925 6707 */
5926 6708 unsigned int
5927 6709 arc_state_multilist_index_func(multilist_t *ml, void *obj)
5928 6710 {
5929 6711 arc_buf_hdr_t *hdr = obj;
5930 6712
5931 6713 /*
5932 6714 * We rely on b_dva to generate evenly distributed index
5933 6715 * numbers using buf_hash below. So, as an added precaution,
5934 6716 * let's make sure we never add empty buffers to the arc lists.
5935 6717 */
5936 6718 ASSERT(!HDR_EMPTY(hdr));
5937 6719
5938 6720 /*
5939 6721 * The assumption here, is the hash value for a given
5940 6722 * arc_buf_hdr_t will remain constant throughout it's lifetime
5941 6723 * (i.e. it's b_spa, b_dva, and b_birth fields don't change).
5942 6724 * Thus, we don't need to store the header's sublist index
5943 6725 * on insertion, as this index can be recalculated on removal.
5944 6726 *
5945 6727 * Also, the low order bits of the hash value are thought to be
5946 6728 * distributed evenly. Otherwise, in the case that the multilist
5947 6729 * has a power of two number of sublists, each sublists' usage
5948 6730 * would not be evenly distributed.
5949 6731 */
5950 6732 return (buf_hash(hdr->b_spa, &hdr->b_dva, hdr->b_birth) %
5951 6733 multilist_get_num_sublists(ml));
5952 6734 }
|
↓ open down ↓ |
29 lines elided |
↑ open up ↑ |
5953 6735
5954 6736 static void
5955 6737 arc_state_init(void)
5956 6738 {
5957 6739 arc_anon = &ARC_anon;
5958 6740 arc_mru = &ARC_mru;
5959 6741 arc_mru_ghost = &ARC_mru_ghost;
5960 6742 arc_mfu = &ARC_mfu;
5961 6743 arc_mfu_ghost = &ARC_mfu_ghost;
5962 6744 arc_l2c_only = &ARC_l2c_only;
6745 + arc_buf_contents_t arcs;
5963 6746
5964 - arc_mru->arcs_list[ARC_BUFC_METADATA] =
5965 - multilist_create(sizeof (arc_buf_hdr_t),
5966 - offsetof(arc_buf_hdr_t, b_l1hdr.b_arc_node),
5967 - arc_state_multilist_index_func);
5968 - arc_mru->arcs_list[ARC_BUFC_DATA] =
5969 - multilist_create(sizeof (arc_buf_hdr_t),
5970 - offsetof(arc_buf_hdr_t, b_l1hdr.b_arc_node),
5971 - arc_state_multilist_index_func);
5972 - arc_mru_ghost->arcs_list[ARC_BUFC_METADATA] =
5973 - multilist_create(sizeof (arc_buf_hdr_t),
5974 - offsetof(arc_buf_hdr_t, b_l1hdr.b_arc_node),
5975 - arc_state_multilist_index_func);
5976 - arc_mru_ghost->arcs_list[ARC_BUFC_DATA] =
5977 - multilist_create(sizeof (arc_buf_hdr_t),
5978 - offsetof(arc_buf_hdr_t, b_l1hdr.b_arc_node),
5979 - arc_state_multilist_index_func);
5980 - arc_mfu->arcs_list[ARC_BUFC_METADATA] =
5981 - multilist_create(sizeof (arc_buf_hdr_t),
5982 - offsetof(arc_buf_hdr_t, b_l1hdr.b_arc_node),
5983 - arc_state_multilist_index_func);
5984 - arc_mfu->arcs_list[ARC_BUFC_DATA] =
5985 - multilist_create(sizeof (arc_buf_hdr_t),
5986 - offsetof(arc_buf_hdr_t, b_l1hdr.b_arc_node),
5987 - arc_state_multilist_index_func);
5988 - arc_mfu_ghost->arcs_list[ARC_BUFC_METADATA] =
5989 - multilist_create(sizeof (arc_buf_hdr_t),
5990 - offsetof(arc_buf_hdr_t, b_l1hdr.b_arc_node),
5991 - arc_state_multilist_index_func);
5992 - arc_mfu_ghost->arcs_list[ARC_BUFC_DATA] =
5993 - multilist_create(sizeof (arc_buf_hdr_t),
5994 - offsetof(arc_buf_hdr_t, b_l1hdr.b_arc_node),
5995 - arc_state_multilist_index_func);
5996 - arc_l2c_only->arcs_list[ARC_BUFC_METADATA] =
5997 - multilist_create(sizeof (arc_buf_hdr_t),
5998 - offsetof(arc_buf_hdr_t, b_l1hdr.b_arc_node),
5999 - arc_state_multilist_index_func);
6000 - arc_l2c_only->arcs_list[ARC_BUFC_DATA] =
6001 - multilist_create(sizeof (arc_buf_hdr_t),
6002 - offsetof(arc_buf_hdr_t, b_l1hdr.b_arc_node),
6003 - arc_state_multilist_index_func);
6747 + for (arcs = ARC_BUFC_DATA; arcs < ARC_BUFC_NUMTYPES; ++arcs) {
6748 + arc_mru->arcs_list[arcs] =
6749 + multilist_create(sizeof (arc_buf_hdr_t),
6750 + offsetof(arc_buf_hdr_t, b_l1hdr.b_arc_node),
6751 + arc_state_multilist_index_func);
6752 + arc_mru_ghost->arcs_list[arcs] =
6753 + multilist_create(sizeof (arc_buf_hdr_t),
6754 + offsetof(arc_buf_hdr_t, b_l1hdr.b_arc_node),
6755 + arc_state_multilist_index_func);
6756 + arc_mfu->arcs_list[arcs] =
6757 + multilist_create(sizeof (arc_buf_hdr_t),
6758 + offsetof(arc_buf_hdr_t, b_l1hdr.b_arc_node),
6759 + arc_state_multilist_index_func);
6760 + arc_mfu_ghost->arcs_list[arcs] =
6761 + multilist_create(sizeof (arc_buf_hdr_t),
6762 + offsetof(arc_buf_hdr_t, b_l1hdr.b_arc_node),
6763 + arc_state_multilist_index_func);
6764 + arc_l2c_only->arcs_list[arcs] =
6765 + multilist_create(sizeof (arc_buf_hdr_t),
6766 + offsetof(arc_buf_hdr_t, b_l1hdr.b_arc_node),
6767 + arc_state_multilist_index_func);
6004 6768
6005 - refcount_create(&arc_anon->arcs_esize[ARC_BUFC_METADATA]);
6006 - refcount_create(&arc_anon->arcs_esize[ARC_BUFC_DATA]);
6007 - refcount_create(&arc_mru->arcs_esize[ARC_BUFC_METADATA]);
6008 - refcount_create(&arc_mru->arcs_esize[ARC_BUFC_DATA]);
6009 - refcount_create(&arc_mru_ghost->arcs_esize[ARC_BUFC_METADATA]);
6010 - refcount_create(&arc_mru_ghost->arcs_esize[ARC_BUFC_DATA]);
6011 - refcount_create(&arc_mfu->arcs_esize[ARC_BUFC_METADATA]);
6012 - refcount_create(&arc_mfu->arcs_esize[ARC_BUFC_DATA]);
6013 - refcount_create(&arc_mfu_ghost->arcs_esize[ARC_BUFC_METADATA]);
6014 - refcount_create(&arc_mfu_ghost->arcs_esize[ARC_BUFC_DATA]);
6015 - refcount_create(&arc_l2c_only->arcs_esize[ARC_BUFC_METADATA]);
6016 - refcount_create(&arc_l2c_only->arcs_esize[ARC_BUFC_DATA]);
6769 + refcount_create(&arc_anon->arcs_esize[arcs]);
6770 + refcount_create(&arc_mru->arcs_esize[arcs]);
6771 + refcount_create(&arc_mru_ghost->arcs_esize[arcs]);
6772 + refcount_create(&arc_mfu->arcs_esize[arcs]);
6773 + refcount_create(&arc_mfu_ghost->arcs_esize[arcs]);
6774 + refcount_create(&arc_l2c_only->arcs_esize[arcs]);
6775 + }
6017 6776
6777 + arc_flush_taskq = taskq_create("arc_flush_tq",
6778 + max_ncpus, minclsyspri, 1, zfs_flush_ntasks, TASKQ_DYNAMIC);
6779 +
6018 6780 refcount_create(&arc_anon->arcs_size);
6019 6781 refcount_create(&arc_mru->arcs_size);
6020 6782 refcount_create(&arc_mru_ghost->arcs_size);
6021 6783 refcount_create(&arc_mfu->arcs_size);
6022 6784 refcount_create(&arc_mfu_ghost->arcs_size);
6023 6785 refcount_create(&arc_l2c_only->arcs_size);
6024 -
6025 - aggsum_init(&arc_meta_used, 0);
6026 - aggsum_init(&arc_size, 0);
6027 - aggsum_init(&astat_data_size, 0);
6028 - aggsum_init(&astat_metadata_size, 0);
6029 - aggsum_init(&astat_hdr_size, 0);
6030 - aggsum_init(&astat_other_size, 0);
6031 - aggsum_init(&astat_l2_hdr_size, 0);
6032 6786 }
6033 6787
6034 6788 static void
6035 6789 arc_state_fini(void)
6036 6790 {
6037 - refcount_destroy(&arc_anon->arcs_esize[ARC_BUFC_METADATA]);
6038 - refcount_destroy(&arc_anon->arcs_esize[ARC_BUFC_DATA]);
6039 - refcount_destroy(&arc_mru->arcs_esize[ARC_BUFC_METADATA]);
6040 - refcount_destroy(&arc_mru->arcs_esize[ARC_BUFC_DATA]);
6041 - refcount_destroy(&arc_mru_ghost->arcs_esize[ARC_BUFC_METADATA]);
6042 - refcount_destroy(&arc_mru_ghost->arcs_esize[ARC_BUFC_DATA]);
6043 - refcount_destroy(&arc_mfu->arcs_esize[ARC_BUFC_METADATA]);
6044 - refcount_destroy(&arc_mfu->arcs_esize[ARC_BUFC_DATA]);
6045 - refcount_destroy(&arc_mfu_ghost->arcs_esize[ARC_BUFC_METADATA]);
6046 - refcount_destroy(&arc_mfu_ghost->arcs_esize[ARC_BUFC_DATA]);
6047 - refcount_destroy(&arc_l2c_only->arcs_esize[ARC_BUFC_METADATA]);
6048 - refcount_destroy(&arc_l2c_only->arcs_esize[ARC_BUFC_DATA]);
6791 + arc_buf_contents_t arcs;
6049 6792
6050 6793 refcount_destroy(&arc_anon->arcs_size);
6051 6794 refcount_destroy(&arc_mru->arcs_size);
6052 6795 refcount_destroy(&arc_mru_ghost->arcs_size);
6053 6796 refcount_destroy(&arc_mfu->arcs_size);
6054 6797 refcount_destroy(&arc_mfu_ghost->arcs_size);
6055 6798 refcount_destroy(&arc_l2c_only->arcs_size);
6056 6799
6057 - multilist_destroy(arc_mru->arcs_list[ARC_BUFC_METADATA]);
6058 - multilist_destroy(arc_mru_ghost->arcs_list[ARC_BUFC_METADATA]);
6059 - multilist_destroy(arc_mfu->arcs_list[ARC_BUFC_METADATA]);
6060 - multilist_destroy(arc_mfu_ghost->arcs_list[ARC_BUFC_METADATA]);
6061 - multilist_destroy(arc_mru->arcs_list[ARC_BUFC_DATA]);
6062 - multilist_destroy(arc_mru_ghost->arcs_list[ARC_BUFC_DATA]);
6063 - multilist_destroy(arc_mfu->arcs_list[ARC_BUFC_DATA]);
6064 - multilist_destroy(arc_mfu_ghost->arcs_list[ARC_BUFC_DATA]);
6800 + for (arcs = ARC_BUFC_DATA; arcs < ARC_BUFC_NUMTYPES; ++arcs) {
6801 + multilist_destroy(arc_mru->arcs_list[arcs]);
6802 + multilist_destroy(arc_mru_ghost->arcs_list[arcs]);
6803 + multilist_destroy(arc_mfu->arcs_list[arcs]);
6804 + multilist_destroy(arc_mfu_ghost->arcs_list[arcs]);
6805 + multilist_destroy(arc_l2c_only->arcs_list[arcs]);
6806 +
6807 + refcount_destroy(&arc_anon->arcs_esize[arcs]);
6808 + refcount_destroy(&arc_mru->arcs_esize[arcs]);
6809 + refcount_destroy(&arc_mru_ghost->arcs_esize[arcs]);
6810 + refcount_destroy(&arc_mfu->arcs_esize[arcs]);
6811 + refcount_destroy(&arc_mfu_ghost->arcs_esize[arcs]);
6812 + refcount_destroy(&arc_l2c_only->arcs_esize[arcs]);
6813 + }
6065 6814 }
6066 6815
6067 6816 uint64_t
6068 6817 arc_max_bytes(void)
6069 6818 {
6070 6819 return (arc_c_max);
6071 6820 }
6072 6821
6073 6822 void
6074 6823 arc_init(void)
6075 6824 {
6076 6825 /*
6077 6826 * allmem is "all memory that we could possibly use".
6078 6827 */
6079 6828 #ifdef _KERNEL
6080 6829 uint64_t allmem = ptob(physmem - swapfs_minfree);
6081 6830 #else
6082 6831 uint64_t allmem = (physmem * PAGESIZE) / 2;
6083 6832 #endif
6084 6833
6085 6834 mutex_init(&arc_reclaim_lock, NULL, MUTEX_DEFAULT, NULL);
6086 6835 cv_init(&arc_reclaim_thread_cv, NULL, CV_DEFAULT, NULL);
6087 6836 cv_init(&arc_reclaim_waiters_cv, NULL, CV_DEFAULT, NULL);
6088 6837
6089 6838 /* Convert seconds to clock ticks */
6090 6839 arc_min_prefetch_lifespan = 1 * hz;
6091 6840
6092 6841 /* set min cache to 1/32 of all memory, or 64MB, whichever is more */
6093 6842 arc_c_min = MAX(allmem / 32, 64 << 20);
6094 6843 /* set max to 3/4 of all memory, or all but 1GB, whichever is more */
6095 6844 if (allmem >= 1 << 30)
6096 6845 arc_c_max = allmem - (1 << 30);
6097 6846 else
6098 6847 arc_c_max = arc_c_min;
6099 6848 arc_c_max = MAX(allmem * 3 / 4, arc_c_max);
6100 6849
6101 6850 /*
6102 6851 * In userland, there's only the memory pressure that we artificially
6103 6852 * create (see arc_available_memory()). Don't let arc_c get too
6104 6853 * small, because it can cause transactions to be larger than
6105 6854 * arc_c, causing arc_tempreserve_space() to fail.
6106 6855 */
6107 6856 #ifndef _KERNEL
6108 6857 arc_c_min = arc_c_max / 2;
6109 6858 #endif
6110 6859
6111 6860 /*
6112 6861 * Allow the tunables to override our calculations if they are
6113 6862 * reasonable (ie. over 64MB)
|
↓ open down ↓ |
39 lines elided |
↑ open up ↑ |
6114 6863 */
6115 6864 if (zfs_arc_max > 64 << 20 && zfs_arc_max < allmem) {
6116 6865 arc_c_max = zfs_arc_max;
6117 6866 arc_c_min = MIN(arc_c_min, arc_c_max);
6118 6867 }
6119 6868 if (zfs_arc_min > 64 << 20 && zfs_arc_min <= arc_c_max)
6120 6869 arc_c_min = zfs_arc_min;
6121 6870
6122 6871 arc_c = arc_c_max;
6123 6872 arc_p = (arc_c >> 1);
6873 + arc_size = 0;
6124 6874
6875 + /* limit ddt meta-data to 1/4 of the arc capacity */
6876 + arc_ddt_limit = arc_c_max / 4;
6125 6877 /* limit meta-data to 1/4 of the arc capacity */
6126 6878 arc_meta_limit = arc_c_max / 4;
6127 6879
6128 6880 #ifdef _KERNEL
6129 6881 /*
6130 6882 * Metadata is stored in the kernel's heap. Don't let us
6131 6883 * use more than half the heap for the ARC.
6132 6884 */
6133 6885 arc_meta_limit = MIN(arc_meta_limit,
6134 6886 vmem_size(heap_arena, VMEM_ALLOC | VMEM_FREE) / 2);
6135 6887 #endif
6136 6888
6137 6889 /* Allow the tunable to override if it is reasonable */
6890 + if (zfs_arc_ddt_limit > 0 && zfs_arc_ddt_limit <= arc_c_max)
6891 + arc_ddt_limit = zfs_arc_ddt_limit;
6892 + arc_ddt_evict_threshold =
6893 + zfs_arc_segregate_ddt ? &arc_ddt_limit : &arc_meta_limit;
6894 +
6895 + /* Allow the tunable to override if it is reasonable */
6138 6896 if (zfs_arc_meta_limit > 0 && zfs_arc_meta_limit <= arc_c_max)
6139 6897 arc_meta_limit = zfs_arc_meta_limit;
6140 6898
6141 6899 if (arc_c_min < arc_meta_limit / 2 && zfs_arc_min == 0)
6142 6900 arc_c_min = arc_meta_limit / 2;
6143 6901
6144 6902 if (zfs_arc_meta_min > 0) {
6145 6903 arc_meta_min = zfs_arc_meta_min;
6146 6904 } else {
6147 6905 arc_meta_min = arc_c_min / 2;
6148 6906 }
6149 6907
6150 6908 if (zfs_arc_grow_retry > 0)
6151 6909 arc_grow_retry = zfs_arc_grow_retry;
6152 6910
6153 6911 if (zfs_arc_shrink_shift > 0)
6154 6912 arc_shrink_shift = zfs_arc_shrink_shift;
6155 6913
6156 6914 /*
6157 6915 * Ensure that arc_no_grow_shift is less than arc_shrink_shift.
6158 6916 */
6159 6917 if (arc_no_grow_shift >= arc_shrink_shift)
6160 6918 arc_no_grow_shift = arc_shrink_shift - 1;
6161 6919
6162 6920 if (zfs_arc_p_min_shift > 0)
6163 6921 arc_p_min_shift = zfs_arc_p_min_shift;
6164 6922
6165 6923 /* if kmem_flags are set, lets try to use less memory */
6166 6924 if (kmem_debugging())
6167 6925 arc_c = arc_c / 2;
6168 6926 if (arc_c < arc_c_min)
6169 6927 arc_c = arc_c_min;
6170 6928
6171 6929 arc_state_init();
6172 6930 buf_init();
6173 6931
6174 6932 arc_reclaim_thread_exit = B_FALSE;
6175 6933
6176 6934 arc_ksp = kstat_create("zfs", 0, "arcstats", "misc", KSTAT_TYPE_NAMED,
6177 6935 sizeof (arc_stats) / sizeof (kstat_named_t), KSTAT_FLAG_VIRTUAL);
6178 6936
6179 6937 if (arc_ksp != NULL) {
6180 6938 arc_ksp->ks_data = &arc_stats;
6181 6939 arc_ksp->ks_update = arc_kstat_update;
6182 6940 kstat_install(arc_ksp);
6183 6941 }
6184 6942
6185 6943 (void) thread_create(NULL, 0, arc_reclaim_thread, NULL, 0, &p0,
6186 6944 TS_RUN, minclsyspri);
6187 6945
6188 6946 arc_dead = B_FALSE;
6189 6947 arc_warm = B_FALSE;
6190 6948
6191 6949 /*
6192 6950 * Calculate maximum amount of dirty data per pool.
6193 6951 *
6194 6952 * If it has been set by /etc/system, take that.
6195 6953 * Otherwise, use a percentage of physical memory defined by
6196 6954 * zfs_dirty_data_max_percent (default 10%) with a cap at
6197 6955 * zfs_dirty_data_max_max (default 4GB).
6198 6956 */
6199 6957 if (zfs_dirty_data_max == 0) {
6200 6958 zfs_dirty_data_max = physmem * PAGESIZE *
6201 6959 zfs_dirty_data_max_percent / 100;
6202 6960 zfs_dirty_data_max = MIN(zfs_dirty_data_max,
6203 6961 zfs_dirty_data_max_max);
6204 6962 }
6205 6963 }
6206 6964
6207 6965 void
6208 6966 arc_fini(void)
6209 6967 {
6210 6968 mutex_enter(&arc_reclaim_lock);
6211 6969 arc_reclaim_thread_exit = B_TRUE;
6212 6970 /*
6213 6971 * The reclaim thread will set arc_reclaim_thread_exit back to
6214 6972 * B_FALSE when it is finished exiting; we're waiting for that.
6215 6973 */
6216 6974 while (arc_reclaim_thread_exit) {
6217 6975 cv_signal(&arc_reclaim_thread_cv);
6218 6976 cv_wait(&arc_reclaim_thread_cv, &arc_reclaim_lock);
6219 6977 }
6220 6978 mutex_exit(&arc_reclaim_lock);
6221 6979
|
↓ open down ↓ |
74 lines elided |
↑ open up ↑ |
6222 6980 /* Use B_TRUE to ensure *all* buffers are evicted */
6223 6981 arc_flush(NULL, B_TRUE);
6224 6982
6225 6983 arc_dead = B_TRUE;
6226 6984
6227 6985 if (arc_ksp != NULL) {
6228 6986 kstat_delete(arc_ksp);
6229 6987 arc_ksp = NULL;
6230 6988 }
6231 6989
6990 + taskq_destroy(arc_flush_taskq);
6991 +
6232 6992 mutex_destroy(&arc_reclaim_lock);
6233 6993 cv_destroy(&arc_reclaim_thread_cv);
6234 6994 cv_destroy(&arc_reclaim_waiters_cv);
6235 6995
6236 6996 arc_state_fini();
6237 6997 buf_fini();
6238 6998
6239 6999 ASSERT0(arc_loaned_bytes);
6240 7000 }
6241 7001
6242 7002 /*
6243 7003 * Level 2 ARC
6244 7004 *
6245 7005 * The level 2 ARC (L2ARC) is a cache layer in-between main memory and disk.
6246 7006 * It uses dedicated storage devices to hold cached data, which are populated
6247 7007 * using large infrequent writes. The main role of this cache is to boost
6248 7008 * the performance of random read workloads. The intended L2ARC devices
6249 7009 * include short-stroked disks, solid state disks, and other media with
6250 7010 * substantially faster read latency than disk.
6251 7011 *
6252 7012 * +-----------------------+
6253 7013 * | ARC |
6254 7014 * +-----------------------+
6255 7015 * | ^ ^
6256 7016 * | | |
6257 7017 * l2arc_feed_thread() arc_read()
6258 7018 * | | |
6259 7019 * | l2arc read |
6260 7020 * V | |
6261 7021 * +---------------+ |
6262 7022 * | L2ARC | |
6263 7023 * +---------------+ |
6264 7024 * | ^ |
6265 7025 * l2arc_write() | |
6266 7026 * | | |
6267 7027 * V | |
6268 7028 * +-------+ +-------+
6269 7029 * | vdev | | vdev |
6270 7030 * | cache | | cache |
6271 7031 * +-------+ +-------+
6272 7032 * +=========+ .-----.
6273 7033 * : L2ARC : |-_____-|
6274 7034 * : devices : | Disks |
6275 7035 * +=========+ `-_____-'
6276 7036 *
6277 7037 * Read requests are satisfied from the following sources, in order:
6278 7038 *
6279 7039 * 1) ARC
6280 7040 * 2) vdev cache of L2ARC devices
6281 7041 * 3) L2ARC devices
6282 7042 * 4) vdev cache of disks
6283 7043 * 5) disks
6284 7044 *
6285 7045 * Some L2ARC device types exhibit extremely slow write performance.
6286 7046 * To accommodate for this there are some significant differences between
6287 7047 * the L2ARC and traditional cache design:
6288 7048 *
6289 7049 * 1. There is no eviction path from the ARC to the L2ARC. Evictions from
6290 7050 * the ARC behave as usual, freeing buffers and placing headers on ghost
6291 7051 * lists. The ARC does not send buffers to the L2ARC during eviction as
6292 7052 * this would add inflated write latencies for all ARC memory pressure.
6293 7053 *
6294 7054 * 2. The L2ARC attempts to cache data from the ARC before it is evicted.
6295 7055 * It does this by periodically scanning buffers from the eviction-end of
6296 7056 * the MFU and MRU ARC lists, copying them to the L2ARC devices if they are
6297 7057 * not already there. It scans until a headroom of buffers is satisfied,
6298 7058 * which itself is a buffer for ARC eviction. If a compressible buffer is
6299 7059 * found during scanning and selected for writing to an L2ARC device, we
6300 7060 * temporarily boost scanning headroom during the next scan cycle to make
6301 7061 * sure we adapt to compression effects (which might significantly reduce
6302 7062 * the data volume we write to L2ARC). The thread that does this is
6303 7063 * l2arc_feed_thread(), illustrated below; example sizes are included to
6304 7064 * provide a better sense of ratio than this diagram:
6305 7065 *
6306 7066 * head --> tail
6307 7067 * +---------------------+----------+
6308 7068 * ARC_mfu |:::::#:::::::::::::::|o#o###o###|-->. # already on L2ARC
6309 7069 * +---------------------+----------+ | o L2ARC eligible
6310 7070 * ARC_mru |:#:::::::::::::::::::|#o#ooo####|-->| : ARC buffer
6311 7071 * +---------------------+----------+ |
6312 7072 * 15.9 Gbytes ^ 32 Mbytes |
6313 7073 * headroom |
6314 7074 * l2arc_feed_thread()
6315 7075 * |
6316 7076 * l2arc write hand <--[oooo]--'
6317 7077 * | 8 Mbyte
6318 7078 * | write max
6319 7079 * V
6320 7080 * +==============================+
6321 7081 * L2ARC dev |####|#|###|###| |####| ... |
6322 7082 * +==============================+
6323 7083 * 32 Gbytes
6324 7084 *
6325 7085 * 3. If an ARC buffer is copied to the L2ARC but then hit instead of
6326 7086 * evicted, then the L2ARC has cached a buffer much sooner than it probably
6327 7087 * needed to, potentially wasting L2ARC device bandwidth and storage. It is
6328 7088 * safe to say that this is an uncommon case, since buffers at the end of
6329 7089 * the ARC lists have moved there due to inactivity.
6330 7090 *
6331 7091 * 4. If the ARC evicts faster than the L2ARC can maintain a headroom,
6332 7092 * then the L2ARC simply misses copying some buffers. This serves as a
6333 7093 * pressure valve to prevent heavy read workloads from both stalling the ARC
6334 7094 * with waits and clogging the L2ARC with writes. This also helps prevent
6335 7095 * the potential for the L2ARC to churn if it attempts to cache content too
6336 7096 * quickly, such as during backups of the entire pool.
6337 7097 *
6338 7098 * 5. After system boot and before the ARC has filled main memory, there are
6339 7099 * no evictions from the ARC and so the tails of the ARC_mfu and ARC_mru
6340 7100 * lists can remain mostly static. Instead of searching from tail of these
6341 7101 * lists as pictured, the l2arc_feed_thread() will search from the list heads
6342 7102 * for eligible buffers, greatly increasing its chance of finding them.
6343 7103 *
6344 7104 * The L2ARC device write speed is also boosted during this time so that
6345 7105 * the L2ARC warms up faster. Since there have been no ARC evictions yet,
6346 7106 * there are no L2ARC reads, and no fear of degrading read performance
6347 7107 * through increased writes.
6348 7108 *
6349 7109 * 6. Writes to the L2ARC devices are grouped and sent in-sequence, so that
6350 7110 * the vdev queue can aggregate them into larger and fewer writes. Each
6351 7111 * device is written to in a rotor fashion, sweeping writes through
6352 7112 * available space then repeating.
6353 7113 *
6354 7114 * 7. The L2ARC does not store dirty content. It never needs to flush
6355 7115 * write buffers back to disk based storage.
6356 7116 *
6357 7117 * 8. If an ARC buffer is written (and dirtied) which also exists in the
6358 7118 * L2ARC, the now stale L2ARC buffer is immediately dropped.
6359 7119 *
6360 7120 * The performance of the L2ARC can be tweaked by a number of tunables, which
6361 7121 * may be necessary for different workloads:
6362 7122 *
6363 7123 * l2arc_write_max max write bytes per interval
6364 7124 * l2arc_write_boost extra write bytes during device warmup
6365 7125 * l2arc_noprefetch skip caching prefetched buffers
6366 7126 * l2arc_headroom number of max device writes to precache
6367 7127 * l2arc_headroom_boost when we find compressed buffers during ARC
6368 7128 * scanning, we multiply headroom by this
6369 7129 * percentage factor for the next scan cycle,
6370 7130 * since more compressed buffers are likely to
6371 7131 * be present
6372 7132 * l2arc_feed_secs seconds between L2ARC writing
6373 7133 *
6374 7134 * Tunables may be removed or added as future performance improvements are
|
↓ open down ↓ |
133 lines elided |
↑ open up ↑ |
6375 7135 * integrated, and also may become zpool properties.
6376 7136 *
6377 7137 * There are three key functions that control how the L2ARC warms up:
6378 7138 *
6379 7139 * l2arc_write_eligible() check if a buffer is eligible to cache
6380 7140 * l2arc_write_size() calculate how much to write
6381 7141 * l2arc_write_interval() calculate sleep delay between writes
6382 7142 *
6383 7143 * These three functions determine what to write, how much, and how quickly
6384 7144 * to send writes.
7145 + *
7146 + * L2ARC persistency:
7147 + *
7148 + * When writing buffers to L2ARC, we periodically add some metadata to
7149 + * make sure we can pick them up after reboot, thus dramatically reducing
7150 + * the impact that any downtime has on the performance of storage systems
7151 + * with large caches.
7152 + *
7153 + * The implementation works fairly simply by integrating the following two
7154 + * modifications:
7155 + *
7156 + * *) Every now and then we mix in a piece of metadata (called a log block)
7157 + * into the L2ARC write. This allows us to understand what's been written,
7158 + * so that we can rebuild the arc_buf_hdr_t structures of the main ARC
7159 + * buffers. The log block also includes a "2-back-reference" pointer to
7160 + * he second-to-previous block, forming a back-linked list of blocks on
7161 + * the L2ARC device.
7162 + *
7163 + * *) We reserve SPA_MINBLOCKSIZE of space at the start of each L2ARC device
7164 + * for our header bookkeeping purposes. This contains a device header,
7165 + * which contains our top-level reference structures. We update it each
7166 + * time we write a new log block, so that we're able to locate it in the
7167 + * L2ARC device. If this write results in an inconsistent device header
7168 + * (e.g. due to power failure), we detect this by verifying the header's
7169 + * checksum and simply drop the entries from L2ARC.
7170 + *
7171 + * Implementation diagram:
7172 + *
7173 + * +=== L2ARC device (not to scale) ======================================+
7174 + * | ___two newest log block pointers__.__________ |
7175 + * | / \1 back \latest |
7176 + * |.____/_. V V |
7177 + * ||L2 dev|....|lb |bufs |lb |bufs |lb |bufs |lb |bufs |lb |---(empty)---|
7178 + * || hdr| ^ /^ /^ / / |
7179 + * |+------+ ...--\-------/ \-----/--\------/ / |
7180 + * | \--------------/ \--------------/ |
7181 + * +======================================================================+
7182 + *
7183 + * As can be seen on the diagram, rather than using a simple linked list,
7184 + * we use a pair of linked lists with alternating elements. This is a
7185 + * performance enhancement due to the fact that we only find out of the
7186 + * address of the next log block access once the current block has been
7187 + * completely read in. Obviously, this hurts performance, because we'd be
7188 + * keeping the device's I/O queue at only a 1 operation deep, thus
7189 + * incurring a large amount of I/O round-trip latency. Having two lists
7190 + * allows us to "prefetch" two log blocks ahead of where we are currently
7191 + * rebuilding L2ARC buffers.
7192 + *
7193 + * On-device data structures:
7194 + *
7195 + * L2ARC device header: l2arc_dev_hdr_phys_t
7196 + * L2ARC log block: l2arc_log_blk_phys_t
7197 + *
7198 + * L2ARC reconstruction:
7199 + *
7200 + * When writing data, we simply write in the standard rotary fashion,
7201 + * evicting buffers as we go and simply writing new data over them (writing
7202 + * a new log block every now and then). This obviously means that once we
7203 + * loop around the end of the device, we will start cutting into an already
7204 + * committed log block (and its referenced data buffers), like so:
7205 + *
7206 + * current write head__ __old tail
7207 + * \ /
7208 + * V V
7209 + * <--|bufs |lb |bufs |lb | |bufs |lb |bufs |lb |-->
7210 + * ^ ^^^^^^^^^___________________________________
7211 + * | \
7212 + * <<nextwrite>> may overwrite this blk and/or its bufs --'
7213 + *
7214 + * When importing the pool, we detect this situation and use it to stop
7215 + * our scanning process (see l2arc_rebuild).
7216 + *
7217 + * There is one significant caveat to consider when rebuilding ARC contents
7218 + * from an L2ARC device: what about invalidated buffers? Given the above
7219 + * construction, we cannot update blocks which we've already written to amend
7220 + * them to remove buffers which were invalidated. Thus, during reconstruction,
7221 + * we might be populating the cache with buffers for data that's not on the
7222 + * main pool anymore, or may have been overwritten!
7223 + *
7224 + * As it turns out, this isn't a problem. Every arc_read request includes
7225 + * both the DVA and, crucially, the birth TXG of the BP the caller is
7226 + * looking for. So even if the cache were populated by completely rotten
7227 + * blocks for data that had been long deleted and/or overwritten, we'll
7228 + * never actually return bad data from the cache, since the DVA with the
7229 + * birth TXG uniquely identify a block in space and time - once created,
7230 + * a block is immutable on disk. The worst thing we have done is wasted
7231 + * some time and memory at l2arc rebuild to reconstruct outdated ARC
7232 + * entries that will get dropped from the l2arc as it is being updated
7233 + * with new blocks.
6385 7234 */
6386 7235
6387 7236 static boolean_t
6388 7237 l2arc_write_eligible(uint64_t spa_guid, arc_buf_hdr_t *hdr)
6389 7238 {
6390 7239 /*
6391 7240 * A buffer is *not* eligible for the L2ARC if it:
6392 7241 * 1. belongs to a different spa.
6393 7242 * 2. is already cached on the L2ARC.
6394 7243 * 3. has an I/O in progress (it may be an incomplete read).
6395 7244 * 4. is flagged not eligible (zfs property).
6396 7245 */
6397 7246 if (hdr->b_spa != spa_guid || HDR_HAS_L2HDR(hdr) ||
6398 7247 HDR_IO_IN_PROGRESS(hdr) || !HDR_L2CACHE(hdr))
6399 7248 return (B_FALSE);
6400 7249
6401 7250 return (B_TRUE);
6402 7251 }
6403 7252
6404 7253 static uint64_t
6405 7254 l2arc_write_size(void)
6406 7255 {
6407 7256 uint64_t size;
6408 7257
6409 7258 /*
6410 7259 * Make sure our globals have meaningful values in case the user
6411 7260 * altered them.
6412 7261 */
6413 7262 size = l2arc_write_max;
6414 7263 if (size == 0) {
6415 7264 cmn_err(CE_NOTE, "Bad value for l2arc_write_max, value must "
6416 7265 "be greater than zero, resetting it to the default (%d)",
6417 7266 L2ARC_WRITE_SIZE);
6418 7267 size = l2arc_write_max = L2ARC_WRITE_SIZE;
6419 7268 }
6420 7269
6421 7270 if (arc_warm == B_FALSE)
6422 7271 size += l2arc_write_boost;
6423 7272
6424 7273 return (size);
6425 7274
6426 7275 }
6427 7276
6428 7277 static clock_t
6429 7278 l2arc_write_interval(clock_t began, uint64_t wanted, uint64_t wrote)
6430 7279 {
6431 7280 clock_t interval, next, now;
6432 7281
6433 7282 /*
6434 7283 * If the ARC lists are busy, increase our write rate; if the
6435 7284 * lists are stale, idle back. This is achieved by checking
6436 7285 * how much we previously wrote - if it was more than half of
6437 7286 * what we wanted, schedule the next write much sooner.
6438 7287 */
6439 7288 if (l2arc_feed_again && wrote > (wanted / 2))
|
↓ open down ↓ |
45 lines elided |
↑ open up ↑ |
6440 7289 interval = (hz * l2arc_feed_min_ms) / 1000;
6441 7290 else
6442 7291 interval = hz * l2arc_feed_secs;
6443 7292
6444 7293 now = ddi_get_lbolt();
6445 7294 next = MAX(now, MIN(now + interval, began + interval));
6446 7295
6447 7296 return (next);
6448 7297 }
6449 7298
7299 +typedef enum l2ad_feed {
7300 + L2ARC_FEED_ALL = 1,
7301 + L2ARC_FEED_DDT_DEV,
7302 + L2ARC_FEED_NON_DDT_DEV,
7303 +} l2ad_feed_t;
7304 +
6450 7305 /*
6451 7306 * Cycle through L2ARC devices. This is how L2ARC load balances.
6452 7307 * If a device is returned, this also returns holding the spa config lock.
6453 7308 */
6454 7309 static l2arc_dev_t *
6455 -l2arc_dev_get_next(void)
7310 +l2arc_dev_get_next(l2ad_feed_t feed_type)
6456 7311 {
6457 - l2arc_dev_t *first, *next = NULL;
7312 + l2arc_dev_t *start = NULL, *next = NULL;
6458 7313
6459 7314 /*
6460 7315 * Lock out the removal of spas (spa_namespace_lock), then removal
6461 7316 * of cache devices (l2arc_dev_mtx). Once a device has been selected,
6462 7317 * both locks will be dropped and a spa config lock held instead.
6463 7318 */
6464 7319 mutex_enter(&spa_namespace_lock);
6465 7320 mutex_enter(&l2arc_dev_mtx);
6466 7321
6467 7322 /* if there are no vdevs, there is nothing to do */
6468 7323 if (l2arc_ndev == 0)
6469 7324 goto out;
6470 7325
6471 - first = NULL;
6472 - next = l2arc_dev_last;
6473 - do {
6474 - /* loop around the list looking for a non-faulted vdev */
6475 - if (next == NULL) {
6476 - next = list_head(l2arc_dev_list);
6477 - } else {
6478 - next = list_next(l2arc_dev_list, next);
6479 - if (next == NULL)
6480 - next = list_head(l2arc_dev_list);
6481 - }
7326 + if (feed_type == L2ARC_FEED_DDT_DEV)
7327 + next = l2arc_ddt_dev_last;
7328 + else
7329 + next = l2arc_dev_last;
6482 7330
6483 - /* if we have come back to the start, bail out */
6484 - if (first == NULL)
6485 - first = next;
6486 - else if (next == first)
6487 - break;
7331 + /* figure out what the next device we look at should be */
7332 + if (next == NULL)
7333 + next = list_head(l2arc_dev_list);
7334 + else if (list_next(l2arc_dev_list, next) == NULL)
7335 + next = list_head(l2arc_dev_list);
7336 + else
7337 + next = list_next(l2arc_dev_list, next);
7338 + ASSERT(next);
6488 7339
6489 - } while (vdev_is_dead(next->l2ad_vdev));
7340 + /* loop through L2ARC devs looking for the one we need */
7341 + /* LINTED(E_CONSTANT_CONDITION) */
7342 + while (1) {
7343 + if (next == NULL) /* reached list end, start from beginning */
7344 + next = list_head(l2arc_dev_list);
6490 7345
6491 - /* if we were unable to find any usable vdevs, return NULL */
6492 - if (vdev_is_dead(next->l2ad_vdev))
6493 - next = NULL;
7346 + if (start == NULL) { /* save starting dev */
7347 + start = next;
7348 + } else if (start == next) { /* full loop completed - stop now */
7349 + next = NULL;
7350 + if (feed_type == L2ARC_FEED_DDT_DEV) {
7351 + l2arc_ddt_dev_last = NULL;
7352 + goto out;
7353 + } else {
7354 + break;
7355 + }
7356 + }
6494 7357
7358 + if (!vdev_is_dead(next->l2ad_vdev) && !next->l2ad_rebuild) {
7359 + if (feed_type == L2ARC_FEED_DDT_DEV) {
7360 + if (vdev_type_is_ddt(next->l2ad_vdev)) {
7361 + l2arc_ddt_dev_last = next;
7362 + goto out;
7363 + }
7364 + } else if (feed_type == L2ARC_FEED_NON_DDT_DEV) {
7365 + if (!vdev_type_is_ddt(next->l2ad_vdev)) {
7366 + break;
7367 + }
7368 + } else {
7369 + ASSERT(feed_type == L2ARC_FEED_ALL);
7370 + break;
7371 + }
7372 + }
7373 + next = list_next(l2arc_dev_list, next);
7374 + }
6495 7375 l2arc_dev_last = next;
6496 7376
6497 7377 out:
6498 7378 mutex_exit(&l2arc_dev_mtx);
6499 7379
6500 7380 /*
6501 7381 * Grab the config lock to prevent the 'next' device from being
6502 7382 * removed while we are writing to it.
6503 7383 */
6504 7384 if (next != NULL)
6505 7385 spa_config_enter(next->l2ad_spa, SCL_L2ARC, next, RW_READER);
6506 7386 mutex_exit(&spa_namespace_lock);
6507 7387
6508 7388 return (next);
6509 7389 }
6510 7390
6511 7391 /*
6512 7392 * Free buffers that were tagged for destruction.
6513 7393 */
6514 7394 static void
6515 7395 l2arc_do_free_on_write()
6516 7396 {
6517 7397 list_t *buflist;
6518 7398 l2arc_data_free_t *df, *df_prev;
6519 7399
6520 7400 mutex_enter(&l2arc_free_on_write_mtx);
6521 7401 buflist = l2arc_free_on_write;
6522 7402
6523 7403 for (df = list_tail(buflist); df; df = df_prev) {
6524 7404 df_prev = list_prev(buflist, df);
6525 7405 ASSERT3P(df->l2df_abd, !=, NULL);
6526 7406 abd_free(df->l2df_abd);
6527 7407 list_remove(buflist, df);
6528 7408 kmem_free(df, sizeof (l2arc_data_free_t));
6529 7409 }
6530 7410
6531 7411 mutex_exit(&l2arc_free_on_write_mtx);
6532 7412 }
6533 7413
6534 7414 /*
6535 7415 * A write to a cache device has completed. Update all headers to allow
6536 7416 * reads from these buffers to begin.
|
↓ open down ↓ |
32 lines elided |
↑ open up ↑ |
6537 7417 */
6538 7418 static void
6539 7419 l2arc_write_done(zio_t *zio)
6540 7420 {
6541 7421 l2arc_write_callback_t *cb;
6542 7422 l2arc_dev_t *dev;
6543 7423 list_t *buflist;
6544 7424 arc_buf_hdr_t *head, *hdr, *hdr_prev;
6545 7425 kmutex_t *hash_lock;
6546 7426 int64_t bytes_dropped = 0;
7427 + l2arc_log_blk_buf_t *lb_buf;
6547 7428
6548 7429 cb = zio->io_private;
6549 7430 ASSERT3P(cb, !=, NULL);
6550 7431 dev = cb->l2wcb_dev;
6551 7432 ASSERT3P(dev, !=, NULL);
6552 7433 head = cb->l2wcb_head;
6553 7434 ASSERT3P(head, !=, NULL);
6554 7435 buflist = &dev->l2ad_buflist;
6555 7436 ASSERT3P(buflist, !=, NULL);
6556 7437 DTRACE_PROBE2(l2arc__iodone, zio_t *, zio,
6557 7438 l2arc_write_callback_t *, cb);
6558 7439
6559 7440 if (zio->io_error != 0)
6560 7441 ARCSTAT_BUMP(arcstat_l2_writes_error);
6561 7442
6562 7443 /*
6563 7444 * All writes completed, or an error was hit.
6564 7445 */
6565 7446 top:
6566 7447 mutex_enter(&dev->l2ad_mtx);
6567 7448 for (hdr = list_prev(buflist, head); hdr; hdr = hdr_prev) {
6568 7449 hdr_prev = list_prev(buflist, hdr);
6569 7450
6570 7451 hash_lock = HDR_LOCK(hdr);
6571 7452
6572 7453 /*
6573 7454 * We cannot use mutex_enter or else we can deadlock
6574 7455 * with l2arc_write_buffers (due to swapping the order
6575 7456 * the hash lock and l2ad_mtx are taken).
6576 7457 */
6577 7458 if (!mutex_tryenter(hash_lock)) {
6578 7459 /*
6579 7460 * Missed the hash lock. We must retry so we
6580 7461 * don't leave the ARC_FLAG_L2_WRITING bit set.
6581 7462 */
6582 7463 ARCSTAT_BUMP(arcstat_l2_writes_lock_retry);
6583 7464
6584 7465 /*
6585 7466 * We don't want to rescan the headers we've
6586 7467 * already marked as having been written out, so
6587 7468 * we reinsert the head node so we can pick up
6588 7469 * where we left off.
6589 7470 */
6590 7471 list_remove(buflist, head);
6591 7472 list_insert_after(buflist, hdr, head);
6592 7473
6593 7474 mutex_exit(&dev->l2ad_mtx);
6594 7475
6595 7476 /*
6596 7477 * We wait for the hash lock to become available
6597 7478 * to try and prevent busy waiting, and increase
6598 7479 * the chance we'll be able to acquire the lock
6599 7480 * the next time around.
6600 7481 */
6601 7482 mutex_enter(hash_lock);
6602 7483 mutex_exit(hash_lock);
6603 7484 goto top;
6604 7485 }
6605 7486
6606 7487 /*
6607 7488 * We could not have been moved into the arc_l2c_only
6608 7489 * state while in-flight due to our ARC_FLAG_L2_WRITING
6609 7490 * bit being set. Let's just ensure that's being enforced.
6610 7491 */
6611 7492 ASSERT(HDR_HAS_L1HDR(hdr));
6612 7493
6613 7494 if (zio->io_error != 0) {
6614 7495 /*
6615 7496 * Error - drop L2ARC entry.
6616 7497 */
6617 7498 list_remove(buflist, hdr);
6618 7499 arc_hdr_clear_flags(hdr, ARC_FLAG_HAS_L2HDR);
6619 7500
6620 7501 ARCSTAT_INCR(arcstat_l2_psize, -arc_hdr_size(hdr));
6621 7502 ARCSTAT_INCR(arcstat_l2_lsize, -HDR_GET_LSIZE(hdr));
6622 7503
6623 7504 bytes_dropped += arc_hdr_size(hdr);
6624 7505 (void) refcount_remove_many(&dev->l2ad_alloc,
6625 7506 arc_hdr_size(hdr), hdr);
6626 7507 }
6627 7508
6628 7509 /*
6629 7510 * Allow ARC to begin reads and ghost list evictions to
6630 7511 * this L2ARC entry.
6631 7512 */
6632 7513 arc_hdr_clear_flags(hdr, ARC_FLAG_L2_WRITING);
|
↓ open down ↓ |
76 lines elided |
↑ open up ↑ |
6633 7514
6634 7515 mutex_exit(hash_lock);
6635 7516 }
6636 7517
6637 7518 atomic_inc_64(&l2arc_writes_done);
6638 7519 list_remove(buflist, head);
6639 7520 ASSERT(!HDR_HAS_L1HDR(head));
6640 7521 kmem_cache_free(hdr_l2only_cache, head);
6641 7522 mutex_exit(&dev->l2ad_mtx);
6642 7523
7524 + ASSERT(dev->l2ad_vdev != NULL);
6643 7525 vdev_space_update(dev->l2ad_vdev, -bytes_dropped, 0, 0);
6644 7526
6645 7527 l2arc_do_free_on_write();
6646 7528
7529 + while ((lb_buf = list_remove_tail(&cb->l2wcb_log_blk_buflist)) != NULL)
7530 + kmem_free(lb_buf, sizeof (*lb_buf));
7531 + list_destroy(&cb->l2wcb_log_blk_buflist);
6647 7532 kmem_free(cb, sizeof (l2arc_write_callback_t));
6648 7533 }
6649 7534
6650 7535 /*
6651 7536 * A read to a cache device completed. Validate buffer contents before
6652 7537 * handing over to the regular ARC routines.
6653 7538 */
6654 7539 static void
6655 7540 l2arc_read_done(zio_t *zio)
6656 7541 {
6657 7542 l2arc_read_callback_t *cb;
6658 7543 arc_buf_hdr_t *hdr;
6659 7544 kmutex_t *hash_lock;
6660 7545 boolean_t valid_cksum;
6661 7546
6662 7547 ASSERT3P(zio->io_vd, !=, NULL);
6663 7548 ASSERT(zio->io_flags & ZIO_FLAG_DONT_PROPAGATE);
6664 7549
6665 7550 spa_config_exit(zio->io_spa, SCL_L2ARC, zio->io_vd);
6666 7551
6667 7552 cb = zio->io_private;
6668 7553 ASSERT3P(cb, !=, NULL);
6669 7554 hdr = cb->l2rcb_hdr;
6670 7555 ASSERT3P(hdr, !=, NULL);
6671 7556
6672 7557 hash_lock = HDR_LOCK(hdr);
6673 7558 mutex_enter(hash_lock);
6674 7559 ASSERT3P(hash_lock, ==, HDR_LOCK(hdr));
6675 7560
6676 7561 /*
6677 7562 * If the data was read into a temporary buffer,
6678 7563 * move it and free the buffer.
6679 7564 */
6680 7565 if (cb->l2rcb_abd != NULL) {
6681 7566 ASSERT3U(arc_hdr_size(hdr), <, zio->io_size);
6682 7567 if (zio->io_error == 0) {
6683 7568 abd_copy(hdr->b_l1hdr.b_pabd, cb->l2rcb_abd,
6684 7569 arc_hdr_size(hdr));
6685 7570 }
6686 7571
6687 7572 /*
6688 7573 * The following must be done regardless of whether
6689 7574 * there was an error:
6690 7575 * - free the temporary buffer
6691 7576 * - point zio to the real ARC buffer
6692 7577 * - set zio size accordingly
6693 7578 * These are required because zio is either re-used for
6694 7579 * an I/O of the block in the case of the error
6695 7580 * or the zio is passed to arc_read_done() and it
6696 7581 * needs real data.
6697 7582 */
6698 7583 abd_free(cb->l2rcb_abd);
6699 7584 zio->io_size = zio->io_orig_size = arc_hdr_size(hdr);
6700 7585 zio->io_abd = zio->io_orig_abd = hdr->b_l1hdr.b_pabd;
6701 7586 }
6702 7587
6703 7588 ASSERT3P(zio->io_abd, !=, NULL);
6704 7589
6705 7590 /*
6706 7591 * Check this survived the L2ARC journey.
6707 7592 */
6708 7593 ASSERT3P(zio->io_abd, ==, hdr->b_l1hdr.b_pabd);
6709 7594 zio->io_bp_copy = cb->l2rcb_bp; /* XXX fix in L2ARC 2.0 */
6710 7595 zio->io_bp = &zio->io_bp_copy; /* XXX fix in L2ARC 2.0 */
6711 7596
6712 7597 valid_cksum = arc_cksum_is_equal(hdr, zio);
6713 7598 if (valid_cksum && zio->io_error == 0 && !HDR_L2_EVICTED(hdr)) {
6714 7599 mutex_exit(hash_lock);
6715 7600 zio->io_private = hdr;
6716 7601 arc_read_done(zio);
6717 7602 } else {
6718 7603 mutex_exit(hash_lock);
6719 7604 /*
6720 7605 * Buffer didn't survive caching. Increment stats and
6721 7606 * reissue to the original storage device.
6722 7607 */
6723 7608 if (zio->io_error != 0) {
6724 7609 ARCSTAT_BUMP(arcstat_l2_io_error);
6725 7610 } else {
6726 7611 zio->io_error = SET_ERROR(EIO);
6727 7612 }
6728 7613 if (!valid_cksum)
6729 7614 ARCSTAT_BUMP(arcstat_l2_cksum_bad);
6730 7615
6731 7616 /*
6732 7617 * If there's no waiter, issue an async i/o to the primary
6733 7618 * storage now. If there *is* a waiter, the caller must
6734 7619 * issue the i/o in a context where it's OK to block.
6735 7620 */
6736 7621 if (zio->io_waiter == NULL) {
6737 7622 zio_t *pio = zio_unique_parent(zio);
6738 7623
6739 7624 ASSERT(!pio || pio->io_child_type == ZIO_CHILD_LOGICAL);
6740 7625
6741 7626 zio_nowait(zio_read(pio, zio->io_spa, zio->io_bp,
6742 7627 hdr->b_l1hdr.b_pabd, zio->io_size, arc_read_done,
|
↓ open down ↓ |
86 lines elided |
↑ open up ↑ |
6743 7628 hdr, zio->io_priority, cb->l2rcb_flags,
6744 7629 &cb->l2rcb_zb));
6745 7630 }
6746 7631 }
6747 7632
6748 7633 kmem_free(cb, sizeof (l2arc_read_callback_t));
6749 7634 }
6750 7635
6751 7636 /*
6752 7637 * This is the list priority from which the L2ARC will search for pages to
6753 - * cache. This is used within loops (0..3) to cycle through lists in the
7638 + * cache. This is used within loops to cycle through lists in the
6754 7639 * desired order. This order can have a significant effect on cache
6755 7640 * performance.
6756 7641 *
6757 - * Currently the metadata lists are hit first, MFU then MRU, followed by
6758 - * the data lists. This function returns a locked list, and also returns
6759 - * the lock pointer.
7642 + * Currently the ddt lists are hit first (MFU then MRU),
7643 + * followed by metadata then by the data lists.
7644 + * This function returns a locked list, and also returns the lock pointer.
6760 7645 */
6761 7646 static multilist_sublist_t *
6762 -l2arc_sublist_lock(int list_num)
7647 +l2arc_sublist_lock(enum l2arc_priorities prio)
6763 7648 {
6764 7649 multilist_t *ml = NULL;
6765 7650 unsigned int idx;
6766 7651
6767 - ASSERT(list_num >= 0 && list_num <= 3);
7652 + ASSERT(prio >= PRIORITY_MFU_DDT);
7653 + ASSERT(prio < PRIORITY_NUMTYPES);
6768 7654
6769 - switch (list_num) {
6770 - case 0:
7655 + switch (prio) {
7656 + case PRIORITY_MFU_DDT:
7657 + ml = arc_mfu->arcs_list[ARC_BUFC_DDT];
7658 + break;
7659 + case PRIORITY_MRU_DDT:
7660 + ml = arc_mru->arcs_list[ARC_BUFC_DDT];
7661 + break;
7662 + case PRIORITY_MFU_META:
6771 7663 ml = arc_mfu->arcs_list[ARC_BUFC_METADATA];
6772 7664 break;
6773 - case 1:
7665 + case PRIORITY_MRU_META:
6774 7666 ml = arc_mru->arcs_list[ARC_BUFC_METADATA];
6775 7667 break;
6776 - case 2:
7668 + case PRIORITY_MFU_DATA:
6777 7669 ml = arc_mfu->arcs_list[ARC_BUFC_DATA];
6778 7670 break;
6779 - case 3:
7671 + case PRIORITY_MRU_DATA:
6780 7672 ml = arc_mru->arcs_list[ARC_BUFC_DATA];
6781 7673 break;
6782 7674 }
6783 7675
6784 7676 /*
6785 7677 * Return a randomly-selected sublist. This is acceptable
6786 7678 * because the caller feeds only a little bit of data for each
6787 7679 * call (8MB). Subsequent calls will result in different
6788 7680 * sublists being selected.
6789 7681 */
6790 7682 idx = multilist_get_random_index(ml);
6791 7683 return (multilist_sublist_lock(ml, idx));
6792 7684 }
6793 7685
6794 7686 /*
7687 + * Calculates the maximum overhead of L2ARC metadata log blocks for a given
7688 + * L2ARC write size. l2arc_evict and l2arc_write_buffers need to include this
7689 + * overhead in processing to make sure there is enough headroom available
7690 + * when writing buffers.
7691 + */
7692 +static inline uint64_t
7693 +l2arc_log_blk_overhead(uint64_t write_sz)
7694 +{
7695 + return ((write_sz / SPA_MINBLOCKSIZE / L2ARC_LOG_BLK_ENTRIES) + 1) *
7696 + L2ARC_LOG_BLK_SIZE;
7697 +}
7698 +
7699 +/*
6795 7700 * Evict buffers from the device write hand to the distance specified in
6796 7701 * bytes. This distance may span populated buffers, it may span nothing.
6797 7702 * This is clearing a region on the L2ARC device ready for writing.
6798 7703 * If the 'all' boolean is set, every buffer is evicted.
6799 7704 */
6800 7705 static void
6801 -l2arc_evict(l2arc_dev_t *dev, uint64_t distance, boolean_t all)
7706 +l2arc_evict_impl(l2arc_dev_t *dev, uint64_t distance, boolean_t all)
6802 7707 {
6803 7708 list_t *buflist;
6804 7709 arc_buf_hdr_t *hdr, *hdr_prev;
6805 7710 kmutex_t *hash_lock;
6806 7711 uint64_t taddr;
6807 7712
6808 7713 buflist = &dev->l2ad_buflist;
6809 7714
6810 7715 if (!all && dev->l2ad_first) {
6811 7716 /*
6812 7717 * This is the first sweep through the device. There is
6813 7718 * nothing to evict.
6814 7719 */
6815 7720 return;
6816 7721 }
6817 7722
7723 + /*
7724 + * We need to add in the worst case scenario of log block overhead.
7725 + */
7726 + distance += l2arc_log_blk_overhead(distance);
6818 7727 if (dev->l2ad_hand >= (dev->l2ad_end - (2 * distance))) {
6819 7728 /*
6820 7729 * When nearing the end of the device, evict to the end
6821 7730 * before the device write hand jumps to the start.
6822 7731 */
6823 7732 taddr = dev->l2ad_end;
6824 7733 } else {
6825 7734 taddr = dev->l2ad_hand + distance;
6826 7735 }
6827 7736 DTRACE_PROBE4(l2arc__evict, l2arc_dev_t *, dev, list_t *, buflist,
6828 7737 uint64_t, taddr, boolean_t, all);
6829 7738
6830 7739 top:
6831 7740 mutex_enter(&dev->l2ad_mtx);
6832 7741 for (hdr = list_tail(buflist); hdr; hdr = hdr_prev) {
6833 7742 hdr_prev = list_prev(buflist, hdr);
6834 7743
6835 7744 hash_lock = HDR_LOCK(hdr);
6836 7745
6837 7746 /*
6838 7747 * We cannot use mutex_enter or else we can deadlock
6839 7748 * with l2arc_write_buffers (due to swapping the order
6840 7749 * the hash lock and l2ad_mtx are taken).
6841 7750 */
6842 7751 if (!mutex_tryenter(hash_lock)) {
6843 7752 /*
6844 7753 * Missed the hash lock. Retry.
6845 7754 */
6846 7755 ARCSTAT_BUMP(arcstat_l2_evict_lock_retry);
6847 7756 mutex_exit(&dev->l2ad_mtx);
6848 7757 mutex_enter(hash_lock);
6849 7758 mutex_exit(hash_lock);
6850 7759 goto top;
6851 7760 }
6852 7761
6853 7762 /*
6854 7763 * A header can't be on this list if it doesn't have L2 header.
6855 7764 */
6856 7765 ASSERT(HDR_HAS_L2HDR(hdr));
6857 7766
6858 7767 /* Ensure this header has finished being written. */
6859 7768 ASSERT(!HDR_L2_WRITING(hdr));
6860 7769 ASSERT(!HDR_L2_WRITE_HEAD(hdr));
6861 7770
6862 7771 if (!all && (hdr->b_l2hdr.b_daddr >= taddr ||
6863 7772 hdr->b_l2hdr.b_daddr < dev->l2ad_hand)) {
6864 7773 /*
6865 7774 * We've evicted to the target address,
6866 7775 * or the end of the device.
6867 7776 */
6868 7777 mutex_exit(hash_lock);
6869 7778 break;
6870 7779 }
6871 7780
6872 7781 if (!HDR_HAS_L1HDR(hdr)) {
6873 7782 ASSERT(!HDR_L2_READING(hdr));
6874 7783 /*
6875 7784 * This doesn't exist in the ARC. Destroy.
6876 7785 * arc_hdr_destroy() will call list_remove()
6877 7786 * and decrement arcstat_l2_lsize.
6878 7787 */
6879 7788 arc_change_state(arc_anon, hdr, hash_lock);
6880 7789 arc_hdr_destroy(hdr);
6881 7790 } else {
6882 7791 ASSERT(hdr->b_l1hdr.b_state != arc_l2c_only);
6883 7792 ARCSTAT_BUMP(arcstat_l2_evict_l1cached);
6884 7793 /*
6885 7794 * Invalidate issued or about to be issued
6886 7795 * reads, since we may be about to write
6887 7796 * over this location.
6888 7797 */
6889 7798 if (HDR_L2_READING(hdr)) {
6890 7799 ARCSTAT_BUMP(arcstat_l2_evict_reading);
|
↓ open down ↓ |
63 lines elided |
↑ open up ↑ |
6891 7800 arc_hdr_set_flags(hdr, ARC_FLAG_L2_EVICTED);
6892 7801 }
6893 7802
6894 7803 arc_hdr_l2hdr_destroy(hdr);
6895 7804 }
6896 7805 mutex_exit(hash_lock);
6897 7806 }
6898 7807 mutex_exit(&dev->l2ad_mtx);
6899 7808 }
6900 7809
7810 +static void
7811 +l2arc_evict_task(void *arg)
7812 +{
7813 + l2arc_dev_t *dev = arg;
7814 + ASSERT(dev);
7815 +
7816 + /*
7817 + * Evict l2arc buffers asynchronously; we need to keep the device
7818 + * around until we are sure there aren't any buffers referencing it.
7819 + * We do not need to hold any config locks, etc. because at this point,
7820 + * we are the only ones who knows about this device (the in-core
7821 + * structure), so no new buffers can be created (e.g. if the pool is
7822 + * re-imported while the asynchronous eviction is in progress) that
7823 + * reference this same in-core structure. Also remove the vdev link
7824 + * since further use of it as l2arc device is prohibited.
7825 + */
7826 + dev->l2ad_vdev = NULL;
7827 + l2arc_evict_impl(dev, 0LL, B_TRUE);
7828 +
7829 + /* Same cleanup as in the synchronous path */
7830 + list_destroy(&dev->l2ad_buflist);
7831 + mutex_destroy(&dev->l2ad_mtx);
7832 + refcount_destroy(&dev->l2ad_alloc);
7833 + kmem_free(dev->l2ad_dev_hdr, dev->l2ad_dev_hdr_asize);
7834 + kmem_free(dev, sizeof (l2arc_dev_t));
7835 +}
7836 +
7837 +boolean_t zfs_l2arc_async_evict = B_TRUE;
7838 +
6901 7839 /*
7840 + * Perform l2arc eviction for buffers associated with this device
7841 + * If evicting all buffers (done at pool export time), try to evict
7842 + * asynchronously, and fall back to synchronous eviction in case of error
7843 + * Tell the caller whether to cleanup the device:
7844 + * - B_TRUE means "asynchronous eviction, do not cleanup"
7845 + * - B_FALSE means "synchronous eviction, done, please cleanup"
7846 + */
7847 +static boolean_t
7848 +l2arc_evict(l2arc_dev_t *dev, uint64_t distance, boolean_t all)
7849 +{
7850 + /*
7851 + * If we are evicting all the buffers for this device, which happens
7852 + * at pool export time, schedule asynchronous task
7853 + */
7854 + if (all && zfs_l2arc_async_evict) {
7855 + if ((taskq_dispatch(arc_flush_taskq, l2arc_evict_task,
7856 + dev, TQ_NOSLEEP) == NULL)) {
7857 + /*
7858 + * Failed to dispatch asynchronous task
7859 + * cleanup, evict synchronously
7860 + */
7861 + l2arc_evict_impl(dev, distance, all);
7862 + } else {
7863 + /*
7864 + * Successful dispatch, vdev space updated
7865 + */
7866 + return (B_TRUE);
7867 + }
7868 + } else {
7869 + /* Evict synchronously */
7870 + l2arc_evict_impl(dev, distance, all);
7871 + }
7872 +
7873 + return (B_FALSE);
7874 +}
7875 +
7876 +/*
6902 7877 * Find and write ARC buffers to the L2ARC device.
6903 7878 *
6904 7879 * An ARC_FLAG_L2_WRITING flag is set so that the L2ARC buffers are not valid
6905 7880 * for reading until they have completed writing.
6906 7881 * The headroom_boost is an in-out parameter used to maintain headroom boost
6907 7882 * state between calls to this function.
6908 7883 *
6909 7884 * Returns the number of bytes actually written (which may be smaller than
6910 7885 * the delta by which the device hand has changed due to alignment).
6911 7886 */
6912 7887 static uint64_t
6913 -l2arc_write_buffers(spa_t *spa, l2arc_dev_t *dev, uint64_t target_sz)
7888 +l2arc_write_buffers(spa_t *spa, l2arc_dev_t *dev, uint64_t target_sz,
7889 + l2ad_feed_t feed_type)
6914 7890 {
6915 7891 arc_buf_hdr_t *hdr, *hdr_prev, *head;
7892 + /*
7893 + * We must carefully track the space we deal with here:
7894 + * - write_size: sum of the size of all buffers to be written
7895 + * without compression or inter-buffer alignment applied.
7896 + * This size is added to arcstat_l2_size, because subsequent
7897 + * eviction of buffers decrements this kstat by only the
7898 + * buffer's b_lsize (which doesn't take alignment into account).
7899 + * - write_asize: sum of the size of all buffers to be written
7900 + * with inter-buffer alignment applied.
7901 + * This size is used to estimate the maximum number of bytes
7902 + * we could take up on the device and is thus used to gauge how
7903 + * close we are to hitting target_sz.
7904 + */
6916 7905 uint64_t write_asize, write_psize, write_lsize, headroom;
6917 7906 boolean_t full;
6918 7907 l2arc_write_callback_t *cb;
6919 7908 zio_t *pio, *wzio;
7909 + enum l2arc_priorities try;
6920 7910 uint64_t guid = spa_load_guid(spa);
7911 + boolean_t dev_hdr_update = B_FALSE;
6921 7912
6922 7913 ASSERT3P(dev->l2ad_vdev, !=, NULL);
6923 7914
6924 7915 pio = NULL;
7916 + cb = NULL;
6925 7917 write_lsize = write_asize = write_psize = 0;
6926 7918 full = B_FALSE;
6927 7919 head = kmem_cache_alloc(hdr_l2only_cache, KM_PUSHPAGE);
6928 7920 arc_hdr_set_flags(head, ARC_FLAG_L2_WRITE_HEAD | ARC_FLAG_HAS_L2HDR);
6929 7921
6930 7922 /*
6931 7923 * Copy buffers for L2ARC writing.
6932 7924 */
6933 - for (int try = 0; try <= 3; try++) {
7925 + for (try = PRIORITY_MFU_DDT; try < PRIORITY_NUMTYPES; try++) {
6934 7926 multilist_sublist_t *mls = l2arc_sublist_lock(try);
6935 7927 uint64_t passed_sz = 0;
6936 7928
6937 7929 /*
6938 7930 * L2ARC fast warmup.
6939 7931 *
6940 7932 * Until the ARC is warm and starts to evict, read from the
6941 7933 * head of the ARC lists rather than the tail.
6942 7934 */
6943 7935 if (arc_warm == B_FALSE)
6944 7936 hdr = multilist_sublist_head(mls);
6945 7937 else
6946 7938 hdr = multilist_sublist_tail(mls);
6947 7939
6948 7940 headroom = target_sz * l2arc_headroom;
6949 7941 if (zfs_compressed_arc_enabled)
6950 7942 headroom = (headroom * l2arc_headroom_boost) / 100;
6951 7943
6952 7944 for (; hdr; hdr = hdr_prev) {
6953 7945 kmutex_t *hash_lock;
6954 7946
6955 7947 if (arc_warm == B_FALSE)
6956 7948 hdr_prev = multilist_sublist_next(mls, hdr);
6957 7949 else
6958 7950 hdr_prev = multilist_sublist_prev(mls, hdr);
6959 7951
6960 7952 hash_lock = HDR_LOCK(hdr);
6961 7953 if (!mutex_tryenter(hash_lock)) {
6962 7954 /*
6963 7955 * Skip this buffer rather than waiting.
6964 7956 */
6965 7957 continue;
6966 7958 }
6967 7959
6968 7960 passed_sz += HDR_GET_LSIZE(hdr);
6969 7961 if (passed_sz > headroom) {
6970 7962 /*
6971 7963 * Searched too far.
6972 7964 */
6973 7965 mutex_exit(hash_lock);
6974 7966 break;
6975 7967 }
6976 7968
6977 7969 if (!l2arc_write_eligible(guid, hdr)) {
6978 7970 mutex_exit(hash_lock);
6979 7971 continue;
6980 7972 }
6981 7973
6982 7974 /*
6983 7975 * We rely on the L1 portion of the header below, so
6984 7976 * it's invalid for this header to have been evicted out
6985 7977 * of the ghost cache, prior to being written out. The
6986 7978 * ARC_FLAG_L2_WRITING bit ensures this won't happen.
6987 7979 */
6988 7980 ASSERT(HDR_HAS_L1HDR(hdr));
6989 7981
6990 7982 ASSERT3U(HDR_GET_PSIZE(hdr), >, 0);
6991 7983 ASSERT3P(hdr->b_l1hdr.b_pabd, !=, NULL);
6992 7984 ASSERT3U(arc_hdr_size(hdr), >, 0);
|
↓ open down ↓ |
49 lines elided |
↑ open up ↑ |
6993 7985 uint64_t psize = arc_hdr_size(hdr);
6994 7986 uint64_t asize = vdev_psize_to_asize(dev->l2ad_vdev,
6995 7987 psize);
6996 7988
6997 7989 if ((write_asize + asize) > target_sz) {
6998 7990 full = B_TRUE;
6999 7991 mutex_exit(hash_lock);
7000 7992 break;
7001 7993 }
7002 7994
7995 + /* make sure buf we select corresponds to feed_type */
7996 + if ((feed_type == L2ARC_FEED_DDT_DEV &&
7997 + arc_buf_type(hdr) != ARC_BUFC_DDT) ||
7998 + (feed_type == L2ARC_FEED_NON_DDT_DEV &&
7999 + arc_buf_type(hdr) == ARC_BUFC_DDT)) {
8000 + mutex_exit(hash_lock);
8001 + continue;
8002 + }
8003 +
7003 8004 if (pio == NULL) {
7004 8005 /*
7005 8006 * Insert a dummy header on the buflist so
7006 8007 * l2arc_write_done() can find where the
7007 8008 * write buffers begin without searching.
7008 8009 */
7009 8010 mutex_enter(&dev->l2ad_mtx);
7010 8011 list_insert_head(&dev->l2ad_buflist, head);
7011 8012 mutex_exit(&dev->l2ad_mtx);
7012 8013
7013 - cb = kmem_alloc(
8014 + cb = kmem_zalloc(
7014 8015 sizeof (l2arc_write_callback_t), KM_SLEEP);
7015 8016 cb->l2wcb_dev = dev;
7016 8017 cb->l2wcb_head = head;
8018 + list_create(&cb->l2wcb_log_blk_buflist,
8019 + sizeof (l2arc_log_blk_buf_t),
8020 + offsetof(l2arc_log_blk_buf_t, lbb_node));
7017 8021 pio = zio_root(spa, l2arc_write_done, cb,
7018 8022 ZIO_FLAG_CANFAIL);
7019 8023 }
7020 8024
7021 8025 hdr->b_l2hdr.b_dev = dev;
7022 8026 hdr->b_l2hdr.b_daddr = dev->l2ad_hand;
7023 8027 arc_hdr_set_flags(hdr,
7024 8028 ARC_FLAG_L2_WRITING | ARC_FLAG_HAS_L2HDR);
7025 8029
7026 8030 mutex_enter(&dev->l2ad_mtx);
7027 8031 list_insert_head(&dev->l2ad_buflist, hdr);
7028 8032 mutex_exit(&dev->l2ad_mtx);
7029 8033
7030 8034 (void) refcount_add_many(&dev->l2ad_alloc, psize, hdr);
7031 8035
7032 8036 /*
7033 8037 * Normally the L2ARC can use the hdr's data, but if
7034 8038 * we're sharing data between the hdr and one of its
7035 8039 * bufs, L2ARC needs its own copy of the data so that
7036 8040 * the ZIO below can't race with the buf consumer.
7037 8041 * Another case where we need to create a copy of the
7038 8042 * data is when the buffer size is not device-aligned
7039 8043 * and we need to pad the block to make it such.
7040 8044 * That also keeps the clock hand suitably aligned.
|
↓ open down ↓ |
14 lines elided |
↑ open up ↑ |
7041 8045 *
7042 8046 * To ensure that the copy will be available for the
7043 8047 * lifetime of the ZIO and be cleaned up afterwards, we
7044 8048 * add it to the l2arc_free_on_write queue.
7045 8049 */
7046 8050 abd_t *to_write;
7047 8051 if (!HDR_SHARED_DATA(hdr) && psize == asize) {
7048 8052 to_write = hdr->b_l1hdr.b_pabd;
7049 8053 } else {
7050 8054 to_write = abd_alloc_for_io(asize,
7051 - HDR_ISTYPE_METADATA(hdr));
8055 + !HDR_ISTYPE_DATA(hdr));
7052 8056 abd_copy(to_write, hdr->b_l1hdr.b_pabd, psize);
7053 8057 if (asize != psize) {
7054 8058 abd_zero_off(to_write, psize,
7055 8059 asize - psize);
7056 8060 }
7057 8061 l2arc_free_abd_on_write(to_write, asize,
7058 8062 arc_buf_type(hdr));
7059 8063 }
7060 8064 wzio = zio_write_phys(pio, dev->l2ad_vdev,
7061 8065 hdr->b_l2hdr.b_daddr, asize, to_write,
7062 8066 ZIO_CHECKSUM_OFF, NULL, hdr,
7063 8067 ZIO_PRIORITY_ASYNC_WRITE,
7064 8068 ZIO_FLAG_CANFAIL, B_FALSE);
7065 8069
7066 8070 write_lsize += HDR_GET_LSIZE(hdr);
|
↓ open down ↓ |
5 lines elided |
↑ open up ↑ |
7067 8071 DTRACE_PROBE2(l2arc__write, vdev_t *, dev->l2ad_vdev,
7068 8072 zio_t *, wzio);
7069 8073
7070 8074 write_psize += psize;
7071 8075 write_asize += asize;
7072 8076 dev->l2ad_hand += asize;
7073 8077
7074 8078 mutex_exit(hash_lock);
7075 8079
7076 8080 (void) zio_nowait(wzio);
8081 +
8082 + /*
8083 + * Append buf info to current log and commit if full.
8084 + * arcstat_l2_{size,asize} kstats are updated internally.
8085 + */
8086 + if (l2arc_log_blk_insert(dev, hdr)) {
8087 + l2arc_log_blk_commit(dev, pio, cb);
8088 + dev_hdr_update = B_TRUE;
8089 + }
7077 8090 }
7078 8091
7079 8092 multilist_sublist_unlock(mls);
7080 8093
7081 8094 if (full == B_TRUE)
7082 8095 break;
7083 8096 }
7084 8097
7085 8098 /* No buffers selected for writing? */
7086 8099 if (pio == NULL) {
7087 8100 ASSERT0(write_lsize);
7088 8101 ASSERT(!HDR_HAS_L1HDR(head));
7089 8102 kmem_cache_free(hdr_l2only_cache, head);
7090 8103 return (0);
7091 8104 }
7092 8105
8106 + /*
8107 + * If we wrote any logs as part of this write, update dev hdr
8108 + * to point to it.
8109 + */
8110 + if (dev_hdr_update)
8111 + l2arc_dev_hdr_update(dev, pio);
8112 +
7093 8113 ASSERT3U(write_asize, <=, target_sz);
7094 8114 ARCSTAT_BUMP(arcstat_l2_writes_sent);
7095 8115 ARCSTAT_INCR(arcstat_l2_write_bytes, write_psize);
8116 + if (feed_type == L2ARC_FEED_DDT_DEV)
8117 + ARCSTAT_INCR(arcstat_l2_ddt_write_bytes, write_psize);
7096 8118 ARCSTAT_INCR(arcstat_l2_lsize, write_lsize);
7097 8119 ARCSTAT_INCR(arcstat_l2_psize, write_psize);
7098 8120 vdev_space_update(dev->l2ad_vdev, write_psize, 0, 0);
7099 8121
7100 8122 /*
7101 8123 * Bump device hand to the device start if it is approaching the end.
7102 8124 * l2arc_evict() will already have evicted ahead for this case.
7103 8125 */
7104 - if (dev->l2ad_hand >= (dev->l2ad_end - target_sz)) {
8126 + if (dev->l2ad_hand + target_sz + l2arc_log_blk_overhead(target_sz) >=
8127 + dev->l2ad_end) {
7105 8128 dev->l2ad_hand = dev->l2ad_start;
7106 8129 dev->l2ad_first = B_FALSE;
7107 8130 }
7108 8131
7109 8132 dev->l2ad_writing = B_TRUE;
7110 8133 (void) zio_wait(pio);
7111 8134 dev->l2ad_writing = B_FALSE;
7112 8135
7113 8136 return (write_asize);
7114 8137 }
7115 8138
8139 +static boolean_t
8140 +l2arc_feed_dev(l2ad_feed_t feed_type, uint64_t *wrote)
8141 +{
8142 + spa_t *spa;
8143 + l2arc_dev_t *dev;
8144 + uint64_t size;
8145 +
8146 + /*
8147 + * This selects the next l2arc device to write to, and in
8148 + * doing so the next spa to feed from: dev->l2ad_spa. This
8149 + * will return NULL if there are now no l2arc devices or if
8150 + * they are all faulted.
8151 + *
8152 + * If a device is returned, its spa's config lock is also
8153 + * held to prevent device removal. l2arc_dev_get_next()
8154 + * will grab and release l2arc_dev_mtx.
8155 + */
8156 + if ((dev = l2arc_dev_get_next(feed_type)) == NULL)
8157 + return (B_FALSE);
8158 +
8159 + spa = dev->l2ad_spa;
8160 + ASSERT(spa != NULL);
8161 +
8162 + /*
8163 + * If the pool is read-only - skip it
8164 + */
8165 + if (!spa_writeable(spa)) {
8166 + spa_config_exit(spa, SCL_L2ARC, dev);
8167 + return (B_FALSE);
8168 + }
8169 +
8170 + ARCSTAT_BUMP(arcstat_l2_feeds);
8171 + size = l2arc_write_size();
8172 +
8173 + /*
8174 + * Evict L2ARC buffers that will be overwritten.
8175 + * B_FALSE guarantees synchronous eviction.
8176 + */
8177 + (void) l2arc_evict(dev, size, B_FALSE);
8178 +
8179 + /*
8180 + * Write ARC buffers.
8181 + */
8182 + *wrote = l2arc_write_buffers(spa, dev, size, feed_type);
8183 +
8184 + spa_config_exit(spa, SCL_L2ARC, dev);
8185 +
8186 + return (B_TRUE);
8187 +}
8188 +
7116 8189 /*
7117 8190 * This thread feeds the L2ARC at regular intervals. This is the beating
7118 8191 * heart of the L2ARC.
7119 8192 */
7120 8193 /* ARGSUSED */
7121 8194 static void
7122 8195 l2arc_feed_thread(void *unused)
7123 8196 {
7124 8197 callb_cpr_t cpr;
7125 - l2arc_dev_t *dev;
7126 - spa_t *spa;
7127 - uint64_t size, wrote;
8198 + uint64_t size, total_written = 0;
7128 8199 clock_t begin, next = ddi_get_lbolt();
8200 + l2ad_feed_t feed_type = L2ARC_FEED_ALL;
7129 8201
7130 8202 CALLB_CPR_INIT(&cpr, &l2arc_feed_thr_lock, callb_generic_cpr, FTAG);
7131 8203
7132 8204 mutex_enter(&l2arc_feed_thr_lock);
7133 8205
7134 8206 while (l2arc_thread_exit == 0) {
7135 8207 CALLB_CPR_SAFE_BEGIN(&cpr);
7136 8208 (void) cv_timedwait(&l2arc_feed_thr_cv, &l2arc_feed_thr_lock,
7137 8209 next);
7138 8210 CALLB_CPR_SAFE_END(&cpr, &l2arc_feed_thr_lock);
7139 8211 next = ddi_get_lbolt() + hz;
7140 8212
7141 8213 /*
7142 8214 * Quick check for L2ARC devices.
|
↓ open down ↓ |
4 lines elided |
↑ open up ↑ |
7143 8215 */
7144 8216 mutex_enter(&l2arc_dev_mtx);
7145 8217 if (l2arc_ndev == 0) {
7146 8218 mutex_exit(&l2arc_dev_mtx);
7147 8219 continue;
7148 8220 }
7149 8221 mutex_exit(&l2arc_dev_mtx);
7150 8222 begin = ddi_get_lbolt();
7151 8223
7152 8224 /*
7153 - * This selects the next l2arc device to write to, and in
7154 - * doing so the next spa to feed from: dev->l2ad_spa. This
7155 - * will return NULL if there are now no l2arc devices or if
7156 - * they are all faulted.
7157 - *
7158 - * If a device is returned, its spa's config lock is also
7159 - * held to prevent device removal. l2arc_dev_get_next()
7160 - * will grab and release l2arc_dev_mtx.
7161 - */
7162 - if ((dev = l2arc_dev_get_next()) == NULL)
7163 - continue;
7164 -
7165 - spa = dev->l2ad_spa;
7166 - ASSERT3P(spa, !=, NULL);
7167 -
7168 - /*
7169 - * If the pool is read-only then force the feed thread to
7170 - * sleep a little longer.
7171 - */
7172 - if (!spa_writeable(spa)) {
7173 - next = ddi_get_lbolt() + 5 * l2arc_feed_secs * hz;
7174 - spa_config_exit(spa, SCL_L2ARC, dev);
7175 - continue;
7176 - }
7177 -
7178 - /*
7179 8225 * Avoid contributing to memory pressure.
7180 8226 */
7181 8227 if (arc_reclaim_needed()) {
7182 8228 ARCSTAT_BUMP(arcstat_l2_abort_lowmem);
7183 - spa_config_exit(spa, SCL_L2ARC, dev);
7184 8229 continue;
7185 8230 }
7186 8231
7187 - ARCSTAT_BUMP(arcstat_l2_feeds);
8232 + /* try to write to DDT L2ARC device if any */
8233 + if (l2arc_feed_dev(L2ARC_FEED_DDT_DEV, &size)) {
8234 + total_written += size;
8235 + feed_type = L2ARC_FEED_NON_DDT_DEV;
8236 + }
7188 8237
7189 - size = l2arc_write_size();
8238 + /* try to write to the regular L2ARC device if any */
8239 + if (l2arc_feed_dev(feed_type, &size)) {
8240 + total_written += size;
8241 + if (feed_type == L2ARC_FEED_NON_DDT_DEV)
8242 + total_written /= 2; /* avg written per device */
8243 + }
7190 8244
7191 8245 /*
7192 - * Evict L2ARC buffers that will be overwritten.
7193 - */
7194 - l2arc_evict(dev, size, B_FALSE);
7195 -
7196 - /*
7197 - * Write ARC buffers.
7198 - */
7199 - wrote = l2arc_write_buffers(spa, dev, size);
7200 -
7201 - /*
7202 8246 * Calculate interval between writes.
7203 8247 */
7204 - next = l2arc_write_interval(begin, size, wrote);
7205 - spa_config_exit(spa, SCL_L2ARC, dev);
8248 + next = l2arc_write_interval(begin, l2arc_write_size(),
8249 + total_written);
8250 +
8251 + total_written = 0;
7206 8252 }
7207 8253
7208 8254 l2arc_thread_exit = 0;
7209 8255 cv_broadcast(&l2arc_feed_thr_cv);
7210 8256 CALLB_CPR_EXIT(&cpr); /* drops l2arc_feed_thr_lock */
7211 8257 thread_exit();
7212 8258 }
7213 8259
7214 8260 boolean_t
7215 8261 l2arc_vdev_present(vdev_t *vd)
7216 8262 {
7217 - l2arc_dev_t *dev;
8263 + return (l2arc_vdev_get(vd) != NULL);
8264 +}
7218 8265
7219 - mutex_enter(&l2arc_dev_mtx);
8266 +/*
8267 + * Returns the l2arc_dev_t associated with a particular vdev_t or NULL if
8268 + * the vdev_t isn't an L2ARC device.
8269 + */
8270 +static l2arc_dev_t *
8271 +l2arc_vdev_get(vdev_t *vd)
8272 +{
8273 + l2arc_dev_t *dev;
8274 + boolean_t held = MUTEX_HELD(&l2arc_dev_mtx);
8275 +
8276 + if (!held)
8277 + mutex_enter(&l2arc_dev_mtx);
7220 8278 for (dev = list_head(l2arc_dev_list); dev != NULL;
7221 8279 dev = list_next(l2arc_dev_list, dev)) {
7222 8280 if (dev->l2ad_vdev == vd)
7223 8281 break;
7224 8282 }
7225 - mutex_exit(&l2arc_dev_mtx);
8283 + if (!held)
8284 + mutex_exit(&l2arc_dev_mtx);
7226 8285
7227 - return (dev != NULL);
8286 + return (dev);
7228 8287 }
7229 8288
7230 8289 /*
7231 8290 * Add a vdev for use by the L2ARC. By this point the spa has already
7232 - * validated the vdev and opened it.
8291 + * validated the vdev and opened it. The `rebuild' flag indicates whether
8292 + * we should attempt an L2ARC persistency rebuild.
7233 8293 */
7234 8294 void
7235 -l2arc_add_vdev(spa_t *spa, vdev_t *vd)
8295 +l2arc_add_vdev(spa_t *spa, vdev_t *vd, boolean_t rebuild)
7236 8296 {
7237 8297 l2arc_dev_t *adddev;
7238 8298
7239 8299 ASSERT(!l2arc_vdev_present(vd));
7240 8300
7241 8301 /*
7242 8302 * Create a new l2arc device entry.
7243 8303 */
7244 8304 adddev = kmem_zalloc(sizeof (l2arc_dev_t), KM_SLEEP);
7245 8305 adddev->l2ad_spa = spa;
7246 8306 adddev->l2ad_vdev = vd;
7247 - adddev->l2ad_start = VDEV_LABEL_START_SIZE;
8307 + /* leave extra size for an l2arc device header */
8308 + adddev->l2ad_dev_hdr_asize = MAX(sizeof (*adddev->l2ad_dev_hdr),
8309 + 1 << vd->vdev_ashift);
8310 + adddev->l2ad_start = VDEV_LABEL_START_SIZE + adddev->l2ad_dev_hdr_asize;
7248 8311 adddev->l2ad_end = VDEV_LABEL_START_SIZE + vdev_get_min_asize(vd);
8312 + ASSERT3U(adddev->l2ad_start, <, adddev->l2ad_end);
7249 8313 adddev->l2ad_hand = adddev->l2ad_start;
7250 8314 adddev->l2ad_first = B_TRUE;
7251 8315 adddev->l2ad_writing = B_FALSE;
8316 + adddev->l2ad_dev_hdr = kmem_zalloc(adddev->l2ad_dev_hdr_asize,
8317 + KM_SLEEP);
7252 8318
7253 8319 mutex_init(&adddev->l2ad_mtx, NULL, MUTEX_DEFAULT, NULL);
7254 8320 /*
7255 8321 * This is a list of all ARC buffers that are still valid on the
7256 8322 * device.
7257 8323 */
7258 8324 list_create(&adddev->l2ad_buflist, sizeof (arc_buf_hdr_t),
7259 8325 offsetof(arc_buf_hdr_t, b_l2hdr.b_l2node));
7260 8326
7261 8327 vdev_space_update(vd, 0, 0, adddev->l2ad_end - adddev->l2ad_hand);
7262 8328 refcount_create(&adddev->l2ad_alloc);
7263 8329
7264 8330 /*
7265 8331 * Add device to global list
7266 8332 */
7267 8333 mutex_enter(&l2arc_dev_mtx);
7268 8334 list_insert_head(l2arc_dev_list, adddev);
7269 8335 atomic_inc_64(&l2arc_ndev);
8336 + if (rebuild && l2arc_rebuild_enabled &&
8337 + adddev->l2ad_end - adddev->l2ad_start > L2ARC_PERSIST_MIN_SIZE) {
8338 + /*
8339 + * Just mark the device as pending for a rebuild. We won't
8340 + * be starting a rebuild in line here as it would block pool
8341 + * import. Instead spa_load_impl will hand that off to an
8342 + * async task which will call l2arc_spa_rebuild_start.
8343 + */
8344 + adddev->l2ad_rebuild = B_TRUE;
8345 + }
7270 8346 mutex_exit(&l2arc_dev_mtx);
7271 8347 }
7272 8348
7273 8349 /*
7274 8350 * Remove a vdev from the L2ARC.
7275 8351 */
7276 8352 void
7277 8353 l2arc_remove_vdev(vdev_t *vd)
7278 8354 {
7279 8355 l2arc_dev_t *dev, *nextdev, *remdev = NULL;
7280 8356
7281 8357 /*
7282 8358 * Find the device by vdev
7283 8359 */
7284 8360 mutex_enter(&l2arc_dev_mtx);
|
↓ open down ↓ |
5 lines elided |
↑ open up ↑ |
7285 8361 for (dev = list_head(l2arc_dev_list); dev; dev = nextdev) {
7286 8362 nextdev = list_next(l2arc_dev_list, dev);
7287 8363 if (vd == dev->l2ad_vdev) {
7288 8364 remdev = dev;
7289 8365 break;
7290 8366 }
7291 8367 }
7292 8368 ASSERT3P(remdev, !=, NULL);
7293 8369
7294 8370 /*
8371 + * Cancel any ongoing or scheduled rebuild (race protection with
8372 + * l2arc_spa_rebuild_start provided via l2arc_dev_mtx).
8373 + */
8374 + remdev->l2ad_rebuild_cancel = B_TRUE;
8375 + if (remdev->l2ad_rebuild_did != 0) {
8376 + /*
8377 + * N.B. it should be safe to thread_join with the rebuild
8378 + * thread while holding l2arc_dev_mtx because it is not
8379 + * accessed from anywhere in the l2arc rebuild code below
8380 + * (except for l2arc_spa_rebuild_start, which is ok).
8381 + */
8382 + thread_join(remdev->l2ad_rebuild_did);
8383 + }
8384 +
8385 + /*
7295 8386 * Remove device from global list
7296 8387 */
7297 8388 list_remove(l2arc_dev_list, remdev);
7298 8389 l2arc_dev_last = NULL; /* may have been invalidated */
8390 + l2arc_ddt_dev_last = NULL; /* may have been invalidated */
7299 8391 atomic_dec_64(&l2arc_ndev);
7300 8392 mutex_exit(&l2arc_dev_mtx);
7301 8393
8394 + if (vdev_type_is_ddt(remdev->l2ad_vdev))
8395 + atomic_add_64(&remdev->l2ad_spa->spa_l2arc_ddt_devs_size,
8396 + -(vdev_get_min_asize(remdev->l2ad_vdev)));
8397 +
7302 8398 /*
7303 8399 * Clear all buflists and ARC references. L2ARC device flush.
7304 8400 */
7305 - l2arc_evict(remdev, 0, B_TRUE);
7306 - list_destroy(&remdev->l2ad_buflist);
7307 - mutex_destroy(&remdev->l2ad_mtx);
7308 - refcount_destroy(&remdev->l2ad_alloc);
7309 - kmem_free(remdev, sizeof (l2arc_dev_t));
8401 + if (l2arc_evict(remdev, 0, B_TRUE) == B_FALSE) {
8402 + /*
8403 + * The eviction was done synchronously, cleanup here
8404 + * Otherwise, the asynchronous task will cleanup
8405 + */
8406 + list_destroy(&remdev->l2ad_buflist);
8407 + mutex_destroy(&remdev->l2ad_mtx);
8408 + kmem_free(remdev->l2ad_dev_hdr, remdev->l2ad_dev_hdr_asize);
8409 + kmem_free(remdev, sizeof (l2arc_dev_t));
8410 + }
7310 8411 }
7311 8412
7312 8413 void
7313 8414 l2arc_init(void)
7314 8415 {
7315 8416 l2arc_thread_exit = 0;
7316 8417 l2arc_ndev = 0;
7317 8418 l2arc_writes_sent = 0;
7318 8419 l2arc_writes_done = 0;
7319 8420
7320 8421 mutex_init(&l2arc_feed_thr_lock, NULL, MUTEX_DEFAULT, NULL);
7321 8422 cv_init(&l2arc_feed_thr_cv, NULL, CV_DEFAULT, NULL);
7322 8423 mutex_init(&l2arc_dev_mtx, NULL, MUTEX_DEFAULT, NULL);
7323 8424 mutex_init(&l2arc_free_on_write_mtx, NULL, MUTEX_DEFAULT, NULL);
7324 8425
7325 8426 l2arc_dev_list = &L2ARC_dev_list;
7326 8427 l2arc_free_on_write = &L2ARC_free_on_write;
7327 8428 list_create(l2arc_dev_list, sizeof (l2arc_dev_t),
7328 8429 offsetof(l2arc_dev_t, l2ad_node));
7329 8430 list_create(l2arc_free_on_write, sizeof (l2arc_data_free_t),
7330 8431 offsetof(l2arc_data_free_t, l2df_list_node));
7331 8432 }
7332 8433
7333 8434 void
7334 8435 l2arc_fini(void)
7335 8436 {
7336 8437 /*
7337 8438 * This is called from dmu_fini(), which is called from spa_fini();
7338 8439 * Because of this, we can assume that all l2arc devices have
7339 8440 * already been removed when the pools themselves were removed.
7340 8441 */
7341 8442
7342 8443 l2arc_do_free_on_write();
7343 8444
7344 8445 mutex_destroy(&l2arc_feed_thr_lock);
7345 8446 cv_destroy(&l2arc_feed_thr_cv);
7346 8447 mutex_destroy(&l2arc_dev_mtx);
7347 8448 mutex_destroy(&l2arc_free_on_write_mtx);
7348 8449
7349 8450 list_destroy(l2arc_dev_list);
7350 8451 list_destroy(l2arc_free_on_write);
7351 8452 }
7352 8453
7353 8454 void
7354 8455 l2arc_start(void)
7355 8456 {
7356 8457 if (!(spa_mode_global & FWRITE))
7357 8458 return;
7358 8459
7359 8460 (void) thread_create(NULL, 0, l2arc_feed_thread, NULL, 0, &p0,
7360 8461 TS_RUN, minclsyspri);
7361 8462 }
7362 8463
7363 8464 void
7364 8465 l2arc_stop(void)
|
↓ open down ↓ |
45 lines elided |
↑ open up ↑ |
7365 8466 {
7366 8467 if (!(spa_mode_global & FWRITE))
7367 8468 return;
7368 8469
7369 8470 mutex_enter(&l2arc_feed_thr_lock);
7370 8471 cv_signal(&l2arc_feed_thr_cv); /* kick thread out of startup */
7371 8472 l2arc_thread_exit = 1;
7372 8473 while (l2arc_thread_exit != 0)
7373 8474 cv_wait(&l2arc_feed_thr_cv, &l2arc_feed_thr_lock);
7374 8475 mutex_exit(&l2arc_feed_thr_lock);
8476 +}
8477 +
8478 +/*
8479 + * Punches out rebuild threads for the L2ARC devices in a spa. This should
8480 + * be called after pool import from the spa async thread, since starting
8481 + * these threads directly from spa_import() will make them part of the
8482 + * "zpool import" context and delay process exit (and thus pool import).
8483 + */
8484 +void
8485 +l2arc_spa_rebuild_start(spa_t *spa)
8486 +{
8487 + /*
8488 + * Locate the spa's l2arc devices and kick off rebuild threads.
8489 + */
8490 + mutex_enter(&l2arc_dev_mtx);
8491 + for (int i = 0; i < spa->spa_l2cache.sav_count; i++) {
8492 + l2arc_dev_t *dev =
8493 + l2arc_vdev_get(spa->spa_l2cache.sav_vdevs[i]);
8494 + if (dev == NULL) {
8495 + /* Don't attempt a rebuild if the vdev is UNAVAIL */
8496 + continue;
8497 + }
8498 + if (dev->l2ad_rebuild && !dev->l2ad_rebuild_cancel) {
8499 + VERIFY3U(dev->l2ad_rebuild_did, ==, 0);
8500 +#ifdef _KERNEL
8501 + dev->l2ad_rebuild_did = thread_create(NULL, 0,
8502 + l2arc_dev_rebuild_start, dev, 0, &p0, TS_RUN,
8503 + minclsyspri)->t_did;
8504 +#endif
8505 + }
8506 + }
8507 + mutex_exit(&l2arc_dev_mtx);
8508 +}
8509 +
8510 +/*
8511 + * Main entry point for L2ARC rebuilding.
8512 + */
8513 +static void
8514 +l2arc_dev_rebuild_start(l2arc_dev_t *dev)
8515 +{
8516 + if (!dev->l2ad_rebuild_cancel) {
8517 + VERIFY(dev->l2ad_rebuild);
8518 + (void) l2arc_rebuild(dev);
8519 + dev->l2ad_rebuild = B_FALSE;
8520 + }
8521 +}
8522 +
8523 +/*
8524 + * This function implements the actual L2ARC metadata rebuild. It:
8525 + *
8526 + * 1) reads the device's header
8527 + * 2) if a good device header is found, starts reading the log block chain
8528 + * 3) restores each block's contents to memory (reconstructing arc_buf_hdr_t's)
8529 + *
8530 + * Operation stops under any of the following conditions:
8531 + *
8532 + * 1) We reach the end of the log blk chain (the back-reference in the blk is
8533 + * invalid or loops over our starting point).
8534 + * 2) We encounter *any* error condition (cksum errors, io errors, looped
8535 + * blocks, etc.).
8536 + */
8537 +static int
8538 +l2arc_rebuild(l2arc_dev_t *dev)
8539 +{
8540 + vdev_t *vd = dev->l2ad_vdev;
8541 + spa_t *spa = vd->vdev_spa;
8542 + int err;
8543 + l2arc_log_blk_phys_t *this_lb, *next_lb;
8544 + uint8_t *this_lb_buf, *next_lb_buf;
8545 + zio_t *this_io = NULL, *next_io = NULL;
8546 + l2arc_log_blkptr_t lb_ptrs[2];
8547 + boolean_t first_pass, lock_held;
8548 + uint64_t load_guid;
8549 +
8550 + this_lb = kmem_zalloc(sizeof (*this_lb), KM_SLEEP);
8551 + next_lb = kmem_zalloc(sizeof (*next_lb), KM_SLEEP);
8552 + this_lb_buf = kmem_zalloc(sizeof (l2arc_log_blk_phys_t), KM_SLEEP);
8553 + next_lb_buf = kmem_zalloc(sizeof (l2arc_log_blk_phys_t), KM_SLEEP);
8554 +
8555 + /*
8556 + * We prevent device removal while issuing reads to the device,
8557 + * then during the rebuilding phases we drop this lock again so
8558 + * that a spa_unload or device remove can be initiated - this is
8559 + * safe, because the spa will signal us to stop before removing
8560 + * our device and wait for us to stop.
8561 + */
8562 + spa_config_enter(spa, SCL_L2ARC, vd, RW_READER);
8563 + lock_held = B_TRUE;
8564 +
8565 + load_guid = spa_load_guid(dev->l2ad_vdev->vdev_spa);
8566 + /*
8567 + * Device header processing phase.
8568 + */
8569 + if ((err = l2arc_dev_hdr_read(dev)) != 0) {
8570 + /* device header corrupted, start a new one */
8571 + bzero(dev->l2ad_dev_hdr, dev->l2ad_dev_hdr_asize);
8572 + goto out;
8573 + }
8574 +
8575 + /* Retrieve the persistent L2ARC device state */
8576 + dev->l2ad_hand = vdev_psize_to_asize(dev->l2ad_vdev,
8577 + dev->l2ad_dev_hdr->dh_start_lbps[0].lbp_daddr +
8578 + LBP_GET_PSIZE(&dev->l2ad_dev_hdr->dh_start_lbps[0]));
8579 + dev->l2ad_first = !!(dev->l2ad_dev_hdr->dh_flags &
8580 + L2ARC_DEV_HDR_EVICT_FIRST);
8581 +
8582 + /* Prepare the rebuild processing state */
8583 + bcopy(dev->l2ad_dev_hdr->dh_start_lbps, lb_ptrs, sizeof (lb_ptrs));
8584 + first_pass = B_TRUE;
8585 +
8586 + /* Start the rebuild process */
8587 + for (;;) {
8588 + if (!l2arc_log_blkptr_valid(dev, &lb_ptrs[0]))
8589 + /* We hit an invalid block address, end the rebuild. */
8590 + break;
8591 +
8592 + if ((err = l2arc_log_blk_read(dev, &lb_ptrs[0], &lb_ptrs[1],
8593 + this_lb, next_lb, this_lb_buf, next_lb_buf,
8594 + this_io, &next_io)) != 0)
8595 + break;
8596 +
8597 + spa_config_exit(spa, SCL_L2ARC, vd);
8598 + lock_held = B_FALSE;
8599 +
8600 + /* Protection against infinite loops of log blocks. */
8601 + if (l2arc_range_check_overlap(lb_ptrs[1].lbp_daddr,
8602 + lb_ptrs[0].lbp_daddr,
8603 + dev->l2ad_dev_hdr->dh_start_lbps[0].lbp_daddr) &&
8604 + !first_pass) {
8605 + ARCSTAT_BUMP(arcstat_l2_rebuild_abort_loop_errors);
8606 + err = SET_ERROR(ELOOP);
8607 + break;
8608 + }
8609 +
8610 + /*
8611 + * Our memory pressure valve. If the system is running low
8612 + * on memory, rather than swamping memory with new ARC buf
8613 + * hdrs, we opt not to rebuild the L2ARC. At this point,
8614 + * however, we have already set up our L2ARC dev to chain in
8615 + * new metadata log blk, so the user may choose to re-add the
8616 + * L2ARC dev at a later time to reconstruct it (when there's
8617 + * less memory pressure).
8618 + */
8619 + if (arc_reclaim_needed()) {
8620 + ARCSTAT_BUMP(arcstat_l2_rebuild_abort_lowmem);
8621 + cmn_err(CE_NOTE, "System running low on memory, "
8622 + "aborting L2ARC rebuild.");
8623 + err = SET_ERROR(ENOMEM);
8624 + break;
8625 + }
8626 +
8627 + /*
8628 + * Now that we know that the next_lb checks out alright, we
8629 + * can start reconstruction from this lb - we can be sure
8630 + * that the L2ARC write hand has not yet reached any of our
8631 + * buffers.
8632 + */
8633 + l2arc_log_blk_restore(dev, load_guid, this_lb,
8634 + LBP_GET_PSIZE(&lb_ptrs[0]));
8635 +
8636 + /*
8637 + * End of list detection. We can look ahead two steps in the
8638 + * blk chain and if the 2nd blk from this_lb dips below the
8639 + * initial chain starting point, then we know two things:
8640 + * 1) it can't be valid, and
8641 + * 2) the next_lb's ARC entries might have already been
8642 + * partially overwritten and so we should stop before
8643 + * we restore it
8644 + */
8645 + if (l2arc_range_check_overlap(
8646 + this_lb->lb_back2_lbp.lbp_daddr, lb_ptrs[0].lbp_daddr,
8647 + dev->l2ad_dev_hdr->dh_start_lbps[0].lbp_daddr) &&
8648 + !first_pass)
8649 + break;
8650 +
8651 + /* log blk restored, continue with next one in the list */
8652 + lb_ptrs[0] = lb_ptrs[1];
8653 + lb_ptrs[1] = this_lb->lb_back2_lbp;
8654 + PTR_SWAP(this_lb, next_lb);
8655 + PTR_SWAP(this_lb_buf, next_lb_buf);
8656 + this_io = next_io;
8657 + next_io = NULL;
8658 + first_pass = B_FALSE;
8659 +
8660 + for (;;) {
8661 + if (dev->l2ad_rebuild_cancel) {
8662 + err = SET_ERROR(ECANCELED);
8663 + goto out;
8664 + }
8665 + if (spa_config_tryenter(spa, SCL_L2ARC, vd,
8666 + RW_READER)) {
8667 + lock_held = B_TRUE;
8668 + break;
8669 + }
8670 + /*
8671 + * L2ARC config lock held by somebody in writer,
8672 + * possibly due to them trying to remove us. They'll
8673 + * likely to want us to shut down, so after a little
8674 + * delay, we check l2ad_rebuild_cancel and retry
8675 + * the lock again.
8676 + */
8677 + delay(1);
8678 + }
8679 + }
8680 +out:
8681 + if (next_io != NULL)
8682 + l2arc_log_blk_prefetch_abort(next_io);
8683 + kmem_free(this_lb, sizeof (*this_lb));
8684 + kmem_free(next_lb, sizeof (*next_lb));
8685 + kmem_free(this_lb_buf, sizeof (l2arc_log_blk_phys_t));
8686 + kmem_free(next_lb_buf, sizeof (l2arc_log_blk_phys_t));
8687 + if (err == 0)
8688 + ARCSTAT_BUMP(arcstat_l2_rebuild_successes);
8689 +
8690 + if (lock_held)
8691 + spa_config_exit(spa, SCL_L2ARC, vd);
8692 +
8693 + return (err);
8694 +}
8695 +
8696 +/*
8697 + * Attempts to read the device header on the provided L2ARC device and writes
8698 + * it to `hdr'. On success, this function returns 0, otherwise the appropriate
8699 + * error code is returned.
8700 + */
8701 +static int
8702 +l2arc_dev_hdr_read(l2arc_dev_t *dev)
8703 +{
8704 + int err;
8705 + uint64_t guid;
8706 + zio_cksum_t cksum;
8707 + l2arc_dev_hdr_phys_t *hdr = dev->l2ad_dev_hdr;
8708 + const uint64_t hdr_asize = dev->l2ad_dev_hdr_asize;
8709 + abd_t *abd;
8710 +
8711 + guid = spa_guid(dev->l2ad_vdev->vdev_spa);
8712 +
8713 + abd = abd_get_from_buf(hdr, hdr_asize);
8714 + err = zio_wait(zio_read_phys(NULL, dev->l2ad_vdev,
8715 + VDEV_LABEL_START_SIZE, hdr_asize, abd,
8716 + ZIO_CHECKSUM_OFF, NULL, NULL, ZIO_PRIORITY_ASYNC_READ,
8717 + ZIO_FLAG_DONT_CACHE | ZIO_FLAG_CANFAIL |
8718 + ZIO_FLAG_DONT_PROPAGATE | ZIO_FLAG_DONT_RETRY, B_FALSE));
8719 + abd_put(abd);
8720 + if (err != 0) {
8721 + ARCSTAT_BUMP(arcstat_l2_rebuild_abort_io_errors);
8722 + return (err);
8723 + }
8724 +
8725 + if (hdr->dh_magic == BSWAP_64(L2ARC_DEV_HDR_MAGIC_V1))
8726 + byteswap_uint64_array(hdr, sizeof (*hdr));
8727 +
8728 + if (hdr->dh_magic != L2ARC_DEV_HDR_MAGIC_V1 ||
8729 + hdr->dh_spa_guid != guid) {
8730 + /*
8731 + * Attempt to rebuild a device containing no actual dev hdr
8732 + * or containing a header from some other pool.
8733 + */
8734 + ARCSTAT_BUMP(arcstat_l2_rebuild_abort_unsupported);
8735 + return (SET_ERROR(ENOTSUP));
8736 + }
8737 +
8738 + l2arc_dev_hdr_checksum(hdr, &cksum);
8739 + if (!ZIO_CHECKSUM_EQUAL(hdr->dh_self_cksum, cksum)) {
8740 + ARCSTAT_BUMP(arcstat_l2_rebuild_abort_cksum_errors);
8741 + return (SET_ERROR(EINVAL));
8742 + }
8743 +
8744 + return (0);
8745 +}
8746 +
8747 +/*
8748 + * Reads L2ARC log blocks from storage and validates their contents.
8749 + *
8750 + * This function implements a simple prefetcher to make sure that while
8751 + * we're processing one buffer the L2ARC is already prefetching the next
8752 + * one in the chain.
8753 + *
8754 + * The arguments this_lp and next_lp point to the current and next log blk
8755 + * address in the block chain. Similarly, this_lb and next_lb hold the
8756 + * l2arc_log_blk_phys_t's of the current and next L2ARC blk. The this_lb_buf
8757 + * and next_lb_buf must be buffers of appropriate to hold a raw
8758 + * l2arc_log_blk_phys_t (they are used as catch buffers for read ops prior
8759 + * to buffer decompression).
8760 + *
8761 + * The `this_io' and `next_io' arguments are used for block prefetching.
8762 + * When issuing the first blk IO during rebuild, you should pass NULL for
8763 + * `this_io'. This function will then issue a sync IO to read the block and
8764 + * also issue an async IO to fetch the next block in the block chain. The
8765 + * prefetch IO is returned in `next_io'. On subsequent calls to this
8766 + * function, pass the value returned in `next_io' from the previous call
8767 + * as `this_io' and a fresh `next_io' pointer to hold the next prefetch IO.
8768 + * Prior to the call, you should initialize your `next_io' pointer to be
8769 + * NULL. If no prefetch IO was issued, the pointer is left set at NULL.
8770 + *
8771 + * On success, this function returns 0, otherwise it returns an appropriate
8772 + * error code. On error the prefetching IO is aborted and cleared before
8773 + * returning from this function. Therefore, if we return `success', the
8774 + * caller can assume that we have taken care of cleanup of prefetch IOs.
8775 + */
8776 +static int
8777 +l2arc_log_blk_read(l2arc_dev_t *dev,
8778 + const l2arc_log_blkptr_t *this_lbp, const l2arc_log_blkptr_t *next_lbp,
8779 + l2arc_log_blk_phys_t *this_lb, l2arc_log_blk_phys_t *next_lb,
8780 + uint8_t *this_lb_buf, uint8_t *next_lb_buf,
8781 + zio_t *this_io, zio_t **next_io)
8782 +{
8783 + int err = 0;
8784 + zio_cksum_t cksum;
8785 +
8786 + ASSERT(this_lbp != NULL && next_lbp != NULL);
8787 + ASSERT(this_lb != NULL && next_lb != NULL);
8788 + ASSERT(this_lb_buf != NULL && next_lb_buf != NULL);
8789 + ASSERT(next_io != NULL && *next_io == NULL);
8790 + ASSERT(l2arc_log_blkptr_valid(dev, this_lbp));
8791 +
8792 + /*
8793 + * Check to see if we have issued the IO for this log blk in a
8794 + * previous run. If not, this is the first call, so issue it now.
8795 + */
8796 + if (this_io == NULL) {
8797 + this_io = l2arc_log_blk_prefetch(dev->l2ad_vdev, this_lbp,
8798 + this_lb_buf);
8799 + }
8800 +
8801 + /*
8802 + * Peek to see if we can start issuing the next IO immediately.
8803 + */
8804 + if (l2arc_log_blkptr_valid(dev, next_lbp)) {
8805 + /*
8806 + * Start issuing IO for the next log blk early - this
8807 + * should help keep the L2ARC device busy while we
8808 + * decompress and restore this log blk.
8809 + */
8810 + *next_io = l2arc_log_blk_prefetch(dev->l2ad_vdev, next_lbp,
8811 + next_lb_buf);
8812 + }
8813 +
8814 + /* Wait for the IO to read this log block to complete */
8815 + if ((err = zio_wait(this_io)) != 0) {
8816 + ARCSTAT_BUMP(arcstat_l2_rebuild_abort_io_errors);
8817 + goto cleanup;
8818 + }
8819 +
8820 + /* Make sure the buffer checks out */
8821 + fletcher_4_native(this_lb_buf, LBP_GET_PSIZE(this_lbp), NULL, &cksum);
8822 + if (!ZIO_CHECKSUM_EQUAL(cksum, this_lbp->lbp_cksum)) {
8823 + ARCSTAT_BUMP(arcstat_l2_rebuild_abort_cksum_errors);
8824 + err = SET_ERROR(EINVAL);
8825 + goto cleanup;
8826 + }
8827 +
8828 + /* Now we can take our time decoding this buffer */
8829 + switch (LBP_GET_COMPRESS(this_lbp)) {
8830 + case ZIO_COMPRESS_OFF:
8831 + bcopy(this_lb_buf, this_lb, sizeof (*this_lb));
8832 + break;
8833 + case ZIO_COMPRESS_LZ4:
8834 + err = zio_decompress_data_buf(LBP_GET_COMPRESS(this_lbp),
8835 + this_lb_buf, this_lb, LBP_GET_PSIZE(this_lbp),
8836 + sizeof (*this_lb));
8837 + if (err != 0) {
8838 + err = SET_ERROR(EINVAL);
8839 + goto cleanup;
8840 + }
8841 +
8842 + break;
8843 + default:
8844 + err = SET_ERROR(EINVAL);
8845 + break;
8846 + }
8847 +
8848 + if (this_lb->lb_magic == BSWAP_64(L2ARC_LOG_BLK_MAGIC))
8849 + byteswap_uint64_array(this_lb, sizeof (*this_lb));
8850 +
8851 + if (this_lb->lb_magic != L2ARC_LOG_BLK_MAGIC) {
8852 + err = SET_ERROR(EINVAL);
8853 + goto cleanup;
8854 + }
8855 +
8856 +cleanup:
8857 + /* Abort an in-flight prefetch I/O in case of error */
8858 + if (err != 0 && *next_io != NULL) {
8859 + l2arc_log_blk_prefetch_abort(*next_io);
8860 + *next_io = NULL;
8861 + }
8862 + return (err);
8863 +}
8864 +
8865 +/*
8866 + * Restores the payload of a log blk to ARC. This creates empty ARC hdr
8867 + * entries which only contain an l2arc hdr, essentially restoring the
8868 + * buffers to their L2ARC evicted state. This function also updates space
8869 + * usage on the L2ARC vdev to make sure it tracks restored buffers.
8870 + */
8871 +static void
8872 +l2arc_log_blk_restore(l2arc_dev_t *dev, uint64_t load_guid,
8873 + const l2arc_log_blk_phys_t *lb, uint64_t lb_psize)
8874 +{
8875 + uint64_t size = 0, psize = 0;
8876 +
8877 + for (int i = L2ARC_LOG_BLK_ENTRIES - 1; i >= 0; i--) {
8878 + /*
8879 + * Restore goes in the reverse temporal direction to preserve
8880 + * correct temporal ordering of buffers in the l2ad_buflist.
8881 + * l2arc_hdr_restore also does a list_insert_tail instead of
8882 + * list_insert_head on the l2ad_buflist:
8883 + *
8884 + * LIST l2ad_buflist LIST
8885 + * HEAD <------ (time) ------ TAIL
8886 + * direction +-----+-----+-----+-----+-----+ direction
8887 + * of l2arc <== | buf | buf | buf | buf | buf | ===> of rebuild
8888 + * fill +-----+-----+-----+-----+-----+
8889 + * ^ ^
8890 + * | |
8891 + * | |
8892 + * l2arc_fill_thread l2arc_rebuild
8893 + * places new bufs here restores bufs here
8894 + *
8895 + * This also works when the restored bufs get evicted at any
8896 + * point during the rebuild.
8897 + */
8898 + l2arc_hdr_restore(&lb->lb_entries[i], dev, load_guid);
8899 + size += LE_GET_LSIZE(&lb->lb_entries[i]);
8900 + psize += LE_GET_PSIZE(&lb->lb_entries[i]);
8901 + }
8902 +
8903 + /*
8904 + * Record rebuild stats:
8905 + * size In-memory size of restored buffer data in ARC
8906 + * psize Physical size of restored buffers in the L2ARC
8907 + * bufs # of ARC buffer headers restored
8908 + * log_blks # of L2ARC log entries processed during restore
8909 + */
8910 + ARCSTAT_INCR(arcstat_l2_rebuild_size, size);
8911 + ARCSTAT_INCR(arcstat_l2_rebuild_psize, psize);
8912 + ARCSTAT_INCR(arcstat_l2_rebuild_bufs, L2ARC_LOG_BLK_ENTRIES);
8913 + ARCSTAT_BUMP(arcstat_l2_rebuild_log_blks);
8914 + ARCSTAT_F_AVG(arcstat_l2_log_blk_avg_size, lb_psize);
8915 + ARCSTAT_F_AVG(arcstat_l2_data_to_meta_ratio, psize / lb_psize);
8916 + vdev_space_update(dev->l2ad_vdev, psize, 0, 0);
8917 +}
8918 +
8919 +/*
8920 + * Restores a single ARC buf hdr from a log block. The ARC buffer is put
8921 + * into a state indicating that it has been evicted to L2ARC.
8922 + */
8923 +static void
8924 +l2arc_hdr_restore(const l2arc_log_ent_phys_t *le, l2arc_dev_t *dev,
8925 + uint64_t load_guid)
8926 +{
8927 + arc_buf_hdr_t *hdr, *exists;
8928 + kmutex_t *hash_lock;
8929 + arc_buf_contents_t type = LE_GET_TYPE(le);
8930 +
8931 + /*
8932 + * Do all the allocation before grabbing any locks, this lets us
8933 + * sleep if memory is full and we don't have to deal with failed
8934 + * allocations.
8935 + */
8936 + hdr = arc_buf_alloc_l2only(load_guid, type, dev, le->le_dva,
8937 + le->le_daddr, LE_GET_LSIZE(le), LE_GET_PSIZE(le),
8938 + le->le_birth, le->le_freeze_cksum, LE_GET_CHECKSUM(le),
8939 + LE_GET_COMPRESS(le), LE_GET_ARC_COMPRESS(le));
8940 +
8941 + ARCSTAT_INCR(arcstat_l2_lsize, HDR_GET_LSIZE(hdr));
8942 + ARCSTAT_INCR(arcstat_l2_psize, arc_hdr_size(hdr));
8943 +
8944 + mutex_enter(&dev->l2ad_mtx);
8945 + /*
8946 + * We connect the l2hdr to the hdr only after the hdr is in the hash
8947 + * table, otherwise the rest of the arc hdr manipulation machinery
8948 + * might get confused.
8949 + */
8950 + list_insert_tail(&dev->l2ad_buflist, hdr);
8951 + (void) refcount_add_many(&dev->l2ad_alloc, arc_hdr_size(hdr), hdr);
8952 + mutex_exit(&dev->l2ad_mtx);
8953 +
8954 + exists = buf_hash_insert(hdr, &hash_lock);
8955 + if (exists) {
8956 + /* Buffer was already cached, no need to restore it. */
8957 + arc_hdr_destroy(hdr);
8958 + mutex_exit(hash_lock);
8959 + ARCSTAT_BUMP(arcstat_l2_rebuild_bufs_precached);
8960 + return;
8961 + }
8962 +
8963 + mutex_exit(hash_lock);
8964 +}
8965 +
8966 +/*
8967 + * Used by PL2ARC related functions that do
8968 + * async read/write
8969 + */
8970 +static void
8971 +pl2arc_io_done(zio_t *zio)
8972 +{
8973 + abd_put(zio->io_private);
8974 + zio->io_private = NULL;
8975 +}
8976 +
8977 +/*
8978 + * Starts an asynchronous read IO to read a log block. This is used in log
8979 + * block reconstruction to start reading the next block before we are done
8980 + * decoding and reconstructing the current block, to keep the l2arc device
8981 + * nice and hot with read IO to process.
8982 + * The returned zio will contain a newly allocated memory buffers for the IO
8983 + * data which should then be freed by the caller once the zio is no longer
8984 + * needed (i.e. due to it having completed). If you wish to abort this
8985 + * zio, you should do so using l2arc_log_blk_prefetch_abort, which takes
8986 + * care of disposing of the allocated buffers correctly.
8987 + */
8988 +static zio_t *
8989 +l2arc_log_blk_prefetch(vdev_t *vd, const l2arc_log_blkptr_t *lbp,
8990 + uint8_t *lb_buf)
8991 +{
8992 + uint32_t psize;
8993 + zio_t *pio;
8994 + abd_t *abd;
8995 +
8996 + psize = LBP_GET_PSIZE(lbp);
8997 + ASSERT(psize <= sizeof (l2arc_log_blk_phys_t));
8998 + pio = zio_root(vd->vdev_spa, NULL, NULL, ZIO_FLAG_DONT_CACHE |
8999 + ZIO_FLAG_CANFAIL | ZIO_FLAG_DONT_PROPAGATE |
9000 + ZIO_FLAG_DONT_RETRY);
9001 + abd = abd_get_from_buf(lb_buf, psize);
9002 + (void) zio_nowait(zio_read_phys(pio, vd, lbp->lbp_daddr, psize,
9003 + abd, ZIO_CHECKSUM_OFF, pl2arc_io_done, abd,
9004 + ZIO_PRIORITY_ASYNC_READ, ZIO_FLAG_DONT_CACHE | ZIO_FLAG_CANFAIL |
9005 + ZIO_FLAG_DONT_PROPAGATE | ZIO_FLAG_DONT_RETRY, B_FALSE));
9006 +
9007 + return (pio);
9008 +}
9009 +
9010 +/*
9011 + * Aborts a zio returned from l2arc_log_blk_prefetch and frees the data
9012 + * buffers allocated for it.
9013 + */
9014 +static void
9015 +l2arc_log_blk_prefetch_abort(zio_t *zio)
9016 +{
9017 + (void) zio_wait(zio);
9018 +}
9019 +
9020 +/*
9021 + * Creates a zio to update the device header on an l2arc device. The zio is
9022 + * initiated as a child of `pio'.
9023 + */
9024 +static void
9025 +l2arc_dev_hdr_update(l2arc_dev_t *dev, zio_t *pio)
9026 +{
9027 + zio_t *wzio;
9028 + abd_t *abd;
9029 + l2arc_dev_hdr_phys_t *hdr = dev->l2ad_dev_hdr;
9030 + const uint64_t hdr_asize = dev->l2ad_dev_hdr_asize;
9031 +
9032 + hdr->dh_magic = L2ARC_DEV_HDR_MAGIC_V1;
9033 + hdr->dh_spa_guid = spa_guid(dev->l2ad_vdev->vdev_spa);
9034 + hdr->dh_alloc_space = refcount_count(&dev->l2ad_alloc);
9035 + hdr->dh_flags = 0;
9036 + if (dev->l2ad_first)
9037 + hdr->dh_flags |= L2ARC_DEV_HDR_EVICT_FIRST;
9038 +
9039 + /* checksum operation goes last */
9040 + l2arc_dev_hdr_checksum(hdr, &hdr->dh_self_cksum);
9041 +
9042 + abd = abd_get_from_buf(hdr, hdr_asize);
9043 + wzio = zio_write_phys(pio, dev->l2ad_vdev, VDEV_LABEL_START_SIZE,
9044 + hdr_asize, abd, ZIO_CHECKSUM_OFF, pl2arc_io_done, abd,
9045 + ZIO_PRIORITY_ASYNC_WRITE, ZIO_FLAG_CANFAIL, B_FALSE);
9046 + DTRACE_PROBE2(l2arc__write, vdev_t *, dev->l2ad_vdev, zio_t *, wzio);
9047 + (void) zio_nowait(wzio);
9048 +}
9049 +
9050 +/*
9051 + * Commits a log block to the L2ARC device. This routine is invoked from
9052 + * l2arc_write_buffers when the log block fills up.
9053 + * This function allocates some memory to temporarily hold the serialized
9054 + * buffer to be written. This is then released in l2arc_write_done.
9055 + */
9056 +static void
9057 +l2arc_log_blk_commit(l2arc_dev_t *dev, zio_t *pio,
9058 + l2arc_write_callback_t *cb)
9059 +{
9060 + l2arc_log_blk_phys_t *lb = &dev->l2ad_log_blk;
9061 + uint64_t psize, asize;
9062 + l2arc_log_blk_buf_t *lb_buf;
9063 + abd_t *abd;
9064 + zio_t *wzio;
9065 +
9066 + VERIFY(dev->l2ad_log_ent_idx == L2ARC_LOG_BLK_ENTRIES);
9067 +
9068 + /* link the buffer into the block chain */
9069 + lb->lb_back2_lbp = dev->l2ad_dev_hdr->dh_start_lbps[1];
9070 + lb->lb_magic = L2ARC_LOG_BLK_MAGIC;
9071 +
9072 + /* try to compress the buffer */
9073 + lb_buf = kmem_zalloc(sizeof (*lb_buf), KM_SLEEP);
9074 + list_insert_tail(&cb->l2wcb_log_blk_buflist, lb_buf);
9075 + abd = abd_get_from_buf(lb, sizeof (*lb));
9076 + psize = zio_compress_data(ZIO_COMPRESS_LZ4, abd, lb_buf->lbb_log_blk,
9077 + sizeof (*lb));
9078 + abd_put(abd);
9079 + /* a log block is never entirely zero */
9080 + ASSERT(psize != 0);
9081 + asize = vdev_psize_to_asize(dev->l2ad_vdev, psize);
9082 + ASSERT(asize <= sizeof (lb_buf->lbb_log_blk));
9083 +
9084 + /*
9085 + * Update the start log blk pointer in the device header to point
9086 + * to the log block we're about to write.
9087 + */
9088 + dev->l2ad_dev_hdr->dh_start_lbps[1] =
9089 + dev->l2ad_dev_hdr->dh_start_lbps[0];
9090 + dev->l2ad_dev_hdr->dh_start_lbps[0].lbp_daddr = dev->l2ad_hand;
9091 + _NOTE(CONSTCOND)
9092 + LBP_SET_LSIZE(&dev->l2ad_dev_hdr->dh_start_lbps[0], sizeof (*lb));
9093 + LBP_SET_PSIZE(&dev->l2ad_dev_hdr->dh_start_lbps[0], asize);
9094 + LBP_SET_CHECKSUM(&dev->l2ad_dev_hdr->dh_start_lbps[0],
9095 + ZIO_CHECKSUM_FLETCHER_4);
9096 + LBP_SET_TYPE(&dev->l2ad_dev_hdr->dh_start_lbps[0], 0);
9097 +
9098 + if (asize < sizeof (*lb)) {
9099 + /* compression succeeded */
9100 + bzero(lb_buf->lbb_log_blk + psize, asize - psize);
9101 + LBP_SET_COMPRESS(&dev->l2ad_dev_hdr->dh_start_lbps[0],
9102 + ZIO_COMPRESS_LZ4);
9103 + } else {
9104 + /* compression failed */
9105 + bcopy(lb, lb_buf->lbb_log_blk, sizeof (*lb));
9106 + LBP_SET_COMPRESS(&dev->l2ad_dev_hdr->dh_start_lbps[0],
9107 + ZIO_COMPRESS_OFF);
9108 + }
9109 +
9110 + /* checksum what we're about to write */
9111 + fletcher_4_native(lb_buf->lbb_log_blk, asize,
9112 + NULL, &dev->l2ad_dev_hdr->dh_start_lbps[0].lbp_cksum);
9113 +
9114 + /* perform the write itself */
9115 + CTASSERT(L2ARC_LOG_BLK_SIZE >= SPA_MINBLOCKSIZE &&
9116 + L2ARC_LOG_BLK_SIZE <= SPA_MAXBLOCKSIZE);
9117 + abd = abd_get_from_buf(lb_buf->lbb_log_blk, asize);
9118 + wzio = zio_write_phys(pio, dev->l2ad_vdev, dev->l2ad_hand,
9119 + asize, abd, ZIO_CHECKSUM_OFF, pl2arc_io_done, abd,
9120 + ZIO_PRIORITY_ASYNC_WRITE, ZIO_FLAG_CANFAIL, B_FALSE);
9121 + DTRACE_PROBE2(l2arc__write, vdev_t *, dev->l2ad_vdev, zio_t *, wzio);
9122 + (void) zio_nowait(wzio);
9123 +
9124 + dev->l2ad_hand += asize;
9125 + vdev_space_update(dev->l2ad_vdev, asize, 0, 0);
9126 +
9127 + /* bump the kstats */
9128 + ARCSTAT_INCR(arcstat_l2_write_bytes, asize);
9129 + ARCSTAT_BUMP(arcstat_l2_log_blk_writes);
9130 + ARCSTAT_F_AVG(arcstat_l2_log_blk_avg_size, asize);
9131 + ARCSTAT_F_AVG(arcstat_l2_data_to_meta_ratio,
9132 + dev->l2ad_log_blk_payload_asize / asize);
9133 +
9134 + /* start a new log block */
9135 + dev->l2ad_log_ent_idx = 0;
9136 + dev->l2ad_log_blk_payload_asize = 0;
9137 +}
9138 +
9139 +/*
9140 + * Validates an L2ARC log blk address to make sure that it can be read
9141 + * from the provided L2ARC device. Returns B_TRUE if the address is
9142 + * within the device's bounds, or B_FALSE if not.
9143 + */
9144 +static boolean_t
9145 +l2arc_log_blkptr_valid(l2arc_dev_t *dev, const l2arc_log_blkptr_t *lbp)
9146 +{
9147 + uint64_t psize = LBP_GET_PSIZE(lbp);
9148 + uint64_t end = lbp->lbp_daddr + psize;
9149 +
9150 + /*
9151 + * A log block is valid if all of the following conditions are true:
9152 + * - it fits entirely between l2ad_start and l2ad_end
9153 + * - it has a valid size
9154 + */
9155 + return (lbp->lbp_daddr >= dev->l2ad_start && end <= dev->l2ad_end &&
9156 + psize > 0 && psize <= sizeof (l2arc_log_blk_phys_t));
9157 +}
9158 +
9159 +/*
9160 + * Computes the checksum of `hdr' and stores it in `cksum'.
9161 + */
9162 +static void
9163 +l2arc_dev_hdr_checksum(const l2arc_dev_hdr_phys_t *hdr, zio_cksum_t *cksum)
9164 +{
9165 + fletcher_4_native((uint8_t *)hdr +
9166 + offsetof(l2arc_dev_hdr_phys_t, dh_spa_guid),
9167 + sizeof (*hdr) - offsetof(l2arc_dev_hdr_phys_t, dh_spa_guid),
9168 + NULL, cksum);
9169 +}
9170 +
9171 +/*
9172 + * Inserts ARC buffer `ab' into the current L2ARC log blk on the device.
9173 + * The buffer being inserted must be present in L2ARC.
9174 + * Returns B_TRUE if the L2ARC log blk is full and needs to be committed
9175 + * to L2ARC, or B_FALSE if it still has room for more ARC buffers.
9176 + */
9177 +static boolean_t
9178 +l2arc_log_blk_insert(l2arc_dev_t *dev, const arc_buf_hdr_t *ab)
9179 +{
9180 + l2arc_log_blk_phys_t *lb = &dev->l2ad_log_blk;
9181 + l2arc_log_ent_phys_t *le;
9182 + int index = dev->l2ad_log_ent_idx++;
9183 +
9184 + ASSERT(index < L2ARC_LOG_BLK_ENTRIES);
9185 +
9186 + le = &lb->lb_entries[index];
9187 + bzero(le, sizeof (*le));
9188 + le->le_dva = ab->b_dva;
9189 + le->le_birth = ab->b_birth;
9190 + le->le_daddr = ab->b_l2hdr.b_daddr;
9191 + LE_SET_LSIZE(le, HDR_GET_LSIZE(ab));
9192 + LE_SET_PSIZE(le, HDR_GET_PSIZE(ab));
9193 +
9194 + if ((ab->b_flags & ARC_FLAG_COMPRESSED_ARC) != 0) {
9195 + LE_SET_ARC_COMPRESS(le, 1);
9196 + LE_SET_COMPRESS(le, HDR_GET_COMPRESS(ab));
9197 + } else {
9198 + ASSERT3U(HDR_GET_COMPRESS(ab), ==, ZIO_COMPRESS_OFF);
9199 + LE_SET_ARC_COMPRESS(le, 0);
9200 + LE_SET_COMPRESS(le, ZIO_COMPRESS_OFF);
9201 + }
9202 +
9203 + if (ab->b_freeze_cksum != NULL) {
9204 + le->le_freeze_cksum = *ab->b_freeze_cksum;
9205 + LE_SET_CHECKSUM(le, ZIO_CHECKSUM_FLETCHER_2);
9206 + } else {
9207 + LE_SET_CHECKSUM(le, ZIO_CHECKSUM_OFF);
9208 + }
9209 +
9210 + LE_SET_TYPE(le, arc_flags_to_bufc(ab->b_flags));
9211 + dev->l2ad_log_blk_payload_asize += arc_hdr_size((arc_buf_hdr_t *)ab);
9212 +
9213 + return (dev->l2ad_log_ent_idx == L2ARC_LOG_BLK_ENTRIES);
9214 +}
9215 +
9216 +/*
9217 + * Checks whether a given L2ARC device address sits in a time-sequential
9218 + * range. The trick here is that the L2ARC is a rotary buffer, so we can't
9219 + * just do a range comparison, we need to handle the situation in which the
9220 + * range wraps around the end of the L2ARC device. Arguments:
9221 + * bottom Lower end of the range to check (written to earlier).
9222 + * top Upper end of the range to check (written to later).
9223 + * check The address for which we want to determine if it sits in
9224 + * between the top and bottom.
9225 + *
9226 + * The 3-way conditional below represents the following cases:
9227 + *
9228 + * bottom < top : Sequentially ordered case:
9229 + * <check>--------+-------------------+
9230 + * | (overlap here?) |
9231 + * L2ARC dev V V
9232 + * |---------------<bottom>============<top>--------------|
9233 + *
9234 + * bottom > top: Looped-around case:
9235 + * <check>--------+------------------+
9236 + * | (overlap here?) |
9237 + * L2ARC dev V V
9238 + * |===============<top>---------------<bottom>===========|
9239 + * ^ ^
9240 + * | (or here?) |
9241 + * +---------------+---------<check>
9242 + *
9243 + * top == bottom : Just a single address comparison.
9244 + */
9245 +static inline boolean_t
9246 +l2arc_range_check_overlap(uint64_t bottom, uint64_t top, uint64_t check)
9247 +{
9248 + if (bottom < top)
9249 + return (bottom <= check && check <= top);
9250 + else if (bottom > top)
9251 + return (check <= top || bottom <= check);
9252 + else
9253 + return (check == top);
7375 9254 }
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX