Print this page
NEX-9752 backport illumos 6950 ARC should cache compressed data
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
6950 ARC should cache compressed data
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Don Brady <don.brady@intel.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
NEX-7603 Back port OpenZFS #188 Create tunable to ignore hole_birth
feature
Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
Reviewed by: Dan Fields <dan.fields@nexenta.com>
2605 want to resume interrupted zfs send
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed by: Xin Li <delphij@freebsd.org>
Reviewed by: Arne Jansen <sensille@gmx.net>
Approved by: Dan McDonald <danmcd@omniti.com>
NEX-4582 update wrc test cases for allow to use write back cache per tree of datasets
Reviewed by: Steve Peng <steve.peng@nexenta.com>
Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
5960 zfs recv should prefetch indirect blocks
5925 zfs receive -o origin=
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
NEX-4207 WRC and dedup on the same pool cause system-panic
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
NEX-4193 WRC does not migrate data that belong to intermediate snapshots
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Alex Aizman <alex.aizman@nexenta.com>
NEX-3710 WRC improvements and bug-fixes
 * refactored WRC move-logic to use zio kmem_cashes
 * replace size and compression fields by blk_prop field
   (the same in blkptr_t) to little reduce size of wrc_block_t
   and use similar macros as for blkptr_t to get PSIZE, LSIZE
   and COMPRESSION
 * make CPU more happy by reduce atomic calls
 * removed unused code
 * fixed naming of variables
 * fixed possible system panic after restart system
   with enabled WRC
 * fixed a race that causes system panic
Reviewed by: Alek Pinchuk <alek@nexenta.com>
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
NEX-3558 KRRP Integration
4459 Typo in itadm(1m) usage message: delete-inititator
Reviewed by: Milan Jurik <milan.jurik@xylab.cz>
Reviewed by: Marcel Telka <marcel@telka.sk>
Approved by: Robert Mustacchi <rm@joyent.com>
4504 traverse_visitbp: visit DMU_GROUPUSED_OBJECT before DMU_USERUSED_OBJECT
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Andriy Gapon <andriy.gapon@hybridcluster.com>
Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com>
Approved by: Robert Mustacchi <rm@joyent.com>
4391 panic system rather than corrupting pool if we hit bug 4390
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Approved by: Gordon Ross <gwr@nexenta.com>
4370 avoid transmitting holes during zfs send
4371 DMU code clean up
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net>
Approved by: Garrett D'Amore <garrett@damore.org>
re #12619 rb4429 More dp->dp_config_rwlock holds
re #12585 rb4049 ZFS++ work port - refactoring to improve separation of open/closed code, bug fixes, performance improvements - open code
Bug 11205: add missing libzfs_closed_stubs.c to fix opensource-only build.
ZFS plus work: special vdevs, cos, cos/vdev properties


   3  *
   4  * The contents of this file are subject to the terms of the
   5  * Common Development and Distribution License (the "License").
   6  * You may not use this file except in compliance with the License.
   7  *
   8  * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
   9  * or http://www.opensolaris.org/os/licensing.
  10  * See the License for the specific language governing permissions
  11  * and limitations under the License.
  12  *
  13  * When distributing Covered Code, include this CDDL HEADER in each
  14  * file and include the License file at usr/src/OPENSOLARIS.LICENSE.
  15  * If applicable, add the following below this CDDL HEADER, with the
  16  * fields enclosed by brackets "[]" replaced with your own identifying
  17  * information: Portions Copyright [yyyy] [name of copyright owner]
  18  *
  19  * CDDL HEADER END
  20  */
  21 /*
  22  * Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.

  23  * Copyright (c) 2012, 2016 by Delphix. All rights reserved.
  24  */
  25 
  26 #include <sys/zfs_context.h>
  27 #include <sys/dmu_objset.h>
  28 #include <sys/dmu_traverse.h>
  29 #include <sys/dsl_dataset.h>
  30 #include <sys/dsl_dir.h>
  31 #include <sys/dsl_pool.h>
  32 #include <sys/dnode.h>
  33 #include <sys/spa.h>
  34 #include <sys/zio.h>
  35 #include <sys/dmu_impl.h>
  36 #include <sys/sa.h>
  37 #include <sys/sa_impl.h>
  38 #include <sys/callb.h>
  39 #include <sys/zfeature.h>
  40 
  41 int32_t zfs_pd_bytes_max = 50 * 1024 * 1024;    /* 50MB */
  42 boolean_t send_holes_without_birth_time = B_TRUE;
  43 
  44 typedef struct prefetch_data {
  45         kmutex_t pd_mtx;
  46         kcondvar_t pd_cv;
  47         int32_t pd_bytes_fetched;
  48         int pd_flags;
  49         boolean_t pd_cancel;
  50         boolean_t pd_exited;
  51         zbookmark_phys_t pd_resume;
  52 } prefetch_data_t;
  53 
  54 typedef struct traverse_data {
  55         spa_t *td_spa;
  56         uint64_t td_objset;
  57         blkptr_t *td_rootbp;
  58         uint64_t td_min_txg;

  59         zbookmark_phys_t *td_resume;
  60         int td_flags;
  61         prefetch_data_t *td_pfd;
  62         boolean_t td_paused;
  63         uint64_t td_hole_birth_enabled_txg;
  64         blkptr_cb_t *td_func;
  65         void *td_arg;
  66         boolean_t td_realloc_possible;
  67 } traverse_data_t;
  68 
  69 static int traverse_dnode(traverse_data_t *td, const dnode_phys_t *dnp,
  70     uint64_t objset, uint64_t object);
  71 static void prefetch_dnode_metadata(traverse_data_t *td, const dnode_phys_t *,
  72     uint64_t objset, uint64_t object);
  73 
  74 static int
  75 traverse_zil_block(zilog_t *zilog, blkptr_t *bp, void *arg, uint64_t claim_txg)
  76 {
  77         traverse_data_t *td = arg;
  78         zbookmark_phys_t zb;


 174                 }
 175         }
 176         return (RESUME_SKIP_NONE);
 177 }
 178 
 179 static void
 180 traverse_prefetch_metadata(traverse_data_t *td,
 181     const blkptr_t *bp, const zbookmark_phys_t *zb)
 182 {
 183         arc_flags_t flags = ARC_FLAG_NOWAIT | ARC_FLAG_PREFETCH;
 184 
 185         if (!(td->td_flags & TRAVERSE_PREFETCH_METADATA))
 186                 return;
 187         /*
 188          * If we are in the process of resuming, don't prefetch, because
 189          * some children will not be needed (and in fact may have already
 190          * been freed).
 191          */
 192         if (td->td_resume != NULL && !ZB_IS_ZERO(td->td_resume))
 193                 return;
 194         if (BP_IS_HOLE(bp) || bp->blk_birth <= td->td_min_txg)

 195                 return;
 196         if (BP_GET_LEVEL(bp) == 0 && BP_GET_TYPE(bp) != DMU_OT_DNODE)
 197                 return;
 198 
 199         (void) arc_read(NULL, td->td_spa, bp, NULL, NULL,
 200             ZIO_PRIORITY_ASYNC_READ, ZIO_FLAG_CANFAIL, &flags, zb);
 201 }
 202 
 203 static boolean_t
 204 prefetch_needed(prefetch_data_t *pfd, const blkptr_t *bp)
 205 {
 206         ASSERT(pfd->pd_flags & TRAVERSE_PREFETCH_DATA);
 207         if (BP_IS_HOLE(bp) || BP_IS_EMBEDDED(bp) ||
 208             BP_GET_TYPE(bp) == DMU_OT_INTENT_LOG)
 209                 return (B_FALSE);
 210         return (B_TRUE);
 211 }
 212 
 213 static int
 214 traverse_visitbp(traverse_data_t *td, const dnode_phys_t *dnp,


 240                  *
 241                  * If a file is written sparsely, then the unwritten parts of
 242                  * the file were "always holes" -- that is, they have been
 243                  * holes since this object was allocated.  However, we (and
 244                  * our callers) can not necessarily tell when an object was
 245                  * allocated.  Therefore, if it's possible that this object
 246                  * was freed and then its object number reused, we need to
 247                  * visit all the holes with birth==0.
 248                  *
 249                  * If it isn't possible that the object number was reused,
 250                  * then if SPA_FEATURE_HOLE_BIRTH was enabled before we wrote
 251                  * all the blocks we will visit as part of this traversal,
 252                  * then this hole must have always existed, so we can skip
 253                  * it.  We visit blocks born after (exclusive) td_min_txg.
 254                  *
 255                  * Note that the meta-dnode cannot be reallocated.
 256                  */
 257                 if (!send_holes_without_birth_time &&
 258                     (!td->td_realloc_possible ||
 259                     zb->zb_object == DMU_META_DNODE_OBJECT) &&
 260                     td->td_hole_birth_enabled_txg <= td->td_min_txg)

 261                         return (0);
 262         } else if (bp->blk_birth <= td->td_min_txg) {

 263                 return (0);
 264         }
 265 
 266         if (pd != NULL && !pd->pd_exited && prefetch_needed(pd, bp)) {
 267                 uint64_t size = BP_GET_LSIZE(bp);
 268                 mutex_enter(&pd->pd_mtx);
 269                 ASSERT(pd->pd_bytes_fetched >= 0);
 270                 while (pd->pd_bytes_fetched < size && !pd->pd_exited)
 271                         cv_wait(&pd->pd_cv, &pd->pd_mtx);
 272                 pd->pd_bytes_fetched -= size;
 273                 cv_broadcast(&pd->pd_cv);
 274                 mutex_exit(&pd->pd_mtx);
 275         }
 276 
 277         if (BP_IS_HOLE(bp)) {
 278                 err = td->td_func(td->td_spa, NULL, bp, zb, dnp, td->td_arg);
 279                 if (err != 0)
 280                         goto post;
 281                 return (0);
 282         }
 283 
 284         if (td->td_flags & TRAVERSE_PRE) {
 285                 err = td->td_func(td->td_spa, NULL, bp, zb, dnp,
 286                     td->td_arg);
 287                 if (err == TRAVERSE_VISIT_NO_CHILDREN)
 288                         return (0);



 289                 if (err != 0)
 290                         goto post;
 291         }
 292 
 293         if (BP_GET_LEVEL(bp) > 0) {
 294                 arc_flags_t flags = ARC_FLAG_WAIT;
 295                 int i;
 296                 blkptr_t *cbp;
 297                 int epb = BP_GET_LSIZE(bp) >> SPA_BLKPTRSHIFT;
 298 
 299                 err = arc_read(NULL, td->td_spa, bp, arc_getbuf_func, &buf,
 300                     ZIO_PRIORITY_ASYNC_READ, ZIO_FLAG_CANFAIL, &flags, zb);
 301                 if (err != 0)
 302                         goto post;
 303                 cbp = buf->b_data;
 304 
 305                 for (i = 0; i < epb; i++) {
 306                         SET_BOOKMARK(&czb, zb->zb_objset, zb->zb_object,
 307                             zb->zb_level - 1,
 308                             zb->zb_blkid * epb + i);


 403                 td->td_resume->zb_object = zb->zb_object;
 404                 td->td_resume->zb_level = 0;
 405                 /*
 406                  * If we have stopped on an indirect block (e.g. due to
 407                  * i/o error), we have not visited anything below it.
 408                  * Set the bookmark to the first level-0 block that we need
 409                  * to visit.  This way, the resuming code does not need to
 410                  * deal with resuming from indirect blocks.
 411                  *
 412                  * Note, if zb_level <= 0, dnp may be NULL, so we don't want
 413                  * to dereference it.
 414                  */
 415                 td->td_resume->zb_blkid = zb->zb_blkid;
 416                 if (zb->zb_level > 0) {
 417                         td->td_resume->zb_blkid <<= zb->zb_level *
 418                             (dnp->dn_indblkshift - SPA_BLKPTRSHIFT);
 419                 }
 420                 td->td_paused = B_TRUE;
 421         }
 422 






 423         return (err);
 424 }
 425 
 426 static void
 427 prefetch_dnode_metadata(traverse_data_t *td, const dnode_phys_t *dnp,
 428     uint64_t objset, uint64_t object)
 429 {
 430         int j;
 431         zbookmark_phys_t czb;
 432 
 433         for (j = 0; j < dnp->dn_nblkptr; j++) {
 434                 SET_BOOKMARK(&czb, objset, object, dnp->dn_nlevels - 1, j);
 435                 traverse_prefetch_metadata(td, &dnp->dn_blkptr[j], &czb);
 436         }
 437 
 438         if (dnp->dn_flags & DNODE_FLAG_SPILL_BLKPTR) {
 439                 SET_BOOKMARK(&czb, objset, object, 0, DMU_SPILL_BLKID);
 440                 traverse_prefetch_metadata(td, &dnp->dn_spill, &czb);
 441         }
 442 }


 529         td.td_arg = td_main->td_pfd;
 530         td.td_pfd = NULL;
 531         td.td_resume = &td_main->td_pfd->pd_resume;
 532 
 533         SET_BOOKMARK(&czb, td.td_objset,
 534             ZB_ROOT_OBJECT, ZB_ROOT_LEVEL, ZB_ROOT_BLKID);
 535         (void) traverse_visitbp(&td, NULL, td.td_rootbp, &czb);
 536 
 537         mutex_enter(&td_main->td_pfd->pd_mtx);
 538         td_main->td_pfd->pd_exited = B_TRUE;
 539         cv_broadcast(&td_main->td_pfd->pd_cv);
 540         mutex_exit(&td_main->td_pfd->pd_mtx);
 541 }
 542 
 543 /*
 544  * NB: dataset must not be changing on-disk (eg, is a snapshot or we are
 545  * in syncing context).
 546  */
 547 static int
 548 traverse_impl(spa_t *spa, dsl_dataset_t *ds, uint64_t objset, blkptr_t *rootbp,
 549     uint64_t txg_start, zbookmark_phys_t *resume, int flags,
 550     blkptr_cb_t func, void *arg)
 551 {
 552         traverse_data_t td;
 553         prefetch_data_t pd = { 0 };
 554         zbookmark_phys_t czb;
 555         int err;
 556 
 557         ASSERT(ds == NULL || objset == ds->ds_object);
 558         ASSERT(!(flags & TRAVERSE_PRE) || !(flags & TRAVERSE_POST));
 559 
 560         td.td_spa = spa;
 561         td.td_objset = objset;
 562         td.td_rootbp = rootbp;
 563         td.td_min_txg = txg_start;

 564         td.td_resume = resume;
 565         td.td_func = func;
 566         td.td_arg = arg;
 567         td.td_pfd = &pd;
 568         td.td_flags = flags;
 569         td.td_paused = B_FALSE;
 570         td.td_realloc_possible = (txg_start == 0 ? B_FALSE : B_TRUE);
 571 
 572         if (spa_feature_is_active(spa, SPA_FEATURE_HOLE_BIRTH)) {
 573                 VERIFY(spa_feature_enabled_txg(spa,
 574                     SPA_FEATURE_HOLE_BIRTH, &td.td_hole_birth_enabled_txg));
 575         } else {
 576                 td.td_hole_birth_enabled_txg = UINT64_MAX;
 577         }
 578 
 579         pd.pd_flags = flags;
 580         if (resume != NULL)
 581                 pd.pd_resume = *resume;
 582         mutex_init(&pd.pd_mtx, NULL, MUTEX_DEFAULT, NULL);
 583         cv_init(&pd.pd_cv, NULL, CV_DEFAULT, NULL);


 614         while (!pd.pd_exited)
 615                 cv_wait(&pd.pd_cv, &pd.pd_mtx);
 616         mutex_exit(&pd.pd_mtx);
 617 
 618         mutex_destroy(&pd.pd_mtx);
 619         cv_destroy(&pd.pd_cv);
 620 
 621         return (err);
 622 }
 623 
 624 /*
 625  * NB: dataset must not be changing on-disk (eg, is a snapshot or we are
 626  * in syncing context).
 627  */
 628 int
 629 traverse_dataset_resume(dsl_dataset_t *ds, uint64_t txg_start,
 630     zbookmark_phys_t *resume,
 631     int flags, blkptr_cb_t func, void *arg)
 632 {
 633         return (traverse_impl(ds->ds_dir->dd_pool->dp_spa, ds, ds->ds_object,
 634             &dsl_dataset_phys(ds)->ds_bp, txg_start, resume, flags, func, arg));

 635 }
 636 
 637 int
 638 traverse_dataset(dsl_dataset_t *ds, uint64_t txg_start,
 639     int flags, blkptr_cb_t func, void *arg)
 640 {
 641         return (traverse_dataset_resume(ds, txg_start, NULL, flags, func, arg));
 642 }
 643 
 644 int
 645 traverse_dataset_destroyed(spa_t *spa, blkptr_t *blkptr,
 646     uint64_t txg_start, zbookmark_phys_t *resume, int flags,
 647     blkptr_cb_t func, void *arg)
 648 {
 649         return (traverse_impl(spa, NULL, ZB_DESTROYED_OBJSET,
 650             blkptr, txg_start, resume, flags, func, arg));
 651 }
 652 
 653 /*
 654  * NB: pool must not be changing on-disk (eg, from zdb or sync context).
 655  */
 656 int
 657 traverse_pool(spa_t *spa, uint64_t txg_start, int flags,
 658     blkptr_cb_t func, void *arg)
 659 {
 660         int err;
 661         dsl_pool_t *dp = spa_get_dsl(spa);
 662         objset_t *mos = dp->dp_meta_objset;
 663         boolean_t hard = (flags & TRAVERSE_HARD);
 664 
 665         /* visit the MOS */

 666         err = traverse_impl(spa, NULL, 0, spa_get_rootblkptr(spa),
 667             txg_start, NULL, flags, func, arg);
 668         if (err != 0)
 669                 return (err);

 670 
 671         /* visit each dataset */
 672         for (uint64_t obj = 1; err == 0;

 673             err = dmu_object_next(mos, &obj, B_FALSE, txg_start)) {
 674                 dmu_object_info_t doi;
 675 
 676                 err = dmu_object_info(mos, obj, &doi);
 677                 if (err != 0) {
 678                         if (hard)
 679                                 continue;
 680                         break;
 681                 }
 682 
 683                 if (doi.doi_bonus_type == DMU_OT_DSL_DATASET) {
 684                         dsl_dataset_t *ds;


 685                         uint64_t txg = txg_start;


 686 
 687                         dsl_pool_config_enter(dp, FTAG);
 688                         err = dsl_dataset_hold_obj(dp, obj, FTAG, &ds);
 689                         dsl_pool_config_exit(dp, FTAG);
 690                         if (err != 0) {
 691                                 if (hard)
 692                                         continue;
 693                                 break;
 694                         }
 695                         if (dsl_dataset_phys(ds)->ds_prev_snap_txg > txg)





























 696                                 txg = dsl_dataset_phys(ds)->ds_prev_snap_txg;
 697                         err = traverse_dataset(ds, txg, flags, func, arg);




 698                         dsl_dataset_rele(ds, FTAG);
 699                         if (err != 0)





 700                                 break;
 701                 }
 702         }
 703         if (err == ESRCH)



 704                 err = 0;
 705         return (err);

 706 }


   3  *
   4  * The contents of this file are subject to the terms of the
   5  * Common Development and Distribution License (the "License").
   6  * You may not use this file except in compliance with the License.
   7  *
   8  * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
   9  * or http://www.opensolaris.org/os/licensing.
  10  * See the License for the specific language governing permissions
  11  * and limitations under the License.
  12  *
  13  * When distributing Covered Code, include this CDDL HEADER in each
  14  * file and include the License file at usr/src/OPENSOLARIS.LICENSE.
  15  * If applicable, add the following below this CDDL HEADER, with the
  16  * fields enclosed by brackets "[]" replaced with your own identifying
  17  * information: Portions Copyright [yyyy] [name of copyright owner]
  18  *
  19  * CDDL HEADER END
  20  */
  21 /*
  22  * Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.
  23  * Copyright 2015 Nexenta Systems, Inc. All rights reserved.
  24  * Copyright (c) 2012, 2016 by Delphix. All rights reserved.
  25  */
  26 
  27 #include <sys/zfs_context.h>
  28 #include <sys/dmu_objset.h>
  29 #include <sys/dmu_traverse.h>
  30 #include <sys/dsl_dataset.h>
  31 #include <sys/dsl_dir.h>
  32 #include <sys/dsl_pool.h>
  33 #include <sys/dnode.h>
  34 #include <sys/spa.h>
  35 #include <sys/zio.h>
  36 #include <sys/dmu_impl.h>
  37 #include <sys/sa.h>
  38 #include <sys/sa_impl.h>
  39 #include <sys/callb.h>
  40 #include <sys/zfeature.h>
  41 
  42 int32_t zfs_pd_bytes_max = 50 * 1024 * 1024;    /* 50MB */
  43 boolean_t send_holes_without_birth_time = B_TRUE;
  44 
  45 typedef struct prefetch_data {
  46         kmutex_t pd_mtx;
  47         kcondvar_t pd_cv;
  48         int32_t pd_bytes_fetched;
  49         int pd_flags;
  50         boolean_t pd_cancel;
  51         boolean_t pd_exited;
  52         zbookmark_phys_t pd_resume;
  53 } prefetch_data_t;
  54 
  55 typedef struct traverse_data {
  56         spa_t *td_spa;
  57         uint64_t td_objset;
  58         blkptr_t *td_rootbp;
  59         uint64_t td_min_txg;
  60         uint64_t td_max_txg;
  61         zbookmark_phys_t *td_resume;
  62         int td_flags;
  63         prefetch_data_t *td_pfd;
  64         boolean_t td_paused;
  65         uint64_t td_hole_birth_enabled_txg;
  66         blkptr_cb_t *td_func;
  67         void *td_arg;
  68         boolean_t td_realloc_possible;
  69 } traverse_data_t;
  70 
  71 static int traverse_dnode(traverse_data_t *td, const dnode_phys_t *dnp,
  72     uint64_t objset, uint64_t object);
  73 static void prefetch_dnode_metadata(traverse_data_t *td, const dnode_phys_t *,
  74     uint64_t objset, uint64_t object);
  75 
  76 static int
  77 traverse_zil_block(zilog_t *zilog, blkptr_t *bp, void *arg, uint64_t claim_txg)
  78 {
  79         traverse_data_t *td = arg;
  80         zbookmark_phys_t zb;


 176                 }
 177         }
 178         return (RESUME_SKIP_NONE);
 179 }
 180 
 181 static void
 182 traverse_prefetch_metadata(traverse_data_t *td,
 183     const blkptr_t *bp, const zbookmark_phys_t *zb)
 184 {
 185         arc_flags_t flags = ARC_FLAG_NOWAIT | ARC_FLAG_PREFETCH;
 186 
 187         if (!(td->td_flags & TRAVERSE_PREFETCH_METADATA))
 188                 return;
 189         /*
 190          * If we are in the process of resuming, don't prefetch, because
 191          * some children will not be needed (and in fact may have already
 192          * been freed).
 193          */
 194         if (td->td_resume != NULL && !ZB_IS_ZERO(td->td_resume))
 195                 return;
 196         if (BP_IS_HOLE(bp) || bp->blk_birth <= td->td_min_txg ||
 197             bp->blk_birth >= td->td_max_txg)
 198                 return;
 199         if (BP_GET_LEVEL(bp) == 0 && BP_GET_TYPE(bp) != DMU_OT_DNODE)
 200                 return;
 201 
 202         (void) arc_read(NULL, td->td_spa, bp, NULL, NULL,
 203             ZIO_PRIORITY_ASYNC_READ, ZIO_FLAG_CANFAIL, &flags, zb);
 204 }
 205 
 206 static boolean_t
 207 prefetch_needed(prefetch_data_t *pfd, const blkptr_t *bp)
 208 {
 209         ASSERT(pfd->pd_flags & TRAVERSE_PREFETCH_DATA);
 210         if (BP_IS_HOLE(bp) || BP_IS_EMBEDDED(bp) ||
 211             BP_GET_TYPE(bp) == DMU_OT_INTENT_LOG)
 212                 return (B_FALSE);
 213         return (B_TRUE);
 214 }
 215 
 216 static int
 217 traverse_visitbp(traverse_data_t *td, const dnode_phys_t *dnp,


 243                  *
 244                  * If a file is written sparsely, then the unwritten parts of
 245                  * the file were "always holes" -- that is, they have been
 246                  * holes since this object was allocated.  However, we (and
 247                  * our callers) can not necessarily tell when an object was
 248                  * allocated.  Therefore, if it's possible that this object
 249                  * was freed and then its object number reused, we need to
 250                  * visit all the holes with birth==0.
 251                  *
 252                  * If it isn't possible that the object number was reused,
 253                  * then if SPA_FEATURE_HOLE_BIRTH was enabled before we wrote
 254                  * all the blocks we will visit as part of this traversal,
 255                  * then this hole must have always existed, so we can skip
 256                  * it.  We visit blocks born after (exclusive) td_min_txg.
 257                  *
 258                  * Note that the meta-dnode cannot be reallocated.
 259                  */
 260                 if (!send_holes_without_birth_time &&
 261                     (!td->td_realloc_possible ||
 262                     zb->zb_object == DMU_META_DNODE_OBJECT) &&
 263                     (td->td_hole_birth_enabled_txg <= td->td_min_txg ||
 264                     td->td_hole_birth_enabled_txg > td->td_max_txg))
 265                         return (0);
 266         } else if (bp->blk_birth <= td->td_min_txg ||
 267             bp->blk_birth >= td->td_max_txg) {
 268                 return (0);
 269         }
 270 
 271         if (pd != NULL && !pd->pd_exited && prefetch_needed(pd, bp)) {
 272                 uint64_t size = BP_GET_LSIZE(bp);
 273                 mutex_enter(&pd->pd_mtx);
 274                 ASSERT(pd->pd_bytes_fetched >= 0);
 275                 while (pd->pd_bytes_fetched < size && !pd->pd_exited)
 276                         cv_wait(&pd->pd_cv, &pd->pd_mtx);
 277                 pd->pd_bytes_fetched -= size;
 278                 cv_broadcast(&pd->pd_cv);
 279                 mutex_exit(&pd->pd_mtx);
 280         }
 281 
 282         if (BP_IS_HOLE(bp)) {
 283                 err = td->td_func(td->td_spa, NULL, bp, zb, dnp, td->td_arg);
 284                 if (err != 0)
 285                         goto post;
 286                 return (0);
 287         }
 288 
 289         if (td->td_flags & TRAVERSE_PRE) {
 290                 err = td->td_func(td->td_spa, NULL, bp, zb, dnp,
 291                     td->td_arg);
 292                 if (err == TRAVERSE_VISIT_NO_CHILDREN)
 293                         return (0);
 294                 /* handle pausing at a common point */
 295                 if (err == ERESTART)
 296                         td->td_paused = B_TRUE;
 297                 if (err != 0)
 298                         goto post;
 299         }
 300 
 301         if (BP_GET_LEVEL(bp) > 0) {
 302                 arc_flags_t flags = ARC_FLAG_WAIT;
 303                 int i;
 304                 blkptr_t *cbp;
 305                 int epb = BP_GET_LSIZE(bp) >> SPA_BLKPTRSHIFT;
 306 
 307                 err = arc_read(NULL, td->td_spa, bp, arc_getbuf_func, &buf,
 308                     ZIO_PRIORITY_ASYNC_READ, ZIO_FLAG_CANFAIL, &flags, zb);
 309                 if (err != 0)
 310                         goto post;
 311                 cbp = buf->b_data;
 312 
 313                 for (i = 0; i < epb; i++) {
 314                         SET_BOOKMARK(&czb, zb->zb_objset, zb->zb_object,
 315                             zb->zb_level - 1,
 316                             zb->zb_blkid * epb + i);


 411                 td->td_resume->zb_object = zb->zb_object;
 412                 td->td_resume->zb_level = 0;
 413                 /*
 414                  * If we have stopped on an indirect block (e.g. due to
 415                  * i/o error), we have not visited anything below it.
 416                  * Set the bookmark to the first level-0 block that we need
 417                  * to visit.  This way, the resuming code does not need to
 418                  * deal with resuming from indirect blocks.
 419                  *
 420                  * Note, if zb_level <= 0, dnp may be NULL, so we don't want
 421                  * to dereference it.
 422                  */
 423                 td->td_resume->zb_blkid = zb->zb_blkid;
 424                 if (zb->zb_level > 0) {
 425                         td->td_resume->zb_blkid <<= zb->zb_level *
 426                             (dnp->dn_indblkshift - SPA_BLKPTRSHIFT);
 427                 }
 428                 td->td_paused = B_TRUE;
 429         }
 430 
 431         /* if we walked over all bp bookmark must be cleared */
 432         if (!err && !td->td_paused && td->td_resume != NULL &&
 433             bp == td->td_rootbp && td->td_pfd != NULL) {
 434                 bzero(td->td_resume, sizeof (*td->td_resume));
 435         }
 436 
 437         return (err);
 438 }
 439 
 440 static void
 441 prefetch_dnode_metadata(traverse_data_t *td, const dnode_phys_t *dnp,
 442     uint64_t objset, uint64_t object)
 443 {
 444         int j;
 445         zbookmark_phys_t czb;
 446 
 447         for (j = 0; j < dnp->dn_nblkptr; j++) {
 448                 SET_BOOKMARK(&czb, objset, object, dnp->dn_nlevels - 1, j);
 449                 traverse_prefetch_metadata(td, &dnp->dn_blkptr[j], &czb);
 450         }
 451 
 452         if (dnp->dn_flags & DNODE_FLAG_SPILL_BLKPTR) {
 453                 SET_BOOKMARK(&czb, objset, object, 0, DMU_SPILL_BLKID);
 454                 traverse_prefetch_metadata(td, &dnp->dn_spill, &czb);
 455         }
 456 }


 543         td.td_arg = td_main->td_pfd;
 544         td.td_pfd = NULL;
 545         td.td_resume = &td_main->td_pfd->pd_resume;
 546 
 547         SET_BOOKMARK(&czb, td.td_objset,
 548             ZB_ROOT_OBJECT, ZB_ROOT_LEVEL, ZB_ROOT_BLKID);
 549         (void) traverse_visitbp(&td, NULL, td.td_rootbp, &czb);
 550 
 551         mutex_enter(&td_main->td_pfd->pd_mtx);
 552         td_main->td_pfd->pd_exited = B_TRUE;
 553         cv_broadcast(&td_main->td_pfd->pd_cv);
 554         mutex_exit(&td_main->td_pfd->pd_mtx);
 555 }
 556 
 557 /*
 558  * NB: dataset must not be changing on-disk (eg, is a snapshot or we are
 559  * in syncing context).
 560  */
 561 static int
 562 traverse_impl(spa_t *spa, dsl_dataset_t *ds, uint64_t objset, blkptr_t *rootbp,
 563     uint64_t txg_start, uint64_t txg_finish, zbookmark_phys_t *resume,
 564     int flags, blkptr_cb_t func, void *arg)
 565 {
 566         traverse_data_t td;
 567         prefetch_data_t pd = { 0 };
 568         zbookmark_phys_t czb;
 569         int err;
 570 
 571         ASSERT(ds == NULL || objset == ds->ds_object);
 572         ASSERT(!(flags & TRAVERSE_PRE) || !(flags & TRAVERSE_POST));
 573 
 574         td.td_spa = spa;
 575         td.td_objset = objset;
 576         td.td_rootbp = rootbp;
 577         td.td_min_txg = txg_start;
 578         td.td_max_txg = txg_finish;
 579         td.td_resume = resume;
 580         td.td_func = func;
 581         td.td_arg = arg;
 582         td.td_pfd = &pd;
 583         td.td_flags = flags;
 584         td.td_paused = B_FALSE;
 585         td.td_realloc_possible = (txg_start == 0 ? B_FALSE : B_TRUE);
 586 
 587         if (spa_feature_is_active(spa, SPA_FEATURE_HOLE_BIRTH)) {
 588                 VERIFY(spa_feature_enabled_txg(spa,
 589                     SPA_FEATURE_HOLE_BIRTH, &td.td_hole_birth_enabled_txg));
 590         } else {
 591                 td.td_hole_birth_enabled_txg = UINT64_MAX;
 592         }
 593 
 594         pd.pd_flags = flags;
 595         if (resume != NULL)
 596                 pd.pd_resume = *resume;
 597         mutex_init(&pd.pd_mtx, NULL, MUTEX_DEFAULT, NULL);
 598         cv_init(&pd.pd_cv, NULL, CV_DEFAULT, NULL);


 629         while (!pd.pd_exited)
 630                 cv_wait(&pd.pd_cv, &pd.pd_mtx);
 631         mutex_exit(&pd.pd_mtx);
 632 
 633         mutex_destroy(&pd.pd_mtx);
 634         cv_destroy(&pd.pd_cv);
 635 
 636         return (err);
 637 }
 638 
 639 /*
 640  * NB: dataset must not be changing on-disk (eg, is a snapshot or we are
 641  * in syncing context).
 642  */
 643 int
 644 traverse_dataset_resume(dsl_dataset_t *ds, uint64_t txg_start,
 645     zbookmark_phys_t *resume,
 646     int flags, blkptr_cb_t func, void *arg)
 647 {
 648         return (traverse_impl(ds->ds_dir->dd_pool->dp_spa, ds, ds->ds_object,
 649             &dsl_dataset_phys(ds)->ds_bp, txg_start, UINT64_MAX, resume, flags,
 650             func, arg));
 651 }
 652 
 653 int
 654 traverse_dataset(dsl_dataset_t *ds, uint64_t txg_start,
 655     int flags, blkptr_cb_t func, void *arg)
 656 {
 657         return (traverse_dataset_resume(ds, txg_start, NULL, flags, func, arg));
 658 }
 659 
 660 int
 661 traverse_dataset_destroyed(spa_t *spa, blkptr_t *blkptr,
 662     uint64_t txg_start, zbookmark_phys_t *resume, int flags,
 663     blkptr_cb_t func, void *arg)
 664 {
 665         return (traverse_impl(spa, NULL, ZB_DESTROYED_OBJSET,
 666             blkptr, txg_start, UINT64_MAX, resume, flags, func, arg));
 667 }
 668 
 669 /*
 670  * NB: pool must not be changing on-disk (eg, from zdb or sync context).
 671  */
 672 int
 673 traverse_pool(spa_t *spa, uint64_t txg_start, uint64_t txg_finish, int flags,
 674     blkptr_cb_t func, void *arg, zbookmark_phys_t *zb)
 675 {
 676         int err = 0, lasterr = 0;
 677         dsl_pool_t *dp = spa_get_dsl(spa);
 678         objset_t *mos = dp->dp_meta_objset;
 679         boolean_t hard = (flags & TRAVERSE_HARD);
 680 
 681         /* visit the MOS */
 682         if (!zb || (zb->zb_objset == 0 && zb->zb_object == 0)) {
 683                 err = traverse_impl(spa, NULL, 0, spa_get_rootblkptr(spa),
 684                     txg_start, txg_finish, NULL, flags, func, arg);
 685                 if (err != 0)
 686                         return (err);
 687         }
 688 
 689         /* visit each dataset */
 690         for (uint64_t obj = (zb && !ZB_IS_ZERO(zb))? zb->zb_objset : 1;
 691             err == 0 || (err != ESRCH && hard);
 692             err = dmu_object_next(mos, &obj, B_FALSE, txg_start)) {
 693                 dmu_object_info_t doi;
 694 
 695                 err = dmu_object_info(mos, obj, &doi);
 696                 if (err != 0) {
 697                         if (hard)
 698                                 continue;
 699                         break;
 700                 }
 701 
 702                 if (doi.doi_bonus_type == DMU_OT_DSL_DATASET) {
 703                         dsl_dataset_t *ds;
 704                         objset_t *os;
 705                         boolean_t os_is_snapshot = B_FALSE;
 706                         uint64_t txg = txg_start;
 707                         uint64_t ctxg;
 708                         uint64_t max_txg = txg_finish;
 709 
 710                         dsl_pool_config_enter(dp, FTAG);
 711                         err = dsl_dataset_hold_obj(dp, obj, FTAG, &ds);
 712                         dsl_pool_config_exit(dp, FTAG);
 713                         if (err != 0) {
 714                                 if (hard)
 715                                         continue;
 716                                 break;
 717                         }
 718 
 719                         dsl_pool_config_enter(dp, FTAG);
 720                         err = dmu_objset_from_ds(ds, &os);
 721                         if (err == 0)
 722                                 os_is_snapshot = dmu_objset_is_snapshot(os);
 723 
 724                         dsl_pool_config_exit(dp, FTAG);
 725                         if (err != 0) {
 726                                 dsl_dataset_rele(ds, FTAG);
 727                                 if (hard)
 728                                         continue;
 729                                 break;
 730                         }
 731                         ctxg = dsl_dataset_phys(ds)->ds_creation_txg;
 732 
 733                         /* uplimited traverse walks over shapshots only */
 734                         if (max_txg != UINT64_MAX && !os_is_snapshot) {
 735                                 dsl_dataset_rele(ds, FTAG);
 736                                 continue;
 737                         }
 738                         if (max_txg != UINT64_MAX && ctxg >= max_txg) {
 739                                 dsl_dataset_rele(ds, FTAG);
 740                                 continue;
 741                         }
 742                         if (os_is_snapshot && ctxg <= txg_start) {
 743                                 dsl_dataset_rele(ds, FTAG);
 744                                 continue;
 745                         }
 746                         if (max_txg == UINT64_MAX &&
 747                             dsl_dataset_phys(ds)->ds_prev_snap_txg > txg)
 748                                 txg = dsl_dataset_phys(ds)->ds_prev_snap_txg;
 749                         if (txg > max_txg)
 750                                 max_txg = txg;
 751                         err = traverse_impl(spa, ds, ds->ds_object,
 752                             &dsl_dataset_phys(ds)->ds_bp,
 753                             txg, max_txg, zb, flags, func, arg);
 754                         dsl_dataset_rele(ds, FTAG);
 755                         if (err != 0) {
 756                                 if (!hard)
 757                                         return (err);
 758                                 lasterr = err;
 759                         }
 760                         if (zb && !ZB_IS_ZERO(zb))
 761                                 break;
 762                 }
 763         }
 764         if (err == ESRCH) {
 765                 /* zero bookmark means we are done */
 766                 if (zb)
 767                         bzero(zb, sizeof (*zb));
 768                 err = 0;
 769         }
 770         return (err != 0 ? err : lasterr);
 771 }