no-reap-now Wdiff usr/src/uts/common/os/vmem.c

Print this page

OS-6363 system went to dark side of moon for ~467 seconds OS-6404 ARC reclaim should throttle its calls to arc_kmem_reap_now() Reviewed by: Bryan Cantrill <bryan@joyent.com> Reviewed by: Dan McDonald <danmcd@joyent.com>

Split	Close
Expand all
Collapse all

          --- old/usr/src/uts/common/os/vmem.c
          +++ new/usr/src/uts/common/os/vmem.c

   1    1  /*
   2    2   * CDDL HEADER START
   3    3   *
   4    4   * The contents of this file are subject to the terms of the
   5    5   * Common Development and Distribution License (the "License").
   6    6   * You may not use this file except in compliance with the License.
   7    7   *
   8    8   * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
   9    9   * or http://www.opensolaris.org/os/licensing.
  10   10   * See the License for the specific language governing permissions
  11   11   * and limitations under the License.
  12   12   *
  13   13   * When distributing Covered Code, include this CDDL HEADER in each
  14   14   * file and include the License file at usr/src/OPENSOLARIS.LICENSE.
  15   15   * If applicable, add the following below this CDDL HEADER, with the
  16   16   * fields enclosed by brackets "[]" replaced with your own identifying
  17   17   * information: Portions Copyright [yyyy] [name of copyright owner]

↓ open down ↓

17 lines elided

↑ open up ↑

  18   18   *
  19   19   * CDDL HEADER END
  20   20   */
  21   21  /*
  22   22   * Copyright 2010 Sun Microsystems, Inc.  All rights reserved.
  23   23   * Use is subject to license terms.
  24   24   */
  25   25  
  26   26  /*
  27   27   * Copyright (c) 2012, 2015 by Delphix. All rights reserved.
  28      - * Copyright (c) 2012, Joyent, Inc. All rights reserved.
       28 + * Copyright (c) 2017, Joyent, Inc.
  29   29   */
  30   30  
  31   31  /*
  32   32   * Big Theory Statement for the virtual memory allocator.
  33   33   *
  34   34   * For a more complete description of the main ideas, see:
  35   35   *
  36   36   *      Jeff Bonwick and Jonathan Adams,
  37   37   *
  38   38   *      Magazines and vmem: Extending the Slab Allocator to Many CPUs and

  39   39   *      Arbitrary Resources.
  40   40   *
  41   41   *      Proceedings of the 2001 Usenix Conference.
  42   42   *      Available as http://www.usenix.org/event/usenix01/bonwick.html
  43   43   *
  44   44   * Section 1, below, is also the primary contents of vmem(9).  If for some
  45   45   * reason you are updating this comment, you will also wish to update the
  46   46   * manual.
  47   47   *
  48   48   * 1. General Concepts
  49   49   * -------------------
  50   50   *
  51   51   * 1.1 Overview
  52   52   * ------------
  53   53   * We divide the kernel address space into a number of logically distinct
  54   54   * pieces, or *arenas*: text, data, heap, stack, and so on.  Within these
  55   55   * arenas we often subdivide further; for example, we use heap addresses
  56   56   * not only for the kernel heap (kmem_alloc() space), but also for DVMA,
  57   57   * bp_mapin(), /dev/kmem, and even some device mappings like the TOD chip.
  58   58   * The kernel address space, therefore, is most accurately described as
  59   59   * a tree of arenas in which each node of the tree *imports* some subset
  60   60   * of its parent.  The virtual memory allocator manages these arenas and
  61   61   * supports their natural hierarchical structure.
  62   62   *
  63   63   * 1.2 Arenas
  64   64   * ----------
  65   65   * An arena is nothing more than a set of integers.  These integers most
  66   66   * commonly represent virtual addresses, but in fact they can represent
  67   67   * anything at all.  For example, we could use an arena containing the
  68   68   * integers minpid through maxpid to allocate process IDs.  vmem_create()
  69   69   * and vmem_destroy() create and destroy vmem arenas.  In order to
  70   70   * differentiate between arenas used for adresses and arenas used for
  71   71   * identifiers, the VMC_IDENTIFIER flag is passed to vmem_create().  This
  72   72   * prevents identifier exhaustion from being diagnosed as general memory
  73   73   * failure.
  74   74   *
  75   75   * 1.3 Spans
  76   76   * ---------
  77   77   * We represent the integers in an arena as a collection of *spans*, or
  78   78   * contiguous ranges of integers.  For example, the kernel heap consists
  79   79   * of just one span: [kernelheap, ekernelheap).  Spans can be added to an
  80   80   * arena in two ways: explicitly, by vmem_add(), or implicitly, by
  81   81   * importing, as described in Section 1.5 below.
  82   82   *
  83   83   * 1.4 Segments
  84   84   * ------------
  85   85   * Spans are subdivided into *segments*, each of which is either allocated
  86   86   * or free.  A segment, like a span, is a contiguous range of integers.
  87   87   * Each allocated segment [addr, addr + size) represents exactly one
  88   88   * vmem_alloc(size) that returned addr.  Free segments represent the space
  89   89   * between allocated segments.  If two free segments are adjacent, we
  90   90   * coalesce them into one larger segment; that is, if segments [a, b) and
  91   91   * [b, c) are both free, we merge them into a single segment [a, c).
  92   92   * The segments within a span are linked together in increasing-address order
  93   93   * so we can easily determine whether coalescing is possible.
  94   94   *
  95   95   * Segments never cross span boundaries.  When all segments within
  96   96   * an imported span become free, we return the span to its source.
  97   97   *
  98   98   * 1.5 Imported Memory
  99   99   * -------------------
 100  100   * As mentioned in the overview, some arenas are logical subsets of
 101  101   * other arenas.  For example, kmem_va_arena (a virtual address cache
 102  102   * that satisfies most kmem_slab_create() requests) is just a subset
 103  103   * of heap_arena (the kernel heap) that provides caching for the most
 104  104   * common slab sizes.  When kmem_va_arena runs out of virtual memory,
 105  105   * it *imports* more from the heap; we say that heap_arena is the
 106  106   * *vmem source* for kmem_va_arena.  vmem_create() allows you to
 107  107   * specify any existing vmem arena as the source for your new arena.
 108  108   * Topologically, since every arena is a child of at most one source,
 109  109   * the set of all arenas forms a collection of trees.
 110  110   *
 111  111   * 1.6 Constrained Allocations
 112  112   * ---------------------------
 113  113   * Some vmem clients are quite picky about the kind of address they want.
 114  114   * For example, the DVMA code may need an address that is at a particular
 115  115   * phase with respect to some alignment (to get good cache coloring), or
 116  116   * that lies within certain limits (the addressable range of a device),
 117  117   * or that doesn't cross some boundary (a DMA counter restriction) --
 118  118   * or all of the above.  vmem_xalloc() allows the client to specify any
 119  119   * or all of these constraints.
 120  120   *
 121  121   * 1.7 The Vmem Quantum
 122  122   * --------------------
 123  123   * Every arena has a notion of 'quantum', specified at vmem_create() time,
 124  124   * that defines the arena's minimum unit of currency.  Most commonly the
 125  125   * quantum is either 1 or PAGESIZE, but any power of 2 is legal.
 126  126   * All vmem allocations are guaranteed to be quantum-aligned.
 127  127   *
 128  128   * 1.8 Quantum Caching
 129  129   * -------------------
 130  130   * A vmem arena may be so hot (frequently used) that the scalability of vmem
 131  131   * allocation is a significant concern.  We address this by allowing the most
 132  132   * common allocation sizes to be serviced by the kernel memory allocator,
 133  133   * which provides low-latency per-cpu caching.  The qcache_max argument to
 134  134   * vmem_create() specifies the largest allocation size to cache.
 135  135   *
 136  136   * 1.9 Relationship to Kernel Memory Allocator
 137  137   * -------------------------------------------
 138  138   * Every kmem cache has a vmem arena as its slab supplier.  The kernel memory
 139  139   * allocator uses vmem_alloc() and vmem_free() to create and destroy slabs.
 140  140   *
 141  141   *
 142  142   * 2. Implementation
 143  143   * -----------------
 144  144   *
 145  145   * 2.1 Segment lists and markers
 146  146   * -----------------------------
 147  147   * The segment structure (vmem_seg_t) contains two doubly-linked lists.
 148  148   *
 149  149   * The arena list (vs_anext/vs_aprev) links all segments in the arena.
 150  150   * In addition to the allocated and free segments, the arena contains
 151  151   * special marker segments at span boundaries.  Span markers simplify
 152  152   * coalescing and importing logic by making it easy to tell both when
 153  153   * we're at a span boundary (so we don't coalesce across it), and when
 154  154   * a span is completely free (its neighbors will both be span markers).
 155  155   *
 156  156   * Imported spans will have vs_import set.
 157  157   *
 158  158   * The next-of-kin list (vs_knext/vs_kprev) links segments of the same type:
 159  159   * (1) for allocated segments, vs_knext is the hash chain linkage;
 160  160   * (2) for free segments, vs_knext is the freelist linkage;
 161  161   * (3) for span marker segments, vs_knext is the next span marker.
 162  162   *
 163  163   * 2.2 Allocation hashing
 164  164   * ----------------------
 165  165   * We maintain a hash table of all allocated segments, hashed by address.
 166  166   * This allows vmem_free() to discover the target segment in constant time.
 167  167   * vmem_update() periodically resizes hash tables to keep hash chains short.
 168  168   *
 169  169   * 2.3 Freelist management
 170  170   * -----------------------
 171  171   * We maintain power-of-2 freelists for free segments, i.e. free segments
 172  172   * of size >= 2^n reside in vmp->vm_freelist[n].  To ensure constant-time
 173  173   * allocation, vmem_xalloc() looks not in the first freelist that *might*
 174  174   * satisfy the allocation, but in the first freelist that *definitely*
 175  175   * satisfies the allocation (unless VM_BESTFIT is specified, or all larger
 176  176   * freelists are empty).  For example, a 1000-byte allocation will be
 177  177   * satisfied not from the 512..1023-byte freelist, whose members *might*
 178  178   * contains a 1000-byte segment, but from a 1024-byte or larger freelist,
 179  179   * the first member of which will *definitely* satisfy the allocation.
 180  180   * This ensures that vmem_xalloc() works in constant time.
 181  181   *
 182  182   * We maintain a bit map to determine quickly which freelists are non-empty.
 183  183   * vmp->vm_freemap & (1 << n) is non-zero iff vmp->vm_freelist[n] is non-empty.
 184  184   *
 185  185   * The different freelists are linked together into one large freelist,
 186  186   * with the freelist heads serving as markers.  Freelist markers simplify
 187  187   * the maintenance of vm_freemap by making it easy to tell when we're taking
 188  188   * the last member of a freelist (both of its neighbors will be markers).
 189  189   *
 190  190   * 2.4 Vmem Locking
 191  191   * ----------------
 192  192   * For simplicity, all arena state is protected by a per-arena lock.
 193  193   * For very hot arenas, use quantum caching for scalability.
 194  194   *
 195  195   * 2.5 Vmem Population
 196  196   * -------------------
 197  197   * Any internal vmem routine that might need to allocate new segment
 198  198   * structures must prepare in advance by calling vmem_populate(), which
 199  199   * will preallocate enough vmem_seg_t's to get is through the entire
 200  200   * operation without dropping the arena lock.
 201  201   *
 202  202   * 2.6 Auditing
 203  203   * ------------
 204  204   * If KMF_AUDIT is set in kmem_flags, we audit vmem allocations as well.
 205  205   * Since virtual addresses cannot be scribbled on, there is no equivalent
 206  206   * in vmem to redzone checking, deadbeef, or other kmem debugging features.
 207  207   * Moreover, we do not audit frees because segment coalescing destroys the
 208  208   * association between an address and its segment structure.  Auditing is
 209  209   * thus intended primarily to keep track of who's consuming the arena.
 210  210   * Debugging support could certainly be extended in the future if it proves
 211  211   * necessary, but we do so much live checking via the allocation hash table
 212  212   * that even non-DEBUG systems get quite a bit of sanity checking already.
 213  213   */
 214  214  
 215  215  #include <sys/vmem_impl.h>
 216  216  #include <sys/kmem.h>
 217  217  #include <sys/kstat.h>
 218  218  #include <sys/param.h>
 219  219  #include <sys/systm.h>
 220  220  #include <sys/atomic.h>
 221  221  #include <sys/bitmap.h>
 222  222  #include <sys/sysmacros.h>
 223  223  #include <sys/cmn_err.h>
 224  224  #include <sys/debug.h>
 225  225  #include <sys/panic.h>
 226  226  
 227  227  #define VMEM_INITIAL            10      /* early vmem arenas */
 228  228  #define VMEM_SEG_INITIAL        200     /* early segments */
 229  229  
 230  230  /*
 231  231   * Adding a new span to an arena requires two segment structures: one to
 232  232   * represent the span, and one to represent the free segment it contains.
 233  233   */
 234  234  #define VMEM_SEGS_PER_SPAN_CREATE       2
 235  235  
 236  236  /*
 237  237   * Allocating a piece of an existing segment requires 0-2 segment structures
 238  238   * depending on how much of the segment we're allocating.
 239  239   *
 240  240   * To allocate the entire segment, no new segment structures are needed; we
 241  241   * simply move the existing segment structure from the freelist to the
 242  242   * allocation hash table.
 243  243   *
 244  244   * To allocate a piece from the left or right end of the segment, we must
 245  245   * split the segment into two pieces (allocated part and remainder), so we
 246  246   * need one new segment structure to represent the remainder.
 247  247   *
 248  248   * To allocate from the middle of a segment, we need two new segment strucures
 249  249   * to represent the remainders on either side of the allocated part.
 250  250   */
 251  251  #define VMEM_SEGS_PER_EXACT_ALLOC       0
 252  252  #define VMEM_SEGS_PER_LEFT_ALLOC        1
 253  253  #define VMEM_SEGS_PER_RIGHT_ALLOC       1
 254  254  #define VMEM_SEGS_PER_MIDDLE_ALLOC      2
 255  255  
 256  256  /*
 257  257   * vmem_populate() preallocates segment structures for vmem to do its work.
 258  258   * It must preallocate enough for the worst case, which is when we must import
 259  259   * a new span and then allocate from the middle of it.
 260  260   */
 261  261  #define VMEM_SEGS_PER_ALLOC_MAX         \
 262  262          (VMEM_SEGS_PER_SPAN_CREATE + VMEM_SEGS_PER_MIDDLE_ALLOC)
 263  263  
 264  264  /*
 265  265   * The segment structures themselves are allocated from vmem_seg_arena, so
 266  266   * we have a recursion problem when vmem_seg_arena needs to populate itself.
 267  267   * We address this by working out the maximum number of segment structures
 268  268   * this act will require, and multiplying by the maximum number of threads
 269  269   * that we'll allow to do it simultaneously.
 270  270   *
 271  271   * The worst-case segment consumption to populate vmem_seg_arena is as
 272  272   * follows (depicted as a stack trace to indicate why events are occurring):
 273  273   *
 274  274   * (In order to lower the fragmentation in the heap_arena, we specify a
 275  275   * minimum import size for the vmem_metadata_arena which is the same size
 276  276   * as the kmem_va quantum cache allocations.  This causes the worst-case
 277  277   * allocation from the vmem_metadata_arena to be 3 segments.)
 278  278   *
 279  279   * vmem_alloc(vmem_seg_arena)           -> 2 segs (span create + exact alloc)
 280  280   *  segkmem_alloc(vmem_metadata_arena)
 281  281   *   vmem_alloc(vmem_metadata_arena)    -> 3 segs (span create + left alloc)
 282  282   *    vmem_alloc(heap_arena)            -> 1 seg (left alloc)
 283  283   *   page_create()
 284  284   *   hat_memload()
 285  285   *    kmem_cache_alloc()
 286  286   *     kmem_slab_create()
 287  287   *      vmem_alloc(hat_memload_arena)   -> 2 segs (span create + exact alloc)
 288  288   *       segkmem_alloc(heap_arena)
 289  289   *        vmem_alloc(heap_arena)        -> 1 seg (left alloc)
 290  290   *        page_create()
 291  291   *        hat_memload()         -> (hat layer won't recurse further)
 292  292   *
 293  293   * The worst-case consumption for each arena is 3 segment structures.
 294  294   * Of course, a 3-seg reserve could easily be blown by multiple threads.
 295  295   * Therefore, we serialize all allocations from vmem_seg_arena (which is OK
 296  296   * because they're rare).  We cannot allow a non-blocking allocation to get
 297  297   * tied up behind a blocking allocation, however, so we use separate locks
 298  298   * for VM_SLEEP and VM_NOSLEEP allocations.  Similarly, VM_PUSHPAGE allocations
 299  299   * must not block behind ordinary VM_SLEEPs.  In addition, if the system is
 300  300   * panicking then we must keep enough resources for panic_thread to do its
 301  301   * work.  Thus we have at most four threads trying to allocate from
 302  302   * vmem_seg_arena, and each thread consumes at most three segment structures,
 303  303   * so we must maintain a 12-seg reserve.
 304  304   */
 305  305  #define VMEM_POPULATE_RESERVE   12
 306  306  
 307  307  /*
 308  308   * vmem_populate() ensures that each arena has VMEM_MINFREE seg structures
 309  309   * so that it can satisfy the worst-case allocation *and* participate in
 310  310   * worst-case allocation from vmem_seg_arena.
 311  311   */
 312  312  #define VMEM_MINFREE    (VMEM_POPULATE_RESERVE + VMEM_SEGS_PER_ALLOC_MAX)
 313  313  
 314  314  static vmem_t vmem0[VMEM_INITIAL];
 315  315  static vmem_t *vmem_populator[VMEM_INITIAL];
 316  316  static uint32_t vmem_id;
 317  317  static uint32_t vmem_populators;
 318  318  static vmem_seg_t vmem_seg0[VMEM_SEG_INITIAL];
 319  319  static vmem_seg_t *vmem_segfree;
 320  320  static kmutex_t vmem_list_lock;
 321  321  static kmutex_t vmem_segfree_lock;
 322  322  static kmutex_t vmem_sleep_lock;
 323  323  static kmutex_t vmem_nosleep_lock;
 324  324  static kmutex_t vmem_pushpage_lock;
 325  325  static kmutex_t vmem_panic_lock;
 326  326  static vmem_t *vmem_list;
 327  327  static vmem_t *vmem_metadata_arena;
 328  328  static vmem_t *vmem_seg_arena;
 329  329  static vmem_t *vmem_hash_arena;
 330  330  static vmem_t *vmem_vmem_arena;
 331  331  static long vmem_update_interval = 15;  /* vmem_update() every 15 seconds */
 332  332  uint32_t vmem_mtbf;             /* mean time between failures [default: off] */
 333  333  size_t vmem_seg_size = sizeof (vmem_seg_t);
 334  334  
 335  335  static vmem_kstat_t vmem_kstat_template = {
 336  336          { "mem_inuse",          KSTAT_DATA_UINT64 },
 337  337          { "mem_import",         KSTAT_DATA_UINT64 },
 338  338          { "mem_total",          KSTAT_DATA_UINT64 },
 339  339          { "vmem_source",        KSTAT_DATA_UINT32 },
 340  340          { "alloc",              KSTAT_DATA_UINT64 },
 341  341          { "free",               KSTAT_DATA_UINT64 },
 342  342          { "wait",               KSTAT_DATA_UINT64 },
 343  343          { "fail",               KSTAT_DATA_UINT64 },
 344  344          { "lookup",             KSTAT_DATA_UINT64 },
 345  345          { "search",             KSTAT_DATA_UINT64 },
 346  346          { "populate_wait",      KSTAT_DATA_UINT64 },
 347  347          { "populate_fail",      KSTAT_DATA_UINT64 },
 348  348          { "contains",           KSTAT_DATA_UINT64 },
 349  349          { "contains_search",    KSTAT_DATA_UINT64 },
 350  350  };
 351  351  
 352  352  /*
 353  353   * Insert/delete from arena list (type 'a') or next-of-kin list (type 'k').
 354  354   */
 355  355  #define VMEM_INSERT(vprev, vsp, type)                                   \
 356  356  {                                                                       \
 357  357          vmem_seg_t *vnext = (vprev)->vs_##type##next;                   \
 358  358          (vsp)->vs_##type##next = (vnext);                               \
 359  359          (vsp)->vs_##type##prev = (vprev);                               \
 360  360          (vprev)->vs_##type##next = (vsp);                               \
 361  361          (vnext)->vs_##type##prev = (vsp);                               \
 362  362  }
 363  363  
 364  364  #define VMEM_DELETE(vsp, type)                                          \
 365  365  {                                                                       \
 366  366          vmem_seg_t *vprev = (vsp)->vs_##type##prev;                     \
 367  367          vmem_seg_t *vnext = (vsp)->vs_##type##next;                     \
 368  368          (vprev)->vs_##type##next = (vnext);                             \
 369  369          (vnext)->vs_##type##prev = (vprev);                             \
 370  370  }
 371  371  
 372  372  /*
 373  373   * Get a vmem_seg_t from the global segfree list.
 374  374   */
 375  375  static vmem_seg_t *
 376  376  vmem_getseg_global(void)
 377  377  {
 378  378          vmem_seg_t *vsp;
 379  379  
 380  380          mutex_enter(&vmem_segfree_lock);
 381  381          if ((vsp = vmem_segfree) != NULL)
 382  382                  vmem_segfree = vsp->vs_knext;
 383  383          mutex_exit(&vmem_segfree_lock);
 384  384  
 385  385          return (vsp);
 386  386  }
 387  387  
 388  388  /*
 389  389   * Put a vmem_seg_t on the global segfree list.
 390  390   */
 391  391  static void
 392  392  vmem_putseg_global(vmem_seg_t *vsp)
 393  393  {
 394  394          mutex_enter(&vmem_segfree_lock);
 395  395          vsp->vs_knext = vmem_segfree;
 396  396          vmem_segfree = vsp;
 397  397          mutex_exit(&vmem_segfree_lock);
 398  398  }
 399  399  
 400  400  /*
 401  401   * Get a vmem_seg_t from vmp's segfree list.
 402  402   */
 403  403  static vmem_seg_t *
 404  404  vmem_getseg(vmem_t *vmp)
 405  405  {
 406  406          vmem_seg_t *vsp;
 407  407  
 408  408          ASSERT(vmp->vm_nsegfree > 0);
 409  409  
 410  410          vsp = vmp->vm_segfree;
 411  411          vmp->vm_segfree = vsp->vs_knext;
 412  412          vmp->vm_nsegfree--;
 413  413  
 414  414          return (vsp);
 415  415  }
 416  416  
 417  417  /*
 418  418   * Put a vmem_seg_t on vmp's segfree list.
 419  419   */
 420  420  static void
 421  421  vmem_putseg(vmem_t *vmp, vmem_seg_t *vsp)
 422  422  {
 423  423          vsp->vs_knext = vmp->vm_segfree;
 424  424          vmp->vm_segfree = vsp;
 425  425          vmp->vm_nsegfree++;
 426  426  }
 427  427  
 428  428  /*
 429  429   * Add vsp to the appropriate freelist.
 430  430   */
 431  431  static void
 432  432  vmem_freelist_insert(vmem_t *vmp, vmem_seg_t *vsp)
 433  433  {
 434  434          vmem_seg_t *vprev;
 435  435  
 436  436          ASSERT(*VMEM_HASH(vmp, vsp->vs_start) != vsp);
 437  437  
 438  438          vprev = (vmem_seg_t *)&vmp->vm_freelist[highbit(VS_SIZE(vsp)) - 1];
 439  439          vsp->vs_type = VMEM_FREE;
 440  440          vmp->vm_freemap |= VS_SIZE(vprev);
 441  441          VMEM_INSERT(vprev, vsp, k);
 442  442  
 443  443          cv_broadcast(&vmp->vm_cv);
 444  444  }
 445  445  
 446  446  /*
 447  447   * Take vsp from the freelist.
 448  448   */
 449  449  static void
 450  450  vmem_freelist_delete(vmem_t *vmp, vmem_seg_t *vsp)
 451  451  {
 452  452          ASSERT(*VMEM_HASH(vmp, vsp->vs_start) != vsp);
 453  453          ASSERT(vsp->vs_type == VMEM_FREE);
 454  454  
 455  455          if (vsp->vs_knext->vs_start == 0 && vsp->vs_kprev->vs_start == 0) {
 456  456                  /*
 457  457                   * The segments on both sides of 'vsp' are freelist heads,
 458  458                   * so taking vsp leaves the freelist at vsp->vs_kprev empty.
 459  459                   */
 460  460                  ASSERT(vmp->vm_freemap & VS_SIZE(vsp->vs_kprev));
 461  461                  vmp->vm_freemap ^= VS_SIZE(vsp->vs_kprev);
 462  462          }
 463  463          VMEM_DELETE(vsp, k);
 464  464  }
 465  465  
 466  466  /*
 467  467   * Add vsp to the allocated-segment hash table and update kstats.
 468  468   */
 469  469  static void
 470  470  vmem_hash_insert(vmem_t *vmp, vmem_seg_t *vsp)
 471  471  {
 472  472          vmem_seg_t **bucket;
 473  473  
 474  474          vsp->vs_type = VMEM_ALLOC;
 475  475          bucket = VMEM_HASH(vmp, vsp->vs_start);
 476  476          vsp->vs_knext = *bucket;
 477  477          *bucket = vsp;
 478  478  
 479  479          if (vmem_seg_size == sizeof (vmem_seg_t)) {
 480  480                  vsp->vs_depth = (uint8_t)getpcstack(vsp->vs_stack,
 481  481                      VMEM_STACK_DEPTH);
 482  482                  vsp->vs_thread = curthread;
 483  483                  vsp->vs_timestamp = gethrtime();
 484  484          } else {
 485  485                  vsp->vs_depth = 0;
 486  486          }
 487  487  
 488  488          vmp->vm_kstat.vk_alloc.value.ui64++;
 489  489          vmp->vm_kstat.vk_mem_inuse.value.ui64 += VS_SIZE(vsp);
 490  490  }
 491  491  
 492  492  /*
 493  493   * Remove vsp from the allocated-segment hash table and update kstats.
 494  494   */
 495  495  static vmem_seg_t *
 496  496  vmem_hash_delete(vmem_t *vmp, uintptr_t addr, size_t size)
 497  497  {
 498  498          vmem_seg_t *vsp, **prev_vspp;
 499  499  
 500  500          prev_vspp = VMEM_HASH(vmp, addr);
 501  501          while ((vsp = *prev_vspp) != NULL) {
 502  502                  if (vsp->vs_start == addr) {
 503  503                          *prev_vspp = vsp->vs_knext;
 504  504                          break;
 505  505                  }
 506  506                  vmp->vm_kstat.vk_lookup.value.ui64++;
 507  507                  prev_vspp = &vsp->vs_knext;
 508  508          }
 509  509  
 510  510          if (vsp == NULL)
 511  511                  panic("vmem_hash_delete(%p, %lx, %lu): bad free",
 512  512                      (void *)vmp, addr, size);
 513  513          if (VS_SIZE(vsp) != size)
 514  514                  panic("vmem_hash_delete(%p, %lx, %lu): wrong size (expect %lu)",
 515  515                      (void *)vmp, addr, size, VS_SIZE(vsp));
 516  516  
 517  517          vmp->vm_kstat.vk_free.value.ui64++;
 518  518          vmp->vm_kstat.vk_mem_inuse.value.ui64 -= size;
 519  519  
 520  520          return (vsp);
 521  521  }
 522  522  
 523  523  /*
 524  524   * Create a segment spanning the range [start, end) and add it to the arena.
 525  525   */
 526  526  static vmem_seg_t *
 527  527  vmem_seg_create(vmem_t *vmp, vmem_seg_t *vprev, uintptr_t start, uintptr_t end)
 528  528  {
 529  529          vmem_seg_t *newseg = vmem_getseg(vmp);
 530  530  
 531  531          newseg->vs_start = start;
 532  532          newseg->vs_end = end;
 533  533          newseg->vs_type = 0;
 534  534          newseg->vs_import = 0;
 535  535  
 536  536          VMEM_INSERT(vprev, newseg, a);
 537  537  
 538  538          return (newseg);
 539  539  }
 540  540  
 541  541  /*
 542  542   * Remove segment vsp from the arena.
 543  543   */
 544  544  static void
 545  545  vmem_seg_destroy(vmem_t *vmp, vmem_seg_t *vsp)
 546  546  {
 547  547          ASSERT(vsp->vs_type != VMEM_ROTOR);
 548  548          VMEM_DELETE(vsp, a);
 549  549  
 550  550          vmem_putseg(vmp, vsp);
 551  551  }
 552  552  
 553  553  /*
 554  554   * Add the span [vaddr, vaddr + size) to vmp and update kstats.
 555  555   */
 556  556  static vmem_seg_t *
 557  557  vmem_span_create(vmem_t *vmp, void *vaddr, size_t size, uint8_t import)
 558  558  {
 559  559          vmem_seg_t *newseg, *span;
 560  560          uintptr_t start = (uintptr_t)vaddr;
 561  561          uintptr_t end = start + size;
 562  562  
 563  563          ASSERT(MUTEX_HELD(&vmp->vm_lock));
 564  564  
 565  565          if ((start | end) & (vmp->vm_quantum - 1))
 566  566                  panic("vmem_span_create(%p, %p, %lu): misaligned",
 567  567                      (void *)vmp, vaddr, size);
 568  568  
 569  569          span = vmem_seg_create(vmp, vmp->vm_seg0.vs_aprev, start, end);
 570  570          span->vs_type = VMEM_SPAN;
 571  571          span->vs_import = import;
 572  572          VMEM_INSERT(vmp->vm_seg0.vs_kprev, span, k);
 573  573  
 574  574          newseg = vmem_seg_create(vmp, span, start, end);
 575  575          vmem_freelist_insert(vmp, newseg);
 576  576  
 577  577          if (import)
 578  578                  vmp->vm_kstat.vk_mem_import.value.ui64 += size;
 579  579          vmp->vm_kstat.vk_mem_total.value.ui64 += size;
 580  580  
 581  581          return (newseg);
 582  582  }
 583  583  
 584  584  /*
 585  585   * Remove span vsp from vmp and update kstats.
 586  586   */
 587  587  static void
 588  588  vmem_span_destroy(vmem_t *vmp, vmem_seg_t *vsp)
 589  589  {
 590  590          vmem_seg_t *span = vsp->vs_aprev;
 591  591          size_t size = VS_SIZE(vsp);
 592  592  
 593  593          ASSERT(MUTEX_HELD(&vmp->vm_lock));
 594  594          ASSERT(span->vs_type == VMEM_SPAN);
 595  595  
 596  596          if (span->vs_import)
 597  597                  vmp->vm_kstat.vk_mem_import.value.ui64 -= size;
 598  598          vmp->vm_kstat.vk_mem_total.value.ui64 -= size;
 599  599  
 600  600          VMEM_DELETE(span, k);
 601  601  
 602  602          vmem_seg_destroy(vmp, vsp);
 603  603          vmem_seg_destroy(vmp, span);
 604  604  }
 605  605  
 606  606  /*
 607  607   * Allocate the subrange [addr, addr + size) from segment vsp.
 608  608   * If there are leftovers on either side, place them on the freelist.
 609  609   * Returns a pointer to the segment representing [addr, addr + size).
 610  610   */
 611  611  static vmem_seg_t *
 612  612  vmem_seg_alloc(vmem_t *vmp, vmem_seg_t *vsp, uintptr_t addr, size_t size)
 613  613  {
 614  614          uintptr_t vs_start = vsp->vs_start;
 615  615          uintptr_t vs_end = vsp->vs_end;
 616  616          size_t vs_size = vs_end - vs_start;
 617  617          size_t realsize = P2ROUNDUP(size, vmp->vm_quantum);
 618  618          uintptr_t addr_end = addr + realsize;
 619  619  
 620  620          ASSERT(P2PHASE(vs_start, vmp->vm_quantum) == 0);
 621  621          ASSERT(P2PHASE(addr, vmp->vm_quantum) == 0);
 622  622          ASSERT(vsp->vs_type == VMEM_FREE);
 623  623          ASSERT(addr >= vs_start && addr_end - 1 <= vs_end - 1);
 624  624          ASSERT(addr - 1 <= addr_end - 1);
 625  625  
 626  626          /*
 627  627           * If we're allocating from the start of the segment, and the
 628  628           * remainder will be on the same freelist, we can save quite
 629  629           * a bit of work.
 630  630           */
 631  631          if (P2SAMEHIGHBIT(vs_size, vs_size - realsize) && addr == vs_start) {
 632  632                  ASSERT(highbit(vs_size) == highbit(vs_size - realsize));
 633  633                  vsp->vs_start = addr_end;
 634  634                  vsp = vmem_seg_create(vmp, vsp->vs_aprev, addr, addr + size);
 635  635                  vmem_hash_insert(vmp, vsp);
 636  636                  return (vsp);
 637  637          }
 638  638  
 639  639          vmem_freelist_delete(vmp, vsp);
 640  640  
 641  641          if (vs_end != addr_end)
 642  642                  vmem_freelist_insert(vmp,
 643  643                      vmem_seg_create(vmp, vsp, addr_end, vs_end));
 644  644  
 645  645          if (vs_start != addr)
 646  646                  vmem_freelist_insert(vmp,
 647  647                      vmem_seg_create(vmp, vsp->vs_aprev, vs_start, addr));
 648  648  
 649  649          vsp->vs_start = addr;
 650  650          vsp->vs_end = addr + size;
 651  651  
 652  652          vmem_hash_insert(vmp, vsp);
 653  653          return (vsp);
 654  654  }
 655  655  
 656  656  /*
 657  657   * Returns 1 if we are populating, 0 otherwise.
 658  658   * Call it if we want to prevent recursion from HAT.
 659  659   */
 660  660  int
 661  661  vmem_is_populator()
 662  662  {
 663  663          return (mutex_owner(&vmem_sleep_lock) == curthread ||
 664  664              mutex_owner(&vmem_nosleep_lock) == curthread ||
 665  665              mutex_owner(&vmem_pushpage_lock) == curthread ||
 666  666              mutex_owner(&vmem_panic_lock) == curthread);
 667  667  }
 668  668  
 669  669  /*
 670  670   * Populate vmp's segfree list with VMEM_MINFREE vmem_seg_t structures.
 671  671   */
 672  672  static int
 673  673  vmem_populate(vmem_t *vmp, int vmflag)
 674  674  {
 675  675          char *p;
 676  676          vmem_seg_t *vsp;
 677  677          ssize_t nseg;
 678  678          size_t size;
 679  679          kmutex_t *lp;
 680  680          int i;
 681  681  
 682  682          while (vmp->vm_nsegfree < VMEM_MINFREE &&
 683  683              (vsp = vmem_getseg_global()) != NULL)
 684  684                  vmem_putseg(vmp, vsp);
 685  685  
 686  686          if (vmp->vm_nsegfree >= VMEM_MINFREE)
 687  687                  return (1);
 688  688  
 689  689          /*
 690  690           * If we're already populating, tap the reserve.
 691  691           */
 692  692          if (vmem_is_populator()) {
 693  693                  ASSERT(vmp->vm_cflags & VMC_POPULATOR);
 694  694                  return (1);
 695  695          }
 696  696  
 697  697          mutex_exit(&vmp->vm_lock);
 698  698  
 699  699          if (panic_thread == curthread)
 700  700                  lp = &vmem_panic_lock;
 701  701          else if (vmflag & VM_NOSLEEP)
 702  702                  lp = &vmem_nosleep_lock;
 703  703          else if (vmflag & VM_PUSHPAGE)
 704  704                  lp = &vmem_pushpage_lock;
 705  705          else
 706  706                  lp = &vmem_sleep_lock;
 707  707  
 708  708          mutex_enter(lp);
 709  709  
 710  710          nseg = VMEM_MINFREE + vmem_populators * VMEM_POPULATE_RESERVE;
 711  711          size = P2ROUNDUP(nseg * vmem_seg_size, vmem_seg_arena->vm_quantum);
 712  712          nseg = size / vmem_seg_size;
 713  713  
 714  714          /*
 715  715           * The following vmem_alloc() may need to populate vmem_seg_arena
 716  716           * and all the things it imports from.  When doing so, it will tap
 717  717           * each arena's reserve to prevent recursion (see the block comment
 718  718           * above the definition of VMEM_POPULATE_RESERVE).
 719  719           */
 720  720          p = vmem_alloc(vmem_seg_arena, size, vmflag & VM_KMFLAGS);
 721  721          if (p == NULL) {
 722  722                  mutex_exit(lp);
 723  723                  mutex_enter(&vmp->vm_lock);
 724  724                  vmp->vm_kstat.vk_populate_fail.value.ui64++;
 725  725                  return (0);
 726  726          }
 727  727  
 728  728          /*
 729  729           * Restock the arenas that may have been depleted during population.
 730  730           */
 731  731          for (i = 0; i < vmem_populators; i++) {
 732  732                  mutex_enter(&vmem_populator[i]->vm_lock);
 733  733                  while (vmem_populator[i]->vm_nsegfree < VMEM_POPULATE_RESERVE)
 734  734                          vmem_putseg(vmem_populator[i],
 735  735                              (vmem_seg_t *)(p + --nseg * vmem_seg_size));
 736  736                  mutex_exit(&vmem_populator[i]->vm_lock);
 737  737          }
 738  738  
 739  739          mutex_exit(lp);
 740  740          mutex_enter(&vmp->vm_lock);
 741  741  
 742  742          /*
 743  743           * Now take our own segments.
 744  744           */
 745  745          ASSERT(nseg >= VMEM_MINFREE);
 746  746          while (vmp->vm_nsegfree < VMEM_MINFREE)
 747  747                  vmem_putseg(vmp, (vmem_seg_t *)(p + --nseg * vmem_seg_size));
 748  748  
 749  749          /*
 750  750           * Give the remainder to charity.
 751  751           */
 752  752          while (nseg > 0)
 753  753                  vmem_putseg_global((vmem_seg_t *)(p + --nseg * vmem_seg_size));
 754  754  
 755  755          return (1);
 756  756  }
 757  757  
 758  758  /*
 759  759   * Advance a walker from its previous position to 'afterme'.
 760  760   * Note: may drop and reacquire vmp->vm_lock.
 761  761   */
 762  762  static void
 763  763  vmem_advance(vmem_t *vmp, vmem_seg_t *walker, vmem_seg_t *afterme)
 764  764  {
 765  765          vmem_seg_t *vprev = walker->vs_aprev;
 766  766          vmem_seg_t *vnext = walker->vs_anext;
 767  767          vmem_seg_t *vsp = NULL;
 768  768  
 769  769          VMEM_DELETE(walker, a);
 770  770  
 771  771          if (afterme != NULL)
 772  772                  VMEM_INSERT(afterme, walker, a);
 773  773  
 774  774          /*
 775  775           * The walker segment's presence may have prevented its neighbors
 776  776           * from coalescing.  If so, coalesce them now.
 777  777           */
 778  778          if (vprev->vs_type == VMEM_FREE) {
 779  779                  if (vnext->vs_type == VMEM_FREE) {
 780  780                          ASSERT(vprev->vs_end == vnext->vs_start);
 781  781                          vmem_freelist_delete(vmp, vnext);
 782  782                          vmem_freelist_delete(vmp, vprev);
 783  783                          vprev->vs_end = vnext->vs_end;
 784  784                          vmem_freelist_insert(vmp, vprev);
 785  785                          vmem_seg_destroy(vmp, vnext);
 786  786                  }
 787  787                  vsp = vprev;
 788  788          } else if (vnext->vs_type == VMEM_FREE) {
 789  789                  vsp = vnext;
 790  790          }
 791  791  
 792  792          /*
 793  793           * vsp could represent a complete imported span,
 794  794           * in which case we must return it to the source.
 795  795           */
 796  796          if (vsp != NULL && vsp->vs_aprev->vs_import &&
 797  797              vmp->vm_source_free != NULL &&
 798  798              vsp->vs_aprev->vs_type == VMEM_SPAN &&
 799  799              vsp->vs_anext->vs_type == VMEM_SPAN) {
 800  800                  void *vaddr = (void *)vsp->vs_start;
 801  801                  size_t size = VS_SIZE(vsp);
 802  802                  ASSERT(size == VS_SIZE(vsp->vs_aprev));
 803  803                  vmem_freelist_delete(vmp, vsp);
 804  804                  vmem_span_destroy(vmp, vsp);
 805  805                  mutex_exit(&vmp->vm_lock);
 806  806                  vmp->vm_source_free(vmp->vm_source, vaddr, size);
 807  807                  mutex_enter(&vmp->vm_lock);
 808  808          }
 809  809  }
 810  810  
 811  811  /*
 812  812   * VM_NEXTFIT allocations deliberately cycle through all virtual addresses
 813  813   * in an arena, so that we avoid reusing addresses for as long as possible.
 814  814   * This helps to catch used-after-freed bugs.  It's also the perfect policy
 815  815   * for allocating things like process IDs, where we want to cycle through
 816  816   * all values in order.
 817  817   */
 818  818  static void *
 819  819  vmem_nextfit_alloc(vmem_t *vmp, size_t size, int vmflag)
 820  820  {
 821  821          vmem_seg_t *vsp, *rotor;
 822  822          uintptr_t addr;
 823  823          size_t realsize = P2ROUNDUP(size, vmp->vm_quantum);
 824  824          size_t vs_size;
 825  825  
 826  826          mutex_enter(&vmp->vm_lock);
 827  827  
 828  828          if (vmp->vm_nsegfree < VMEM_MINFREE && !vmem_populate(vmp, vmflag)) {
 829  829                  mutex_exit(&vmp->vm_lock);
 830  830                  return (NULL);
 831  831          }
 832  832  
 833  833          /*
 834  834           * The common case is that the segment right after the rotor is free,
 835  835           * and large enough that extracting 'size' bytes won't change which
 836  836           * freelist it's on.  In this case we can avoid a *lot* of work.
 837  837           * Instead of the normal vmem_seg_alloc(), we just advance the start
 838  838           * address of the victim segment.  Instead of moving the rotor, we
 839  839           * create the new segment structure *behind the rotor*, which has
 840  840           * the same effect.  And finally, we know we don't have to coalesce
 841  841           * the rotor's neighbors because the new segment lies between them.
 842  842           */
 843  843          rotor = &vmp->vm_rotor;
 844  844          vsp = rotor->vs_anext;
 845  845          if (vsp->vs_type == VMEM_FREE && (vs_size = VS_SIZE(vsp)) > realsize &&
 846  846              P2SAMEHIGHBIT(vs_size, vs_size - realsize)) {
 847  847                  ASSERT(highbit(vs_size) == highbit(vs_size - realsize));
 848  848                  addr = vsp->vs_start;
 849  849                  vsp->vs_start = addr + realsize;
 850  850                  vmem_hash_insert(vmp,
 851  851                      vmem_seg_create(vmp, rotor->vs_aprev, addr, addr + size));
 852  852                  mutex_exit(&vmp->vm_lock);
 853  853                  return ((void *)addr);
 854  854          }
 855  855  
 856  856          /*
 857  857           * Starting at the rotor, look for a segment large enough to
 858  858           * satisfy the allocation.
 859  859           */
 860  860          for (;;) {
 861  861                  vmp->vm_kstat.vk_search.value.ui64++;
 862  862                  if (vsp->vs_type == VMEM_FREE && VS_SIZE(vsp) >= size)
 863  863                          break;
 864  864                  vsp = vsp->vs_anext;
 865  865                  if (vsp == rotor) {
 866  866                          /*
 867  867                           * We've come full circle.  One possibility is that the
 868  868                           * there's actually enough space, but the rotor itself
 869  869                           * is preventing the allocation from succeeding because
 870  870                           * it's sitting between two free segments.  Therefore,
 871  871                           * we advance the rotor and see if that liberates a
 872  872                           * suitable segment.
 873  873                           */
 874  874                          vmem_advance(vmp, rotor, rotor->vs_anext);
 875  875                          vsp = rotor->vs_aprev;
 876  876                          if (vsp->vs_type == VMEM_FREE && VS_SIZE(vsp) >= size)
 877  877                                  break;
 878  878                          /*
 879  879                           * If there's a lower arena we can import from, or it's
 880  880                           * a VM_NOSLEEP allocation, let vmem_xalloc() handle it.
 881  881                           * Otherwise, wait until another thread frees something.
 882  882                           */
 883  883                          if (vmp->vm_source_alloc != NULL ||
 884  884                              (vmflag & VM_NOSLEEP)) {
 885  885                                  mutex_exit(&vmp->vm_lock);
 886  886                                  return (vmem_xalloc(vmp, size, vmp->vm_quantum,
 887  887                                      0, 0, NULL, NULL, vmflag & VM_KMFLAGS));
 888  888                          }
 889  889                          vmp->vm_kstat.vk_wait.value.ui64++;
 890  890                          cv_wait(&vmp->vm_cv, &vmp->vm_lock);
 891  891                          vsp = rotor->vs_anext;
 892  892                  }
 893  893          }
 894  894  
 895  895          /*
 896  896           * We found a segment.  Extract enough space to satisfy the allocation.
 897  897           */
 898  898          addr = vsp->vs_start;
 899  899          vsp = vmem_seg_alloc(vmp, vsp, addr, size);
 900  900          ASSERT(vsp->vs_type == VMEM_ALLOC &&
 901  901              vsp->vs_start == addr && vsp->vs_end == addr + size);
 902  902  
 903  903          /*
 904  904           * Advance the rotor to right after the newly-allocated segment.
 905  905           * That's where the next VM_NEXTFIT allocation will begin searching.
 906  906           */
 907  907          vmem_advance(vmp, rotor, vsp);
 908  908          mutex_exit(&vmp->vm_lock);
 909  909          return ((void *)addr);
 910  910  }
 911  911  
 912  912  /*
 913  913   * Checks if vmp is guaranteed to have a size-byte buffer somewhere on its
 914  914   * freelist.  If size is not a power-of-2, it can return a false-negative.
 915  915   *
 916  916   * Used to decide if a newly imported span is superfluous after re-acquiring
 917  917   * the arena lock.
 918  918   */
 919  919  static int
 920  920  vmem_canalloc(vmem_t *vmp, size_t size)
 921  921  {
 922  922          int hb;
 923  923          int flist = 0;
 924  924          ASSERT(MUTEX_HELD(&vmp->vm_lock));
 925  925  
 926  926          if (ISP2(size))
 927  927                  flist = lowbit(P2ALIGN(vmp->vm_freemap, size));
 928  928          else if ((hb = highbit(size)) < VMEM_FREELISTS)
 929  929                  flist = lowbit(P2ALIGN(vmp->vm_freemap, 1UL << hb));
 930  930  
 931  931          return (flist);
 932  932  }
 933  933  
 934  934  /*
 935  935   * Allocate size bytes at offset phase from an align boundary such that the
 936  936   * resulting segment [addr, addr + size) is a subset of [minaddr, maxaddr)
 937  937   * that does not straddle a nocross-aligned boundary.
 938  938   */
 939  939  void *
 940  940  vmem_xalloc(vmem_t *vmp, size_t size, size_t align_arg, size_t phase,
 941  941      size_t nocross, void *minaddr, void *maxaddr, int vmflag)
 942  942  {
 943  943          vmem_seg_t *vsp;
 944  944          vmem_seg_t *vbest = NULL;
 945  945          uintptr_t addr, taddr, start, end;
 946  946          uintptr_t align = (align_arg != 0) ? align_arg : vmp->vm_quantum;
 947  947          void *vaddr, *xvaddr = NULL;
 948  948          size_t xsize;
 949  949          int hb, flist, resv;
 950  950          uint32_t mtbf;
 951  951  
 952  952          if ((align | phase | nocross) & (vmp->vm_quantum - 1))
 953  953                  panic("vmem_xalloc(%p, %lu, %lu, %lu, %lu, %p, %p, %x): "
 954  954                      "parameters not vm_quantum aligned",
 955  955                      (void *)vmp, size, align_arg, phase, nocross,
 956  956                      minaddr, maxaddr, vmflag);
 957  957  
 958  958          if (nocross != 0 &&
 959  959              (align > nocross || P2ROUNDUP(phase + size, align) > nocross))
 960  960                  panic("vmem_xalloc(%p, %lu, %lu, %lu, %lu, %p, %p, %x): "
 961  961                      "overconstrained allocation",
 962  962                      (void *)vmp, size, align_arg, phase, nocross,
 963  963                      minaddr, maxaddr, vmflag);
 964  964  
 965  965          if (phase >= align || !ISP2(align) || !ISP2(nocross))
 966  966                  panic("vmem_xalloc(%p, %lu, %lu, %lu, %lu, %p, %p, %x): "
 967  967                      "parameters inconsistent or invalid",
 968  968                      (void *)vmp, size, align_arg, phase, nocross,
 969  969                      minaddr, maxaddr, vmflag);
 970  970  
 971  971          if ((mtbf = vmem_mtbf | vmp->vm_mtbf) != 0 && gethrtime() % mtbf == 0 &&
 972  972              (vmflag & (VM_NOSLEEP | VM_PANIC)) == VM_NOSLEEP)
 973  973                  return (NULL);
 974  974  
 975  975          mutex_enter(&vmp->vm_lock);
 976  976          for (;;) {
 977  977                  if (vmp->vm_nsegfree < VMEM_MINFREE &&
 978  978                      !vmem_populate(vmp, vmflag))
 979  979                          break;
 980  980  do_alloc:
 981  981                  /*
 982  982                   * highbit() returns the highest bit + 1, which is exactly
 983  983                   * what we want: we want to search the first freelist whose
 984  984                   * members are *definitely* large enough to satisfy our
 985  985                   * allocation.  However, there are certain cases in which we
 986  986                   * want to look at the next-smallest freelist (which *might*
 987  987                   * be able to satisfy the allocation):
 988  988                   *
 989  989                   * (1)  The size is exactly a power of 2, in which case
 990  990                   *      the smaller freelist is always big enough;
 991  991                   *
 992  992                   * (2)  All other freelists are empty;
 993  993                   *
 994  994                   * (3)  We're in the highest possible freelist, which is
 995  995                   *      always empty (e.g. the 4GB freelist on 32-bit systems);
 996  996                   *
 997  997                   * (4)  We're doing a best-fit or first-fit allocation.
 998  998                   */
 999  999                  if (ISP2(size)) {
1000 1000                          flist = lowbit(P2ALIGN(vmp->vm_freemap, size));
1001 1001                  } else {
1002 1002                          hb = highbit(size);
1003 1003                          if ((vmp->vm_freemap >> hb) == 0 ||
1004 1004                              hb == VMEM_FREELISTS ||
1005 1005                              (vmflag & (VM_BESTFIT | VM_FIRSTFIT)))
1006 1006                                  hb--;
1007 1007                          flist = lowbit(P2ALIGN(vmp->vm_freemap, 1UL << hb));
1008 1008                  }
1009 1009  
1010 1010                  for (vbest = NULL, vsp = (flist == 0) ? NULL :
1011 1011                      vmp->vm_freelist[flist - 1].vs_knext;
1012 1012                      vsp != NULL; vsp = vsp->vs_knext) {
1013 1013                          vmp->vm_kstat.vk_search.value.ui64++;
1014 1014                          if (vsp->vs_start == 0) {
1015 1015                                  /*
1016 1016                                   * We're moving up to a larger freelist,
1017 1017                                   * so if we've already found a candidate,
1018 1018                                   * the fit can't possibly get any better.
1019 1019                                   */
1020 1020                                  if (vbest != NULL)
1021 1021                                          break;
1022 1022                                  /*
1023 1023                                   * Find the next non-empty freelist.
1024 1024                                   */
1025 1025                                  flist = lowbit(P2ALIGN(vmp->vm_freemap,
1026 1026                                      VS_SIZE(vsp)));
1027 1027                                  if (flist-- == 0)
1028 1028                                          break;
1029 1029                                  vsp = (vmem_seg_t *)&vmp->vm_freelist[flist];
1030 1030                                  ASSERT(vsp->vs_knext->vs_type == VMEM_FREE);
1031 1031                                  continue;
1032 1032                          }
1033 1033                          if (vsp->vs_end - 1 < (uintptr_t)minaddr)
1034 1034                                  continue;
1035 1035                          if (vsp->vs_start > (uintptr_t)maxaddr - 1)
1036 1036                                  continue;
1037 1037                          start = MAX(vsp->vs_start, (uintptr_t)minaddr);
1038 1038                          end = MIN(vsp->vs_end - 1, (uintptr_t)maxaddr - 1) + 1;
1039 1039                          taddr = P2PHASEUP(start, align, phase);
1040 1040                          if (P2BOUNDARY(taddr, size, nocross))
1041 1041                                  taddr +=
1042 1042                                      P2ROUNDUP(P2NPHASE(taddr, nocross), align);
1043 1043                          if ((taddr - start) + size > end - start ||
1044 1044                              (vbest != NULL && VS_SIZE(vsp) >= VS_SIZE(vbest)))
1045 1045                                  continue;
1046 1046                          vbest = vsp;
1047 1047                          addr = taddr;
1048 1048                          if (!(vmflag & VM_BESTFIT) || VS_SIZE(vbest) == size)
1049 1049                                  break;
1050 1050                  }
1051 1051                  if (vbest != NULL)
1052 1052                          break;
1053 1053                  ASSERT(xvaddr == NULL);
1054 1054                  if (size == 0)
1055 1055                          panic("vmem_xalloc(): size == 0");
1056 1056                  if (vmp->vm_source_alloc != NULL && nocross == 0 &&
1057 1057                      minaddr == NULL && maxaddr == NULL) {
1058 1058                          size_t aneeded, asize;
1059 1059                          size_t aquantum = MAX(vmp->vm_quantum,
1060 1060                              vmp->vm_source->vm_quantum);
1061 1061                          size_t aphase = phase;
1062 1062                          if ((align > aquantum) &&
1063 1063                              !(vmp->vm_cflags & VMC_XALIGN)) {
1064 1064                                  aphase = (P2PHASE(phase, aquantum) != 0) ?
1065 1065                                      align - vmp->vm_quantum : align - aquantum;
1066 1066                                  ASSERT(aphase >= phase);
1067 1067                          }
1068 1068                          aneeded = MAX(size + aphase, vmp->vm_min_import);
1069 1069                          asize = P2ROUNDUP(aneeded, aquantum);
1070 1070  
1071 1071                          if (asize < size) {
1072 1072                                  /*
1073 1073                                   * The rounding induced overflow; return NULL
1074 1074                                   * if we are permitted to fail the allocation
1075 1075                                   * (and explicitly panic if we aren't).
1076 1076                                   */
1077 1077                                  if ((vmflag & VM_NOSLEEP) &&
1078 1078                                      !(vmflag & VM_PANIC)) {
1079 1079                                          mutex_exit(&vmp->vm_lock);
1080 1080                                          return (NULL);
1081 1081                                  }
1082 1082  
1083 1083                                  panic("vmem_xalloc(): size overflow");
1084 1084                          }
1085 1085  
1086 1086                          /*
1087 1087                           * Determine how many segment structures we'll consume.
1088 1088                           * The calculation must be precise because if we're
1089 1089                           * here on behalf of vmem_populate(), we are taking
1090 1090                           * segments from a very limited reserve.
1091 1091                           */
1092 1092                          if (size == asize && !(vmp->vm_cflags & VMC_XALLOC))
1093 1093                                  resv = VMEM_SEGS_PER_SPAN_CREATE +
1094 1094                                      VMEM_SEGS_PER_EXACT_ALLOC;
1095 1095                          else if (phase == 0 &&
1096 1096                              align <= vmp->vm_source->vm_quantum)
1097 1097                                  resv = VMEM_SEGS_PER_SPAN_CREATE +
1098 1098                                      VMEM_SEGS_PER_LEFT_ALLOC;
1099 1099                          else
1100 1100                                  resv = VMEM_SEGS_PER_ALLOC_MAX;
1101 1101  
1102 1102                          ASSERT(vmp->vm_nsegfree >= resv);
1103 1103                          vmp->vm_nsegfree -= resv;       /* reserve our segs */
1104 1104                          mutex_exit(&vmp->vm_lock);
1105 1105                          if (vmp->vm_cflags & VMC_XALLOC) {
1106 1106                                  size_t oasize = asize;
1107 1107                                  vaddr = ((vmem_ximport_t *)
1108 1108                                      vmp->vm_source_alloc)(vmp->vm_source,
1109 1109                                      &asize, align, vmflag & VM_KMFLAGS);
1110 1110                                  ASSERT(asize >= oasize);
1111 1111                                  ASSERT(P2PHASE(asize,
1112 1112                                      vmp->vm_source->vm_quantum) == 0);
1113 1113                                  ASSERT(!(vmp->vm_cflags & VMC_XALIGN) ||
1114 1114                                      IS_P2ALIGNED(vaddr, align));
1115 1115                          } else {
1116 1116                                  vaddr = vmp->vm_source_alloc(vmp->vm_source,
1117 1117                                      asize, vmflag & VM_KMFLAGS);
1118 1118                          }
1119 1119                          mutex_enter(&vmp->vm_lock);
1120 1120                          vmp->vm_nsegfree += resv;       /* claim reservation */
1121 1121                          aneeded = size + align - vmp->vm_quantum;
1122 1122                          aneeded = P2ROUNDUP(aneeded, vmp->vm_quantum);
1123 1123                          if (vaddr != NULL) {
1124 1124                                  /*
1125 1125                                   * Since we dropped the vmem lock while
1126 1126                                   * calling the import function, other
1127 1127                                   * threads could have imported space
1128 1128                                   * and made our import unnecessary.  In
1129 1129                                   * order to save space, we return
1130 1130                                   * excess imports immediately.
1131 1131                                   */
1132 1132                                  if (asize > aneeded &&
1133 1133                                      vmp->vm_source_free != NULL &&
1134 1134                                      vmem_canalloc(vmp, aneeded)) {
1135 1135                                          ASSERT(resv >=
1136 1136                                              VMEM_SEGS_PER_MIDDLE_ALLOC);
1137 1137                                          xvaddr = vaddr;
1138 1138                                          xsize = asize;
1139 1139                                          goto do_alloc;
1140 1140                                  }
1141 1141                                  vbest = vmem_span_create(vmp, vaddr, asize, 1);
1142 1142                                  addr = P2PHASEUP(vbest->vs_start, align, phase);
1143 1143                                  break;
1144 1144                          } else if (vmem_canalloc(vmp, aneeded)) {
1145 1145                                  /*
1146 1146                                   * Our import failed, but another thread
1147 1147                                   * added sufficient free memory to the arena
1148 1148                                   * to satisfy our request.  Go back and
1149 1149                                   * grab it.
1150 1150                                   */
1151 1151                                  ASSERT(resv >= VMEM_SEGS_PER_MIDDLE_ALLOC);
1152 1152                                  goto do_alloc;
1153 1153                          }
1154 1154                  }
1155 1155  
1156 1156                  /*
1157 1157                   * If the requestor chooses to fail the allocation attempt
1158 1158                   * rather than reap wait and retry - get out of the loop.
1159 1159                   */
1160 1160                  if (vmflag & VM_ABORT)
1161 1161                          break;
1162 1162                  mutex_exit(&vmp->vm_lock);
1163 1163                  if (vmp->vm_cflags & VMC_IDENTIFIER)
1164 1164                          kmem_reap_idspace();
1165 1165                  else
1166 1166                          kmem_reap();
1167 1167                  mutex_enter(&vmp->vm_lock);
1168 1168                  if (vmflag & VM_NOSLEEP)
1169 1169                          break;
1170 1170                  vmp->vm_kstat.vk_wait.value.ui64++;
1171 1171                  cv_wait(&vmp->vm_cv, &vmp->vm_lock);
1172 1172          }
1173 1173          if (vbest != NULL) {
1174 1174                  ASSERT(vbest->vs_type == VMEM_FREE);
1175 1175                  ASSERT(vbest->vs_knext != vbest);
1176 1176                  /* re-position to end of buffer */
1177 1177                  if (vmflag & VM_ENDALLOC) {
1178 1178                          addr += ((vbest->vs_end - (addr + size)) / align) *
1179 1179                              align;
1180 1180                  }
1181 1181                  (void) vmem_seg_alloc(vmp, vbest, addr, size);
1182 1182                  mutex_exit(&vmp->vm_lock);
1183 1183                  if (xvaddr)
1184 1184                          vmp->vm_source_free(vmp->vm_source, xvaddr, xsize);
1185 1185                  ASSERT(P2PHASE(addr, align) == phase);
1186 1186                  ASSERT(!P2BOUNDARY(addr, size, nocross));
1187 1187                  ASSERT(addr >= (uintptr_t)minaddr);
1188 1188                  ASSERT(addr + size - 1 <= (uintptr_t)maxaddr - 1);
1189 1189                  return ((void *)addr);
1190 1190          }
1191 1191          vmp->vm_kstat.vk_fail.value.ui64++;
1192 1192          mutex_exit(&vmp->vm_lock);
1193 1193          if (vmflag & VM_PANIC)
1194 1194                  panic("vmem_xalloc(%p, %lu, %lu, %lu, %lu, %p, %p, %x): "
1195 1195                      "cannot satisfy mandatory allocation",
1196 1196                      (void *)vmp, size, align_arg, phase, nocross,
1197 1197                      minaddr, maxaddr, vmflag);
1198 1198          ASSERT(xvaddr == NULL);
1199 1199          return (NULL);
1200 1200  }
1201 1201  
1202 1202  /*
1203 1203   * Free the segment [vaddr, vaddr + size), where vaddr was a constrained
1204 1204   * allocation.  vmem_xalloc() and vmem_xfree() must always be paired because
1205 1205   * both routines bypass the quantum caches.
1206 1206   */
1207 1207  void
1208 1208  vmem_xfree(vmem_t *vmp, void *vaddr, size_t size)
1209 1209  {
1210 1210          vmem_seg_t *vsp, *vnext, *vprev;
1211 1211  
1212 1212          mutex_enter(&vmp->vm_lock);
1213 1213  
1214 1214          vsp = vmem_hash_delete(vmp, (uintptr_t)vaddr, size);
1215 1215          vsp->vs_end = P2ROUNDUP(vsp->vs_end, vmp->vm_quantum);
1216 1216  
1217 1217          /*
1218 1218           * Attempt to coalesce with the next segment.
1219 1219           */
1220 1220          vnext = vsp->vs_anext;
1221 1221          if (vnext->vs_type == VMEM_FREE) {
1222 1222                  ASSERT(vsp->vs_end == vnext->vs_start);
1223 1223                  vmem_freelist_delete(vmp, vnext);
1224 1224                  vsp->vs_end = vnext->vs_end;
1225 1225                  vmem_seg_destroy(vmp, vnext);
1226 1226          }
1227 1227  
1228 1228          /*
1229 1229           * Attempt to coalesce with the previous segment.
1230 1230           */
1231 1231          vprev = vsp->vs_aprev;
1232 1232          if (vprev->vs_type == VMEM_FREE) {
1233 1233                  ASSERT(vprev->vs_end == vsp->vs_start);
1234 1234                  vmem_freelist_delete(vmp, vprev);
1235 1235                  vprev->vs_end = vsp->vs_end;
1236 1236                  vmem_seg_destroy(vmp, vsp);
1237 1237                  vsp = vprev;
1238 1238          }
1239 1239  
1240 1240          /*
1241 1241           * If the entire span is free, return it to the source.
1242 1242           */
1243 1243          if (vsp->vs_aprev->vs_import && vmp->vm_source_free != NULL &&
1244 1244              vsp->vs_aprev->vs_type == VMEM_SPAN &&
1245 1245              vsp->vs_anext->vs_type == VMEM_SPAN) {
1246 1246                  vaddr = (void *)vsp->vs_start;
1247 1247                  size = VS_SIZE(vsp);
1248 1248                  ASSERT(size == VS_SIZE(vsp->vs_aprev));
1249 1249                  vmem_span_destroy(vmp, vsp);
1250 1250                  mutex_exit(&vmp->vm_lock);
1251 1251                  vmp->vm_source_free(vmp->vm_source, vaddr, size);
1252 1252          } else {
1253 1253                  vmem_freelist_insert(vmp, vsp);
1254 1254                  mutex_exit(&vmp->vm_lock);
1255 1255          }
1256 1256  }
1257 1257  
1258 1258  /*
1259 1259   * Allocate size bytes from arena vmp.  Returns the allocated address
1260 1260   * on success, NULL on failure.  vmflag specifies VM_SLEEP or VM_NOSLEEP,
1261 1261   * and may also specify best-fit, first-fit, or next-fit allocation policy
1262 1262   * instead of the default instant-fit policy.  VM_SLEEP allocations are
1263 1263   * guaranteed to succeed.
1264 1264   */
1265 1265  void *
1266 1266  vmem_alloc(vmem_t *vmp, size_t size, int vmflag)
1267 1267  {
1268 1268          vmem_seg_t *vsp;
1269 1269          uintptr_t addr;
1270 1270          int hb;
1271 1271          int flist = 0;
1272 1272          uint32_t mtbf;
1273 1273  
1274 1274          if (size - 1 < vmp->vm_qcache_max)
1275 1275                  return (kmem_cache_alloc(vmp->vm_qcache[(size - 1) >>
1276 1276                      vmp->vm_qshift], vmflag & VM_KMFLAGS));
1277 1277  
1278 1278          if ((mtbf = vmem_mtbf | vmp->vm_mtbf) != 0 && gethrtime() % mtbf == 0 &&
1279 1279              (vmflag & (VM_NOSLEEP | VM_PANIC)) == VM_NOSLEEP)
1280 1280                  return (NULL);
1281 1281  
1282 1282          if (vmflag & VM_NEXTFIT)
1283 1283                  return (vmem_nextfit_alloc(vmp, size, vmflag));
1284 1284  
1285 1285          if (vmflag & (VM_BESTFIT | VM_FIRSTFIT))
1286 1286                  return (vmem_xalloc(vmp, size, vmp->vm_quantum, 0, 0,
1287 1287                      NULL, NULL, vmflag));
1288 1288  
1289 1289          /*
1290 1290           * Unconstrained instant-fit allocation from the segment list.
1291 1291           */
1292 1292          mutex_enter(&vmp->vm_lock);
1293 1293  
1294 1294          if (vmp->vm_nsegfree >= VMEM_MINFREE || vmem_populate(vmp, vmflag)) {
1295 1295                  if (ISP2(size))
1296 1296                          flist = lowbit(P2ALIGN(vmp->vm_freemap, size));
1297 1297                  else if ((hb = highbit(size)) < VMEM_FREELISTS)
1298 1298                          flist = lowbit(P2ALIGN(vmp->vm_freemap, 1UL << hb));
1299 1299          }
1300 1300  
1301 1301          if (flist-- == 0) {
1302 1302                  mutex_exit(&vmp->vm_lock);
1303 1303                  return (vmem_xalloc(vmp, size, vmp->vm_quantum,
1304 1304                      0, 0, NULL, NULL, vmflag));
1305 1305          }
1306 1306  
1307 1307          ASSERT(size <= (1UL << flist));
1308 1308          vsp = vmp->vm_freelist[flist].vs_knext;
1309 1309          addr = vsp->vs_start;
1310 1310          if (vmflag & VM_ENDALLOC) {
1311 1311                  addr += vsp->vs_end - (addr + size);
1312 1312          }
1313 1313          (void) vmem_seg_alloc(vmp, vsp, addr, size);
1314 1314          mutex_exit(&vmp->vm_lock);
1315 1315          return ((void *)addr);
1316 1316  }
1317 1317  
1318 1318  /*
1319 1319   * Free the segment [vaddr, vaddr + size).
1320 1320   */
1321 1321  void
1322 1322  vmem_free(vmem_t *vmp, void *vaddr, size_t size)
1323 1323  {
1324 1324          if (size - 1 < vmp->vm_qcache_max)
1325 1325                  kmem_cache_free(vmp->vm_qcache[(size - 1) >> vmp->vm_qshift],
1326 1326                      vaddr);
1327 1327          else
1328 1328                  vmem_xfree(vmp, vaddr, size);
1329 1329  }
1330 1330  
1331 1331  /*
1332 1332   * Determine whether arena vmp contains the segment [vaddr, vaddr + size).
1333 1333   */
1334 1334  int
1335 1335  vmem_contains(vmem_t *vmp, void *vaddr, size_t size)
1336 1336  {
1337 1337          uintptr_t start = (uintptr_t)vaddr;
1338 1338          uintptr_t end = start + size;
1339 1339          vmem_seg_t *vsp;
1340 1340          vmem_seg_t *seg0 = &vmp->vm_seg0;
1341 1341  
1342 1342          mutex_enter(&vmp->vm_lock);
1343 1343          vmp->vm_kstat.vk_contains.value.ui64++;
1344 1344          for (vsp = seg0->vs_knext; vsp != seg0; vsp = vsp->vs_knext) {
1345 1345                  vmp->vm_kstat.vk_contains_search.value.ui64++;
1346 1346                  ASSERT(vsp->vs_type == VMEM_SPAN);
1347 1347                  if (start >= vsp->vs_start && end - 1 <= vsp->vs_end - 1)
1348 1348                          break;
1349 1349          }
1350 1350          mutex_exit(&vmp->vm_lock);
1351 1351          return (vsp != seg0);
1352 1352  }
1353 1353  
1354 1354  /*
1355 1355   * Add the span [vaddr, vaddr + size) to arena vmp.
1356 1356   */
1357 1357  void *
1358 1358  vmem_add(vmem_t *vmp, void *vaddr, size_t size, int vmflag)
1359 1359  {
1360 1360          if (vaddr == NULL || size == 0)
1361 1361                  panic("vmem_add(%p, %p, %lu): bad arguments",
1362 1362                      (void *)vmp, vaddr, size);
1363 1363  
1364 1364          ASSERT(!vmem_contains(vmp, vaddr, size));
1365 1365  
1366 1366          mutex_enter(&vmp->vm_lock);
1367 1367          if (vmem_populate(vmp, vmflag))
1368 1368                  (void) vmem_span_create(vmp, vaddr, size, 0);
1369 1369          else
1370 1370                  vaddr = NULL;
1371 1371          mutex_exit(&vmp->vm_lock);
1372 1372          return (vaddr);
1373 1373  }
1374 1374  
1375 1375  /*
1376 1376   * Walk the vmp arena, applying func to each segment matching typemask.
1377 1377   * If VMEM_REENTRANT is specified, the arena lock is dropped across each
1378 1378   * call to func(); otherwise, it is held for the duration of vmem_walk()
1379 1379   * to ensure a consistent snapshot.  Note that VMEM_REENTRANT callbacks
1380 1380   * are *not* necessarily consistent, so they may only be used when a hint
1381 1381   * is adequate.
1382 1382   */
1383 1383  void
1384 1384  vmem_walk(vmem_t *vmp, int typemask,
1385 1385      void (*func)(void *, void *, size_t), void *arg)
1386 1386  {
1387 1387          vmem_seg_t *vsp;
1388 1388          vmem_seg_t *seg0 = &vmp->vm_seg0;
1389 1389          vmem_seg_t walker;
1390 1390  
1391 1391          if (typemask & VMEM_WALKER)
1392 1392                  return;
1393 1393  
1394 1394          bzero(&walker, sizeof (walker));
1395 1395          walker.vs_type = VMEM_WALKER;
1396 1396  
1397 1397          mutex_enter(&vmp->vm_lock);
1398 1398          VMEM_INSERT(seg0, &walker, a);
1399 1399          for (vsp = seg0->vs_anext; vsp != seg0; vsp = vsp->vs_anext) {
1400 1400                  if (vsp->vs_type & typemask) {
1401 1401                          void *start = (void *)vsp->vs_start;
1402 1402                          size_t size = VS_SIZE(vsp);
1403 1403                          if (typemask & VMEM_REENTRANT) {
1404 1404                                  vmem_advance(vmp, &walker, vsp);
1405 1405                                  mutex_exit(&vmp->vm_lock);
1406 1406                                  func(arg, start, size);
1407 1407                                  mutex_enter(&vmp->vm_lock);
1408 1408                                  vsp = &walker;
1409 1409                          } else {
1410 1410                                  func(arg, start, size);
1411 1411                          }
1412 1412                  }
1413 1413          }
1414 1414          vmem_advance(vmp, &walker, NULL);
1415 1415          mutex_exit(&vmp->vm_lock);
1416 1416  }
1417 1417  
1418 1418  /*
1419 1419   * Return the total amount of memory whose type matches typemask.  Thus:
1420 1420   *
1421 1421   *      typemask VMEM_ALLOC yields total memory allocated (in use).
1422 1422   *      typemask VMEM_FREE yields total memory free (available).
1423 1423   *      typemask (VMEM_ALLOC | VMEM_FREE) yields total arena size.
1424 1424   */
1425 1425  size_t
1426 1426  vmem_size(vmem_t *vmp, int typemask)
1427 1427  {
1428 1428          uint64_t size = 0;
1429 1429  
1430 1430          if (typemask & VMEM_ALLOC)
1431 1431                  size += vmp->vm_kstat.vk_mem_inuse.value.ui64;
1432 1432          if (typemask & VMEM_FREE)
1433 1433                  size += vmp->vm_kstat.vk_mem_total.value.ui64 -
1434 1434                      vmp->vm_kstat.vk_mem_inuse.value.ui64;
1435 1435          return ((size_t)size);
1436 1436  }
1437 1437  
1438 1438  /*
1439 1439   * Create an arena called name whose initial span is [base, base + size).
1440 1440   * The arena's natural unit of currency is quantum, so vmem_alloc()
1441 1441   * guarantees quantum-aligned results.  The arena may import new spans
1442 1442   * by invoking afunc() on source, and may return those spans by invoking
1443 1443   * ffunc() on source.  To make small allocations fast and scalable,
1444 1444   * the arena offers high-performance caching for each integer multiple
1445 1445   * of quantum up to qcache_max.
1446 1446   */
1447 1447  static vmem_t *
1448 1448  vmem_create_common(const char *name, void *base, size_t size, size_t quantum,
1449 1449      void *(*afunc)(vmem_t *, size_t, int),
1450 1450      void (*ffunc)(vmem_t *, void *, size_t),
1451 1451      vmem_t *source, size_t qcache_max, int vmflag)
1452 1452  {
1453 1453          int i;
1454 1454          size_t nqcache;
1455 1455          vmem_t *vmp, *cur, **vmpp;
1456 1456          vmem_seg_t *vsp;
1457 1457          vmem_freelist_t *vfp;
1458 1458          uint32_t id = atomic_inc_32_nv(&vmem_id);
1459 1459  
1460 1460          if (vmem_vmem_arena != NULL) {
1461 1461                  vmp = vmem_alloc(vmem_vmem_arena, sizeof (vmem_t),
1462 1462                      vmflag & VM_KMFLAGS);
1463 1463          } else {
1464 1464                  ASSERT(id <= VMEM_INITIAL);
1465 1465                  vmp = &vmem0[id - 1];
1466 1466          }
1467 1467  
1468 1468          /* An identifier arena must inherit from another identifier arena */
1469 1469          ASSERT(source == NULL || ((source->vm_cflags & VMC_IDENTIFIER) ==
1470 1470              (vmflag & VMC_IDENTIFIER)));
1471 1471  
1472 1472          if (vmp == NULL)
1473 1473                  return (NULL);
1474 1474          bzero(vmp, sizeof (vmem_t));
1475 1475  
1476 1476          (void) snprintf(vmp->vm_name, VMEM_NAMELEN, "%s", name);
1477 1477          mutex_init(&vmp->vm_lock, NULL, MUTEX_DEFAULT, NULL);
1478 1478          cv_init(&vmp->vm_cv, NULL, CV_DEFAULT, NULL);
1479 1479          vmp->vm_cflags = vmflag;
1480 1480          vmflag &= VM_KMFLAGS;
1481 1481  
1482 1482          vmp->vm_quantum = quantum;
1483 1483          vmp->vm_qshift = highbit(quantum) - 1;
1484 1484          nqcache = MIN(qcache_max >> vmp->vm_qshift, VMEM_NQCACHE_MAX);
1485 1485  
1486 1486          for (i = 0; i <= VMEM_FREELISTS; i++) {
1487 1487                  vfp = &vmp->vm_freelist[i];
1488 1488                  vfp->vs_end = 1UL << i;
1489 1489                  vfp->vs_knext = (vmem_seg_t *)(vfp + 1);
1490 1490                  vfp->vs_kprev = (vmem_seg_t *)(vfp - 1);
1491 1491          }
1492 1492  
1493 1493          vmp->vm_freelist[0].vs_kprev = NULL;
1494 1494          vmp->vm_freelist[VMEM_FREELISTS].vs_knext = NULL;
1495 1495          vmp->vm_freelist[VMEM_FREELISTS].vs_end = 0;
1496 1496          vmp->vm_hash_table = vmp->vm_hash0;
1497 1497          vmp->vm_hash_mask = VMEM_HASH_INITIAL - 1;
1498 1498          vmp->vm_hash_shift = highbit(vmp->vm_hash_mask);
1499 1499  
1500 1500          vsp = &vmp->vm_seg0;
1501 1501          vsp->vs_anext = vsp;
1502 1502          vsp->vs_aprev = vsp;
1503 1503          vsp->vs_knext = vsp;
1504 1504          vsp->vs_kprev = vsp;
1505 1505          vsp->vs_type = VMEM_SPAN;
1506 1506  
1507 1507          vsp = &vmp->vm_rotor;
1508 1508          vsp->vs_type = VMEM_ROTOR;
1509 1509          VMEM_INSERT(&vmp->vm_seg0, vsp, a);
1510 1510  
1511 1511          bcopy(&vmem_kstat_template, &vmp->vm_kstat, sizeof (vmem_kstat_t));
1512 1512  
1513 1513          vmp->vm_id = id;
1514 1514          if (source != NULL)
1515 1515                  vmp->vm_kstat.vk_source_id.value.ui32 = source->vm_id;
1516 1516          vmp->vm_source = source;
1517 1517          vmp->vm_source_alloc = afunc;
1518 1518          vmp->vm_source_free = ffunc;
1519 1519  
1520 1520          /*
1521 1521           * Some arenas (like vmem_metadata and kmem_metadata) cannot
1522 1522           * use quantum caching to lower fragmentation.  Instead, we
1523 1523           * increase their imports, giving a similar effect.
1524 1524           */
1525 1525          if (vmp->vm_cflags & VMC_NO_QCACHE) {
1526 1526                  vmp->vm_min_import =
1527 1527                      VMEM_QCACHE_SLABSIZE(nqcache << vmp->vm_qshift);
1528 1528                  nqcache = 0;
1529 1529          }
1530 1530  
1531 1531          if (nqcache != 0) {
1532 1532                  ASSERT(!(vmflag & VM_NOSLEEP));
1533 1533                  vmp->vm_qcache_max = nqcache << vmp->vm_qshift;
1534 1534                  for (i = 0; i < nqcache; i++) {
1535 1535                          char buf[VMEM_NAMELEN + 21];
1536 1536                          (void) sprintf(buf, "%s_%lu", vmp->vm_name,
1537 1537                              (i + 1) * quantum);
1538 1538                          vmp->vm_qcache[i] = kmem_cache_create(buf,
1539 1539                              (i + 1) * quantum, quantum, NULL, NULL, NULL,
1540 1540                              NULL, vmp, KMC_QCACHE | KMC_NOTOUCH);
1541 1541                  }
1542 1542          }
1543 1543  
1544 1544          if ((vmp->vm_ksp = kstat_create("vmem", vmp->vm_id, vmp->vm_name,
1545 1545              "vmem", KSTAT_TYPE_NAMED, sizeof (vmem_kstat_t) /
1546 1546              sizeof (kstat_named_t), KSTAT_FLAG_VIRTUAL)) != NULL) {
1547 1547                  vmp->vm_ksp->ks_data = &vmp->vm_kstat;
1548 1548                  kstat_install(vmp->vm_ksp);
1549 1549          }
1550 1550  
1551 1551          mutex_enter(&vmem_list_lock);
1552 1552          vmpp = &vmem_list;
1553 1553          while ((cur = *vmpp) != NULL)
1554 1554                  vmpp = &cur->vm_next;
1555 1555          *vmpp = vmp;
1556 1556          mutex_exit(&vmem_list_lock);
1557 1557  
1558 1558          if (vmp->vm_cflags & VMC_POPULATOR) {
1559 1559                  ASSERT(vmem_populators < VMEM_INITIAL);
1560 1560                  vmem_populator[atomic_inc_32_nv(&vmem_populators) - 1] = vmp;
1561 1561                  mutex_enter(&vmp->vm_lock);
1562 1562                  (void) vmem_populate(vmp, vmflag | VM_PANIC);
1563 1563                  mutex_exit(&vmp->vm_lock);
1564 1564          }
1565 1565  
1566 1566          if ((base || size) && vmem_add(vmp, base, size, vmflag) == NULL) {
1567 1567                  vmem_destroy(vmp);
1568 1568                  return (NULL);
1569 1569          }
1570 1570  
1571 1571          return (vmp);
1572 1572  }
1573 1573  
1574 1574  vmem_t *
1575 1575  vmem_xcreate(const char *name, void *base, size_t size, size_t quantum,
1576 1576      vmem_ximport_t *afunc, vmem_free_t *ffunc, vmem_t *source,
1577 1577      size_t qcache_max, int vmflag)
1578 1578  {
1579 1579          ASSERT(!(vmflag & (VMC_POPULATOR | VMC_XALLOC)));
1580 1580          vmflag &= ~(VMC_POPULATOR | VMC_XALLOC);
1581 1581  
1582 1582          return (vmem_create_common(name, base, size, quantum,
1583 1583              (vmem_alloc_t *)afunc, ffunc, source, qcache_max,
1584 1584              vmflag | VMC_XALLOC));
1585 1585  }
1586 1586  
1587 1587  vmem_t *
1588 1588  vmem_create(const char *name, void *base, size_t size, size_t quantum,
1589 1589      vmem_alloc_t *afunc, vmem_free_t *ffunc, vmem_t *source,
1590 1590      size_t qcache_max, int vmflag)
1591 1591  {
1592 1592          ASSERT(!(vmflag & (VMC_XALLOC | VMC_XALIGN)));
1593 1593          vmflag &= ~(VMC_XALLOC | VMC_XALIGN);
1594 1594  
1595 1595          return (vmem_create_common(name, base, size, quantum,
1596 1596              afunc, ffunc, source, qcache_max, vmflag));
1597 1597  }
1598 1598  
1599 1599  /*
1600 1600   * Destroy arena vmp.
1601 1601   */
1602 1602  void
1603 1603  vmem_destroy(vmem_t *vmp)
1604 1604  {
1605 1605          vmem_t *cur, **vmpp;
1606 1606          vmem_seg_t *seg0 = &vmp->vm_seg0;
1607 1607          vmem_seg_t *vsp, *anext;
1608 1608          size_t leaked;
1609 1609          int i;
1610 1610  
1611 1611          mutex_enter(&vmem_list_lock);
1612 1612          vmpp = &vmem_list;
1613 1613          while ((cur = *vmpp) != vmp)
1614 1614                  vmpp = &cur->vm_next;
1615 1615          *vmpp = vmp->vm_next;
1616 1616          mutex_exit(&vmem_list_lock);
1617 1617  
1618 1618          for (i = 0; i < VMEM_NQCACHE_MAX; i++)
1619 1619                  if (vmp->vm_qcache[i])
1620 1620                          kmem_cache_destroy(vmp->vm_qcache[i]);
1621 1621  
1622 1622          leaked = vmem_size(vmp, VMEM_ALLOC);
1623 1623          if (leaked != 0)
1624 1624                  cmn_err(CE_WARN, "vmem_destroy('%s'): leaked %lu %s",
1625 1625                      vmp->vm_name, leaked, (vmp->vm_cflags & VMC_IDENTIFIER) ?
1626 1626                      "identifiers" : "bytes");
1627 1627  
1628 1628          if (vmp->vm_hash_table != vmp->vm_hash0)
1629 1629                  vmem_free(vmem_hash_arena, vmp->vm_hash_table,
1630 1630                      (vmp->vm_hash_mask + 1) * sizeof (void *));
1631 1631  
1632 1632          /*
1633 1633           * Give back the segment structures for anything that's left in the
1634 1634           * arena, e.g. the primary spans and their free segments.
1635 1635           */
1636 1636          VMEM_DELETE(&vmp->vm_rotor, a);
1637 1637          for (vsp = seg0->vs_anext; vsp != seg0; vsp = anext) {
1638 1638                  anext = vsp->vs_anext;
1639 1639                  vmem_putseg_global(vsp);
1640 1640          }
1641 1641  
1642 1642          while (vmp->vm_nsegfree > 0)
1643 1643                  vmem_putseg_global(vmem_getseg(vmp));
1644 1644  
1645 1645          kstat_delete(vmp->vm_ksp);
1646 1646  
1647 1647          mutex_destroy(&vmp->vm_lock);
1648 1648          cv_destroy(&vmp->vm_cv);
1649 1649          vmem_free(vmem_vmem_arena, vmp, sizeof (vmem_t));
1650 1650  }
1651 1651  
1652 1652  /*
1653 1653   * Only shrink vmem hashtable if it is 1<<vmem_rescale_minshift times (8x)
1654 1654   * larger than necessary.
1655 1655   */
1656 1656  int vmem_rescale_minshift = 3;
1657 1657  
1658 1658  /*
1659 1659   * Resize vmp's hash table to keep the average lookup depth near 1.0.
1660 1660   */
1661 1661  static void
1662 1662  vmem_hash_rescale(vmem_t *vmp)
1663 1663  {
1664 1664          vmem_seg_t **old_table, **new_table, *vsp;
1665 1665          size_t old_size, new_size, h, nseg;
1666 1666  
1667 1667          nseg = (size_t)(vmp->vm_kstat.vk_alloc.value.ui64 -
1668 1668              vmp->vm_kstat.vk_free.value.ui64);
1669 1669  
1670 1670          new_size = MAX(VMEM_HASH_INITIAL, 1 << (highbit(3 * nseg + 4) - 2));
1671 1671          old_size = vmp->vm_hash_mask + 1;
1672 1672  
1673 1673          if ((old_size >> vmem_rescale_minshift) <= new_size &&
1674 1674              new_size <= (old_size << 1))
1675 1675                  return;
1676 1676  
1677 1677          new_table = vmem_alloc(vmem_hash_arena, new_size * sizeof (void *),
1678 1678              VM_NOSLEEP);
1679 1679          if (new_table == NULL)
1680 1680                  return;
1681 1681          bzero(new_table, new_size * sizeof (void *));
1682 1682  
1683 1683          mutex_enter(&vmp->vm_lock);
1684 1684  
1685 1685          old_size = vmp->vm_hash_mask + 1;
1686 1686          old_table = vmp->vm_hash_table;
1687 1687  
1688 1688          vmp->vm_hash_mask = new_size - 1;
1689 1689          vmp->vm_hash_table = new_table;
1690 1690          vmp->vm_hash_shift = highbit(vmp->vm_hash_mask);
1691 1691  
1692 1692          for (h = 0; h < old_size; h++) {
1693 1693                  vsp = old_table[h];
1694 1694                  while (vsp != NULL) {
1695 1695                          uintptr_t addr = vsp->vs_start;
1696 1696                          vmem_seg_t *next_vsp = vsp->vs_knext;
1697 1697                          vmem_seg_t **hash_bucket = VMEM_HASH(vmp, addr);
1698 1698                          vsp->vs_knext = *hash_bucket;
1699 1699                          *hash_bucket = vsp;
1700 1700                          vsp = next_vsp;
1701 1701                  }
1702 1702          }
1703 1703  
1704 1704          mutex_exit(&vmp->vm_lock);
1705 1705  
1706 1706          if (old_table != vmp->vm_hash0)
1707 1707                  vmem_free(vmem_hash_arena, old_table,
1708 1708                      old_size * sizeof (void *));
1709 1709  }
1710 1710  
1711 1711  /*
1712 1712   * Perform periodic maintenance on all vmem arenas.
1713 1713   */
1714 1714  void
1715 1715  vmem_update(void *dummy)
1716 1716  {
1717 1717          vmem_t *vmp;
1718 1718  
1719 1719          mutex_enter(&vmem_list_lock);
1720 1720          for (vmp = vmem_list; vmp != NULL; vmp = vmp->vm_next) {
1721 1721                  /*
1722 1722                   * If threads are waiting for resources, wake them up
1723 1723                   * periodically so they can issue another kmem_reap()
1724 1724                   * to reclaim resources cached by the slab allocator.
1725 1725                   */
1726 1726                  cv_broadcast(&vmp->vm_cv);
1727 1727  
1728 1728                  /*
1729 1729                   * Rescale the hash table to keep the hash chains short.
1730 1730                   */
1731 1731                  vmem_hash_rescale(vmp);
1732 1732          }
1733 1733          mutex_exit(&vmem_list_lock);
1734 1734  
1735 1735          (void) timeout(vmem_update, dummy, vmem_update_interval * hz);
1736 1736  }
1737 1737

↓ open down ↓

1699 lines elided

↑ open up ↑

1738 1738  void
1739 1739  vmem_qcache_reap(vmem_t *vmp)
1740 1740  {
1741 1741          int i;
1742 1742  
1743 1743          /*
1744 1744           * Reap any quantum caches that may be part of this vmem.
1745 1745           */
1746 1746          for (i = 0; i < VMEM_NQCACHE_MAX; i++)
1747 1747                  if (vmp->vm_qcache[i])
1748      -                        kmem_cache_reap_now(vmp->vm_qcache[i]);
     1748 +                        kmem_cache_reap_soon(vmp->vm_qcache[i]);
1749 1749  }
1750 1750  
1751 1751  /*
1752 1752   * Prepare vmem for use.
1753 1753   */
1754 1754  vmem_t *
1755 1755  vmem_init(const char *heap_name,
1756 1756      void *heap_start, size_t heap_size, size_t heap_quantum,
1757 1757      void *(*heap_alloc)(vmem_t *, size_t, int),
1758 1758      void (*heap_free)(vmem_t *, void *, size_t))

1759 1759  {
1760 1760          uint32_t id;
1761 1761          int nseg = VMEM_SEG_INITIAL;
1762 1762          vmem_t *heap;
1763 1763  
1764 1764          while (--nseg >= 0)
1765 1765                  vmem_putseg_global(&vmem_seg0[nseg]);
1766 1766  
1767 1767          heap = vmem_create(heap_name,
1768 1768              heap_start, heap_size, heap_quantum,
1769 1769              NULL, NULL, NULL, 0,
1770 1770              VM_SLEEP | VMC_POPULATOR);
1771 1771  
1772 1772          vmem_metadata_arena = vmem_create("vmem_metadata",
1773 1773              NULL, 0, heap_quantum,
1774 1774              vmem_alloc, vmem_free, heap, 8 * heap_quantum,
1775 1775              VM_SLEEP | VMC_POPULATOR | VMC_NO_QCACHE);
1776 1776  
1777 1777          vmem_seg_arena = vmem_create("vmem_seg",
1778 1778              NULL, 0, heap_quantum,
1779 1779              heap_alloc, heap_free, vmem_metadata_arena, 0,
1780 1780              VM_SLEEP | VMC_POPULATOR);
1781 1781  
1782 1782          vmem_hash_arena = vmem_create("vmem_hash",
1783 1783              NULL, 0, 8,
1784 1784              heap_alloc, heap_free, vmem_metadata_arena, 0,
1785 1785              VM_SLEEP);
1786 1786  
1787 1787          vmem_vmem_arena = vmem_create("vmem_vmem",
1788 1788              vmem0, sizeof (vmem0), 1,
1789 1789              heap_alloc, heap_free, vmem_metadata_arena, 0,
1790 1790              VM_SLEEP);
1791 1791  
1792 1792          for (id = 0; id < vmem_id; id++)
1793 1793                  (void) vmem_xalloc(vmem_vmem_arena, sizeof (vmem_t),
1794 1794                      1, 0, 0, &vmem0[id], &vmem0[id + 1],
1795 1795                      VM_NOSLEEP | VM_BESTFIT | VM_PANIC);
1796 1796  
1797 1797          return (heap);
1798 1798  }

↓ open down ↓

40 lines elided

↑ open up ↑

XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX