Print this page
    
OS-5078 illumos#6514 broke vm_usage and lx proc
OS-2969 vm_getusage syscall accurate zone RSS is overcounting
OS-3088 need a lighterweight page invalidation mechanism for zone memcap
OS-881 To workaround OS-580 add support to only invalidate mappings from a single process
OS-750 improve RUSAGESYS_GETVMUSAGE for zoneadmd
OS-399 zone phys. mem. cap should be a rctl and have associated kstat
    
      
        | Split | 
	Close | 
      
      | Expand all | 
      | Collapse all | 
    
    
          --- old/usr/src/uts/common/vm/vm_usage.c
          +++ new/usr/src/uts/common/vm/vm_usage.c
   1    1  /*
   2    2   * CDDL HEADER START
   3    3   *
   4    4   * The contents of this file are subject to the terms of the
   5    5   * Common Development and Distribution License (the "License").
   6    6   * You may not use this file except in compliance with the License.
   7    7   *
   8    8   * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
   9    9   * or http://www.opensolaris.org/os/licensing.
  10   10   * See the License for the specific language governing permissions
  11   11   * and limitations under the License.
  12   12   *
  13   13   * When distributing Covered Code, include this CDDL HEADER in each
  14   14   * file and include the License file at usr/src/OPENSOLARIS.LICENSE.
  15   15   * If applicable, add the following below this CDDL HEADER, with the
  16   16   * fields enclosed by brackets "[]" replaced with your own identifying
  17   17   * information: Portions Copyright [yyyy] [name of copyright owner]
  
    | 
      ↓ open down ↓ | 
    17 lines elided | 
    
      ↑ open up ↑ | 
  
  18   18   *
  19   19   * CDDL HEADER END
  20   20   */
  21   21  
  22   22  /*
  23   23   * Copyright 2009 Sun Microsystems, Inc.  All rights reserved.
  24   24   * Use is subject to license terms.
  25   25   */
  26   26  
  27   27  /*
       28 + * Copyright 2016, Joyent, Inc.
       29 + */
       30 +
       31 +/*
  28   32   * vm_usage
  29   33   *
  30   34   * This file implements the getvmusage() private system call.
  31   35   * getvmusage() counts the amount of resident memory pages and swap
  32   36   * reserved by the specified process collective. A "process collective" is
  33   37   * the set of processes owned by a particular, zone, project, task, or user.
  34   38   *
  35   39   * rss and swap are counted so that for a given process collective, a page is
  36   40   * only counted once.  For example, this means that if multiple processes in
  37   41   * the same project map the same page, then the project will only be charged
  38   42   * once for that page.  On the other hand, if two processes in different
  39   43   * projects map the same page, then both projects will be charged
  40   44   * for the page.
  41   45   *
  42   46   * The vm_getusage() calculation is implemented so that the first thread
  43   47   * performs the rss/swap counting. Other callers will wait for that thread to
  44   48   * finish, copying the results.  This enables multiple rcapds and prstats to
  45   49   * consume data from the same calculation.  The results are also cached so that
  46   50   * a caller interested in recent results can just copy them instead of starting
  47   51   * a new calculation. The caller passes the maximium age (in seconds) of the
  48   52   * data.  If the cached data is young enough, the cache is copied, otherwise,
  49   53   * a new calculation is executed and the cache is replaced with the new
  50   54   * data.
  51   55   *
  52   56   * The rss calculation for each process collective is as follows:
  53   57   *
  54   58   *   - Inspect flags, determine if counting rss for zones, projects, tasks,
  55   59   *     and/or users.
  56   60   *   - For each proc:
  57   61   *      - Figure out proc's collectives (zone, project, task, and/or user).
  58   62   *      - For each seg in proc's address space:
  59   63   *              - If seg is private:
  60   64   *                      - Lookup anons in the amp.
  61   65   *                      - For incore pages not previously visited each of the
  62   66   *                        proc's collectives, add incore pagesize to each.
  63   67   *                        collective.
  64   68   *                        Anon's with a refcnt of 1 can be assummed to be not
  65   69   *                        previously visited.
  66   70   *                      - For address ranges without anons in the amp:
  67   71   *                              - Lookup pages in underlying vnode.
  68   72   *                              - For incore pages not previously visiting for
  69   73   *                                each of the proc's collectives, add incore
  70   74   *                                pagesize to each collective.
  71   75   *              - If seg is shared:
  72   76   *                      - Lookup pages in the shared amp or vnode.
  73   77   *                      - For incore pages not previously visited for each of
  74   78   *                        the proc's collectives, add incore pagesize to each
  75   79   *                        collective.
  76   80   *
  77   81   * Swap is reserved by private segments, and shared anonymous segments.
  78   82   * The only shared anon segments which do not reserve swap are ISM segments
  79   83   * and schedctl segments, both of which can be identified by having
  80   84   * amp->swresv == 0.
  81   85   *
  82   86   * The swap calculation for each collective is as follows:
  83   87   *
  84   88   *   - Inspect flags, determine if counting rss for zones, projects, tasks,
  85   89   *     and/or users.
  86   90   *   - For each proc:
  87   91   *      - Figure out proc's collectives (zone, project, task, and/or user).
  88   92   *      - For each seg in proc's address space:
  89   93   *              - If seg is private:
  90   94   *                      - Add svd->swresv pages to swap count for each of the
  91   95   *                        proc's collectives.
  92   96   *              - If seg is anon, shared, and amp->swresv != 0
  93   97   *                      - For address ranges in amp not previously visited for
  94   98   *                        each of the proc's collectives, add size of address
  95   99   *                        range to the swap count for each collective.
  96  100   *
  97  101   * These two calculations are done simultaneously, with most of the work
  98  102   * being done in vmu_calculate_seg().  The results of the calculation are
  99  103   * copied into "vmu_data.vmu_cache_results".
 100  104   *
 101  105   * To perform the calculation, various things are tracked and cached:
 102  106   *
 103  107   *    - incore/not-incore page ranges for all vnodes.
 104  108   *      (vmu_data.vmu_all_vnodes_hash)
 105  109   *      This eliminates looking up the same page more than once.
 106  110   *
 107  111   *    - incore/not-incore page ranges for all shared amps.
 108  112   *      (vmu_data.vmu_all_amps_hash)
 109  113   *      This eliminates looking up the same page more than once.
 110  114   *
 111  115   *    - visited page ranges for each collective.
 112  116   *         - per vnode (entity->vme_vnode_hash)
 113  117   *         - per shared amp (entity->vme_amp_hash)
 114  118   *      For accurate counting of map-shared and COW-shared pages.
 115  119   *
 116  120   *    - visited private anons (refcnt > 1) for each collective.
 117  121   *      (entity->vme_anon_hash)
 118  122   *      For accurate counting of COW-shared pages.
 119  123   *
 120  124   * The common accounting structure is the vmu_entity_t, which represents
 121  125   * collectives:
 122  126   *
 123  127   *    - A zone.
 124  128   *    - A project, task, or user within a zone.
 125  129   *    - The entire system (vmu_data.vmu_system).
 126  130   *    - Each collapsed (col) project and user.  This means a given projid or
 127  131   *      uid, regardless of which zone the process is in.  For instance,
 128  132   *      project 0 in the global zone and project 0 in a non global zone are
 129  133   *      the same collapsed project.
 130  134   *
 131  135   *  Each entity structure tracks which pages have been already visited for
 132  136   *  that entity (via previously inspected processes) so that these pages are
 133  137   *  not double counted.
 134  138   */
 135  139  
 136  140  #include <sys/errno.h>
 137  141  #include <sys/types.h>
 138  142  #include <sys/zone.h>
 139  143  #include <sys/proc.h>
 140  144  #include <sys/project.h>
 141  145  #include <sys/task.h>
 142  146  #include <sys/thread.h>
 143  147  #include <sys/time.h>
 144  148  #include <sys/mman.h>
 145  149  #include <sys/modhash.h>
 146  150  #include <sys/modhash_impl.h>
 147  151  #include <sys/shm.h>
 148  152  #include <sys/swap.h>
 149  153  #include <sys/synch.h>
 150  154  #include <sys/systm.h>
 151  155  #include <sys/var.h>
 152  156  #include <sys/vm_usage.h>
 153  157  #include <sys/zone.h>
 154  158  #include <sys/sunddi.h>
 155  159  #include <sys/avl.h>
 156  160  #include <vm/anon.h>
 157  161  #include <vm/as.h>
 158  162  #include <vm/seg_vn.h>
 159  163  #include <vm/seg_spt.h>
 160  164  
 161  165  #define VMUSAGE_HASH_SIZE               512
 162  166  
 163  167  #define VMUSAGE_TYPE_VNODE              1
 164  168  #define VMUSAGE_TYPE_AMP                2
 165  169  #define VMUSAGE_TYPE_ANON               3
 166  170  
 167  171  #define VMUSAGE_BOUND_UNKNOWN           0
 168  172  #define VMUSAGE_BOUND_INCORE            1
 169  173  #define VMUSAGE_BOUND_NOT_INCORE        2
 170  174  
 171  175  #define ISWITHIN(node, addr)    ((node)->vmb_start <= addr && \
 172  176                                      (node)->vmb_end >= addr ? 1 : 0)
 173  177  
 174  178  /*
 175  179   * bounds for vnodes and shared amps
 176  180   * Each bound is either entirely incore, entirely not in core, or
 177  181   * entirely unknown.  bounds are stored in an avl tree sorted by start member
 178  182   * when in use, otherwise (free or temporary lists) they're strung
 179  183   * together off of vmb_next.
 180  184   */
 181  185  typedef struct vmu_bound {
 182  186          avl_node_t vmb_node;
 183  187          struct vmu_bound *vmb_next; /* NULL in tree else on free or temp list */
 184  188          pgcnt_t vmb_start;  /* page offset in vnode/amp on which bound starts */
 185  189          pgcnt_t vmb_end;    /* page offset in vnode/amp on which bound ends */
 186  190          char    vmb_type;   /* One of VMUSAGE_BOUND_* */
 187  191  } vmu_bound_t;
 188  192  
 189  193  /*
 190  194   * hash of visited objects (vnodes or shared amps)
 191  195   * key is address of vnode or amp.  Bounds lists known incore/non-incore
 192  196   * bounds for vnode/amp.
 193  197   */
 194  198  typedef struct vmu_object {
 195  199          struct vmu_object       *vmo_next;      /* free list */
 196  200          caddr_t         vmo_key;
 197  201          short           vmo_type;
 198  202          avl_tree_t      vmo_bounds;
 199  203  } vmu_object_t;
 200  204  
 201  205  /*
 202  206   * Entity by which to count results.
 203  207   *
 204  208   * The entity structure keeps the current rss/swap counts for each entity
 205  209   * (zone, project, etc), and hashes of vm structures that have already
 206  210   * been visited for the entity.
 207  211   *
 208  212   * vme_next:    links the list of all entities currently being counted by
 209  213   *              vmu_calculate().
 210  214   *
 211  215   * vme_next_calc: links the list of entities related to the current process
 212  216   *               being counted by vmu_calculate_proc().
 213  217   *
 214  218   * vmu_calculate_proc() walks all processes.  For each process, it makes a
 215  219   * list of the entities related to that process using vme_next_calc.  This
 216  220   * list changes each time vmu_calculate_proc() is called.
 217  221   *
 218  222   */
 219  223  typedef struct vmu_entity {
 220  224          struct vmu_entity *vme_next;
 221  225          struct vmu_entity *vme_next_calc;
 222  226          mod_hash_t      *vme_vnode_hash; /* vnodes visited for entity */
 223  227          mod_hash_t      *vme_amp_hash;   /* shared amps visited for entity */
 224  228          mod_hash_t      *vme_anon_hash;  /* COW anons visited for entity */
 225  229          vmusage_t       vme_result;      /* identifies entity and results */
 226  230  } vmu_entity_t;
 227  231  
 228  232  /*
 229  233   * Hash of entities visited within a zone, and an entity for the zone
 230  234   * itself.
 231  235   */
 232  236  typedef struct vmu_zone {
 233  237          struct vmu_zone *vmz_next;      /* free list */
 234  238          id_t            vmz_id;
 235  239          vmu_entity_t    *vmz_zone;
 236  240          mod_hash_t      *vmz_projects_hash;
 237  241          mod_hash_t      *vmz_tasks_hash;
 238  242          mod_hash_t      *vmz_rusers_hash;
 239  243          mod_hash_t      *vmz_eusers_hash;
 240  244  } vmu_zone_t;
 241  245  
 242  246  /*
 243  247   * Cache of results from last calculation
 244  248   */
 245  249  typedef struct vmu_cache {
 246  250          vmusage_t       *vmc_results;   /* Results from last call to */
 247  251                                          /* vm_getusage(). */
 248  252          uint64_t        vmc_nresults;   /* Count of cached results */
 249  253          uint64_t        vmc_refcnt;     /* refcnt for free */
 250  254          uint_t          vmc_flags;      /* Flags for vm_getusage() */
 251  255          hrtime_t        vmc_timestamp;  /* when cache was created */
 252  256  } vmu_cache_t;
 253  257  
 254  258  /*
 255  259   * top level rss info for the system
 256  260   */
 257  261  typedef struct vmu_data {
 258  262          kmutex_t        vmu_lock;               /* Protects vmu_data */
 259  263          kcondvar_t      vmu_cv;                 /* Used to signal threads */
 260  264                                                  /* Waiting for */
 261  265                                                  /* Rss_calc_thread to finish */
 262  266          vmu_entity_t    *vmu_system;            /* Entity for tracking */
 263  267                                                  /* rss/swap for all processes */
 264  268                                                  /* in all zones */
 265  269          mod_hash_t      *vmu_zones_hash;        /* Zones visited */
 266  270          mod_hash_t      *vmu_projects_col_hash; /* These *_col_hash hashes */
 267  271          mod_hash_t      *vmu_rusers_col_hash;   /* keep track of entities, */
 268  272          mod_hash_t      *vmu_eusers_col_hash;   /* ignoring zoneid, in order */
 269  273                                                  /* to implement VMUSAGE_COL_* */
 270  274                                                  /* flags, which aggregate by */
 271  275                                                  /* project or user regardless */
 272  276                                                  /* of zoneid. */
 273  277          mod_hash_t      *vmu_all_vnodes_hash;   /* System wide visited vnodes */
 274  278                                                  /* to track incore/not-incore */
 275  279          mod_hash_t      *vmu_all_amps_hash;     /* System wide visited shared */
 276  280                                                  /* amps to track incore/not- */
 277  281                                                  /* incore */
 278  282          vmu_entity_t    *vmu_entities;          /* Linked list of entities */
 279  283          size_t          vmu_nentities;          /* Count of entities in list */
 280  284          vmu_cache_t     *vmu_cache;             /* Cached results */
 281  285          kthread_t       *vmu_calc_thread;       /* NULL, or thread running */
 282  286                                                  /* vmu_calculate() */
 283  287          uint_t          vmu_calc_flags;         /* Flags being using by */
 284  288                                                  /* currently running calc */
 285  289                                                  /* thread */
 286  290          uint_t          vmu_pending_flags;      /* Flags of vm_getusage() */
 287  291                                                  /* threads waiting for */
 288  292                                                  /* calc thread to finish */
 289  293          uint_t          vmu_pending_waiters;    /* Number of threads waiting */
 290  294                                                  /* for calc thread */
 291  295          vmu_bound_t     *vmu_free_bounds;
 292  296          vmu_object_t    *vmu_free_objects;
 293  297          vmu_entity_t    *vmu_free_entities;
 294  298          vmu_zone_t      *vmu_free_zones;
 295  299  } vmu_data_t;
 296  300  
 297  301  extern struct as kas;
 298  302  extern proc_t *practive;
 299  303  extern zone_t *global_zone;
 300  304  extern struct seg_ops segvn_ops;
 301  305  extern struct seg_ops segspt_shmops;
 302  306  
 303  307  static vmu_data_t vmu_data;
 304  308  static kmem_cache_t *vmu_bound_cache;
 305  309  static kmem_cache_t *vmu_object_cache;
 306  310  
 307  311  /*
 308  312   * Comparison routine for AVL tree. We base our comparison on vmb_start.
 309  313   */
 310  314  static int
 311  315  bounds_cmp(const void *bnd1, const void *bnd2)
 312  316  {
 313  317          const vmu_bound_t *bound1 = bnd1;
 314  318          const vmu_bound_t *bound2 = bnd2;
 315  319  
 316  320          if (bound1->vmb_start == bound2->vmb_start) {
 317  321                  return (0);
 318  322          }
 319  323          if (bound1->vmb_start < bound2->vmb_start) {
 320  324                  return (-1);
 321  325          }
 322  326  
 323  327          return (1);
 324  328  }
 325  329  
 326  330  /*
 327  331   * Save a bound on the free list.
 328  332   */
 329  333  static void
 330  334  vmu_free_bound(vmu_bound_t *bound)
 331  335  {
 332  336          bound->vmb_next = vmu_data.vmu_free_bounds;
 333  337          bound->vmb_start = 0;
 334  338          bound->vmb_end = 0;
 335  339          bound->vmb_type = 0;
 336  340          vmu_data.vmu_free_bounds = bound;
 337  341  }
 338  342  
 339  343  /*
 340  344   * Free an object, and all visited bound info.
 341  345   */
 342  346  static void
 343  347  vmu_free_object(mod_hash_val_t val)
 344  348  {
 345  349          vmu_object_t *obj = (vmu_object_t *)val;
 346  350          avl_tree_t *tree = &(obj->vmo_bounds);
 347  351          vmu_bound_t *bound;
 348  352          void *cookie = NULL;
 349  353  
 350  354          while ((bound = avl_destroy_nodes(tree, &cookie)) != NULL)
 351  355                  vmu_free_bound(bound);
 352  356          avl_destroy(tree);
 353  357  
 354  358          obj->vmo_type = 0;
 355  359          obj->vmo_next = vmu_data.vmu_free_objects;
 356  360          vmu_data.vmu_free_objects = obj;
 357  361  }
 358  362  
 359  363  /*
 360  364   * Free an entity, and hashes of visited objects for that entity.
 361  365   */
 362  366  static void
 363  367  vmu_free_entity(mod_hash_val_t val)
 364  368  {
 365  369          vmu_entity_t *entity = (vmu_entity_t *)val;
 366  370  
 367  371          if (entity->vme_vnode_hash != NULL)
 368  372                  i_mod_hash_clear_nosync(entity->vme_vnode_hash);
 369  373          if (entity->vme_amp_hash != NULL)
 370  374                  i_mod_hash_clear_nosync(entity->vme_amp_hash);
 371  375          if (entity->vme_anon_hash != NULL)
 372  376                  i_mod_hash_clear_nosync(entity->vme_anon_hash);
 373  377  
 374  378          entity->vme_next = vmu_data.vmu_free_entities;
 375  379          vmu_data.vmu_free_entities = entity;
 376  380  }
 377  381  
 378  382  /*
 379  383   * Free zone entity, and all hashes of entities inside that zone,
 380  384   * which are projects, tasks, and users.
 381  385   */
 382  386  static void
 383  387  vmu_free_zone(mod_hash_val_t val)
 384  388  {
 385  389          vmu_zone_t *zone = (vmu_zone_t *)val;
 386  390  
 387  391          if (zone->vmz_zone != NULL) {
 388  392                  vmu_free_entity((mod_hash_val_t)zone->vmz_zone);
 389  393                  zone->vmz_zone = NULL;
 390  394          }
 391  395          if (zone->vmz_projects_hash != NULL)
 392  396                  i_mod_hash_clear_nosync(zone->vmz_projects_hash);
 393  397          if (zone->vmz_tasks_hash != NULL)
 394  398                  i_mod_hash_clear_nosync(zone->vmz_tasks_hash);
 395  399          if (zone->vmz_rusers_hash != NULL)
 396  400                  i_mod_hash_clear_nosync(zone->vmz_rusers_hash);
 397  401          if (zone->vmz_eusers_hash != NULL)
 398  402                  i_mod_hash_clear_nosync(zone->vmz_eusers_hash);
 399  403          zone->vmz_next = vmu_data.vmu_free_zones;
 400  404          vmu_data.vmu_free_zones = zone;
 401  405  }
 402  406  
 403  407  /*
 404  408   * Initialize synchronization primitives and hashes for system-wide tracking
 405  409   * of visited vnodes and shared amps.  Initialize results cache.
 406  410   */
 407  411  void
 408  412  vm_usage_init()
 409  413  {
 410  414          mutex_init(&vmu_data.vmu_lock, NULL, MUTEX_DEFAULT, NULL);
 411  415          cv_init(&vmu_data.vmu_cv, NULL, CV_DEFAULT, NULL);
 412  416  
 413  417          vmu_data.vmu_system = NULL;
 414  418          vmu_data.vmu_zones_hash = NULL;
 415  419          vmu_data.vmu_projects_col_hash = NULL;
 416  420          vmu_data.vmu_rusers_col_hash = NULL;
 417  421          vmu_data.vmu_eusers_col_hash = NULL;
 418  422  
 419  423          vmu_data.vmu_free_bounds = NULL;
 420  424          vmu_data.vmu_free_objects = NULL;
 421  425          vmu_data.vmu_free_entities = NULL;
 422  426          vmu_data.vmu_free_zones = NULL;
 423  427  
 424  428          vmu_data.vmu_all_vnodes_hash = mod_hash_create_ptrhash(
 425  429              "vmusage vnode hash", VMUSAGE_HASH_SIZE, vmu_free_object,
 426  430              sizeof (vnode_t));
 427  431          vmu_data.vmu_all_amps_hash = mod_hash_create_ptrhash(
 428  432              "vmusage amp hash", VMUSAGE_HASH_SIZE, vmu_free_object,
 429  433              sizeof (struct anon_map));
 430  434          vmu_data.vmu_projects_col_hash = mod_hash_create_idhash(
 431  435              "vmusage collapsed project hash", VMUSAGE_HASH_SIZE,
 432  436              vmu_free_entity);
 433  437          vmu_data.vmu_rusers_col_hash = mod_hash_create_idhash(
 434  438              "vmusage collapsed ruser hash", VMUSAGE_HASH_SIZE,
 435  439              vmu_free_entity);
 436  440          vmu_data.vmu_eusers_col_hash = mod_hash_create_idhash(
 437  441              "vmusage collpased euser hash", VMUSAGE_HASH_SIZE,
 438  442              vmu_free_entity);
 439  443          vmu_data.vmu_zones_hash = mod_hash_create_idhash(
 440  444              "vmusage zone hash", VMUSAGE_HASH_SIZE, vmu_free_zone);
 441  445  
 442  446          vmu_bound_cache = kmem_cache_create("vmu_bound_cache",
 443  447              sizeof (vmu_bound_t), 0, NULL, NULL, NULL, NULL, NULL, 0);
 444  448          vmu_object_cache = kmem_cache_create("vmu_object_cache",
 445  449              sizeof (vmu_object_t), 0, NULL, NULL, NULL, NULL, NULL, 0);
 446  450  
 447  451          vmu_data.vmu_entities = NULL;
 448  452          vmu_data.vmu_nentities = 0;
 449  453  
 450  454          vmu_data.vmu_cache = NULL;
 451  455          vmu_data.vmu_calc_thread = NULL;
 452  456          vmu_data.vmu_calc_flags = 0;
 453  457          vmu_data.vmu_pending_flags = 0;
 454  458          vmu_data.vmu_pending_waiters = 0;
 455  459  }
 456  460  
 457  461  /*
 458  462   * Allocate hashes for tracking vm objects visited for an entity.
 459  463   * Update list of entities.
 460  464   */
 461  465  static vmu_entity_t *
 462  466  vmu_alloc_entity(id_t id, int type, id_t zoneid)
 463  467  {
 464  468          vmu_entity_t *entity;
 465  469  
 466  470          if (vmu_data.vmu_free_entities != NULL) {
 467  471                  entity = vmu_data.vmu_free_entities;
 468  472                  vmu_data.vmu_free_entities =
 469  473                      vmu_data.vmu_free_entities->vme_next;
 470  474                  bzero(&entity->vme_result, sizeof (vmusage_t));
 471  475          } else {
 472  476                  entity = kmem_zalloc(sizeof (vmu_entity_t), KM_SLEEP);
 473  477          }
 474  478          entity->vme_result.vmu_id = id;
 475  479          entity->vme_result.vmu_zoneid = zoneid;
 476  480          entity->vme_result.vmu_type = type;
 477  481  
 478  482          if (entity->vme_vnode_hash == NULL)
 479  483                  entity->vme_vnode_hash = mod_hash_create_ptrhash(
 480  484                      "vmusage vnode hash", VMUSAGE_HASH_SIZE, vmu_free_object,
 481  485                      sizeof (vnode_t));
 482  486  
 483  487          if (entity->vme_amp_hash == NULL)
 484  488                  entity->vme_amp_hash = mod_hash_create_ptrhash(
 485  489                      "vmusage amp hash", VMUSAGE_HASH_SIZE, vmu_free_object,
 486  490                      sizeof (struct anon_map));
 487  491  
 488  492          if (entity->vme_anon_hash == NULL)
 489  493                  entity->vme_anon_hash = mod_hash_create_ptrhash(
 490  494                      "vmusage anon hash", VMUSAGE_HASH_SIZE,
 491  495                      mod_hash_null_valdtor, sizeof (struct anon));
 492  496  
 493  497          entity->vme_next = vmu_data.vmu_entities;
 494  498          vmu_data.vmu_entities = entity;
 495  499          vmu_data.vmu_nentities++;
 496  500  
 497  501          return (entity);
 498  502  }
 499  503  
 500  504  /*
 501  505   * Allocate a zone entity, and hashes for tracking visited vm objects
 502  506   * for projects, tasks, and users within that zone.
 503  507   */
 504  508  static vmu_zone_t *
 505  509  vmu_alloc_zone(id_t id)
 506  510  {
 507  511          vmu_zone_t *zone;
 508  512  
 509  513          if (vmu_data.vmu_free_zones != NULL) {
 510  514                  zone = vmu_data.vmu_free_zones;
  
    | 
      ↓ open down ↓ | 
    473 lines elided | 
    
      ↑ open up ↑ | 
  
 511  515                  vmu_data.vmu_free_zones =
 512  516                      vmu_data.vmu_free_zones->vmz_next;
 513  517                  zone->vmz_next = NULL;
 514  518                  zone->vmz_zone = NULL;
 515  519          } else {
 516  520                  zone = kmem_zalloc(sizeof (vmu_zone_t), KM_SLEEP);
 517  521          }
 518  522  
 519  523          zone->vmz_id = id;
 520  524  
 521      -        if ((vmu_data.vmu_calc_flags & (VMUSAGE_ZONE | VMUSAGE_ALL_ZONES)) != 0)
      525 +        if ((vmu_data.vmu_calc_flags &
      526 +            (VMUSAGE_ZONE | VMUSAGE_ALL_ZONES | VMUSAGE_A_ZONE)) != 0)
 522  527                  zone->vmz_zone = vmu_alloc_entity(id, VMUSAGE_ZONE, id);
 523  528  
 524  529          if ((vmu_data.vmu_calc_flags & (VMUSAGE_PROJECTS |
 525  530              VMUSAGE_ALL_PROJECTS)) != 0 && zone->vmz_projects_hash == NULL)
 526  531                  zone->vmz_projects_hash = mod_hash_create_idhash(
 527  532                      "vmusage project hash", VMUSAGE_HASH_SIZE, vmu_free_entity);
 528  533  
 529  534          if ((vmu_data.vmu_calc_flags & (VMUSAGE_TASKS | VMUSAGE_ALL_TASKS))
 530  535              != 0 && zone->vmz_tasks_hash == NULL)
 531  536                  zone->vmz_tasks_hash = mod_hash_create_idhash(
 532  537                      "vmusage task hash", VMUSAGE_HASH_SIZE, vmu_free_entity);
 533  538  
 534  539          if ((vmu_data.vmu_calc_flags & (VMUSAGE_RUSERS | VMUSAGE_ALL_RUSERS))
 535  540              != 0 && zone->vmz_rusers_hash == NULL)
 536  541                  zone->vmz_rusers_hash = mod_hash_create_idhash(
 537  542                      "vmusage ruser hash", VMUSAGE_HASH_SIZE, vmu_free_entity);
 538  543  
 539  544          if ((vmu_data.vmu_calc_flags & (VMUSAGE_EUSERS | VMUSAGE_ALL_EUSERS))
 540  545              != 0 && zone->vmz_eusers_hash == NULL)
 541  546                  zone->vmz_eusers_hash = mod_hash_create_idhash(
 542  547                      "vmusage euser hash", VMUSAGE_HASH_SIZE, vmu_free_entity);
 543  548  
 544  549          return (zone);
 545  550  }
 546  551  
 547  552  /*
 548  553   * Allocate a structure for tracking visited bounds for a vm object.
 549  554   */
 550  555  static vmu_object_t *
 551  556  vmu_alloc_object(caddr_t key, int type)
 552  557  {
 553  558          vmu_object_t *object;
 554  559  
 555  560          if (vmu_data.vmu_free_objects != NULL) {
 556  561                  object = vmu_data.vmu_free_objects;
 557  562                  vmu_data.vmu_free_objects =
 558  563                      vmu_data.vmu_free_objects->vmo_next;
 559  564          } else {
 560  565                  object = kmem_cache_alloc(vmu_object_cache, KM_SLEEP);
 561  566          }
 562  567  
 563  568          object->vmo_next = NULL;
 564  569          object->vmo_key = key;
 565  570          object->vmo_type = type;
 566  571          avl_create(&(object->vmo_bounds), bounds_cmp, sizeof (vmu_bound_t), 0);
 567  572  
 568  573          return (object);
 569  574  }
 570  575  
 571  576  /*
 572  577   * Allocate and return a bound structure.
 573  578   */
 574  579  static vmu_bound_t *
 575  580  vmu_alloc_bound()
 576  581  {
 577  582          vmu_bound_t *bound;
 578  583  
 579  584          if (vmu_data.vmu_free_bounds != NULL) {
 580  585                  bound = vmu_data.vmu_free_bounds;
 581  586                  vmu_data.vmu_free_bounds =
 582  587                      vmu_data.vmu_free_bounds->vmb_next;
 583  588          } else {
 584  589                  bound = kmem_cache_alloc(vmu_bound_cache, KM_SLEEP);
 585  590          }
 586  591  
 587  592          bound->vmb_next = NULL;
 588  593          bound->vmb_start = 0;
 589  594          bound->vmb_end = 0;
 590  595          bound->vmb_type = 0;
 591  596          return (bound);
 592  597  }
 593  598  
 594  599  /*
 595  600   * vmu_find_insert_* functions implement hash lookup or allocate and
 596  601   * insert operations.
 597  602   */
 598  603  static vmu_object_t *
 599  604  vmu_find_insert_object(mod_hash_t *hash, caddr_t key, uint_t type)
 600  605  {
 601  606          int ret;
 602  607          vmu_object_t *object;
 603  608  
 604  609          ret = i_mod_hash_find_nosync(hash, (mod_hash_key_t)key,
 605  610              (mod_hash_val_t *)&object);
 606  611          if (ret != 0) {
 607  612                  object = vmu_alloc_object(key, type);
 608  613                  ret = i_mod_hash_insert_nosync(hash, (mod_hash_key_t)key,
 609  614                      (mod_hash_val_t)object, (mod_hash_hndl_t)0);
 610  615                  ASSERT(ret == 0);
 611  616          }
 612  617          return (object);
 613  618  }
 614  619  
 615  620  static int
 616  621  vmu_find_insert_anon(mod_hash_t *hash, caddr_t key)
 617  622  {
 618  623          int ret;
 619  624          caddr_t val;
 620  625  
 621  626          ret = i_mod_hash_find_nosync(hash, (mod_hash_key_t)key,
 622  627              (mod_hash_val_t *)&val);
 623  628  
 624  629          if (ret == 0)
 625  630                  return (0);
 626  631  
 627  632          ret = i_mod_hash_insert_nosync(hash, (mod_hash_key_t)key,
 628  633              (mod_hash_val_t)key, (mod_hash_hndl_t)0);
 629  634  
 630  635          ASSERT(ret == 0);
 631  636  
 632  637          return (1);
 633  638  }
 634  639  
 635  640  static vmu_entity_t *
 636  641  vmu_find_insert_entity(mod_hash_t *hash, id_t id, uint_t type, id_t zoneid)
 637  642  {
 638  643          int ret;
 639  644          vmu_entity_t *entity;
 640  645  
 641  646          ret = i_mod_hash_find_nosync(hash, (mod_hash_key_t)(uintptr_t)id,
 642  647              (mod_hash_val_t *)&entity);
 643  648          if (ret != 0) {
 644  649                  entity = vmu_alloc_entity(id, type, zoneid);
 645  650                  ret = i_mod_hash_insert_nosync(hash,
 646  651                      (mod_hash_key_t)(uintptr_t)id, (mod_hash_val_t)entity,
 647  652                      (mod_hash_hndl_t)0);
 648  653                  ASSERT(ret == 0);
 649  654          }
 650  655          return (entity);
 651  656  }
 652  657  
 653  658  
 654  659  
 655  660  
 656  661  /*
 657  662   * Returns list of object bounds between start and end.  New bounds inserted
 658  663   * by this call are given type.
 659  664   *
 660  665   * Returns the number of pages covered if new bounds are created.  Returns 0
 661  666   * if region between start/end consists of all existing bounds.
 662  667   */
 663  668  static pgcnt_t
 664  669  vmu_insert_lookup_object_bounds(vmu_object_t *ro, pgcnt_t start, pgcnt_t
 665  670      end, char type, vmu_bound_t **first, vmu_bound_t **last)
 666  671  {
 667  672          avl_tree_t      *tree = &(ro->vmo_bounds);
 668  673          avl_index_t     where;
 669  674          vmu_bound_t     *walker, *tmp;
 670  675          pgcnt_t         ret = 0;
 671  676  
 672  677          ASSERT(start <= end);
 673  678  
 674  679          *first = *last = NULL;
 675  680  
 676  681          tmp = vmu_alloc_bound();
 677  682          tmp->vmb_start = start;
 678  683          tmp->vmb_type = type;
 679  684  
 680  685          /* Hopelessly optimistic case. */
 681  686          if (walker = avl_find(tree, tmp, &where)) {
 682  687                  /* We got lucky. */
 683  688                  vmu_free_bound(tmp);
 684  689                  *first = walker;
 685  690          }
 686  691  
 687  692          if (walker == NULL) {
 688  693                  /* Is start in the previous node? */
 689  694                  walker = avl_nearest(tree, where, AVL_BEFORE);
 690  695                  if (walker != NULL) {
 691  696                          if (ISWITHIN(walker, start)) {
 692  697                                  /* We found start. */
 693  698                                  vmu_free_bound(tmp);
 694  699                                  *first = walker;
 695  700                          }
 696  701                  }
 697  702          }
 698  703  
 699  704          /*
 700  705           * At this point, if *first is still NULL, then we
 701  706           * didn't get a direct hit and start isn't covered
 702  707           * by the previous node. We know that the next node
 703  708           * must have a greater start value than we require
 704  709           * because avl_find tells us where the AVL routines would
 705  710           * insert our new node. We have some gap between the
 706  711           * start we want and the next node.
 707  712           */
 708  713          if (*first == NULL) {
 709  714                  walker = avl_nearest(tree, where, AVL_AFTER);
 710  715                  if (walker != NULL && walker->vmb_start <= end) {
 711  716                          /* Fill the gap. */
 712  717                          tmp->vmb_end = walker->vmb_start - 1;
 713  718                          *first = tmp;
 714  719                  } else {
 715  720                          /* We have a gap over [start, end]. */
 716  721                          tmp->vmb_end = end;
 717  722                          *first = *last = tmp;
 718  723                  }
 719  724                  ret += tmp->vmb_end - tmp->vmb_start + 1;
 720  725                  avl_insert(tree, tmp, where);
 721  726          }
 722  727  
 723  728          ASSERT(*first != NULL);
 724  729  
 725  730          if (*last != NULL) {
 726  731                  /* We're done. */
 727  732                  return (ret);
 728  733          }
 729  734  
 730  735          /*
 731  736           * If we are here we still need to set *last and
 732  737           * that may involve filling in some gaps.
 733  738           */
 734  739          *last = *first;
 735  740          for (;;) {
 736  741                  if (ISWITHIN(*last, end)) {
 737  742                          /* We're done. */
 738  743                          break;
 739  744                  }
 740  745                  walker = AVL_NEXT(tree, *last);
 741  746                  if (walker == NULL || walker->vmb_start > end) {
 742  747                          /* Bottom or mid tree with gap. */
 743  748                          tmp = vmu_alloc_bound();
 744  749                          tmp->vmb_start = (*last)->vmb_end + 1;
 745  750                          tmp->vmb_end = end;
 746  751                          tmp->vmb_type = type;
 747  752                          ret += tmp->vmb_end - tmp->vmb_start + 1;
 748  753                          avl_insert_here(tree, tmp, *last, AVL_AFTER);
 749  754                          *last = tmp;
 750  755                          break;
 751  756                  } else {
 752  757                          if ((*last)->vmb_end + 1 != walker->vmb_start) {
 753  758                                  /* Non-contiguous. */
 754  759                                  tmp = vmu_alloc_bound();
 755  760                                  tmp->vmb_start = (*last)->vmb_end + 1;
 756  761                                  tmp->vmb_end = walker->vmb_start - 1;
 757  762                                  tmp->vmb_type = type;
 758  763                                  ret += tmp->vmb_end - tmp->vmb_start + 1;
 759  764                                  avl_insert_here(tree, tmp, *last, AVL_AFTER);
 760  765                                  *last = tmp;
 761  766                          } else {
 762  767                                  *last = walker;
 763  768                          }
 764  769                  }
 765  770          }
 766  771  
 767  772          return (ret);
 768  773  }
 769  774  
 770  775  /*
 771  776   * vmu_update_bounds()
 772  777   *
 773  778   * tree: avl_tree in which first and last hang.
 774  779   *
 775  780   * first, last: list of continuous bounds, of which zero or more are of
 776  781   *              type VMUSAGE_BOUND_UNKNOWN.
 777  782   *
 778  783   * new_tree: avl_tree in which new_first and new_last hang.
 779  784   *
 780  785   * new_first, new_last: list of continuous bounds, of which none are of
 781  786   *                      type VMUSAGE_BOUND_UNKNOWN.  These bounds are used to
 782  787   *                      update the types of bounds in (first,last) with
 783  788   *                      type VMUSAGE_BOUND_UNKNOWN.
 784  789   *
 785  790   * For the list of bounds (first,last), this function updates any bounds
 786  791   * with type VMUSAGE_BOUND_UNKNOWN using the type of the corresponding bound in
 787  792   * the list (new_first, new_last).
 788  793   *
 789  794   * If a bound of type VMUSAGE_BOUND_UNKNOWN spans multiple bounds in the list
 790  795   * (new_first, new_last), it will be split into multiple bounds.
 791  796   *
 792  797   * Return value:
 793  798   *      The number of pages in the list of bounds (first,last) that were of
 794  799   *      type VMUSAGE_BOUND_UNKNOWN, which have been updated to be of type
 795  800   *      VMUSAGE_BOUND_INCORE.
 796  801   *
 797  802   */
 798  803  static pgcnt_t
 799  804  vmu_update_bounds(avl_tree_t *tree, vmu_bound_t **first, vmu_bound_t **last,
 800  805      avl_tree_t *new_tree, vmu_bound_t *new_first, vmu_bound_t *new_last)
 801  806  {
 802  807          vmu_bound_t *next, *new_next, *tmp;
 803  808          pgcnt_t rss = 0;
 804  809  
 805  810          next = *first;
 806  811          new_next = new_first;
 807  812  
 808  813          /*
 809  814           * Verify first and last bound are covered by new bounds if they
 810  815           * have unknown type.
 811  816           */
 812  817          ASSERT((*first)->vmb_type != VMUSAGE_BOUND_UNKNOWN ||
 813  818              (*first)->vmb_start >= new_first->vmb_start);
 814  819          ASSERT((*last)->vmb_type != VMUSAGE_BOUND_UNKNOWN ||
 815  820              (*last)->vmb_end <= new_last->vmb_end);
 816  821          for (;;) {
 817  822                  /* If bound already has type, proceed to next bound. */
 818  823                  if (next->vmb_type != VMUSAGE_BOUND_UNKNOWN) {
 819  824                          if (next == *last)
 820  825                                  break;
 821  826                          next = AVL_NEXT(tree, next);
 822  827                          continue;
 823  828                  }
 824  829                  while (new_next->vmb_end < next->vmb_start)
 825  830                          new_next = AVL_NEXT(new_tree, new_next);
 826  831                  ASSERT(new_next->vmb_type != VMUSAGE_BOUND_UNKNOWN);
 827  832                  next->vmb_type = new_next->vmb_type;
 828  833                  if (new_next->vmb_end < next->vmb_end) {
 829  834                          /* need to split bound */
 830  835                          tmp = vmu_alloc_bound();
 831  836                          tmp->vmb_type = VMUSAGE_BOUND_UNKNOWN;
 832  837                          tmp->vmb_start = new_next->vmb_end + 1;
 833  838                          tmp->vmb_end = next->vmb_end;
 834  839                          avl_insert_here(tree, tmp, next, AVL_AFTER);
 835  840                          next->vmb_end = new_next->vmb_end;
 836  841                          if (*last == next)
 837  842                                  *last = tmp;
 838  843                          if (next->vmb_type == VMUSAGE_BOUND_INCORE)
 839  844                                  rss += next->vmb_end - next->vmb_start + 1;
 840  845                          next = tmp;
 841  846                  } else {
 842  847                          if (next->vmb_type == VMUSAGE_BOUND_INCORE)
 843  848                                  rss += next->vmb_end - next->vmb_start + 1;
 844  849                          if (next == *last)
 845  850                                  break;
 846  851                          next = AVL_NEXT(tree, next);
 847  852                  }
 848  853          }
 849  854          return (rss);
 850  855  }
 851  856  
 852  857  /*
 853  858   * Merges adjacent bounds with same type between first and last bound.
 854  859   * After merge, last pointer may point to a different bound, as (incoming)
 855  860   * last bound may have been merged away.
 856  861   */
 857  862  static void
 858  863  vmu_merge_bounds(avl_tree_t *tree, vmu_bound_t **first, vmu_bound_t **last)
 859  864  {
 860  865          vmu_bound_t *current;
 861  866          vmu_bound_t *next;
 862  867  
 863  868          ASSERT(tree != NULL);
 864  869          ASSERT(*first != NULL);
 865  870          ASSERT(*last != NULL);
 866  871  
 867  872          current = *first;
 868  873          while (current != *last) {
 869  874                  next = AVL_NEXT(tree, current);
 870  875                  if ((current->vmb_end + 1) == next->vmb_start &&
 871  876                      current->vmb_type == next->vmb_type) {
 872  877                          current->vmb_end = next->vmb_end;
 873  878                          avl_remove(tree, next);
 874  879                          vmu_free_bound(next);
 875  880                          if (next == *last) {
 876  881                                  *last = current;
 877  882                          }
 878  883                  } else {
 879  884                          current = AVL_NEXT(tree, current);
 880  885                  }
 881  886          }
 882  887  }
 883  888  
 884  889  /*
 885  890   * Given an amp and a list of bounds, updates each bound's type with
 886  891   * VMUSAGE_BOUND_INCORE or VMUSAGE_BOUND_NOT_INCORE.
 887  892   *
 888  893   * If a bound is partially incore, it will be split into two bounds.
 889  894   * first and last may be modified, as bounds may be split into multiple
 890  895   * bounds if they are partially incore/not-incore.
 891  896   *
 892  897   * Set incore to non-zero if bounds are already known to be incore.
 893  898   *
 894  899   */
 895  900  static void
 896  901  vmu_amp_update_incore_bounds(avl_tree_t *tree, struct anon_map *amp,
 897  902      vmu_bound_t **first, vmu_bound_t **last, boolean_t incore)
 898  903  {
 899  904          vmu_bound_t *next;
 900  905          vmu_bound_t *tmp;
 901  906          pgcnt_t index;
 902  907          short bound_type;
 903  908          short page_type;
 904  909          vnode_t *vn;
 905  910          anoff_t off;
 906  911          struct anon *ap;
 907  912  
 908  913          next = *first;
 909  914          /* Shared anon slots don't change once set. */
 910  915          ANON_LOCK_ENTER(&->a_rwlock, RW_READER);
  
    | 
      ↓ open down ↓ | 
    379 lines elided | 
    
      ↑ open up ↑ | 
  
 911  916          for (;;) {
 912  917                  if (incore == B_TRUE)
 913  918                          next->vmb_type = VMUSAGE_BOUND_INCORE;
 914  919  
 915  920                  if (next->vmb_type != VMUSAGE_BOUND_UNKNOWN) {
 916  921                          if (next == *last)
 917  922                                  break;
 918  923                          next = AVL_NEXT(tree, next);
 919  924                          continue;
 920  925                  }
      926 +
      927 +                ASSERT(next->vmb_type == VMUSAGE_BOUND_UNKNOWN);
 921  928                  bound_type = next->vmb_type;
 922  929                  index = next->vmb_start;
 923  930                  while (index <= next->vmb_end) {
 924  931  
 925  932                          /*
 926  933                           * These are used to determine how much to increment
 927  934                           * index when a large page is found.
 928  935                           */
 929  936                          page_t *page;
 930  937                          pgcnt_t pgcnt = 1;
 931  938                          uint_t pgshft;
 932  939                          pgcnt_t pgmsk;
 933  940  
 934  941                          ap = anon_get_ptr(amp->ahp, index);
 935  942                          if (ap != NULL)
 936  943                                  swap_xlate(ap, &vn, &off);
 937  944  
 938  945                          if (ap != NULL && vn != NULL && vn->v_pages != NULL &&
 939  946                              (page = page_exists(vn, off)) != NULL) {
 940      -                                page_type = VMUSAGE_BOUND_INCORE;
      947 +                                if (PP_ISFREE(page))
      948 +                                        page_type = VMUSAGE_BOUND_NOT_INCORE;
      949 +                                else
      950 +                                        page_type = VMUSAGE_BOUND_INCORE;
 941  951                                  if (page->p_szc > 0) {
 942  952                                          pgcnt = page_get_pagecnt(page->p_szc);
 943  953                                          pgshft = page_get_shift(page->p_szc);
 944  954                                          pgmsk = (0x1 << (pgshft - PAGESHIFT))
 945  955                                              - 1;
 946  956                                  }
 947  957                          } else {
 948  958                                  page_type = VMUSAGE_BOUND_NOT_INCORE;
 949  959                          }
      960 +
 950  961                          if (bound_type == VMUSAGE_BOUND_UNKNOWN) {
 951  962                                  next->vmb_type = page_type;
      963 +                                bound_type = page_type;
 952  964                          } else if (next->vmb_type != page_type) {
 953  965                                  /*
 954  966                                   * If current bound type does not match page
 955  967                                   * type, need to split off new bound.
 956  968                                   */
 957  969                                  tmp = vmu_alloc_bound();
 958  970                                  tmp->vmb_type = page_type;
 959  971                                  tmp->vmb_start = index;
 960  972                                  tmp->vmb_end = next->vmb_end;
 961  973                                  avl_insert_here(tree, tmp, next, AVL_AFTER);
 962  974                                  next->vmb_end = index - 1;
 963  975                                  if (*last == next)
 964  976                                          *last = tmp;
 965  977                                  next = tmp;
 966  978                          }
 967  979                          if (pgcnt > 1) {
 968  980                                  /*
 969  981                                   * If inside large page, jump to next large
 970  982                                   * page
 971  983                                   */
 972  984                                  index = (index & ~pgmsk) + pgcnt;
 973  985                          } else {
 974  986                                  index++;
 975  987                          }
 976  988                  }
 977  989                  if (next == *last) {
 978  990                          ASSERT(next->vmb_type != VMUSAGE_BOUND_UNKNOWN);
 979  991                          break;
 980  992                  } else
 981  993                          next = AVL_NEXT(tree, next);
 982  994          }
 983  995          ANON_LOCK_EXIT(&->a_rwlock);
 984  996  }
 985  997  
 986  998  /*
 987  999   * Same as vmu_amp_update_incore_bounds(), except for tracking
 988 1000   * incore-/not-incore for vnodes.
 989 1001   */
 990 1002  static void
 991 1003  vmu_vnode_update_incore_bounds(avl_tree_t *tree, vnode_t *vnode,
 992 1004      vmu_bound_t **first, vmu_bound_t **last)
 993 1005  {
 994 1006          vmu_bound_t *next;
 995 1007          vmu_bound_t *tmp;
 996 1008          pgcnt_t index;
 997 1009          short bound_type;
 998 1010          short page_type;
 999 1011  
1000 1012          next = *first;
1001 1013          for (;;) {
  
    | 
      ↓ open down ↓ | 
    40 lines elided | 
    
      ↑ open up ↑ | 
  
1002 1014                  if (vnode->v_pages == NULL)
1003 1015                          next->vmb_type = VMUSAGE_BOUND_NOT_INCORE;
1004 1016  
1005 1017                  if (next->vmb_type != VMUSAGE_BOUND_UNKNOWN) {
1006 1018                          if (next == *last)
1007 1019                                  break;
1008 1020                          next = AVL_NEXT(tree, next);
1009 1021                          continue;
1010 1022                  }
1011 1023  
     1024 +                ASSERT(next->vmb_type == VMUSAGE_BOUND_UNKNOWN);
1012 1025                  bound_type = next->vmb_type;
1013 1026                  index = next->vmb_start;
1014 1027                  while (index <= next->vmb_end) {
1015 1028  
1016 1029                          /*
1017 1030                           * These are used to determine how much to increment
1018 1031                           * index when a large page is found.
1019 1032                           */
1020 1033                          page_t *page;
1021 1034                          pgcnt_t pgcnt = 1;
1022 1035                          uint_t pgshft;
1023 1036                          pgcnt_t pgmsk;
1024 1037  
1025 1038                          if (vnode->v_pages != NULL &&
1026 1039                              (page = page_exists(vnode, ptob(index))) != NULL) {
1027      -                                page_type = VMUSAGE_BOUND_INCORE;
     1040 +                                if (PP_ISFREE(page))
     1041 +                                        page_type = VMUSAGE_BOUND_NOT_INCORE;
     1042 +                                else
     1043 +                                        page_type = VMUSAGE_BOUND_INCORE;
1028 1044                                  if (page->p_szc > 0) {
1029 1045                                          pgcnt = page_get_pagecnt(page->p_szc);
1030 1046                                          pgshft = page_get_shift(page->p_szc);
1031 1047                                          pgmsk = (0x1 << (pgshft - PAGESHIFT))
1032 1048                                              - 1;
1033 1049                                  }
1034 1050                          } else {
1035 1051                                  page_type = VMUSAGE_BOUND_NOT_INCORE;
1036 1052                          }
     1053 +
1037 1054                          if (bound_type == VMUSAGE_BOUND_UNKNOWN) {
1038 1055                                  next->vmb_type = page_type;
     1056 +                                bound_type = page_type;
1039 1057                          } else if (next->vmb_type != page_type) {
1040 1058                                  /*
1041 1059                                   * If current bound type does not match page
1042 1060                                   * type, need to split off new bound.
1043 1061                                   */
1044 1062                                  tmp = vmu_alloc_bound();
1045 1063                                  tmp->vmb_type = page_type;
1046 1064                                  tmp->vmb_start = index;
1047 1065                                  tmp->vmb_end = next->vmb_end;
1048 1066                                  avl_insert_here(tree, tmp, next, AVL_AFTER);
1049 1067                                  next->vmb_end = index - 1;
1050 1068                                  if (*last == next)
1051 1069                                          *last = tmp;
1052 1070                                  next = tmp;
1053 1071                          }
1054 1072                          if (pgcnt > 1) {
1055 1073                                  /*
1056 1074                                   * If inside large page, jump to next large
1057 1075                                   * page
1058 1076                                   */
1059 1077                                  index = (index & ~pgmsk) + pgcnt;
1060 1078                          } else {
1061 1079                                  index++;
1062 1080                          }
1063 1081                  }
1064 1082                  if (next == *last) {
1065 1083                          ASSERT(next->vmb_type != VMUSAGE_BOUND_UNKNOWN);
1066 1084                          break;
1067 1085                  } else
1068 1086                          next = AVL_NEXT(tree, next);
1069 1087          }
1070 1088  }
1071 1089  
1072 1090  /*
1073 1091   * Calculate the rss and swap consumed by a segment.  vmu_entities is the
1074 1092   * list of entities to visit.  For shared segments, the vnode or amp
1075 1093   * is looked up in each entity to see if it has been already counted.  Private
1076 1094   * anon pages are checked per entity to ensure that COW pages are not
1077 1095   * double counted.
1078 1096   *
1079 1097   * For private mapped files, first the amp is checked for private pages.
1080 1098   * Bounds not backed by the amp are looked up in the vnode for each entity
1081 1099   * to avoid double counting of private COW vnode pages.
1082 1100   */
1083 1101  static void
1084 1102  vmu_calculate_seg(vmu_entity_t *vmu_entities, struct seg *seg)
1085 1103  {
1086 1104          struct segvn_data *svd;
1087 1105          struct shm_data *shmd;
1088 1106          struct spt_data *sptd;
1089 1107          vmu_object_t *shared_object = NULL;
1090 1108          vmu_object_t *entity_object = NULL;
1091 1109          vmu_entity_t *entity;
1092 1110          vmusage_t *result;
1093 1111          vmu_bound_t *first = NULL;
1094 1112          vmu_bound_t *last = NULL;
1095 1113          vmu_bound_t *cur = NULL;
1096 1114          vmu_bound_t *e_first = NULL;
1097 1115          vmu_bound_t *e_last = NULL;
1098 1116          vmu_bound_t *tmp;
1099 1117          pgcnt_t p_index, s_index, p_start, p_end, s_start, s_end, rss, virt;
1100 1118          struct anon_map *private_amp = NULL;
1101 1119          boolean_t incore = B_FALSE;
1102 1120          boolean_t shared = B_FALSE;
1103 1121          int file = 0;
1104 1122          pgcnt_t swresv = 0;
1105 1123          pgcnt_t panon = 0;
1106 1124  
1107 1125          /* Can zero-length segments exist?  Not sure, so paranoia. */
1108 1126          if (seg->s_size <= 0)
1109 1127                  return;
1110 1128  
1111 1129          /*
1112 1130           * Figure out if there is a shared object (such as a named vnode or
1113 1131           * a shared amp, then figure out if there is a private amp, which
1114 1132           * identifies private pages.
1115 1133           */
1116 1134          if (seg->s_ops == &segvn_ops) {
1117 1135                  svd = (struct segvn_data *)seg->s_data;
1118 1136                  if (svd->type == MAP_SHARED) {
1119 1137                          shared = B_TRUE;
1120 1138                  } else {
1121 1139                          swresv = svd->swresv;
1122 1140  
1123 1141                          if (SEGVN_LOCK_TRYENTER(seg->s_as, &svd->lock,
1124 1142                              RW_READER) != 0) {
1125 1143                                  /*
1126 1144                                   * Text replication anon maps can be shared
1127 1145                                   * across all zones. Space used for text
1128 1146                                   * replication is typically capped as a small %
1129 1147                                   * of memory.  To keep it simple for now we
1130 1148                                   * don't account for swap and memory space used
1131 1149                                   * for text replication.
1132 1150                                   */
1133 1151                                  if (svd->tr_state == SEGVN_TR_OFF &&
1134 1152                                      svd->amp != NULL) {
1135 1153                                          private_amp = svd->amp;
1136 1154                                          p_start = svd->anon_index;
1137 1155                                          p_end = svd->anon_index +
1138 1156                                              btop(seg->s_size) - 1;
1139 1157                                  }
1140 1158                                  SEGVN_LOCK_EXIT(seg->s_as, &svd->lock);
1141 1159                          }
1142 1160                  }
1143 1161                  if (svd->vp != NULL) {
1144 1162                          file = 1;
1145 1163                          shared_object = vmu_find_insert_object(
1146 1164                              vmu_data.vmu_all_vnodes_hash, (caddr_t)svd->vp,
1147 1165                              VMUSAGE_TYPE_VNODE);
1148 1166                          s_start = btop(svd->offset);
1149 1167                          s_end = btop(svd->offset + seg->s_size) - 1;
1150 1168                  }
1151 1169                  if (svd->amp != NULL && svd->type == MAP_SHARED) {
1152 1170                          ASSERT(shared_object == NULL);
1153 1171                          shared_object = vmu_find_insert_object(
1154 1172                              vmu_data.vmu_all_amps_hash, (caddr_t)svd->amp,
1155 1173                              VMUSAGE_TYPE_AMP);
1156 1174                          s_start = svd->anon_index;
1157 1175                          s_end = svd->anon_index + btop(seg->s_size) - 1;
1158 1176                          /* schedctl mappings are always in core */
1159 1177                          if (svd->amp->swresv == 0)
1160 1178                                  incore = B_TRUE;
1161 1179                  }
1162 1180          } else if (seg->s_ops == &segspt_shmops) {
1163 1181                  shared = B_TRUE;
1164 1182                  shmd = (struct shm_data *)seg->s_data;
1165 1183                  shared_object = vmu_find_insert_object(
1166 1184                      vmu_data.vmu_all_amps_hash, (caddr_t)shmd->shm_amp,
1167 1185                      VMUSAGE_TYPE_AMP);
1168 1186                  s_start = 0;
1169 1187                  s_end = btop(seg->s_size) - 1;
1170 1188                  sptd = shmd->shm_sptseg->s_data;
1171 1189  
1172 1190                  /* ism segments are always incore and do not reserve swap */
1173 1191                  if (sptd->spt_flags & SHM_SHARE_MMU)
1174 1192                          incore = B_TRUE;
1175 1193  
1176 1194          } else {
1177 1195                  return;
1178 1196          }
1179 1197  
1180 1198          /*
1181 1199           * If there is a private amp, count anon pages that exist.  If an
1182 1200           * anon has a refcnt > 1 (COW sharing), then save the anon in a
1183 1201           * hash so that it is not double counted.
1184 1202           *
1185 1203           * If there is also a shared object, then figure out the bounds
1186 1204           * which are not mapped by the private amp.
1187 1205           */
1188 1206          if (private_amp != NULL) {
1189 1207  
1190 1208                  /* Enter as writer to prevent COW anons from being freed */
1191 1209                  ANON_LOCK_ENTER(&private_amp->a_rwlock, RW_WRITER);
1192 1210  
1193 1211                  p_index = p_start;
1194 1212                  s_index = s_start;
1195 1213  
1196 1214                  while (p_index <= p_end) {
1197 1215  
1198 1216                          pgcnt_t p_index_next;
1199 1217                          pgcnt_t p_bound_size;
1200 1218                          int cnt;
1201 1219                          anoff_t off;
1202 1220                          struct vnode *vn;
1203 1221                          struct anon *ap;
1204 1222                          page_t *page;           /* For handling of large */
1205 1223                          pgcnt_t pgcnt = 1;      /* pages */
1206 1224                          pgcnt_t pgstart;
1207 1225                          pgcnt_t pgend;
1208 1226                          uint_t pgshft;
1209 1227                          pgcnt_t pgmsk;
1210 1228  
1211 1229                          p_index_next = p_index;
1212 1230                          ap = anon_get_next_ptr(private_amp->ahp,
1213 1231                              &p_index_next);
1214 1232  
1215 1233                          /*
1216 1234                           * If next anon is past end of mapping, simulate
1217 1235                           * end of anon so loop terminates.
1218 1236                           */
1219 1237                          if (p_index_next > p_end) {
1220 1238                                  p_index_next = p_end + 1;
1221 1239                                  ap = NULL;
1222 1240                          }
1223 1241                          /*
1224 1242                           * For COW segments, keep track of bounds not
1225 1243                           * backed by private amp so they can be looked
1226 1244                           * up in the backing vnode
1227 1245                           */
1228 1246                          if (p_index_next != p_index) {
1229 1247  
1230 1248                                  /*
1231 1249                                   * Compute index difference between anon and
1232 1250                                   * previous anon.
1233 1251                                   */
1234 1252                                  p_bound_size = p_index_next - p_index - 1;
1235 1253  
1236 1254                                  if (shared_object != NULL) {
1237 1255                                          cur = vmu_alloc_bound();
1238 1256                                          cur->vmb_start = s_index;
1239 1257                                          cur->vmb_end = s_index + p_bound_size;
1240 1258                                          cur->vmb_type = VMUSAGE_BOUND_UNKNOWN;
1241 1259                                          if (first == NULL) {
1242 1260                                                  first = cur;
1243 1261                                                  last = cur;
1244 1262                                          } else {
1245 1263                                                  last->vmb_next = cur;
1246 1264                                                  last = cur;
1247 1265                                          }
1248 1266                                  }
1249 1267                                  p_index = p_index + p_bound_size + 1;
1250 1268                                  s_index = s_index + p_bound_size + 1;
1251 1269                          }
1252 1270  
1253 1271                          /* Detect end of anons in amp */
1254 1272                          if (ap == NULL)
1255 1273                                  break;
1256 1274  
1257 1275                          cnt = ap->an_refcnt;
1258 1276                          swap_xlate(ap, &vn, &off);
1259 1277  
1260 1278                          if (vn == NULL || vn->v_pages == NULL ||
1261 1279                              (page = page_exists(vn, off)) == NULL) {
1262 1280                                  p_index++;
1263 1281                                  s_index++;
1264 1282                                  continue;
1265 1283                          }
1266 1284  
1267 1285                          /*
1268 1286                           * If large page is found, compute portion of large
1269 1287                           * page in mapping, and increment indicies to the next
1270 1288                           * large page.
1271 1289                           */
1272 1290                          if (page->p_szc > 0) {
1273 1291  
1274 1292                                  pgcnt = page_get_pagecnt(page->p_szc);
1275 1293                                  pgshft = page_get_shift(page->p_szc);
1276 1294                                  pgmsk = (0x1 << (pgshft - PAGESHIFT)) - 1;
1277 1295  
1278 1296                                  /* First page in large page */
1279 1297                                  pgstart = p_index & ~pgmsk;
1280 1298                                  /* Last page in large page */
1281 1299                                  pgend = pgstart + pgcnt - 1;
1282 1300                                  /*
1283 1301                                   * Artifically end page if page extends past
1284 1302                                   * end of mapping.
1285 1303                                   */
1286 1304                                  if (pgend > p_end)
1287 1305                                          pgend = p_end;
1288 1306  
1289 1307                                  /*
1290 1308                                   * Compute number of pages from large page
1291 1309                                   * which are mapped.
1292 1310                                   */
1293 1311                                  pgcnt = pgend - p_index + 1;
1294 1312  
1295 1313                                  /*
1296 1314                                   * Point indicies at page after large page,
  
    | 
      ↓ open down ↓ | 
    248 lines elided | 
    
      ↑ open up ↑ | 
  
1297 1315                                   * or at page after end of mapping.
1298 1316                                   */
1299 1317                                  p_index += pgcnt;
1300 1318                                  s_index += pgcnt;
1301 1319                          } else {
1302 1320                                  p_index++;
1303 1321                                  s_index++;
1304 1322                          }
1305 1323  
1306 1324                          /*
     1325 +                         * Pages on the free list aren't counted for the rss.
     1326 +                         */
     1327 +                        if (PP_ISFREE(page))
     1328 +                                continue;
     1329 +
     1330 +                        /*
1307 1331                           * Assume anon structs with a refcnt
1308 1332                           * of 1 are not COW shared, so there
1309 1333                           * is no reason to track them per entity.
1310 1334                           */
1311 1335                          if (cnt == 1) {
1312 1336                                  panon += pgcnt;
1313 1337                                  continue;
1314 1338                          }
1315 1339                          for (entity = vmu_entities; entity != NULL;
1316 1340                              entity = entity->vme_next_calc) {
1317 1341  
1318 1342                                  result = &entity->vme_result;
1319 1343                                  /*
1320 1344                                   * Track COW anons per entity so
1321 1345                                   * they are not double counted.
1322 1346                                   */
1323 1347                                  if (vmu_find_insert_anon(entity->vme_anon_hash,
1324 1348                                      (caddr_t)ap) == 0)
1325 1349                                          continue;
1326 1350  
1327 1351                                  result->vmu_rss_all += (pgcnt << PAGESHIFT);
1328 1352                                  result->vmu_rss_private +=
1329 1353                                      (pgcnt << PAGESHIFT);
1330 1354                          }
1331 1355                  }
1332 1356                  ANON_LOCK_EXIT(&private_amp->a_rwlock);
1333 1357          }
1334 1358  
1335 1359          /* Add up resident anon and swap reserved for private mappings */
1336 1360          if (swresv > 0 || panon > 0) {
1337 1361                  for (entity = vmu_entities; entity != NULL;
1338 1362                      entity = entity->vme_next_calc) {
1339 1363                          result = &entity->vme_result;
1340 1364                          result->vmu_swap_all += swresv;
1341 1365                          result->vmu_swap_private += swresv;
1342 1366                          result->vmu_rss_all += (panon << PAGESHIFT);
1343 1367                          result->vmu_rss_private += (panon << PAGESHIFT);
1344 1368                  }
1345 1369          }
1346 1370  
1347 1371          /* Compute resident pages backing shared amp or named vnode */
1348 1372          if (shared_object != NULL) {
1349 1373                  avl_tree_t *tree = &(shared_object->vmo_bounds);
1350 1374  
1351 1375                  if (first == NULL) {
1352 1376                          /*
1353 1377                           * No private amp, or private amp has no anon
1354 1378                           * structs.  This means entire segment is backed by
1355 1379                           * the shared object.
1356 1380                           */
1357 1381                          first = vmu_alloc_bound();
1358 1382                          first->vmb_start = s_start;
1359 1383                          first->vmb_end = s_end;
1360 1384                          first->vmb_type = VMUSAGE_BOUND_UNKNOWN;
1361 1385                  }
1362 1386                  /*
1363 1387                   * Iterate bounds not backed by private amp, and compute
1364 1388                   * resident pages.
1365 1389                   */
1366 1390                  cur = first;
1367 1391                  while (cur != NULL) {
1368 1392  
1369 1393                          if (vmu_insert_lookup_object_bounds(shared_object,
1370 1394                              cur->vmb_start, cur->vmb_end, VMUSAGE_BOUND_UNKNOWN,
1371 1395                              &first, &last) > 0) {
1372 1396                                  /* new bounds, find incore/not-incore */
1373 1397                                  if (shared_object->vmo_type ==
1374 1398                                      VMUSAGE_TYPE_VNODE) {
1375 1399                                          vmu_vnode_update_incore_bounds(
1376 1400                                              tree,
1377 1401                                              (vnode_t *)
1378 1402                                              shared_object->vmo_key, &first,
1379 1403                                              &last);
1380 1404                                  } else {
1381 1405                                          vmu_amp_update_incore_bounds(
1382 1406                                              tree,
1383 1407                                              (struct anon_map *)
1384 1408                                              shared_object->vmo_key, &first,
1385 1409                                              &last, incore);
1386 1410                                  }
1387 1411                                  vmu_merge_bounds(tree, &first, &last);
1388 1412                          }
1389 1413                          for (entity = vmu_entities; entity != NULL;
1390 1414                              entity = entity->vme_next_calc) {
1391 1415                                  avl_tree_t *e_tree;
1392 1416  
1393 1417                                  result = &entity->vme_result;
1394 1418  
1395 1419                                  entity_object = vmu_find_insert_object(
1396 1420                                      shared_object->vmo_type ==
1397 1421                                      VMUSAGE_TYPE_VNODE ? entity->vme_vnode_hash:
1398 1422                                      entity->vme_amp_hash,
1399 1423                                      shared_object->vmo_key,
1400 1424                                      shared_object->vmo_type);
1401 1425  
1402 1426                                  virt = vmu_insert_lookup_object_bounds(
1403 1427                                      entity_object, cur->vmb_start, cur->vmb_end,
1404 1428                                      VMUSAGE_BOUND_UNKNOWN, &e_first, &e_last);
1405 1429  
1406 1430                                  if (virt == 0)
1407 1431                                          continue;
1408 1432                                  /*
1409 1433                                   * Range visited for this entity
1410 1434                                   */
1411 1435                                  e_tree = &(entity_object->vmo_bounds);
1412 1436                                  rss = vmu_update_bounds(e_tree, &e_first,
1413 1437                                      &e_last, tree, first, last);
1414 1438                                  result->vmu_rss_all += (rss << PAGESHIFT);
1415 1439                                  if (shared == B_TRUE && file == B_FALSE) {
1416 1440                                          /* shared anon mapping */
1417 1441                                          result->vmu_swap_all +=
1418 1442                                              (virt << PAGESHIFT);
1419 1443                                          result->vmu_swap_shared +=
1420 1444                                              (virt << PAGESHIFT);
1421 1445                                          result->vmu_rss_shared +=
1422 1446                                              (rss << PAGESHIFT);
1423 1447                                  } else if (shared == B_TRUE && file == B_TRUE) {
1424 1448                                          /* shared file mapping */
1425 1449                                          result->vmu_rss_shared +=
1426 1450                                              (rss << PAGESHIFT);
1427 1451                                  } else if (shared == B_FALSE &&
1428 1452                                      file == B_TRUE) {
1429 1453                                          /* private file mapping */
1430 1454                                          result->vmu_rss_private +=
1431 1455                                              (rss << PAGESHIFT);
1432 1456                                  }
1433 1457                                  vmu_merge_bounds(e_tree, &e_first, &e_last);
1434 1458                          }
1435 1459                          tmp = cur;
1436 1460                          cur = cur->vmb_next;
1437 1461                          vmu_free_bound(tmp);
1438 1462                  }
1439 1463          }
1440 1464  }
1441 1465  
1442 1466  /*
1443 1467   * Based on the current calculation flags, find the relevant entities
1444 1468   * which are relative to the process.  Then calculate each segment
1445 1469   * in the process'es address space for each relevant entity.
1446 1470   */
1447 1471  static void
1448 1472  vmu_calculate_proc(proc_t *p)
1449 1473  {
1450 1474          vmu_entity_t *entities = NULL;
1451 1475          vmu_zone_t *zone;
1452 1476          vmu_entity_t *tmp;
1453 1477          struct as *as;
  
    | 
      ↓ open down ↓ | 
    137 lines elided | 
    
      ↑ open up ↑ | 
  
1454 1478          struct seg *seg;
1455 1479          int ret;
1456 1480  
1457 1481          /* Figure out which entities are being computed */
1458 1482          if ((vmu_data.vmu_system) != NULL) {
1459 1483                  tmp = vmu_data.vmu_system;
1460 1484                  tmp->vme_next_calc = entities;
1461 1485                  entities = tmp;
1462 1486          }
1463 1487          if (vmu_data.vmu_calc_flags &
1464      -            (VMUSAGE_ZONE | VMUSAGE_ALL_ZONES | VMUSAGE_PROJECTS |
1465      -            VMUSAGE_ALL_PROJECTS | VMUSAGE_TASKS | VMUSAGE_ALL_TASKS |
     1488 +            (VMUSAGE_ZONE | VMUSAGE_ALL_ZONES | VMUSAGE_A_ZONE |
     1489 +            VMUSAGE_PROJECTS | VMUSAGE_ALL_PROJECTS |
     1490 +            VMUSAGE_TASKS | VMUSAGE_ALL_TASKS |
1466 1491              VMUSAGE_RUSERS | VMUSAGE_ALL_RUSERS | VMUSAGE_EUSERS |
1467 1492              VMUSAGE_ALL_EUSERS)) {
1468 1493                  ret = i_mod_hash_find_nosync(vmu_data.vmu_zones_hash,
1469 1494                      (mod_hash_key_t)(uintptr_t)p->p_zone->zone_id,
1470 1495                      (mod_hash_val_t *)&zone);
1471 1496                  if (ret != 0) {
1472 1497                          zone = vmu_alloc_zone(p->p_zone->zone_id);
1473 1498                          ret = i_mod_hash_insert_nosync(vmu_data.vmu_zones_hash,
1474 1499                              (mod_hash_key_t)(uintptr_t)p->p_zone->zone_id,
1475 1500                              (mod_hash_val_t)zone, (mod_hash_hndl_t)0);
1476 1501                          ASSERT(ret == 0);
1477 1502                  }
1478 1503                  if (zone->vmz_zone != NULL) {
1479 1504                          tmp = zone->vmz_zone;
1480 1505                          tmp->vme_next_calc = entities;
1481 1506                          entities = tmp;
1482 1507                  }
1483 1508                  if (vmu_data.vmu_calc_flags &
1484 1509                      (VMUSAGE_PROJECTS | VMUSAGE_ALL_PROJECTS)) {
1485 1510                          tmp = vmu_find_insert_entity(zone->vmz_projects_hash,
1486 1511                              p->p_task->tk_proj->kpj_id, VMUSAGE_PROJECTS,
1487 1512                              zone->vmz_id);
1488 1513                          tmp->vme_next_calc = entities;
1489 1514                          entities = tmp;
1490 1515                  }
1491 1516                  if (vmu_data.vmu_calc_flags &
1492 1517                      (VMUSAGE_TASKS | VMUSAGE_ALL_TASKS)) {
1493 1518                          tmp = vmu_find_insert_entity(zone->vmz_tasks_hash,
1494 1519                              p->p_task->tk_tkid, VMUSAGE_TASKS, zone->vmz_id);
1495 1520                          tmp->vme_next_calc = entities;
1496 1521                          entities = tmp;
1497 1522                  }
1498 1523                  if (vmu_data.vmu_calc_flags &
1499 1524                      (VMUSAGE_RUSERS | VMUSAGE_ALL_RUSERS)) {
1500 1525                          tmp = vmu_find_insert_entity(zone->vmz_rusers_hash,
1501 1526                              crgetruid(p->p_cred), VMUSAGE_RUSERS, zone->vmz_id);
1502 1527                          tmp->vme_next_calc = entities;
1503 1528                          entities = tmp;
1504 1529                  }
1505 1530                  if (vmu_data.vmu_calc_flags &
1506 1531                      (VMUSAGE_EUSERS | VMUSAGE_ALL_EUSERS)) {
1507 1532                          tmp = vmu_find_insert_entity(zone->vmz_eusers_hash,
1508 1533                              crgetuid(p->p_cred), VMUSAGE_EUSERS, zone->vmz_id);
1509 1534                          tmp->vme_next_calc = entities;
1510 1535                          entities = tmp;
1511 1536                  }
1512 1537          }
1513 1538          /* Entities which collapse projects and users for all zones */
1514 1539          if (vmu_data.vmu_calc_flags & VMUSAGE_COL_PROJECTS) {
1515 1540                  tmp = vmu_find_insert_entity(vmu_data.vmu_projects_col_hash,
1516 1541                      p->p_task->tk_proj->kpj_id, VMUSAGE_PROJECTS, ALL_ZONES);
1517 1542                  tmp->vme_next_calc = entities;
1518 1543                  entities = tmp;
1519 1544          }
1520 1545          if (vmu_data.vmu_calc_flags & VMUSAGE_COL_RUSERS) {
1521 1546                  tmp = vmu_find_insert_entity(vmu_data.vmu_rusers_col_hash,
1522 1547                      crgetruid(p->p_cred), VMUSAGE_RUSERS, ALL_ZONES);
1523 1548                  tmp->vme_next_calc = entities;
1524 1549                  entities = tmp;
1525 1550          }
1526 1551          if (vmu_data.vmu_calc_flags & VMUSAGE_COL_EUSERS) {
1527 1552                  tmp = vmu_find_insert_entity(vmu_data.vmu_eusers_col_hash,
1528 1553                      crgetuid(p->p_cred), VMUSAGE_EUSERS, ALL_ZONES);
1529 1554                  tmp->vme_next_calc = entities;
1530 1555                  entities = tmp;
1531 1556          }
1532 1557  
1533 1558          ASSERT(entities != NULL);
1534 1559          /* process all segs in process's address space */
1535 1560          as = p->p_as;
1536 1561          AS_LOCK_ENTER(as, RW_READER);
1537 1562          for (seg = AS_SEGFIRST(as); seg != NULL;
1538 1563              seg = AS_SEGNEXT(as, seg)) {
1539 1564                  vmu_calculate_seg(entities, seg);
1540 1565          }
1541 1566          AS_LOCK_EXIT(as);
1542 1567  }
1543 1568  
1544 1569  /*
1545 1570   * Free data created by previous call to vmu_calculate().
1546 1571   */
1547 1572  static void
1548 1573  vmu_clear_calc()
1549 1574  {
1550 1575          if (vmu_data.vmu_system != NULL)
1551 1576                  vmu_free_entity(vmu_data.vmu_system);
1552 1577                  vmu_data.vmu_system = NULL;
1553 1578          if (vmu_data.vmu_zones_hash != NULL)
1554 1579                  i_mod_hash_clear_nosync(vmu_data.vmu_zones_hash);
1555 1580          if (vmu_data.vmu_projects_col_hash != NULL)
1556 1581                  i_mod_hash_clear_nosync(vmu_data.vmu_projects_col_hash);
1557 1582          if (vmu_data.vmu_rusers_col_hash != NULL)
1558 1583                  i_mod_hash_clear_nosync(vmu_data.vmu_rusers_col_hash);
1559 1584          if (vmu_data.vmu_eusers_col_hash != NULL)
1560 1585                  i_mod_hash_clear_nosync(vmu_data.vmu_eusers_col_hash);
1561 1586  
1562 1587          i_mod_hash_clear_nosync(vmu_data.vmu_all_vnodes_hash);
1563 1588          i_mod_hash_clear_nosync(vmu_data.vmu_all_amps_hash);
1564 1589  }
1565 1590  
1566 1591  /*
1567 1592   * Free unused data structures.  These can result if the system workload
1568 1593   * decreases between calculations.
1569 1594   */
1570 1595  static void
1571 1596  vmu_free_extra()
1572 1597  {
1573 1598          vmu_bound_t *tb;
1574 1599          vmu_object_t *to;
1575 1600          vmu_entity_t *te;
1576 1601          vmu_zone_t *tz;
1577 1602  
1578 1603          while (vmu_data.vmu_free_bounds != NULL) {
1579 1604                  tb = vmu_data.vmu_free_bounds;
1580 1605                  vmu_data.vmu_free_bounds = vmu_data.vmu_free_bounds->vmb_next;
1581 1606                  kmem_cache_free(vmu_bound_cache, tb);
1582 1607          }
1583 1608          while (vmu_data.vmu_free_objects != NULL) {
1584 1609                  to = vmu_data.vmu_free_objects;
1585 1610                  vmu_data.vmu_free_objects =
1586 1611                      vmu_data.vmu_free_objects->vmo_next;
1587 1612                  kmem_cache_free(vmu_object_cache, to);
1588 1613          }
1589 1614          while (vmu_data.vmu_free_entities != NULL) {
1590 1615                  te = vmu_data.vmu_free_entities;
1591 1616                  vmu_data.vmu_free_entities =
1592 1617                      vmu_data.vmu_free_entities->vme_next;
1593 1618                  if (te->vme_vnode_hash != NULL)
1594 1619                          mod_hash_destroy_hash(te->vme_vnode_hash);
1595 1620                  if (te->vme_amp_hash != NULL)
1596 1621                          mod_hash_destroy_hash(te->vme_amp_hash);
1597 1622                  if (te->vme_anon_hash != NULL)
1598 1623                          mod_hash_destroy_hash(te->vme_anon_hash);
1599 1624                  kmem_free(te, sizeof (vmu_entity_t));
1600 1625          }
1601 1626          while (vmu_data.vmu_free_zones != NULL) {
1602 1627                  tz = vmu_data.vmu_free_zones;
1603 1628                  vmu_data.vmu_free_zones =
1604 1629                      vmu_data.vmu_free_zones->vmz_next;
1605 1630                  if (tz->vmz_projects_hash != NULL)
1606 1631                          mod_hash_destroy_hash(tz->vmz_projects_hash);
1607 1632                  if (tz->vmz_tasks_hash != NULL)
1608 1633                          mod_hash_destroy_hash(tz->vmz_tasks_hash);
1609 1634                  if (tz->vmz_rusers_hash != NULL)
1610 1635                          mod_hash_destroy_hash(tz->vmz_rusers_hash);
1611 1636                  if (tz->vmz_eusers_hash != NULL)
1612 1637                          mod_hash_destroy_hash(tz->vmz_eusers_hash);
1613 1638                  kmem_free(tz, sizeof (vmu_zone_t));
1614 1639          }
1615 1640  }
1616 1641  
1617 1642  extern kcondvar_t *pr_pid_cv;
1618 1643  
1619 1644  /*
1620 1645   * Determine which entity types are relevant and allocate the hashes to
1621 1646   * track them.  Then walk the process table and count rss and swap
1622 1647   * for each process'es address space.  Address space object such as
1623 1648   * vnodes, amps and anons are tracked per entity, so that they are
1624 1649   * not double counted in the results.
1625 1650   *
1626 1651   */
1627 1652  static void
1628 1653  vmu_calculate()
1629 1654  {
1630 1655          int i = 0;
1631 1656          int ret;
1632 1657          proc_t *p;
1633 1658  
1634 1659          vmu_clear_calc();
1635 1660  
1636 1661          if (vmu_data.vmu_calc_flags & VMUSAGE_SYSTEM)
1637 1662                  vmu_data.vmu_system = vmu_alloc_entity(0, VMUSAGE_SYSTEM,
1638 1663                      ALL_ZONES);
1639 1664  
1640 1665          /*
1641 1666           * Walk process table and calculate rss of each proc.
1642 1667           *
1643 1668           * Pidlock and p_lock cannot be held while doing the rss calculation.
1644 1669           * This is because:
1645 1670           *      1.  The calculation allocates using KM_SLEEP.
1646 1671           *      2.  The calculation grabs a_lock, which cannot be grabbed
1647 1672           *          after p_lock.
1648 1673           *
1649 1674           * Since pidlock must be dropped, we cannot simply just walk the
1650 1675           * practive list.  Instead, we walk the process table, and sprlock
1651 1676           * each process to ensure that it does not exit during the
1652 1677           * calculation.
1653 1678           */
1654 1679  
1655 1680          mutex_enter(&pidlock);
1656 1681          for (i = 0; i < v.v_proc; i++) {
1657 1682  again:
1658 1683                  p = pid_entry(i);
1659 1684                  if (p == NULL)
1660 1685                          continue;
1661 1686  
1662 1687                  mutex_enter(&p->p_lock);
1663 1688                  mutex_exit(&pidlock);
1664 1689  
1665 1690                  if (panicstr) {
1666 1691                          mutex_exit(&p->p_lock);
1667 1692                          return;
1668 1693                  }
1669 1694  
1670 1695                  /* Try to set P_PR_LOCK */
1671 1696                  ret = sprtrylock_proc(p);
1672 1697                  if (ret == -1) {
1673 1698                          /* Process in invalid state */
1674 1699                          mutex_exit(&p->p_lock);
1675 1700                          mutex_enter(&pidlock);
1676 1701                          continue;
1677 1702                  } else if (ret == 1) {
1678 1703                          /*
1679 1704                           * P_PR_LOCK is already set.  Wait and try again.
1680 1705                           * This also drops p_lock.
1681 1706                           */
1682 1707                          sprwaitlock_proc(p);
1683 1708                          mutex_enter(&pidlock);
1684 1709                          goto again;
1685 1710                  }
1686 1711                  mutex_exit(&p->p_lock);
1687 1712  
1688 1713                  vmu_calculate_proc(p);
1689 1714  
1690 1715                  mutex_enter(&p->p_lock);
1691 1716                  sprunlock(p);
1692 1717                  mutex_enter(&pidlock);
1693 1718          }
1694 1719          mutex_exit(&pidlock);
1695 1720  
1696 1721          vmu_free_extra();
1697 1722  }
1698 1723  
1699 1724  /*
1700 1725   * allocate a new cache for N results satisfying flags
1701 1726   */
1702 1727  vmu_cache_t *
1703 1728  vmu_cache_alloc(size_t nres, uint_t flags)
1704 1729  {
1705 1730          vmu_cache_t *cache;
1706 1731  
1707 1732          cache = kmem_zalloc(sizeof (vmu_cache_t), KM_SLEEP);
1708 1733          cache->vmc_results = kmem_zalloc(sizeof (vmusage_t) * nres, KM_SLEEP);
1709 1734          cache->vmc_nresults = nres;
1710 1735          cache->vmc_flags = flags;
1711 1736          cache->vmc_refcnt = 1;
1712 1737          return (cache);
1713 1738  }
1714 1739  
1715 1740  /*
1716 1741   * Make sure cached results are not freed
1717 1742   */
1718 1743  static void
1719 1744  vmu_cache_hold(vmu_cache_t *cache)
1720 1745  {
1721 1746          ASSERT(MUTEX_HELD(&vmu_data.vmu_lock));
1722 1747          cache->vmc_refcnt++;
1723 1748  }
1724 1749  
1725 1750  /*
1726 1751   * free cache data
1727 1752   */
1728 1753  static void
1729 1754  vmu_cache_rele(vmu_cache_t *cache)
1730 1755  {
1731 1756          ASSERT(MUTEX_HELD(&vmu_data.vmu_lock));
  
    | 
      ↓ open down ↓ | 
    256 lines elided | 
    
      ↑ open up ↑ | 
  
1732 1757          ASSERT(cache->vmc_refcnt > 0);
1733 1758          cache->vmc_refcnt--;
1734 1759          if (cache->vmc_refcnt == 0) {
1735 1760                  kmem_free(cache->vmc_results, sizeof (vmusage_t) *
1736 1761                      cache->vmc_nresults);
1737 1762                  kmem_free(cache, sizeof (vmu_cache_t));
1738 1763          }
1739 1764  }
1740 1765  
1741 1766  /*
     1767 + * When new data is calculated, update the phys_mem rctl usage value in the
     1768 + * zones.
     1769 + */
     1770 +static void
     1771 +vmu_update_zone_rctls(vmu_cache_t *cache)
     1772 +{
     1773 +        vmusage_t       *rp;
     1774 +        size_t          i = 0;
     1775 +        zone_t          *zp;
     1776 +
     1777 +        for (rp = cache->vmc_results; i < cache->vmc_nresults; rp++, i++) {
     1778 +                if (rp->vmu_type == VMUSAGE_ZONE &&
     1779 +                    rp->vmu_zoneid != ALL_ZONES) {
     1780 +                        if ((zp = zone_find_by_id(rp->vmu_zoneid)) != NULL) {
     1781 +                                zp->zone_phys_mem = rp->vmu_rss_all;
     1782 +                                zone_rele(zp);
     1783 +                        }
     1784 +                }
     1785 +        }
     1786 +}
     1787 +
     1788 +/*
1742 1789   * Copy out the cached results to a caller.  Inspect the callers flags
1743 1790   * and zone to determine which cached results should be copied.
1744 1791   */
1745 1792  static int
1746 1793  vmu_copyout_results(vmu_cache_t *cache, vmusage_t *buf, size_t *nres,
1747      -    uint_t flags, int cpflg)
     1794 +    uint_t flags, id_t req_zone_id, int cpflg)
1748 1795  {
1749 1796          vmusage_t *result, *out_result;
1750 1797          vmusage_t dummy;
1751 1798          size_t i, count = 0;
1752 1799          size_t bufsize;
1753 1800          int ret = 0;
1754 1801          uint_t types = 0;
1755 1802  
1756 1803          if (nres != NULL) {
1757 1804                  if (ddi_copyin((caddr_t)nres, &bufsize, sizeof (size_t), cpflg))
1758 1805                          return (set_errno(EFAULT));
1759 1806          } else {
1760 1807                  bufsize = 0;
1761 1808          }
1762 1809  
1763 1810          /* figure out what results the caller is interested in. */
1764 1811          if ((flags & VMUSAGE_SYSTEM) && curproc->p_zone == global_zone)
1765 1812                  types |= VMUSAGE_SYSTEM;
1766      -        if (flags & (VMUSAGE_ZONE | VMUSAGE_ALL_ZONES))
     1813 +        if (flags & (VMUSAGE_ZONE | VMUSAGE_ALL_ZONES | VMUSAGE_A_ZONE))
1767 1814                  types |= VMUSAGE_ZONE;
1768 1815          if (flags & (VMUSAGE_PROJECTS | VMUSAGE_ALL_PROJECTS |
1769 1816              VMUSAGE_COL_PROJECTS))
1770 1817                  types |= VMUSAGE_PROJECTS;
1771 1818          if (flags & (VMUSAGE_TASKS | VMUSAGE_ALL_TASKS))
1772 1819                  types |= VMUSAGE_TASKS;
1773 1820          if (flags & (VMUSAGE_RUSERS | VMUSAGE_ALL_RUSERS | VMUSAGE_COL_RUSERS))
1774 1821                  types |= VMUSAGE_RUSERS;
1775 1822          if (flags & (VMUSAGE_EUSERS | VMUSAGE_ALL_EUSERS | VMUSAGE_COL_EUSERS))
1776 1823                  types |= VMUSAGE_EUSERS;
1777 1824  
1778 1825          /* count results for current zone */
1779 1826          out_result = buf;
1780 1827          for (result = cache->vmc_results, i = 0;
1781 1828              i < cache->vmc_nresults; result++, i++) {
1782 1829  
1783 1830                  /* Do not return "other-zone" results to non-global zones */
1784 1831                  if (curproc->p_zone != global_zone &&
1785 1832                      curproc->p_zone->zone_id != result->vmu_zoneid)
1786 1833                          continue;
1787 1834  
1788 1835                  /*
1789 1836                   * If non-global zone requests VMUSAGE_SYSTEM, fake
1790 1837                   * up VMUSAGE_ZONE result as VMUSAGE_SYSTEM result.
1791 1838                   */
1792 1839                  if (curproc->p_zone != global_zone &&
1793 1840                      (flags & VMUSAGE_SYSTEM) != 0 &&
1794 1841                      result->vmu_type == VMUSAGE_ZONE) {
1795 1842                          count++;
1796 1843                          if (out_result != NULL) {
1797 1844                                  if (bufsize < count) {
1798 1845                                          ret = set_errno(EOVERFLOW);
1799 1846                                  } else {
1800 1847                                          dummy = *result;
1801 1848                                          dummy.vmu_zoneid = ALL_ZONES;
1802 1849                                          dummy.vmu_id = 0;
1803 1850                                          dummy.vmu_type = VMUSAGE_SYSTEM;
1804 1851                                          if (ddi_copyout(&dummy, out_result,
1805 1852                                              sizeof (vmusage_t), cpflg))
1806 1853                                                  return (set_errno(EFAULT));
1807 1854                                          out_result++;
1808 1855                                  }
1809 1856                          }
1810 1857                  }
1811 1858  
1812 1859                  /* Skip results that do not match requested type */
1813 1860                  if ((result->vmu_type & types) == 0)
1814 1861                          continue;
1815 1862  
1816 1863                  /* Skip collated results if not requested */
1817 1864                  if (result->vmu_zoneid == ALL_ZONES) {
1818 1865                          if (result->vmu_type == VMUSAGE_PROJECTS &&
  
    | 
      ↓ open down ↓ | 
    42 lines elided | 
    
      ↑ open up ↑ | 
  
1819 1866                              (flags & VMUSAGE_COL_PROJECTS) == 0)
1820 1867                                  continue;
1821 1868                          if (result->vmu_type == VMUSAGE_EUSERS &&
1822 1869                              (flags & VMUSAGE_COL_EUSERS) == 0)
1823 1870                                  continue;
1824 1871                          if (result->vmu_type == VMUSAGE_RUSERS &&
1825 1872                              (flags & VMUSAGE_COL_RUSERS) == 0)
1826 1873                                  continue;
1827 1874                  }
1828 1875  
1829      -                /* Skip "other zone" results if not requested */
1830      -                if (result->vmu_zoneid != curproc->p_zone->zone_id) {
1831      -                        if (result->vmu_type == VMUSAGE_ZONE &&
1832      -                            (flags & VMUSAGE_ALL_ZONES) == 0)
     1876 +                if (result->vmu_type == VMUSAGE_ZONE &&
     1877 +                    flags & VMUSAGE_A_ZONE) {
     1878 +                        /* Skip non-requested zone results */
     1879 +                        if (result->vmu_zoneid != req_zone_id)
1833 1880                                  continue;
1834      -                        if (result->vmu_type == VMUSAGE_PROJECTS &&
1835      -                            (flags & (VMUSAGE_ALL_PROJECTS |
1836      -                            VMUSAGE_COL_PROJECTS)) == 0)
1837      -                                continue;
1838      -                        if (result->vmu_type == VMUSAGE_TASKS &&
1839      -                            (flags & VMUSAGE_ALL_TASKS) == 0)
1840      -                                continue;
1841      -                        if (result->vmu_type == VMUSAGE_RUSERS &&
1842      -                            (flags & (VMUSAGE_ALL_RUSERS |
1843      -                            VMUSAGE_COL_RUSERS)) == 0)
1844      -                                continue;
1845      -                        if (result->vmu_type == VMUSAGE_EUSERS &&
1846      -                            (flags & (VMUSAGE_ALL_EUSERS |
1847      -                            VMUSAGE_COL_EUSERS)) == 0)
1848      -                                continue;
     1881 +                } else {
     1882 +                        /* Skip "other zone" results if not requested */
     1883 +                        if (result->vmu_zoneid != curproc->p_zone->zone_id) {
     1884 +                                if (result->vmu_type == VMUSAGE_ZONE &&
     1885 +                                    (flags & VMUSAGE_ALL_ZONES) == 0)
     1886 +                                        continue;
     1887 +                                if (result->vmu_type == VMUSAGE_PROJECTS &&
     1888 +                                    (flags & (VMUSAGE_ALL_PROJECTS |
     1889 +                                    VMUSAGE_COL_PROJECTS)) == 0)
     1890 +                                        continue;
     1891 +                                if (result->vmu_type == VMUSAGE_TASKS &&
     1892 +                                    (flags & VMUSAGE_ALL_TASKS) == 0)
     1893 +                                        continue;
     1894 +                                if (result->vmu_type == VMUSAGE_RUSERS &&
     1895 +                                    (flags & (VMUSAGE_ALL_RUSERS |
     1896 +                                    VMUSAGE_COL_RUSERS)) == 0)
     1897 +                                        continue;
     1898 +                                if (result->vmu_type == VMUSAGE_EUSERS &&
     1899 +                                    (flags & (VMUSAGE_ALL_EUSERS |
     1900 +                                    VMUSAGE_COL_EUSERS)) == 0)
     1901 +                                        continue;
     1902 +                        }
1849 1903                  }
1850 1904                  count++;
1851 1905                  if (out_result != NULL) {
1852 1906                          if (bufsize < count) {
1853 1907                                  ret = set_errno(EOVERFLOW);
1854 1908                          } else {
1855 1909                                  if (ddi_copyout(result, out_result,
1856 1910                                      sizeof (vmusage_t), cpflg))
1857 1911                                          return (set_errno(EFAULT));
1858 1912                                  out_result++;
1859 1913                          }
1860 1914                  }
1861 1915          }
1862 1916          if (nres != NULL)
1863 1917                  if (ddi_copyout(&count, (void *)nres, sizeof (size_t), cpflg))
1864 1918                          return (set_errno(EFAULT));
1865 1919  
1866 1920          return (ret);
1867 1921  }
1868 1922  
1869 1923  /*
1870 1924   * vm_getusage()
1871 1925   *
1872 1926   * Counts rss and swap by zone, project, task, and/or user.  The flags argument
1873 1927   * determines the type of results structures returned.  Flags requesting
1874 1928   * results from more than one zone are "flattened" to the local zone if the
1875 1929   * caller is not the global zone.
1876 1930   *
1877 1931   * args:
1878 1932   *      flags:  bitmap consisting of one or more of VMUSAGE_*.
1879 1933   *      age:    maximum allowable age (time since counting was done) in
1880 1934   *              seconds of the results.  Results from previous callers are
1881 1935   *              cached in kernel.
1882 1936   *      buf:    pointer to buffer array of vmusage_t.  If NULL, then only nres
1883 1937   *              set on success.
1884 1938   *      nres:   Set to number of vmusage_t structures pointed to by buf
1885 1939   *              before calling vm_getusage().
1886 1940   *              On return 0 (success) or ENOSPC, is set to the number of result
1887 1941   *              structures returned or attempted to return.
1888 1942   *
1889 1943   * returns 0 on success, -1 on failure:
1890 1944   *      EINTR (interrupted)
1891 1945   *      ENOSPC (nres to small for results, nres set to needed value for success)
1892 1946   *      EINVAL (flags invalid)
1893 1947   *      EFAULT (bad address for buf or nres)
  
    | 
      ↓ open down ↓ | 
    35 lines elided | 
    
      ↑ open up ↑ | 
  
1894 1948   */
1895 1949  int
1896 1950  vm_getusage(uint_t flags, time_t age, vmusage_t *buf, size_t *nres, int cpflg)
1897 1951  {
1898 1952          vmu_entity_t *entity;
1899 1953          vmusage_t *result;
1900 1954          int ret = 0;
1901 1955          int cacherecent = 0;
1902 1956          hrtime_t now;
1903 1957          uint_t flags_orig;
     1958 +        id_t req_zone_id;
1904 1959  
1905 1960          /*
1906 1961           * Non-global zones cannot request system wide and/or collated
1907      -         * results, or the system result, so munge the flags accordingly.
     1962 +         * results, or the system result, or usage of another zone, so munge
     1963 +         * the flags accordingly.
1908 1964           */
1909 1965          flags_orig = flags;
1910 1966          if (curproc->p_zone != global_zone) {
1911 1967                  if (flags & (VMUSAGE_ALL_PROJECTS | VMUSAGE_COL_PROJECTS)) {
1912 1968                          flags &= ~(VMUSAGE_ALL_PROJECTS | VMUSAGE_COL_PROJECTS);
1913 1969                          flags |= VMUSAGE_PROJECTS;
1914 1970                  }
1915 1971                  if (flags & (VMUSAGE_ALL_RUSERS | VMUSAGE_COL_RUSERS)) {
1916 1972                          flags &= ~(VMUSAGE_ALL_RUSERS | VMUSAGE_COL_RUSERS);
1917 1973                          flags |= VMUSAGE_RUSERS;
1918 1974                  }
1919 1975                  if (flags & (VMUSAGE_ALL_EUSERS | VMUSAGE_COL_EUSERS)) {
1920 1976                          flags &= ~(VMUSAGE_ALL_EUSERS | VMUSAGE_COL_EUSERS);
1921 1977                          flags |= VMUSAGE_EUSERS;
1922 1978                  }
1923 1979                  if (flags & VMUSAGE_SYSTEM) {
1924 1980                          flags &= ~VMUSAGE_SYSTEM;
1925 1981                          flags |= VMUSAGE_ZONE;
1926 1982                  }
     1983 +                if (flags & VMUSAGE_A_ZONE) {
     1984 +                        flags &= ~VMUSAGE_A_ZONE;
     1985 +                        flags |= VMUSAGE_ZONE;
     1986 +                }
1927 1987          }
1928 1988  
1929 1989          /* Check for unknown flags */
1930 1990          if ((flags & (~VMUSAGE_MASK)) != 0)
1931 1991                  return (set_errno(EINVAL));
1932 1992  
1933 1993          /* Check for no flags */
1934 1994          if ((flags & VMUSAGE_MASK) == 0)
1935 1995                  return (set_errno(EINVAL));
1936 1996  
     1997 +        /* If requesting results for a specific zone, get the zone ID */
     1998 +        if (flags & VMUSAGE_A_ZONE) {
     1999 +                size_t bufsize;
     2000 +                vmusage_t zreq;
     2001 +
     2002 +                if (ddi_copyin((caddr_t)nres, &bufsize, sizeof (size_t), cpflg))
     2003 +                        return (set_errno(EFAULT));
     2004 +                /* Requested zone ID is passed in buf, so 0 len not allowed */
     2005 +                if (bufsize == 0)
     2006 +                        return (set_errno(EINVAL));
     2007 +                if (ddi_copyin((caddr_t)buf, &zreq, sizeof (vmusage_t), cpflg))
     2008 +                        return (set_errno(EFAULT));
     2009 +                req_zone_id = zreq.vmu_id;
     2010 +        }
     2011 +
1937 2012          mutex_enter(&vmu_data.vmu_lock);
1938 2013          now = gethrtime();
1939 2014  
1940 2015  start:
1941 2016          if (vmu_data.vmu_cache != NULL) {
1942 2017  
1943 2018                  vmu_cache_t *cache;
1944 2019  
1945 2020                  if ((vmu_data.vmu_cache->vmc_timestamp +
1946 2021                      ((hrtime_t)age * NANOSEC)) > now)
1947 2022                          cacherecent = 1;
1948 2023  
1949 2024                  if ((vmu_data.vmu_cache->vmc_flags & flags) == flags &&
1950 2025                      cacherecent == 1) {
1951 2026                          cache = vmu_data.vmu_cache;
1952 2027                          vmu_cache_hold(cache);
1953 2028                          mutex_exit(&vmu_data.vmu_lock);
1954 2029  
1955 2030                          ret = vmu_copyout_results(cache, buf, nres, flags_orig,
1956      -                            cpflg);
     2031 +                            req_zone_id, cpflg);
1957 2032                          mutex_enter(&vmu_data.vmu_lock);
1958 2033                          vmu_cache_rele(cache);
1959 2034                          if (vmu_data.vmu_pending_waiters > 0)
1960 2035                                  cv_broadcast(&vmu_data.vmu_cv);
1961 2036                          mutex_exit(&vmu_data.vmu_lock);
1962 2037                          return (ret);
1963 2038                  }
1964 2039                  /*
1965 2040                   * If the cache is recent, it is likely that there are other
1966 2041                   * consumers of vm_getusage running, so add their flags to the
1967 2042                   * desired flags for the calculation.
1968 2043                   */
1969 2044                  if (cacherecent == 1)
1970 2045                          flags = vmu_data.vmu_cache->vmc_flags | flags;
1971 2046          }
1972 2047          if (vmu_data.vmu_calc_thread == NULL) {
1973 2048  
1974 2049                  vmu_cache_t *cache;
1975 2050  
1976 2051                  vmu_data.vmu_calc_thread = curthread;
1977 2052                  vmu_data.vmu_calc_flags = flags;
1978 2053                  vmu_data.vmu_entities = NULL;
1979 2054                  vmu_data.vmu_nentities = 0;
1980 2055                  if (vmu_data.vmu_pending_waiters > 0)
1981 2056                          vmu_data.vmu_calc_flags |=
1982 2057                              vmu_data.vmu_pending_flags;
1983 2058  
1984 2059                  vmu_data.vmu_pending_flags = 0;
1985 2060                  mutex_exit(&vmu_data.vmu_lock);
1986 2061                  vmu_calculate();
1987 2062                  mutex_enter(&vmu_data.vmu_lock);
1988 2063                  /* copy results to cache */
1989 2064                  if (vmu_data.vmu_cache != NULL)
1990 2065                          vmu_cache_rele(vmu_data.vmu_cache);
1991 2066                  cache = vmu_data.vmu_cache =
1992 2067                      vmu_cache_alloc(vmu_data.vmu_nentities,
1993 2068                      vmu_data.vmu_calc_flags);
1994 2069  
1995 2070                  result = cache->vmc_results;
1996 2071                  for (entity = vmu_data.vmu_entities; entity != NULL;
1997 2072                      entity = entity->vme_next) {
1998 2073                          *result = entity->vme_result;
1999 2074                          result++;
2000 2075                  }
2001 2076                  cache->vmc_timestamp = gethrtime();
  
    | 
      ↓ open down ↓ | 
    35 lines elided | 
    
      ↑ open up ↑ | 
  
2002 2077                  vmu_cache_hold(cache);
2003 2078  
2004 2079                  vmu_data.vmu_calc_flags = 0;
2005 2080                  vmu_data.vmu_calc_thread = NULL;
2006 2081  
2007 2082                  if (vmu_data.vmu_pending_waiters > 0)
2008 2083                          cv_broadcast(&vmu_data.vmu_cv);
2009 2084  
2010 2085                  mutex_exit(&vmu_data.vmu_lock);
2011 2086  
     2087 +                /* update zone's phys. mem. rctl usage */
     2088 +                vmu_update_zone_rctls(cache);
2012 2089                  /* copy cache */
2013      -                ret = vmu_copyout_results(cache, buf, nres, flags_orig, cpflg);
     2090 +                ret = vmu_copyout_results(cache, buf, nres, flags_orig,
     2091 +                    req_zone_id, cpflg);
2014 2092                  mutex_enter(&vmu_data.vmu_lock);
2015 2093                  vmu_cache_rele(cache);
2016 2094                  mutex_exit(&vmu_data.vmu_lock);
2017 2095  
2018 2096                  return (ret);
2019 2097          }
2020 2098          vmu_data.vmu_pending_flags |= flags;
2021 2099          vmu_data.vmu_pending_waiters++;
2022 2100          while (vmu_data.vmu_calc_thread != NULL) {
2023 2101                  if (cv_wait_sig(&vmu_data.vmu_cv,
2024 2102                      &vmu_data.vmu_lock) == 0) {
2025 2103                          vmu_data.vmu_pending_waiters--;
2026 2104                          mutex_exit(&vmu_data.vmu_lock);
2027 2105                          return (set_errno(EINTR));
2028 2106                  }
2029 2107          }
2030 2108          vmu_data.vmu_pending_waiters--;
2031 2109          goto start;
2032 2110  }
     2111 +
     2112 +#if defined(__x86)
     2113 +/*
     2114 + * Attempt to invalidate all of the pages in the mapping for the given process.
     2115 + */
     2116 +static void
     2117 +map_inval(proc_t *p, struct seg *seg, caddr_t addr, size_t size)
     2118 +{
     2119 +        page_t          *pp;
     2120 +        size_t          psize;
     2121 +        u_offset_t      off;
     2122 +        caddr_t         eaddr;
     2123 +        struct vnode    *vp;
     2124 +        struct segvn_data *svd;
     2125 +        struct hat      *victim_hat;
     2126 +
     2127 +        ASSERT((addr + size) <= (seg->s_base + seg->s_size));
     2128 +
     2129 +        victim_hat = p->p_as->a_hat;
     2130 +        svd = (struct segvn_data *)seg->s_data;
     2131 +        vp = svd->vp;
     2132 +        psize = page_get_pagesize(seg->s_szc);
     2133 +
     2134 +        off = svd->offset + (uintptr_t)(addr - seg->s_base);
     2135 +
     2136 +        for (eaddr = addr + size; addr < eaddr; addr += psize, off += psize) {
     2137 +                pp = page_lookup_nowait(vp, off, SE_SHARED);
     2138 +
     2139 +                if (pp != NULL) {
     2140 +                        /* following logic based on pvn_getdirty() */
     2141 +
     2142 +                        if (pp->p_lckcnt != 0 || pp->p_cowcnt != 0) {
     2143 +                                page_unlock(pp);
     2144 +                                continue;
     2145 +                        }
     2146 +
     2147 +                        page_io_lock(pp);
     2148 +                        hat_page_inval(pp, 0, victim_hat);
     2149 +                        page_io_unlock(pp);
     2150 +
     2151 +                        /*
     2152 +                         * For B_INVALCURONLY-style handling we let
     2153 +                         * page_release call VN_DISPOSE if no one else is using
     2154 +                         * the page.
     2155 +                         *
     2156 +                         * A hat_ismod() check would be useless because:
     2157 +                         * (1) we are not be holding SE_EXCL lock
     2158 +                         * (2) we've not unloaded _all_ translations
     2159 +                         *
     2160 +                         * Let page_release() do the heavy-lifting.
     2161 +                         */
     2162 +                        (void) page_release(pp, 1);
     2163 +                }
     2164 +        }
     2165 +}
     2166 +
     2167 +/*
     2168 + * vm_map_inval()
     2169 + *
     2170 + * Invalidate as many pages as possible within the given mapping for the given
     2171 + * process. addr is expected to be the base address of the mapping and size is
     2172 + * the length of the mapping. In some cases a mapping will encompass an
     2173 + * entire segment, but at least for anon or stack mappings, these will be
     2174 + * regions within a single large segment. Thus, the invalidation is oriented
     2175 + * around a single mapping and not an entire segment.
     2176 + *
     2177 + * SPARC sfmmu hat does not support HAT_CURPROC_PGUNLOAD-style handling so
     2178 + * this code is only applicable to x86.
     2179 + */
     2180 +int
     2181 +vm_map_inval(pid_t pid, caddr_t addr, size_t size)
     2182 +{
     2183 +        int ret;
     2184 +        int error = 0;
     2185 +        proc_t *p;              /* target proc */
     2186 +        struct as *as;          /* target proc's address space */
     2187 +        struct seg *seg;        /* working segment */
     2188 +
     2189 +        if (curproc->p_zone != global_zone || crgetruid(curproc->p_cred) != 0)
     2190 +                return (set_errno(EPERM));
     2191 +
     2192 +        /* If not a valid mapping address, return an error */
     2193 +        if ((caddr_t)((uintptr_t)addr & (uintptr_t)PAGEMASK) != addr)
     2194 +                return (set_errno(EINVAL));
     2195 +
     2196 +again:
     2197 +        mutex_enter(&pidlock);
     2198 +        p = prfind(pid);
     2199 +        if (p == NULL) {
     2200 +                mutex_exit(&pidlock);
     2201 +                return (set_errno(ESRCH));
     2202 +        }
     2203 +
     2204 +        mutex_enter(&p->p_lock);
     2205 +        mutex_exit(&pidlock);
     2206 +
     2207 +        if (panicstr != NULL) {
     2208 +                mutex_exit(&p->p_lock);
     2209 +                return (0);
     2210 +        }
     2211 +
     2212 +        as = p->p_as;
     2213 +
     2214 +        /*
     2215 +         * Try to set P_PR_LOCK - prevents process "changing shape"
     2216 +         * - blocks fork
     2217 +         * - blocks sigkill
     2218 +         * - cannot be a system proc
     2219 +         * - must be fully created proc
     2220 +         */
     2221 +        ret = sprtrylock_proc(p);
     2222 +        if (ret == -1) {
     2223 +                /* Process in invalid state */
     2224 +                mutex_exit(&p->p_lock);
     2225 +                return (set_errno(ESRCH));
     2226 +        }
     2227 +
     2228 +        if (ret == 1) {
     2229 +                /*
     2230 +                 * P_PR_LOCK is already set. Wait and try again. This also
     2231 +                 * drops p_lock so p may no longer be valid since the proc may
     2232 +                 * have exited.
     2233 +                 */
     2234 +                sprwaitlock_proc(p);
     2235 +                goto again;
     2236 +        }
     2237 +
     2238 +        /* P_PR_LOCK is now set */
     2239 +        mutex_exit(&p->p_lock);
     2240 +
     2241 +        AS_LOCK_ENTER(as, RW_READER);
     2242 +        if ((seg = as_segat(as, addr)) == NULL) {
     2243 +                AS_LOCK_EXIT(as);
     2244 +                mutex_enter(&p->p_lock);
     2245 +                sprunlock(p);
     2246 +                return (set_errno(ENOMEM));
     2247 +        }
     2248 +
     2249 +        /*
     2250 +         * The invalidation behavior only makes sense for vnode-backed segments.
     2251 +         */
     2252 +        if (seg->s_ops != &segvn_ops) {
     2253 +                AS_LOCK_EXIT(as);
     2254 +                mutex_enter(&p->p_lock);
     2255 +                sprunlock(p);
     2256 +                return (0);
     2257 +        }
     2258 +
     2259 +        /*
     2260 +         * If the mapping is out of bounds of the segement return an error.
     2261 +         */
     2262 +        if ((addr + size) > (seg->s_base + seg->s_size)) {
     2263 +                AS_LOCK_EXIT(as);
     2264 +                mutex_enter(&p->p_lock);
     2265 +                sprunlock(p);
     2266 +                return (set_errno(EINVAL));
     2267 +        }
     2268 +
     2269 +        /*
     2270 +         * Don't use MS_INVALCURPROC flag here since that would eventually
     2271 +         * initiate hat invalidation based on curthread. Since we're doing this
     2272 +         * on behalf of a different process, that would erroneously invalidate
     2273 +         * our own process mappings.
     2274 +         */
     2275 +        error = SEGOP_SYNC(seg, addr, size, 0, (uint_t)MS_ASYNC);
     2276 +        if (error == 0) {
     2277 +                /*
     2278 +                 * Since we didn't invalidate during the sync above, we now
     2279 +                 * try to invalidate all of the pages in the mapping.
     2280 +                 */
     2281 +                map_inval(p, seg, addr, size);
     2282 +        }
     2283 +        AS_LOCK_EXIT(as);
     2284 +
     2285 +        mutex_enter(&p->p_lock);
     2286 +        sprunlock(p);
     2287 +
     2288 +        if (error)
     2289 +                (void) set_errno(error);
     2290 +        return (error);
     2291 +}
     2292 +#endif
    
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX