Print this page
OS-5078 illumos#6514 broke vm_usage and lx proc
OS-2969 vm_getusage syscall accurate zone RSS is overcounting
OS-3088 need a lighterweight page invalidation mechanism for zone memcap
OS-881 To workaround OS-580 add support to only invalidate mappings from a single process
OS-750 improve RUSAGESYS_GETVMUSAGE for zoneadmd
OS-399 zone phys. mem. cap should be a rctl and have associated kstat
| Split |
Close |
| Expand all |
| Collapse all |
--- old/usr/src/uts/common/vm/vm_usage.c
+++ new/usr/src/uts/common/vm/vm_usage.c
1 1 /*
2 2 * CDDL HEADER START
3 3 *
4 4 * The contents of this file are subject to the terms of the
5 5 * Common Development and Distribution License (the "License").
6 6 * You may not use this file except in compliance with the License.
7 7 *
8 8 * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
9 9 * or http://www.opensolaris.org/os/licensing.
10 10 * See the License for the specific language governing permissions
11 11 * and limitations under the License.
12 12 *
13 13 * When distributing Covered Code, include this CDDL HEADER in each
14 14 * file and include the License file at usr/src/OPENSOLARIS.LICENSE.
15 15 * If applicable, add the following below this CDDL HEADER, with the
16 16 * fields enclosed by brackets "[]" replaced with your own identifying
17 17 * information: Portions Copyright [yyyy] [name of copyright owner]
|
↓ open down ↓ |
17 lines elided |
↑ open up ↑ |
18 18 *
19 19 * CDDL HEADER END
20 20 */
21 21
22 22 /*
23 23 * Copyright 2009 Sun Microsystems, Inc. All rights reserved.
24 24 * Use is subject to license terms.
25 25 */
26 26
27 27 /*
28 + * Copyright 2016, Joyent, Inc.
29 + */
30 +
31 +/*
28 32 * vm_usage
29 33 *
30 34 * This file implements the getvmusage() private system call.
31 35 * getvmusage() counts the amount of resident memory pages and swap
32 36 * reserved by the specified process collective. A "process collective" is
33 37 * the set of processes owned by a particular, zone, project, task, or user.
34 38 *
35 39 * rss and swap are counted so that for a given process collective, a page is
36 40 * only counted once. For example, this means that if multiple processes in
37 41 * the same project map the same page, then the project will only be charged
38 42 * once for that page. On the other hand, if two processes in different
39 43 * projects map the same page, then both projects will be charged
40 44 * for the page.
41 45 *
42 46 * The vm_getusage() calculation is implemented so that the first thread
43 47 * performs the rss/swap counting. Other callers will wait for that thread to
44 48 * finish, copying the results. This enables multiple rcapds and prstats to
45 49 * consume data from the same calculation. The results are also cached so that
46 50 * a caller interested in recent results can just copy them instead of starting
47 51 * a new calculation. The caller passes the maximium age (in seconds) of the
48 52 * data. If the cached data is young enough, the cache is copied, otherwise,
49 53 * a new calculation is executed and the cache is replaced with the new
50 54 * data.
51 55 *
52 56 * The rss calculation for each process collective is as follows:
53 57 *
54 58 * - Inspect flags, determine if counting rss for zones, projects, tasks,
55 59 * and/or users.
56 60 * - For each proc:
57 61 * - Figure out proc's collectives (zone, project, task, and/or user).
58 62 * - For each seg in proc's address space:
59 63 * - If seg is private:
60 64 * - Lookup anons in the amp.
61 65 * - For incore pages not previously visited each of the
62 66 * proc's collectives, add incore pagesize to each.
63 67 * collective.
64 68 * Anon's with a refcnt of 1 can be assummed to be not
65 69 * previously visited.
66 70 * - For address ranges without anons in the amp:
67 71 * - Lookup pages in underlying vnode.
68 72 * - For incore pages not previously visiting for
69 73 * each of the proc's collectives, add incore
70 74 * pagesize to each collective.
71 75 * - If seg is shared:
72 76 * - Lookup pages in the shared amp or vnode.
73 77 * - For incore pages not previously visited for each of
74 78 * the proc's collectives, add incore pagesize to each
75 79 * collective.
76 80 *
77 81 * Swap is reserved by private segments, and shared anonymous segments.
78 82 * The only shared anon segments which do not reserve swap are ISM segments
79 83 * and schedctl segments, both of which can be identified by having
80 84 * amp->swresv == 0.
81 85 *
82 86 * The swap calculation for each collective is as follows:
83 87 *
84 88 * - Inspect flags, determine if counting rss for zones, projects, tasks,
85 89 * and/or users.
86 90 * - For each proc:
87 91 * - Figure out proc's collectives (zone, project, task, and/or user).
88 92 * - For each seg in proc's address space:
89 93 * - If seg is private:
90 94 * - Add svd->swresv pages to swap count for each of the
91 95 * proc's collectives.
92 96 * - If seg is anon, shared, and amp->swresv != 0
93 97 * - For address ranges in amp not previously visited for
94 98 * each of the proc's collectives, add size of address
95 99 * range to the swap count for each collective.
96 100 *
97 101 * These two calculations are done simultaneously, with most of the work
98 102 * being done in vmu_calculate_seg(). The results of the calculation are
99 103 * copied into "vmu_data.vmu_cache_results".
100 104 *
101 105 * To perform the calculation, various things are tracked and cached:
102 106 *
103 107 * - incore/not-incore page ranges for all vnodes.
104 108 * (vmu_data.vmu_all_vnodes_hash)
105 109 * This eliminates looking up the same page more than once.
106 110 *
107 111 * - incore/not-incore page ranges for all shared amps.
108 112 * (vmu_data.vmu_all_amps_hash)
109 113 * This eliminates looking up the same page more than once.
110 114 *
111 115 * - visited page ranges for each collective.
112 116 * - per vnode (entity->vme_vnode_hash)
113 117 * - per shared amp (entity->vme_amp_hash)
114 118 * For accurate counting of map-shared and COW-shared pages.
115 119 *
116 120 * - visited private anons (refcnt > 1) for each collective.
117 121 * (entity->vme_anon_hash)
118 122 * For accurate counting of COW-shared pages.
119 123 *
120 124 * The common accounting structure is the vmu_entity_t, which represents
121 125 * collectives:
122 126 *
123 127 * - A zone.
124 128 * - A project, task, or user within a zone.
125 129 * - The entire system (vmu_data.vmu_system).
126 130 * - Each collapsed (col) project and user. This means a given projid or
127 131 * uid, regardless of which zone the process is in. For instance,
128 132 * project 0 in the global zone and project 0 in a non global zone are
129 133 * the same collapsed project.
130 134 *
131 135 * Each entity structure tracks which pages have been already visited for
132 136 * that entity (via previously inspected processes) so that these pages are
133 137 * not double counted.
134 138 */
135 139
136 140 #include <sys/errno.h>
137 141 #include <sys/types.h>
138 142 #include <sys/zone.h>
139 143 #include <sys/proc.h>
140 144 #include <sys/project.h>
141 145 #include <sys/task.h>
142 146 #include <sys/thread.h>
143 147 #include <sys/time.h>
144 148 #include <sys/mman.h>
145 149 #include <sys/modhash.h>
146 150 #include <sys/modhash_impl.h>
147 151 #include <sys/shm.h>
148 152 #include <sys/swap.h>
149 153 #include <sys/synch.h>
150 154 #include <sys/systm.h>
151 155 #include <sys/var.h>
152 156 #include <sys/vm_usage.h>
153 157 #include <sys/zone.h>
154 158 #include <sys/sunddi.h>
155 159 #include <sys/avl.h>
156 160 #include <vm/anon.h>
157 161 #include <vm/as.h>
158 162 #include <vm/seg_vn.h>
159 163 #include <vm/seg_spt.h>
160 164
161 165 #define VMUSAGE_HASH_SIZE 512
162 166
163 167 #define VMUSAGE_TYPE_VNODE 1
164 168 #define VMUSAGE_TYPE_AMP 2
165 169 #define VMUSAGE_TYPE_ANON 3
166 170
167 171 #define VMUSAGE_BOUND_UNKNOWN 0
168 172 #define VMUSAGE_BOUND_INCORE 1
169 173 #define VMUSAGE_BOUND_NOT_INCORE 2
170 174
171 175 #define ISWITHIN(node, addr) ((node)->vmb_start <= addr && \
172 176 (node)->vmb_end >= addr ? 1 : 0)
173 177
174 178 /*
175 179 * bounds for vnodes and shared amps
176 180 * Each bound is either entirely incore, entirely not in core, or
177 181 * entirely unknown. bounds are stored in an avl tree sorted by start member
178 182 * when in use, otherwise (free or temporary lists) they're strung
179 183 * together off of vmb_next.
180 184 */
181 185 typedef struct vmu_bound {
182 186 avl_node_t vmb_node;
183 187 struct vmu_bound *vmb_next; /* NULL in tree else on free or temp list */
184 188 pgcnt_t vmb_start; /* page offset in vnode/amp on which bound starts */
185 189 pgcnt_t vmb_end; /* page offset in vnode/amp on which bound ends */
186 190 char vmb_type; /* One of VMUSAGE_BOUND_* */
187 191 } vmu_bound_t;
188 192
189 193 /*
190 194 * hash of visited objects (vnodes or shared amps)
191 195 * key is address of vnode or amp. Bounds lists known incore/non-incore
192 196 * bounds for vnode/amp.
193 197 */
194 198 typedef struct vmu_object {
195 199 struct vmu_object *vmo_next; /* free list */
196 200 caddr_t vmo_key;
197 201 short vmo_type;
198 202 avl_tree_t vmo_bounds;
199 203 } vmu_object_t;
200 204
201 205 /*
202 206 * Entity by which to count results.
203 207 *
204 208 * The entity structure keeps the current rss/swap counts for each entity
205 209 * (zone, project, etc), and hashes of vm structures that have already
206 210 * been visited for the entity.
207 211 *
208 212 * vme_next: links the list of all entities currently being counted by
209 213 * vmu_calculate().
210 214 *
211 215 * vme_next_calc: links the list of entities related to the current process
212 216 * being counted by vmu_calculate_proc().
213 217 *
214 218 * vmu_calculate_proc() walks all processes. For each process, it makes a
215 219 * list of the entities related to that process using vme_next_calc. This
216 220 * list changes each time vmu_calculate_proc() is called.
217 221 *
218 222 */
219 223 typedef struct vmu_entity {
220 224 struct vmu_entity *vme_next;
221 225 struct vmu_entity *vme_next_calc;
222 226 mod_hash_t *vme_vnode_hash; /* vnodes visited for entity */
223 227 mod_hash_t *vme_amp_hash; /* shared amps visited for entity */
224 228 mod_hash_t *vme_anon_hash; /* COW anons visited for entity */
225 229 vmusage_t vme_result; /* identifies entity and results */
226 230 } vmu_entity_t;
227 231
228 232 /*
229 233 * Hash of entities visited within a zone, and an entity for the zone
230 234 * itself.
231 235 */
232 236 typedef struct vmu_zone {
233 237 struct vmu_zone *vmz_next; /* free list */
234 238 id_t vmz_id;
235 239 vmu_entity_t *vmz_zone;
236 240 mod_hash_t *vmz_projects_hash;
237 241 mod_hash_t *vmz_tasks_hash;
238 242 mod_hash_t *vmz_rusers_hash;
239 243 mod_hash_t *vmz_eusers_hash;
240 244 } vmu_zone_t;
241 245
242 246 /*
243 247 * Cache of results from last calculation
244 248 */
245 249 typedef struct vmu_cache {
246 250 vmusage_t *vmc_results; /* Results from last call to */
247 251 /* vm_getusage(). */
248 252 uint64_t vmc_nresults; /* Count of cached results */
249 253 uint64_t vmc_refcnt; /* refcnt for free */
250 254 uint_t vmc_flags; /* Flags for vm_getusage() */
251 255 hrtime_t vmc_timestamp; /* when cache was created */
252 256 } vmu_cache_t;
253 257
254 258 /*
255 259 * top level rss info for the system
256 260 */
257 261 typedef struct vmu_data {
258 262 kmutex_t vmu_lock; /* Protects vmu_data */
259 263 kcondvar_t vmu_cv; /* Used to signal threads */
260 264 /* Waiting for */
261 265 /* Rss_calc_thread to finish */
262 266 vmu_entity_t *vmu_system; /* Entity for tracking */
263 267 /* rss/swap for all processes */
264 268 /* in all zones */
265 269 mod_hash_t *vmu_zones_hash; /* Zones visited */
266 270 mod_hash_t *vmu_projects_col_hash; /* These *_col_hash hashes */
267 271 mod_hash_t *vmu_rusers_col_hash; /* keep track of entities, */
268 272 mod_hash_t *vmu_eusers_col_hash; /* ignoring zoneid, in order */
269 273 /* to implement VMUSAGE_COL_* */
270 274 /* flags, which aggregate by */
271 275 /* project or user regardless */
272 276 /* of zoneid. */
273 277 mod_hash_t *vmu_all_vnodes_hash; /* System wide visited vnodes */
274 278 /* to track incore/not-incore */
275 279 mod_hash_t *vmu_all_amps_hash; /* System wide visited shared */
276 280 /* amps to track incore/not- */
277 281 /* incore */
278 282 vmu_entity_t *vmu_entities; /* Linked list of entities */
279 283 size_t vmu_nentities; /* Count of entities in list */
280 284 vmu_cache_t *vmu_cache; /* Cached results */
281 285 kthread_t *vmu_calc_thread; /* NULL, or thread running */
282 286 /* vmu_calculate() */
283 287 uint_t vmu_calc_flags; /* Flags being using by */
284 288 /* currently running calc */
285 289 /* thread */
286 290 uint_t vmu_pending_flags; /* Flags of vm_getusage() */
287 291 /* threads waiting for */
288 292 /* calc thread to finish */
289 293 uint_t vmu_pending_waiters; /* Number of threads waiting */
290 294 /* for calc thread */
291 295 vmu_bound_t *vmu_free_bounds;
292 296 vmu_object_t *vmu_free_objects;
293 297 vmu_entity_t *vmu_free_entities;
294 298 vmu_zone_t *vmu_free_zones;
295 299 } vmu_data_t;
296 300
297 301 extern struct as kas;
298 302 extern proc_t *practive;
299 303 extern zone_t *global_zone;
300 304 extern struct seg_ops segvn_ops;
301 305 extern struct seg_ops segspt_shmops;
302 306
303 307 static vmu_data_t vmu_data;
304 308 static kmem_cache_t *vmu_bound_cache;
305 309 static kmem_cache_t *vmu_object_cache;
306 310
307 311 /*
308 312 * Comparison routine for AVL tree. We base our comparison on vmb_start.
309 313 */
310 314 static int
311 315 bounds_cmp(const void *bnd1, const void *bnd2)
312 316 {
313 317 const vmu_bound_t *bound1 = bnd1;
314 318 const vmu_bound_t *bound2 = bnd2;
315 319
316 320 if (bound1->vmb_start == bound2->vmb_start) {
317 321 return (0);
318 322 }
319 323 if (bound1->vmb_start < bound2->vmb_start) {
320 324 return (-1);
321 325 }
322 326
323 327 return (1);
324 328 }
325 329
326 330 /*
327 331 * Save a bound on the free list.
328 332 */
329 333 static void
330 334 vmu_free_bound(vmu_bound_t *bound)
331 335 {
332 336 bound->vmb_next = vmu_data.vmu_free_bounds;
333 337 bound->vmb_start = 0;
334 338 bound->vmb_end = 0;
335 339 bound->vmb_type = 0;
336 340 vmu_data.vmu_free_bounds = bound;
337 341 }
338 342
339 343 /*
340 344 * Free an object, and all visited bound info.
341 345 */
342 346 static void
343 347 vmu_free_object(mod_hash_val_t val)
344 348 {
345 349 vmu_object_t *obj = (vmu_object_t *)val;
346 350 avl_tree_t *tree = &(obj->vmo_bounds);
347 351 vmu_bound_t *bound;
348 352 void *cookie = NULL;
349 353
350 354 while ((bound = avl_destroy_nodes(tree, &cookie)) != NULL)
351 355 vmu_free_bound(bound);
352 356 avl_destroy(tree);
353 357
354 358 obj->vmo_type = 0;
355 359 obj->vmo_next = vmu_data.vmu_free_objects;
356 360 vmu_data.vmu_free_objects = obj;
357 361 }
358 362
359 363 /*
360 364 * Free an entity, and hashes of visited objects for that entity.
361 365 */
362 366 static void
363 367 vmu_free_entity(mod_hash_val_t val)
364 368 {
365 369 vmu_entity_t *entity = (vmu_entity_t *)val;
366 370
367 371 if (entity->vme_vnode_hash != NULL)
368 372 i_mod_hash_clear_nosync(entity->vme_vnode_hash);
369 373 if (entity->vme_amp_hash != NULL)
370 374 i_mod_hash_clear_nosync(entity->vme_amp_hash);
371 375 if (entity->vme_anon_hash != NULL)
372 376 i_mod_hash_clear_nosync(entity->vme_anon_hash);
373 377
374 378 entity->vme_next = vmu_data.vmu_free_entities;
375 379 vmu_data.vmu_free_entities = entity;
376 380 }
377 381
378 382 /*
379 383 * Free zone entity, and all hashes of entities inside that zone,
380 384 * which are projects, tasks, and users.
381 385 */
382 386 static void
383 387 vmu_free_zone(mod_hash_val_t val)
384 388 {
385 389 vmu_zone_t *zone = (vmu_zone_t *)val;
386 390
387 391 if (zone->vmz_zone != NULL) {
388 392 vmu_free_entity((mod_hash_val_t)zone->vmz_zone);
389 393 zone->vmz_zone = NULL;
390 394 }
391 395 if (zone->vmz_projects_hash != NULL)
392 396 i_mod_hash_clear_nosync(zone->vmz_projects_hash);
393 397 if (zone->vmz_tasks_hash != NULL)
394 398 i_mod_hash_clear_nosync(zone->vmz_tasks_hash);
395 399 if (zone->vmz_rusers_hash != NULL)
396 400 i_mod_hash_clear_nosync(zone->vmz_rusers_hash);
397 401 if (zone->vmz_eusers_hash != NULL)
398 402 i_mod_hash_clear_nosync(zone->vmz_eusers_hash);
399 403 zone->vmz_next = vmu_data.vmu_free_zones;
400 404 vmu_data.vmu_free_zones = zone;
401 405 }
402 406
403 407 /*
404 408 * Initialize synchronization primitives and hashes for system-wide tracking
405 409 * of visited vnodes and shared amps. Initialize results cache.
406 410 */
407 411 void
408 412 vm_usage_init()
409 413 {
410 414 mutex_init(&vmu_data.vmu_lock, NULL, MUTEX_DEFAULT, NULL);
411 415 cv_init(&vmu_data.vmu_cv, NULL, CV_DEFAULT, NULL);
412 416
413 417 vmu_data.vmu_system = NULL;
414 418 vmu_data.vmu_zones_hash = NULL;
415 419 vmu_data.vmu_projects_col_hash = NULL;
416 420 vmu_data.vmu_rusers_col_hash = NULL;
417 421 vmu_data.vmu_eusers_col_hash = NULL;
418 422
419 423 vmu_data.vmu_free_bounds = NULL;
420 424 vmu_data.vmu_free_objects = NULL;
421 425 vmu_data.vmu_free_entities = NULL;
422 426 vmu_data.vmu_free_zones = NULL;
423 427
424 428 vmu_data.vmu_all_vnodes_hash = mod_hash_create_ptrhash(
425 429 "vmusage vnode hash", VMUSAGE_HASH_SIZE, vmu_free_object,
426 430 sizeof (vnode_t));
427 431 vmu_data.vmu_all_amps_hash = mod_hash_create_ptrhash(
428 432 "vmusage amp hash", VMUSAGE_HASH_SIZE, vmu_free_object,
429 433 sizeof (struct anon_map));
430 434 vmu_data.vmu_projects_col_hash = mod_hash_create_idhash(
431 435 "vmusage collapsed project hash", VMUSAGE_HASH_SIZE,
432 436 vmu_free_entity);
433 437 vmu_data.vmu_rusers_col_hash = mod_hash_create_idhash(
434 438 "vmusage collapsed ruser hash", VMUSAGE_HASH_SIZE,
435 439 vmu_free_entity);
436 440 vmu_data.vmu_eusers_col_hash = mod_hash_create_idhash(
437 441 "vmusage collpased euser hash", VMUSAGE_HASH_SIZE,
438 442 vmu_free_entity);
439 443 vmu_data.vmu_zones_hash = mod_hash_create_idhash(
440 444 "vmusage zone hash", VMUSAGE_HASH_SIZE, vmu_free_zone);
441 445
442 446 vmu_bound_cache = kmem_cache_create("vmu_bound_cache",
443 447 sizeof (vmu_bound_t), 0, NULL, NULL, NULL, NULL, NULL, 0);
444 448 vmu_object_cache = kmem_cache_create("vmu_object_cache",
445 449 sizeof (vmu_object_t), 0, NULL, NULL, NULL, NULL, NULL, 0);
446 450
447 451 vmu_data.vmu_entities = NULL;
448 452 vmu_data.vmu_nentities = 0;
449 453
450 454 vmu_data.vmu_cache = NULL;
451 455 vmu_data.vmu_calc_thread = NULL;
452 456 vmu_data.vmu_calc_flags = 0;
453 457 vmu_data.vmu_pending_flags = 0;
454 458 vmu_data.vmu_pending_waiters = 0;
455 459 }
456 460
457 461 /*
458 462 * Allocate hashes for tracking vm objects visited for an entity.
459 463 * Update list of entities.
460 464 */
461 465 static vmu_entity_t *
462 466 vmu_alloc_entity(id_t id, int type, id_t zoneid)
463 467 {
464 468 vmu_entity_t *entity;
465 469
466 470 if (vmu_data.vmu_free_entities != NULL) {
467 471 entity = vmu_data.vmu_free_entities;
468 472 vmu_data.vmu_free_entities =
469 473 vmu_data.vmu_free_entities->vme_next;
470 474 bzero(&entity->vme_result, sizeof (vmusage_t));
471 475 } else {
472 476 entity = kmem_zalloc(sizeof (vmu_entity_t), KM_SLEEP);
473 477 }
474 478 entity->vme_result.vmu_id = id;
475 479 entity->vme_result.vmu_zoneid = zoneid;
476 480 entity->vme_result.vmu_type = type;
477 481
478 482 if (entity->vme_vnode_hash == NULL)
479 483 entity->vme_vnode_hash = mod_hash_create_ptrhash(
480 484 "vmusage vnode hash", VMUSAGE_HASH_SIZE, vmu_free_object,
481 485 sizeof (vnode_t));
482 486
483 487 if (entity->vme_amp_hash == NULL)
484 488 entity->vme_amp_hash = mod_hash_create_ptrhash(
485 489 "vmusage amp hash", VMUSAGE_HASH_SIZE, vmu_free_object,
486 490 sizeof (struct anon_map));
487 491
488 492 if (entity->vme_anon_hash == NULL)
489 493 entity->vme_anon_hash = mod_hash_create_ptrhash(
490 494 "vmusage anon hash", VMUSAGE_HASH_SIZE,
491 495 mod_hash_null_valdtor, sizeof (struct anon));
492 496
493 497 entity->vme_next = vmu_data.vmu_entities;
494 498 vmu_data.vmu_entities = entity;
495 499 vmu_data.vmu_nentities++;
496 500
497 501 return (entity);
498 502 }
499 503
500 504 /*
501 505 * Allocate a zone entity, and hashes for tracking visited vm objects
502 506 * for projects, tasks, and users within that zone.
503 507 */
504 508 static vmu_zone_t *
505 509 vmu_alloc_zone(id_t id)
506 510 {
507 511 vmu_zone_t *zone;
508 512
509 513 if (vmu_data.vmu_free_zones != NULL) {
510 514 zone = vmu_data.vmu_free_zones;
|
↓ open down ↓ |
473 lines elided |
↑ open up ↑ |
511 515 vmu_data.vmu_free_zones =
512 516 vmu_data.vmu_free_zones->vmz_next;
513 517 zone->vmz_next = NULL;
514 518 zone->vmz_zone = NULL;
515 519 } else {
516 520 zone = kmem_zalloc(sizeof (vmu_zone_t), KM_SLEEP);
517 521 }
518 522
519 523 zone->vmz_id = id;
520 524
521 - if ((vmu_data.vmu_calc_flags & (VMUSAGE_ZONE | VMUSAGE_ALL_ZONES)) != 0)
525 + if ((vmu_data.vmu_calc_flags &
526 + (VMUSAGE_ZONE | VMUSAGE_ALL_ZONES | VMUSAGE_A_ZONE)) != 0)
522 527 zone->vmz_zone = vmu_alloc_entity(id, VMUSAGE_ZONE, id);
523 528
524 529 if ((vmu_data.vmu_calc_flags & (VMUSAGE_PROJECTS |
525 530 VMUSAGE_ALL_PROJECTS)) != 0 && zone->vmz_projects_hash == NULL)
526 531 zone->vmz_projects_hash = mod_hash_create_idhash(
527 532 "vmusage project hash", VMUSAGE_HASH_SIZE, vmu_free_entity);
528 533
529 534 if ((vmu_data.vmu_calc_flags & (VMUSAGE_TASKS | VMUSAGE_ALL_TASKS))
530 535 != 0 && zone->vmz_tasks_hash == NULL)
531 536 zone->vmz_tasks_hash = mod_hash_create_idhash(
532 537 "vmusage task hash", VMUSAGE_HASH_SIZE, vmu_free_entity);
533 538
534 539 if ((vmu_data.vmu_calc_flags & (VMUSAGE_RUSERS | VMUSAGE_ALL_RUSERS))
535 540 != 0 && zone->vmz_rusers_hash == NULL)
536 541 zone->vmz_rusers_hash = mod_hash_create_idhash(
537 542 "vmusage ruser hash", VMUSAGE_HASH_SIZE, vmu_free_entity);
538 543
539 544 if ((vmu_data.vmu_calc_flags & (VMUSAGE_EUSERS | VMUSAGE_ALL_EUSERS))
540 545 != 0 && zone->vmz_eusers_hash == NULL)
541 546 zone->vmz_eusers_hash = mod_hash_create_idhash(
542 547 "vmusage euser hash", VMUSAGE_HASH_SIZE, vmu_free_entity);
543 548
544 549 return (zone);
545 550 }
546 551
547 552 /*
548 553 * Allocate a structure for tracking visited bounds for a vm object.
549 554 */
550 555 static vmu_object_t *
551 556 vmu_alloc_object(caddr_t key, int type)
552 557 {
553 558 vmu_object_t *object;
554 559
555 560 if (vmu_data.vmu_free_objects != NULL) {
556 561 object = vmu_data.vmu_free_objects;
557 562 vmu_data.vmu_free_objects =
558 563 vmu_data.vmu_free_objects->vmo_next;
559 564 } else {
560 565 object = kmem_cache_alloc(vmu_object_cache, KM_SLEEP);
561 566 }
562 567
563 568 object->vmo_next = NULL;
564 569 object->vmo_key = key;
565 570 object->vmo_type = type;
566 571 avl_create(&(object->vmo_bounds), bounds_cmp, sizeof (vmu_bound_t), 0);
567 572
568 573 return (object);
569 574 }
570 575
571 576 /*
572 577 * Allocate and return a bound structure.
573 578 */
574 579 static vmu_bound_t *
575 580 vmu_alloc_bound()
576 581 {
577 582 vmu_bound_t *bound;
578 583
579 584 if (vmu_data.vmu_free_bounds != NULL) {
580 585 bound = vmu_data.vmu_free_bounds;
581 586 vmu_data.vmu_free_bounds =
582 587 vmu_data.vmu_free_bounds->vmb_next;
583 588 } else {
584 589 bound = kmem_cache_alloc(vmu_bound_cache, KM_SLEEP);
585 590 }
586 591
587 592 bound->vmb_next = NULL;
588 593 bound->vmb_start = 0;
589 594 bound->vmb_end = 0;
590 595 bound->vmb_type = 0;
591 596 return (bound);
592 597 }
593 598
594 599 /*
595 600 * vmu_find_insert_* functions implement hash lookup or allocate and
596 601 * insert operations.
597 602 */
598 603 static vmu_object_t *
599 604 vmu_find_insert_object(mod_hash_t *hash, caddr_t key, uint_t type)
600 605 {
601 606 int ret;
602 607 vmu_object_t *object;
603 608
604 609 ret = i_mod_hash_find_nosync(hash, (mod_hash_key_t)key,
605 610 (mod_hash_val_t *)&object);
606 611 if (ret != 0) {
607 612 object = vmu_alloc_object(key, type);
608 613 ret = i_mod_hash_insert_nosync(hash, (mod_hash_key_t)key,
609 614 (mod_hash_val_t)object, (mod_hash_hndl_t)0);
610 615 ASSERT(ret == 0);
611 616 }
612 617 return (object);
613 618 }
614 619
615 620 static int
616 621 vmu_find_insert_anon(mod_hash_t *hash, caddr_t key)
617 622 {
618 623 int ret;
619 624 caddr_t val;
620 625
621 626 ret = i_mod_hash_find_nosync(hash, (mod_hash_key_t)key,
622 627 (mod_hash_val_t *)&val);
623 628
624 629 if (ret == 0)
625 630 return (0);
626 631
627 632 ret = i_mod_hash_insert_nosync(hash, (mod_hash_key_t)key,
628 633 (mod_hash_val_t)key, (mod_hash_hndl_t)0);
629 634
630 635 ASSERT(ret == 0);
631 636
632 637 return (1);
633 638 }
634 639
635 640 static vmu_entity_t *
636 641 vmu_find_insert_entity(mod_hash_t *hash, id_t id, uint_t type, id_t zoneid)
637 642 {
638 643 int ret;
639 644 vmu_entity_t *entity;
640 645
641 646 ret = i_mod_hash_find_nosync(hash, (mod_hash_key_t)(uintptr_t)id,
642 647 (mod_hash_val_t *)&entity);
643 648 if (ret != 0) {
644 649 entity = vmu_alloc_entity(id, type, zoneid);
645 650 ret = i_mod_hash_insert_nosync(hash,
646 651 (mod_hash_key_t)(uintptr_t)id, (mod_hash_val_t)entity,
647 652 (mod_hash_hndl_t)0);
648 653 ASSERT(ret == 0);
649 654 }
650 655 return (entity);
651 656 }
652 657
653 658
654 659
655 660
656 661 /*
657 662 * Returns list of object bounds between start and end. New bounds inserted
658 663 * by this call are given type.
659 664 *
660 665 * Returns the number of pages covered if new bounds are created. Returns 0
661 666 * if region between start/end consists of all existing bounds.
662 667 */
663 668 static pgcnt_t
664 669 vmu_insert_lookup_object_bounds(vmu_object_t *ro, pgcnt_t start, pgcnt_t
665 670 end, char type, vmu_bound_t **first, vmu_bound_t **last)
666 671 {
667 672 avl_tree_t *tree = &(ro->vmo_bounds);
668 673 avl_index_t where;
669 674 vmu_bound_t *walker, *tmp;
670 675 pgcnt_t ret = 0;
671 676
672 677 ASSERT(start <= end);
673 678
674 679 *first = *last = NULL;
675 680
676 681 tmp = vmu_alloc_bound();
677 682 tmp->vmb_start = start;
678 683 tmp->vmb_type = type;
679 684
680 685 /* Hopelessly optimistic case. */
681 686 if (walker = avl_find(tree, tmp, &where)) {
682 687 /* We got lucky. */
683 688 vmu_free_bound(tmp);
684 689 *first = walker;
685 690 }
686 691
687 692 if (walker == NULL) {
688 693 /* Is start in the previous node? */
689 694 walker = avl_nearest(tree, where, AVL_BEFORE);
690 695 if (walker != NULL) {
691 696 if (ISWITHIN(walker, start)) {
692 697 /* We found start. */
693 698 vmu_free_bound(tmp);
694 699 *first = walker;
695 700 }
696 701 }
697 702 }
698 703
699 704 /*
700 705 * At this point, if *first is still NULL, then we
701 706 * didn't get a direct hit and start isn't covered
702 707 * by the previous node. We know that the next node
703 708 * must have a greater start value than we require
704 709 * because avl_find tells us where the AVL routines would
705 710 * insert our new node. We have some gap between the
706 711 * start we want and the next node.
707 712 */
708 713 if (*first == NULL) {
709 714 walker = avl_nearest(tree, where, AVL_AFTER);
710 715 if (walker != NULL && walker->vmb_start <= end) {
711 716 /* Fill the gap. */
712 717 tmp->vmb_end = walker->vmb_start - 1;
713 718 *first = tmp;
714 719 } else {
715 720 /* We have a gap over [start, end]. */
716 721 tmp->vmb_end = end;
717 722 *first = *last = tmp;
718 723 }
719 724 ret += tmp->vmb_end - tmp->vmb_start + 1;
720 725 avl_insert(tree, tmp, where);
721 726 }
722 727
723 728 ASSERT(*first != NULL);
724 729
725 730 if (*last != NULL) {
726 731 /* We're done. */
727 732 return (ret);
728 733 }
729 734
730 735 /*
731 736 * If we are here we still need to set *last and
732 737 * that may involve filling in some gaps.
733 738 */
734 739 *last = *first;
735 740 for (;;) {
736 741 if (ISWITHIN(*last, end)) {
737 742 /* We're done. */
738 743 break;
739 744 }
740 745 walker = AVL_NEXT(tree, *last);
741 746 if (walker == NULL || walker->vmb_start > end) {
742 747 /* Bottom or mid tree with gap. */
743 748 tmp = vmu_alloc_bound();
744 749 tmp->vmb_start = (*last)->vmb_end + 1;
745 750 tmp->vmb_end = end;
746 751 tmp->vmb_type = type;
747 752 ret += tmp->vmb_end - tmp->vmb_start + 1;
748 753 avl_insert_here(tree, tmp, *last, AVL_AFTER);
749 754 *last = tmp;
750 755 break;
751 756 } else {
752 757 if ((*last)->vmb_end + 1 != walker->vmb_start) {
753 758 /* Non-contiguous. */
754 759 tmp = vmu_alloc_bound();
755 760 tmp->vmb_start = (*last)->vmb_end + 1;
756 761 tmp->vmb_end = walker->vmb_start - 1;
757 762 tmp->vmb_type = type;
758 763 ret += tmp->vmb_end - tmp->vmb_start + 1;
759 764 avl_insert_here(tree, tmp, *last, AVL_AFTER);
760 765 *last = tmp;
761 766 } else {
762 767 *last = walker;
763 768 }
764 769 }
765 770 }
766 771
767 772 return (ret);
768 773 }
769 774
770 775 /*
771 776 * vmu_update_bounds()
772 777 *
773 778 * tree: avl_tree in which first and last hang.
774 779 *
775 780 * first, last: list of continuous bounds, of which zero or more are of
776 781 * type VMUSAGE_BOUND_UNKNOWN.
777 782 *
778 783 * new_tree: avl_tree in which new_first and new_last hang.
779 784 *
780 785 * new_first, new_last: list of continuous bounds, of which none are of
781 786 * type VMUSAGE_BOUND_UNKNOWN. These bounds are used to
782 787 * update the types of bounds in (first,last) with
783 788 * type VMUSAGE_BOUND_UNKNOWN.
784 789 *
785 790 * For the list of bounds (first,last), this function updates any bounds
786 791 * with type VMUSAGE_BOUND_UNKNOWN using the type of the corresponding bound in
787 792 * the list (new_first, new_last).
788 793 *
789 794 * If a bound of type VMUSAGE_BOUND_UNKNOWN spans multiple bounds in the list
790 795 * (new_first, new_last), it will be split into multiple bounds.
791 796 *
792 797 * Return value:
793 798 * The number of pages in the list of bounds (first,last) that were of
794 799 * type VMUSAGE_BOUND_UNKNOWN, which have been updated to be of type
795 800 * VMUSAGE_BOUND_INCORE.
796 801 *
797 802 */
798 803 static pgcnt_t
799 804 vmu_update_bounds(avl_tree_t *tree, vmu_bound_t **first, vmu_bound_t **last,
800 805 avl_tree_t *new_tree, vmu_bound_t *new_first, vmu_bound_t *new_last)
801 806 {
802 807 vmu_bound_t *next, *new_next, *tmp;
803 808 pgcnt_t rss = 0;
804 809
805 810 next = *first;
806 811 new_next = new_first;
807 812
808 813 /*
809 814 * Verify first and last bound are covered by new bounds if they
810 815 * have unknown type.
811 816 */
812 817 ASSERT((*first)->vmb_type != VMUSAGE_BOUND_UNKNOWN ||
813 818 (*first)->vmb_start >= new_first->vmb_start);
814 819 ASSERT((*last)->vmb_type != VMUSAGE_BOUND_UNKNOWN ||
815 820 (*last)->vmb_end <= new_last->vmb_end);
816 821 for (;;) {
817 822 /* If bound already has type, proceed to next bound. */
818 823 if (next->vmb_type != VMUSAGE_BOUND_UNKNOWN) {
819 824 if (next == *last)
820 825 break;
821 826 next = AVL_NEXT(tree, next);
822 827 continue;
823 828 }
824 829 while (new_next->vmb_end < next->vmb_start)
825 830 new_next = AVL_NEXT(new_tree, new_next);
826 831 ASSERT(new_next->vmb_type != VMUSAGE_BOUND_UNKNOWN);
827 832 next->vmb_type = new_next->vmb_type;
828 833 if (new_next->vmb_end < next->vmb_end) {
829 834 /* need to split bound */
830 835 tmp = vmu_alloc_bound();
831 836 tmp->vmb_type = VMUSAGE_BOUND_UNKNOWN;
832 837 tmp->vmb_start = new_next->vmb_end + 1;
833 838 tmp->vmb_end = next->vmb_end;
834 839 avl_insert_here(tree, tmp, next, AVL_AFTER);
835 840 next->vmb_end = new_next->vmb_end;
836 841 if (*last == next)
837 842 *last = tmp;
838 843 if (next->vmb_type == VMUSAGE_BOUND_INCORE)
839 844 rss += next->vmb_end - next->vmb_start + 1;
840 845 next = tmp;
841 846 } else {
842 847 if (next->vmb_type == VMUSAGE_BOUND_INCORE)
843 848 rss += next->vmb_end - next->vmb_start + 1;
844 849 if (next == *last)
845 850 break;
846 851 next = AVL_NEXT(tree, next);
847 852 }
848 853 }
849 854 return (rss);
850 855 }
851 856
852 857 /*
853 858 * Merges adjacent bounds with same type between first and last bound.
854 859 * After merge, last pointer may point to a different bound, as (incoming)
855 860 * last bound may have been merged away.
856 861 */
857 862 static void
858 863 vmu_merge_bounds(avl_tree_t *tree, vmu_bound_t **first, vmu_bound_t **last)
859 864 {
860 865 vmu_bound_t *current;
861 866 vmu_bound_t *next;
862 867
863 868 ASSERT(tree != NULL);
864 869 ASSERT(*first != NULL);
865 870 ASSERT(*last != NULL);
866 871
867 872 current = *first;
868 873 while (current != *last) {
869 874 next = AVL_NEXT(tree, current);
870 875 if ((current->vmb_end + 1) == next->vmb_start &&
871 876 current->vmb_type == next->vmb_type) {
872 877 current->vmb_end = next->vmb_end;
873 878 avl_remove(tree, next);
874 879 vmu_free_bound(next);
875 880 if (next == *last) {
876 881 *last = current;
877 882 }
878 883 } else {
879 884 current = AVL_NEXT(tree, current);
880 885 }
881 886 }
882 887 }
883 888
884 889 /*
885 890 * Given an amp and a list of bounds, updates each bound's type with
886 891 * VMUSAGE_BOUND_INCORE or VMUSAGE_BOUND_NOT_INCORE.
887 892 *
888 893 * If a bound is partially incore, it will be split into two bounds.
889 894 * first and last may be modified, as bounds may be split into multiple
890 895 * bounds if they are partially incore/not-incore.
891 896 *
892 897 * Set incore to non-zero if bounds are already known to be incore.
893 898 *
894 899 */
895 900 static void
896 901 vmu_amp_update_incore_bounds(avl_tree_t *tree, struct anon_map *amp,
897 902 vmu_bound_t **first, vmu_bound_t **last, boolean_t incore)
898 903 {
899 904 vmu_bound_t *next;
900 905 vmu_bound_t *tmp;
901 906 pgcnt_t index;
902 907 short bound_type;
903 908 short page_type;
904 909 vnode_t *vn;
905 910 anoff_t off;
906 911 struct anon *ap;
907 912
908 913 next = *first;
909 914 /* Shared anon slots don't change once set. */
910 915 ANON_LOCK_ENTER(&->a_rwlock, RW_READER);
|
↓ open down ↓ |
379 lines elided |
↑ open up ↑ |
911 916 for (;;) {
912 917 if (incore == B_TRUE)
913 918 next->vmb_type = VMUSAGE_BOUND_INCORE;
914 919
915 920 if (next->vmb_type != VMUSAGE_BOUND_UNKNOWN) {
916 921 if (next == *last)
917 922 break;
918 923 next = AVL_NEXT(tree, next);
919 924 continue;
920 925 }
926 +
927 + ASSERT(next->vmb_type == VMUSAGE_BOUND_UNKNOWN);
921 928 bound_type = next->vmb_type;
922 929 index = next->vmb_start;
923 930 while (index <= next->vmb_end) {
924 931
925 932 /*
926 933 * These are used to determine how much to increment
927 934 * index when a large page is found.
928 935 */
929 936 page_t *page;
930 937 pgcnt_t pgcnt = 1;
931 938 uint_t pgshft;
932 939 pgcnt_t pgmsk;
933 940
934 941 ap = anon_get_ptr(amp->ahp, index);
935 942 if (ap != NULL)
936 943 swap_xlate(ap, &vn, &off);
937 944
938 945 if (ap != NULL && vn != NULL && vn->v_pages != NULL &&
939 946 (page = page_exists(vn, off)) != NULL) {
940 - page_type = VMUSAGE_BOUND_INCORE;
947 + if (PP_ISFREE(page))
948 + page_type = VMUSAGE_BOUND_NOT_INCORE;
949 + else
950 + page_type = VMUSAGE_BOUND_INCORE;
941 951 if (page->p_szc > 0) {
942 952 pgcnt = page_get_pagecnt(page->p_szc);
943 953 pgshft = page_get_shift(page->p_szc);
944 954 pgmsk = (0x1 << (pgshft - PAGESHIFT))
945 955 - 1;
946 956 }
947 957 } else {
948 958 page_type = VMUSAGE_BOUND_NOT_INCORE;
949 959 }
960 +
950 961 if (bound_type == VMUSAGE_BOUND_UNKNOWN) {
951 962 next->vmb_type = page_type;
963 + bound_type = page_type;
952 964 } else if (next->vmb_type != page_type) {
953 965 /*
954 966 * If current bound type does not match page
955 967 * type, need to split off new bound.
956 968 */
957 969 tmp = vmu_alloc_bound();
958 970 tmp->vmb_type = page_type;
959 971 tmp->vmb_start = index;
960 972 tmp->vmb_end = next->vmb_end;
961 973 avl_insert_here(tree, tmp, next, AVL_AFTER);
962 974 next->vmb_end = index - 1;
963 975 if (*last == next)
964 976 *last = tmp;
965 977 next = tmp;
966 978 }
967 979 if (pgcnt > 1) {
968 980 /*
969 981 * If inside large page, jump to next large
970 982 * page
971 983 */
972 984 index = (index & ~pgmsk) + pgcnt;
973 985 } else {
974 986 index++;
975 987 }
976 988 }
977 989 if (next == *last) {
978 990 ASSERT(next->vmb_type != VMUSAGE_BOUND_UNKNOWN);
979 991 break;
980 992 } else
981 993 next = AVL_NEXT(tree, next);
982 994 }
983 995 ANON_LOCK_EXIT(&->a_rwlock);
984 996 }
985 997
986 998 /*
987 999 * Same as vmu_amp_update_incore_bounds(), except for tracking
988 1000 * incore-/not-incore for vnodes.
989 1001 */
990 1002 static void
991 1003 vmu_vnode_update_incore_bounds(avl_tree_t *tree, vnode_t *vnode,
992 1004 vmu_bound_t **first, vmu_bound_t **last)
993 1005 {
994 1006 vmu_bound_t *next;
995 1007 vmu_bound_t *tmp;
996 1008 pgcnt_t index;
997 1009 short bound_type;
998 1010 short page_type;
999 1011
1000 1012 next = *first;
1001 1013 for (;;) {
|
↓ open down ↓ |
40 lines elided |
↑ open up ↑ |
1002 1014 if (vnode->v_pages == NULL)
1003 1015 next->vmb_type = VMUSAGE_BOUND_NOT_INCORE;
1004 1016
1005 1017 if (next->vmb_type != VMUSAGE_BOUND_UNKNOWN) {
1006 1018 if (next == *last)
1007 1019 break;
1008 1020 next = AVL_NEXT(tree, next);
1009 1021 continue;
1010 1022 }
1011 1023
1024 + ASSERT(next->vmb_type == VMUSAGE_BOUND_UNKNOWN);
1012 1025 bound_type = next->vmb_type;
1013 1026 index = next->vmb_start;
1014 1027 while (index <= next->vmb_end) {
1015 1028
1016 1029 /*
1017 1030 * These are used to determine how much to increment
1018 1031 * index when a large page is found.
1019 1032 */
1020 1033 page_t *page;
1021 1034 pgcnt_t pgcnt = 1;
1022 1035 uint_t pgshft;
1023 1036 pgcnt_t pgmsk;
1024 1037
1025 1038 if (vnode->v_pages != NULL &&
1026 1039 (page = page_exists(vnode, ptob(index))) != NULL) {
1027 - page_type = VMUSAGE_BOUND_INCORE;
1040 + if (PP_ISFREE(page))
1041 + page_type = VMUSAGE_BOUND_NOT_INCORE;
1042 + else
1043 + page_type = VMUSAGE_BOUND_INCORE;
1028 1044 if (page->p_szc > 0) {
1029 1045 pgcnt = page_get_pagecnt(page->p_szc);
1030 1046 pgshft = page_get_shift(page->p_szc);
1031 1047 pgmsk = (0x1 << (pgshft - PAGESHIFT))
1032 1048 - 1;
1033 1049 }
1034 1050 } else {
1035 1051 page_type = VMUSAGE_BOUND_NOT_INCORE;
1036 1052 }
1053 +
1037 1054 if (bound_type == VMUSAGE_BOUND_UNKNOWN) {
1038 1055 next->vmb_type = page_type;
1056 + bound_type = page_type;
1039 1057 } else if (next->vmb_type != page_type) {
1040 1058 /*
1041 1059 * If current bound type does not match page
1042 1060 * type, need to split off new bound.
1043 1061 */
1044 1062 tmp = vmu_alloc_bound();
1045 1063 tmp->vmb_type = page_type;
1046 1064 tmp->vmb_start = index;
1047 1065 tmp->vmb_end = next->vmb_end;
1048 1066 avl_insert_here(tree, tmp, next, AVL_AFTER);
1049 1067 next->vmb_end = index - 1;
1050 1068 if (*last == next)
1051 1069 *last = tmp;
1052 1070 next = tmp;
1053 1071 }
1054 1072 if (pgcnt > 1) {
1055 1073 /*
1056 1074 * If inside large page, jump to next large
1057 1075 * page
1058 1076 */
1059 1077 index = (index & ~pgmsk) + pgcnt;
1060 1078 } else {
1061 1079 index++;
1062 1080 }
1063 1081 }
1064 1082 if (next == *last) {
1065 1083 ASSERT(next->vmb_type != VMUSAGE_BOUND_UNKNOWN);
1066 1084 break;
1067 1085 } else
1068 1086 next = AVL_NEXT(tree, next);
1069 1087 }
1070 1088 }
1071 1089
1072 1090 /*
1073 1091 * Calculate the rss and swap consumed by a segment. vmu_entities is the
1074 1092 * list of entities to visit. For shared segments, the vnode or amp
1075 1093 * is looked up in each entity to see if it has been already counted. Private
1076 1094 * anon pages are checked per entity to ensure that COW pages are not
1077 1095 * double counted.
1078 1096 *
1079 1097 * For private mapped files, first the amp is checked for private pages.
1080 1098 * Bounds not backed by the amp are looked up in the vnode for each entity
1081 1099 * to avoid double counting of private COW vnode pages.
1082 1100 */
1083 1101 static void
1084 1102 vmu_calculate_seg(vmu_entity_t *vmu_entities, struct seg *seg)
1085 1103 {
1086 1104 struct segvn_data *svd;
1087 1105 struct shm_data *shmd;
1088 1106 struct spt_data *sptd;
1089 1107 vmu_object_t *shared_object = NULL;
1090 1108 vmu_object_t *entity_object = NULL;
1091 1109 vmu_entity_t *entity;
1092 1110 vmusage_t *result;
1093 1111 vmu_bound_t *first = NULL;
1094 1112 vmu_bound_t *last = NULL;
1095 1113 vmu_bound_t *cur = NULL;
1096 1114 vmu_bound_t *e_first = NULL;
1097 1115 vmu_bound_t *e_last = NULL;
1098 1116 vmu_bound_t *tmp;
1099 1117 pgcnt_t p_index, s_index, p_start, p_end, s_start, s_end, rss, virt;
1100 1118 struct anon_map *private_amp = NULL;
1101 1119 boolean_t incore = B_FALSE;
1102 1120 boolean_t shared = B_FALSE;
1103 1121 int file = 0;
1104 1122 pgcnt_t swresv = 0;
1105 1123 pgcnt_t panon = 0;
1106 1124
1107 1125 /* Can zero-length segments exist? Not sure, so paranoia. */
1108 1126 if (seg->s_size <= 0)
1109 1127 return;
1110 1128
1111 1129 /*
1112 1130 * Figure out if there is a shared object (such as a named vnode or
1113 1131 * a shared amp, then figure out if there is a private amp, which
1114 1132 * identifies private pages.
1115 1133 */
1116 1134 if (seg->s_ops == &segvn_ops) {
1117 1135 svd = (struct segvn_data *)seg->s_data;
1118 1136 if (svd->type == MAP_SHARED) {
1119 1137 shared = B_TRUE;
1120 1138 } else {
1121 1139 swresv = svd->swresv;
1122 1140
1123 1141 if (SEGVN_LOCK_TRYENTER(seg->s_as, &svd->lock,
1124 1142 RW_READER) != 0) {
1125 1143 /*
1126 1144 * Text replication anon maps can be shared
1127 1145 * across all zones. Space used for text
1128 1146 * replication is typically capped as a small %
1129 1147 * of memory. To keep it simple for now we
1130 1148 * don't account for swap and memory space used
1131 1149 * for text replication.
1132 1150 */
1133 1151 if (svd->tr_state == SEGVN_TR_OFF &&
1134 1152 svd->amp != NULL) {
1135 1153 private_amp = svd->amp;
1136 1154 p_start = svd->anon_index;
1137 1155 p_end = svd->anon_index +
1138 1156 btop(seg->s_size) - 1;
1139 1157 }
1140 1158 SEGVN_LOCK_EXIT(seg->s_as, &svd->lock);
1141 1159 }
1142 1160 }
1143 1161 if (svd->vp != NULL) {
1144 1162 file = 1;
1145 1163 shared_object = vmu_find_insert_object(
1146 1164 vmu_data.vmu_all_vnodes_hash, (caddr_t)svd->vp,
1147 1165 VMUSAGE_TYPE_VNODE);
1148 1166 s_start = btop(svd->offset);
1149 1167 s_end = btop(svd->offset + seg->s_size) - 1;
1150 1168 }
1151 1169 if (svd->amp != NULL && svd->type == MAP_SHARED) {
1152 1170 ASSERT(shared_object == NULL);
1153 1171 shared_object = vmu_find_insert_object(
1154 1172 vmu_data.vmu_all_amps_hash, (caddr_t)svd->amp,
1155 1173 VMUSAGE_TYPE_AMP);
1156 1174 s_start = svd->anon_index;
1157 1175 s_end = svd->anon_index + btop(seg->s_size) - 1;
1158 1176 /* schedctl mappings are always in core */
1159 1177 if (svd->amp->swresv == 0)
1160 1178 incore = B_TRUE;
1161 1179 }
1162 1180 } else if (seg->s_ops == &segspt_shmops) {
1163 1181 shared = B_TRUE;
1164 1182 shmd = (struct shm_data *)seg->s_data;
1165 1183 shared_object = vmu_find_insert_object(
1166 1184 vmu_data.vmu_all_amps_hash, (caddr_t)shmd->shm_amp,
1167 1185 VMUSAGE_TYPE_AMP);
1168 1186 s_start = 0;
1169 1187 s_end = btop(seg->s_size) - 1;
1170 1188 sptd = shmd->shm_sptseg->s_data;
1171 1189
1172 1190 /* ism segments are always incore and do not reserve swap */
1173 1191 if (sptd->spt_flags & SHM_SHARE_MMU)
1174 1192 incore = B_TRUE;
1175 1193
1176 1194 } else {
1177 1195 return;
1178 1196 }
1179 1197
1180 1198 /*
1181 1199 * If there is a private amp, count anon pages that exist. If an
1182 1200 * anon has a refcnt > 1 (COW sharing), then save the anon in a
1183 1201 * hash so that it is not double counted.
1184 1202 *
1185 1203 * If there is also a shared object, then figure out the bounds
1186 1204 * which are not mapped by the private amp.
1187 1205 */
1188 1206 if (private_amp != NULL) {
1189 1207
1190 1208 /* Enter as writer to prevent COW anons from being freed */
1191 1209 ANON_LOCK_ENTER(&private_amp->a_rwlock, RW_WRITER);
1192 1210
1193 1211 p_index = p_start;
1194 1212 s_index = s_start;
1195 1213
1196 1214 while (p_index <= p_end) {
1197 1215
1198 1216 pgcnt_t p_index_next;
1199 1217 pgcnt_t p_bound_size;
1200 1218 int cnt;
1201 1219 anoff_t off;
1202 1220 struct vnode *vn;
1203 1221 struct anon *ap;
1204 1222 page_t *page; /* For handling of large */
1205 1223 pgcnt_t pgcnt = 1; /* pages */
1206 1224 pgcnt_t pgstart;
1207 1225 pgcnt_t pgend;
1208 1226 uint_t pgshft;
1209 1227 pgcnt_t pgmsk;
1210 1228
1211 1229 p_index_next = p_index;
1212 1230 ap = anon_get_next_ptr(private_amp->ahp,
1213 1231 &p_index_next);
1214 1232
1215 1233 /*
1216 1234 * If next anon is past end of mapping, simulate
1217 1235 * end of anon so loop terminates.
1218 1236 */
1219 1237 if (p_index_next > p_end) {
1220 1238 p_index_next = p_end + 1;
1221 1239 ap = NULL;
1222 1240 }
1223 1241 /*
1224 1242 * For COW segments, keep track of bounds not
1225 1243 * backed by private amp so they can be looked
1226 1244 * up in the backing vnode
1227 1245 */
1228 1246 if (p_index_next != p_index) {
1229 1247
1230 1248 /*
1231 1249 * Compute index difference between anon and
1232 1250 * previous anon.
1233 1251 */
1234 1252 p_bound_size = p_index_next - p_index - 1;
1235 1253
1236 1254 if (shared_object != NULL) {
1237 1255 cur = vmu_alloc_bound();
1238 1256 cur->vmb_start = s_index;
1239 1257 cur->vmb_end = s_index + p_bound_size;
1240 1258 cur->vmb_type = VMUSAGE_BOUND_UNKNOWN;
1241 1259 if (first == NULL) {
1242 1260 first = cur;
1243 1261 last = cur;
1244 1262 } else {
1245 1263 last->vmb_next = cur;
1246 1264 last = cur;
1247 1265 }
1248 1266 }
1249 1267 p_index = p_index + p_bound_size + 1;
1250 1268 s_index = s_index + p_bound_size + 1;
1251 1269 }
1252 1270
1253 1271 /* Detect end of anons in amp */
1254 1272 if (ap == NULL)
1255 1273 break;
1256 1274
1257 1275 cnt = ap->an_refcnt;
1258 1276 swap_xlate(ap, &vn, &off);
1259 1277
1260 1278 if (vn == NULL || vn->v_pages == NULL ||
1261 1279 (page = page_exists(vn, off)) == NULL) {
1262 1280 p_index++;
1263 1281 s_index++;
1264 1282 continue;
1265 1283 }
1266 1284
1267 1285 /*
1268 1286 * If large page is found, compute portion of large
1269 1287 * page in mapping, and increment indicies to the next
1270 1288 * large page.
1271 1289 */
1272 1290 if (page->p_szc > 0) {
1273 1291
1274 1292 pgcnt = page_get_pagecnt(page->p_szc);
1275 1293 pgshft = page_get_shift(page->p_szc);
1276 1294 pgmsk = (0x1 << (pgshft - PAGESHIFT)) - 1;
1277 1295
1278 1296 /* First page in large page */
1279 1297 pgstart = p_index & ~pgmsk;
1280 1298 /* Last page in large page */
1281 1299 pgend = pgstart + pgcnt - 1;
1282 1300 /*
1283 1301 * Artifically end page if page extends past
1284 1302 * end of mapping.
1285 1303 */
1286 1304 if (pgend > p_end)
1287 1305 pgend = p_end;
1288 1306
1289 1307 /*
1290 1308 * Compute number of pages from large page
1291 1309 * which are mapped.
1292 1310 */
1293 1311 pgcnt = pgend - p_index + 1;
1294 1312
1295 1313 /*
1296 1314 * Point indicies at page after large page,
|
↓ open down ↓ |
248 lines elided |
↑ open up ↑ |
1297 1315 * or at page after end of mapping.
1298 1316 */
1299 1317 p_index += pgcnt;
1300 1318 s_index += pgcnt;
1301 1319 } else {
1302 1320 p_index++;
1303 1321 s_index++;
1304 1322 }
1305 1323
1306 1324 /*
1325 + * Pages on the free list aren't counted for the rss.
1326 + */
1327 + if (PP_ISFREE(page))
1328 + continue;
1329 +
1330 + /*
1307 1331 * Assume anon structs with a refcnt
1308 1332 * of 1 are not COW shared, so there
1309 1333 * is no reason to track them per entity.
1310 1334 */
1311 1335 if (cnt == 1) {
1312 1336 panon += pgcnt;
1313 1337 continue;
1314 1338 }
1315 1339 for (entity = vmu_entities; entity != NULL;
1316 1340 entity = entity->vme_next_calc) {
1317 1341
1318 1342 result = &entity->vme_result;
1319 1343 /*
1320 1344 * Track COW anons per entity so
1321 1345 * they are not double counted.
1322 1346 */
1323 1347 if (vmu_find_insert_anon(entity->vme_anon_hash,
1324 1348 (caddr_t)ap) == 0)
1325 1349 continue;
1326 1350
1327 1351 result->vmu_rss_all += (pgcnt << PAGESHIFT);
1328 1352 result->vmu_rss_private +=
1329 1353 (pgcnt << PAGESHIFT);
1330 1354 }
1331 1355 }
1332 1356 ANON_LOCK_EXIT(&private_amp->a_rwlock);
1333 1357 }
1334 1358
1335 1359 /* Add up resident anon and swap reserved for private mappings */
1336 1360 if (swresv > 0 || panon > 0) {
1337 1361 for (entity = vmu_entities; entity != NULL;
1338 1362 entity = entity->vme_next_calc) {
1339 1363 result = &entity->vme_result;
1340 1364 result->vmu_swap_all += swresv;
1341 1365 result->vmu_swap_private += swresv;
1342 1366 result->vmu_rss_all += (panon << PAGESHIFT);
1343 1367 result->vmu_rss_private += (panon << PAGESHIFT);
1344 1368 }
1345 1369 }
1346 1370
1347 1371 /* Compute resident pages backing shared amp or named vnode */
1348 1372 if (shared_object != NULL) {
1349 1373 avl_tree_t *tree = &(shared_object->vmo_bounds);
1350 1374
1351 1375 if (first == NULL) {
1352 1376 /*
1353 1377 * No private amp, or private amp has no anon
1354 1378 * structs. This means entire segment is backed by
1355 1379 * the shared object.
1356 1380 */
1357 1381 first = vmu_alloc_bound();
1358 1382 first->vmb_start = s_start;
1359 1383 first->vmb_end = s_end;
1360 1384 first->vmb_type = VMUSAGE_BOUND_UNKNOWN;
1361 1385 }
1362 1386 /*
1363 1387 * Iterate bounds not backed by private amp, and compute
1364 1388 * resident pages.
1365 1389 */
1366 1390 cur = first;
1367 1391 while (cur != NULL) {
1368 1392
1369 1393 if (vmu_insert_lookup_object_bounds(shared_object,
1370 1394 cur->vmb_start, cur->vmb_end, VMUSAGE_BOUND_UNKNOWN,
1371 1395 &first, &last) > 0) {
1372 1396 /* new bounds, find incore/not-incore */
1373 1397 if (shared_object->vmo_type ==
1374 1398 VMUSAGE_TYPE_VNODE) {
1375 1399 vmu_vnode_update_incore_bounds(
1376 1400 tree,
1377 1401 (vnode_t *)
1378 1402 shared_object->vmo_key, &first,
1379 1403 &last);
1380 1404 } else {
1381 1405 vmu_amp_update_incore_bounds(
1382 1406 tree,
1383 1407 (struct anon_map *)
1384 1408 shared_object->vmo_key, &first,
1385 1409 &last, incore);
1386 1410 }
1387 1411 vmu_merge_bounds(tree, &first, &last);
1388 1412 }
1389 1413 for (entity = vmu_entities; entity != NULL;
1390 1414 entity = entity->vme_next_calc) {
1391 1415 avl_tree_t *e_tree;
1392 1416
1393 1417 result = &entity->vme_result;
1394 1418
1395 1419 entity_object = vmu_find_insert_object(
1396 1420 shared_object->vmo_type ==
1397 1421 VMUSAGE_TYPE_VNODE ? entity->vme_vnode_hash:
1398 1422 entity->vme_amp_hash,
1399 1423 shared_object->vmo_key,
1400 1424 shared_object->vmo_type);
1401 1425
1402 1426 virt = vmu_insert_lookup_object_bounds(
1403 1427 entity_object, cur->vmb_start, cur->vmb_end,
1404 1428 VMUSAGE_BOUND_UNKNOWN, &e_first, &e_last);
1405 1429
1406 1430 if (virt == 0)
1407 1431 continue;
1408 1432 /*
1409 1433 * Range visited for this entity
1410 1434 */
1411 1435 e_tree = &(entity_object->vmo_bounds);
1412 1436 rss = vmu_update_bounds(e_tree, &e_first,
1413 1437 &e_last, tree, first, last);
1414 1438 result->vmu_rss_all += (rss << PAGESHIFT);
1415 1439 if (shared == B_TRUE && file == B_FALSE) {
1416 1440 /* shared anon mapping */
1417 1441 result->vmu_swap_all +=
1418 1442 (virt << PAGESHIFT);
1419 1443 result->vmu_swap_shared +=
1420 1444 (virt << PAGESHIFT);
1421 1445 result->vmu_rss_shared +=
1422 1446 (rss << PAGESHIFT);
1423 1447 } else if (shared == B_TRUE && file == B_TRUE) {
1424 1448 /* shared file mapping */
1425 1449 result->vmu_rss_shared +=
1426 1450 (rss << PAGESHIFT);
1427 1451 } else if (shared == B_FALSE &&
1428 1452 file == B_TRUE) {
1429 1453 /* private file mapping */
1430 1454 result->vmu_rss_private +=
1431 1455 (rss << PAGESHIFT);
1432 1456 }
1433 1457 vmu_merge_bounds(e_tree, &e_first, &e_last);
1434 1458 }
1435 1459 tmp = cur;
1436 1460 cur = cur->vmb_next;
1437 1461 vmu_free_bound(tmp);
1438 1462 }
1439 1463 }
1440 1464 }
1441 1465
1442 1466 /*
1443 1467 * Based on the current calculation flags, find the relevant entities
1444 1468 * which are relative to the process. Then calculate each segment
1445 1469 * in the process'es address space for each relevant entity.
1446 1470 */
1447 1471 static void
1448 1472 vmu_calculate_proc(proc_t *p)
1449 1473 {
1450 1474 vmu_entity_t *entities = NULL;
1451 1475 vmu_zone_t *zone;
1452 1476 vmu_entity_t *tmp;
1453 1477 struct as *as;
|
↓ open down ↓ |
137 lines elided |
↑ open up ↑ |
1454 1478 struct seg *seg;
1455 1479 int ret;
1456 1480
1457 1481 /* Figure out which entities are being computed */
1458 1482 if ((vmu_data.vmu_system) != NULL) {
1459 1483 tmp = vmu_data.vmu_system;
1460 1484 tmp->vme_next_calc = entities;
1461 1485 entities = tmp;
1462 1486 }
1463 1487 if (vmu_data.vmu_calc_flags &
1464 - (VMUSAGE_ZONE | VMUSAGE_ALL_ZONES | VMUSAGE_PROJECTS |
1465 - VMUSAGE_ALL_PROJECTS | VMUSAGE_TASKS | VMUSAGE_ALL_TASKS |
1488 + (VMUSAGE_ZONE | VMUSAGE_ALL_ZONES | VMUSAGE_A_ZONE |
1489 + VMUSAGE_PROJECTS | VMUSAGE_ALL_PROJECTS |
1490 + VMUSAGE_TASKS | VMUSAGE_ALL_TASKS |
1466 1491 VMUSAGE_RUSERS | VMUSAGE_ALL_RUSERS | VMUSAGE_EUSERS |
1467 1492 VMUSAGE_ALL_EUSERS)) {
1468 1493 ret = i_mod_hash_find_nosync(vmu_data.vmu_zones_hash,
1469 1494 (mod_hash_key_t)(uintptr_t)p->p_zone->zone_id,
1470 1495 (mod_hash_val_t *)&zone);
1471 1496 if (ret != 0) {
1472 1497 zone = vmu_alloc_zone(p->p_zone->zone_id);
1473 1498 ret = i_mod_hash_insert_nosync(vmu_data.vmu_zones_hash,
1474 1499 (mod_hash_key_t)(uintptr_t)p->p_zone->zone_id,
1475 1500 (mod_hash_val_t)zone, (mod_hash_hndl_t)0);
1476 1501 ASSERT(ret == 0);
1477 1502 }
1478 1503 if (zone->vmz_zone != NULL) {
1479 1504 tmp = zone->vmz_zone;
1480 1505 tmp->vme_next_calc = entities;
1481 1506 entities = tmp;
1482 1507 }
1483 1508 if (vmu_data.vmu_calc_flags &
1484 1509 (VMUSAGE_PROJECTS | VMUSAGE_ALL_PROJECTS)) {
1485 1510 tmp = vmu_find_insert_entity(zone->vmz_projects_hash,
1486 1511 p->p_task->tk_proj->kpj_id, VMUSAGE_PROJECTS,
1487 1512 zone->vmz_id);
1488 1513 tmp->vme_next_calc = entities;
1489 1514 entities = tmp;
1490 1515 }
1491 1516 if (vmu_data.vmu_calc_flags &
1492 1517 (VMUSAGE_TASKS | VMUSAGE_ALL_TASKS)) {
1493 1518 tmp = vmu_find_insert_entity(zone->vmz_tasks_hash,
1494 1519 p->p_task->tk_tkid, VMUSAGE_TASKS, zone->vmz_id);
1495 1520 tmp->vme_next_calc = entities;
1496 1521 entities = tmp;
1497 1522 }
1498 1523 if (vmu_data.vmu_calc_flags &
1499 1524 (VMUSAGE_RUSERS | VMUSAGE_ALL_RUSERS)) {
1500 1525 tmp = vmu_find_insert_entity(zone->vmz_rusers_hash,
1501 1526 crgetruid(p->p_cred), VMUSAGE_RUSERS, zone->vmz_id);
1502 1527 tmp->vme_next_calc = entities;
1503 1528 entities = tmp;
1504 1529 }
1505 1530 if (vmu_data.vmu_calc_flags &
1506 1531 (VMUSAGE_EUSERS | VMUSAGE_ALL_EUSERS)) {
1507 1532 tmp = vmu_find_insert_entity(zone->vmz_eusers_hash,
1508 1533 crgetuid(p->p_cred), VMUSAGE_EUSERS, zone->vmz_id);
1509 1534 tmp->vme_next_calc = entities;
1510 1535 entities = tmp;
1511 1536 }
1512 1537 }
1513 1538 /* Entities which collapse projects and users for all zones */
1514 1539 if (vmu_data.vmu_calc_flags & VMUSAGE_COL_PROJECTS) {
1515 1540 tmp = vmu_find_insert_entity(vmu_data.vmu_projects_col_hash,
1516 1541 p->p_task->tk_proj->kpj_id, VMUSAGE_PROJECTS, ALL_ZONES);
1517 1542 tmp->vme_next_calc = entities;
1518 1543 entities = tmp;
1519 1544 }
1520 1545 if (vmu_data.vmu_calc_flags & VMUSAGE_COL_RUSERS) {
1521 1546 tmp = vmu_find_insert_entity(vmu_data.vmu_rusers_col_hash,
1522 1547 crgetruid(p->p_cred), VMUSAGE_RUSERS, ALL_ZONES);
1523 1548 tmp->vme_next_calc = entities;
1524 1549 entities = tmp;
1525 1550 }
1526 1551 if (vmu_data.vmu_calc_flags & VMUSAGE_COL_EUSERS) {
1527 1552 tmp = vmu_find_insert_entity(vmu_data.vmu_eusers_col_hash,
1528 1553 crgetuid(p->p_cred), VMUSAGE_EUSERS, ALL_ZONES);
1529 1554 tmp->vme_next_calc = entities;
1530 1555 entities = tmp;
1531 1556 }
1532 1557
1533 1558 ASSERT(entities != NULL);
1534 1559 /* process all segs in process's address space */
1535 1560 as = p->p_as;
1536 1561 AS_LOCK_ENTER(as, RW_READER);
1537 1562 for (seg = AS_SEGFIRST(as); seg != NULL;
1538 1563 seg = AS_SEGNEXT(as, seg)) {
1539 1564 vmu_calculate_seg(entities, seg);
1540 1565 }
1541 1566 AS_LOCK_EXIT(as);
1542 1567 }
1543 1568
1544 1569 /*
1545 1570 * Free data created by previous call to vmu_calculate().
1546 1571 */
1547 1572 static void
1548 1573 vmu_clear_calc()
1549 1574 {
1550 1575 if (vmu_data.vmu_system != NULL)
1551 1576 vmu_free_entity(vmu_data.vmu_system);
1552 1577 vmu_data.vmu_system = NULL;
1553 1578 if (vmu_data.vmu_zones_hash != NULL)
1554 1579 i_mod_hash_clear_nosync(vmu_data.vmu_zones_hash);
1555 1580 if (vmu_data.vmu_projects_col_hash != NULL)
1556 1581 i_mod_hash_clear_nosync(vmu_data.vmu_projects_col_hash);
1557 1582 if (vmu_data.vmu_rusers_col_hash != NULL)
1558 1583 i_mod_hash_clear_nosync(vmu_data.vmu_rusers_col_hash);
1559 1584 if (vmu_data.vmu_eusers_col_hash != NULL)
1560 1585 i_mod_hash_clear_nosync(vmu_data.vmu_eusers_col_hash);
1561 1586
1562 1587 i_mod_hash_clear_nosync(vmu_data.vmu_all_vnodes_hash);
1563 1588 i_mod_hash_clear_nosync(vmu_data.vmu_all_amps_hash);
1564 1589 }
1565 1590
1566 1591 /*
1567 1592 * Free unused data structures. These can result if the system workload
1568 1593 * decreases between calculations.
1569 1594 */
1570 1595 static void
1571 1596 vmu_free_extra()
1572 1597 {
1573 1598 vmu_bound_t *tb;
1574 1599 vmu_object_t *to;
1575 1600 vmu_entity_t *te;
1576 1601 vmu_zone_t *tz;
1577 1602
1578 1603 while (vmu_data.vmu_free_bounds != NULL) {
1579 1604 tb = vmu_data.vmu_free_bounds;
1580 1605 vmu_data.vmu_free_bounds = vmu_data.vmu_free_bounds->vmb_next;
1581 1606 kmem_cache_free(vmu_bound_cache, tb);
1582 1607 }
1583 1608 while (vmu_data.vmu_free_objects != NULL) {
1584 1609 to = vmu_data.vmu_free_objects;
1585 1610 vmu_data.vmu_free_objects =
1586 1611 vmu_data.vmu_free_objects->vmo_next;
1587 1612 kmem_cache_free(vmu_object_cache, to);
1588 1613 }
1589 1614 while (vmu_data.vmu_free_entities != NULL) {
1590 1615 te = vmu_data.vmu_free_entities;
1591 1616 vmu_data.vmu_free_entities =
1592 1617 vmu_data.vmu_free_entities->vme_next;
1593 1618 if (te->vme_vnode_hash != NULL)
1594 1619 mod_hash_destroy_hash(te->vme_vnode_hash);
1595 1620 if (te->vme_amp_hash != NULL)
1596 1621 mod_hash_destroy_hash(te->vme_amp_hash);
1597 1622 if (te->vme_anon_hash != NULL)
1598 1623 mod_hash_destroy_hash(te->vme_anon_hash);
1599 1624 kmem_free(te, sizeof (vmu_entity_t));
1600 1625 }
1601 1626 while (vmu_data.vmu_free_zones != NULL) {
1602 1627 tz = vmu_data.vmu_free_zones;
1603 1628 vmu_data.vmu_free_zones =
1604 1629 vmu_data.vmu_free_zones->vmz_next;
1605 1630 if (tz->vmz_projects_hash != NULL)
1606 1631 mod_hash_destroy_hash(tz->vmz_projects_hash);
1607 1632 if (tz->vmz_tasks_hash != NULL)
1608 1633 mod_hash_destroy_hash(tz->vmz_tasks_hash);
1609 1634 if (tz->vmz_rusers_hash != NULL)
1610 1635 mod_hash_destroy_hash(tz->vmz_rusers_hash);
1611 1636 if (tz->vmz_eusers_hash != NULL)
1612 1637 mod_hash_destroy_hash(tz->vmz_eusers_hash);
1613 1638 kmem_free(tz, sizeof (vmu_zone_t));
1614 1639 }
1615 1640 }
1616 1641
1617 1642 extern kcondvar_t *pr_pid_cv;
1618 1643
1619 1644 /*
1620 1645 * Determine which entity types are relevant and allocate the hashes to
1621 1646 * track them. Then walk the process table and count rss and swap
1622 1647 * for each process'es address space. Address space object such as
1623 1648 * vnodes, amps and anons are tracked per entity, so that they are
1624 1649 * not double counted in the results.
1625 1650 *
1626 1651 */
1627 1652 static void
1628 1653 vmu_calculate()
1629 1654 {
1630 1655 int i = 0;
1631 1656 int ret;
1632 1657 proc_t *p;
1633 1658
1634 1659 vmu_clear_calc();
1635 1660
1636 1661 if (vmu_data.vmu_calc_flags & VMUSAGE_SYSTEM)
1637 1662 vmu_data.vmu_system = vmu_alloc_entity(0, VMUSAGE_SYSTEM,
1638 1663 ALL_ZONES);
1639 1664
1640 1665 /*
1641 1666 * Walk process table and calculate rss of each proc.
1642 1667 *
1643 1668 * Pidlock and p_lock cannot be held while doing the rss calculation.
1644 1669 * This is because:
1645 1670 * 1. The calculation allocates using KM_SLEEP.
1646 1671 * 2. The calculation grabs a_lock, which cannot be grabbed
1647 1672 * after p_lock.
1648 1673 *
1649 1674 * Since pidlock must be dropped, we cannot simply just walk the
1650 1675 * practive list. Instead, we walk the process table, and sprlock
1651 1676 * each process to ensure that it does not exit during the
1652 1677 * calculation.
1653 1678 */
1654 1679
1655 1680 mutex_enter(&pidlock);
1656 1681 for (i = 0; i < v.v_proc; i++) {
1657 1682 again:
1658 1683 p = pid_entry(i);
1659 1684 if (p == NULL)
1660 1685 continue;
1661 1686
1662 1687 mutex_enter(&p->p_lock);
1663 1688 mutex_exit(&pidlock);
1664 1689
1665 1690 if (panicstr) {
1666 1691 mutex_exit(&p->p_lock);
1667 1692 return;
1668 1693 }
1669 1694
1670 1695 /* Try to set P_PR_LOCK */
1671 1696 ret = sprtrylock_proc(p);
1672 1697 if (ret == -1) {
1673 1698 /* Process in invalid state */
1674 1699 mutex_exit(&p->p_lock);
1675 1700 mutex_enter(&pidlock);
1676 1701 continue;
1677 1702 } else if (ret == 1) {
1678 1703 /*
1679 1704 * P_PR_LOCK is already set. Wait and try again.
1680 1705 * This also drops p_lock.
1681 1706 */
1682 1707 sprwaitlock_proc(p);
1683 1708 mutex_enter(&pidlock);
1684 1709 goto again;
1685 1710 }
1686 1711 mutex_exit(&p->p_lock);
1687 1712
1688 1713 vmu_calculate_proc(p);
1689 1714
1690 1715 mutex_enter(&p->p_lock);
1691 1716 sprunlock(p);
1692 1717 mutex_enter(&pidlock);
1693 1718 }
1694 1719 mutex_exit(&pidlock);
1695 1720
1696 1721 vmu_free_extra();
1697 1722 }
1698 1723
1699 1724 /*
1700 1725 * allocate a new cache for N results satisfying flags
1701 1726 */
1702 1727 vmu_cache_t *
1703 1728 vmu_cache_alloc(size_t nres, uint_t flags)
1704 1729 {
1705 1730 vmu_cache_t *cache;
1706 1731
1707 1732 cache = kmem_zalloc(sizeof (vmu_cache_t), KM_SLEEP);
1708 1733 cache->vmc_results = kmem_zalloc(sizeof (vmusage_t) * nres, KM_SLEEP);
1709 1734 cache->vmc_nresults = nres;
1710 1735 cache->vmc_flags = flags;
1711 1736 cache->vmc_refcnt = 1;
1712 1737 return (cache);
1713 1738 }
1714 1739
1715 1740 /*
1716 1741 * Make sure cached results are not freed
1717 1742 */
1718 1743 static void
1719 1744 vmu_cache_hold(vmu_cache_t *cache)
1720 1745 {
1721 1746 ASSERT(MUTEX_HELD(&vmu_data.vmu_lock));
1722 1747 cache->vmc_refcnt++;
1723 1748 }
1724 1749
1725 1750 /*
1726 1751 * free cache data
1727 1752 */
1728 1753 static void
1729 1754 vmu_cache_rele(vmu_cache_t *cache)
1730 1755 {
1731 1756 ASSERT(MUTEX_HELD(&vmu_data.vmu_lock));
|
↓ open down ↓ |
256 lines elided |
↑ open up ↑ |
1732 1757 ASSERT(cache->vmc_refcnt > 0);
1733 1758 cache->vmc_refcnt--;
1734 1759 if (cache->vmc_refcnt == 0) {
1735 1760 kmem_free(cache->vmc_results, sizeof (vmusage_t) *
1736 1761 cache->vmc_nresults);
1737 1762 kmem_free(cache, sizeof (vmu_cache_t));
1738 1763 }
1739 1764 }
1740 1765
1741 1766 /*
1767 + * When new data is calculated, update the phys_mem rctl usage value in the
1768 + * zones.
1769 + */
1770 +static void
1771 +vmu_update_zone_rctls(vmu_cache_t *cache)
1772 +{
1773 + vmusage_t *rp;
1774 + size_t i = 0;
1775 + zone_t *zp;
1776 +
1777 + for (rp = cache->vmc_results; i < cache->vmc_nresults; rp++, i++) {
1778 + if (rp->vmu_type == VMUSAGE_ZONE &&
1779 + rp->vmu_zoneid != ALL_ZONES) {
1780 + if ((zp = zone_find_by_id(rp->vmu_zoneid)) != NULL) {
1781 + zp->zone_phys_mem = rp->vmu_rss_all;
1782 + zone_rele(zp);
1783 + }
1784 + }
1785 + }
1786 +}
1787 +
1788 +/*
1742 1789 * Copy out the cached results to a caller. Inspect the callers flags
1743 1790 * and zone to determine which cached results should be copied.
1744 1791 */
1745 1792 static int
1746 1793 vmu_copyout_results(vmu_cache_t *cache, vmusage_t *buf, size_t *nres,
1747 - uint_t flags, int cpflg)
1794 + uint_t flags, id_t req_zone_id, int cpflg)
1748 1795 {
1749 1796 vmusage_t *result, *out_result;
1750 1797 vmusage_t dummy;
1751 1798 size_t i, count = 0;
1752 1799 size_t bufsize;
1753 1800 int ret = 0;
1754 1801 uint_t types = 0;
1755 1802
1756 1803 if (nres != NULL) {
1757 1804 if (ddi_copyin((caddr_t)nres, &bufsize, sizeof (size_t), cpflg))
1758 1805 return (set_errno(EFAULT));
1759 1806 } else {
1760 1807 bufsize = 0;
1761 1808 }
1762 1809
1763 1810 /* figure out what results the caller is interested in. */
1764 1811 if ((flags & VMUSAGE_SYSTEM) && curproc->p_zone == global_zone)
1765 1812 types |= VMUSAGE_SYSTEM;
1766 - if (flags & (VMUSAGE_ZONE | VMUSAGE_ALL_ZONES))
1813 + if (flags & (VMUSAGE_ZONE | VMUSAGE_ALL_ZONES | VMUSAGE_A_ZONE))
1767 1814 types |= VMUSAGE_ZONE;
1768 1815 if (flags & (VMUSAGE_PROJECTS | VMUSAGE_ALL_PROJECTS |
1769 1816 VMUSAGE_COL_PROJECTS))
1770 1817 types |= VMUSAGE_PROJECTS;
1771 1818 if (flags & (VMUSAGE_TASKS | VMUSAGE_ALL_TASKS))
1772 1819 types |= VMUSAGE_TASKS;
1773 1820 if (flags & (VMUSAGE_RUSERS | VMUSAGE_ALL_RUSERS | VMUSAGE_COL_RUSERS))
1774 1821 types |= VMUSAGE_RUSERS;
1775 1822 if (flags & (VMUSAGE_EUSERS | VMUSAGE_ALL_EUSERS | VMUSAGE_COL_EUSERS))
1776 1823 types |= VMUSAGE_EUSERS;
1777 1824
1778 1825 /* count results for current zone */
1779 1826 out_result = buf;
1780 1827 for (result = cache->vmc_results, i = 0;
1781 1828 i < cache->vmc_nresults; result++, i++) {
1782 1829
1783 1830 /* Do not return "other-zone" results to non-global zones */
1784 1831 if (curproc->p_zone != global_zone &&
1785 1832 curproc->p_zone->zone_id != result->vmu_zoneid)
1786 1833 continue;
1787 1834
1788 1835 /*
1789 1836 * If non-global zone requests VMUSAGE_SYSTEM, fake
1790 1837 * up VMUSAGE_ZONE result as VMUSAGE_SYSTEM result.
1791 1838 */
1792 1839 if (curproc->p_zone != global_zone &&
1793 1840 (flags & VMUSAGE_SYSTEM) != 0 &&
1794 1841 result->vmu_type == VMUSAGE_ZONE) {
1795 1842 count++;
1796 1843 if (out_result != NULL) {
1797 1844 if (bufsize < count) {
1798 1845 ret = set_errno(EOVERFLOW);
1799 1846 } else {
1800 1847 dummy = *result;
1801 1848 dummy.vmu_zoneid = ALL_ZONES;
1802 1849 dummy.vmu_id = 0;
1803 1850 dummy.vmu_type = VMUSAGE_SYSTEM;
1804 1851 if (ddi_copyout(&dummy, out_result,
1805 1852 sizeof (vmusage_t), cpflg))
1806 1853 return (set_errno(EFAULT));
1807 1854 out_result++;
1808 1855 }
1809 1856 }
1810 1857 }
1811 1858
1812 1859 /* Skip results that do not match requested type */
1813 1860 if ((result->vmu_type & types) == 0)
1814 1861 continue;
1815 1862
1816 1863 /* Skip collated results if not requested */
1817 1864 if (result->vmu_zoneid == ALL_ZONES) {
1818 1865 if (result->vmu_type == VMUSAGE_PROJECTS &&
|
↓ open down ↓ |
42 lines elided |
↑ open up ↑ |
1819 1866 (flags & VMUSAGE_COL_PROJECTS) == 0)
1820 1867 continue;
1821 1868 if (result->vmu_type == VMUSAGE_EUSERS &&
1822 1869 (flags & VMUSAGE_COL_EUSERS) == 0)
1823 1870 continue;
1824 1871 if (result->vmu_type == VMUSAGE_RUSERS &&
1825 1872 (flags & VMUSAGE_COL_RUSERS) == 0)
1826 1873 continue;
1827 1874 }
1828 1875
1829 - /* Skip "other zone" results if not requested */
1830 - if (result->vmu_zoneid != curproc->p_zone->zone_id) {
1831 - if (result->vmu_type == VMUSAGE_ZONE &&
1832 - (flags & VMUSAGE_ALL_ZONES) == 0)
1876 + if (result->vmu_type == VMUSAGE_ZONE &&
1877 + flags & VMUSAGE_A_ZONE) {
1878 + /* Skip non-requested zone results */
1879 + if (result->vmu_zoneid != req_zone_id)
1833 1880 continue;
1834 - if (result->vmu_type == VMUSAGE_PROJECTS &&
1835 - (flags & (VMUSAGE_ALL_PROJECTS |
1836 - VMUSAGE_COL_PROJECTS)) == 0)
1837 - continue;
1838 - if (result->vmu_type == VMUSAGE_TASKS &&
1839 - (flags & VMUSAGE_ALL_TASKS) == 0)
1840 - continue;
1841 - if (result->vmu_type == VMUSAGE_RUSERS &&
1842 - (flags & (VMUSAGE_ALL_RUSERS |
1843 - VMUSAGE_COL_RUSERS)) == 0)
1844 - continue;
1845 - if (result->vmu_type == VMUSAGE_EUSERS &&
1846 - (flags & (VMUSAGE_ALL_EUSERS |
1847 - VMUSAGE_COL_EUSERS)) == 0)
1848 - continue;
1881 + } else {
1882 + /* Skip "other zone" results if not requested */
1883 + if (result->vmu_zoneid != curproc->p_zone->zone_id) {
1884 + if (result->vmu_type == VMUSAGE_ZONE &&
1885 + (flags & VMUSAGE_ALL_ZONES) == 0)
1886 + continue;
1887 + if (result->vmu_type == VMUSAGE_PROJECTS &&
1888 + (flags & (VMUSAGE_ALL_PROJECTS |
1889 + VMUSAGE_COL_PROJECTS)) == 0)
1890 + continue;
1891 + if (result->vmu_type == VMUSAGE_TASKS &&
1892 + (flags & VMUSAGE_ALL_TASKS) == 0)
1893 + continue;
1894 + if (result->vmu_type == VMUSAGE_RUSERS &&
1895 + (flags & (VMUSAGE_ALL_RUSERS |
1896 + VMUSAGE_COL_RUSERS)) == 0)
1897 + continue;
1898 + if (result->vmu_type == VMUSAGE_EUSERS &&
1899 + (flags & (VMUSAGE_ALL_EUSERS |
1900 + VMUSAGE_COL_EUSERS)) == 0)
1901 + continue;
1902 + }
1849 1903 }
1850 1904 count++;
1851 1905 if (out_result != NULL) {
1852 1906 if (bufsize < count) {
1853 1907 ret = set_errno(EOVERFLOW);
1854 1908 } else {
1855 1909 if (ddi_copyout(result, out_result,
1856 1910 sizeof (vmusage_t), cpflg))
1857 1911 return (set_errno(EFAULT));
1858 1912 out_result++;
1859 1913 }
1860 1914 }
1861 1915 }
1862 1916 if (nres != NULL)
1863 1917 if (ddi_copyout(&count, (void *)nres, sizeof (size_t), cpflg))
1864 1918 return (set_errno(EFAULT));
1865 1919
1866 1920 return (ret);
1867 1921 }
1868 1922
1869 1923 /*
1870 1924 * vm_getusage()
1871 1925 *
1872 1926 * Counts rss and swap by zone, project, task, and/or user. The flags argument
1873 1927 * determines the type of results structures returned. Flags requesting
1874 1928 * results from more than one zone are "flattened" to the local zone if the
1875 1929 * caller is not the global zone.
1876 1930 *
1877 1931 * args:
1878 1932 * flags: bitmap consisting of one or more of VMUSAGE_*.
1879 1933 * age: maximum allowable age (time since counting was done) in
1880 1934 * seconds of the results. Results from previous callers are
1881 1935 * cached in kernel.
1882 1936 * buf: pointer to buffer array of vmusage_t. If NULL, then only nres
1883 1937 * set on success.
1884 1938 * nres: Set to number of vmusage_t structures pointed to by buf
1885 1939 * before calling vm_getusage().
1886 1940 * On return 0 (success) or ENOSPC, is set to the number of result
1887 1941 * structures returned or attempted to return.
1888 1942 *
1889 1943 * returns 0 on success, -1 on failure:
1890 1944 * EINTR (interrupted)
1891 1945 * ENOSPC (nres to small for results, nres set to needed value for success)
1892 1946 * EINVAL (flags invalid)
1893 1947 * EFAULT (bad address for buf or nres)
|
↓ open down ↓ |
35 lines elided |
↑ open up ↑ |
1894 1948 */
1895 1949 int
1896 1950 vm_getusage(uint_t flags, time_t age, vmusage_t *buf, size_t *nres, int cpflg)
1897 1951 {
1898 1952 vmu_entity_t *entity;
1899 1953 vmusage_t *result;
1900 1954 int ret = 0;
1901 1955 int cacherecent = 0;
1902 1956 hrtime_t now;
1903 1957 uint_t flags_orig;
1958 + id_t req_zone_id;
1904 1959
1905 1960 /*
1906 1961 * Non-global zones cannot request system wide and/or collated
1907 - * results, or the system result, so munge the flags accordingly.
1962 + * results, or the system result, or usage of another zone, so munge
1963 + * the flags accordingly.
1908 1964 */
1909 1965 flags_orig = flags;
1910 1966 if (curproc->p_zone != global_zone) {
1911 1967 if (flags & (VMUSAGE_ALL_PROJECTS | VMUSAGE_COL_PROJECTS)) {
1912 1968 flags &= ~(VMUSAGE_ALL_PROJECTS | VMUSAGE_COL_PROJECTS);
1913 1969 flags |= VMUSAGE_PROJECTS;
1914 1970 }
1915 1971 if (flags & (VMUSAGE_ALL_RUSERS | VMUSAGE_COL_RUSERS)) {
1916 1972 flags &= ~(VMUSAGE_ALL_RUSERS | VMUSAGE_COL_RUSERS);
1917 1973 flags |= VMUSAGE_RUSERS;
1918 1974 }
1919 1975 if (flags & (VMUSAGE_ALL_EUSERS | VMUSAGE_COL_EUSERS)) {
1920 1976 flags &= ~(VMUSAGE_ALL_EUSERS | VMUSAGE_COL_EUSERS);
1921 1977 flags |= VMUSAGE_EUSERS;
1922 1978 }
1923 1979 if (flags & VMUSAGE_SYSTEM) {
1924 1980 flags &= ~VMUSAGE_SYSTEM;
1925 1981 flags |= VMUSAGE_ZONE;
1926 1982 }
1983 + if (flags & VMUSAGE_A_ZONE) {
1984 + flags &= ~VMUSAGE_A_ZONE;
1985 + flags |= VMUSAGE_ZONE;
1986 + }
1927 1987 }
1928 1988
1929 1989 /* Check for unknown flags */
1930 1990 if ((flags & (~VMUSAGE_MASK)) != 0)
1931 1991 return (set_errno(EINVAL));
1932 1992
1933 1993 /* Check for no flags */
1934 1994 if ((flags & VMUSAGE_MASK) == 0)
1935 1995 return (set_errno(EINVAL));
1936 1996
1997 + /* If requesting results for a specific zone, get the zone ID */
1998 + if (flags & VMUSAGE_A_ZONE) {
1999 + size_t bufsize;
2000 + vmusage_t zreq;
2001 +
2002 + if (ddi_copyin((caddr_t)nres, &bufsize, sizeof (size_t), cpflg))
2003 + return (set_errno(EFAULT));
2004 + /* Requested zone ID is passed in buf, so 0 len not allowed */
2005 + if (bufsize == 0)
2006 + return (set_errno(EINVAL));
2007 + if (ddi_copyin((caddr_t)buf, &zreq, sizeof (vmusage_t), cpflg))
2008 + return (set_errno(EFAULT));
2009 + req_zone_id = zreq.vmu_id;
2010 + }
2011 +
1937 2012 mutex_enter(&vmu_data.vmu_lock);
1938 2013 now = gethrtime();
1939 2014
1940 2015 start:
1941 2016 if (vmu_data.vmu_cache != NULL) {
1942 2017
1943 2018 vmu_cache_t *cache;
1944 2019
1945 2020 if ((vmu_data.vmu_cache->vmc_timestamp +
1946 2021 ((hrtime_t)age * NANOSEC)) > now)
1947 2022 cacherecent = 1;
1948 2023
1949 2024 if ((vmu_data.vmu_cache->vmc_flags & flags) == flags &&
1950 2025 cacherecent == 1) {
1951 2026 cache = vmu_data.vmu_cache;
1952 2027 vmu_cache_hold(cache);
1953 2028 mutex_exit(&vmu_data.vmu_lock);
1954 2029
1955 2030 ret = vmu_copyout_results(cache, buf, nres, flags_orig,
1956 - cpflg);
2031 + req_zone_id, cpflg);
1957 2032 mutex_enter(&vmu_data.vmu_lock);
1958 2033 vmu_cache_rele(cache);
1959 2034 if (vmu_data.vmu_pending_waiters > 0)
1960 2035 cv_broadcast(&vmu_data.vmu_cv);
1961 2036 mutex_exit(&vmu_data.vmu_lock);
1962 2037 return (ret);
1963 2038 }
1964 2039 /*
1965 2040 * If the cache is recent, it is likely that there are other
1966 2041 * consumers of vm_getusage running, so add their flags to the
1967 2042 * desired flags for the calculation.
1968 2043 */
1969 2044 if (cacherecent == 1)
1970 2045 flags = vmu_data.vmu_cache->vmc_flags | flags;
1971 2046 }
1972 2047 if (vmu_data.vmu_calc_thread == NULL) {
1973 2048
1974 2049 vmu_cache_t *cache;
1975 2050
1976 2051 vmu_data.vmu_calc_thread = curthread;
1977 2052 vmu_data.vmu_calc_flags = flags;
1978 2053 vmu_data.vmu_entities = NULL;
1979 2054 vmu_data.vmu_nentities = 0;
1980 2055 if (vmu_data.vmu_pending_waiters > 0)
1981 2056 vmu_data.vmu_calc_flags |=
1982 2057 vmu_data.vmu_pending_flags;
1983 2058
1984 2059 vmu_data.vmu_pending_flags = 0;
1985 2060 mutex_exit(&vmu_data.vmu_lock);
1986 2061 vmu_calculate();
1987 2062 mutex_enter(&vmu_data.vmu_lock);
1988 2063 /* copy results to cache */
1989 2064 if (vmu_data.vmu_cache != NULL)
1990 2065 vmu_cache_rele(vmu_data.vmu_cache);
1991 2066 cache = vmu_data.vmu_cache =
1992 2067 vmu_cache_alloc(vmu_data.vmu_nentities,
1993 2068 vmu_data.vmu_calc_flags);
1994 2069
1995 2070 result = cache->vmc_results;
1996 2071 for (entity = vmu_data.vmu_entities; entity != NULL;
1997 2072 entity = entity->vme_next) {
1998 2073 *result = entity->vme_result;
1999 2074 result++;
2000 2075 }
2001 2076 cache->vmc_timestamp = gethrtime();
|
↓ open down ↓ |
35 lines elided |
↑ open up ↑ |
2002 2077 vmu_cache_hold(cache);
2003 2078
2004 2079 vmu_data.vmu_calc_flags = 0;
2005 2080 vmu_data.vmu_calc_thread = NULL;
2006 2081
2007 2082 if (vmu_data.vmu_pending_waiters > 0)
2008 2083 cv_broadcast(&vmu_data.vmu_cv);
2009 2084
2010 2085 mutex_exit(&vmu_data.vmu_lock);
2011 2086
2087 + /* update zone's phys. mem. rctl usage */
2088 + vmu_update_zone_rctls(cache);
2012 2089 /* copy cache */
2013 - ret = vmu_copyout_results(cache, buf, nres, flags_orig, cpflg);
2090 + ret = vmu_copyout_results(cache, buf, nres, flags_orig,
2091 + req_zone_id, cpflg);
2014 2092 mutex_enter(&vmu_data.vmu_lock);
2015 2093 vmu_cache_rele(cache);
2016 2094 mutex_exit(&vmu_data.vmu_lock);
2017 2095
2018 2096 return (ret);
2019 2097 }
2020 2098 vmu_data.vmu_pending_flags |= flags;
2021 2099 vmu_data.vmu_pending_waiters++;
2022 2100 while (vmu_data.vmu_calc_thread != NULL) {
2023 2101 if (cv_wait_sig(&vmu_data.vmu_cv,
2024 2102 &vmu_data.vmu_lock) == 0) {
2025 2103 vmu_data.vmu_pending_waiters--;
2026 2104 mutex_exit(&vmu_data.vmu_lock);
2027 2105 return (set_errno(EINTR));
2028 2106 }
2029 2107 }
2030 2108 vmu_data.vmu_pending_waiters--;
2031 2109 goto start;
2032 2110 }
2111 +
2112 +#if defined(__x86)
2113 +/*
2114 + * Attempt to invalidate all of the pages in the mapping for the given process.
2115 + */
2116 +static void
2117 +map_inval(proc_t *p, struct seg *seg, caddr_t addr, size_t size)
2118 +{
2119 + page_t *pp;
2120 + size_t psize;
2121 + u_offset_t off;
2122 + caddr_t eaddr;
2123 + struct vnode *vp;
2124 + struct segvn_data *svd;
2125 + struct hat *victim_hat;
2126 +
2127 + ASSERT((addr + size) <= (seg->s_base + seg->s_size));
2128 +
2129 + victim_hat = p->p_as->a_hat;
2130 + svd = (struct segvn_data *)seg->s_data;
2131 + vp = svd->vp;
2132 + psize = page_get_pagesize(seg->s_szc);
2133 +
2134 + off = svd->offset + (uintptr_t)(addr - seg->s_base);
2135 +
2136 + for (eaddr = addr + size; addr < eaddr; addr += psize, off += psize) {
2137 + pp = page_lookup_nowait(vp, off, SE_SHARED);
2138 +
2139 + if (pp != NULL) {
2140 + /* following logic based on pvn_getdirty() */
2141 +
2142 + if (pp->p_lckcnt != 0 || pp->p_cowcnt != 0) {
2143 + page_unlock(pp);
2144 + continue;
2145 + }
2146 +
2147 + page_io_lock(pp);
2148 + hat_page_inval(pp, 0, victim_hat);
2149 + page_io_unlock(pp);
2150 +
2151 + /*
2152 + * For B_INVALCURONLY-style handling we let
2153 + * page_release call VN_DISPOSE if no one else is using
2154 + * the page.
2155 + *
2156 + * A hat_ismod() check would be useless because:
2157 + * (1) we are not be holding SE_EXCL lock
2158 + * (2) we've not unloaded _all_ translations
2159 + *
2160 + * Let page_release() do the heavy-lifting.
2161 + */
2162 + (void) page_release(pp, 1);
2163 + }
2164 + }
2165 +}
2166 +
2167 +/*
2168 + * vm_map_inval()
2169 + *
2170 + * Invalidate as many pages as possible within the given mapping for the given
2171 + * process. addr is expected to be the base address of the mapping and size is
2172 + * the length of the mapping. In some cases a mapping will encompass an
2173 + * entire segment, but at least for anon or stack mappings, these will be
2174 + * regions within a single large segment. Thus, the invalidation is oriented
2175 + * around a single mapping and not an entire segment.
2176 + *
2177 + * SPARC sfmmu hat does not support HAT_CURPROC_PGUNLOAD-style handling so
2178 + * this code is only applicable to x86.
2179 + */
2180 +int
2181 +vm_map_inval(pid_t pid, caddr_t addr, size_t size)
2182 +{
2183 + int ret;
2184 + int error = 0;
2185 + proc_t *p; /* target proc */
2186 + struct as *as; /* target proc's address space */
2187 + struct seg *seg; /* working segment */
2188 +
2189 + if (curproc->p_zone != global_zone || crgetruid(curproc->p_cred) != 0)
2190 + return (set_errno(EPERM));
2191 +
2192 + /* If not a valid mapping address, return an error */
2193 + if ((caddr_t)((uintptr_t)addr & (uintptr_t)PAGEMASK) != addr)
2194 + return (set_errno(EINVAL));
2195 +
2196 +again:
2197 + mutex_enter(&pidlock);
2198 + p = prfind(pid);
2199 + if (p == NULL) {
2200 + mutex_exit(&pidlock);
2201 + return (set_errno(ESRCH));
2202 + }
2203 +
2204 + mutex_enter(&p->p_lock);
2205 + mutex_exit(&pidlock);
2206 +
2207 + if (panicstr != NULL) {
2208 + mutex_exit(&p->p_lock);
2209 + return (0);
2210 + }
2211 +
2212 + as = p->p_as;
2213 +
2214 + /*
2215 + * Try to set P_PR_LOCK - prevents process "changing shape"
2216 + * - blocks fork
2217 + * - blocks sigkill
2218 + * - cannot be a system proc
2219 + * - must be fully created proc
2220 + */
2221 + ret = sprtrylock_proc(p);
2222 + if (ret == -1) {
2223 + /* Process in invalid state */
2224 + mutex_exit(&p->p_lock);
2225 + return (set_errno(ESRCH));
2226 + }
2227 +
2228 + if (ret == 1) {
2229 + /*
2230 + * P_PR_LOCK is already set. Wait and try again. This also
2231 + * drops p_lock so p may no longer be valid since the proc may
2232 + * have exited.
2233 + */
2234 + sprwaitlock_proc(p);
2235 + goto again;
2236 + }
2237 +
2238 + /* P_PR_LOCK is now set */
2239 + mutex_exit(&p->p_lock);
2240 +
2241 + AS_LOCK_ENTER(as, RW_READER);
2242 + if ((seg = as_segat(as, addr)) == NULL) {
2243 + AS_LOCK_EXIT(as);
2244 + mutex_enter(&p->p_lock);
2245 + sprunlock(p);
2246 + return (set_errno(ENOMEM));
2247 + }
2248 +
2249 + /*
2250 + * The invalidation behavior only makes sense for vnode-backed segments.
2251 + */
2252 + if (seg->s_ops != &segvn_ops) {
2253 + AS_LOCK_EXIT(as);
2254 + mutex_enter(&p->p_lock);
2255 + sprunlock(p);
2256 + return (0);
2257 + }
2258 +
2259 + /*
2260 + * If the mapping is out of bounds of the segement return an error.
2261 + */
2262 + if ((addr + size) > (seg->s_base + seg->s_size)) {
2263 + AS_LOCK_EXIT(as);
2264 + mutex_enter(&p->p_lock);
2265 + sprunlock(p);
2266 + return (set_errno(EINVAL));
2267 + }
2268 +
2269 + /*
2270 + * Don't use MS_INVALCURPROC flag here since that would eventually
2271 + * initiate hat invalidation based on curthread. Since we're doing this
2272 + * on behalf of a different process, that would erroneously invalidate
2273 + * our own process mappings.
2274 + */
2275 + error = SEGOP_SYNC(seg, addr, size, 0, (uint_t)MS_ASYNC);
2276 + if (error == 0) {
2277 + /*
2278 + * Since we didn't invalidate during the sync above, we now
2279 + * try to invalidate all of the pages in the mapping.
2280 + */
2281 + map_inval(p, seg, addr, size);
2282 + }
2283 + AS_LOCK_EXIT(as);
2284 +
2285 + mutex_enter(&p->p_lock);
2286 + sprunlock(p);
2287 +
2288 + if (error)
2289 + (void) set_errno(error);
2290 + return (error);
2291 +}
2292 +#endif
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX