2012-01-09 22:51:56 +00:00
|
|
|
/*
|
|
|
|
* zsmalloc memory allocator
|
|
|
|
*
|
|
|
|
* Copyright (C) 2011 Nitin Gupta
|
2014-01-30 23:45:55 +00:00
|
|
|
* Copyright (C) 2012, 2013 Minchan Kim
|
2012-01-09 22:51:56 +00:00
|
|
|
*
|
|
|
|
* This code is released using a dual license strategy: BSD/GPL
|
|
|
|
* You can choose the license that better fits your requirements.
|
|
|
|
*
|
|
|
|
* Released under the terms of 3-clause BSD License
|
|
|
|
* Released under the terms of GNU General Public License Version 2.0
|
|
|
|
*/
|
|
|
|
|
|
|
|
#ifndef _ZS_MALLOC_H_
|
|
|
|
#define _ZS_MALLOC_H_
|
|
|
|
|
|
|
|
#include <linux/types.h>
|
|
|
|
|
2015-09-08 22:04:35 +00:00
|
|
|
struct zs_pool_stats {
|
2015-09-08 22:04:38 +00:00
|
|
|
/* How many pages were migrated (freed) */
|
2021-02-26 01:18:31 +00:00
|
|
|
atomic_long_t pages_compacted;
|
2015-09-08 22:04:35 +00:00
|
|
|
};
|
|
|
|
|
2012-01-09 22:51:56 +00:00
|
|
|
struct zs_pool;
|
|
|
|
|
2016-05-20 23:59:48 +00:00
|
|
|
struct zs_pool *zs_create_pool(const char *name);
|
2012-01-09 22:51:56 +00:00
|
|
|
void zs_destroy_pool(struct zs_pool *pool);
|
|
|
|
|
zsmalloc: prefer the the original page's node for compressed data
Currently, zsmalloc, zswap's and zram's backend memory allocator, does not
enforce any policy for the allocation of memory for the compressed data,
instead just adopting the memory policy of the task entering reclaim, or
the default policy (prefer local node) if no such policy is specified.
This can lead to several pathological behaviors in multi-node NUMA
systems:
1. Systems with CXL-based memory tiering can encounter the following
inversion with zswap/zram: the coldest pages demoted to the CXL tier
can return to the high tier when they are reclaimed to compressed swap,
creating memory pressure on the high tier.
2. Consider a direct reclaimer scanning nodes in order of allocation
preference. If it ventures into remote nodes, the memory it compresses
there should stay there. Trying to shift those contents over to the
reclaiming thread's preferred node further *increases* its local
pressure, and provoking more spills. The remote node is also the most
likely to refault this data again. This undesirable behavior was
pointed out by Johannes Weiner in [1].
3. For zswap writeback, the zswap entries are organized in
node-specific LRUs, based on the node placement of the original pages,
allowing for targeted zswap writeback for specific nodes.
However, the compressed data of a zswap entry can be placed on a
different node from the LRU it is placed on. This means that reclaim
targeted at one node might not free up memory used for zswap entries in
that node, but instead reclaiming memory in a different node.
All of these issues will be resolved if the compressed data go to the same
node as the original page. This patch encourages this behavior by having
zswap and zram pass the node of the original page to zsmalloc, and have
zsmalloc prefer the specified node if we need to allocate new (zs)pages
for the compressed data.
Note that we are not strictly binding the allocation to the preferred
node. We still allow the allocation to fall back to other nodes when the
preferred node is full, or if we have zspages with slots available on a
different node. This is OK, and still a strict improvement over the
status quo:
1. On a system with demotion enabled, we will generally prefer
demotions over compressed swapping, and only swap when pages have
already gone to the lowest tier. This patch should achieve the desired
effect for the most part.
2. If the preferred node is out of memory, letting the compressed data
going to other nodes can be better than the alternative (OOMs, keeping
cold memory unreclaimed, disk swapping, etc.).
3. If the allocation go to a separate node because we have a zspage
with slots available, at least we're not creating extra immediate
memory pressure (since the space is already allocated).
3. While there can be mixings, we generally reclaim pages in same-node
batches, which encourage zspage grouping that is more likely to go to
the right node.
4. A strict binding would require partitioning zsmalloc by node, which
is more complicated, and more prone to regression, since it reduces the
storage density of zsmalloc. We need to evaluate the tradeoff and
benchmark carefully before adopting such an involved solution.
[1]: https://lore.kernel.org/linux-mm/20250331165306.GC2110528@cmpxchg.org/
[senozhatsky@chromium.org: coding-style fixes]
Link: https://lkml.kernel.org/r/mnvexa7kseswglcqbhlot4zg3b3la2ypv2rimdl5mh5glbmhvz@wi6bgqn47hge
Link: https://lkml.kernel.org/r/20250402204416.3435994-1-nphamcs@gmail.com
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Suggested-by: Gregory Price <gourry@gourry.net>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev>
Acked-by: Sergey Senozhatsky <senozhatsky@chromium.org> [zram, zsmalloc]
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Yosry Ahmed <yosry.ahmed@linux.dev> [zswap/zsmalloc]
Cc: "Huang, Ying" <ying.huang@linux.alibaba.com>
Cc: Joanthan Cameron <Jonathan.Cameron@huawei.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-04-02 20:44:16 +00:00
|
|
|
unsigned long zs_malloc(struct zs_pool *pool, size_t size, gfp_t flags,
|
|
|
|
const int nid);
|
2012-06-08 06:39:25 +00:00
|
|
|
void zs_free(struct zs_pool *pool, unsigned long obj);
|
2012-01-09 22:51:56 +00:00
|
|
|
|
zsmalloc: introduce zs_huge_class_size()
Patch series "zsmalloc/zram: drop zram's max_zpage_size", v3.
ZRAM's max_zpage_size is a bad thing. It forces zsmalloc to store
normal objects as huge ones, which results in bigger zsmalloc memory
usage. Drop it and use actual zsmalloc huge-class value when decide if
the object is huge or not.
This patch (of 2):
Not every object can be share its zspage with other objects, e.g. when
the object is as big as zspage or nearly as big a zspage. For such
objects zsmalloc has a so called huge class - every object which belongs
to huge class consumes the entire zspage (which consists of a physical
page). On x86_64, PAGE_SHIFT 12 box, the first non-huge class size is
3264, so starting down from size 3264, objects can share page(-s) and
thus minimize memory wastage.
ZRAM, however, has its own statically defined watermark for huge
objects, namely "3 * PAGE_SIZE / 4 = 3072", and forcibly stores every
object larger than this watermark (3072) as a PAGE_SIZE object, in other
words, to a huge class, while zsmalloc can keep some of those objects in
non-huge classes. This results in increased memory consumption.
zsmalloc knows better if the object is huge or not. Introduce
zs_huge_class_size() function which tells if the given object can be
stored in one of non-huge classes or not. This will let us to drop
ZRAM's huge object watermark and fully rely on zsmalloc when we decide
if the object is huge.
[sergey.senozhatsky.work@gmail.com: add pool param to zs_huge_class_size()]
Link: http://lkml.kernel.org/r/20180314081833.1096-2-sergey.senozhatsky@gmail.com
Link: http://lkml.kernel.org/r/20180306070639.7389-2-sergey.senozhatsky@gmail.com
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05 23:24:43 +00:00
|
|
|
size_t zs_huge_class_size(struct zs_pool *pool);
|
|
|
|
|
2014-10-09 22:29:50 +00:00
|
|
|
unsigned long zs_get_total_pages(struct zs_pool *pool);
|
2015-04-15 23:15:30 +00:00
|
|
|
unsigned long zs_compact(struct zs_pool *pool);
|
2012-01-09 22:51:56 +00:00
|
|
|
|
2022-11-09 11:50:42 +00:00
|
|
|
unsigned int zs_lookup_class_index(struct zs_pool *pool, unsigned int size);
|
|
|
|
|
2015-09-08 22:04:35 +00:00
|
|
|
void zs_pool_stats(struct zs_pool *pool, struct zs_pool_stats *stats);
|
zsmalloc: introduce new object mapping API
Current object mapping API is a little cumbersome. First, it's
inconsistent, sometimes it returns with page-faults disabled and sometimes
with page-faults enabled. Second, and most importantly, it enforces
atomicity restrictions on its users. zs_map_object() has to return a
liner object address which is not always possible because some objects
span multiple physical (non-contiguous) pages. For such objects zsmalloc
uses a per-CPU buffer to which object's data is copied before a pointer to
that per-CPU buffer is returned back to the caller. This leads to
another, final, issue - extra memcpy(). Since the caller gets a pointer
to per-CPU buffer it can memcpy() data only to that buffer, and during
zs_unmap_object() zsmalloc will memcpy() from that per-CPU buffer to
physical pages that object in question spans across.
New API splits functions by access mode:
- zs_obj_read_begin(handle, local_copy)
Returns a pointer to handle memory. For objects that span two
physical pages a local_copy buffer is used to store object's
data before the address is returned to the caller. Otherwise
the object's page is kmap_local mapped directly.
- zs_obj_read_end(handle, buf)
Unmaps the page if it was kmap_local mapped by zs_obj_read_begin().
- zs_obj_write(handle, buf, len)
Copies len-bytes from compression buffer to handle memory
(takes care of objects that span two pages). This does not
need any additional (e.g. per-CPU) buffers and writes the data
directly to zsmalloc pool pages.
In terms of performance, on a synthetic and completely reproducible
test that allocates fixed number of objects of fixed sizes and
iterates over those objects, first mapping in RO then in RW mode:
OLD API
=======
3 first results out of 10
369,205,778 instructions # 0.80 insn per cycle
40,467,926 branches # 113.732 M/sec
369,002,122 instructions # 0.62 insn per cycle
40,426,145 branches # 189.361 M/sec
369,036,706 instructions # 0.63 insn per cycle
40,430,860 branches # 204.105 M/sec
[..]
NEW API
=======
3 first results out of 10
265,799,293 instructions # 0.51 insn per cycle
29,834,567 branches # 170.281 M/sec
265,765,970 instructions # 0.55 insn per cycle
29,829,019 branches # 161.602 M/sec
265,764,702 instructions # 0.51 insn per cycle
29,828,015 branches # 189.677 M/sec
[..]
T-test on all 10 runs
=====================
Difference at 95.0% confidence
-1.03219e+08 +/- 55308.7
-27.9705% +/- 0.0149878%
(Student's t, pooled s = 58864.4)
The old API will stay around until the remaining users switch to the new
one. After that we'll also remove zsmalloc per-CPU buffer and CPU hotplug
handling.
The split of map(RO) and map(WO) into read_{begin/end}/write is suggested
by Yosry Ahmed.
Link: https://lkml.kernel.org/r/20250303022425.285971-15-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Suggested-by: Yosry Ahmed <yosry.ahmed@linux.dev>
Reviewed-by: Yosry Ahmed <yosry.ahmed@linux.dev>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Kairui Song <ryncsn@gmail.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-03 02:03:23 +00:00
|
|
|
|
|
|
|
void *zs_obj_read_begin(struct zs_pool *pool, unsigned long handle,
|
|
|
|
void *local_copy);
|
|
|
|
void zs_obj_read_end(struct zs_pool *pool, unsigned long handle,
|
|
|
|
void *handle_mem);
|
|
|
|
void zs_obj_write(struct zs_pool *pool, unsigned long handle,
|
|
|
|
void *handle_mem, size_t mem_len);
|
|
|
|
|
2012-01-09 22:51:56 +00:00
|
|
|
#endif
|