The vmstat counter NR_INDIRECTLY_RECLAIMABLE_BYTES was introduced by
commit eb59254608 ("mm: introduce NR_INDIRECTLY_RECLAIMABLE_BYTES") with
the goal of accounting objects that can be reclaimed, but cannot be
allocated via a SLAB_RECLAIM_ACCOUNT cache. This is now possible via
kmalloc() with __GFP_RECLAIMABLE flag, and the dcache external names user
is converted.
The counter is however still useful for accounting direct page allocations
(i.e. not slab) with a shrinker, such as the ION page pool. So keep it,
and:
- change granularity to pages to be more like other counters; sub-page
allocations should be able to use kmalloc
- rename the counter to NR_KERNEL_MISC_RECLAIMABLE
- expose the counter again in vmstat as "nr_kernel_misc_reclaimable"; we can
again remove the check for not printing "hidden" counters
Link: http://lkml.kernel.org/r/20180731090649.16028-5-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: Roman Gushchin <guro@fb.com>
Cc: Vijayanand Jitta <vjitta@codeaurora.org>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Sumit Semwal <sumit.semwal@linaro.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
cherry picked from commit b29940c1abd7a4c3abeb926df0a5ec84d6902d47)
Bug: 138148041
Test: verify KReclaimable accounting after ION allocation+deallocation
Change-Id: I6694939516e5dfec388e038131021e3885c2640b
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAl3zQ1wACgkQONu9yGCS
aT6ZDA/+JQyM+mgrU2t5mkq9lXCwL87Jiooy0kKT9b/2EWmW5Gdxp/On9PXfqtfs
uZ+v0A1g1H+582uwuqG1wB2jr3I2AhNnRNbvSypGtk1Kitx9HqVJD/wWRRVCULww
cr3uA/ZOX+deRjOVYP3dhFp7ycn6u5+GxgmFQTLmKAYN8uUqq4/dpWy01iB0nr2A
GcoLm9P96o8P/wIWaykqOvshDrocbFcBL4VuxLeZCbFsAMTiX+jJnyIL8W7gfBJl
M2626S/hESk5DvGcMN3zwOw/nTJlvySUtfqXSvPk0sT90UMx/YZ9QdpS9GkvRb9t
OA1G+iHguEU+Fq/DawUyxwk/kt3nA6cg0q7RSxHo7QP6SGo7OaHHS1myzGDhL8oc
LDKXO2iSSzvXJDlqrU45N+1YhpeiIHCxmDctbUIM9dP4u6wWmQIyYXLrcpupTsm9
StiDBguXFHWSBFhG0+MlTUU5cypVNoN+56wBAUTR6+qoDASTzGvjNbrBsQihODV0
RMFJF17Zvn+UoEohe860EMswUBsJ+F+VSZO5yGuZgsaC/2Ih6M1dxsiNU7RF02gX
fRis6huj1+642ZsEbd2tueYGUaDN1HpMsVkN3AAkD3pJF5lX7AJRwhvRyC8N1jhc
G90KMSk2pR/ItjmUpkKaAhAKhN+oKSzuCPpHj2iGotfWdd4slXQ=
=Ekyt
-----END PGP SIGNATURE-----
Merge 4.19.89 into android-4.19
Changes in 4.19.89
rsi: release skb if rsi_prepare_beacon fails
arm64: tegra: Fix 'active-low' warning for Jetson TX1 regulator
sparc64: implement ioremap_uc
lp: fix sparc64 LPSETTIMEOUT ioctl
usb: gadget: u_serial: add missing port entry locking
tty: serial: fsl_lpuart: use the sg count from dma_map_sg
tty: serial: msm_serial: Fix flow control
serial: pl011: Fix DMA ->flush_buffer()
serial: serial_core: Perform NULL checks for break_ctl ops
serial: ifx6x60: add missed pm_runtime_disable
autofs: fix a leak in autofs_expire_indirect()
RDMA/hns: Correct the value of HNS_ROCE_HEM_CHUNK_LEN
iwlwifi: pcie: don't consider IV len in A-MSDU
exportfs_decode_fh(): negative pinned may become positive without the parent locked
audit_get_nd(): don't unlock parent too early
NFC: nxp-nci: Fix NULL pointer dereference after I2C communication error
xfrm: release device reference for invalid state
Input: cyttsp4_core - fix use after free bug
sched/core: Avoid spurious lock dependencies
perf/core: Consistently fail fork on allocation failures
ALSA: pcm: Fix stream lock usage in snd_pcm_period_elapsed()
drm/sun4i: tcon: Set min division of TCON0_DCLK to 1.
selftests: kvm: fix build with glibc >= 2.30
rsxx: add missed destroy_workqueue calls in remove
net: ep93xx_eth: fix mismatch of request_mem_region in remove
i2c: core: fix use after free in of_i2c_notify
serial: core: Allow processing sysrq at port unlock time
cxgb4vf: fix memleak in mac_hlist initialization
iwlwifi: mvm: synchronize TID queue removal
iwlwifi: trans: Clear persistence bit when starting the FW
iwlwifi: mvm: Send non offchannel traffic via AP sta
ARM: 8813/1: Make aligned 2-byte getuser()/putuser() atomic on ARMv6+
audit: Embed key into chunk
netfilter: nf_tables: don't use position attribute on rule replacement
ARC: IOC: panic if kernel was started with previously enabled IOC
net/mlx5: Release resource on error flow
clk: sunxi-ng: a64: Fix gate bit of DSI DPHY
ice: Fix NVM mask defines
dlm: fix possible call to kfree() for non-initialized pointer
ARM: dts: exynos: Fix LDO13 min values on Odroid XU3/XU4/HC1
extcon: max8997: Fix lack of path setting in USB device mode
net: ethernet: ti: cpts: correct debug for expired txq skb
rtc: s3c-rtc: Avoid using broken ALMYEAR register
rtc: max77686: Fix the returned value in case of error in 'max77686_rtc_read_time()'
i40e: don't restart nway if autoneg not supported
virtchnl: Fix off by one error
clk: rockchip: fix rk3188 sclk_smc gate data
clk: rockchip: fix rk3188 sclk_mac_lbtest parameter ordering
ARM: dts: rockchip: Fix rk3288-rock2 vcc_flash name
dlm: fix missing idr_destroy for recover_idr
MIPS: SiByte: Enable ZONE_DMA32 for LittleSur
net: dsa: mv88e6xxx: Work around mv886e6161 SERDES missing MII_PHYSID2
scsi: zfcp: update kernel message for invalid FCP_CMND length, it's not the CDB
scsi: zfcp: drop default switch case which might paper over missing case
drivers: soc: Allow building the amlogic drivers without ARCH_MESON
bus: ti-sysc: Fix getting optional clocks in clock_roles
ARM: dts: imx6: RDU2: fix eGalax touchscreen node
crypto: ecc - check for invalid values in the key verification test
crypto: bcm - fix normal/non key hash algorithm failure
arm64: dts: zynqmp: Fix node names which contain "_"
pinctrl: qcom: ssbi-gpio: fix gpio-hog related boot issues
Staging: iio: adt7316: Fix i2c data reading, set the data field
firmware: raspberrypi: Fix firmware calls with large buffers
mm/vmstat.c: fix NUMA statistics updates
clk: rockchip: fix I2S1 clock gate register for rk3328
clk: rockchip: fix ID of 8ch clock of I2S1 for rk3328
sctp: count sk_wmem_alloc by skb truesize in sctp_packet_transmit
regulator: Fix return value of _set_load() stub
USB: serial: f81534: fix reading old/new IC config
xfs: extent shifting doesn't fully invalidate page cache
net-next/hinic:fix a bug in set mac address
net-next/hinic: fix a bug in rx data flow
ice: Fix return value from NAPI poll
ice: Fix possible NULL pointer de-reference
iomap: FUA is wrong for DIO O_DSYNC writes into unwritten extents
iomap: sub-block dio needs to zeroout beyond EOF
iomap: dio data corruption and spurious errors when pipes fill
iomap: readpages doesn't zero page tail beyond EOF
iw_cxgb4: only reconnect with MPAv1 if the peer aborts
MIPS: OCTEON: octeon-platform: fix typing
net/smc: use after free fix in smc_wr_tx_put_slot()
math-emu/soft-fp.h: (_FP_ROUND_ZERO) cast 0 to void to fix warning
nds32: Fix the items of hwcap_str ordering issue.
rtc: max8997: Fix the returned value in case of error in 'max8997_rtc_read_alarm()'
rtc: dt-binding: abx80x: fix resistance scale
ARM: dts: exynos: Use Samsung SoC specific compatible for DWC2 module
media: coda: fix memory corruption in case more than 32 instances are opened
media: pulse8-cec: return 0 when invalidating the logical address
media: cec: report Vendor ID after initialization
iwlwifi: fix cfg structs for 22000 with different RF modules
ravb: Clean up duplex handling
net/ipv6: re-do dad when interface has IFF_NOARP flag change
dmaengine: coh901318: Fix a double-lock bug
dmaengine: coh901318: Remove unused variable
dmaengine: dw-dmac: implement dma protection control setting
net: qualcomm: rmnet: move null check on dev before dereferecing it
selftests/powerpc: Allocate base registers
selftests/powerpc: Skip test instead of failing
usb: dwc3: debugfs: Properly print/set link state for HS
usb: dwc3: don't log probe deferrals; but do log other error codes
ACPI: fix acpi_find_child_device() invocation in acpi_preset_companion()
f2fs: fix to account preflush command for noflush_merge mode
f2fs: fix count of seg_freed to make sec_freed correct
f2fs: change segment to section in f2fs_ioc_gc_range
ARM: dts: rockchip: Fix the PMU interrupt number for rv1108
ARM: dts: rockchip: Assign the proper GPIO clocks for rv1108
f2fs: fix to allow node segment for GC by ioctl path
sparc: Fix JIT fused branch convergance.
sparc: Correct ctx->saw_frame_pointer logic.
nvme: Free ctrl device name on init failure
dma-mapping: fix return type of dma_set_max_seg_size()
slimbus: ngd: Fix build error on x86
altera-stapl: check for a null key before strcasecmp'ing it
serial: imx: fix error handling in console_setup
i2c: imx: don't print error message on probe defer
clk: meson: Fix GXL HDMI PLL fractional bits width
gpu: host1x: Fix syncpoint ID field size on Tegra186
lockd: fix decoding of TEST results
sctp: increase sk_wmem_alloc when head->truesize is increased
iommu/amd: Fix line-break in error log reporting
ASoC: rsnd: tidyup registering method for rsnd_kctrl_new()
ARM: dts: sun4i: Fix gpio-keys warning
ARM: dts: sun4i: Fix HDMI output DTC warning
ARM: dts: sun5i: a10s: Fix HDMI output DTC warning
ARM: dts: r8a779[01]: Disable unconnected LVDS encoders
ARM: dts: sun7i: Fix HDMI output DTC warning
ARM: dts: sun8i: a23/a33: Fix OPP DTC warnings
ARM: dts: sun8i: v3s: Change pinctrl nodes to avoid warning
dlm: NULL check before kmem_cache_destroy is not needed
ARM: debug: enable UART1 for socfpga Cyclone5
can: xilinx: fix return type of ndo_start_xmit function
nfsd: fix a warning in __cld_pipe_upcall()
bpf: btf: implement btf_name_valid_identifier()
bpf: btf: check name validity for various types
tools: bpftool: fix a bitfield pretty print issue
ASoC: au8540: use 64-bit arithmetic instead of 32-bit
ARM: OMAP1/2: fix SoC name printing
arm64: dts: meson-gxl-libretech-cc: fix GPIO lines names
arm64: dts: meson-gxbb-nanopi-k2: fix GPIO lines names
arm64: dts: meson-gxbb-odroidc2: fix GPIO lines names
arm64: dts: meson-gxl-khadas-vim: fix GPIO lines names
net/x25: fix called/calling length calculation in x25_parse_address_block
net/x25: fix null_x25_address handling
tools/bpf: make libbpf _GNU_SOURCE friendly
clk: mediatek: Drop __init from mtk_clk_register_cpumuxes()
clk: mediatek: Drop more __init markings for driver probe
soc: renesas: r8a77970-sysc: Correct names of A2DP/A2CN power domains
soc: renesas: r8a77980-sysc: Correct names of A2DP[01] power domains
soc: renesas: r8a77980-sysc: Correct A3VIP[012] power domain hierarchy
kbuild: disable dtc simple_bus_reg warnings by default
tcp: make tcp_space() aware of socket backlog
ARM: dts: mmp2: fix the gpio interrupt cell number
ARM: dts: realview-pbx: Fix duplicate regulator nodes
tcp: fix off-by-one bug on aborting window-probing socket
tcp: fix SNMP under-estimation on failed retransmission
tcp: fix SNMP TCP timeout under-estimation
modpost: skip ELF local symbols during section mismatch check
kbuild: fix single target build for external module
mtd: fix mtd_oobavail() incoherent returned value
ARM: dts: pxa: clean up USB controller nodes
clk: meson: meson8b: fix the offset of vid_pll_dco's N value
clk: sunxi-ng: h3/h5: Fix CSI_MCLK parent
clk: qcom: Fix MSM8998 resets
media: cxd2880-spi: fix probe when dvb_attach fails
ARM: dts: realview: Fix some more duplicate regulator nodes
dlm: fix invalid cluster name warning
net/mlx4_core: Fix return codes of unsupported operations
pstore/ram: Avoid NULL deref in ftrace merging failure path
powerpc/math-emu: Update macros from GCC
clk: renesas: r8a77990: Correct parent clock of DU
clk: renesas: r8a77995: Correct parent clock of DU
MIPS: OCTEON: cvmx_pko_mem_debug8: use oldest forward compatible definition
nfsd: Return EPERM, not EACCES, in some SETATTR cases
media: uvcvideo: Abstract streaming object lifetime
tty: serial: qcom_geni_serial: Fix softlock
ARM: dts: sun8i: h3: Fix the system-control register range
tty: Don't block on IO when ldisc change is pending
media: stkwebcam: Bugfix for wrong return values
firmware: qcom: scm: fix compilation error when disabled
clk: qcom: gcc-msm8998: Disable halt check of UFS clocks
sctp: frag_point sanity check
soc: renesas: r8a77990-sysc: Fix initialization order of 3DG-{A,B}
mlxsw: spectrum_router: Relax GRE decap matching check
IB/hfi1: Ignore LNI errors before DC8051 transitions to Polling state
IB/hfi1: Close VNIC sdma_progress sleep window
mlx4: Use snprintf instead of complicated strcpy
usb: mtu3: fix dbginfo in qmu_tx_zlp_error_handler
clk: renesas: rcar-gen3: Set state when registering SD clocks
ASoC: max9867: Fix power management
ARM: dts: sunxi: Fix PMU compatible strings
ARM: dts: am335x-pdu001: Fix polarity of card detection input
media: vimc: fix start stream when link is disabled
net: aquantia: fix RSS table and key sizes
sched/fair: Scale bandwidth quota and period without losing quota/period ratio precision
fuse: verify nlink
fuse: verify attributes
ALSA: hda/realtek - Enable internal speaker of ASUS UX431FLC
ALSA: hda/realtek - Enable the headset-mic on a Xiaomi's laptop
ALSA: hda/realtek - Dell headphone has noise on unmute for ALC236
ALSA: pcm: oss: Avoid potential buffer overflows
ALSA: hda - Add mute led support for HP ProBook 645 G4
Input: synaptics - switch another X1 Carbon 6 to RMI/SMbus
Input: synaptics-rmi4 - re-enable IRQs in f34v7_do_reflash
Input: synaptics-rmi4 - don't increment rmiaddr for SMBus transfers
Input: goodix - add upside-down quirk for Teclast X89 tablet
coresight: etm4x: Fix input validation for sysfs.
Input: Fix memory leak in psxpad_spi_probe
x86/mm/32: Sync only to VMALLOC_END in vmalloc_sync_all()
x86/PCI: Avoid AMD FCH XHCI USB PME# from D0 defect
xfrm interface: fix memory leak on creation
xfrm interface: avoid corruption on changelink
xfrm interface: fix list corruption for x-netns
xfrm interface: fix management of phydev
CIFS: Fix NULL-pointer dereference in smb2_push_mandatory_locks
CIFS: Fix SMB2 oplock break processing
tty: vt: keyboard: reject invalid keycodes
can: slcan: Fix use-after-free Read in slcan_open
kernfs: fix ino wrap-around detection
jbd2: Fix possible overflow in jbd2_log_space_left()
drm/msm: fix memleak on release
drm/i810: Prevent underflow in ioctl
arm64: dts: exynos: Revert "Remove unneeded address space mapping for soc node"
KVM: arm/arm64: vgic: Don't rely on the wrong pending table
KVM: x86: do not modify masked bits of shared MSRs
KVM: x86: fix presentation of TSX feature in ARCH_CAPABILITIES
KVM: x86: Grab KVM's srcu lock when setting nested state
crypto: crypto4xx - fix double-free in crypto4xx_destroy_sdr
crypto: atmel-aes - Fix IV handling when req->nbytes < ivsize
crypto: af_alg - cast ki_complete ternary op to int
crypto: ccp - fix uninitialized list head
crypto: ecdh - fix big endian bug in ECC library
crypto: user - fix memory leak in crypto_report
spi: atmel: Fix CS high support
mwifiex: update set_mac_address logic
can: ucan: fix non-atomic allocation in completion handler
RDMA/qib: Validate ->show()/store() callbacks before calling them
iomap: Fix pipe page leakage during splicing
thermal: Fix deadlock in thermal thermal_zone_device_check
vcs: prevent write access to vcsu devices
binder: Fix race between mmap() and binder_alloc_print_pages()
binder: Handle start==NULL in binder_update_page_range()
ALSA: hda - Fix pending unsol events at shutdown
md/raid0: Fix an error message in raid0_make_request()
watchdog: aspeed: Fix clock behaviour for ast2600
perf script: Fix invalid LBR/binary mismatch error
splice: don't read more than available pipe space
iomap: partially revert 4721a601099 (simulated directio short read on EFAULT)
xfs: add missing error check in xfs_prepare_shift()
ASoC: rsnd: fixup MIX kctrl registration
KVM: x86: fix out-of-bounds write in KVM_GET_EMULATED_CPUID (CVE-2019-19332)
net: qrtr: fix memort leak in qrtr_tun_write_iter
appletalk: Fix potential NULL pointer dereference in unregister_snap_client
appletalk: Set error code if register_snap_client failed
Linux 4.19.89
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Ie3fa59adde9a7e9a6d4684de0e95de14a8b83d0b
[ Upstream commit 13c9aaf7fa01cc7600c61981609feadeef3354ec ]
Scan through the whole array to see if an update is needed. While we're
at it, use sizeof() to be safe against any possible type changes in the
future.
The bug here is that we wouldn't sync per-cpu counters into global ones
if there was an update of numa_stats for higher cpus. Highly
theoretical one though because it is much more probable that zone_stats
are updated so we would refresh anyway. So I wouldn't bother to mark
this for stable, yet something nice to fix.
[mhocko@suse.com: changelog enhancement]
Link: http://lkml.kernel.org/r/1541601517-17282-1-git-send-email-janne.huttunen@nokia.com
Fixes: 1d90ca897c ("mm: update NUMA counter threshold size")
Signed-off-by: Janne Huttunen <janne.huttunen@nokia.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
This change adds accounting for the memory allocated for shadow stacks.
Bug: 145210207
Change-Id: I51157fe0b23b4cb28bb33c86a5dfe3ac911296a4
(am from https://lore.kernel.org/patchwork/patch/1149055/)
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAl3K+EEACgkQONu9yGCS
aT5ReA/+IrdgTosfTtLizJhq+rR7UClBC4D7rOEBv6BVpKMNRdldcaVsFy6/MO6I
eq3keoicBpGxY8j6mp559GT7ACvXC3L6SALlwvfmGX6BHGVmwTE1Q6UmYbMfCoVY
kkZ6YUGc9T20hfvi1aG/bXerv90phIj3mSOttpLQsiNmlqkP4o2ayA9c/ohhINZ9
N3tEqh81vuZ9DQKdR6hY3iiqq8j3x5if5PppfdtQJMq8U7EUePHeiOrl6+j/PWXV
UHF+BGXVJBXF7Qn3NGzLAZsI5HI+PT9ec9R5/6kFwbKnD88CR80AJpwzmPXmLjTu
584oSQHzSQzSl5+r53SrK3EgOUdsduicGHJfO9RU1ttMAkkz+t+7LSSMDMX0qMFX
OGJdaiQc5arKNEcgckpnNDScRAlxK7IwUWfgcyFsjgMhLKxEfAuz8+JGJEp3QF6T
bNXupekzCnhE70j1/1MObLrTshW+BwIfYgSWwjH01Fq5h6/7IcuGWtfpuanhFj0k
sa4J02kdCvUPNAAlXAt07E/QS/4pLEV/Lh4K86MGR2ltYQjPEyYoY9LwIJAFEyQE
/DYUTS9GvlOitpbdUT48v/msZpOHWvhza8He/ZChN17RqFrIY9ykkPkA22moCqWB
w7d4HTP3dkYlt00e2DAclbCMsek9rNn+oTnY1jxGKa3pXxZAKyo=
=T5EF
-----END PGP SIGNATURE-----
Merge 4.19.84 into android-4.19
Changes in 4.19.84
bonding: fix state transition issue in link monitoring
CDC-NCM: handle incomplete transfer of MTU
ipv4: Fix table id reference in fib_sync_down_addr
net: ethernet: octeon_mgmt: Account for second possible VLAN header
net: fix data-race in neigh_event_send()
net: qualcomm: rmnet: Fix potential UAF when unregistering
net: usb: qmi_wwan: add support for DW5821e with eSIM support
NFC: fdp: fix incorrect free object
nfc: netlink: fix double device reference drop
NFC: st21nfca: fix double free
qede: fix NULL pointer deref in __qede_remove()
net: mscc: ocelot: don't handle netdev events for other netdevs
net: mscc: ocelot: fix NULL pointer on LAG slave removal
ipv6: fixes rt6_probe() and fib6_nh->last_probe init
net: hns: Fix the stray netpoll locks causing deadlock in NAPI path
ALSA: timer: Fix incorrectly assigned timer instance
ALSA: bebob: fix to detect configured source of sampling clock for Focusrite Saffire Pro i/o series
ALSA: hda/ca0132 - Fix possible workqueue stall
mm: memcontrol: fix network errors from failing __GFP_ATOMIC charges
mm, meminit: recalculate pcpu batch and high limits after init completes
mm: thp: handle page cache THP correctly in PageTransCompoundMap
mm, vmstat: hide /proc/pagetypeinfo from normal users
dump_stack: avoid the livelock of the dump_lock
tools: gpio: Use !building_out_of_srctree to determine srctree
perf tools: Fix time sorting
drm/radeon: fix si_enable_smc_cac() failed issue
HID: wacom: generic: Treat serial number and related fields as unsigned
soundwire: depend on ACPI
soundwire: bus: set initial value to port_status
arm64: Do not mask out PTE_RDONLY in pte_same()
ceph: fix use-after-free in __ceph_remove_cap()
ceph: add missing check in d_revalidate snapdir handling
iio: adc: stm32-adc: fix stopping dma
iio: imu: adis16480: make sure provided frequency is positive
iio: srf04: fix wrong limitation in distance measuring
ARM: sunxi: Fix CPU powerdown on A83T
netfilter: nf_tables: Align nft_expr private data to 64-bit
netfilter: ipset: Fix an error code in ip_set_sockfn_get()
intel_th: pci: Add Comet Lake PCH support
intel_th: pci: Add Jasper Lake PCH support
x86/apic/32: Avoid bogus LDR warnings
SMB3: Fix persistent handles reconnect
can: usb_8dev: fix use-after-free on disconnect
can: flexcan: disable completely the ECC mechanism
can: c_can: c_can_poll(): only read status register after status IRQ
can: peak_usb: fix a potential out-of-sync while decoding packets
can: rx-offload: can_rx_offload_queue_sorted(): fix error handling, avoid skb mem leak
can: gs_usb: gs_can_open(): prevent memory leak
can: dev: add missing of_node_put() after calling of_get_child_by_name()
can: mcba_usb: fix use-after-free on disconnect
can: peak_usb: fix slab info leak
configfs: stash the data we need into configfs_buffer at open time
configfs_register_group() shouldn't be (and isn't) called in rmdirable parts
configfs: new object reprsenting tree fragments
configfs: provide exclusion between IO and removals
configfs: fix a deadlock in configfs_symlink()
ALSA: usb-audio: More validations of descriptor units
ALSA: usb-audio: Simplify parse_audio_unit()
ALSA: usb-audio: Unify the release of usb_mixer_elem_info objects
ALSA: usb-audio: Remove superfluous bLength checks
ALSA: usb-audio: Clean up check_input_term()
ALSA: usb-audio: Fix possible NULL dereference at create_yamaha_midi_quirk()
ALSA: usb-audio: remove some dead code
ALSA: usb-audio: Fix copy&paste error in the validator
sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices
sched/fair: Fix -Wunused-but-set-variable warnings
usbip: Fix vhci_urb_enqueue() URB null transfer buffer error path
usbip: Implement SG support to vhci-hcd and stub driver
PCI: tegra: Enable Relaxed Ordering only for Tegra20 & Tegra30
HID: google: add magnemite/masterball USB ids
dmaengine: xilinx_dma: Fix control reg update in vdma_channel_set_config
dmaengine: sprd: Fix the possible memory leak issue
HID: intel-ish-hid: fix wrong error handling in ishtp_cl_alloc_tx_ring()
RDMA/mlx5: Clear old rate limit when closing QP
iw_cxgb4: fix ECN check on the passive accept
RDMA/qedr: Fix reported firmware version
net/mlx5e: TX, Fix consumer index of error cqe dump
net/mlx5: prevent memory leak in mlx5_fpga_conn_create_cq
scsi: qla2xxx: fixup incorrect usage of host_byte
RDMA/uverbs: Prevent potential underflow
net: openvswitch: free vport unless register_netdevice() succeeds
scsi: lpfc: Honor module parameter lpfc_use_adisc
scsi: qla2xxx: Initialized mailbox to prevent driver load failure
netfilter: nf_flow_table: set timeout before insertion into hashes
ipvs: don't ignore errors in case refcounting ip_vs module fails
ipvs: move old_secure_tcp into struct netns_ipvs
bonding: fix unexpected IFF_BONDING bit unset
macsec: fix refcnt leak in module exit routine
usb: fsl: Check memory resource before releasing it
usb: gadget: udc: atmel: Fix interrupt storm in FIFO mode.
usb: gadget: composite: Fix possible double free memory bug
usb: dwc3: pci: prevent memory leak in dwc3_pci_probe
usb: gadget: configfs: fix concurrent issue between composite APIs
usb: dwc3: remove the call trace of USBx_GFLADJ
perf/x86/amd/ibs: Fix reading of the IBS OpData register and thus precise RIP validity
perf/x86/amd/ibs: Handle erratum #420 only on the affected CPU family (10h)
perf/x86/uncore: Fix event group support
USB: Skip endpoints with 0 maxpacket length
USB: ldusb: use unsigned size format specifiers
usbip: tools: Fix read_usb_vudc_device() error path handling
RDMA/iw_cxgb4: Avoid freeing skb twice in arp failure case
RDMA/hns: Prevent memory leaks of eq->buf_list
scsi: qla2xxx: stop timer in shutdown path
nvme-multipath: fix possible io hang after ctrl reconnect
fjes: Handle workqueue allocation failure
net: hisilicon: Fix "Trying to free already-free IRQ"
net: mscc: ocelot: fix vlan_filtering when enslaving to bridge before link is up
net: mscc: ocelot: refuse to overwrite the port's native vlan
iommu/amd: Apply the same IVRS IOAPIC workaround to Acer Aspire A315-41
drm/amdgpu: If amdgpu_ib_schedule fails return back the error.
drm/amd/display: Passive DP->HDMI dongle detection fix
hv_netvsc: Fix error handling in netvsc_attach()
usb: dwc3: gadget: fix race when disabling ep with cancelled xfers
NFSv4: Don't allow a cached open with a revoked delegation
net: ethernet: arc: add the missed clk_disable_unprepare
igb: Fix constant media auto sense switching when no cable is connected
e1000: fix memory leaks
pinctrl: intel: Avoid potential glitches if pin is in GPIO mode
ocfs2: protect extent tree in ocfs2_prepare_inode_for_write()
pinctrl: cherryview: Fix irq_valid_mask calculation
blkcg: make blkcg_print_stat() print stats only for online blkgs
iio: imu: mpu6050: Add support for the ICM 20602 IMU
iio: imu: inv_mpu6050: fix no data on MPU6050
mm/filemap.c: don't initiate writeback if mapping has no dirty pages
cgroup,writeback: don't switch wbs immediately on dead wbs if the memcg is dead
usbip: Fix free of unallocated memory in vhci tx
netfilter: ipset: Copy the right MAC address in hash:ip,mac IPv6 sets
net: prevent load/store tearing on sk->sk_stamp
iio: imu: mpu6050: Fix FIFO layout for ICM20602
vsock/virtio: fix sock refcnt holding during the shutdown
drm/i915: Rename gen7 cmdparser tables
drm/i915: Disable Secure Batches for gen6+
drm/i915: Remove Master tables from cmdparser
drm/i915: Add support for mandatory cmdparsing
drm/i915: Support ro ppgtt mapped cmdparser shadow buffers
drm/i915: Allow parsing of unsized batches
drm/i915: Add gen9 BCS cmdparsing
drm/i915/cmdparser: Use explicit goto for error paths
drm/i915/cmdparser: Add support for backward jumps
drm/i915/cmdparser: Ignore Length operands during command matching
drm/i915: Lower RM timeout to avoid DSI hard hangs
drm/i915/gen8+: Add RC6 CTX corruption WA
drm/i915/cmdparser: Fix jump whitelist clearing
KVM: x86: use Intel speculation bugs and features as derived in generic x86 code
x86/msr: Add the IA32_TSX_CTRL MSR
x86/cpu: Add a helper function x86_read_arch_cap_msr()
x86/cpu: Add a "tsx=" cmdline option with TSX disabled by default
x86/speculation/taa: Add mitigation for TSX Async Abort
x86/speculation/taa: Add sysfs reporting for TSX Async Abort
kvm/x86: Export MDS_NO=0 to guests when TSX is enabled
x86/tsx: Add "auto" option to the tsx= cmdline parameter
x86/speculation/taa: Add documentation for TSX Async Abort
x86/tsx: Add config options to set tsx=on|off|auto
x86/speculation/taa: Fix printing of TAA_MSG_SMT on IBRS_ALL CPUs
x86/bugs: Add ITLB_MULTIHIT bug infrastructure
x86/cpu: Add Tremont to the cpu vulnerability whitelist
cpu/speculation: Uninline and export CPU mitigations helpers
Documentation: Add ITLB_MULTIHIT documentation
kvm: x86, powerpc: do not allow clearing largepages debugfs entry
kvm: Convert kvm_lock to a mutex
kvm: mmu: Do not release the page inside mmu_set_spte()
KVM: x86: make FNAME(fetch) and __direct_map more similar
KVM: x86: remove now unneeded hugepage gfn adjustment
KVM: x86: change kvm_mmu_page_get_gfn BUG_ON to WARN_ON
KVM: x86: add tracepoints around __direct_map and FNAME(fetch)
KVM: vmx, svm: always run with EFER.NXE=1 when shadow paging is active
kvm: mmu: ITLB_MULTIHIT mitigation
kvm: Add helper function for creating VM worker threads
kvm: x86: mmu: Recovery of shattered NX large pages
Linux 4.19.84
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I7a820f00c4b868ed677bb49613f835b7e67a3a06
commit abaed0112c1db08be15a784a2c5c8a8b3063cdd3 upstream.
/proc/pagetypeinfo is a debugging tool to examine internal page
allocator state wrt to fragmentation. It is not very useful for any
other use so normal users really do not need to read this file.
Waiman Long has noticed that reading this file can have negative side
effects because zone->lock is necessary for gathering data and that a)
interferes with the page allocator and its users and b) can lead to hard
lockups on large machines which have very long free_list.
Reduce both issues by simply not exporting the file to regular users.
Link: http://lkml.kernel.org/r/20191025072610.18526-2-mhocko@kernel.org
Fixes: 467c996c1e ("Print out statistics in relation to fragmentation avoidance to /proc/pagetypeinfo")
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: Waiman Long <longman@redhat.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Waiman Long <longman@redhat.com>
Acked-by: Rafael Aquini <aquini@redhat.com>
Acked-by: David Rientjes <rientjes@google.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Jann Horn <jannh@google.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAlzEBokACgkQONu9yGCS
aT7G7w/8C93URGM67H7ynkCHTo8y3hkRE2rUJPckJNdS+IJKuecmOphak4tF0h07
qPWDPya70Q1S0cNu661TuVAGrhmE5jBx8/xfZaAOeaaU0xtZive+TfSHdAQQaHct
tDk32O85N1aZ49rDEz9ibr7CGLVFDZtyhxV5gFMYQpjbqA7MzJC61zQg1jHyPSCz
sKjQzW+uXMuSLru8jXHMvp41K5sFFp5gYdQbAVKlWtt79qPxWdxZPJbLbM0LBbtz
XHt9E45Ink3ALF9P6tZ4e6gi4zzlNbh9yR92+X5NK5/8AP57yWba4W9JHWIfMBpC
yyDYTOEAzdxqa2Jrgwr4WTdKH6U7FbQZFmWfTBB4VotbHLBWkVXj0OnF10qxP9eQ
p5wGDTJAlWezhX1BTCfYroglDsvqhj+gHfwHzDRF1Del1dRgydRMQc0qLD1d9tul
ovzwOkx1xyJrM2wq05I5gc0FoVyOL6/KCwqMrpVfKa3WKY7Uttjgf56bMqdIIkns
i/6opzF+wtvwlLlCoXgYPXdm6kbWdgvS+skVHfWcHmZFMuGrFGGzJNwzXb7qnVjK
T0hD1OestsfTyD/amnDNYkNeCkoOZqtHAi+xYOQR4kGY5cxP1lQJf85MgAy6RZSY
h+rjys76Qf6+hTCtrowLr8SgksX4ACWxm+UarfAiiNnnDXwGfu8=
=SrFV
-----END PGP SIGNATURE-----
Merge 4.19.37 into android-4.19
Changes in 4.19.37
bonding: fix event handling for stacked bonds
failover: allow name change on IFF_UP slave interfaces
net: atm: Fix potential Spectre v1 vulnerabilities
net: bridge: fix per-port af_packet sockets
net: bridge: multicast: use rcu to access port list from br_multicast_start_querier
net: Fix missing meta data in skb with vlan packet
net: fou: do not use guehdr after iptunnel_pull_offloads in gue_udp_recv
tcp: tcp_grow_window() needs to respect tcp_space()
team: set slave to promisc if team is already in promisc mode
tipc: missing entries in name table of publications
vhost: reject zero size iova range
ipv4: recompile ip options in ipv4_link_failure
ipv4: ensure rcu_read_lock() in ipv4_link_failure()
net: thunderx: raise XDP MTU to 1508
net: thunderx: don't allow jumbo frames with XDP
net/mlx5: FPGA, tls, hold rcu read lock a bit longer
net/tls: prevent bad memory access in tls_is_sk_tx_device_offloaded()
net/mlx5: FPGA, tls, idr remove on flow delete
route: Avoid crash from dereferencing NULL rt->from
sch_cake: Use tc_skb_protocol() helper for getting packet protocol
sch_cake: Make sure we can write the IP header before changing DSCP bits
nfp: flower: replace CFI with vlan present
nfp: flower: remove vlan CFI bit from push vlan action
sch_cake: Simplify logic in cake_select_tin()
net: IP defrag: encapsulate rbtree defrag code into callable functions
net: IP6 defrag: use rbtrees for IPv6 defrag
net: IP6 defrag: use rbtrees in nf_conntrack_reasm.c
CIFS: keep FileInfo handle live during oplock break
cifs: Fix use-after-free in SMB2_write
cifs: Fix use-after-free in SMB2_read
cifs: fix handle leak in smb2_query_symlink()
KVM: x86: Don't clear EFER during SMM transitions for 32-bit vCPU
KVM: x86: svm: make sure NMI is injected after nmi_singlestep
Staging: iio: meter: fixed typo
staging: iio: ad7192: Fix ad7193 channel address
iio: gyro: mpu3050: fix chip ID reading
iio/gyro/bmg160: Use millidegrees for temperature scale
iio:chemical:bme680: Fix, report temperature in millidegrees
iio:chemical:bme680: Fix SPI read interface
iio: cros_ec: Fix the maths for gyro scale calculation
iio: ad_sigma_delta: select channel when reading register
iio: dac: mcp4725: add missing powerdown bits in store eeprom
iio: Fix scan mask selection
iio: adc: at91: disable adc channel interrupt in timeout case
iio: core: fix a possible circular locking dependency
io: accel: kxcjk1013: restore the range after resume.
staging: most: core: use device description as name
staging: comedi: vmk80xx: Fix use of uninitialized semaphore
staging: comedi: vmk80xx: Fix possible double-free of ->usb_rx_buf
staging: comedi: ni_usb6501: Fix use of uninitialized mutex
staging: comedi: ni_usb6501: Fix possible double-free of ->usb_rx_buf
ALSA: hda/realtek - add two more pin configuration sets to quirk table
ALSA: core: Fix card races between register and disconnect
Input: elan_i2c - add hardware ID for multiple Lenovo laptops
serial: sh-sci: Fix HSCIF RX sampling point adjustment
serial: sh-sci: Fix HSCIF RX sampling point calculation
vt: fix cursor when clearing the screen
scsi: core: set result when the command cannot be dispatched
Revert "scsi: fcoe: clear FC_RP_STARTED flags when receiving a LOGO"
Revert "svm: Fix AVIC incomplete IPI emulation"
coredump: fix race condition between mmget_not_zero()/get_task_mm() and core dumping
ipmi: fix sleep-in-atomic in free_user at cleanup SRCU user->release_barrier
crypto: x86/poly1305 - fix overflow during partial reduction
drm/ttm: fix out-of-bounds read in ttm_put_pages() v2
arm64: futex: Restore oldval initialization to work around buggy compilers
x86/kprobes: Verify stack frame on kretprobe
kprobes: Mark ftrace mcount handler functions nokprobe
kprobes: Fix error check when reusing optimized probes
rt2x00: do not increment sequence number while re-transmitting
mac80211: do not call driver wake_tx_queue op during reconfig
drm/amdgpu/gmc9: fix VM_L2_CNTL3 programming
perf/x86/amd: Add event map for AMD Family 17h
x86/cpu/bugs: Use __initconst for 'const' init data
perf/x86: Fix incorrect PEBS_REGS
x86/speculation: Prevent deadlock on ssb_state::lock
timers/sched_clock: Prevent generic sched_clock wrap caused by tick_freeze()
nfit/ars: Remove ars_start_flags
nfit/ars: Introduce scrub_flags
nfit/ars: Allow root to busy-poll the ARS state machine
nfit/ars: Avoid stale ARS results
mmc: sdhci: Fix data command CRC error handling
mmc: sdhci: Rename SDHCI_ACMD12_ERR and SDHCI_INT_ACMD12ERR
mmc: sdhci: Handle auto-command errors
modpost: file2alias: go back to simple devtable lookup
modpost: file2alias: check prototype of handler
tpm/tpm_i2c_atmel: Return -E2BIG when the transfer is incomplete
tpm: Fix the type of the return value in calc_tpm2_event_size()
Revert "kbuild: use -Oz instead of -Os when using clang"
sched/fair: Limit sched_cfs_period_timer() loop to avoid hard lockup
device_cgroup: fix RCU imbalance in error case
mm/vmstat.c: fix /proc/vmstat format for CONFIG_DEBUG_TLBFLUSH=y CONFIG_SMP=n
ALSA: info: Fix racy addition/deletion of nodes
percpu: stop printing kernel addresses
tools include: Adopt linux/bits.h
ASoC: rockchip: add missing INTERLEAVED PCM attribute
i2c-hid: properly terminate i2c_hid_dmi_desc_override_table[] array
Revert "locking/lockdep: Add debug_locks check in __lock_downgrade()"
kernel/sysctl.c: fix out-of-bounds access when setting file-max
Linux 4.19.37
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
commit e8277b3b52240ec1caad8e6df278863e4bf42eac upstream.
Commit 58bc4c34d2 ("mm/vmstat.c: skip NR_TLB_REMOTE_FLUSH* properly")
depends on skipping vmstat entries with empty name introduced in
7aaf772723 ("mm: don't show nr_indirectly_reclaimable in
/proc/vmstat") but reverted in b29940c1abd7 ("mm: rename and change
semantics of nr_indirectly_reclaimable_bytes").
So skipping no longer works and /proc/vmstat has misformatted lines " 0".
This patch simply shows debug counters "nr_tlb_remote_*" for UP.
Link: http://lkml.kernel.org/r/155481488468.467.4295519102880913454.stgit@buzz
Fixes: 58bc4c34d2 ("mm/vmstat.c: skip NR_TLB_REMOTE_FLUSH* properly")
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Roman Gushchin <guro@fb.com>
Cc: Jann Horn <jannh@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAly6xzUACgkQONu9yGCS
aT5sIA//b7nAk2zuhmbkonsBfzFq5uBJmqXcCrOgy3XHMs4fE+Q11kLd1wMAV7dx
U7FNHe4PIJ8Rczxgqr2VP3VmFbV6UuTK+UTclJKfbV3ouIAQiQBuutABBmbDUj2p
FInc/yAYyhVc9n7gX78czTiUxKnKi4+sisUYDCZPr3hr6jDPcLvm/WVWdyrcXJje
rYFNmE/2MBH1NofG+MOpq+ILhKHXlf2APN2/spl+I42a8bwodiSl9g+dhuWr7wgT
Ln2Ocf7BZ6BPCQKoveZdD1Gd56NNR/lJh4ulqpuhaZw4Yp+B/C7GmrBtdPzVSGka
IwPWoSc9/9VSUl+ooSZHms78VLbqq0rNNclskL2bN6m962u04Eu7sB2Tg/bwUs52
Wkcw0DY4J/oMJtj/CMHcQOUPsk6vwHxqnjsj+LYJ1ZjHO68tUshnENxXrbAoDc45
2fuY3TCA+XqFvqNt5HbkLPtFR78u8QmZ1lP/Pkri6xoG/GA6O0EAxhS0Z9hncGK7
8wJNuxLMd2UX94wlajQ+DF7yyCU4HOFEdeSEOwlHHBid/fckXsGzL2tKJUAbbUPP
ux3An8kJHni8nQrmUkyy1Nx29ROyAFxBLOQshWGpXgJrV3qRMYLyB2Icv0WYCGFk
zZCTupPgvb46u81VzqxrLH4RZdy4Ar4uB3BQGPKs596rlYmvnSo=
=CArs
-----END PGP SIGNATURE-----
Merge 4.19.36 into android-4.19
Changes in 4.19.36
ARC: u-boot args: check that magic number is correct
arc: hsdk_defconfig: Enable CONFIG_BLK_DEV_RAM
inotify: Fix fsnotify_mark refcount leak in inotify_update_existing_watch()
perf/core: Restore mmap record type correctly
ext4: avoid panic during forced reboot
ext4: add missing brelse() in add_new_gdb_meta_bg()
ext4: report real fs size after failed resize
ALSA: echoaudio: add a check for ioremap_nocache
ALSA: sb8: add a check for request_region
auxdisplay: hd44780: Fix memory leak on ->remove()
drm/udl: use drm_gem_object_put_unlocked.
IB/mlx4: Fix race condition between catas error reset and aliasguid flows
i40iw: Avoid panic when handling the inetdev event
mmc: davinci: remove extraneous __init annotation
ALSA: opl3: fix mismatch between snd_opl3_drum_switch definition and declaration
thermal/intel_powerclamp: fix __percpu declaration of worker_data
thermal: samsung: Fix incorrect check after code merge
thermal: bcm2835: Fix crash in bcm2835_thermal_debugfs
thermal/int340x_thermal: Add additional UUIDs
thermal/int340x_thermal: fix mode setting
thermal/intel_powerclamp: fix truncated kthread name
scsi: iscsi: flush running unbind operations when removing a session
sched/cpufreq: Fix 32-bit math overflow
sched/core: Fix buffer overflow in cgroup2 property cpu.max
x86/mm: Don't leak kernel addresses
tools/power turbostat: return the exit status of a command
perf list: Don't forget to drop the reference to the allocated thread_map
perf config: Fix an error in the config template documentation
perf config: Fix a memory leak in collect_config()
perf build-id: Fix memory leak in print_sdt_events()
perf top: Fix error handling in cmd_top()
perf hist: Add missing map__put() in error case
perf evsel: Free evsel->counts in perf_evsel__exit()
perf tests: Fix a memory leak of cpu_map object in the openat_syscall_event_on_all_cpus test
perf tests: Fix memory leak by expr__find_other() in test__expr()
perf tests: Fix a memory leak in test__perf_evsel__tp_sched_test()
ACPI / utils: Drop reference in test for device presence
PM / Domains: Avoid a potential deadlock
blk-iolatency: #include "blk.h"
drm/exynos/mixer: fix MIXER shadow registry synchronisation code
irqchip/stm32: Don't clear rising/falling config registers at init
irqchip/mbigen: Don't clear eventid when freeing an MSI
x86/hpet: Prevent potential NULL pointer dereference
x86/hyperv: Prevent potential NULL pointer dereference
x86/cpu/cyrix: Use correct macros for Cyrix calls on Geode processors
drm/nouveau/debugfs: Fix check of pm_runtime_get_sync failure
iommu/vt-d: Check capability before disabling protected memory
x86/hw_breakpoints: Make default case in hw_breakpoint_arch_parse() return an error
fix incorrect error code mapping for OBJECTID_NOT_FOUND
x86/gart: Exclude GART aperture from kcore
ext4: prohibit fstrim in norecovery mode
drm/cirrus: Use drm_framebuffer_put to avoid kernel oops in clean-up
gpio: pxa: handle corner case of unprobed device
rsi: improve kernel thread handling to fix kernel panic
f2fs: fix to avoid NULL pointer dereference on se->discard_map
9p: do not trust pdu content for stat item size
9p locks: add mount option for lock retry interval
ASoC: Fix UBSAN warning at snd_soc_get/put_volsw_sx()
f2fs: fix to do sanity check with current segment number
netfilter: xt_cgroup: shrink size of v2 path
serial: uartps: console_setup() can't be placed to init section
powerpc/pseries: Remove prrn_work workqueue
media: au0828: cannot kfree dev before usb disconnect
Bluetooth: Fix debugfs NULL pointer dereference
HID: i2c-hid: override HID descriptors for certain devices
pinctrl: core: make sure strcmp() doesn't get a null parameter
ARM: samsung: Limit SAMSUNG_PM_CHECK config option to non-Exynos platforms
usbip: fix vhci_hcd controller counting
ACPI / SBS: Fix GPE storm on recent MacBookPro's
HID: usbhid: Add quirk for Redragon/Dragonrise Seymur 2
KVM: nVMX: restore host state in nested_vmx_vmexit for VMFail
compiler.h: update definition of unreachable()
netfilter: nf_flow_table: remove flowtable hook flush routine in netns exit routine
f2fs: cleanup dirty pages if recover failed
net: stmmac: Set OWN bit for jumbo frames
cifs: fallback to older infolevels on findfirst queryinfo retry
kernel: hung_task.c: disable on suspend
platform/x86: Add Intel AtomISP2 dummy / power-management driver
drm/ttm: Fix bo_global and mem_global kfree error
ALSA: hda: fix front speakers on Huawei MBXP
ACPI: EC / PM: Disable non-wakeup GPEs for suspend-to-idle
net/rds: fix warn in rds_message_alloc_sgs
xfrm: destroy xfrm_state synchronously on net exit path
crypto: sha256/arm - fix crash bug in Thumb2 build
crypto: sha512/arm - fix crash bug in Thumb2 build
net: ip6_gre: fix possible NULL pointer dereference in ip6erspan_set_version
iommu/dmar: Fix buffer overflow during PCI bus notification
scsi: core: Avoid that system resume triggers a kernel warning
soc/tegra: pmc: Drop locking from tegra_powergate_is_powered()
lkdtm: Print real addresses
lkdtm: Add tests for NULL pointer dereference
drm/panel: panel-innolux: set display off in innolux_panel_unprepare
crypto: axis - fix for recursive locking from bottom half
Revert "ACPI / EC: Remove old CLEAR_ON_RESUME quirk"
coresight: cpu-debug: Support for CA73 CPUs
PCI: Blacklist power management of Gigabyte X299 DESIGNARE EX PCIe ports
drm/nouveau/volt/gf117: fix speedo readout register
ARM: 8839/1: kprobe: make patch_lock a raw_spinlock_t
drm/amdkfd: use init_mqd function to allocate object for hid_mqd (CI)
appletalk: Fix use-after-free in atalk_proc_exit
lib/div64.c: off by one in shift
rxrpc: Fix client call connect/disconnect race
f2fs: fix to dirty inode for i_mode recovery
include/linux/swap.h: use offsetof() instead of custom __swapoffset macro
bpf: fix use after free in bpf_evict_inode
IB/hfi1: Failed to drain send queue when QP is put into error state
mm: hide incomplete nr_indirectly_reclaimable in /proc/zoneinfo
mm: hide incomplete nr_indirectly_reclaimable in sysfs
appletalk: Fix compile regression
Linux 4.19.36
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
[fixed differently upstream, this is a work-around to resolve it for 4.19.y]
Yongqin reported that /proc/zoneinfo format is broken in 4.14
due to commit 7aaf772723 ("mm: don't show nr_indirectly_reclaimable
in /proc/vmstat")
Node 0, zone DMA
per-node stats
nr_inactive_anon 403
nr_active_anon 89123
nr_inactive_file 128887
nr_active_file 47377
nr_unevictable 2053
nr_slab_reclaimable 7510
nr_slab_unreclaimable 10775
nr_isolated_anon 0
nr_isolated_file 0
<...>
nr_vmscan_write 0
nr_vmscan_immediate_reclaim 0
nr_dirtied 6022
nr_written 5985
74240
^^^^^^^^^^
pages free 131656
The problem is caused by the nr_indirectly_reclaimable counter,
which is hidden from the /proc/vmstat, but not from the
/proc/zoneinfo. Let's fix this inconsistency and hide the
counter from /proc/zoneinfo exactly as from /proc/vmstat.
BTW, in 4.19+ the counter has been renamed and exported by
the commit b29940c1abd7 ("mm: rename and change semantics of
nr_indirectly_reclaimable_bytes"), so there is no such a problem
anymore.
Cc: <stable@vger.kernel.org> # 4.14.x-4.18.x
Fixes: 7aaf772723 ("mm: don't show nr_indirectly_reclaimable in /proc/vmstat")
Reported-by: Yongqin Liu <yongqin.liu@linaro.org>
Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Refaults happen during transitions between workingsets as well as in-place
thrashing. Knowing the difference between the two has a range of
applications, including measuring the impact of memory shortage on the
system performance, as well as the ability to smarter balance pressure
between the filesystem cache and the swap-backed workingset.
During workingset transitions, inactive cache refaults and pushes out
established active cache. When that active cache isn't stale, however,
and also ends up refaulting, that's bonafide thrashing.
Introduce a new page flag that tells on eviction whether the page has been
active or not in its lifetime. This bit is then stored in the shadow
entry, to classify refaults as transitioning or thrashing.
How many page->flags does this leave us with on 32-bit?
20 bits are always page flags
21 if you have an MMU
23 with the zone bits for DMA, Normal, HighMem, Movable
29 with the sparsemem section bits
30 if PAE is enabled
31 with this patch.
So on 32-bit PAE, that leaves 1 bit for distinguishing two NUMA nodes. If
that's not enough, the system can switch to discontigmem and re-gain the 6
or 7 sparsemem section bits.
Link: http://lkml.kernel.org/r/20180828172258.3185-3-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Daniel Drake <drake@endlessm.com>
Tested-by: Suren Baghdasaryan <surenb@google.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <jweiner@fb.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Enderborg <peter.enderborg@sony.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 8508cf3ffad4defa202b303e5b6379efc4cd9054)
Bug: 127712811
Test: lmkd in PSI mode
Change-Id: I71df060dce5590a3c654f9a0e8e54deeb74b64c2
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
5dd0b16cda ("mm/vmstat: Make NR_TLB_REMOTE_FLUSH_RECEIVED available even
on UP") made the availability of the NR_TLB_REMOTE_FLUSH* counters inside
the kernel unconditional to reduce #ifdef soup, but (either to avoid
showing dummy zero counters to userspace, or because that code was missed)
didn't update the vmstat_array, meaning that all following counters would
be shown with incorrect values.
This only affects kernel builds with
CONFIG_VM_EVENT_COUNTERS=y && CONFIG_DEBUG_TLBFLUSH=y && CONFIG_SMP=n.
Link: http://lkml.kernel.org/r/20181001143138.95119-2-jannh@google.com
Fixes: 5dd0b16cda ("mm/vmstat: Make NR_TLB_REMOTE_FLUSH_RECEIVED available even on UP")
Signed-off-by: Jann Horn <jannh@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Roman Gushchin <guro@fb.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Kemi Wang <kemi.wang@intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
7a9cdebdcc ("mm: get rid of vmacache_flush_all() entirely") removed the
VMACACHE_FULL_FLUSHES statistics, but didn't remove the corresponding
entry in vmstat_text. This causes an out-of-bounds access in
vmstat_show().
Luckily this only affects kernels with CONFIG_DEBUG_VM_VMACACHE=y, which
is probably very rare.
Link: http://lkml.kernel.org/r/20181001143138.95119-1-jannh@google.com
Fixes: 7a9cdebdcc ("mm: get rid of vmacache_flush_all() entirely")
Signed-off-by: Jann Horn <jannh@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Roman Gushchin <guro@fb.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Kemi Wang <kemi.wang@intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Revert commit c7f26ccfb2 ("mm/vmstat.c: fix vmstat_update() preemption
BUG"). Steven saw a "using smp_processor_id() in preemptible" message
and added a preempt_disable() section around it to keep it quiet. This
is not the right thing to do it does not fix the real problem.
vmstat_update() is invoked by a kworker on a specific CPU. This worker
it bound to this CPU. The name of the worker was "kworker/1:1" so it
should have been a worker which was bound to CPU1. A worker which can
run on any CPU would have a `u' before the first digit.
smp_processor_id() can be used in a preempt-enabled region as long as
the task is bound to a single CPU which is the case here. If it could
run on an arbitrary CPU then this is the problem we have an should seek
to resolve.
Not only this smp_processor_id() must not be migrated to another CPU but
also refresh_cpu_vm_stats() which might access wrong per-CPU variables.
Not to mention that other code relies on the fact that such a worker
runs on one specific CPU only.
Therefore revert that commit and we should look instead what broke the
affinity mask of the kworker.
Link: http://lkml.kernel.org/r/20180504104451.20278-1-bigeasy@linutronix.de
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Steven J. Hill <steven.hill@cavium.com>
Cc: Tejun Heo <htejun@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Variants of proc_create{,_data} that directly take a struct seq_operations
argument and drastically reduces the boilerplate code in the callers.
All trivial callers converted over.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Don't show nr_indirectly_reclaimable in /proc/vmstat, because there is
no need to export this vm counter to userspace, and some changes are
expected in reclaimable object accounting, which can alter this counter.
Link: http://lkml.kernel.org/r/20180425191422.9159-1-guro@fb.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "indirectly reclaimable memory", v2.
This patchset introduces the concept of indirectly reclaimable memory
and applies it to fix the issue of when a big number of dentries with
external names can significantly affect the MemAvailable value.
This patch (of 3):
Introduce a concept of indirectly reclaimable memory and adds the
corresponding memory counter and /proc/vmstat item.
Indirectly reclaimable memory is any sort of memory, used by the kernel
(except of reclaimable slabs), which is actually reclaimable, i.e. will
be released under memory pressure.
The counter is in bytes, as it's not always possible to count such
objects in pages. The name contains BYTES by analogy to
NR_KERNEL_STACK_KB.
Link: http://lkml.kernel.org/r/20180305133743.12746-2-guro@fb.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Attempting to hotplug CPUs with CONFIG_VM_EVENT_COUNTERS enabled can
cause vmstat_update() to report a BUG due to preemption not being
disabled around smp_processor_id().
Discovered on Ubiquiti EdgeRouter Pro with Cavium Octeon II processor.
BUG: using smp_processor_id() in preemptible [00000000] code:
kworker/1:1/269
caller is vmstat_update+0x50/0xa0
CPU: 0 PID: 269 Comm: kworker/1:1 Not tainted
4.16.0-rc4-Cavium-Octeon-00009-gf83bbd5-dirty #1
Workqueue: mm_percpu_wq vmstat_update
Call Trace:
show_stack+0x94/0x128
dump_stack+0xa4/0xe0
check_preemption_disabled+0x118/0x120
vmstat_update+0x50/0xa0
process_one_work+0x144/0x348
worker_thread+0x150/0x4b8
kthread+0x110/0x140
ret_from_kernel_thread+0x14/0x1c
Link: http://lkml.kernel.org/r/1520881552-25659-1-git-send-email-steven.hill@cavium.com
Signed-off-by: Steven J. Hill <steven.hill@cavium.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Tejun Heo <htejun@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is the second step which introduces a tunable interface that allow
numa stats configurable for optimizing zone_statistics(), as suggested
by Dave Hansen and Ying Huang.
=========================================================================
When page allocation performance becomes a bottleneck and you can
tolerate some possible tool breakage and decreased numa counter
precision, you can do:
echo 0 > /proc/sys/vm/numa_stat
In this case, numa counter update is ignored. We can see about
*4.8%*(185->176) drop of cpu cycles per single page allocation and
reclaim on Jesper's page_bench01 (single thread) and *8.1%*(343->315)
drop of cpu cycles per single page allocation and reclaim on Jesper's
page_bench03 (88 threads) running on a 2-Socket Broadwell-based server
(88 threads, 126G memory).
Benchmark link provided by Jesper D Brouer (increase loop times to
10000000):
https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/bench
=========================================================================
When page allocation performance is not a bottleneck and you want all
tooling to work, you can do:
echo 1 > /proc/sys/vm/numa_stat
This is system default setting.
Many thanks to Michal Hocko, Dave Hansen, Ying Huang and Vlastimil Babka
for comments to help improve the original patch.
[keescook@chromium.org: make sure mutex is a global static]
Link: http://lkml.kernel.org/r/20171107213809.GA4314@beast
Link: http://lkml.kernel.org/r/1508290927-8518-1-git-send-email-kemi.wang@intel.com
Signed-off-by: Kemi Wang <kemi.wang@intel.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Reported-by: Jesper Dangaard Brouer <brouer@redhat.com>
Suggested-by: Dave Hansen <dave.hansen@intel.com>
Suggested-by: Ying Huang <ying.huang@intel.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: "Luis R . Rodriguez" <mcgrof@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Christopher Lameter <cl@linux.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Tim Chen <tim.c.chen@intel.com>
Cc: Andi Kleen <andi.kleen@intel.com>
Cc: Aaron Lu <aaron.lu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since commit 59dc76b0d4 ("mm: vmscan: reduce size of inactive file
list") 'pgdat->inactive_ratio' is not used, except for printing
"node_inactive_ratio: 0" in /proc/zoneinfo output.
Remove it.
Link: http://lkml.kernel.org/r/20171003152611.27483-1-aryabinin@virtuozzo.com
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
To avoid deviation, the per cpu number of NUMA stats in
vm_numa_stat_diff[] is included when a user *reads* the NUMA stats.
Since NUMA stats does not be read by users frequently, and kernel does not
need it to make a decision, it will not be a problem to make the readers
more expensive.
Link: http://lkml.kernel.org/r/1503568801-21305-4-git-send-email-kemi.wang@intel.com
Signed-off-by: Kemi Wang <kemi.wang@intel.com>
Reported-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Aaron Lu <aaron.lu@intel.com>
Cc: Andi Kleen <andi.kleen@intel.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Tim Chen <tim.c.chen@intel.com>
Cc: Ying Huang <ying.huang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There is significant overhead in cache bouncing caused by zone counters
(NUMA associated counters) update in parallel in multi-threaded page
allocation (suggested by Dave Hansen).
This patch updates NUMA counter threshold to a fixed size of MAX_U16 - 2,
as a small threshold greatly increases the update frequency of the global
counter from local per cpu counter(suggested by Ying Huang).
The rationality is that these statistics counters don't affect the
kernel's decision, unlike other VM counters, so it's not a problem to use
a large threshold.
With this patchset, we see 31.3% drop of CPU cycles(537-->369) for per
single page allocation and reclaim on Jesper's page_bench03 benchmark.
Benchmark provided by Jesper D Brouer(increase loop times to 10000000):
https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/
bench
Threshold CPU cycles Throughput(88 threads)
32 799 241760478
64 640 301628829
125 537 358906028 <==> system by default (base)
256 468 412397590
512 428 450550704
4096 399 482520943
20000 394 489009617
30000 395 488017817
65533 369(-31.3%) 521661345(+45.3%) <==> with this patchset
N/A 342(-36.3%) 562900157(+56.8%) <==> disable zone_statistics
Link: http://lkml.kernel.org/r/1503568801-21305-3-git-send-email-kemi.wang@intel.com
Signed-off-by: Kemi Wang <kemi.wang@intel.com>
Reported-by: Jesper Dangaard Brouer <brouer@redhat.com>
Suggested-by: Dave Hansen <dave.hansen@intel.com>
Suggested-by: Ying Huang <ying.huang@intel.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Aaron Lu <aaron.lu@intel.com>
Cc: Andi Kleen <andi.kleen@intel.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Tim Chen <tim.c.chen@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "Separate NUMA statistics from zone statistics", v2.
Each page allocation updates a set of per-zone statistics with a call to
zone_statistics(). As discussed in 2017 MM summit, these are a
substantial source of overhead in the page allocator and are very rarely
consumed. This significant overhead in cache bouncing caused by zone
counters (NUMA associated counters) update in parallel in multi-threaded
page allocation (pointed out by Dave Hansen).
A link to the MM summit slides:
http://people.netfilter.org/hawk/presentations/MM-summit2017/MM-summit2017-JesperBrouer.pdf
To mitigate this overhead, this patchset separates NUMA statistics from
zone statistics framework, and update NUMA counter threshold to a fixed
size of MAX_U16 - 2, as a small threshold greatly increases the update
frequency of the global counter from local per cpu counter (suggested by
Ying Huang). The rationality is that these statistics counters don't
need to be read often, unlike other VM counters, so it's not a problem
to use a large threshold and make readers more expensive.
With this patchset, we see 31.3% drop of CPU cycles(537-->369, see
below) for per single page allocation and reclaim on Jesper's
page_bench03 benchmark. Meanwhile, this patchset keeps the same style
of virtual memory statistics with little end-user-visible effects (only
move the numa stats to show behind zone page stats, see the first patch
for details).
I did an experiment of single page allocation and reclaim concurrently
using Jesper's page_bench03 benchmark on a 2-Socket Broadwell-based
server (88 processors with 126G memory) with different size of threshold
of pcp counter.
Benchmark provided by Jesper D Brouer(increase loop times to 10000000):
https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/bench
Threshold CPU cycles Throughput(88 threads)
32 799 241760478
64 640 301628829
125 537 358906028 <==> system by default
256 468 412397590
512 428 450550704
4096 399 482520943
20000 394 489009617
30000 395 488017817
65533 369(-31.3%) 521661345(+45.3%) <==> with this patchset
N/A 342(-36.3%) 562900157(+56.8%) <==> disable zone_statistics
This patch (of 3):
In this patch, NUMA statistics is separated from zone statistics
framework, all the call sites of NUMA stats are changed to use
numa-stats-specific functions, it does not have any functionality change
except that the number of NUMA stats is shown behind zone page stats
when users *read* the zone info.
E.g. cat /proc/zoneinfo
***Base*** ***With this patch***
nr_free_pages 3976 nr_free_pages 3976
nr_zone_inactive_anon 0 nr_zone_inactive_anon 0
nr_zone_active_anon 0 nr_zone_active_anon 0
nr_zone_inactive_file 0 nr_zone_inactive_file 0
nr_zone_active_file 0 nr_zone_active_file 0
nr_zone_unevictable 0 nr_zone_unevictable 0
nr_zone_write_pending 0 nr_zone_write_pending 0
nr_mlock 0 nr_mlock 0
nr_page_table_pages 0 nr_page_table_pages 0
nr_kernel_stack 0 nr_kernel_stack 0
nr_bounce 0 nr_bounce 0
nr_zspages 0 nr_zspages 0
numa_hit 0 *nr_free_cma 0*
numa_miss 0 numa_hit 0
numa_foreign 0 numa_miss 0
numa_interleave 0 numa_foreign 0
numa_local 0 numa_interleave 0
numa_other 0 numa_local 0
*nr_free_cma 0* numa_other 0
... ...
vm stats threshold: 10 vm stats threshold: 10
... ...
The next patch updates the numa stats counter size and threshold.
[akpm@linux-foundation.org: coding-style fixes]
Link: http://lkml.kernel.org/r/1503568801-21305-2-git-send-email-kemi.wang@intel.com
Signed-off-by: Kemi Wang <kemi.wang@intel.com>
Reported-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Christopher Lameter <cl@linux.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Andi Kleen <andi.kleen@intel.com>
Cc: Ying Huang <ying.huang@intel.com>
Cc: Aaron Lu <aaron.lu@intel.com>
Cc: Tim Chen <tim.c.chen@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "mm, swap: VMA based swap readahead", v4.
The swap readahead is an important mechanism to reduce the swap in
latency. Although pure sequential memory access pattern isn't very
popular for anonymous memory, the space locality is still considered
valid.
In the original swap readahead implementation, the consecutive blocks in
swap device are readahead based on the global space locality estimation.
But the consecutive blocks in swap device just reflect the order of page
reclaiming, don't necessarily reflect the access pattern in virtual
memory space. And the different tasks in the system may have different
access patterns, which makes the global space locality estimation
incorrect.
In this patchset, when page fault occurs, the virtual pages near the
fault address will be readahead instead of the swap slots near the fault
swap slot in swap device. This avoid to readahead the unrelated swap
slots. At the same time, the swap readahead is changed to work on
per-VMA from globally. So that the different access patterns of the
different VMAs could be distinguished, and the different readahead
policy could be applied accordingly. The original core readahead
detection and scaling algorithm is reused, because it is an effect
algorithm to detect the space locality.
In addition to the swap readahead changes, some new sysfs interface is
added to show the efficiency of the readahead algorithm and some other
swap statistics.
This new implementation will incur more small random read, on SSD, the
improved correctness of estimation and readahead target should beat the
potential increased overhead, this is also illustrated in the test
results below. But on HDD, the overhead may beat the benefit, so the
original implementation will be used by default.
The test and result is as follow,
Common test condition
=====================
Test Machine: Xeon E5 v3 (2 sockets, 72 threads, 32G RAM)
Swap device: NVMe disk
Micro-benchmark with combined access pattern
============================================
vm-scalability, sequential swap test case, 4 processes to eat 50G
virtual memory space, repeat the sequential memory writing until 300
seconds. The first round writing will trigger swap out, the following
rounds will trigger sequential swap in and out.
At the same time, run vm-scalability random swap test case in
background, 8 processes to eat 30G virtual memory space, repeat the
random memory write until 300 seconds. This will trigger random swap-in
in the background.
This is a combined workload with sequential and random memory accessing
at the same time. The result (for sequential workload) is as follow,
Base Optimized
---- ---------
throughput 345413 KB/s 414029 KB/s (+19.9%)
latency.average 97.14 us 61.06 us (-37.1%)
latency.50th 2 us 1 us
latency.60th 2 us 1 us
latency.70th 98 us 2 us
latency.80th 160 us 2 us
latency.90th 260 us 217 us
latency.95th 346 us 369 us
latency.99th 1.34 ms 1.09 ms
ra_hit% 52.69% 99.98%
The original swap readahead algorithm is confused by the background
random access workload, so readahead hit rate is lower. The VMA-base
readahead algorithm works much better.
Linpack
=======
The test memory size is bigger than RAM to trigger swapping.
Base Optimized
---- ---------
elapsed_time 393.49 s 329.88 s (-16.2%)
ra_hit% 86.21% 98.82%
The score of base and optimized kernel hasn't visible changes. But the
elapsed time reduced and readahead hit rate improved, so the optimized
kernel runs better for startup and tear down stages. And the absolute
value of readahead hit rate is high, shows that the space locality is
still valid in some practical workloads.
This patch (of 5):
The statistics for total readahead pages and total readahead hits are
recorded and exported via the following sysfs interface.
/sys/kernel/mm/swap/ra_hits
/sys/kernel/mm/swap/ra_total
With them, the efficiency of the swap readahead could be measured, so
that the swap readahead algorithm and parameters could be tuned
accordingly.
[akpm@linux-foundation.org: don't display swap stats if CONFIG_SWAP=n]
Link: http://lkml.kernel.org/r/20170807054038.1843-2-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Tim Chen <tim.c.chen@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Comment for pagetypeinfo_showblockcount() is mistakenly duplicated from
pagetypeinfo_show_free()'s comment. This commit fixes it.
Link: http://lkml.kernel.org/r/20170809185816.11244-1-sj38.park@gmail.com
Fixes: 467c996c1e ("Print out statistics in relation to fragmentation avoidance to /proc/pagetypeinfo")
Signed-off-by: SeongJae Park <sj38.park@gmail.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When order is -1 or too big, *1UL << order* will be 0, which will cause
a divide error. Although it seems that all callers of
__fragmentation_index() will only do so with a valid order, the patch
can make it more robust.
Should prevent reoccurrences of
https://bugzilla.kernel.org/show_bug.cgi?id=196555
Link: http://lkml.kernel.org/r/1501751520-2598-1-git-send-email-wen.yang99@zte.com.cn
Signed-off-by: Wen Yang <wen.yang99@zte.com.cn>
Reviewed-by: Jiang Biao <jiang.biao2@zte.com.cn>
Suggested-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
global_page_state is error prone as a recent bug report pointed out [1].
It only returns proper values for zone based counters as the enum it
gets suggests. We already have global_node_page_state so let's rename
global_page_state to global_zone_page_state to be more explicit here.
All existing users seems to be correct:
$ git grep "global_page_state(NR_" | sed 's@.*(\(NR_[A-Z_]*\)).*@\1@' | sort | uniq -c
2 NR_BOUNCE
2 NR_FREE_CMA_PAGES
11 NR_FREE_PAGES
1 NR_KERNEL_STACK_KB
1 NR_MLOCK
2 NR_PAGETABLE
This patch shouldn't introduce any functional change.
[1] http://lkml.kernel.org/r/201707260628.v6Q6SmaS030814@www262.sakura.ne.jp
Link: http://lkml.kernel.org/r/20170801134256.5400-2-hannes@cmpxchg.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When swapping out THP (Transparent Huge Page), instead of swapping out
the THP as a whole, sometimes we have to fallback to split the THP into
normal pages before swapping, because no free swap clusters are
available, or cgroup limit is exceeded, etc. To count the number of the
fallback, a new VM event THP_SWPOUT_FALLBACK is added, and counted when
we fallback to split the THP.
Link: http://lkml.kernel.org/r/20170724051840.2309-13-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Ross Zwisler <ross.zwisler@intel.com> [for brd.c, zram_drv.c, pmem.c]
Cc: Vishal L Verma <vishal.l.verma@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
To support delay splitting THP (Transparent Huge Page) after swapped
out, we need to enhance swap writing code to support to write a THP as a
whole. This will improve swap write IO performance.
As Ming Lei <ming.lei@redhat.com> pointed out, this should be based on
multipage bvec support, which hasn't been merged yet. So this patch is
only for testing the functionality of the other patches in the series.
And will be reimplemented after multipage bvec support is merged.
Link: http://lkml.kernel.org/r/20170724051840.2309-7-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Ross Zwisler <ross.zwisler@intel.com> [for brd.c, zram_drv.c, pmem.c]
Cc: Shaohua Li <shli@kernel.org>
Cc: Vishal L Verma <vishal.l.verma@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
pagetypeinfo_showmixedcount_print is found to take a lot of time to
complete and it does this holding the zone lock and disabling
interrupts. In some cases it is found to take more than a second (On a
2.4GHz,8Gb RAM,arm64 cpu).
Avoid taking the zone lock similar to what is done by read_page_owner,
which means possibility of inaccurate results.
Link: http://lkml.kernel.org/r/1498045643-12257-1-git-send-email-vinmenon@codeaurora.org
Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: zhongjiang <zhongjiang@huawei.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Sudip Mukherjee <sudipm.mukherjee@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "mm: per-lruvec slab stats"
Josef is working on a new approach to balancing slab caches and the page
cache. For this to work, he needs slab cache statistics on the lruvec
level. These patches implement that by adding infrastructure that
allows updating and reading generic VM stat items per lruvec, then
switches some existing VM accounting sites, including the slab
accounting ones, to this new cgroup-aware API.
I'll follow up with more patches on this, because there is actually
substantial simplification that can be done to the memory controller
when we replace private memcg accounting with making the existing VM
accounting sites cgroup-aware. But this is enough for Josef to base his
slab reclaim work on, so here goes.
This patch (of 5):
To re-implement slab cache vs. page cache balancing, we'll need the
slab counters at the lruvec level, which, ever since lru reclaim was
moved from the zone to the node, is the intersection of the node, not
the zone, and the memcg.
We could retain the per-zone counters for when the page allocator dumps
its memory information on failures, and have counters on both levels -
which on all but NUMA node 0 is usually redundant. But let's keep it
simple for now and just move them. If anybody complains we can restore
the per-zone counters.
[hannes@cmpxchg.org: fix oops]
Link: http://lkml.kernel.org/r/20170605183511.GA8915@cmpxchg.org
Link: http://lkml.kernel.org/r/20170530181724.27197-3-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Show count of oom killer invocations in /proc/vmstat and count of
processes killed in memory cgroup in knob "memory.events" (in
memory.oom_control for v1 cgroup).
Also describe difference between "oom" and "oom_kill" in memory cgroup
documentation. Currently oom in memory cgroup kills tasks iff shortage
has happened inside page fault.
These counters helps in monitoring oom kills - for now the only way is
grepping for magic words in kernel log.
[akpm@linux-foundation.org: fix for mem_cgroup_count_vm_event() rename]
[akpm@linux-foundation.org: fix comment, per Konstantin]
Link: http://lkml.kernel.org/r/149570810989.203600.9492483715840752937.stgit@buzz
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Roman Guschin <guroan@gmail.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
pagetypeinfo_showblockcount_print skips over invalid pfns but it would
report pages which are offline because those have a valid pfn. Their
migrate type is misleading at best.
Now that we have pfn_to_online_page() we can use it instead of
pfn_valid() and fix this.
[mhocko@suse.com: fix build]
Link: http://lkml.kernel.org/r/20170519072225.GA13041@dhcp22.suse.cz
Link: http://lkml.kernel.org/r/20170515085827.16474-11-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: Joonsoo Kim <js1304@gmail.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Daniel Kiper <daniel.kiper@oracle.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Igor Mammedov <imammedo@redhat.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Reza Arbab <arbab@linux.vnet.ibm.com>
Cc: Tobias Regnery <tobias.regnery@gmail.com>
Cc: Toshi Kani <toshi.kani@hpe.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Standardize the file operation variable names related to all four memory
management /proc interface files. Also change all the symbol
permissions (S_IRUGO) into octal permissions (0444) as it got complaints
from checkpatch.pl. This does not create any functional change to the
interface.
Link: http://lkml.kernel.org/r/20170427030632.8588-1-khandual@linux.vnet.ibm.com
Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
After commit e2ecc8a79e ("mm, vmstat: print non-populated zones in
zoneinfo"), /proc/zoneinfo will show unpopulated zones.
A memoryless node, having no populated zones at all, was previously
ignored, but will now trigger the WARN() in is_zone_first_populated().
Remove this warning, as its only purpose was to warn of a situation that
has since been enabled.
Aside: The "per-node stats" are still printed under the first populated
zone, but that's not necessarily the first stanza any more. I'm not
sure which criteria is more important with regard to not breaking
parsers, but it looks a little weird to the eye.
Fixes: e2ecc8a79e ("mm, vmstat: print node-based stats in zoneinfo file")
Link: http://lkml.kernel.org/r/1493854905-10918-1-git-send-email-arbab@linux.vnet.ibm.com
Signed-off-by: Reza Arbab <arbab@linux.vnet.ibm.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
After "mm, vmstat: print non-populated zones in zoneinfo",
/proc/zoneinfo will show unpopulated zones.
The per-cpu pageset statistics are not relevant for unpopulated zones
and can be potentially lengthy, so supress them when they are not
interesting.
Also moves lowmem reserve protection information above pcp stats since
it is relevant for all zones per vm.lowmem_reserve_ratio.
Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1703061400500.46428@chino.kir.corp.google.com
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Initscripts can use the information (protection levels) from
/proc/zoneinfo to configure vm.lowmem_reserve_ratio at boot.
vm.lowmem_reserve_ratio is an array of ratios for each configured zone
on the system. If a zone is not populated on an arch, /proc/zoneinfo
suppresses its output.
This results in there not being a 1:1 mapping between the set of zones
emitted by /proc/zoneinfo and the zones configured by
vm.lowmem_reserve_ratio.
This patch shows statistics for non-populated zones in /proc/zoneinfo.
The zones exist and hold a spot in the vm.lowmem_reserve_ratio array.
Without this patch, it is not possible to determine which index in the
array controls which zone if one or more zones on the system are not
populated.
Remaining users of walk_zones_in_node() are unchanged. Files such as
/proc/pagetypeinfo require certain zone data to be initialized properly
for display, which is not done for unpopulated zones.
Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1703031451310.98023@chino.kir.corp.google.com
Signed-off-by: David Rientjes <rientjes@google.com>
Reviewed-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
madv()'s MADV_FREE indicate pages are 'lazyfree'. They are still
anonymous pages, but they can be freed without pageout. To distinguish
these from normal anonymous pages, we clear their SwapBacked flag.
MADV_FREE pages could be freed without pageout, so they pretty much like
used once file pages. For such pages, we'd like to reclaim them once
there is memory pressure. Also it might be unfair reclaiming MADV_FREE
pages always before used once file pages and we definitively want to
reclaim the pages before other anonymous and file pages.
To speed up MADV_FREE pages reclaim, we put the pages into
LRU_INACTIVE_FILE list. The rationale is LRU_INACTIVE_FILE list is tiny
nowadays and should be full of used once file pages. Reclaiming
MADV_FREE pages will not have much interfere of anonymous and active
file pages. And the inactive file pages and MADV_FREE pages will be
reclaimed according to their age, so we don't reclaim too many MADV_FREE
pages too. Putting the MADV_FREE pages into LRU_INACTIVE_FILE_LIST also
means we can reclaim the pages without swap support. This idea is
suggested by Johannes.
This patch doesn't move MADV_FREE pages to LRU_INACTIVE_FILE list yet to
avoid bisect failure, next patch will do it.
The patch is based on Minchan's original patch.
[akpm@linux-foundation.org: coding-style fixes]
Link: http://lkml.kernel.org/r/2f87063c1e9354677b7618c647abde77b07561e5.1487965799.git.shli@fb.com
Signed-off-by: Shaohua Li <shli@fb.com>
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Minchan Kim <minchan@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
NR_PAGES_SCANNED counts number of pages scanned since the last page free
event in the allocator. This was used primarily to measure the
reclaimability of zones and nodes, and determine when reclaim should
give up on them. In that role, it has been replaced in the preceding
patches by a different mechanism.
Being implemented as an efficient vmstat counter, it was automatically
exported to userspace as well. It's however unlikely that anyone
outside the kernel is using this counter in any meaningful way.
Remove the counter and the unused pgdat_reclaimable().
Link: http://lkml.kernel.org/r/20170228214007.5621-8-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Jia He <hejianet@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "mm: kswapd spinning on unreclaimable nodes - fixes and
cleanups".
Jia reported a scenario in which the kswapd of a node indefinitely spins
at 100% CPU usage. We have seen similar cases at Facebook.
The kernel's current method of judging its ability to reclaim a node (or
whether to back off and sleep) is based on the amount of scanned pages
in proportion to the amount of reclaimable pages. In Jia's and our
scenarios, there are no reclaimable pages in the node, however, and the
condition for backing off is never met. Kswapd busyloops in an attempt
to restore the watermarks while having nothing to work with.
This series reworks the definition of an unreclaimable node based not on
scanning but on whether kswapd is able to actually reclaim pages in
MAX_RECLAIM_RETRIES (16) consecutive runs. This is the same criteria
the page allocator uses for giving up on direct reclaim and invoking the
OOM killer. If it cannot free any pages, kswapd will go to sleep and
leave further attempts to direct reclaim invocations, which will either
make progress and re-enable kswapd, or invoke the OOM killer.
Patch #1 fixes the immediate problem Jia reported, the remainder are
smaller fixlets, cleanups, and overall phasing out of the old method.
Patch #6 is the odd one out. It's a nice cleanup to get_scan_count(),
and directly related to #5, but in itself not relevant to the series.
If the whole series is too ambitious for 4.11, I would consider the
first three patches fixes, the rest cleanups.
This patch (of 9):
Jia He reports a problem with kswapd spinning at 100% CPU when
requesting more hugepages than memory available in the system:
$ echo 4000 >/proc/sys/vm/nr_hugepages
top - 13:42:59 up 3:37, 1 user, load average: 1.09, 1.03, 1.01
Tasks: 1 total, 1 running, 0 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.0 us, 12.5 sy, 0.0 ni, 85.5 id, 2.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem: 31371520 total, 30915136 used, 456384 free, 320 buffers
KiB Swap: 6284224 total, 115712 used, 6168512 free. 48192 cached Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
76 root 20 0 0 0 0 R 100.0 0.000 217:17.29 kswapd3
At that time, there are no reclaimable pages left in the node, but as
kswapd fails to restore the high watermarks it refuses to go to sleep.
Kswapd needs to back away from nodes that fail to balance. Up until
commit 1d82de618d ("mm, vmscan: make kswapd reclaim in terms of
nodes") kswapd had such a mechanism. It considered zones whose
theoretically reclaimable pages it had reclaimed six times over as
unreclaimable and backed away from them. This guard was erroneously
removed as the patch changed the definition of a balanced node.
However, simply restoring this code wouldn't help in the case reported
here: there *are* no reclaimable pages that could be scanned until the
threshold is met. Kswapd would stay awake anyway.
Introduce a new and much simpler way of backing off. If kswapd runs
through MAX_RECLAIM_RETRIES (16) cycles without reclaiming a single
page, make it back off from the node. This is the same number of shots
direct reclaim takes before declaring OOM. Kswapd will go to sleep on
that node until a direct reclaimer manages to reclaim some pages, thus
proving the node reclaimable again.
[hannes@cmpxchg.org: check kswapd failure against the cumulative nr_reclaimed count]
Link: http://lkml.kernel.org/r/20170306162410.GB2090@cmpxchg.org
[shakeelb@google.com: fix condition for throttle_direct_reclaim]
Link: http://lkml.kernel.org/r/20170314183228.20152-1-shakeelb@google.com
Link: http://lkml.kernel.org/r/20170228214007.5621-2-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Reported-by: Jia He <hejianet@gmail.com>
Tested-by: Jia He <hejianet@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Geert has reported a freeze during PM resume and some additional
debugging has shown that the device_resume worker cannot make a forward
progress because it waits for an event which is stuck waiting in
drain_all_pages:
INFO: task kworker/u4:0:5 blocked for more than 120 seconds.
Not tainted 4.11.0-rc7-koelsch-00029-g005882e53d62f25d-dirty #3476
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kworker/u4:0 D 0 5 2 0x00000000
Workqueue: events_unbound async_run_entry_fn
__schedule
schedule
schedule_timeout
wait_for_common
dpm_wait_for_superior
device_resume
async_resume
async_run_entry_fn
process_one_work
worker_thread
kthread
[...]
bash D 0 1703 1694 0x00000000
__schedule
schedule
schedule_timeout
wait_for_common
flush_work
drain_all_pages
start_isolate_page_range
alloc_contig_range
cma_alloc
__alloc_from_contiguous
cma_allocator_alloc
__dma_alloc
arm_dma_alloc
sh_eth_ring_init
sh_eth_open
sh_eth_resume
dpm_run_callback
device_resume
dpm_resume
dpm_resume_end
suspend_devices_and_enter
pm_suspend
state_store
kernfs_fop_write
__vfs_write
vfs_write
SyS_write
[...]
Showing busy workqueues and worker pools:
[...]
workqueue mm_percpu_wq: flags=0xc
pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=0/0
delayed: drain_local_pages_wq, vmstat_update
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=0/0
delayed: drain_local_pages_wq BAR(1703), vmstat_update
Tetsuo has properly noted that mm_percpu_wq is created as WQ_FREEZABLE
so it is frozen this early during resume so we are effectively
deadlocked. Fix this by dropping WQ_FREEZABLE when creating
mm_percpu_wq. We really want to have it operational all the time.
Fixes: ce612879dd ("mm: move pcp and lru-pcp draining into single wq")
Reported-and-tested-by: Geert Uytterhoeven <geert@linux-m68k.org>
Debugged-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Signed-off-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We currently have 2 specific WQ_RECLAIM workqueues in the mm code.
vmstat_wq for updating pcp stats and lru_add_drain_wq dedicated to drain
per cpu lru caches. This seems more than necessary because both can run
on a single WQ. Both do not block on locks requiring a memory
allocation nor perform any allocations themselves. We will save one
rescuer thread this way.
On the other hand drain_all_pages() queues work on the system wq which
doesn't have rescuer and so this depend on memory allocation (when all
workers are stuck allocating and new ones cannot be created).
Initially we thought this would be more of a theoretical problem but
Hugh Dickins has reported:
: 4.11-rc has been giving me hangs after hours of swapping load. At
: first they looked like memory leaks ("fork: Cannot allocate memory");
: but for no good reason I happened to do "cat /proc/sys/vm/stat_refresh"
: before looking at /proc/meminfo one time, and the stat_refresh stuck
: in D state, waiting for completion of flush_work like many kworkers.
: kthreadd waiting for completion of flush_work in drain_all_pages().
This worker should be using WQ_RECLAIM as well in order to guarantee a
forward progress. We can reuse the same one as for lru draining and
vmstat.
Link: http://lkml.kernel.org/r/20170307131751.24936-1-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Suggested-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@suse.de>
Tested-by: Yang Li <pku.leo@gmail.com>
Tested-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Yang Li has reported that drain_all_pages triggers a WARN_ON which means
that this function is called earlier than the mm_percpu_wq is
initialized on arm64 with CMA configured:
WARNING: CPU: 2 PID: 1 at mm/page_alloc.c:2423 drain_all_pages+0x244/0x25c
Modules linked in:
CPU: 2 PID: 1 Comm: swapper/0 Not tainted 4.11.0-rc1-next-20170310-00027-g64dfbc5 #127
Hardware name: Freescale Layerscape 2088A RDB Board (DT)
task: ffffffc07c4a6d00 task.stack: ffffffc07c4a8000
PC is at drain_all_pages+0x244/0x25c
LR is at start_isolate_page_range+0x14c/0x1f0
[...]
drain_all_pages+0x244/0x25c
start_isolate_page_range+0x14c/0x1f0
alloc_contig_range+0xec/0x354
cma_alloc+0x100/0x1fc
dma_alloc_from_contiguous+0x3c/0x44
atomic_pool_init+0x7c/0x208
arm64_dma_init+0x44/0x4c
do_one_initcall+0x38/0x128
kernel_init_freeable+0x1a0/0x240
kernel_init+0x10/0xfc
ret_from_fork+0x10/0x20
Fix this by moving the whole setup_vmstat which is an initcall right now
to init_mm_internals which will be called right after the WQ subsystem
is initialized.
Link: http://lkml.kernel.org/r/20170315164021.28532-1-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: Yang Li <pku.leo@gmail.com>
Tested-by: Yang Li <pku.leo@gmail.com>
Tested-by: Xiaolong Ye <xiaolong.ye@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We added support for PUD-sized transparent hugepages, however we count
the event "thp split pud" into thp_split_pmd event.
To separate the event count of thp split pud from pmd, add a new event
named thp_split_pud.
Link: http://lkml.kernel.org/r/1488282380-5076-1-git-send-email-xieyisheng1@huawei.com
Signed-off-by: Yisheng Xie <xieyisheng1@huawei.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Sebastian Siewior <bigeasy@linutronix.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Ebru Akagunduz <ebru.akagunduz@gmail.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Hanjun Guo <guohanjun@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A "compact_daemon_wake" vmstat exists that represents the number of
times kcompactd has woken up. This doesn't represent how much work it
actually did, though.
It's useful to understand how much compaction work is being done by
kcompactd versus other methods such as direct compaction and explicitly
triggered per-node (or system) compaction.
This adds two new vmstats: "compact_daemon_migrate_scanned" and
"compact_daemon_free_scanned" to represent the number of pages kcompactd
has scanned as part of its migration scanner and freeing scanner,
respectively.
These values are still accounted for in the general
"compact_migrate_scanned" and "compact_free_scanned" for compatibility.
It could be argued that explicitly triggered compaction could also be
tracked separately, and that could be added if others find it useful.
Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1612071749390.69852@chino.kir.corp.google.com
Signed-off-by: David Rientjes <rientjes@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Install the callbacks via the state machine, but do not invoke them as we
can initialize the node state without calling the callbacks on all online
CPUs.
start_shepherd_timer() is now called outside the get_online_cpus() block
which is safe as it only operates on cpu possible mask.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Cc: rt@linutronix.de
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Link: http://lkml.kernel.org/r/20161129145221.ffc3kg3hd7lxiwj6@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Both iterations over online cpus can be replaced by the proper node
specific functions.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: linux-mm@kvack.org
Cc: rt@linutronix.de
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Link: http://lkml.kernel.org/r/20161129145113.fn3lw5aazjjvdrr3@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Both functions are called with protection against cpu hotplug already so
*_online_cpus() could be dropped.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: linux-mm@kvack.org
Cc: rt@linutronix.de
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Link: http://lkml.kernel.org/r/20161126231350.10321-8-bigeasy@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Allow some seq_puts removals by taking a string instead of a single
char.
[akpm@linux-foundation.org: update vmstat_show(), per Joe]
Link: http://lkml.kernel.org/r/667e1cf3d436de91a5698170a1e98d882905e956.1470704995.git.joe@perches.com
Signed-off-by: Joe Perches <joe@perches.com>
Cc: Joe Perches <joe@perches.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>