Go to file
Waiman Long a479960018 locking/rwsem: Optimize down_read_trylock() under highly contended case
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2141431

commit 14c24048841151548a3f4d9e218510c844c1b737
Author: Muchun Song <songmuchun@bytedance.com>
Date:   Thu, 18 Nov 2021 17:44:55 +0800

    locking/rwsem: Optimize down_read_trylock() under highly contended case

    We found that a process with 10 thousnads threads has been encountered
    a regression problem from Linux-v4.14 to Linux-v5.4. It is a kind of
    workload which will concurrently allocate lots of memory in different
    threads sometimes. In this case, we will see the down_read_trylock()
    with a high hotspot. Therefore, we suppose that rwsem has a regression
    at least since Linux-v5.4. In order to easily debug this problem, we
    write a simply benchmark to create the similar situation lile the
    following.

      ```c++
      #include <sys/mman.h>
      #include <sys/time.h>
      #include <sys/resource.h>
      #include <sched.h>

      #include <cstdio>
      #include <cassert>
      #include <thread>
      #include <vector>
      #include <chrono>

      volatile int mutex;

      void trigger(int cpu, char* ptr, std::size_t sz)
      {
            cpu_set_t set;
            CPU_ZERO(&set);
            CPU_SET(cpu, &set);
            assert(pthread_setaffinity_np(pthread_self(), sizeof(set), &set) == 0);

            while (mutex);

            for (std::size_t i = 0; i < sz; i += 4096) {
                    *ptr = '\0';
                    ptr += 4096;
            }
      }

      int main(int argc, char* argv[])
      {
            std::size_t sz = 100;

            if (argc > 1)
                    sz = atoi(argv[1]);

            auto nproc = std:🧵:hardware_concurrency();
            std::vector<std::thread> thr;
            sz <<= 30;
            auto* ptr = mmap(nullptr, sz, PROT_READ | PROT_WRITE, MAP_ANON |
                             MAP_PRIVATE, -1, 0);
            assert(ptr != MAP_FAILED);
            char* cptr = static_cast<char*>(ptr);
            auto run = sz / nproc;
            run = (run >> 12) << 12;

            mutex = 1;

            for (auto i = 0U; i < nproc; ++i) {
                    thr.emplace_back(std::thread([i, cptr, run]() { trigger(i, cptr, run); }));
                    cptr += run;
            }

            rusage usage_start;
            getrusage(RUSAGE_SELF, &usage_start);
            auto start = std::chrono::system_clock::now();

            mutex = 0;

            for (auto& t : thr)
                    t.join();

            rusage usage_end;
            getrusage(RUSAGE_SELF, &usage_end);
            auto end = std::chrono::system_clock::now();
            timeval utime;
            timeval stime;
            timersub(&usage_end.ru_utime, &usage_start.ru_utime, &utime);
            timersub(&usage_end.ru_stime, &usage_start.ru_stime, &stime);
            printf("usr: %ld.%06ld\n", utime.tv_sec, utime.tv_usec);
            printf("sys: %ld.%06ld\n", stime.tv_sec, stime.tv_usec);
            printf("real: %lu\n",
                   std::chrono::duration_cast<std::chrono::milliseconds>(end -
                   start).count());

            return 0;
      }
      ```

    The functionality of above program is simply which creates `nproc`
    threads and each of them are trying to touch memory (trigger page
    fault) on different CPU. Then we will see the similar profile by
    `perf top`.

      25.55%  [kernel]                  [k] down_read_trylock
      14.78%  [kernel]                  [k] handle_mm_fault
      13.45%  [kernel]                  [k] up_read
       8.61%  [kernel]                  [k] clear_page_erms
       3.89%  [kernel]                  [k] __do_page_fault

    The highest hot instruction, which accounts for about 92%, in
    down_read_trylock() is cmpxchg like the following.

      91.89 │      lock   cmpxchg %rdx,(%rdi)

    Sice the problem is found by migrating from Linux-v4.14 to Linux-v5.4,
    so we easily found that the commit ddb20d1d3a ("locking/rwsem: Optimize
    down_read_trylock()") caused the regression. The reason is that the
    commit assumes the rwsem is not contended at all. But it is not always
    true for mmap lock which could be contended with thousands threads.
    So most threads almost need to run at least 2 times of "cmpxchg" to
    acquire the lock. The overhead of atomic operation is higher than
    non-atomic instructions, which caused the regression.

    By using the above benchmark, the real executing time on a x86-64 system
    before and after the patch were:

                      Before Patch  After Patch
       # of Threads      real          real     reduced by
       ------------     ------        ------    ----------
             1          65,373        65,206       ~0.0%
             4          15,467        15,378       ~0.5%
            40           6,214         5,528      ~11.0%

    For the uncontended case, the new down_read_trylock() is the same as
    before. For the contended cases, the new down_read_trylock() is faster
    than before. The more contended, the more fast.

    Signed-off-by: Muchun Song <songmuchun@bytedance.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Acked-by: Waiman Long <longman@redhat.com>
    Link: https://lore.kernel.org/r/20211118094455.9068-1-songmuchun@bytedance.com

Signed-off-by: Waiman Long <longman@redhat.com>
2022-11-10 11:38:04 -05:00
Documentation Merge: Update drivers/rtc for known edge platforms 2022-11-10 02:23:22 -05:00
LICENSES LICENSES/dual/CC-BY-4.0: Git rid of "smart quotes" 2021-07-15 06:31:24 -06:00
arch Merge: powerpc/pseries: Use lparcfg to reconfig VAS windows for DLPAR CPU 2022-11-10 02:23:24 -05:00
block Merge: block: update with v6.1-rc2 2022-11-03 13:30:02 -04:00
certs certs: Move load_certificate_list() to be with the asymmetric keys code 2022-06-23 11:32:02 +01:00
crypto fs: get rid of the res2 iocb->ki_complete argument 2022-10-27 12:59:04 -04:00
drivers Merge: Update drivers/rtc for known edge platforms 2022-11-10 02:23:22 -05:00
fs Merge: io_uring: update to v5.18 2022-11-10 02:23:20 -05:00
include Merge: Update drivers/rtc for known edge platforms 2022-11-10 02:23:22 -05:00
init mm: Kconfig: reorganize misplaced mm options 2022-10-31 19:33:25 -04:00
ipc mm,hugetlb: remove mlock ulimit for SHM_HUGETLB 2022-10-12 07:27:31 -04:00
kernel locking/rwsem: Optimize down_read_trylock() under highly contended case 2022-11-10 11:38:04 -05:00
lib Merge: memcg: Backport some useful upstream patches 2022-11-09 08:55:31 -05:00
mm Merge: memcg: Backport some useful upstream patches 2022-11-09 08:55:31 -05:00
net Merge: netfilter: 9.2 phase 1 backports 2022-11-09 08:55:32 -05:00
redhat [redhat] kernel-5.14.0-192.el9 2022-11-10 02:24:12 -05:00
samples Merge: VFIO 9.2 backports 2022-10-21 09:47:28 -04:00
scripts scripts: Create objdump-func helper script 2022-10-27 15:26:52 -04:00
security Merge: CNB: net: HW counters for soft devices 2022-11-08 09:08:22 -05:00
sound Merge: Update kernel's PCI subsystem to v6.0 2022-11-02 03:26:51 -04:00
tools Merge: KVM: selftests: replace assertion with warning in access_tracking_perf_test 2022-11-09 08:55:30 -05:00
usr .gitignore: prefix local generated files with a slash 2021-05-02 00:43:35 +09:00
virt KVM: Drop unnecessary initialization of "ops" in kvm_ioctl_create_device() 2022-10-25 13:53:38 +02:00
.clang-format clang-format: Update with the latest for_each macro list 2021-05-12 23:32:39 +02:00
.cocciconfig
.get_maintainer.conf get_maintainer.conf: Update with new location of RHMAINTAINERS 2022-01-19 14:26:16 -05:00
.get_maintainer.ignore
.gitattributes gitattributes: Remove unnecesary export restrictions 2021-08-30 10:50:35 -04:00
.gitignore gitlab: Add CI job for packaging scripts 2021-08-30 10:49:13 -04:00
.gitlab-ci.yml CI: Add automotive-check for rt branches 2022-09-22 17:14:55 +02:00
.mailmap mailmap: remove my redhat.com address from RHEL9's .mailmap file 2022-09-26 09:34:38 -04:00
COPYING COPYING: state that all contributions really are covered by this file 2020-02-10 13:32:20 -08:00
CREDITS MAINTAINERS: move Murali Karicheri to credits 2021-04-29 15:47:30 -07:00
Kbuild kbuild: rename hostprogs-y/always to hostprogs/always-y 2020-02-04 01:53:07 +09:00
Kconfig Introduce CONFIG_RH_DISABLE_DEPRECATED 2021-08-30 10:50:55 -04:00
Kconfig.redhat Rename RH_DISABLE_DEPRECATED to RHEL_DIFFERENCES 2021-08-30 14:29:36 -04:00
MAINTAINERS Merge: Update drivers/rtc for known edge platforms 2022-11-10 02:23:22 -05:00
Makefile objtool: Add CONFIG_OBJTOOL 2022-10-27 15:26:51 -04:00
Makefile.rhelver [redhat] kernel-5.14.0-192.el9 2022-11-10 02:24:12 -05:00
README
makefile redhat: Change Makefile target names to dist- 2021-08-30 10:50:11 -04:00

README

Linux kernel
============

There are several guides for kernel developers and users. These guides can
be rendered in a number of formats, like HTML and PDF. Please read
Documentation/admin-guide/README.rst first.

In order to build the documentation, use ``make htmldocs`` or
``make pdfdocs``.  The formatted documentation can also be read online at:

    https://www.kernel.org/doc/html/latest/

There are various text files in the Documentation/ subdirectory,
several of them using the Restructured Text markup notation.

Please read the Documentation/process/changes.rst file, as it contains the
requirements for building and running the kernel, and information about
the problems which may result by upgrading your kernel.