History log of /linux-master/arch/x86/include/asm/tlbflush.h
Revision Date Author Comments
# fca4d413 12-Jun-2023 Rick Edgecombe <rick.p.edgecombe@intel.com>

x86/mm: Introduce _PAGE_SAVED_DIRTY

Some OSes have a greater dependence on software available bits in PTEs than
Linux. That left the hardware architects looking for a way to represent a
new memory type (shadow stack) within the existing bits. They chose to
repurpose a lightly-used state: Write=0,Dirty=1. So in order to support
shadow stack memory, Linux should avoid creating memory with this PTE bit
combination unless it intends for it to be shadow stack.

The reason it's lightly used is that Dirty=1 is normally set by HW
_before_ a write. A write with a Write=0 PTE would typically only generate
a fault, not set Dirty=1. Hardware can (rarely) both set Dirty=1 *and*
generate the fault, resulting in a Write=0,Dirty=1 PTE. Hardware which
supports shadow stacks will no longer exhibit this oddity.

So that leaves Write=0,Dirty=1 PTEs created in software. To avoid
inadvertently created shadow stack memory, in places where Linux normally
creates Write=0,Dirty=1, it can use the software-defined _PAGE_SAVED_DIRTY
in place of the hardware _PAGE_DIRTY. In other words, whenever Linux needs
to create Write=0,Dirty=1, it instead creates Write=0,SavedDirty=1 except
for shadow stack, which is Write=0,Dirty=1.

There are six bits left available to software in the 64-bit PTE after
consuming a bit for _PAGE_SAVED_DIRTY. For 32 bit, the same bit as
_PAGE_BIT_UFFD_WP is used, since user fault fd is not supported on 32
bit. This leaves one unused software bit on 32 bit (_PAGE_BIT_SOFT_DIRTY,
as this is also not supported on 32 bit).

Implement only the infrastructure for _PAGE_SAVED_DIRTY. Changes to
actually begin creating _PAGE_SAVED_DIRTY PTEs will follow once other
pieces are in place.

Since this SavedDirty shifting is done for all x86 CPUs, this leaves
the possibility for the hardware oddity to still create Write=0,Dirty=1
PTEs in rare cases. Since these CPUs also don't support shadow stack, this
will be harmless as it was before the introduction of SavedDirty.

Implement the shifting logic to be branchless. Embed the logic of whether
to do the shifting (including checking the Write bits) so that it can be
called by future callers that would otherwise need additional branching
logic. This efficiency allows the logic of when to do the shifting to be
centralized, making the code easier to reason about.

Co-developed-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/all/20230613001108.3040476-11-rick.p.edgecombe%40intel.com


# 1af5a810 25-Jul-2023 Alistair Popple <apopple@nvidia.com>

mmu_notifiers: rename invalidate_range notifier

There are two main use cases for mmu notifiers. One is by KVM which uses
mmu_notifier_invalidate_range_start()/end() to manage a software TLB.

The other is to manage hardware TLBs which need to use the
invalidate_range() callback because HW can establish new TLB entries at
any time. Hence using start/end() can lead to memory corruption as these
callbacks happen too soon/late during page unmap.

mmu notifier users should therefore either use the start()/end() callbacks
or the invalidate_range() callbacks. To make this usage clearer rename
the invalidate_range() callback to arch_invalidate_secondary_tlbs() and
update documention.

Link: https://lkml.kernel.org/r/6f77248cd25545c8020a54b4e567e8b72be4dca1.1690292440.git-series.apopple@nvidia.com
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Cc: Andrew Donnellan <ajd@linux.ibm.com>
Cc: Chaitanya Kumar Borah <chaitanya.kumar.borah@intel.com>
Cc: Frederic Barrat <fbarrat@linux.ibm.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Nicolin Chen <nicolinc@nvidia.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
Cc: Will Deacon <will@kernel.org>
Cc: Zhi Wang <zhi.wang.linux@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 6bbd42e2 25-Jul-2023 Alistair Popple <apopple@nvidia.com>

mmu_notifiers: call invalidate_range() when invalidating TLBs

The invalidate_range() is going to become an architecture specific mmu
notifier used to keep the TLB of secondary MMUs such as an IOMMU in sync
with the CPU page tables. Currently it is called from separate code paths
to the main CPU TLB invalidations. This can lead to a secondary TLB not
getting invalidated when required and makes it hard to reason about when
exactly the secondary TLB is invalidated.

To fix this move the notifier call to the architecture specific TLB
maintenance functions for architectures that have secondary MMUs requiring
explicit software invalidations.

This fixes a SMMU bug on ARM64. On ARM64 PTE permission upgrades require
a TLB invalidation. This invalidation is done by the architecture
specific ptep_set_access_flags() which calls flush_tlb_page() if required.
However this doesn't call the notifier resulting in infinite faults being
generated by devices using the SMMU if it has previously cached a
read-only PTE in it's TLB.

Moving the invalidations into the TLB invalidation functions ensures all
invalidations happen at the same time as the CPU invalidation. The
architecture specific flush_tlb_all() routines do not call the notifier as
none of the IOMMUs require this.

Link: https://lkml.kernel.org/r/0287ae32d91393a582897d6c4db6f7456b1001f2.1690292440.git-series.apopple@nvidia.com
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Suggested-by: Jason Gunthorpe <jgg@ziepe.ca>
Tested-by: SeongJae Park <sj@kernel.org>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Tested-by: Luis Chamberlain <mcgrof@kernel.org>
Cc: Andrew Donnellan <ajd@linux.ibm.com>
Cc: Chaitanya Kumar Borah <chaitanya.kumar.borah@intel.com>
Cc: Frederic Barrat <fbarrat@linux.ibm.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Nicolin Chen <nicolinc@nvidia.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
Cc: Will Deacon <will@kernel.org>
Cc: Zhi Wang <zhi.wang.linux@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# db6c1f6f 17-Jul-2023 Yicong Yang <yangyicong@hisilicon.com>

mm/tlbbatch: introduce arch_flush_tlb_batched_pending()

Currently we'll flush the mm in flush_tlb_batched_pending() to avoid race
between reclaim unmaps pages by batched TLB flush and mprotect/munmap/etc.
Other architectures like arm64 may only need a synchronization
barrier(dsb) here rather than a full mm flush. So add
arch_flush_tlb_batched_pending() to allow an arch-specific implementation
here. This intends no functional changes on x86 since still a full mm
flush for x86.

Link: https://lkml.kernel.org/r/20230717131004.12662-4-yangyicong@huawei.com
Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Barry Song <baohua@kernel.org>
Cc: Barry Song <v-songbaohua@oppo.com>
Cc: Darren Hart <darren@os.amperecomputing.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: lipeifeng <lipeifeng@oppo.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Nadav Amit <namit@vmware.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Punit Agrawal <punit.agrawal@bytedance.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Steven Miao <realmz6@gmail.com>
Cc: Will Deacon <will@kernel.org>
Cc: Xin Hao <xhao@linux.alibaba.com>
Cc: Zeng Tao <prime.zeng@hisilicon.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# f73419bb 17-Jul-2023 Barry Song <v-songbaohua@oppo.com>

mm/tlbbatch: rename and extend some functions

This patch does some preparation works to extend batched TLB flush to
arm64. Including:
- Extend set_tlb_ubc_flush_pending() and arch_tlbbatch_add_mm()
to accept an additional argument for address, architectures
like arm64 may need this for tlbi.
- Rename arch_tlbbatch_add_mm() to arch_tlbbatch_add_pending()
to match its current function since we don't need to handle
mm on architectures like arm64 and add_mm is not proper,
add_pending will make sense to both as on x86 we're pending the
TLB flush operations while on arm64 we're pending the synchronize
operations.

This intends no functional changes on x86.

Link: https://lkml.kernel.org/r/20230717131004.12662-3-yangyicong@huawei.com
Tested-by: Yicong Yang <yangyicong@hisilicon.com>
Tested-by: Xin Hao <xhao@linux.alibaba.com>
Tested-by: Punit Agrawal <punit.agrawal@bytedance.com>
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: Xin Hao <xhao@linux.alibaba.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Nadav Amit <namit@vmware.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Barry Song <baohua@kernel.org>
Cc: Darren Hart <darren@os.amperecomputing.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: lipeifeng <lipeifeng@oppo.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Steven Miao <realmz6@gmail.com>
Cc: Will Deacon <will@kernel.org>
Cc: Zeng Tao <prime.zeng@hisilicon.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 65c8d30e 17-Jul-2023 Anshuman Khandual <khandual@linux.vnet.ibm.com>

mm/tlbbatch: introduce arch_tlbbatch_should_defer()

Patch series "arm64: support batched/deferred tlb shootdown during page
reclamation/migration", v11.

Though ARM64 has the hardware to do tlb shootdown, the hardware
broadcasting is not free. A simplest micro benchmark shows even on
snapdragon 888 with only 8 cores, the overhead for ptep_clear_flush is
huge even for paging out one page mapped by only one process: 5.36% a.out
[kernel.kallsyms] [k] ptep_clear_flush

While pages are mapped by multiple processes or HW has more CPUs, the cost
should become even higher due to the bad scalability of tlb shootdown.
The same benchmark can result in 16.99% CPU consumption on ARM64 server
with around 100 cores according to the test on patch 4/4.

This patchset leverages the existing BATCHED_UNMAP_TLB_FLUSH by
1. only send tlbi instructions in the first stage -
arch_tlbbatch_add_mm()
2. wait for the completion of tlbi by dsb while doing tlbbatch
sync in arch_tlbbatch_flush()

Testing on snapdragon shows the overhead of ptep_clear_flush is removed by
the patchset. The micro benchmark becomes 5% faster even for one page
mapped by single process on snapdragon 888.

Since BATCHED_UNMAP_TLB_FLUSH is implemented only on x86, the patchset
does some renaming/extension for the current implementation first (Patch
1-3), then add the support on arm64 (Patch 4).


This patch (of 4):

The entire scheme of deferred TLB flush in reclaim path rests on the fact
that the cost to refill TLB entries is less than flushing out individual
entries by sending IPI to remote CPUs. But architecture can have
different ways to evaluate that. Hence apart from checking
TTU_BATCH_FLUSH in the TTU flags, rest of the decision should be
architecture specific.

[yangyicong@hisilicon.com: rebase and fix incorrect return value type]
Link: https://lkml.kernel.org/r/20230717131004.12662-1-yangyicong@huawei.com
Link: https://lkml.kernel.org/r/20230717131004.12662-2-yangyicong@huawei.com
Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
[https://lore.kernel.org/linuxppc-dev/20171101101735.2318-2-khandual@linux.vnet.ibm.com/]
Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: Barry Song <baohua@kernel.org>
Reviewed-by: Xin Hao <xhao@linux.alibaba.com>
Tested-by: Punit Agrawal <punit.agrawal@bytedance.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Darren Hart <darren@os.amperecomputing.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: lipeifeng <lipeifeng@oppo.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Steven Miao <realmz6@gmail.com>
Cc: Will Deacon <will@kernel.org>
Cc: Zeng Tao <prime.zeng@hisilicon.com>
Cc: Barry Song <v-songbaohua@oppo.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Nadav Amit <namit@vmware.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 013fdeb0 14-Jun-2023 Borislav Petkov (AMD) <bp@alien8.de>

x86/mm: Remove unused current_untag_mask()

e0bddc19ba95 ("x86/mm: Reduce untagged_addr() overhead for systems without LAM")

removed its only usage site so drop it.

Move the tlbstate_untag_mask up in the header and drop the ugly
ifdeffery as the unused declaration should be properly discarded.

Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Link: https://lore.kernel.org/r/20230614174148.5439-1-bp@alien8.de


# 74c228d2 12-Mar-2023 Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

x86/uaccess: Provide untagged_addr() and remove tags before address check

untagged_addr() is a helper used by the core-mm to strip tag bits and
get the address to the canonical shape based on rules of the current
thread. It only handles userspace addresses.

The untagging mask is stored in per-CPU variable and set on context
switching to the task.

The tags must not be included into check whether it's okay to access the
userspace address. Strip tags in access_ok().

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Alexander Potapenko <glider@google.com>
Link: https://lore.kernel.org/all/20230312112612.31869-7-kirill.shutemov%40linux.intel.com


# 82721d8b 12-Mar-2023 Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

x86/mm: Handle LAM on context switch

Linear Address Masking mode for userspace pointers encoded in CR3 bits.
The mode is selected per-process and stored in mm_context_t.

switch_mm_irqs_off() now respects selected LAM mode and constructs CR3
accordingly.

The active LAM mode gets recorded in the tlb_state.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Alexander Potapenko <glider@google.com>
Link: https://lore.kernel.org/all/20230312112612.31869-5-kirill.shutemov%40linux.intel.com


# 8f1d56f6 10-Jul-2022 Nadav Amit <namit@vmware.com>

x86/mm/tlb: Ignore f->new_tlb_gen when zero

Commit aa44284960d5 ("x86/mm/tlb: Avoid reading mm_tlb_gen when
possible") introduced an optimization to skip superfluous TLB
flushes based on the generation provided in flush_tlb_info.

However, arch_tlbbatch_flush() does not provide any generation in
flush_tlb_info and populates the flush_tlb_info generation with
0. This 0 is causes the flush_tlb_info to be interpreted as a
superfluous, old flush. As a result, try_to_unmap_one() would
not perform any TLB flushes.

Fix it by checking whether f->new_tlb_gen is nonzero. Zero value
is anyhow is an invalid generation value. To avoid future
confusion, introduce TLB_GENERATION_INVALID constant and use it
properly. Add warnings to ensure no partial flushes are done with
TLB_GENERATION_INVALID or when f->mm is NULL, since this does not
make any sense.

In addition, add the missing unlikely().

[ dhansen: change VM_BUG_ON() -> VM_WARN_ON(), clarify changelog ]

Fixes: aa44284960d5 ("x86/mm/tlb: Avoid reading mm_tlb_gen when possible")
Reported-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Nadav Amit <namit@vmware.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Tested-by: Hugh Dickins <hughd@google.com>
Link: https://lkml.kernel.org/r/20220710232837.3618-1-namit@vmware.com


# c9fe6656 09-May-2022 Nadav Amit <namit@vmware.com>

mm/mprotect: do not flush when not required architecturally

Currently, using mprotect() to unprotect a memory region or uffd to
unprotect a memory region causes a TLB flush. However, in such cases the
PTE is often not modified (i.e., remain RO) and therefore not TLB flush is
needed.

Add an arch-specific pte_needs_flush() which tells whether a TLB flush is
needed based on the old PTE and the new one. Implement an x86
pte_needs_flush().

Always flush the TLB when it is architecturally needed even when skipping
a TLB flush might only result in a spurious page-faults by skipping the
flush.

Even with such conservative manner, we can in the future further refine
the checks to test whether a PTE is present by only considering the
architectural _PAGE_PRESENT flag instead of {pte|pmd}_preesnt(). For not
be careful and use the latter.

Link: https://lkml.kernel.org/r/20220401180821.1986781-3-namit@vmware.com
Signed-off-by: Nadav Amit <namit@vmware.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Nick Piggin <npiggin@gmail.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# f154f290 02-Dec-2021 Joerg Roedel <jroedel@suse.de>

x86/mm/64: Flush global TLB on boot and AP bringup

The AP bringup code uses the trampoline_pgd page-table which
establishes global mappings in the user range of the address space.
Flush the global TLB entries after the indentity mappings are removed so
no stale entries remain in the TLB.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20211202153226.22946-3-joro@8bytes.org


# 371b09c6 08-Jan-2021 Balbir Singh <sblbir@amazon.com>

x86/mm: Refactor cond_ibpb() to support other use cases

cond_ibpb() has the necessary bits required to track the previous mm in
switch_mm_irqs_off(). This can be reused for other use cases like L1D
flushing on context switch.

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Balbir Singh <sblbir@amazon.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210108121056.21940-3-sblbir@amazon.com


# 2f4305b1 20-Feb-2021 Nadav Amit <namit@vmware.com>

x86/mm/tlb: Privatize cpu_tlbstate

cpu_tlbstate is mostly private and only the variable is_lazy is shared.
This causes some false-sharing when TLB flushes are performed.

Break cpu_tlbstate intro cpu_tlbstate and cpu_tlbstate_shared, and mark
each one accordingly.

Signed-off-by: Nadav Amit <namit@vmware.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/r/20210220231712.2475218-6-namit@vmware.com


# 4ce94eab 20-Feb-2021 Nadav Amit <namit@vmware.com>

x86/mm/tlb: Flush remote and local TLBs concurrently

To improve TLB shootdown performance, flush the remote and local TLBs
concurrently. Introduce flush_tlb_multi() that does so. Introduce
paravirtual versions of flush_tlb_multi() for KVM, Xen and hyper-v (Xen
and hyper-v are only compile-tested).

While the updated smp infrastructure is capable of running a function on
a single local core, it is not optimized for this case. The multiple
function calls and the indirect branch introduce some overhead, and
might make local TLB flushes slower than they were before the recent
changes.

Before calling the SMP infrastructure, check if only a local TLB flush
is needed to restore the lost performance in this common case. This
requires to check mm_cpumask() one more time, but unless this mask is
updated very frequently, this should impact performance negatively.

Signed-off-by: Nadav Amit <namit@vmware.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Michael Kelley <mikelley@microsoft.com> # Hyper-v parts
Reviewed-by: Juergen Gross <jgross@suse.com> # Xen and paravirt parts
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/r/20210220231712.2475218-5-namit@vmware.com


# 4c1ba392 20-Feb-2021 Nadav Amit <namit@vmware.com>

x86/mm/tlb: Unify flush_tlb_func_local() and flush_tlb_func_remote()

The unification of these two functions allows to use them in the updated
SMP infrastrucutre.

To do so, remove the reason argument from flush_tlb_func_local(), add
a member to struct tlb_flush_info that says which CPU initiated the
flush and act accordingly. Optimize the size of flush_tlb_info while we
are at it.

Unfortunately, this prevents us from using a constant tlb_flush_info for
arch_tlbbatch_flush(), but in a later stage we may be able to inline
tlb_flush_info into the IPI data, so it should not have an impact
eventually.

Signed-off-by: Nadav Amit <namit@vmware.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/r/20210220231712.2475218-3-namit@vmware.com


# bfe3d8f6 21-Apr-2020 Thomas Gleixner <tglx@linutronix.de>

x86/tlb: Restrict access to tlbstate

Hide tlbstate, flush_tlb_info and related helpers when tlbflush.h is
included from a module. Modules have absolutely no business with these
internals.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200421092600.328438734@linutronix.de


# 6c9b7d79 21-Apr-2020 Thomas Gleixner <tglx@linutronix.de>

x86/tlb: Move PCID helpers where they are used

Aside of the fact that they are used only in the TLB code, especially
having the comment close to the actual implementation makes a lot of
sense.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200421092600.145772183@linutronix.de


# af5c40c6 21-Apr-2020 Thomas Gleixner <tglx@linutronix.de>

x86/tlb: Uninline nmi_uaccess_okay()

cpu_tlbstate is exported because various TLB-related functions need
access to it, but cpu_tlbstate is sensitive information which should
only be accessed by well-contained kernel functions and not be directly
exposed to modules.

nmi_access_ok() is the last inline function which requires access to
cpu_tlbstate. Move it into the TLB code.

No functional change.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200421092600.052543007@linutronix.de


# 96f59fe2 21-Apr-2020 Thomas Gleixner <tglx@linutronix.de>

x86/tlb: Move cr4_set_bits_and_update_boot() to the usage site

No point in having this exposed.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200421092559.940978251@linutronix.de


# 69de6c1a 21-Apr-2020 Thomas Gleixner <tglx@linutronix.de>

x86/tlb: Move paravirt_tlb_remove_table() to the usage site

Move it where the only user is.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200421092559.849801011@linutronix.de


# 4b04e6c2 21-Apr-2020 Thomas Gleixner <tglx@linutronix.de>

x86/tlb: Move __flush_tlb_all() out of line

Reduce the number of required exports to one and make flush_tlb_global()
static to the TLB code.

flush_tlb_local() cannot be confined to the TLB code as the MTRR
handling requires a PGE-less flush.

Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200421092559.740388137@linutronix.de


# 29def599 21-Apr-2020 Thomas Gleixner <tglx@linutronix.de>

x86/tlb: Move flush_tlb_others() out of line

cpu_tlbstate is exported because various TLB-related functions need
access to it, but cpu_tlbstate is sensitive information which should
only be accessed by well-contained kernel functions and not be directly
exposed to modules.

As a last step, move __flush_tlb_others() out of line and hide the
native function. The latter can be static when CONFIG_PARAVIRT is
disabled.

No functional change.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200421092559.641957686@linutronix.de


# 58430c5d 21-Apr-2020 Thomas Gleixner <tglx@linutronix.de>

x86/tlb: Move __flush_tlb_one_kernel() out of line

cpu_tlbstate is exported because various TLB-related functions need
access to it, but cpu_tlbstate is sensitive information which should
only be accessed by well-contained kernel functions and not be directly
exposed to modules.

As a fourth step, move __flush_tlb_one_kernel() out of line and hide
the native function. The latter can be static when CONFIG_PARAVIRT is
disabled.

Consolidate the name space while at it and remove the pointless extra
wrapper in the paravirt code.

No functional change.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200421092559.535159540@linutronix.de


# 127ac915 21-Apr-2020 Thomas Gleixner <tglx@linutronix.de>

x86/tlb: Move __flush_tlb_one_user() out of line

cpu_tlbstate is exported because various TLB-related functions need access
to it, but cpu_tlbstate is sensitive information which should only be
accessed by well-contained kernel functions and not be directly exposed to
modules.

As a third step, move _flush_tlb_one_user() out of line and hide the
native function. The latter can be static when CONFIG_PARAVIRT is
disabled.

Consolidate the name space while at it and remove the pointless extra
wrapper in the paravirt code.

No functional change.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200421092559.428213098@linutronix.de


# cd30d26c 21-Apr-2020 Thomas Gleixner <tglx@linutronix.de>

x86/tlb: Move __flush_tlb_global() out of line

cpu_tlbstate is exported because various TLB-related functions need
access to it, but cpu_tlbstate is sensitive information which should
only be accessed by well-contained kernel functions and not be directly
exposed to modules.

As a second step, move __flush_tlb_global() out of line and hide the
native function. The latter can be static when CONFIG_PARAVIRT is
disabled.

Consolidate the namespace while at it and remove the pointless extra
wrapper in the paravirt code.

No functional change.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200421092559.336916818@linutronix.de


# 2faf153b 21-Apr-2020 Thomas Gleixner <tglx@linutronix.de>

x86/tlb: Move __flush_tlb() out of line

cpu_tlbstate is exported because various TLB-related functions need
access to it, but cpu_tlbstate is sensitive information which should
only be accessed by well-contained kernel functions and not be directly
exposed to modules.

As a first step, move __flush_tlb() out of line and hide the native
function. The latter can be static when CONFIG_PARAVIRT is disabled.

Consolidate the namespace while at it and remove the pointless extra
wrapper in the paravirt code.

No functional change.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200421092559.246130908@linutronix.de


# d8f0b353 21-Apr-2020 Thomas Gleixner <tglx@linutronix.de>

x86/cpu: Uninline CR4 accessors

cpu_tlbstate is exported because various TLB-related functions need
access to it, but cpu_tlbstate is sensitive information which should
only be accessed by well-contained kernel functions and not be directly
exposed to modules.

The various CR4 accessors require cpu_tlbstate as the CR4 shadow cache
is located there.

In preparation for unexporting cpu_tlbstate, create a builtin function
for manipulating CR4 and rework the various helpers to use it.

No functional change.

[ bp: push the export of native_write_cr4() only when CONFIG_LKTDM=m to
the last patch in the series. ]

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200421092558.939985695@linutronix.de


# 21e450d2 18-Jun-2019 Jan Kiszka <jan.kiszka@siemens.com>

x86/mm: Avoid redundant interrupt disable in load_mm_cr4()

load_mm_cr4() is always called with interrupts disabled from:

- switch_mm_irqs_off()
- refresh_pce(), which is a on_each_cpu() callback

Thus, disabling interrupts in cr4_set/clear_bits() is redundant.

Implement cr4_set/clear_bits_irqsoff() helpers, rename load_mm_cr4() to
load_mm_cr4_irqsoff() and use the new helpers. The new helpers do not need
a lockdep assert as __cr4_set() has one already.

The renaming in combination with the checks in __cr4_set() ensure that any
changes in the boundary conditions at the call sites will be detected.

[ tglx: Massaged change log ]

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/0fbbcb64-5f26-4ffb-1bb9-4f5f48426893@siemens.com


# 5932c9fd 25-Apr-2019 Nadav Amit <namit@vmware.com>

mm/tlb: Provide default nmi_uaccess_okay()

x86 has an nmi_uaccess_okay(), but other architectures do not.
Arch-independent code might need to know whether access to user
addresses is ok in an NMI context or in other code whose execution
context is unknown. Specifically, this function is needed for
bpf_probe_write_user().

Add a default implementation of nmi_uaccess_okay() for architectures
that do not have such a function.

Signed-off-by: Nadav Amit <namit@vmware.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <akpm@linux-foundation.org>
Cc: <ard.biesheuvel@linaro.org>
Cc: <deneen.t.dock@intel.com>
Cc: <kernel-hardening@lists.openwall.com>
Cc: <kristen@linux.intel.com>
Cc: <linux_dti@icloud.com>
Cc: <will.deacon@arm.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190426001143.4983-23-namit@vmware.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# a72a1932 28-Mar-2019 Jann Horn <jannh@google.com>

x86/mm/tlb: Define LOADED_MM_SWITCHING with pointer-sized number

sparse complains that LOADED_MM_SWITCHING's definition casts an int to a
pointer:

arch/x86/mm/tlb.c:409:17: warning: non size-preserving integer to pointer cast

Use a pointer-sized integer constant instead.

Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Mukesh Ojha <mojha@codeaurora.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Sai Praneeth <sai.praneeth.prakhya@intel.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: x86-ml <x86@kernel.org>
Link: https://lkml.kernel.org/r/20190328230939.15711-1-jannh@google.com


# 4c71a2b6 25-Nov-2018 Thomas Gleixner <tglx@linutronix.de>

x86/speculation: Prepare for conditional IBPB in switch_mm()

The IBPB speculation barrier is issued from switch_mm() when the kernel
switches to a user space task with a different mm than the user space task
which ran last on the same CPU.

An additional optimization is to avoid IBPB when the incoming task can be
ptraced by the outgoing task. This optimization only works when switching
directly between two user space tasks. When switching from a kernel task to
a user space task the optimization fails because the previous task cannot
be accessed anymore. So for quite some scenarios the optimization is just
adding overhead.

The upcoming conditional IBPB support will issue IBPB only for user space
tasks which have the TIF_SPEC_IB bit set. This requires to handle the
following cases:

1) Switch from a user space task (potential attacker) which has
TIF_SPEC_IB set to a user space task (potential victim) which has
TIF_SPEC_IB not set.

2) Switch from a user space task (potential attacker) which has
TIF_SPEC_IB not set to a user space task (potential victim) which has
TIF_SPEC_IB set.

This needs to be optimized for the case where the IBPB can be avoided when
only kernel threads ran in between user space tasks which belong to the
same process.

The current check whether two tasks belong to the same context is using the
tasks context id. While correct, it's simpler to use the mm pointer because
it allows to mangle the TIF_SPEC_IB bit into it. The context id based
mechanism requires extra storage, which creates worse code.

When a task is scheduled out its TIF_SPEC_IB bit is mangled as bit 0 into
the per CPU storage which is used to track the last user space mm which was
running on a CPU. This bit can be used together with the TIF_SPEC_IB bit of
the incoming task to make the decision whether IBPB needs to be issued or
not to cover the two cases above.

As conditional IBPB is going to be the default, remove the dubious ptrace
check for the IBPB always case and simply issue IBPB always when the
process changes.

Move the storage to a different place in the struct as the original one
created a hole.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: David Woodhouse <dwmw@amazon.co.uk>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Casey Schaufler <casey.schaufler@intel.com>
Cc: Asit Mallick <asit.k.mallick@intel.com>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Jon Masters <jcm@redhat.com>
Cc: Waiman Long <longman9394@gmail.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Dave Stewart <david.c.stewart@intel.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20181125185005.466447057@linutronix.de


# f77084d9 16-Oct-2018 Sebastian Andrzej Siewior <bigeasy@linutronix.de>

x86/mm/pat: Disable preemption around __flush_tlb_all()

The WARN_ON_ONCE(__read_cr3() != build_cr3()) in switch_mm_irqs_off()
triggers every once in a while during a snapshotted system upgrade.

The warning triggers since commit decab0888e6e ("x86/mm: Remove
preempt_disable/enable() from __native_flush_tlb()"). The callchain is:

get_page_from_freelist() -> post_alloc_hook() -> __kernel_map_pages()

with CONFIG_DEBUG_PAGEALLOC enabled.

Disable preemption during CR3 reset / __flush_tlb_all() and add a comment
why preemption has to be disabled so it won't be removed accidentaly.

Add another preemptible() check in __flush_tlb_all() to catch callers with
enabled preemption when PGE is enabled, because PGE enabled does not
trigger the warning in __native_flush_tlb(). Suggested by Andy Lutomirski.

Fixes: decab0888e6e ("x86/mm: Remove preempt_disable/enable() from __native_flush_tlb()")
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20181017103432.zgv46nlu3hc7k4rq@linutronix.de


# 97807813 25-Sep-2018 Rik van Riel <riel@surriel.com>

x86/mm/tlb: Add freed_tables element to flush_tlb_info

Pass the information on to native_flush_tlb_others.

No functional changes.

Cc: npiggin@gmail.com
Cc: mingo@kernel.org
Cc: will.deacon@arm.com
Cc: songliubraving@fb.com
Cc: kernel-team@fb.com
Cc: hpa@zytor.com
Cc: luto@kernel.org
Signed-off-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20180926035844.1420-7-riel@surriel.com


# 016c4d92 25-Sep-2018 Rik van Riel <riel@surriel.com>

x86/mm/tlb: Add freed_tables argument to flush_tlb_mm_range

Add an argument to flush_tlb_mm_range to indicate whether page tables
are about to be freed after this TLB flush. This allows for an
optimization of flush_tlb_mm_range to skip CPUs in lazy TLB mode.

No functional changes.

Cc: npiggin@gmail.com
Cc: mingo@kernel.org
Cc: will.deacon@arm.com
Cc: songliubraving@fb.com
Cc: kernel-team@fb.com
Cc: luto@kernel.org
Cc: hpa@zytor.com
Signed-off-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20180926035844.1420-6-riel@surriel.com


# 5462bc3a 25-Sep-2018 Rik van Riel <riel@surriel.com>

x86/mm/tlb: Always use lazy TLB mode

On most workloads, the number of context switches far exceeds the
number of TLB flushes sent. Optimizing the context switches, by always
using lazy TLB mode, speeds up those workloads.

This patch results in about a 1% reduction in CPU use on a two socket
Broadwell system running a memcache like workload.

Cc: npiggin@gmail.com
Cc: efault@gmx.de
Cc: will.deacon@arm.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: kernel-team@fb.com
Cc: hpa@zytor.com
Cc: luto@kernel.org
Tested-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Rik van Riel <riel@surriel.com>
(cherry picked from commit 95b0e6357d3e4e05349668940d7ff8f3b7e7e11e)
Acked-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20180716190337.26133-7-riel@surriel.com


# a31acd3e 25-Aug-2018 Peter Zijlstra <peterz@infradead.org>

x86/mm: Page size aware flush_tlb_mm_range()

Use the new tlb_get_unmap_shift() to determine the stride of the
INVLPG loop.

Cc: Nick Piggin <npiggin@gmail.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>


# 4012e77a 29-Aug-2018 Andy Lutomirski <luto@kernel.org>

x86/nmi: Fix NMI uaccess race against CR3 switching


A NMI can hit in the middle of context switching or in the middle of
switch_mm_irqs_off(). In either case, CR3 might not match current->mm,
which could cause copy_from_user_nmi() and friends to read the wrong
memory.

Fix it by adding a new nmi_uaccess_okay() helper and checking it in
copy_from_user_nmi() and in __copy_from_user_nmi()'s callers.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Rik van Riel <riel@surriel.com>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Jann Horn <jannh@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/dd956eba16646fd0b15c3c0741269dfd84452dac.1535557289.git.luto@kernel.org


# 48a8b97c 22-Aug-2018 Peter Zijlstra <peterz@infradead.org>

x86/mm: Only use tlb_remove_table() for paravirt

If we don't use paravirt; don't play unnecessary and complicated games
to free page-tables.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 52a288c7 22-Aug-2018 Peter Zijlstra <peterz@infradead.org>

x86/mm/tlb: Revert the recent lazy TLB patches

Revert commits:

95b0e6357d3e x86/mm/tlb: Always use lazy TLB mode
64482aafe55f x86/mm/tlb: Only send page table free TLB flush to lazy TLB CPUs
ac0315896970 x86/mm/tlb: Make lazy TLB mode lazier
61d0beb5796a x86/mm/tlb: Restructure switch_mm_irqs_off()
2ff6ddf19c0e x86/mm/tlb: Leave lazy TLB mode at page table free time

In order to simplify the TLB invalidate fixes for x86 and unify the
parts that need backporting. We'll try again later.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 95b0e635 16-Jul-2018 Rik van Riel <riel@surriel.com>

x86/mm/tlb: Always use lazy TLB mode

Now that CPUs in lazy TLB mode no longer receive TLB shootdown IPIs, except
at page table freeing time, and idle CPUs will no longer get shootdown IPIs
for things like mprotect and madvise, we can always use lazy TLB mode.

Tested-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Rik van Riel <riel@surriel.com>
Acked-by: Dave Hansen <dave.hansen@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: efault@gmx.de
Cc: kernel-team@fb.com
Cc: luto@kernel.org
Link: http://lkml.kernel.org/r/20180716190337.26133-7-riel@surriel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 2ff6ddf1 16-Jul-2018 Rik van Riel <riel@surriel.com>

x86/mm/tlb: Leave lazy TLB mode at page table free time

Andy discovered that speculative memory accesses while in lazy
TLB mode can crash a system, when a CPU tries to dereference a
speculative access using memory contents that used to be valid
page table memory, but have since been reused for something else
and point into la-la land.

The latter problem can be prevented in two ways. The first is to
always send a TLB shootdown IPI to CPUs in lazy TLB mode, while
the second one is to only send the TLB shootdown at page table
freeing time.

The second should result in fewer IPIs, since operationgs like
mprotect and madvise are very common with some workloads, but
do not involve page table freeing. Also, on munmap, batching
of page table freeing covers much larger ranges of virtual
memory than the batching of unmapped user pages.

Tested-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Rik van Riel <riel@surriel.com>
Acked-by: Dave Hansen <dave.hansen@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: efault@gmx.de
Cc: kernel-team@fb.com
Cc: luto@kernel.org
Link: http://lkml.kernel.org/r/20180716190337.26133-3-riel@surriel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 162ee5a8 04-Apr-2018 Sai Praneeth <sai.praneeth.prakhya@intel.com>

x86/mm: Fix bogus warning during EFI bootup, use boot_cpu_has() instead of this_cpu_has() in build_cr3_noflush()

Linus reported the following boot warning:

WARNING: CPU: 0 PID: 0 at arch/x86/include/asm/tlbflush.h:134 load_new_mm_cr3+0x114/0x170
[...]
Call Trace:
switch_mm_irqs_off+0x267/0x590
switch_mm+0xe/0x20
efi_switch_mm+0x3e/0x50
efi_enter_virtual_mode+0x43f/0x4da
start_kernel+0x3bf/0x458
secondary_startup_64+0xa5/0xb0

... after merging:

03781e40890c: x86/efi: Use efi_switch_mm() rather than manually twiddling with %cr3

When the platform supports PCID and if CONFIG_DEBUG_VM=y is enabled,
build_cr3_noflush() (called via switch_mm()) does a sanity check to see
if X86_FEATURE_PCID is set.

Presently, build_cr3_noflush() uses "this_cpu_has(X86_FEATURE_PCID)" to
perform the check but this_cpu_has() works only after SMP is initialized
(i.e. per cpu cpu_info's should be populated) and this happens to be very
late in the boot process (during rest_init()).

As efi_runtime_services() are called during (early) kernel boot time
and run time, modify build_cr3_noflush() to use boot_cpu_has() all the
time. As suggested by Dave Hansen, this should be OK because all CPU's have
same capabilities on x86.

With this change the warning is fixed.

( Dave also suggested that we put a warning in this_cpu_has() if it's used
early in the boot process. This is still work in progress as it affects
MCE. )

Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sai Praneeth Prakhya <sai.praneeth.prakhya@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Lee Chun-Yi <jlee@suse.com>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ravi Shankar <ravi.v.shankar@intel.com>
Cc: Ricardo Neri <ricardo.neri@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: linux-efi@vger.kernel.org
Link: http://lkml.kernel.org/r/1522870459-7432-1-git-send-email-sai.praneeth.prakhya@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 1299ef1d 31-Jan-2018 Andy Lutomirski <luto@kernel.org>

x86/mm: Rename flush_tlb_single() and flush_tlb_one() to __flush_tlb_one_[user|kernel]()

flush_tlb_single() and flush_tlb_one() sound almost identical, but
they really mean "flush one user translation" and "flush one kernel
translation". Rename them to flush_tlb_one_user() and
flush_tlb_one_kernel() to make the semantics more obvious.

[ I was looking at some PTI-related code, and the flush-one-address code
is unnecessarily hard to understand because the names of the helpers are
uninformative. This came up during PTI review, but no one got around to
doing it. ]

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Kees Cook <keescook@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Linux-MM <linux-mm@kvack.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Link: http://lkml.kernel.org/r/3303b02e3c3d049dc5235d5651e0ae6d29a34354.1517414378.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 18bf3c3e 29-Jan-2018 Tim Chen <tim.c.chen@linux.intel.com>

x86/speculation: Use Indirect Branch Prediction Barrier in context switch

Flush indirect branches when switching into a process that marked itself
non dumpable. This protects high value processes like gpg better,
without having too high performance overhead.

If done naïvely, we could switch to a kernel idle thread and then back
to the original process, such as:

process A -> idle -> process A

In such scenario, we do not have to do IBPB here even though the process
is non-dumpable, as we are switching back to the same process after a
hiatus.

To avoid the redundant IBPB, which is expensive, we track the last mm
user context ID. The cost is to have an extra u64 mm context id to track
the last mm we were using before switching to the init_mm used by idle.
Avoiding the extra IBPB is probably worth the extra memory for this
common scenario.

For those cases where tlb_defer_switch_to_init_mm() returns true (non
PCID), lazy tlb will defer switch to init_mm, so we will not be changing
the mm for the process A -> idle -> process A switch. So IBPB will be
skipped for this case.

Thanks to the reviewers and Andy Lutomirski for the suggestion of
using ctx_id which got rid of the problem of mm pointer recycling.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: ak@linux.intel.com
Cc: karahmed@amazon.de
Cc: arjan@linux.intel.com
Cc: torvalds@linux-foundation.org
Cc: linux@dominikbrodowski.net
Cc: peterz@infradead.org
Cc: bp@alien8.de
Cc: luto@kernel.org
Cc: pbonzini@redhat.com
Cc: gregkh@linux-foundation.org
Link: https://lkml.kernel.org/r/1517263487-3708-1-git-send-email-dwmw@amazon.co.uk


# f10ee3dc 13-Jan-2018 Thomas Gleixner <tglx@linutronix.de>

x86/pti: Fix !PCID and sanitize defines

The switch to the user space page tables in the low level ASM code sets
unconditionally bit 12 and bit 11 of CR3. Bit 12 is switching the base
address of the page directory to the user part, bit 11 is switching the
PCID to the PCID associated with the user page tables.

This fails on a machine which lacks PCID support because bit 11 is set in
CR3. Bit 11 is reserved when PCID is inactive.

While the Intel SDM claims that the reserved bits are ignored when PCID is
disabled, the AMD APM states that they should be cleared.

This went unnoticed as the AMD APM was not checked when the code was
developed and reviewed and test systems with Intel CPUs never failed to
boot. The report is against a Centos 6 host where the guest fails to boot,
so it's not yet clear whether this is a virt issue or can happen on real
hardware too, but thats irrelevant as the AMD APM clearly ask for clearing
the reserved bits.

Make sure that on non PCID machines bit 11 is not set by the page table
switching code.

Andy suggested to rename the related bits and masks so they are clearly
describing what they should be used for, which is done as well for clarity.

That split could have been done with alternatives but the macro hell is
horrible and ugly. This can be done on top if someone cares to remove the
extra orq. For now it's a straight forward fix.

Fixes: 6fd166aae78c ("x86/mm: Use/Fix PCID to optimize user/kernel switches")
Reported-by: Laura Abbott <labbott@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: stable <stable@vger.kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Willy Tarreau <w@1wt.eu>
Cc: David Woodhouse <dwmw@amazon.co.uk>
Link: https://lkml.kernel.org/r/alpine.DEB.2.20.1801140009150.2371@nanos


# decab088 30-Dec-2017 Thomas Gleixner <tglx@linutronix.de>

x86/mm: Remove preempt_disable/enable() from __native_flush_tlb()

The preempt_disable/enable() pair in __native_flush_tlb() was added in
commit:

5cf0791da5c1 ("x86/mm: Disable preemption during CR3 read+write")

... to protect the UP variant of flush_tlb_mm_range().

That preempt_disable/enable() pair should have been added to the UP variant
of flush_tlb_mm_range() instead.

The UP variant was removed with commit:

ce4a4e565f52 ("x86/mm: Remove the UP asm/tlbflush.h code, always use the (formerly) SMP code")

... but the preempt_disable/enable() pair stayed around.

The latest change to __native_flush_tlb() in commit:

6fd166aae78c ("x86/mm: Use/Fix PCID to optimize user/kernel switches")

... added an access to a per CPU variable outside the preempt disabled
regions, which makes no sense at all. __native_flush_tlb() must always
be called with at least preemption disabled.

Remove the preempt_disable/enable() pair and add a WARN_ON_ONCE() to catch
bad callers independent of the smp_processor_id() debugging.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: <stable@vger.kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dominik Brodowski <linux@dominikbrodowski.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linuxfoundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20171230211829.679325424@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 0a126abd 05-Dec-2017 Peter Zijlstra <peterz@infradead.org>

x86/mm: Clarify the whole ASID/kernel PCID/user PCID naming

Ideally we'd also use sparse to enforce this separation so it becomes much
more difficult to mess up.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Cc: linux-mm@kvack.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 6cff64b8 04-Dec-2017 Dave Hansen <dave.hansen@linux.intel.com>

x86/mm: Use INVPCID for __native_flush_tlb_single()

This uses INVPCID to shoot down individual lines of the user mapping
instead of marking the entire user map as invalid. This
could/might/possibly be faster.

This for sure needs tlb_single_page_flush_ceiling to be redetermined;
esp. since INVPCID is _slow_.

A detailed performance analysis is available here:

https://lkml.kernel.org/r/3062e486-3539-8a1f-5724-16199420be71@intel.com

[ Peterz: Split out from big combo patch ]

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 6fd166aa 04-Dec-2017 Peter Zijlstra <peterz@infradead.org>

x86/mm: Use/Fix PCID to optimize user/kernel switches

We can use PCID to retain the TLBs across CR3 switches; including those now
part of the user/kernel switch. This increases performance of kernel
entry/exit at the cost of more expensive/complicated TLB flushing.

Now that we have two address spaces, one for kernel and one for user space,
we need two PCIDs per mm. We use the top PCID bit to indicate a user PCID
(just like we use the PFN LSB for the PGD). Since we do TLB invalidation
from kernel space, the existing code will only invalidate the kernel PCID,
we augment that by marking the corresponding user PCID invalid, and upon
switching back to userspace, use a flushing CR3 write for the switch.

In order to access the user_pcid_flush_mask we use PER_CPU storage, which
means the previously established SWAPGS vs CR3 ordering is now mandatory
and required.

Having to do this memory access does require additional registers, most
sites have a functioning stack and we can spill one (RAX), sites without
functional stack need to otherwise provide the second scratch register.

Note: PCID is generally available on Intel Sandybridge and later CPUs.
Note: Up until this point TLB flushing was broken in this series.

Based-on-code-from: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 2ea907c4 04-Dec-2017 Dave Hansen <dave.hansen@linux.intel.com>

x86/mm: Allow flushing for future ASID switches

If changing the page tables in such a way that an invalidation of all
contexts (aka. PCIDs / ASIDs) is required, they can be actively invalidated
by:

1. INVPCID for each PCID (works for single pages too).

2. Load CR3 with each PCID without the NOFLUSH bit set

3. Load CR3 with the NOFLUSH bit set for each and do INVLPG for each address.

But, none of these are really feasible since there are ~6 ASIDs (12 with
PAGE_TABLE_ISOLATION) at the time that invalidation is required.
Instead of actively invalidating them, invalidate the *current* context and
also mark the cpu_tlbstate _quickly_ to indicate future invalidation to be
required.

At the next context-switch, look for this indicator
('invalidate_other' being set) invalidate all of the
cpu_tlbstate.ctxs[] entries.

This ensures that any future context switches will do a full flush
of the TLB, picking up the previous changes.

[ tglx: Folded more fixups from Peter ]

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 1a3b0cae 05-Dec-2017 Peter Zijlstra <peterz@infradead.org>

x86/mm: Create asm/invpcid.h

Unclutter tlbflush.h a little.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Cc: linux-mm@kvack.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# dd95f1a4 04-Dec-2017 Dave Hansen <dave.hansen@linux.intel.com>

x86/mm: Put MMU to hardware ASID translation in one place

There are effectively two ASID types:

1. The one stored in the mmu_context that goes from 0..5
2. The one programmed into the hardware that goes from 1..6

This consolidates the locations where converting between the two (by doing
a +1) to a single place which gives us a nice place to comment.
PAGE_TABLE_ISOLATION will also need to, given an ASID, know which hardware
ASID to flush for the userspace mapping.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Cc: linux-mm@kvack.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# cb0a9144 04-Dec-2017 Dave Hansen <dave.hansen@linux.intel.com>

x86/mm: Remove hard-coded ASID limit checks

First, it's nice to remove the magic numbers.

Second, PAGE_TABLE_ISOLATION is going to consume half of the available ASID
space. The space is currently unused, but add a comment to spell out this
new restriction.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Cc: linux-mm@kvack.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 50fb83a6 04-Dec-2017 Dave Hansen <dave.hansen@linux.intel.com>

x86/mm: Move the CR3 construction functions to tlbflush.h

For flushing the TLB, the ASID which has been programmed into the hardware
must be known. That differs from what is in 'cpu_tlbstate'.

Add functions to transform the 'cpu_tlbstate' values into to the one
programmed into the hardware (CR3).

It's not easy to include mmu_context.h into tlbflush.h, so just move the
CR3 building over to tlbflush.h.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Cc: linux-mm@kvack.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 3f67af51 05-Dec-2017 Peter Zijlstra <peterz@infradead.org>

x86/mm: Add comments to clarify which TLB-flush functions are supposed to flush what

Per popular request..

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Cc: linux-mm@kvack.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# b5fc6d94 05-Dec-2017 Peter Zijlstra <peterz@infradead.org>

x86/mm: Remove superfluous barriers

atomic64_inc_return() already implies smp_mb() before and after.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Cc: linux-mm@kvack.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 23cb7d46 05-Dec-2017 Peter Zijlstra <peterz@infradead.org>

x86/microcode: Dont abuse the TLB-flush interface

Commit:

ec400ddeff20 ("x86/microcode_intel_early.c: Early update ucode on Intel's CPU")

... grubbed into tlbflush internals without coherent explanation.

Since it says its a precaution and the SDM doesn't mention anything like
this, take it out back.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: fenghua.yu@intel.com
Cc: hughd@google.com
Cc: keescook@google.com
Cc: linux-mm@kvack.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 9d0b6232 24-Nov-2017 Nadav Amit <namit@vmware.com>

x86/tlb: Disable interrupts when changing CR4

CR4 modifications are implemented as RMW operations which update a shadow
variable and write the result to CR4. The RMW operation is protected by
preemption disable, but there is no enforcement or debugging mechanism.

CR4 modifications happen also in interrupt context via
__native_flush_tlb_global(). This implementation does not affect a
interrupted thread context CR4 operation, because the CR4 toggle restores
the original content and does not modify the shadow variable.

So the current situation seems to be safe, but a recent patch tried to add
an actual RMW operation in interrupt context, which will cause subtle
corruptions.

To prevent that and make the CR4 handling future proof:

- Add a lockdep assertion to __cr4_set() which will catch interrupt
enabled invocations

- Disable interrupts in the cr4 manipulator inlines

- Rename cr4_toggle_bits() to cr4_toggle_bits_irqsoff(). This is called
from __switch_to_xtra() where interrupts are already disabled and
performance matters.

All other call sites are not performance critical, so the extra overhead of
an additional local_irq_save/restore() pair is not a problem. If new call
sites care about performance then the necessary _irqsoff() variants can be
added.

[ tglx: Condensed the patch by moving the irq protection inside the
manipulator functions. Updated changelog ]

Signed-off-by: Nadav Amit <namit@vmware.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Luck <tony.luck@intel.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: nadav.amit@gmail.com
Cc: linux-edac@vger.kernel.org
Link: https://lkml.kernel.org/r/20171125032907.2241-3-namit@vmware.com


# 0c3292ca 24-Nov-2017 Nadav Amit <namit@vmware.com>

x86/tlb: Refactor CR4 setting and shadow write

Refactor the write to CR4 and its shadow value. This is done in
preparation for the addition of an assertion to check that IRQs are
disabled during CR4 update.

No functional change.

Signed-off-by: Nadav Amit <namit@vmware.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: nadav.amit@gmail.com
Cc: Andy Lutomirski <luto@kernel.org>
Cc: linux-edac@vger.kernel.org
Link: https://lkml.kernel.org/r/20171125032907.2241-2-namit@vmware.com


# b2441318 01-Nov-2017 Greg Kroah-Hartman <gregkh@linuxfoundation.org>

License cleanup: add SPDX GPL-2.0 license identifier to files with no license

Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.

By default all files without license information are under the default
license of the kernel, which is GPL version 2.

Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.

This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.

How this work was done:

Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,

Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.

The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.

The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.

Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if <5
lines).

All documentation files were explicitly excluded.

The following heuristics were used to determine which SPDX license
identifiers to apply.

- when both scanners couldn't find any license traces, file was
considered to have no license information in it, and the top level
COPYING file license applied.

For non */uapi/* files that summary was:

SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 11139

and resulted in the first patch in this series.

If that file was a */uapi/* path one, it was "GPL-2.0 WITH
Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was:

SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 WITH Linux-syscall-note 930

and resulted in the second patch in this series.

- if a file had some form of licensing information in it, and was one
of the */uapi/* ones, it was denoted with the Linux-syscall-note if
any GPL family license was found in the file or had no licensing in
it (per prior point). Results summary:

SPDX license identifier # files
---------------------------------------------------|------
GPL-2.0 WITH Linux-syscall-note 270
GPL-2.0+ WITH Linux-syscall-note 169
((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
LGPL-2.1+ WITH Linux-syscall-note 15
GPL-1.0+ WITH Linux-syscall-note 14
((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
LGPL-2.0+ WITH Linux-syscall-note 4
LGPL-2.1 WITH Linux-syscall-note 3
((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1

and that resulted in the third patch in this series.

- when the two scanners agreed on the detected license(s), that became
the concluded license(s).

- when there was disagreement between the two scanners (one detected a
license but the other didn't, or they both detected different
licenses) a manual inspection of the file occurred.

- In most cases a manual inspection of the information in the file
resulted in a clear resolution of the license that should apply (and
which scanner probably needed to revisit its heuristics).

- When it was not immediately clear, the license identifier was
confirmed with lawyers working with the Linux Foundation.

- If there was any question as to the appropriate license identifier,
the file was flagged for further research and to be revisited later
in time.

In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.

Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights. The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.

Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.

In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.

Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
- a full scancode scan run, collecting the matched texts, detected
license ids and scores
- reviewing anything where there was a license detected (about 500+
files) to ensure that the applied SPDX license was correct
- reviewing anything where there was no detection but the patch license
was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
SPDX license was correct

This produced a worksheet with 20 files needing minor correction. This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.

These .csv files were then reviewed by Greg. Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected. This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.) Finally Greg ran the script using the .csv files to
generate the patches.

Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 7ac7f2c3 14-Oct-2017 Andy Lutomirski <luto@kernel.org>

x86/mm: Remove debug/x86/tlb_defer_switch_to_init_mm

Borislav thinks that we don't need this knob in a released kernel.
Get rid of it.

Requested-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: b956575bed91 ("x86/mm: Flush more aggressively in lazy TLB mode")
Link: http://lkml.kernel.org/r/1fa72431924e81e86c164ff7881bf9240d1f1a6c.1508000261.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 4e57b946 14-Oct-2017 Andy Lutomirski <luto@kernel.org>

x86/mm: Tidy up "x86/mm: Flush more aggressively in lazy TLB mode"

Due to timezones, commit:

b956575bed91 ("x86/mm: Flush more aggressively in lazy TLB mode")

was an outdated patch that well tested and fixed the bug but didn't
address Borislav's review comments.

Tidy it up:

- The name "tlb_use_lazy_mode()" was highly confusing. Change it to
"tlb_defer_switch_to_init_mm()", which describes what it actually
means.

- Move the static_branch crap into a helper.

- Improve comments.

Actually removing the debugfs option is in the next patch.

Reported-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: b956575bed91 ("x86/mm: Flush more aggressively in lazy TLB mode")
Link: http://lkml.kernel.org/r/154ef95428d4592596b6e98b0af1d2747d6cfbf8.1508000261.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# b956575b 09-Oct-2017 Andy Lutomirski <luto@kernel.org>

x86/mm: Flush more aggressively in lazy TLB mode

Since commit:

94b1b03b519b ("x86/mm: Rework lazy TLB mode and TLB freshness tracking")

x86's lazy TLB mode has been all the way lazy: when running a kernel thread
(including the idle thread), the kernel keeps using the last user mm's
page tables without attempting to maintain user TLB coherence at all.

From a pure semantic perspective, this is fine -- kernel threads won't
attempt to access user pages, so having stale TLB entries doesn't matter.

Unfortunately, I forgot about a subtlety. By skipping TLB flushes,
we also allow any paging-structure caches that may exist on the CPU
to become incoherent. This means that we can have a
paging-structure cache entry that references a freed page table, and
the CPU is within its rights to do a speculative page walk starting
at the freed page table.

I can imagine this causing two different problems:

- A speculative page walk starting from a bogus page table could read
IO addresses. I haven't seen any reports of this causing problems.

- A speculative page walk that involves a bogus page table can install
garbage in the TLB. Such garbage would always be at a user VA, but
some AMD CPUs have logic that triggers a machine check when it notices
these bogus entries. I've seen a couple reports of this.

Boris further explains the failure mode:

> It is actually more of an optimization which assumes that paging-structure
> entries are in WB DRAM:
>
> "TlbCacheDis: cacheable memory disable. Read-write. 0=Enables
> performance optimization that assumes PML4, PDP, PDE, and PTE entries
> are in cacheable WB-DRAM; memory type checks may be bypassed, and
> addresses outside of WB-DRAM may result in undefined behavior or NB
> protocol errors. 1=Disables performance optimization and allows PML4,
> PDP, PDE and PTE entries to be in any memory type. Operating systems
> that maintain page tables in memory types other than WB- DRAM must set
> TlbCacheDis to insure proper operation."
>
> The MCE generated is an NB protocol error to signal that
>
> "Link: A specific coherent-only packet from a CPU was issued to an
> IO link. This may be caused by software which addresses page table
> structures in a memory type other than cacheable WB-DRAM without
> properly configuring MSRC001_0015[TlbCacheDis]. This may occur, for
> example, when page table structure addresses are above top of memory. In
> such cases, the NB will generate an MCE if it sees a mismatch between
> the memory operation generated by the core and the link type."
>
> I'm assuming coherent-only packets don't go out on IO links, thus the
> error.

To fix this, reinstate TLB coherence in lazy mode. With this patch
applied, we do it in one of two ways:

- If we have PCID, we simply switch back to init_mm's page tables
when we enter a kernel thread -- this seems to be quite cheap
except for the cost of serializing the CPU.

- If we don't have PCID, then we set a flag and switch to init_mm
the first time we would otherwise need to flush the TLB.

The /sys/kernel/debug/x86/tlb_use_lazy_mode debug switch can be changed
to override the default mode for benchmarking.

In theory, we could optimize this better by only flushing the TLB in
lazy CPUs when a page table is freed. Doing that would require
auditing the mm code to make sure that all page table freeing goes
through tlb_remove_page() as well as reworking some data structures
to implement the improved flush logic.

Reported-by: Markus Trippelsdorf <markus@trippelsdorf.de>
Reported-by: Adam Borowski <kilobyte@angband.pl>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Eric Biggers <ebiggers@google.com>
Cc: Johannes Hirte <johannes.hirte@datenkhaos.de>
Cc: Kees Cook <keescook@chromium.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Roman Kagan <rkagan@virtuozzo.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 94b1b03b519b ("x86/mm: Rework lazy TLB mode and TLB freshness tracking")
Link: http://lkml.kernel.org/r/20171009170231.fkpraqokz6e4zeco@pd.tnic
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 72c0098d 06-Sep-2017 Andy Lutomirski <luto@kernel.org>

x86/mm: Reinitialize TLB state on hotplug and resume

When Linux brings a CPU down and back up, it switches to init_mm and then
loads swapper_pg_dir into CR3. With PCID enabled, this has the side effect
of masking off the ASID bits in CR3.

This can result in some confusion in the TLB handling code. If we
bring a CPU down and back up with any ASID other than 0, we end up
with the wrong ASID active on the CPU after resume. This could
cause our internal state to become corrupt, although major
corruption is unlikely because init_mm doesn't have any user pages.
More obviously, if CONFIG_DEBUG_VM=y, we'll trip over an assertion
in the next context switch. The result of *that* is a failure to
resume from suspend with probability 1 - 1/6^(cpus-1).

Fix it by reinitializing cpu_tlbstate on resume and CPU bringup.

Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Reported-by: Jiri Kosina <jikos@kernel.org>
Fixes: 10af6235e0d3 ("x86/mm: Implement PCID based optimization: try to preserve old TLB entries using PCID")
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 10af6235 24-Jul-2017 Andy Lutomirski <luto@kernel.org>

x86/mm: Implement PCID based optimization: try to preserve old TLB entries using PCID

PCID is a "process context ID" -- it's what other architectures call
an address space ID. Every non-global TLB entry is tagged with a
PCID, only TLB entries that match the currently selected PCID are
used, and we can switch PGDs without flushing the TLB. x86's
PCID is 12 bits.

This is an unorthodox approach to using PCID. x86's PCID is far too
short to uniquely identify a process, and we can't even really
uniquely identify a running process because there are monster
systems with over 4096 CPUs. To make matters worse, past attempts
to use all 12 PCID bits have resulted in slowdowns instead of
speedups.

This patch uses PCID differently. We use a PCID to identify a
recently-used mm on a per-cpu basis. An mm has no fixed PCID
binding at all; instead, we give it a fresh PCID each time it's
loaded except in cases where we want to preserve the TLB, in which
case we reuse a recent value.

Here are some benchmark results, done on a Skylake laptop at 2.3 GHz
(turbo off, intel_pstate requesting max performance) under KVM with
the guest using idle=poll (to avoid artifacts when bouncing between
CPUs). I haven't done any real statistics here -- I just ran them
in a loop and picked the fastest results that didn't look like
outliers. Unpatched means commit a4eb8b993554, so all the
bookkeeping overhead is gone.

ping-pong between two mms on the same CPU using eventfd:

patched: 1.22µs
patched, nopcid: 1.33µs
unpatched: 1.34µs

Same ping-pong, but now touch 512 pages (all zero-page to minimize
cache misses) each iteration. dTLB misses are measured by
dtlb_load_misses.miss_causes_a_walk:

patched: 1.8µs 11M dTLB misses
patched, nopcid: 6.2µs, 207M dTLB misses
unpatched: 6.1µs, 190M dTLB misses

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Reviewed-by: Nadav Amit <nadav.amit@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/9ee75f17a81770feed616358e6860d98a2a5b1e7.1500957502.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 660da7c9 29-Jun-2017 Andy Lutomirski <luto@kernel.org>

x86/mm: Enable CR4.PCIDE on supported systems

We can use PCID if the CPU has PCID and PGE and we're not on Xen.

By itself, this has no effect. A followup patch will start using PCID.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Reviewed-by: Nadav Amit <nadav.amit@gmail.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/6327ecd907b32f79d5aa0d466f04503bbec5df88.1498751203.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 94b1b03b 29-Jun-2017 Andy Lutomirski <luto@kernel.org>

x86/mm: Rework lazy TLB mode and TLB freshness tracking

x86's lazy TLB mode used to be fairly weak -- it would switch to
init_mm the first time it tried to flush a lazy TLB. This meant an
unnecessary CR3 write and, if the flush was remote, an unnecessary
IPI.

Rewrite it entirely. When we enter lazy mode, we simply remove the
CPU from mm_cpumask. This means that we need a way to figure out
whether we've missed a flush when we switch back out of lazy mode.
I use the tlb_gen machinery to track whether a context is up to
date.

Note to reviewers: this patch, my itself, looks a bit odd. I'm
using an array of length 1 containing (ctx_id, tlb_gen) rather than
just storing tlb_gen, and making it at array isn't necessary yet.
I'm doing this because the next few patches add PCID support, and,
with PCID, we need ctx_id, and the array will end up with a length
greater than 1. Making it an array now means that there will be
less churn and therefore less stress on your eyeballs.

NB: This is dubious but, AFAICT, still correct on Xen and UV.
xen_exit_mmap() uses mm_cpumask() for nefarious purposes and this
patch changes the way that mm_cpumask() works. This should be okay,
since Xen *also* iterates all online CPUs to find all the CPUs it
needs to twiddle.

The UV tlbflush code is rather dated and should be changed.

Here are some benchmark results, done on a Skylake laptop at 2.3 GHz
(turbo off, intel_pstate requesting max performance) under KVM with
the guest using idle=poll (to avoid artifacts when bouncing between
CPUs). I haven't done any real statistics here -- I just ran them
in a loop and picked the fastest results that didn't look like
outliers. Unpatched means commit a4eb8b993554, so all the
bookkeeping overhead is gone.

MADV_DONTNEED; touch the page; switch CPUs using sched_setaffinity. In
an unpatched kernel, MADV_DONTNEED will send an IPI to the previous CPU.
This is intended to be a nearly worst-case test.

patched: 13.4µs
unpatched: 21.6µs

Vitaly's pthread_mmap microbenchmark with 8 threads (on four cores),
nrounds = 100, 256M data

patched: 1.1 seconds or so
unpatched: 1.9 seconds or so

The sleepup on Vitaly's test appearss to be because it spends a lot
of time blocked on mmap_sem, and this patch avoids sending IPIs to
blocked CPUs.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Reviewed-by: Nadav Amit <nadav.amit@gmail.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Banman <abanman@sgi.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Dimitri Sivanich <sivanich@sgi.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Travis <travis@sgi.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/ddf2c92962339f4ba39d8fc41b853936ec0b44f1.1498751203.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# b0579ade 29-Jun-2017 Andy Lutomirski <luto@kernel.org>

x86/mm: Track the TLB's tlb_gen and update the flushing algorithm

There are two kernel features that would benefit from tracking
how up-to-date each CPU's TLB is in the case where IPIs aren't keeping
it up to date in real time:

- Lazy mm switching currently works by switching to init_mm when
it would otherwise flush. This is wasteful: there isn't fundamentally
any need to update CR3 at all when going lazy or when returning from
lazy mode, nor is there any need to receive flush IPIs at all. Instead,
we should just stop trying to keep the TLB coherent when we go lazy and,
when unlazying, check whether we missed any flushes.

- PCID will let us keep recent user contexts alive in the TLB. If we
start doing this, we need a way to decide whether those contexts are
up to date.

On some paravirt systems, remote TLBs can be flushed without IPIs.
This won't update the target CPUs' tlb_gens, which may cause
unnecessary local flushes later on. We can address this if it becomes
a problem by carefully updating the target CPU's tlb_gen directly.

By itself, this patch is a very minor optimization that avoids
unnecessary flushes when multiple TLB flushes targetting the same CPU
race. The complexity in this patch would not be worth it on its own,
but it will enable improved lazy TLB tracking and PCID.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Reviewed-by: Nadav Amit <nadav.amit@gmail.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/1210fb244bc9cbe7677f7f0b72db4d359675f24b.1498751203.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# f39681ed 29-Jun-2017 Andy Lutomirski <luto@kernel.org>

x86/mm: Give each mm TLB flush generation a unique ID

This adds two new variables to mmu_context_t: ctx_id and tlb_gen.
ctx_id uniquely identifies the mm_struct and will never be reused.
For a given mm_struct (and hence ctx_id), tlb_gen is a monotonic
count of the number of times that a TLB flush has been requested.
The pair (ctx_id, tlb_gen) can be used as an identifier for TLB
flush actions and will be used in subsequent patches to reliably
determine whether all needed TLB flushes have occurred on a given
CPU.

This patch is split out for ease of review. By itself, it has no
real effect other than creating and updating the new variables.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Reviewed-by: Nadav Amit <nadav.amit@gmail.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/413a91c24dab3ed0caa5f4e4d017d87b0857f920.1498751203.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# d5436812 20-Jun-2017 Andy Lutomirski <luto@kernel.org>

x86/mm: Remove reset_lazy_tlbstate()

The only call site also calls idle_task_exit(), and idle_task_exit()
puts us into a clean state by explicitly switching to init_mm.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/3acc7ad02a2ec060d2321a1e0f6de1cb90069517.1498022414.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 6c690ee1 12-Jun-2017 Andy Lutomirski <luto@kernel.org>

x86/mm: Split read_cr3() into read_cr3_pa() and __read_cr3()

The kernel has several code paths that read CR3. Most of them assume that
CR3 contains the PGD's physical address, whereas some of them awkwardly
use PHYSICAL_PAGE_MASK to mask off low bits.

Add explicit mask macros for CR3 and convert all of the CR3 readers.
This will keep them from breaking when PCID is enabled.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: xen-devel <xen-devel@lists.xen.org>
Link: http://lkml.kernel.org/r/883f8fb121f4616c1c1427ad87350bb2f5ffeca1.1497288170.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 3d28ebce 28-May-2017 Andy Lutomirski <luto@kernel.org>

x86/mm: Rework lazy TLB to track the actual loaded mm

Lazy TLB state is currently managed in a rather baroque manner.
AFAICT, there are three possible states:

- Non-lazy. This means that we're running a user thread or a
kernel thread that has called use_mm(). current->mm ==
current->active_mm == cpu_tlbstate.active_mm and
cpu_tlbstate.state == TLBSTATE_OK.

- Lazy with user mm. We're running a kernel thread without an mm
and we're borrowing an mm_struct. We have current->mm == NULL,
current->active_mm == cpu_tlbstate.active_mm, cpu_tlbstate.state
!= TLBSTATE_OK (i.e. TLBSTATE_LAZY or 0). The current cpu is set
in mm_cpumask(current->active_mm). CR3 points to
current->active_mm->pgd. The TLB is up to date.

- Lazy with init_mm. This happens when we call leave_mm(). We
have current->mm == NULL, current->active_mm ==
cpu_tlbstate.active_mm, but that mm is only relelvant insofar as
the scheduler is tracking it for refcounting. cpu_tlbstate.state
!= TLBSTATE_OK. The current cpu is clear in
mm_cpumask(current->active_mm). CR3 points to swapper_pg_dir,
i.e. init_mm->pgd.

This patch simplifies the situation. Other than perf, x86 stops
caring about current->active_mm at all. We have
cpu_tlbstate.loaded_mm pointing to the mm that CR3 references. The
TLB is always up to date for that mm. leave_mm() just switches us
to init_mm. There are no longer any special cases for mm_cpumask,
and switch_mm() switches mms without worrying about laziness.

After this patch, cpu_tlbstate.state serves only to tell the TLB
flush code whether it may switch to init_mm instead of doing a
normal flush.

This makes fairly extensive changes to xen_exit_mmap(), which used
to look a bit like black magic.

Perf is unchanged. With or without this change, perf may behave a bit
erratically if it tries to read user memory in kernel thread context.
We should build on this patch to teach perf to never look at user
memory when cpu_tlbstate.loaded_mm != current->mm.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Borislav Petkov <bpetkov@suse.de>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Nadav Amit <namit@vmware.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# ce4a4e56 28-May-2017 Andy Lutomirski <luto@kernel.org>

x86/mm: Remove the UP asm/tlbflush.h code, always use the (formerly) SMP code

The UP asm/tlbflush.h generates somewhat nicer code than the SMP version.
Aside from that, it's fallen quite a bit behind the SMP code:

- flush_tlb_mm_range() didn't flush individual pages if the range
was small.

- The lazy TLB code was much weaker. This usually wouldn't matter,
but, if a kernel thread flushed its lazy "active_mm" more than
once (due to reclaim or similar), it wouldn't be unlazied and
would instead pointlessly flush repeatedly.

- Tracepoints were missing.

Aside from that, simply having the UP code around was a maintanence
burden, since it means that any change to the TLB flush code had to
make sure not to break it.

Simplify everything by deleting the UP code.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Borislav Petkov <bpetkov@suse.de>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Nadav Amit <namit@vmware.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 454bbad9 28-May-2017 Andy Lutomirski <luto@kernel.org>

x86/mm: Refactor flush_tlb_mm_range() to merge local and remote cases

The local flush path is very similar to the remote flush path.
Merge them.

This is intended to make no difference to behavior whatsoever. It
removes some code and will make future changes to the flushing
mechanics simpler.

This patch does remove one small optimization: flush_tlb_mm_range()
now has an unconditional smp_mb() instead of using MOV to CR3 or
INVLPG as a full barrier when applicable. I think this is okay for
a few reasons. First, smp_mb() is quite cheap compared to the cost
of a TLB flush. Second, this rearrangement makes a bigger
optimization available: with some work on the SMP function call
code, we could do the local and remote flushes in parallel. Third,
I'm planning a rework of the TLB flush algorithm that will require
an atomic operation at the beginning of each flush, and that
operation will replace the smp_mb().

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Borislav Petkov <bpetkov@suse.de>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Nadav Amit <namit@vmware.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# a2055abe 28-May-2017 Andy Lutomirski <luto@kernel.org>

x86/mm: Pass flush_tlb_info to flush_tlb_others() etc

Rather than passing all the contents of flush_tlb_info to
flush_tlb_others(), pass a pointer to the structure directly. For
consistency, this also removes the unnecessary cpu parameter from
uv_flush_tlb_others() to make its signature match the other
*flush_tlb_others() functions.

This serves two purposes:

- It will dramatically simplify future patches that change struct
flush_tlb_info, which I'm planning to do.

- struct flush_tlb_info is an adequate description of what to do
for a local flush, too, so by reusing it we can remove duplicated
code between local and remove flushes in a future patch.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Borislav Petkov <bpetkov@suse.de>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Nadav Amit <namit@vmware.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
[ Fix build warning. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# e73ad5ff 22-May-2017 Andy Lutomirski <luto@kernel.org>

mm, x86/mm: Make the batched unmap TLB flush API more generic

try_to_unmap_flush() used to open-code a rather x86-centric flush
sequence: local_flush_tlb() + flush_tlb_others(). Rearrange the
code so that the arch (only x86 for now) provides
arch_tlbbatch_add_mm() and arch_tlbbatch_flush() and the core code
calls those functions instead.

I'll want this for x86 because, to enable address space ids, I can't
support the flush_tlb_others() mode used by exising
try_to_unmap_flush() implementation with good performance. I can
support the new API fairly easily, though.

I imagine that other architectures may be in a similar position.
Architectures with strong remote flush primitives (arm64?) may have
even worse performance problems with flush_tlb_others() the way that
try_to_unmap_flush() uses it.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Borislav Petkov <bpetkov@suse.de>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Nadav Amit <namit@vmware.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/19f25a8581f9fb77876b7ff3b001f89835e34ea3.1495492063.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# ca6c99c0 22-May-2017 Andy Lutomirski <luto@kernel.org>

x86/mm: Reimplement flush_tlb_page() using flush_tlb_mm_range()

flush_tlb_page() was very similar to flush_tlb_mm_range() except that
it had a couple of issues:

- It was missing an smp_mb() in the case where
current->active_mm != mm. (This is a longstanding bug reported by Nadav Amit)

- It was missing tracepoints and vm counter updates.

The only reason that I can see for keeping it at as a separate
function is that it could avoid a few branches that
flush_tlb_mm_range() needs to decide to flush just one page. This
hardly seems worthwhile. If we decide we want to get rid of those
branches again, a better way would be to introduce an
__flush_tlb_mm_range() helper and make both flush_tlb_page() and
flush_tlb_mm_range() use it.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Borislav Petkov <bpetkov@suse.de>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Nadav Amit <namit@vmware.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/3cc3847cf888d8907577569b8bac3f01992ef8f9.1495492063.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 29961b59 22-Apr-2017 Andy Lutomirski <luto@kernel.org>

x86/mm: Remove flush_tlb() and flush_tlb_current_task()

I was trying to figure out what how flush_tlb_current_task() would
possibly work correctly if current->mm != current->active_mm, but I
realized I could spare myself the effort: it has no callers except
the unused flush_tlb() macro.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Nadav Amit <namit@vmware.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/e52d64c11690f85e9f1d69d7b48cc2269cd2e94b.1492844372.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 2c4ea6e2 10-Mar-2017 Daniel Borkmann <daniel@iogearbox.net>

x86/tlb: Fix tlb flushing when lguest clears PGE

Fengguang reported random corruptions from various locations on x86-32
after commits d2852a224050 ("arch: add ARCH_HAS_SET_MEMORY config") and
9d876e79df6a ("bpf: fix unlocking of jited image when module ronx not set")
that uses the former. While x86-32 doesn't have a JIT like x86_64, the
bpf_prog_lock_ro() and bpf_prog_unlock_ro() got enabled due to
ARCH_HAS_SET_MEMORY, whereas Fengguang's test kernel doesn't have module
support built in and therefore never had the DEBUG_SET_MODULE_RONX setting
enabled.

After investigating the crashes further, it turned out that using
set_memory_ro() and set_memory_rw() didn't have the desired effect, for
example, setting the pages as read-only on x86-32 would still let
probe_kernel_write() succeed without error. This behavior would manifest
itself in situations where the vmalloc'ed buffer was accessed prior to
set_memory_*() such as in case of bpf_prog_alloc(). In cases where it
wasn't, the page attribute changes seemed to have taken effect, leading to
the conclusion that a TLB invalidate didn't happen. Moreover, it turned out
that this issue reproduced with qemu in "-cpu kvm64" mode, but not for
"-cpu host". When the issue occurs, change_page_attr_set_clr() did trigger
a TLB flush as expected via __flush_tlb_all() through cpa_flush_range(),
though.

There are 3 variants for issuing a TLB flush: invpcid_flush_all() (depends
on CPU feature bits X86_FEATURE_INVPCID, X86_FEATURE_PGE), cr4 based flush
(depends on X86_FEATURE_PGE), and cr3 based flush. For "-cpu host" case in
my setup, the flush used invpcid_flush_all() variant, whereas for "-cpu
kvm64", the flush was cr4 based. Switching the kvm64 case to cr3 manually
worked fine, and further investigating the cr4 one turned out that
X86_CR4_PGE bit was not set in cr4 register, meaning the
__native_flush_tlb_global_irq_disabled() wrote cr4 twice with the same
value instead of clearing X86_CR4_PGE in the first write to trigger the
flush.

It turned out that X86_CR4_PGE was cleared from cr4 during init from
lguest_arch_host_init() via adjust_pge(). The X86_FEATURE_PGE bit is also
cleared from there due to concerns of using PGE in guest kernel that can
lead to hard to trace bugs (see bff672e630a0 ("lguest: documentation V:
Host") in init()). The CPU feature bits are cleared in dynamic
boot_cpu_data, but they never propagated to __flush_tlb_all() as it uses
static_cpu_has() instead of boot_cpu_has() for testing which variant of TLB
flushing to use, meaning they still used the old setting of the host
kernel.

Clearing via setup_clear_cpu_cap(X86_FEATURE_PGE) so this would propagate
to static_cpu_has() checks is too late at this point as sections have been
patched already, so for now, it seems reasonable to switch back to
boot_cpu_has(X86_FEATURE_PGE) as it was prior to commit c109bf95992b
("x86/cpufeature: Remove cpu_has_pge"). This lets the TLB flush trigger via
cr3 as originally intended, properly makes the new page attributes visible
and thus fixes the crashes seen by Fengguang.

Fixes: c109bf95992b ("x86/cpufeature: Remove cpu_has_pge")
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: bp@suse.de
Cc: Kees Cook <keescook@chromium.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: netdev@vger.kernel.org
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: lkp@01.org
Cc: Laura Abbott <labbott@redhat.com>
Cc: stable@vger.kernel.org
Link: http://lkml.kernrl.org/r/20170301125426.l4nf65rx4wahohyl@wfg-t540p.sh.intel.com
Link: http://lkml.kernel.org/r/25c41ad9eca164be4db9ad84f768965b7eb19d9e.1489191673.git.daniel@iogearbox.net
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>


# 5a920155 14-Feb-2017 Thomas Gleixner <tglx@linutronix.de>

x86/process: Optimize TIF_NOTSC switch

Provide and use a toggle helper instead of doing it with a branch.

x86_64: arch/x86/kernel/process.o
text data bss dec hex
3008 8577 16 11601 2d51 Before
2976 8577 16 11569 2d31 After

i386: arch/x86/kernel/process.o
text data bss dec hex
2925 8673 8 11606 2d56 Before
2893 8673 8 11574 2d36 After

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Link: http://lkml.kernel.org/r/20170214081104.9244-4-khuey@kylehuey.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>


# 1ef55be1 29-Sep-2016 Andy Lutomirski <luto@kernel.org>

x86/asm: Get rid of __read_cr4_safe()

We use __read_cr4() vs __read_cr4_safe() inconsistently. On
CR4-less CPUs, all CR4 bits are effectively clear, so we can make
the code simpler and more robust by making __read_cr4() always fix
up faults on 32-bit kernels.

This may fix some bugs on old 486-like CPUs, but I don't have any
easy way to test that.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: david@saggiorato.net
Link: http://lkml.kernel.org/r/ea647033d357d9ce2ad2bbde5a631045f5052fb6.1475178370.git.luto@kernel.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>


# e1bfc11c 28-Sep-2016 Andy Lutomirski <luto@kernel.org>

x86/init: Fix cr4_init_shadow() on CR4-less machines

cr4_init_shadow() will panic on 486-like machines without CR4. Fix
it using __read_cr4_safe().

Reported-by: david@saggiorato.net
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Fixes: 1e02ce4cccdc ("x86: Store a per-cpu shadow copy of CR4")
Link: http://lkml.kernel.org/r/43a20f81fb504013bf613913dc25574b45336a61.1475091074.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 5cf0791d 05-Aug-2016 Sebastian Andrzej Siewior <bigeasy@linutronix.de>

x86/mm: Disable preemption during CR3 read+write

There's a subtle preemption race on UP kernels:

Usually current->mm (and therefore mm->pgd) stays the same during the
lifetime of a task so it does not matter if a task gets preempted during
the read and write of the CR3.

But then, there is this scenario on x86-UP:

TaskA is in do_exit() and exit_mm() sets current->mm = NULL followed by:

-> mmput()
-> exit_mmap()
-> tlb_finish_mmu()
-> tlb_flush_mmu()
-> tlb_flush_mmu_tlbonly()
-> tlb_flush()
-> flush_tlb_mm_range()
-> __flush_tlb_up()
-> __flush_tlb()
-> __native_flush_tlb()

At this point current->mm is NULL but current->active_mm still points to
the "old" mm.

Let's preempt taskA _after_ native_read_cr3() by taskB. TaskB has its
own mm so CR3 has changed.

Now preempt back to taskA. TaskA has no ->mm set so it borrows taskB's
mm and so CR3 remains unchanged. Once taskA gets active it continues
where it was interrupted and that means it writes its old CR3 value
back. Everything is fine because userland won't need its memory
anymore.

Now the fun part:

Let's preempt taskA one more time and get back to taskB. This
time switch_mm() won't do a thing because oldmm (->active_mm)
is the same as mm (as per context_switch()). So we remain
with a bad CR3 / PGD and return to userland.

The next thing that happens is handle_mm_fault() with an address for
the execution of its code in userland. handle_mm_fault() realizes that
it has a PTE with proper rights so it returns doing nothing. But the
CPU looks at the wrong PGD and insists that something is wrong and
faults again. And again. And one more time…

This pagefault circle continues until the scheduler gets tired of it and
puts another task on the CPU. It gets little difficult if the task is a
RT task with a high priority. The system will either freeze or it gets
fixed by the software watchdog thread which usually runs at RT-max prio.
But waiting for the watchdog will increase the latency of the RT task
which is no good.

Fix this by disabling preemption across the critical code section.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bp@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/1470404259-26290-1-git-send-email-bigeasy@linutronix.de
[ Prettified the changelog. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 858eaaa7 01-Apr-2016 Nadav Amit <namit@vmware.com>

mm/rmap: batched invalidations should use existing api

The recently introduced batched invalidations mechanism uses its own
mechanism for shootdown. However, it does wrong accounting of
interrupts (e.g., inc_irq_stat is called for local invalidations),
trace-points (e.g., TLB_REMOTE_SHOOTDOWN for local invalidations) and
may break some platforms as it bypasses the invalidation mechanisms of
Xen and SGI UV.

This patch reuses the existing TLB flushing mechnaisms instead. We use
NULL as mm to indicate a global invalidation is required.

Fixes 72b252aed506b8 ("mm: send one IPI per CPU to TLB flush all entries after unmapping pages")
Signed-off-by: Nadav Amit <namit@vmware.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# c109bf95 29-Mar-2016 Borislav Petkov <bp@suse.de>

x86/cpufeature: Remove cpu_has_pge

Use static_cpu_has() in __flush_tlb_all() due to the time-sensitivity of
this one.

Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1459266123-21878-10-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# e2c7698c 10-Feb-2016 Borislav Petkov <bp@suse.de>

x86/mm: Fix INVPCID asm constraint

So we want to specify the dependency on both @pcid and @addr so that the
compiler doesn't reorder accesses to them *before* the TLB flush. But
for that to work, we need to express this properly in the inline asm and
deref the whole desc array, not the pointer to it. See clwb() for an
example.

This fixes the build error on 32-bit:

arch/x86/include/asm/tlbflush.h: In function ‘__invpcid’:
arch/x86/include/asm/tlbflush.h:26:18: error: memory input 0 is not directly addressable

which gcc4.7 caught but 5.x didn't. Which is strange. :-\

Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luis R. Rodriguez <mcgrof@suse.com>
Cc: Michael Matz <matz@suse.de>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: linux-mm@kvack.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# d8bced79 29-Jan-2016 Andy Lutomirski <luto@kernel.org>

x86/mm: If INVPCID is available, use it to flush global mappings

On my Skylake laptop, INVPCID function 2 (flush absolutely
everything) takes about 376ns, whereas saving flags, twiddling
CR4.PGE to flush global mappings, and restoring flags takes about
539ns.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luis R. Rodriguez <mcgrof@suse.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/ed0ef62581c0ea9c99b9bf6df726015e96d44743.1454096309.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 060a402a 29-Jan-2016 Andy Lutomirski <luto@kernel.org>

x86/mm: Add INVPCID helpers

This adds helpers for each of the four currently-specified INVPCID
modes.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luis R. Rodriguez <mcgrof@suse.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/8a62b23ad686888cee01da134c91409e22064db9.1454096309.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# cd4d09ec 26-Jan-2016 Borislav Petkov <bp@suse.de>

x86/cpufeature: Carve out X86_FEATURE_*

Move them to a separate header and have the following
dependency:

x86/cpufeatures.h <- x86/processor.h <- x86/cpufeature.h

This makes it easier to use the header in asm code and not
include the whole cpufeature.h and add guards for asm.

Suggested-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1453842730-28463-5-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 72b252ae 04-Sep-2015 Mel Gorman <mgorman@suse.de>

mm: send one IPI per CPU to TLB flush all entries after unmapping pages

An IPI is sent to flush remote TLBs when a page is unmapped that was
potentially accesssed by other CPUs. There are many circumstances where
this happens but the obvious one is kswapd reclaiming pages belonging to a
running process as kswapd and the task are likely running on separate
CPUs.

On small machines, this is not a significant problem but as machine gets
larger with more cores and more memory, the cost of these IPIs can be
high. This patch uses a simple structure that tracks CPUs that
potentially have TLB entries for pages being unmapped. When the unmapping
is complete, the full TLB is flushed on the assumption that a refill cost
is lower than flushing individual entries.

Architectures wishing to do this must give the following guarantee.

If a clean page is unmapped and not immediately flushed, the
architecture must guarantee that a write to that linear address
from a CPU with a cached TLB entry will trap a page fault.

This is essentially what the kernel already depends on but the window is
much larger with this patch applied and is worth highlighting. The
architecture should consider whether the cost of the full TLB flush is
higher than sending an IPI to flush each individual entry. An additional
architecture helper called flush_tlb_local is required. It's a trivial
wrapper with some accounting in the x86 case.

The impact of this patch depends on the workload as measuring any benefit
requires both mapped pages co-located on the LRU and memory pressure. The
case with the biggest impact is multiple processes reading mapped pages
taken from the vm-scalability test suite. The test case uses NR_CPU
readers of mapped files that consume 10*RAM.

Linear mapped reader on a 4-node machine with 64G RAM and 48 CPUs

4.2.0-rc1 4.2.0-rc1
vanilla flushfull-v7
Ops lru-file-mmap-read-elapsed 159.62 ( 0.00%) 120.68 ( 24.40%)
Ops lru-file-mmap-read-time_range 30.59 ( 0.00%) 2.80 ( 90.85%)
Ops lru-file-mmap-read-time_stddv 6.70 ( 0.00%) 0.64 ( 90.38%)

4.2.0-rc1 4.2.0-rc1
vanilla flushfull-v7
User 581.00 611.43
System 5804.93 4111.76
Elapsed 161.03 122.12

This is showing that the readers completed 24.40% faster with 29% less
system CPU time. From vmstats, it is known that the vanilla kernel was
interrupted roughly 900K times per second during the steady phase of the
test and the patched kernel was interrupts 180K times per second.

The impact is lower on a single socket machine.

4.2.0-rc1 4.2.0-rc1
vanilla flushfull-v7
Ops lru-file-mmap-read-elapsed 25.33 ( 0.00%) 20.38 ( 19.54%)
Ops lru-file-mmap-read-time_range 0.91 ( 0.00%) 1.44 (-58.24%)
Ops lru-file-mmap-read-time_stddv 0.28 ( 0.00%) 0.47 (-65.34%)

4.2.0-rc1 4.2.0-rc1
vanilla flushfull-v7
User 58.09 57.64
System 111.82 76.56
Elapsed 27.29 22.55

It's still a noticeable improvement with vmstat showing interrupts went
from roughly 500K per second to 45K per second.

The patch will have no impact on workloads with no memory pressure or have
relatively few mapped pages. It will have an unpredictable impact on the
workload running on the CPU being flushed as it'll depend on how many TLB
entries need to be refilled and how long that takes. Worst case, the TLB
will be completely cleared of active entries when the target PFNs were not
resident at all.

[sasha.levin@oracle.com: trace tlb flush after disabling preemption in try_to_unmap_flush]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 1e02ce4c 24-Oct-2014 Andy Lutomirski <luto@amacapital.net>

x86: Store a per-cpu shadow copy of CR4

Context switches and TLB flushes can change individual bits of CR4.
CR4 reads take several cycles, so store a shadow copy of CR4 in a
per-cpu variable.

To avoid wasting a cache line, I added the CR4 shadow to
cpu_tlbstate, which is already touched in switch_mm. The heaviest
users of the cr4 shadow will be switch_mm and __switch_to_xtra, and
__switch_to_xtra is called shortly after switch_mm during context
switch, so the cacheline is likely to be hot.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Vince Weaver <vince@deater.net>
Cc: "hillf.zj" <hillf.zj@alibaba-inc.com>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/3a54dd3353fffbf84804398e00dfdc5b7c1afd7d.1414190806.git.luto@amacapital.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 375074cc 24-Oct-2014 Andy Lutomirski <luto@amacapital.net>

x86: Clean up cr4 manipulation

CR4 manipulation was split, seemingly at random, between direct
(write_cr4) and using a helper (set/clear_in_cr4). Unfortunately,
the set_in_cr4 and clear_in_cr4 helpers also poke at the boot code,
which only a small subset of users actually wanted.

This patch replaces all cr4 access in functions that don't leave cr4
exactly the way they found it with new helpers cr4_set_bits,
cr4_clear_bits, and cr4_set_bits_and_update_boot.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Vince Weaver <vince@deater.net>
Cc: "hillf.zj" <hillf.zj@alibaba-inc.com>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/495a10bdc9e67016b8fd3945700d46cfd5c12c2f.1414190806.git.luto@amacapital.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# ec659934 21-Jan-2014 Mel Gorman <mgorman@suse.de>

mm, x86: Account for TLB flushes only when debugging

Bisection between 3.11 and 3.12 fingered commit 9824cf97 ("mm:
vmstats: tlb flush counters") to cause overhead problems.

The counters are undeniably useful but how often do we really
need to debug TLB flush related issues? It does not justify
taking the penalty everywhere so make it a debugging option.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Tested-by: Davidlohr Bueso <davidlohr@hp.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Alex Shi <alex.shi@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/n/tip-XzxjntugxuwpxXhcrxqqh53b@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 6df46865 11-Sep-2013 Dave Hansen <dave@sr71.net>

mm: vmstats: track TLB flush stats on UP too

The previous patch doing vmstats for TLB flushes ("mm: vmstats: tlb flush
counters") effectively missed UP since arch/x86/mm/tlb.c is only compiled
for SMP.

UP systems do not do remote TLB flushes, so compile those counters out on
UP.

arch/x86/kernel/cpu/mtrr/generic.c calls __flush_tlb() directly. This is
probably an optimization since both the mtrr code and __flush_tlb() write
cr4. It would probably be safe to make that a flush_tlb_all() (and then
get these statistics), but the mtrr code is ancient and I'm hesitant to
touch it other than to just stick in the counters.

[akpm@linux-foundation.org: tweak comments]
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# e8747f10 04-Jun-2013 Michael Wang <wangyun@linux.vnet.ibm.com>

x86, cleanups: Remove extra tab in __flush_tlb_one()

Remove the extra tab in __flush_tlb_one().

CC: Alex Shi <alex.shi@intel.com>
CC: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Michael Wang <wangyun@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/51AD8902.60603@linux.vnet.ibm.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>


# 086fc8f8 21-Dec-2012 Fenghua Yu <fenghua.yu@intel.com>

x86/tlbflush.h: Define __native_flush_tlb_global_irq_disabled()

This function is called in __native_flush_tlb_global() and after
apply_microcode_early().

Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Link: http://lkml.kernel.org/r/1356075872-3054-8-git-send-email-fenghua.yu@intel.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>


# 094ab1db 28-Nov-2012 H. Peter Anvin <hpa@linux.intel.com>

x86, 386 removal: Remove CONFIG_INVLPG

All 486+ CPUs support INVLPG, so remove the fallback 386 support
code.

Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Link: http://lkml.kernel.org/r/1354132230-21854-6-git-send-email-hpa@linux.intel.com


# 7efa1c87 19-Jul-2012 Alex Shi <alex.shi@linux.alibaba.com>

x86/tlb: Fix build warning and crash when building for !SMP

The incompatible parameter of flush_tlb_mm_range cause build warning.
Fix it by correct parameter.

Ingo Molnar found that this could also cause a user space crash.

Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reported-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Alex Shi <alex.shi@intel.com>
Link: http://lkml.kernel.org/r/1342747103-19765-1-git-send-email-alex.shi@intel.com
Signed-off-by: H. Peter Anvin <hpa@zytor.com>


# effee4b9 27-Jun-2012 Alex Shi <alex.shi@linux.alibaba.com>

x86/tlb: do flush_tlb_kernel_range by 'invlpg'

This patch do flush_tlb_kernel_range by 'invlpg'. The performance pay
and gain was analyzed in previous patch
(x86/flush_tlb: try flush_tlb_single one by one in flush_tlb_range).

In the testing: http://lkml.org/lkml/2012/6/21/10

The pay is mostly covered by long kernel path, but the gain is still
quite clear, memory access in user APP can increase 30+% when kernel
execute this funtion.

Signed-off-by: Alex Shi <alex.shi@intel.com>
Link: http://lkml.kernel.org/r/1340845344-27557-10-git-send-email-alex.shi@intel.com
Signed-off-by: H. Peter Anvin <hpa@zytor.com>


# 611ae8e3 27-Jun-2012 Alex Shi <alex.shi@linux.alibaba.com>

x86/tlb: enable tlb flush range support for x86

Not every tlb_flush execution moment is really need to evacuate all
TLB entries, like in munmap, just few 'invlpg' is better for whole
process performance, since it leaves most of TLB entries for later
accessing.

This patch also rewrite flush_tlb_range for 2 purposes:
1, split it out to get flush_blt_mm_range function.
2, clean up to reduce line breaking, thanks for Borislav's input.

My micro benchmark 'mummap' http://lkml.org/lkml/2012/5/17/59
show that the random memory access on other CPU has 0~50% speed up
on a 2P * 4cores * HT NHM EP while do 'munmap'.

Thanks Yongjie's testing on this patch:
-------------
I used Linux 3.4-RC6 w/ and w/o his patches as Xen dom0 and guest
kernel.
After running two benchmarks in Xen HVM guest, I found his patches
brought about 1%~3% performance gain in 'kernel build' and 'netperf'
testing, though the performance gain was not very stable in 'kernel
build' testing.

Some detailed testing results are below.

Testing Environment:
Hardware: Romley-EP platform
Xen version: latest upstream
Linux kernel: 3.4-RC6
Guest vCPU number: 8
NIC: Intel 82599 (10GB bandwidth)

In 'kernel build' testing in guest:
Command line | performance gain
make -j 4 | 3.81%
make -j 8 | 0.37%
make -j 16 | -0.52%

In 'netperf' testing, we tested TCP_STREAM with default socket size
16384 byte as large packet and 64 byte as small packet.
I used several clients to add networking pressure, then 'netperf' server
automatically generated several threads to response them.
I also used large-size packet and small-size packet in the testing.
Packet size | Thread number | performance gain
16384 bytes | 4 | 0.02%
16384 bytes | 8 | 2.21%
16384 bytes | 16 | 2.04%
64 bytes | 4 | 1.07%
64 bytes | 8 | 3.31%
64 bytes | 16 | 0.71%

Signed-off-by: Alex Shi <alex.shi@intel.com>
Link: http://lkml.kernel.org/r/1340845344-27557-8-git-send-email-alex.shi@intel.com
Tested-by: Ren, Yongjie <yongjie.ren@intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>


# e7b52ffd 27-Jun-2012 Alex Shi <alex.shi@linux.alibaba.com>

x86/flush_tlb: try flush_tlb_single one by one in flush_tlb_range

x86 has no flush_tlb_range support in instruction level. Currently the
flush_tlb_range just implemented by flushing all page table. That is not
the best solution for all scenarios. In fact, if we just use 'invlpg' to
flush few lines from TLB, we can get the performance gain from later
remain TLB lines accessing.

But the 'invlpg' instruction costs much of time. Its execution time can
compete with cr3 rewriting, and even a bit more on SNB CPU.

So, on a 512 4KB TLB entries CPU, the balance points is at:
(512 - X) * 100ns(assumed TLB refill cost) =
X(TLB flush entries) * 100ns(assumed invlpg cost)

Here, X is 256, that is 1/2 of 512 entries.

But with the mysterious CPU pre-fetcher and page miss handler Unit, the
assumed TLB refill cost is far lower then 100ns in sequential access. And
2 HT siblings in one core makes the memory access more faster if they are
accessing the same memory. So, in the patch, I just do the change when
the target entries is less than 1/16 of whole active tlb entries.
Actually, I have no data support for the percentage '1/16', so any
suggestions are welcomed.

As to hugetlb, guess due to smaller page table, and smaller active TLB
entries, I didn't see benefit via my benchmark, so no optimizing now.

My micro benchmark show in ideal scenarios, the performance improves 70
percent in reading. And in worst scenario, the reading/writing
performance is similar with unpatched 3.4-rc4 kernel.

Here is the reading data on my 2P * 4cores *HT NHM EP machine, with THP
'always':

multi thread testing, '-t' paramter is thread number:
with patch unpatched 3.4-rc4
./mprotect -t 1 14ns 24ns
./mprotect -t 2 13ns 22ns
./mprotect -t 4 12ns 19ns
./mprotect -t 8 14ns 16ns
./mprotect -t 16 28ns 26ns
./mprotect -t 32 54ns 51ns
./mprotect -t 128 200ns 199ns

Single process with sequencial flushing and memory accessing:

with patch unpatched 3.4-rc4
./mprotect 7ns 11ns
./mprotect -p 4096 -l 8 -n 10240
21ns 21ns

[ hpa: http://lkml.kernel.org/r/1B4B44D9196EFF41AE41FDA404FC0A100BFF94@SHSMSX101.ccr.corp.intel.com
has additional performance numbers. ]

Signed-off-by: Alex Shi <alex.shi@intel.com>
Link: http://lkml.kernel.org/r/1340845344-27557-3-git-send-email-alex.shi@intel.com
Signed-off-by: H. Peter Anvin <hpa@zytor.com>


# 3e7f3db0 10-May-2012 Alex Shi <alex.shi@linux.alibaba.com>

x86/tlb: Clean up and unify TLB_FLUSH_ALL definition

Since sizeof(long) is 4 in x86_32 mode, and it's 8 in x86_64
mode, sizeof(long long) is also 8 byte in x86_64 mode.
use long mode can fit TLB_FLUSH_ALL defination here both in 32
or 64 bits mode.

Signed-off-by: Alex Shi <alex.shi@intel.com>
Link: http://lkml.kernel.org/n/tip-evv5bekiipi2pmyzdsy8lkkw@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# c6ae41e7 11-May-2012 Alex Shi <alex.shi@linux.alibaba.com>

x86: replace percpu_xxx funcs with this_cpu_xxx

Since percpu_xxx() serial functions are duplicated with this_cpu_xxx().
Removing percpu_xxx() definition and replacing them by this_cpu_xxx()
in code. There is no function change in this patch, just preparation for
later percpu_xxx serial function removing.

On x86 machine the this_cpu_xxx() serial functions are same as
__this_cpu_xxx() without no unnecessary premmpt enable/disable.

Thanks for Stephen Rothwell, he found and fixed a i386 build error in
the patch.

Also thanks for Andrew Morton, he kept updating the patchset in Linus'
tree.

Signed-off-by: Alex Shi <alex.shi@intel.com>
Acked-by: Christoph Lameter <cl@gentwo.org>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Tejun Heo <tj@kernel.org>


# f05e798a 28-Mar-2012 David Howells <dhowells@redhat.com>

Disintegrate asm/system.h for X86

Disintegrate asm/system.h for X86.

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: H. Peter Anvin <hpa@zytor.com>
cc: x86@kernel.org


# b40827fa 28-Aug-2010 Borislav Petkov <bp@alien8.de>

x86-32, mm: Add an initial page table for core bootstrapping

This patch adds an initial page table with low mappings used exclusively
for booting APs/resuming after ACPI suspend/machine restart. After this,
there's no need to add low mappings to swapper_pg_dir and zap them later
or create own swsusp PGD page solely for ACPI sleep needs - we have
initial_page_table for that.

Signed-off-by: Borislav Petkov <bp@alien8.de>
LKML-Reference: <20101020070526.GA9588@liondog.tnic>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>


# 55cd6367 12-Jun-2009 Yinghai Lu <yinghai@kernel.org>

x86: make zap_low_mapping could be used early

Only one cpu is there, just call __flush_tlb for it. Fixes the following boot
warning on x86:

[ 0.000000] Memory: 885032k/915540k available (5993k kernel code, 29844k reserved, 3842k data, 428k init, 0k highmem)
[ 0.000000] virtual kernel memory layout:
[ 0.000000] fixmap : 0xffe17000 - 0xfffff000 (1952 kB)
[ 0.000000] vmalloc : 0xf8615000 - 0xffe15000 ( 120 MB)
[ 0.000000] lowmem : 0xc0000000 - 0xf7e15000 ( 894 MB)
[ 0.000000] .init : 0xc19a5000 - 0xc1a10000 ( 428 kB)
[ 0.000000] .data : 0xc15da4bb - 0xc199af6c (3842 kB)
[ 0.000000] .text : 0xc1000000 - 0xc15da4bb (5993 kB)
[ 0.000000] Checking if this processor honours the WP bit even in supervisor mode...Ok.
[ 0.000000] ------------[ cut here ]------------
[ 0.000000] WARNING: at kernel/smp.c:369 smp_call_function_many+0x50/0x1b0()
[ 0.000000] Hardware name: System Product Name
[ 0.000000] Modules linked in:
[ 0.000000] Pid: 0, comm: swapper Not tainted 2.6.30-tip #52504
[ 0.000000] Call Trace:
[ 0.000000] [<c104aa16>] warn_slowpath_common+0x65/0x95
[ 0.000000] [<c104aa58>] warn_slowpath_null+0x12/0x15
[ 0.000000] [<c1073bbe>] smp_call_function_many+0x50/0x1b0
[ 0.000000] [<c1037615>] ? do_flush_tlb_all+0x0/0x41
[ 0.000000] [<c1037615>] ? do_flush_tlb_all+0x0/0x41
[ 0.000000] [<c1073d4f>] smp_call_function+0x31/0x58
[ 0.000000] [<c1037615>] ? do_flush_tlb_all+0x0/0x41
[ 0.000000] [<c104f635>] on_each_cpu+0x26/0x65
[ 0.000000] [<c10374b5>] flush_tlb_all+0x19/0x1b
[ 0.000000] [<c1032ab3>] zap_low_mappings+0x4d/0x56
[ 0.000000] [<c15d64b5>] ? printk+0x14/0x17
[ 0.000000] [<c19b42a8>] mem_init+0x23d/0x245
[ 0.000000] [<c19a56a1>] start_kernel+0x17a/0x2d5
[ 0.000000] [<c19a5347>] ? unknown_bootoption+0x0/0x19a
[ 0.000000] [<c19a5039>] __init_begin+0x39/0x41
[ 0.000000] ---[ end trace 4eaa2a86a8e2da22 ]---

Reported-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>


# d7285c6b 23-Apr-2009 Chris Wright <chrisw@sous-sol.org>

x86: use native register access for native tlb flushing

currently these are paravirtulaized, doesn't appear any callers rely on
this (no pv_ops backends are using native_tlb and overriding cr3/4
access).

[ Impact: fix lockdep warning with paravirt and function tracer ]

Signed-off-by: Chris Wright <chrisw@sous-sol.org>
LKML-Reference: <20090423172138.GR3036@sequoia.sous-sol.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>


# 9b8de747 21-Apr-2009 David Howells <dhowells@redhat.com>

FRV: Fix the section attribute on UP DECLARE_PER_CPU()

In non-SMP mode, the variable section attribute specified by DECLARE_PER_CPU()
does not agree with that specified by DEFINE_PER_CPU(). This means that
architectures that have a small data section references relative to a base
register may throw up linkage errors due to too great a displacement between
where the base register points and the per-CPU variable.

On FRV, the .h declaration says that the variable is in the .sdata section, but
the .c definition says it's actually in the .data section. The linker throws
up the following errors:

kernel/built-in.o: In function `release_task':
kernel/exit.c:78: relocation truncated to fit: R_FRV_GPREL12 against symbol `per_cpu__process_counts' defined in .data section in kernel/built-in.o
kernel/exit.c:78: relocation truncated to fit: R_FRV_GPREL12 against symbol `per_cpu__process_counts' defined in .data section in kernel/built-in.o

To fix this, DECLARE_PER_CPU() should simply apply the same section attribute
as does DEFINE_PER_CPU(). However, this is made slightly more complex by
virtue of the fact that there are several variants on DEFINE, so these need to
be matched by variants on DECLARE.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 9eb912d1 18-Jan-2009 Brian Gerst <brgerst@gmail.com>

x86-64: Move TLB state from PDA to per-cpu and consolidate with 32-bit.

Signed-off-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>


# 4595f962 10-Jan-2009 Rusty Russell <rusty@rustcorp.com.au>

x86: change flush_tlb_others to take a const struct cpumask

Impact: reduce stack usage, use new cpumask API.

This is made a little more tricky by uv_flush_tlb_others which
actually alters its argument, for an IPI to be sent to the remaining
cpus in the mask.

I solve this by allocating a cpumask_var_t for this case and falling back
to IPI should this fail.

To eliminate temporaries in the caller, all flush_tlb_others implementations
now do the this-cpu-elimination step themselves.

Note also the curious "cpus_or(f->flush_cpumask, cpumask, f->flush_cpumask)"
which has been there since pre-git and yet f->flush_cpumask is always zero
at this point.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Mike Travis <travis@sgi.com>


# dacf7333 07-Jan-2009 Jaswinder Singh Rajput <jaswinder@infradead.org>

x86: smp.h move zap_low_mappings declartion to tlbflush.h

Impact: cleanup, moving NON-SMP stuff from smp.h

Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>


# 1965aae3 22-Oct-2008 H. Peter Anvin <hpa@zytor.com>

x86: Fix ASM_X86__ header guards

Change header guards named "ASM_X86__*" to "_ASM_X86_*" since:

a. the double underscore is ugly and pointless.
b. no leading underscore violates namespace constraints.

Signed-off-by: H. Peter Anvin <hpa@zytor.com>


# bb898558 17-Aug-2008 Al Viro <viro@zeniv.linux.org.uk>

x86, um: ... and asm-x86 move

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>