272461 |
03-Oct-2014 |
gjb |
Copy stable/10@r272459 to releng/10.1 as part of the 10.1-RELEASE process.
Approved by: re (implicit) Sponsored by: The FreeBSD Foundation
|
269592 |
05-Aug-2014 |
marius |
MFC: r260457
The changes in r233781 attempted to make logging during a machine check exception more readable. In practice they prevented all logging during a machine check exception on at least some systems. Specifically, when an uncorrected ECC error is detected in a DIMM on a Nehalem/Westmere class machine, all CPUs receive a machine check exception, but only CPUs on the same package as the memory controller for the erroring DIMM log an error. The CPUs on the other package would complete the scan of their machine check banks and panic before the first set of CPUs could log an error. The end result was a clearer display during the panic (no interleaved messages), but a crashdump without any useful info about the error that occurred.
To handle this case, make all CPUs spin in the machine check handler once they have completed their scan of their machine check banks until at least one machine check error is logged. I tried using a DELAY() instead so that the CPUs would not potentially hang forever, but that was not reliable in testing.
While here, don't clear MCIP from MSR_MCG_STATUS before invoking panic. Only clear it if the machine check handler does not panic and returns to the interrupted thread.
MFC: r263113
Correct type for malloc().
Submitted by: "Conrad Meyer" <conrad.meyer@isilon.com>
MFC: r269052, r269239, r269242
Intel desktop Haswell CPUs may report benign corrected parity errors (see HSD131 erratum in [1]) at a considerable rate. So filter these (default), unless logging is enabled. Unfortunately, there really is no better way to reasonably implement suppressing these errors than to just skipping them in mca_log(). Given that they are reported for bank 0, they'd need to be masked in MSR_MC0_CTL. However, P6 family processors require that register to be set to either all 0s or all 1s, disabling way more than the one error in question when using all 0s there. Alternatively, it could be masked for the corresponding CMCI, but that still wouldn't keep the periodic scanner from detecting these spurious errors. Apart from that, register contents of MSR_MC0_CTL{,2} don't seem to be publicly documented, neither in the Intel Architectures Developer's Manual nor in the Haswell datasheets.
Note that while HSD131 actually is only about C0-stepping as of revision 014 of the Intel desktop 4th generation processor family specification update, these corrected errors also have been observed with D0-stepping aka "Haswell Refresh".
1: http://www.intel.de/content/dam/www/public/us/en/documents/specification-updates/4th-gen-core-family-desktop-specification-update.pdf
Reviewed by: jhb Sponsored by: Bally Wulff Games & Entertainment GmbH
|
268076 |
01-Jul-2014 |
scottl |
Merge r266746, 266775:
Now that there are separate back-end implementations of busdma, the bounce implementation shouldn't steal flags from the common front-end. Move those flags to the back-end.
Eliminate the fake contig_dmamap and replace it with a new flag, BUS_DMA_KMEM_ALLOC. They serve the same purpose, but using the flag means that the map can be NULL again, which in turn enables significant optimizations for the common case of no bouncing.
Obtained from: Netflix, Inc.
|
262192 |
18-Feb-2014 |
jhb |
MFC 261517,261520: Convert the license on files where I am the sole copyright holder to 2 clause BSD licenses.
|
262141 |
18-Feb-2014 |
jhb |
MFC 259140: Move constants for indices in the local APIC's local vector table from apicvar.h to apicreg.h.
|
259511 |
17-Dec-2013 |
kib |
MFC r257230: Add a virtual table for the busdma methods on x86, to allow different busdma implementations to coexist.
|
259510 |
17-Dec-2013 |
kib |
MFC r257228: Add bus_dmamap_load_ma() function to load map with the array of vm_pages.
|
257492 |
01-Nov-2013 |
kib |
MFC r257069: Add ddb 'show ioapic' and 'show all ioapics' commands.
Approved by: re (glebius)
|
256281 |
10-Oct-2013 |
gjb |
Copy head (r256279) to stable/10 as part of the 10.0-RELEASE cycle.
Approved by: re (implicit) Sponsored by: The FreeBSD Foundation
|
255726 |
20-Sep-2013 |
gibbs |
Add support for suspend/resume/migration operations when running as a Xen PVHVM guest.
Submitted by: Roger Pau Monné Sponsored by: Citrix Systems R&D Reviewed by: gibbs Approved by: re (blanket Xen) MFC after: 2 weeks
sys/amd64/amd64/mp_machdep.c: sys/i386/i386/mp_machdep.c: - Make sure that are no MMU related IPIs pending on migration. - Reset pending IPI_BITMAP on resume. - Init vcpu_info on resume.
sys/amd64/include/intr_machdep.h: sys/i386/include/intr_machdep.h: sys/x86/acpica/acpi_wakeup.c: sys/x86/x86/intr_machdep.c: sys/x86/isa/atpic.c: sys/x86/x86/io_apic.c: sys/x86/x86/local_apic.c: - Add a "suspend_cancelled" parameter to pic_resume(). For the Xen PIC, restoration of interrupt services differs between the aborted suspend and normal resume cases, so we must provide this information.
sys/dev/acpica/acpi_timer.c: sys/dev/xen/timer/timer.c: sys/timetc.h: - Don't swap out "suspend safe" timers across a suspend/resume cycle. This includes the Xen PV and ACPI timers.
sys/dev/xen/control/control.c: - Perform proper suspend/resume process for PVHVM: - Suspend all APs before going into suspension, this allows us to reset the vcpu_info on resume for each AP. - Reset shared info page and callback on resume.
sys/dev/xen/timer/timer.c: - Implement suspend/resume support for the PV timer. Since FreeBSD doesn't perform a per-cpu resume of the timer, we need to call smp_rendezvous in order to correctly resume the timer on each CPU.
sys/dev/xen/xenpci/xenpci.c: - Don't reset the PCI interrupt on each suspend/resume.
sys/kern/subr_smp.c: - When suspending a PVHVM domain make sure there are no MMU IPIs in-flight, or we will get a lockup on resume due to the fact that pending event channels are not carried over on migration. - Implement a generic version of restart_cpus that can be used by suspended and stopped cpus.
sys/x86/xen/hvm.c: - Implement resume support for the hypercall page and shared info. - Clear vcpu_info so it can be reset by APs when resuming from suspension.
sys/dev/xen/xenpci/xenpci.c: sys/x86/xen/hvm.c: sys/x86/xen/xen_intr.c: - Support UP kernel configurations.
sys/x86/xen/xen_intr.c: - Properly rebind per-cpus VIRQs and IPIs on resume.
|
255040 |
29-Aug-2013 |
gibbs |
Implement vector callback for PVHVM and unify event channel implementations
Re-structure Xen HVM support so that: - Xen is detected and hypercalls can be performed very early in system startup. - Xen interrupt services are implemented using FreeBSD's native interrupt delivery infrastructure. - the Xen interrupt service implementation is shared between PV and HVM guests. - Xen interrupt handlers can optionally use a filter handler in order to avoid the overhead of dispatch to an interrupt thread. - interrupt load can be distributed among all available CPUs. - the overhead of accessing the emulated local and I/O apics on HVM is removed for event channel port events. - a similar optimization can eventually, and fairly easily, be used to optimize MSI.
Early Xen detection, HVM refactoring, PVHVM interrupt infrastructure, and misc Xen cleanups:
Sponsored by: Spectra Logic Corporation
Unification of PV & HVM interrupt infrastructure, bug fixes, and misc Xen cleanups:
Submitted by: Roger Pau Monné Sponsored by: Citrix Systems R&D
sys/x86/x86/local_apic.c: sys/amd64/include/apicvar.h: sys/i386/include/apicvar.h: sys/amd64/amd64/apic_vector.S: sys/i386/i386/apic_vector.s: sys/amd64/amd64/machdep.c: sys/i386/i386/machdep.c: sys/i386/xen/exception.s: sys/x86/include/segments.h: Reserve IDT vector 0x93 for the Xen event channel upcall interrupt handler. On Hypervisors that support the direct vector callback feature, we can request that this vector be called directly by an injected HVM interrupt event, instead of a simulated PCI interrupt on the Xen platform PCI device. This avoids all of the overhead of dealing with the emulated I/O APIC and local APIC. It also means that the Hypervisor can inject these events on any CPU, allowing upcalls for different ports to be handled in parallel.
sys/amd64/amd64/mp_machdep.c: sys/i386/i386/mp_machdep.c: Map Xen per-vcpu area during AP startup.
sys/amd64/include/intr_machdep.h: sys/i386/include/intr_machdep.h: Increase the FreeBSD IRQ vector table to include space for event channel interrupt sources.
sys/amd64/include/pcpu.h: sys/i386/include/pcpu.h: Remove Xen HVM per-cpu variable data. These fields are now allocated via the dynamic per-cpu scheme. See xen_intr.c for details.
sys/amd64/include/xen/hypercall.h: sys/dev/xen/blkback/blkback.c: sys/i386/include/xen/xenvar.h: sys/i386/xen/clock.c: sys/i386/xen/xen_machdep.c: sys/xen/gnttab.c: Prefer FreeBSD primatives to Linux ones in Xen support code.
sys/amd64/include/xen/xen-os.h: sys/i386/include/xen/xen-os.h: sys/xen/xen-os.h: sys/dev/xen/balloon/balloon.c: sys/dev/xen/blkback/blkback.c: sys/dev/xen/blkfront/blkfront.c: sys/dev/xen/console/xencons_ring.c: sys/dev/xen/control/control.c: sys/dev/xen/netback/netback.c: sys/dev/xen/netfront/netfront.c: sys/dev/xen/xenpci/xenpci.c: sys/i386/i386/machdep.c: sys/i386/include/pmap.h: sys/i386/include/xen/xenfunc.h: sys/i386/isa/npx.c: sys/i386/xen/clock.c: sys/i386/xen/mp_machdep.c: sys/i386/xen/mptable.c: sys/i386/xen/xen_clock_util.c: sys/i386/xen/xen_machdep.c: sys/i386/xen/xen_rtc.c: sys/xen/evtchn/evtchn_dev.c: sys/xen/features.c: sys/xen/gnttab.c: sys/xen/gnttab.h: sys/xen/hvm.h: sys/xen/xenbus/xenbus.c: sys/xen/xenbus/xenbus_if.m: sys/xen/xenbus/xenbusb_front.c: sys/xen/xenbus/xenbusvar.h: sys/xen/xenstore/xenstore.c: sys/xen/xenstore/xenstore_dev.c: sys/xen/xenstore/xenstorevar.h: Pull common Xen OS support functions/settings into xen/xen-os.h.
sys/amd64/include/xen/xen-os.h: sys/i386/include/xen/xen-os.h: sys/xen/xen-os.h: Remove constants, macros, and functions unused in FreeBSD's Xen support.
sys/xen/xen-os.h: sys/i386/xen/xen_machdep.c: sys/x86/xen/hvm.c: Introduce new functions xen_domain(), xen_pv_domain(), and xen_hvm_domain(). These are used in favor of #ifdefs so that FreeBSD can dynamically detect and adapt to the presence of a hypervisor. The goal is to have an HVM optimized GENERIC, but more is necessary before this is possible.
sys/amd64/amd64/machdep.c: sys/dev/xen/xenpci/xenpcivar.h: sys/dev/xen/xenpci/xenpci.c: sys/x86/xen/hvm.c: sys/sys/kernel.h: Refactor magic ioport, Hypercall table and Hypervisor shared information page setup, and move it to a dedicated HVM support module.
HVM mode initialization is now triggered during the SI_SUB_HYPERVISOR phase of system startup. This currently occurs just after the kernel VM is fully setup which is just enough infrastructure to allow the hypercall table and shared info page to be properly mapped.
sys/xen/hvm.h: sys/x86/xen/hvm.c: Add definitions and a method for configuring Hypervisor event delievery via a direct vector callback.
sys/amd64/include/xen/xen-os.h: sys/x86/xen/hvm.c:
sys/conf/files: sys/conf/files.amd64: sys/conf/files.i386: Adjust kernel build to reflect the refactoring of early Xen startup code and Xen interrupt services.
sys/dev/xen/blkback/blkback.c: sys/dev/xen/blkfront/blkfront.c: sys/dev/xen/blkfront/block.h: sys/dev/xen/control/control.c: sys/dev/xen/evtchn/evtchn_dev.c: sys/dev/xen/netback/netback.c: sys/dev/xen/netfront/netfront.c: sys/xen/xenstore/xenstore.c: sys/xen/evtchn/evtchn_dev.c: sys/dev/xen/console/console.c: sys/dev/xen/console/xencons_ring.c Adjust drivers to use new xen_intr_*() API.
sys/dev/xen/blkback/blkback.c: Since blkback defers all event handling to a taskqueue, convert this task queue to a "fast" taskqueue, and schedule it via an interrupt filter. This avoids an unnecessary ithread context switch.
sys/xen/xenstore/xenstore.c: The xenstore driver is MPSAFE. Indicate as much when registering its interrupt handler.
sys/xen/xenbus/xenbus.c: sys/xen/xenbus/xenbusvar.h: Remove unused event channel APIs.
sys/xen/evtchn.h: Remove all kernel Xen interrupt service API definitions from this file. It is now only used for structure and ioctl definitions related to the event channel userland device driver.
Update the definitions in this file to match those from NetBSD. Implementing this interface will be necessary for Dom0 support.
sys/xen/evtchn/evtchnvar.h: Add a header file for implemenation internal APIs related to managing event channels event delivery. This is used to allow, for example, the event channel userland device driver to access low-level routines that typical kernel consumers of event channel services should never access.
sys/xen/interface/event_channel.h: sys/xen/xen_intr.h: Standardize on the evtchn_port_t type for referring to an event channel port id. In order to prevent low-level event channel APIs from leaking to kernel consumers who should not have access to this data, the type is defined twice: Once in the Xen provided event_channel.h, and again in xen/xen_intr.h. The double declaration is protected by __XEN_EVTCHN_PORT_DEFINED__ to ensure it is never declared twice within a given compilation unit.
sys/xen/xen_intr.h: sys/xen/evtchn/evtchn.c: sys/x86/xen/xen_intr.c: sys/dev/xen/xenpci/evtchn.c: sys/dev/xen/xenpci/xenpcivar.h: New implementation of Xen interrupt services. This is similar in many respects to the i386 PV implementation with the exception that events for bound to event channel ports (i.e. not IPI, virtual IRQ, or physical IRQ) are further optimized to avoid mask/unmask operations that aren't necessary for these edge triggered events.
Stubs exist for supporting physical IRQ binding, but will need additional work before this implementation can be fully shared between PV and HVM.
sys/amd64/amd64/mp_machdep.c: sys/i386/i386/mp_machdep.c: sys/i386/xen/mp_machdep.c sys/x86/xen/hvm.c: Add support for placing vcpu_info into an arbritary memory page instead of using HYPERVISOR_shared_info->vcpu_info. This allows the creation of domains with more than 32 vcpus.
sys/i386/i386/machdep.c: sys/i386/xen/clock.c: sys/i386/xen/xen_machdep.c: sys/i386/xen/exception.s: Add support for new event channle implementation.
|
254025 |
07-Aug-2013 |
jeff |
Replace kernel virtual address space allocation with vmem. This provides transparent layering and better fragmentation.
- Normalize functions that allocate memory to use kmem_* - Those that allocate address space are named kva_* - Those that operate on maps are named kmap_* - Implement recursive allocation handling for kmem_arena in vmem.
Reviewed by: alc Tested by: pho Sponsored by: EMC / Isilon Storage Division
|
251900 |
18-Jun-2013 |
rpaulo |
Fix a KTR_BUSDMA format string.
|
250840 |
21-May-2013 |
marcel |
Add basic support for FDT to i386 & amd64. This change includes: 1. Common headers for fdt.h and ofw_machdep.h under x86/include with indirections under i386/include and amd64/include. 2. New modinfo for loader provided FDT blob. 3. Common x86_init_fdt() called from hammer_time() on amd64 and init386() on i386. 4. Split-off FDT specific low-level console functions from FDT bus methods for the uart(4) driver. The low-level console logic has been moved to uart_cpu_fdt.c and is used for arm, mips & powerpc only. The FDT bus methods are shared across all architectures. 5. Add dev/fdt/fdt_x86.c to hold the fdt_fixup_table[] and the fdt_pic_table[] arrays. Both are empty right now.
FDT addresses are I/O ports on x86. Since the core FDT code does not handle different address spaces, adding support for both I/O ports and memory addresses requires some thought and discussion. It may be better to use a compile-time option that controls this.
Obtained from: Juniper Networks, Inc.
|
250576 |
12-May-2013 |
eadler |
Fix several typos
PR: kern/176054 Submitted by: Christoph Mallon <christoph.mallon@gmx.de> MFC after: 3 days
|
249625 |
18-Apr-2013 |
mav |
Introduce kern.timecounter.smp_tsc_adjust tunable (disabled by default) and respective functionality, allowing to synchronize TSC on APs to match BSP's during boot. It may be unsafe in general case due to theoretical chance of later drift if CPUs are using different clock rate or source, but it allows to use TSC in some cases when difference caused by some initialization bug, while TSCs are known to increment synchronously.
Reviewed by: jimharris, kib MFC after: 1 month
|
249324 |
10-Apr-2013 |
neel |
Unsynchronized TSCs on the host require special handling in bhyve:
- use clock_gettime(2) as the time base for the emulated ACPI timer instead of directly using rdtsc().
- don't advertise the invariant TSC capability to the guest to discourage it from using the TSC as its time base.
Discussed with: jhb@ (about making 'smp_tsc' a global) Reported by: Dan Mack on freebsd-virtualization@ Obtained from: NetApp
|
248968 |
01-Apr-2013 |
kib |
Record the correct error in the trace.
Sponsored by: The FreeBSD Foundation MFC after: 3 days
|
247463 |
28-Feb-2013 |
mav |
MFcalloutng: Switch eventtimers(9) from using struct bintime to sbintime_t. Even before this not a single driver really supported full dynamic range of struct bintime even in theory, not speaking about practical inexpediency. This change legitimates the status quo and cleans up the code.
|
246713 |
12-Feb-2013 |
kib |
Reform the busdma API so that new types may be added without modifying every architecture's busdma_machdep.c. It is done by unifying the bus_dmamap_load_buffer() routines so that they may be called from MI code. The MD busdma is then given a chance to do any final processing in the complete() callback.
The cam changes unify the bus_dmamap_load* handling in cam drivers.
The arm and mips implementations are updated to track virtual addresses for sync(). Previously this was done in a type specific way. Now it is done in a generic way by recording the list of virtuals in the map.
Submitted by: jeff (sponsored by EMC/Isilon) Reviewed by: kan (previous version), scottl, mjacob (isp(4), no objections for target mode changes) Discussed with: ian (arm changes) Tested by: marius (sparc64), mips (jmallet), isci(4) on x86 (jharris), amd64 (Fabian Keil <freebsd-listen@fabiankeil.de>)
|
246247 |
02-Feb-2013 |
avg |
x86 suspend/resume: suspend pics and pseudo-pics in reverse order
- change 'pics' from STAILQ to TAILQ - ensure that Local APIC is always first in 'pics'
Reviewed by: jhb Tested by: Sergey V. Dyatko <sergey.dyatko@gmail.com>, KAHO Toshikazu <kaho@elam.kais.kyoto-u.ac.jp> MFC after: 12 days
|
246212 |
01-Feb-2013 |
kib |
The change to reduce default smp_tsc_shift caused tsc shift to become zero on slower machines, which make the fenced get_timecount methods not used despite needed. Remove the (shift > 0) condition when selecting the get_timecount() implementation.
Rename smp_tsc_shift to tsc_shift, and apply it for the UP case too. Allow shift to reach value of 31 instead of 30, as it was previously (should be nop).
Reorganize the tc quality calculation to remove the conditionally compiled block. Rename test_smp_tsc() to test_tsc() and provide separate versions for SMP and UP builds. The check for virtialized hardware is more natural to perform in the smp version of the test_tsc(), since it is only done for smp case.
Noted and reviewed by: bde (previous version) MFC after: 12 days
|
246116 |
30-Jan-2013 |
kib |
Reduce default shift used to calculate the max frequency for the TSC timecounter to 1, and correspondingly increase the precision of the gettimeofday(2) and related functions in the default configuration.
The motivation for the TSC-low timecounter, as described in the r222866, seems to provide a workaround for the non-serializing behaviour of the RDTSC on some Intel hardware. Tests demonstrate that even with the pre-shift of 8, the cross-core non-monotonicity of the RDTSC is still observed reliably, e.g. on the Nehalems. The r238755 and r238973 implemented the proper fix for the issue.
The pre-shift of 1 is applied to keep TSC not overflowing for the frequency of hardclock down to 2 sec/intr. The pre-shift is made a tunable to allow the easy debugging of the issues users could see with the shift being too low.
Reviewed by: bde MFC after: 2 weeks
|
245577 |
17-Jan-2013 |
jhb |
Don't attempt to use clflush on the local APIC register window. Various CPUs exhibit bad behavior if this is done (Intel Errata AAJ3, hangs on Pentium-M, and trashing of the local APIC registers on a VIA C7). The local APIC is implicitly mapped UC already via MTRRs, so the clflush isn't necessary anyway.
MFC after: 2 weeks
|
243764 |
01-Dec-2012 |
avg |
ioapic_program_intpin: program high bits before low bits
Programming the low bits has a side-effect if unmasking the pin if it is not disabled. So if an interrupt was pending then it would be delivered with the correct new vector but to the incorrect old LAPIC.
This fix could be made clearer by preserving the mask bit while programming the low bits and then explicitly resetting the mask bit after all the programming is done.
Probability to trip over the fixed bug could be increased by bootverbose because printing of the interrupt information in ioapic_assign_cpu lengthened the time window during which an interrupt could arrive while a pin is masked.
Reported by: Andreas Longwitz <longwitz@incore.de> Tested by: Andreas Longwitz <longwitz@incore.de> MFC after: 12 days
|
241371 |
09-Oct-2012 |
attilio |
Reverts r234074,234105,234564,234723,234989,235231-235232 and part of r234247. Use, instead, the static intializer introduced in r239923 for x86 and sparc64 intr_cpus, unwinding the code to the initial version.
Reviewed by: marius
|
239354 |
17-Aug-2012 |
jhb |
Allow static DMA allocations that allow for enough segments to do page-sized segments for the entire allocation to use kmem_alloc_attr() to allocate KVM rather than using kmem_alloc_contig(). This avoids requiring a single physically contiguous chunk in this case.
Submitted by: Peter Jeremy (original version) MFC after: 1 month
|
239133 |
07-Aug-2012 |
jimharris |
During TSC synchronization test, use rdtsc() rather than rdtsc32(), to protect against 32-bit TSC overflow while the sync test is running.
On dual-socket Xeon E5-2600 (SNB) systems with up to 32 threads, there is non-trivial chance (2-3%) that TSC synchronization test fails due to 32-bit TSC overflow while the synchronization test is running.
Sponsored by: Intel Reviewed by: jkim Discussed with: jkim, kib
|
239020 |
03-Aug-2012 |
jhb |
Correct function name in comment.
Submitted by: alc
|
239013 |
03-Aug-2012 |
mav |
Microoptimize LAPIC timer routines to avoid reading from hardware during programming using earlier cached values. This makes respective routines to disappear from PMC top and reduces total number of active CPU cycles on idle 24-core system by 10%.
|
239008 |
03-Aug-2012 |
jhb |
Improve the handling of static DMA buffers that use non-default memory attributes (currently just BUS_DMA_NOCACHE): - Don't call pmap_change_attr() on the returned address, instead use kmem_alloc_contig() to ask the VM system for memory with the requested attribute. - As a result, always use kmem_alloc_contig() for non-default memory attributes, even for sub-page allocations. This requires adjusting bus_dmamem_free()'s logic for determining which free routine to use. - For x86, add a new dummy bus_dmamap that is used for static DMA buffers allocated via kmem_alloc_contig(). bus_dmamem_free() can then use the map pointer to determine which free routine to use. - For powerpc, add a new flag to the allocated map (bus_dmamem_alloc() always creates a real map on powerpc) to indicate which free routine should be used.
Note that the BUS_DMA_NOCACHE handling in powerpc is currently #ifdef'd out. I have left it disabled but updated it to match x86.
Reviewed by: scottl MFC after: 1 month
|
238975 |
01-Aug-2012 |
kib |
Do a trivial reformatting of the comment, to record the proper commit message for r238973:
Rdtsc instruction is not synchronized, it seems on some Intel cores it can bypass even the locked instructions. As a result, rdtsc executed on different cores may return unordered TSC values even when the rdtsc appearance in the instruction sequences is provably ordered.
Similarly to what has been done in r238755 for TSC synchronization test, add explicit fences right before rdtsc in the timecounters 'get' functions. Intel recommends to use LFENCE, while AMD refers to MFENCE. For VIA follow what Linux does and use LFENCE. With this change, I see no reordered reads of TSC on Nehalem.
Change the rmb() to inlined CPUID in the SMP TSC synchronization test. On i386, locked instruction is used for rmb(), and as noted earlier, it is not enough. Since i386 machine may not support SSE2, do simplest possible synchronization with CPUID.
MFC after: 1 week Discussed with: avg, bde, jkim
|
238973 |
01-Aug-2012 |
kib |
diff --git a/sys/x86/x86/tsc.c b/sys/x86/x86/tsc.c index c253a96..3d8bd30 100644 --- a/sys/x86/x86/tsc.c +++ b/sys/x86/x86/tsc.c @@ -82,7 +82,11 @@ static void tsc_freq_changed(void *arg, const struct cf_level *level, static void tsc_freq_changing(void *arg, const struct cf_level *level, int *status); static unsigned tsc_get_timecount(struct timecounter *tc); -static unsigned tsc_get_timecount_low(struct timecounter *tc); +static inline unsigned tsc_get_timecount_low(struct timecounter *tc); +static unsigned tsc_get_timecount_lfence(struct timecounter *tc); +static unsigned tsc_get_timecount_low_lfence(struct timecounter *tc); +static unsigned tsc_get_timecount_mfence(struct timecounter *tc); +static unsigned tsc_get_timecount_low_mfence(struct timecounter *tc); static void tsc_levels_changed(void *arg, int unit);
static struct timecounter tsc_timecounter = { @@ -262,6 +266,10 @@ probe_tsc_freq(void) (vm_guest == VM_GUEST_NO && CPUID_TO_FAMILY(cpu_id) >= 0x10)) tsc_is_invariant = 1; + if (cpu_feature & CPUID_SSE2) { + tsc_timecounter.tc_get_timecount = + tsc_get_timecount_mfence; + } break; case CPU_VENDOR_INTEL: if ((amd_pminfo & AMDPM_TSC_INVARIANT) != 0 || @@ -271,6 +279,10 @@ probe_tsc_freq(void) (CPUID_TO_FAMILY(cpu_id) == 0xf && CPUID_TO_MODEL(cpu_id) >= 0x3)))) tsc_is_invariant = 1; + if (cpu_feature & CPUID_SSE2) { + tsc_timecounter.tc_get_timecount = + tsc_get_timecount_lfence; + } break; case CPU_VENDOR_CENTAUR: if (vm_guest == VM_GUEST_NO && @@ -278,6 +290,10 @@ probe_tsc_freq(void) CPUID_TO_MODEL(cpu_id) >= 0xf && (rdmsr(0x1203) & 0x100000000ULL) == 0) tsc_is_invariant = 1; + if (cpu_feature & CPUID_SSE2) { + tsc_timecounter.tc_get_timecount = + tsc_get_timecount_lfence; + } break; }
@@ -328,16 +344,31 @@ init_TSC(void)
#ifdef SMP
-/* rmb is required here because rdtsc is not a serializing instruction. */ -#define TSC_READ(x) \ -static void \ -tsc_read_##x(void *arg) \ -{ \ - uint32_t *tsc = arg; \ - u_int cpu = PCPU_GET(cpuid); \ - \ - rmb(); \ - tsc[cpu * 3 + x] = rdtsc32(); \ +/* + * RDTSC is not a serializing instruction, and does not drain + * instruction stream, so we need to drain the stream before executing + * it. It could be fixed by use of RDTSCP, except the instruction is + * not available everywhere. + * + * Use CPUID for draining in the boot-time SMP constistency test. The + * timecounters use MFENCE for AMD CPUs, and LFENCE for others (Intel + * and VIA) when SSE2 is present, and nothing on older machines which + * also do not issue RDTSC prematurely. There, testing for SSE2 and + * vendor is too cumbersome, and we learn about TSC presence from + * CPUID. + * + * Do not use do_cpuid(), since we do not need CPUID results, which + * have to be written into memory with do_cpuid(). + */ +#define TSC_READ(x) \ +static void \ +tsc_read_##x(void *arg) \ +{ \ + uint32_t *tsc = arg; \ + u_int cpu = PCPU_GET(cpuid); \ + \ + __asm __volatile("cpuid" : : : "eax", "ebx", "ecx", "edx"); \ + tsc[cpu * 3 + x] = rdtsc32(); \ } TSC_READ(0) TSC_READ(1) @@ -487,7 +518,16 @@ init: for (shift = 0; shift < 31 && (tsc_freq >> shift) > max_freq; shift++) ; if (shift > 0) { - tsc_timecounter.tc_get_timecount = tsc_get_timecount_low; + if (cpu_feature & CPUID_SSE2) { + if (cpu_vendor_id == CPU_VENDOR_AMD) { + tsc_timecounter.tc_get_timecount = + tsc_get_timecount_low_mfence; + } else { + tsc_timecounter.tc_get_timecount = + tsc_get_timecount_low_lfence; + } + } else + tsc_timecounter.tc_get_timecount = tsc_get_timecount_low; tsc_timecounter.tc_name = "TSC-low"; if (bootverbose) printf("TSC timecounter discards lower %d bit(s)\n", @@ -599,16 +639,48 @@ tsc_get_timecount(struct timecounter *tc __unused) return (rdtsc32()); }
-static u_int +static inline u_int tsc_get_timecount_low(struct timecounter *tc) { uint32_t rv;
__asm __volatile("rdtsc; shrd %%cl, %%edx, %0" - : "=a" (rv) : "c" ((int)(intptr_t)tc->tc_priv) : "edx"); + : "=a" (rv) : "c" ((int)(intptr_t)tc->tc_priv) : "edx"); return (rv); }
+static u_int +tsc_get_timecount_lfence(struct timecounter *tc __unused) +{ + + lfence(); + return (rdtsc32()); +} + +static u_int +tsc_get_timecount_low_lfence(struct timecounter *tc) +{ + + lfence(); + return (tsc_get_timecount_low(tc)); +} + +static u_int +tsc_get_timecount_mfence(struct timecounter *tc __unused) +{ + + mfence(); + return (rdtsc32()); +} + +static u_int +tsc_get_timecount_low_mfence(struct timecounter *tc) +{ + + mfence(); + return (tsc_get_timecount_low(tc)); +} + uint32_t cpu_fill_vdso_timehands(struct vdso_timehands *vdso_th) {
|
238755 |
24-Jul-2012 |
jimharris |
Add rmb() to tsc_read_##x to enforce serialization of rdtsc captures.
Intel Architecture Manual specifies that rdtsc instruction is not serialized, so without this change, TSC synchronization test would periodically fail, resulting in use of HPET timecounter instead of TSC-low. This caused severe performance degradation (40-50%) when running high IO/s workloads due to HPET MMIO reads and GEOM stat collection.
Tests on Xeon E5-2600 (Sandy Bridge) 8C systems were seeing TSC synchronization fail approximately 20% of the time.
Sponsored by: Intel Reviewed by: kib MFC after: 3 days
|
237433 |
22-Jun-2012 |
kib |
Implement mechanism to export some kernel timekeeping data to usermode, using shared page. The structures and functions have vdso prefix, to indicate the intended location of the code in some future.
The versioned per-algorithm data is exported in the format of struct vdso_timehands, which mostly repeats the content of in-kernel struct timehands. Usermode reading of the structure can be lockless. Compatibility export for 32bit processes on 64bit host is also provided. Kernel also provides usermode with indication about currently used timecounter, so that libc can fall back to syscall if configured timecounter is unknown to usermode code.
The shared data updates are initiated both from the tc_windup(), where a fast task is queued to do the update, and from sysctl handlers which change timecounter. A manual override switch kern.timecounter.fast_gettime allows to turn off the mechanism.
Only x86 architectures export the real algorithm data, and there, only for tsc timecounter. HPET counters page could be exported as well, but I prefer to not further glue the kernel and libc ABI there until proper vdso-based solution is developed.
Minimal stubs neccessary for non-x86 architectures to still compile are provided.
Discussed with: bde Reviewed by: jhb Tested by: flo MFC after: 1 month
|
236503 |
03-Jun-2012 |
avg |
free wdog_kern_pat calls in post-panic paths from under SW_WATCHDOG
Those calls are useful with hardware watchdog drivers too.
MFC after: 3 weeks
|
234989 |
03-May-2012 |
attilio |
Revert part of r234723 by re-enabling the SMP protection for intr_bind() on x86. This has been requested by jhb and I strongly disagree with this, but as long as he is the x86 and interrupt subsystem maintainer I will follow his directives.
The disagreement cames from what we should really consider as a public KPI. IMHO, if we really need a selection between the kernel functions, we may need an explicit protection like _KERNEL_KPI, which defines which subset of the kernel function might really be considered as part of the KPI (for thirdy part modules) and which not. As long as we don't have this mechanism I just consider any possible function as usable by thirdy part code, thus intr_bind() included.
MFC after: 1 week
|
234723 |
26-Apr-2012 |
attilio |
Clean up the intr* MD KPI from the SMP dependency, removing a cause of discrepancy between modules and kernel, but deal with SMP differences within the functions themselves.
As an added bonus this also helps in terms of code readability.
Requested by: gibbs Reviewed by: jhb, marius MFC after: 1 week
|
233961 |
06-Apr-2012 |
gibbs |
Fix interrupt load balancing regression, introduced in revision 222813, that left all un-pinned interrupts assigned to CPU 0.
sys/x86/x86/intr_machdep.c: In intr_shuffle_irqs(), remove CPU_SETOF() call that initialized the "intr_cpus" cpuset to only contain CPU0.
This initialization is too late and nullifies the results of calls the intr_add_cpu() that occur much earlier in the boot process. Since "intr_cpus" is statically initialized to the empty set, and all processors, including the BSP, already add themselves to "intr_cpus" no special initialization for the BSP is necessary.
MFC after: 3 days
|
233793 |
02-Apr-2012 |
jhb |
Further tweak the changes made in r233709. The kernel doesn't permit sleeping from a swi handler (even though in this case it would be ok), so switch the refill and scanning SWI handlers to being tasks on a fast taskqueue. Also, only schedule the refill task for a CMCI as an MC# can fire at any time, so it should do the minimal amount of work needed and avoid opportunities to deadlock before it panics (such as scheduling a task it won't ever need in practice). To handle the case of an MC# only finding recoverable errors (which should never happen), always try to refill the event free list when the periodic scan executes.
MFC after: 2 weeks
|
233781 |
02-Apr-2012 |
jhb |
Make machine check exception logging more readable. On newer Intel systems, an uncorrected ECC error tends to fire on all CPUs in a package simultaneously and the current printf hacks are not sufficient to make the messages legible. Instead, use the existing mca_lock spinlock to serialize calls to mca_log() and change the machine check code to panic directly when an unrecoverable error is encoutered rather than falling back to a trap_fatal() call in trap() (which adds nearly a screen-full of logging messages that aren't useful for machine checks).
MFC after: 2 weeks
|
233709 |
30-Mar-2012 |
jhb |
Attempt to make machine check handling a bit more robust: - Don't malloc() new MCA records for machine checks logged due to a CMCI or MC# exception. Instead, use a pre-allocated pool of records. When a CMCI or MC# exception fires, schedule a swi to refill the pool. The pool is sized to hold at least one record per available machine bank, and one record per CPU. This should handle the case of all CPUs triggering a single bank at once as well as the case a single CPU triggering all of its banks. The periodic scans still use malloc() since they are run from a safe context. - Since we have to create an swi to handle refills, make the periodic scan a second swi for the same thread instead of having a separate taskqueue thread for the scans.
Suggested by: mdf (avoiding malloc()) MFC after: 2 weeks
|
233707 |
30-Mar-2012 |
jhb |
Move the legacy(4) driver to x86.
|
233676 |
29-Mar-2012 |
jhb |
Use a more proper fix for enabling HT MSI mapping windows on Host-PCI bridges. Rather than blindly enabling the windows on all of them, only enable the window when an MSI interrupt is enabled for a device behind the bridge, similar to what already happens for HT PCI-PCI bridges.
To implement this, each x86 Host-PCI bridge driver has to be able to locate it's actual backing device on bus 0. For ACPI, use the _ADR method to find the slot and function of the device. For the non-ACPI case, the legacy(4) driver already scans bus 0 looking for Host-PCI bridge devices. Now it saves the slot and function of each bridge that it finds as ivars that the Host-PCI bridge driver can then use in its pcib_map_msi() method.
This fixes machines where non-MSI interrupts were broken by the previous round of HT MSI changes.
Tested by: bapt MFC after: 1 week
|
233036 |
16-Mar-2012 |
jhb |
Revert the PCIe 4GB boundary issue workaround now that the proper fix is in HEAD.
Ok'd by: scottl
|
233031 |
16-Mar-2012 |
nyan |
- Fix to build a native i386 kernel without the SMP and atpic. - Merge r232744 changes to pc98. (Allow a kernel to be built with 'nodevice atpic'.) - Move ICU related defines from x86/isa/atpic.c to x86/isa/icu.h and use them in x86/x86/intr_machdep.c.
Reviewed by: jhb
|
232747 |
09-Mar-2012 |
jhb |
Move i386's intr_machdep.c to the x86 tree and share it with amd64.
|
232356 |
01-Mar-2012 |
jhb |
- Change contigmalloc() to use the vm_paddr_t type instead of an unsigned long for specifying a boundary constraint. - Change bus_dma tags to use bus_addr_t instead of bus_size_t for boundary constraints.
These allow boundary constraints to be fully expressed for cases where sizeof(bus_addr_t) != sizeof(bus_size_t). Specifically, it allows a driver to properly specify a 4GB boundary in a PAE kernel.
Note that this cannot be safely MFC'd without a lot of compat shims due to KBI changes, so I do not intend to merge it.
Reviewed by: scottl
|
232267 |
28-Feb-2012 |
emaste |
Workaround for PCIe 4GB boundary issue
Enforce a boundary of no more than 4GB - transfers crossing a 4GB boundary can lead to data corruption due to PCIe limitations. This change is a less-intrusive workaround that can be quickly merged back to older branches; a cleaner implementation will arrive in HEAD later but may require KPI changes.
This change is based on a suggestion by jhb@.
Reviewed by: scottl, jhb Sponsored by: Sandvine Incorporated MFC after: 3 days
|
232232 |
27-Feb-2012 |
jhb |
- Panic up front if a kernel does not include 'device atpic' and an APIC is not found. - Don't panic if lapic_enable_cmc() is called and the APIC is not enabled. This can happen due to booting a kernel with APIC disabled on a CPU that supports CMCI. - Wrap a long line.
|
227843 |
22-Nov-2011 |
marius |
- There's no need to overwrite the default device method with the default one. Interestingly, these are actually the default for quite some time (bus_generic_driver_added(9) since r52045 and bus_generic_print_child(9) since r52045) but even recently added device drivers do this unnecessarily. Discussed with: jhb, marcel - While at it, use DEVMETHOD_END. Discussed with: jhb - Also while at it, use __FBSDID.
|
227309 |
07-Nov-2011 |
ed |
Mark all SYSCTL_NODEs static that have no corresponding SYSCTL_DECLs.
The SYSCTL_NODE macro defines a list that stores all child-elements of that node. If there's no SYSCTL_DECL macro anywhere else, there's no reason why it shouldn't be static.
|
225069 |
22-Aug-2011 |
silby |
Disable TSC usage inside SMP VM environments. On my VMware ESXi 4.1 environment with a core i5-2500K, operation in this mode causes timeouts from the mpt driver. Switching to the ACPI-fast timer resolves this issue. Switching the VM back to single CPU mode also works, which is why I have not disabled the TSC in that mode.
I did not test with KVM or other VM environments, but I am being cautious and assuming that the TSC is not reliable in SMP mode there as well.
Reviewed by: kib Approved by: re (kib) MFC after: Not applicable, the timecounter code is new for 9.x
|
224096 |
16-Jul-2011 |
jhb |
Fix build when NEW_PCIB is not defined.
Submitted by: gcooper (partially) Pointy hat to: jhb
|
224069 |
15-Jul-2011 |
jhb |
Respect the BIOS/firmware's notion of acceptable address ranges for PCI resource allocation on x86 platforms: - Add a new helper API that Host-PCI bridge drivers can use to restrict resource allocation requests to a set of address ranges for different resource types. - For the ACPI Host-PCI bridge driver, use Producer address range resources in _CRS to enumerate valid address ranges for a given Host-PCI bridge. This can be disabled by including "hostres" in the debug.acpi.disabled tunable. - For the MPTable Host-PCI bridge driver, use entries in the extended MPTable to determine the valid address ranges for a given Host-PCI bridge. This required adding code to parse extended table entries.
Similar to the new PCI-PCI bridge driver, these changes are only enabled if the NEW_PCIB kernel option is enabled (which is enabled by default on amd64 and i386).
Approved by: re (kib)
|
224042 |
14-Jul-2011 |
jkim |
If TSC stops ticking in C3, disable deep sleep when the user forcefully select TSC as timecounter hardware.
Tested by: Fabian Keil (freebsd-listen at fabiankeil dot de)
|
223426 |
22-Jun-2011 |
jkim |
Set negative quality to TSC timecounter when C3 state is enabled for Intel processors unless the invariant TSC bit of CPUID is set. Intel processors may stop incrementing TSC when DPSLP# pin is asserted, according to Intel processor manuals, i. e., TSC timecounter is useless if the processor can enter deep sleep state (C3/C4). This problem was accidentally uncovered by r222869, which increased timecounter quality of P-state invariant TSC, e.g., for Core2 Duo T5870 (Family 6, Model f) and Atom N270 (Family 6, Model 1c).
Reported by: Fabian Keil (freebsd-listen at fabiankeil dot de) Ian FREISLICH (ianf at clue dot co dot za) Tested by: Fabian Keil (freebsd-listen at fabiankeil dot de) - Core2 Duo T5870 (C3 state available/enabled) jkim - Xeon X5150 (C3 state unavailable)
|
223211 |
17-Jun-2011 |
jkim |
Teach the compiler how to shift TSC value efficiently. As noted in r220631, some times compiler inserts redundant instructions to preserve unused upper 32 bits even when it is casted to a 32-bit value. Unfortunately, it seems the problem becomes more serious when it is shifted, especially on amd64.
|
222884 |
08-Jun-2011 |
jkim |
Tidy up r222866.
- Re-add accidentally removed atomic op. for sysctl(9) handler. - Remove a period(`.') at the end of a debugging message. - Consistently spell "low" for "TSC-low" timecounter throughout.
Pointed out by: bde
|
222869 |
08-Jun-2011 |
jkim |
Increase quality of TSC (or TSC-low) timecounter to 1000 if it is P-state invariant. For SMP case (TSC-low), it also has to pass SMP synchronization test and the CPU vendor/model has to be white-listed explicitly. Currently, all Intel CPUs and single-socket AMD Family 15h processors are listed here.
Discussed with: hackers
|
222866 |
08-Jun-2011 |
jkim |
Introduce low-resolution TSC timecounter "TSC-low". It replaces the normal TSC timecounter if TSC frequency is higher than ~4.29 MHz (or 2^32-1 Hz) or multiple CPUs are present. The "TSC-low" frequency is always lower than a preset maximum value and derived from TSC frequency (by being halved until it becomes lower than the maximum). Note the maximum value for SMP case is significantly lower than UP case because we want to reduce (rare but known) "temporal anomalies" caused by non-serialized RDTSC instruction. Normally, it is still higher than "ACPI-fast" timecounter frequency (which was default timecounter hardware for long time until r222222) to be useful.
|
222864 |
08-Jun-2011 |
jkim |
Remove a redundant assignment since r221703.
|
222813 |
07-Jun-2011 |
attilio |
etire the cpumask_t type and replace it with cpuset_t usage.
This is intended to fix the bug where cpu mask objects are capped to 32. MAXCPU, then, can now arbitrarely bumped to whatever value. Anyway, as long as several structures in the kernel are statically allocated and sized as MAXCPU, it is suggested to keep it as low as possible for the time being.
Technical notes on this commit itself: - More functions to handle with cpuset_t objects are introduced. The most notable are cpusetobj_ffs() (which calculates a ffs(3) for a cpuset_t object), cpusetobj_strprint() (which prepares a string representing a cpuset_t object) and cpusetobj_strscan() (which creates a valid cpuset_t starting from a string representation). - pc_cpumask and pc_other_cpus are target to be removed soon. With the moving from cpumask_t to cpuset_t they are now inefficient and not really useful. Anyway, for the time being, please note that access to pcpu datas is protected by sched_pin() in order to avoid migrating the CPU while reading more than one (possible) word - Please note that size of cpuset_t objects may differ between kernel and userland. While this is not directly related to the patch itself, it is good to understand that concept and possibly use the patch as a reference on how to deal with cpuset_t objects in userland, when accessing kernland members. - KTR_CPUMASK is changed and now is represented through a string, to be set as the example reported in NOTES.
Please additively note that no MAXCPU is bumped in this patch, but private testing has been done until to MAXCPU=128 on a real 8x8x2(htt) machine (amd64).
Please note that the FreeBSD version is not yet bumped because of the upcoming pcpu changes. However, note that this patch is not targeted for MFC.
People to thank for the time spent on this patch: - sbruno, pluknet and Nicholas Esborn (nick AT desert DOT net) tested several revision of the patches and really helped in improving stability of this work. - marius fixed several bugs in the sparc64 implementation and reviewed patches related to ktr. - jeff and jhb discussed the basic approach followed. - kib and marcel made targeted review on some specific part of the patch. - marius, art, nwhitehorn and andreast reviewed MD specific part of the patch. - marius, andreast, gonzo, nwhitehorn and jceel tested MD specific implementations of the patch. - Other people have made contributions on other patches that have been already committed and have been listed separately.
Companies that should be mentioned for having participated at several degrees: - Yahoo! for having offered the machines used for testing on big count of CPUs. - The FreeBSD Foundation for having sponsored my devsummit attendance, which has been instrumental. - Sandvine for having offered offices and infrastructure during development.
(I really hope I didn't forget anyone, if it happened I apologize in advance).
|
221703 |
09-May-2011 |
jkim |
Implement boot-time TSC synchronization test for SMP. This test is executed when the user has indicated that the system has synchronized TSCs or it has P-state invariant TSCs. For the former case, we may clear the tunable if it fails the test to prevent accidental foot-shooting. For the latter case, we may set it if it passes the test to notify the user that it may be usable.
|
221508 |
05-May-2011 |
mav |
Some changes around LAPIC timer programming.
This fixes heavy interrupt storm and resulting system freeze when using LAPIC timer in one-shot mode under Xen HVM. There, unlike real hardware, programming timer with zero period almost immediately causes interrupt.
|
221393 |
03-May-2011 |
jhb |
Reimplement how PCI-PCI bridges manage their I/O windows. Previously the driver would verify that requests for child devices were confined to any existing I/O windows, but the driver relied on the firmware to initialize the windows and would never grow the windows for new requests. Now the driver actively manages the I/O windows.
This is implemented by allocating a bus resource for each I/O window from the parent PCI bus and suballocating that resource to child devices. The suballocations are managed by creating an rman for each I/O window. The suballocated resources are mapped by passing the bus_activate_resource() call up to the parent PCI bus. Windows are grown when needed by using bus_adjust_resource() to adjust the resource allocated from the parent PCI bus. If the adjust request succeeds, the window is adjusted and the suballocation request for the child device is retried.
When growing a window, the rman_first_free_region() and rman_last_free_region() routines are used to determine if the front or end of the existing I/O window is free. From using that, the smallest ranges that need to be added to either the front or back of the window are computed. The driver will first try to grow the window in whichever direction requires the smallest growth first followed by the other direction if that fails.
Subtractive bridges will first attempt to satisfy requests for child resources from I/O windows (including attempts to grow the windows). If that fails, the request is passed up to the parent PCI bus directly however.
The PCI-PCI bridge driver will try to use firmware-assigned ranges for child BARs first and only allocate a "fresh" range if that specific range cannot be accommodated in the I/O window. This allows systems where the firmware assigns resources during boot but later wipes the I/O windows (some ACPI BIOSen are known to do this) to "rediscover" the original I/O window ranges.
The ACPI Host-PCI bridge driver has been adjusted to correctly honor hw.acpi.host_mem_start and the I/O port equivalent when a PCI-PCI bridge makes a wildcard request for an I/O window range.
The new PCI-PCI bridge driver is only enabled if the NEW_PCIB kernel option is enabled. This is a transition aide to allow platforms that do not yet support bus_activate_resource() and bus_adjust_resource() in their Host-PCI bridge drivers (and possibly other drivers as needed) to use the old driver for now. Once all platforms support the new driver, the kernel option and old driver will be removed.
PR: kern/143874 kern/149306 Tested by: mav
|
221331 |
02-May-2011 |
jkim |
Fix build with clang. Please note there is an LLVM/Clang PR:
http://llvm.org/bugs/show_bug.cgi?id=9379
Reported by: rpaulo, dim
|
221324 |
02-May-2011 |
jhb |
Add implementations of BUS_ADJUST_RESOURCE() to the PCI bus driver, generic PCI-PCI bridge driver, x86 nexus driver, and x86 Host to PCI bridge drivers.
|
221218 |
29-Apr-2011 |
jhb |
Change rman_manage_region() to actually honor the rm_start and rm_end constraints on the rman and reject attempts to manage a region that is out of range. - Fix various places that set rm_end incorrectly (to ~0 or ~0u instead of ~0ul). - To preserve existing behavior, change rman_init() to set rm_start and rm_end to allow managing the full range (0 to ~0ul) if they are not set by the caller when rman_init() is called.
|
221214 |
29-Apr-2011 |
jkim |
Detect VMware guest and set the TSC frequency as reported by the hypervisor. VMware products virtualize TSC and it run at fixed frequency in so-called "apparent time". Although virtualized i8254 also runs in apparent time, TSC calibration always gives slightly off frequency because of the complicated timer emulation and lost-tick correction mechanism.
|
221178 |
28-Apr-2011 |
jkim |
Turn off periodic recalibration of CPU ticker frequency if it is invariant.
|
221173 |
28-Apr-2011 |
attilio |
Add the watchdogs patting during the (shutdown time) disk syncing and disk dumping. With the option SW_WATCHDOG on, these operations are doomed to let watchdog fire, fi they take too long.
I implemented the stubs this way because I really want wdog_kern_* KPI to not be dependant by SW_WATCHDOG being on (and really, the option only enables watchdog activation in hardclock) and also avoid to call them when not necessary (avoiding not-volountary watchdog activations).
Sponsored by: Sandvine Incorporated Discussed with: emaste, des MFC after: 2 weeks
|
220637 |
14-Apr-2011 |
jkim |
Work around an emulator problem where virtual CPU advertises TSC is P-state invariant and APERF/MPERF MSRs exist but these MSRs never tick. When we calculate effective frequency from cpu_est_clockrate(), it caused panic of division-by-zero. Now we test whether these MSRs actually increase to avoid such foot-shooting.
Reported by: dim Tested by: dim
|
220632 |
14-Apr-2011 |
jkim |
Use newly added rdtsc32() for the timecounter_get_t method.
|
220613 |
14-Apr-2011 |
jkim |
Add some tunable descriptions about x86 timers.
Requested by: arundel
|
220579 |
12-Apr-2011 |
jkim |
Probe capability to find effective frequency. When the TSC is P-state invariant, APERF/MPERF ratio can be used to find effective frequency.
|
220577 |
12-Apr-2011 |
jkim |
Add a new tunable 'machdep.disable_tsc_calibration' to allow skipping TSC frequency calibration. For Intel processors, if brand string from CPUID contains its nominal frequency, this frequency is used instead.
|
220433 |
07-Apr-2011 |
jkim |
Use atomic load & store for TSC frequency. It may be overkill for amd64 but safer for i386 because it can be easily over 4 GHz now. More worse, it can be easily changed by user with 'machdep.tsc_freq' tunable (directly) or cpufreq(4) (indirectly). Note it is intentionally not used in performance critical paths to avoid performance regression (but we should, in theory). Alternatively, we may add "virtual TSC" with lower frequency if maximum frequency overflows 32 bits (and ignore possible incoherency as we do now).
|
219700 |
16-Mar-2011 |
jkim |
Revert r219676.
Requested by: jhb, bde
|
219676 |
15-Mar-2011 |
jkim |
Do not let machdep.tsc_freq modify tsc_freq itself. It is bad for i386 as it does not operate atomically. Actually, it serves no purpose.
Noticed by: bde
|
219673 |
15-Mar-2011 |
jkim |
Deprecate tsc_present as the last of its real consumers finally disappeared.
|
219473 |
11-Mar-2011 |
jkim |
Add a tunable "machdep.disable_tsc" to turn off TSC. Specifically, it turns off boot-time CPU frequency calibration, DELAY(9) with TSC, and using TSC as a CPU ticker. Note tsc_present does not change by this tunable.
|
219469 |
10-Mar-2011 |
jkim |
Turn off pointless P-state invariant TSC detection based on CPU model on a virtual machine.
|
219461 |
10-Mar-2011 |
jkim |
Deprecate rarely used tsc_is_broken. Instead, we zero out tsc_freq because it is almost always used with tsc_freq any way.
|
218221 |
03-Feb-2011 |
jhb |
Use a dedicated taskqueue with a thread that runs at a software-interrupt priority for the periodic polling of the machine check registers.
|
217616 |
19-Jan-2011 |
mdf |
Introduce signed and unsigned version of CTLTYPE_QUAD, renaming existing uses. Rename sysctl_handle_quad() to sysctl_handle_64().
|
217360 |
13-Jan-2011 |
jhb |
If an interrupt on an I/O APIC is moved to a different CPU after it has started to execute, it seems that the corresponding ISR bit in the "old" local APIC can be cleared. This causes the local APIC interrupt routine to fail to find an interrupt to service. Rather than panic'ing in this case, simply return from the interrupt without sending an EOI to the local APIC. If there are any other pending interrupts in other ISR registers, the local APIC will assert a new interrupt.
Tested by: steve
|
217337 |
13-Jan-2011 |
mdf |
Revert to using bus_size_t for the bounce_zone's alignment member.
Reuqested by: jhb
|
217330 |
12-Jan-2011 |
mdf |
Fix a brain fart. Since this file is shared between i386 and amd64, a bus_size_t may be 32 or 64 bits. Change the bounce_zone alignment field to explicitly be 32 bits, as I can't really imagine a DMA device that needs anything close to 2GB alignment of data.
|
217326 |
12-Jan-2011 |
mdf |
sysctl(9) cleanup checkpoint: amd64 GENERIC builds cleanly.
Commit the kernel changes.
|
216679 |
23-Dec-2010 |
jhb |
Drop the icu_lock spinlock while pausing briefly after masking the interrupt in the I/O APIC before moving it to a different CPU. If the interrupt had been triggered by the I/O APIC after locking icu_lock but before we masked the pin in the I/O APIC, then this could cause the interrupt to be pending on the "old" CPU and it would finally trigger after we had moved the interrupt to the new CPU. This could cause us to panic as there was no interrupt source associated with the old IDT vector on the old CPU. Dropping the lock after the interrupt is masked but before it is moved allows the interrupt to fire and be handled in this case before it is moved.
Tested by: Daniel Braniss danny of cs huji ac il MFC after: 1 week
|
216592 |
20-Dec-2010 |
tijl |
Merge amd64 and i386 bus.h and move the resulting header to x86. Replace the original amd64 and i386 headers with stubs.
Rename (AMD64|I386)_BUS_SPACE_* to X86_BUS_SPACE_* everywhere.
Reviewed by: imp (previous version), jhb Approved by: kib (mentor)
|
216337 |
09-Dec-2010 |
jkim |
Remove AMD Family 0Fh, Model 6Bh, Stepping 2 from the list of P-state invariant CPUs. I do not believe this model is P-state invariant any more. Maybe cpufreq(4) was broken at the time of commit. :-(
|
216316 |
09-Dec-2010 |
cperciva |
Replace i386/i386/busdma_machdep.c and amd64/amd64/busdma_machdep.c (which are identical) with a single x86/x86/busdma_machdep.c.
|
216283 |
08-Dec-2010 |
jkim |
Merge sys/amd64/amd64/tsc.c and sys/i386/i386/tsc.c and move to sys/x86/x86.
Discussed with: avg
|
215751 |
23-Nov-2010 |
avg |
x86/local_apic: use newly added ARAT bit definition
ARAT: APIC-Timer-always-running feature.
Suggested by: mav MFC after: 12 days
|
215051 |
09-Nov-2010 |
attilio |
Move the mptable.h under x86/include/.
Sponsored by: Sandvine Incorporated MFC after: 14 days
|
215009 |
08-Nov-2010 |
jhb |
Sync the APIC startup sequence with amd64: - Register APIC enumerators at SI_SUB_TUNABLES - 1 instead of SI_SUB_CPU - 1. - Probe CPUs at SI_SUB_TUNABLES - 1. This allows i386 to set a truly accurate mp_maxid value rather than always setting it to MAXCPU - 1.
|
215001 |
08-Nov-2010 |
jhb |
Only dump the values of the PMC and CMCI local vector table entries on a local APIC if those LVT entries are valid. This quiets spurious illegal register local APIC errors during boot on a CPU that doesn't support those vectors.
MFC after: 1 week
|
214686 |
02-Nov-2010 |
jhb |
Cosmetic change to revert one of my earlier ones. #if __i386__ && PAE is identical to just #if PAE since PAE is only a valid option for i386.
Submitted by: attilio
|
214681 |
02-Nov-2010 |
jhb |
Further tweaks to the ram_attach() routine: - Use > 2^32 - 1 instead of >= when checking for memory regions above 4G. - Skip SMAP entries > 4G on i386 rather than breaking out of the loop since SMAP entries are not guaranteed to be in order. - Remove 'i' and loop over 'rid' directly in the dump_avail[] case. - Only check for 4G regions in the dump_avail[] case on i386 if PAE is enabled since vm_paddr_t is 32-bit in the !PAE case.
Submitted by: alc
|
214676 |
02-Nov-2010 |
jhb |
Skip SMAP regions above 4GB on i386 since they will not fit into a long. While here, update some comments to better explain the new code flow.
Tested by: dhw
|
214631 |
01-Nov-2010 |
jhb |
Move <machine/apicreg.h> to <x86/apicreg.h>.
|
214630 |
01-Nov-2010 |
jhb |
Move the <machine/mca.h> header to <x86/mca.h>.
|
214515 |
29-Oct-2010 |
attilio |
- Merge ram_attach() implementation for i386 and amd64 - Rename RES_BUS_SPACE_* into BUS_SPACE_* for consistency - Trim out an unnecessary checking condition
Sponsored by: Sandvine Incorporated Requested and reviewed by: jhb
|
214457 |
28-Oct-2010 |
attilio |
Merge nexus.c from amd64 and i386 to x86 subtree.
Sponsored by: Sandvine Incorporated Tested by: gianni
|
214446 |
28-Oct-2010 |
attilio |
Merge the mptable support from MD bits to x86 subtree.
Sponsored by: Sandvine Incorporated Discussed with: jhb
|
214386 |
26-Oct-2010 |
attilio |
Style fix.
Reported by: bde, dim
|
214380 |
26-Oct-2010 |
attilio |
Remove usage of PRI* macro for style compliancy.
Requested by: bde, jhb Sponsored by: Sandvine Incorporated
|
214373 |
26-Oct-2010 |
attilio |
Merge dump_machdep.c i386/amd64 under the x86 subtree.
Sponsored by: Sandvine Incorporated Tested by: gianni
|
214347 |
25-Oct-2010 |
jhb |
Use 'saveintr' instead of 'savecrit' or 'eflags' to hold the state returned by intr_disable().
Requested by: bde
|
212541 |
13-Sep-2010 |
mav |
Refactor timer management code with priority to one-shot operation mode. The main goal of this is to generate timer interrupts only when there is some work to do. When CPU is busy interrupts are generating at full rate of hz + stathz to fullfill scheduler and timekeeping requirements. But when CPU is idle, only minimum set of interrupts (down to 8 interrupts per second per CPU now), needed to handle scheduled callouts is executed. This allows significantly increase idle CPU sleep time, increasing effect of static power-saving technologies. Also it should reduce host CPU load on virtualized systems, when guest system is idle.
There is set of tunables, also available as writable sysctls, allowing to control wanted event timer subsystem behavior: kern.eventtimer.timer - allows to choose event timer hardware to use. On x86 there is up to 4 different kinds of timers. Depending on whether chosen timer is per-CPU, behavior of other options slightly differs. kern.eventtimer.periodic - allows to choose periodic and one-shot operation mode. In periodic mode, current timer hardware taken as the only source of time for time events. This mode is quite alike to previous kernel behavior. One-shot mode instead uses currently selected time counter hardware to schedule all needed events one by one and program timer to generate interrupt exactly in specified time. Default value depends of chosen timer capabilities, but one-shot mode is preferred, until other is forced by user or hardware. kern.eventtimer.singlemul - in periodic mode specifies how much times higher timer frequency should be, to not strictly alias hardclock() and statclock() events. Default values are 2 and 4, but could be reduced to 1 if extra interrupts are unwanted. kern.eventtimer.idletick - makes each CPU to receive every timer interrupt independently of whether they busy or not. By default this options is disabled. If chosen timer is per-CPU and runs in periodic mode, this option has no effect - all interrupts are generating.
As soon as this patch modifies cpu_idle() on some platforms, I have also refactored one on x86. Now it makes use of MONITOR/MWAIT instrunctions (if supported) under high sleep/wakeup rate, as fast alternative to other methods. It allows SMP scheduler to wake up sleeping CPUs much faster without using IPI, significantly increasing performance on some highly task-switching loads.
Tested by: many (on i386, amd64, sparc64 and powerc) H/W donated by: Gheorghe Ardelean Sponsored by: iXsystems, Inc.
|
212004 |
30-Aug-2010 |
rpaulo |
When DTrace is enabled, make sure we don't overwrite the IDT_DTRACE_RET entry with an IRQ for some hardware component.
Reviewed by: jhb Sponsored by: The FreeBSD Foundation
|
211756 |
24-Aug-2010 |
mav |
Enable timer interrupt before starting timer. This allows to handle very short periods without interrupt loss.
|
210577 |
28-Jul-2010 |
jhb |
The corrected error count field is dependent on CMCI, not TES.
MFC after: 1 week
|
210444 |
24-Jul-2010 |
mav |
Increment td->td_intr_nesting_level for LAPIC timer interrupts. Among other things it hints SCHED_ULE to run clock swi handlers on their native CPUs, avoiding many unneeded IPI_PREEMPT calls.
|
210298 |
20-Jul-2010 |
mav |
Fix several un-/signedness bugs of r210290 and r210293. Add one more check.
|
210290 |
20-Jul-2010 |
mav |
Extend timer driver API to report also minimal and maximal supported period lengths. Make MI wrapper code to validate periods in request. Make kernel clock management code to honor these hardware limitations while choosing hz, stathz and profhz values.
|
210054 |
14-Jul-2010 |
mav |
Move timeevents.c to MI code, as it is not x86-specific. I already have it working on Marvell ARM SoCs, and it would be nice to unify timer code between more platforms.
|
210052 |
14-Jul-2010 |
mav |
Remove some unneeded includes. Code now can be built on ARM.
|
209990 |
13-Jul-2010 |
mav |
Rise knowledge about curthread->td_intr_frame by one step. Make timer callback argument really opaque. Not repeat interrupt handler's problem in case somebody will ever need to have both argument and frame.
|
209901 |
11-Jul-2010 |
mav |
Make kernel panic with reasonable message if no usable event timer found.
|
209371 |
20-Jun-2010 |
mav |
Implement new event timers infrastructure. It provides unified APIs for writing event timer drivers, for choosing best possible drivers by machine independent code and for operating them to supply kernel with hardclock(), statclock() and profclock() events in unified fashion on various hardware.
Infrastructure provides support for both per-CPU (independent for every CPU core) and global timers in periodic and one-shot modes. MI management code at this moment uses only periodic mode, but one-shot mode use planned for later, as part of tickless kernel project.
For this moment infrastructure used on i386 and amd64 architectures. Other archs are welcome to follow, while their current operation should not be affected.
This patch updates existing drivers (i8254, RTC and LAPIC) for the new order, and adds event timers support into the HPET driver. These drivers have different capabilities: LAPIC - per-CPU timer, supports periodic and one-shot operation, may freeze in C3 state, calibrated on first use, so may be not exactly precise. HPET - depending on hardware can work as per-CPU or global, supports periodic and one-shot operation, usually provides several event timers. i8254 - global, limited to periodic mode, because same hardware used also as time counter. RTC - global, supports only periodic mode, set of frequencies in Hz limited by powers of 2.
Depending on hardware capabilities, drivers preferred in following orders, either LAPIC, HPETs, i8254, RTC or HPETs, LAPIC, i8254, RTC. User may explicitly specify wanted timers via loader tunables or sysctls: kern.eventtimer.timer1 and kern.eventtimer.timer2. If requested driver is unavailable or unoperational, system will try to replace it. If no more timers available or "NONE" specified for second, system will operate using only one timer, multiplying it's frequency by few times and uing respective dividers to honor hz, stathz and profhz values, set during initial setup.
|
209212 |
15-Jun-2010 |
jhb |
Restore the machine check register banks on resume. For banks being monitored via CMCI, reset the interrupt threshold to 1 on resume.
Reviewed by: jkim MFC after: 2 weeks
|
209154 |
14-Jun-2010 |
mav |
Virtualize pci_remap_msi_irq() call from general MSI code. It allows MSI (FSB interrupts) to be used by non-PCI devices, such as HPET.
|
209059 |
11-Jun-2010 |
jhb |
Update several places that iterate over CPUs to use CPU_FOREACH().
|
208991 |
10-Jun-2010 |
mav |
Do not disable edge-triggered interrupts before migration. DELAY() with interrupt disabled highly probable causes interrupt loss.
|
208922 |
08-Jun-2010 |
jhb |
Move the MD support for PCI message signalled interrupts to the x86 tree as it is identical for i386 and amd64.
|
208921 |
08-Jun-2010 |
jhb |
Move the machine check support code to the x86 tree since it is identical on i386 and amd64.
Requested by: alc
|
208919 |
08-Jun-2010 |
jhb |
Move the I/O APIC code to the x86 tree since it is identical on i386 and amd64.
|
208507 |
24-May-2010 |
jhb |
Add support for corrected machine check interrupts. CMCI is a new local APIC interrupt that fires when a threshold of corrected machine check events is reached. CMCI also includes a count of events when reporting corrected errors in the bank's status register. Note that individual banks may or may not support CMCI. If they do, each bank includes its own threshold register that determines when the interrupt fires. Currently the code uses a very simple strategy where it doubles the threshold on each interrupt until it succeeds in throttling the interrupt to occur only once a minute (this interval can be tuned via sysctl). The threshold is also adjusted on each hourly poll which will lower the threshold once events stop occurring.
Tested by: Sailaja Bangaru sbappana at yahoo com MFC after: 1 month
|
208494 |
24-May-2010 |
mav |
- Implement MI helper functions, dividing one or two timer interrupts with arbitrary frequencies into hardclock(), statclock() and profclock() calls. Same code with minor variations duplicated several times over the tree for different timer drivers and architectures. - Switch all x86 archs to new functions, simplifying the code and removing extra logic from timer drivers. Other archs are also welcome.
|
208479 |
24-May-2010 |
mav |
Restore different APIC init orders for i386 and amd64 unified in r208452. Seems noone of them contents both arch for different reasons.
Submitted by: kib@
|
208452 |
23-May-2010 |
mav |
Unify local_apic.c for x86 archs,
|