From: Len Brown on
> The following experimental patch changes the kernel mapping for ACPI tables
> to CACHED. This eliminates the page attibute conflict & allows users to map
> the tables CACHEABLE. This significantly speeds up boot:
>
> 38 minutes without the patch
> 27 minutes with the patch
> ~30% improvement
>
> Time to run ACPIDUMP on a large system:
> 527 seconds without the patch
> 8 seconds with the patch

Interesting.

Can you detect a performance differene on a 1-node machine
that doesn't magnify the penalty of the remote uncached access?

thanks,
-Len Brown, Intel Open Source Technology Cetner

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: ykzhao on
On Thu, 2010-07-22 at 23:22 +0800, Jack Steiner wrote:
> I'd like feedback on the following performance problem &
> suggestions for a proper solution.
>
>
> Large SGI UV systems (3072p, 5TB) take a long time to boot. A significant
> part of the boot time is scanning ACPI tables. ACPI tables on UV systems
> are located in RAM memory that is physically attached to node 0.
>
> User programs (ex., acpidump) read the ACPI tables by mapping them thru
> /dev/mem. Although mmap tries to map the tables as CACHED, there are
> existing kernel UNCACHED mapping that conflict and the tables end up as
> being mapped UNCACHED. (See the call to track_pfn_vma_new() in
> remap_pfn_range()).
>
> Much of the access is to small fields (bytes (checksums), shorts, etc).
> Late in boot, there is significant scanning of the ACPI tables that take
> place from nodes other than zero. Since the tables are not cached, each
> reference accesses physical memory that is attached to remote nodes. These
> memory requests must cross the numalink interconnect which adds several
> hundred nsec to each access. This slows the boot process. Access from
> node 0, although faster, is still very slow.
>
>
>
> The following experimental patch changes the kernel mapping for ACPI tables
> to CACHED. This eliminates the page attibute conflict & allows users to map
> the tables CACHEABLE. This significantly speeds up boot:
>
> 38 minutes without the patch
> 27 minutes with the patch
> ~30% improvement
>
> Time to run ACPIDUMP on a large system:
> 527 seconds without the patch
> 8 seconds with the patch
>
>
Hi, Jack
From the above data it seems that the performance is improved
significantly after using cached type for ACPI region.

But some ACPI region will be used for the communication between OS and
BIOS. Maybe it is inappropriate to map them as cached.

ACPI spec have the following two types of ACPI address range.

a. AddressRangeACPI(E820_ACPI): This belongs to the ACPI Reclaim
Memory. This range is available RAM usable by the OS after it reads the
ACPI tables.

b. AddressRangeNVS(E820_NVS): This belongs to the ACPI NVS Memory.
This range of addresses is in use or reserved by the system and must not
be used by the operating system. And this region will also be used for
the communication between OS and BIOS.

>From the above description maybe the E820_ACPI region can be mapped as
cached. But this still depends on the BIOS. If the some shared data
resides in the E820_ACPI region on some BIOS, maybe we can't map the
E820_ACPI region as cached again.

Thanks.
Yakui
> I don't know if the patch in it's current form is the correct solution. I'm
> interested in feedback on how this should be solved. I expect there
> are issues on other platforms so for now, the patch uses x86_platform_ops
> to change mappings only on UV platforms (I'm paranoid :-).
>
> I also need to experiment with early_ioremap'ing of the ACPI tables. I suspect
> this is also mapped UNCACHED. There may be additional improvements if this
> could be mapped CACHED. However, the potential performance gain is much
> less since these references all occur from node 0.
>
>
>
> Signed-off-by: Jack Steiner <steiner(a)sgi.com>
>
>
> ---
> arch/x86/include/asm/x86_init.h | 2 ++
> arch/x86/kernel/apic/x2apic_uv_x.c | 6 ++++++
> arch/x86/kernel/x86_init.c | 3 +++
> drivers/acpi/osl.c | 12 +++++++++---
> 4 files changed, 20 insertions(+), 3 deletions(-)
>
> Index: linux/arch/x86/include/asm/x86_init.h
> ===================================================================
> --- linux.orig/arch/x86/include/asm/x86_init.h 2010-07-21 16:53:30.226241589 -0500
> +++ linux/arch/x86/include/asm/x86_init.h 2010-07-21 16:57:46.614872338 -0500
> @@ -113,6 +113,7 @@ struct x86_cpuinit_ops {
>
> /**
> * struct x86_platform_ops - platform specific runtime functions
> + * @is_wb_acpi_tables E820 ACPI table are in WB memory
> * @is_untracked_pat_range exclude from PAT logic
> * @calibrate_tsc: calibrate TSC
> * @get_wallclock: get time from HW clock like RTC etc.
> @@ -120,6 +121,7 @@ struct x86_cpuinit_ops {
> * @nmi_init enable NMI on cpus
> */
> struct x86_platform_ops {
> + int (*is_wb_acpi_tables)(void);
> int (*is_untracked_pat_range)(u64 start, u64 end);
> unsigned long (*calibrate_tsc)(void);
> unsigned long (*get_wallclock)(void);
> Index: linux/arch/x86/kernel/apic/x2apic_uv_x.c
> ===================================================================
> --- linux.orig/arch/x86/kernel/apic/x2apic_uv_x.c 2010-07-21 16:53:30.226241589 -0500
> +++ linux/arch/x86/kernel/apic/x2apic_uv_x.c 2010-07-21 16:54:46.358866486 -0500
> @@ -58,6 +58,11 @@ static int uv_is_untracked_pat_range(u64
> return is_ISA_range(start, end) || is_GRU_range(start, end);
> }
>
> +static int uv_is_wb_acpi_tables(void)
> +{
> + return 1;
> +}
> +
> static int early_get_nodeid(void)
> {
> union uvh_node_id_u node_id;
> @@ -81,6 +86,7 @@ static int __init uv_acpi_madt_oem_check
> nodeid = early_get_nodeid();
> x86_platform.is_untracked_pat_range = uv_is_untracked_pat_range;
> x86_platform.nmi_init = uv_nmi_init;
> + x86_platform.is_wb_acpi_tables = uv_is_wb_acpi_tables;
> if (!strcmp(oem_table_id, "UVL"))
> uv_system_type = UV_LEGACY_APIC;
> else if (!strcmp(oem_table_id, "UVX"))
> Index: linux/arch/x86/kernel/x86_init.c
> ===================================================================
> --- linux.orig/arch/x86/kernel/x86_init.c 2010-07-21 16:53:30.226241589 -0500
> +++ linux/arch/x86/kernel/x86_init.c 2010-07-21 16:58:17.106240870 -0500
> @@ -71,7 +71,10 @@ struct x86_cpuinit_ops x86_cpuinit __cpu
>
> static void default_nmi_init(void) { };
>
> +static int default_wb_acpi_tables(void) {return 0;}
> +
> struct x86_platform_ops x86_platform = {
> + .is_wb_acpi_tables = default_wb_acpi_tables,
> .is_untracked_pat_range = default_is_untracked_pat_range,
> .calibrate_tsc = native_calibrate_tsc,
> .get_wallclock = mach_get_cmos_time,
> Index: linux/drivers/acpi/osl.c
> ===================================================================
> --- linux.orig/drivers/acpi/osl.c 2010-07-21 16:53:30.226241589 -0500
> +++ linux/drivers/acpi/osl.c 2010-07-21 17:58:20.370414172 -0500
> @@ -293,12 +293,18 @@ acpi_os_map_memory(acpi_physical_address
> printk(KERN_ERR PREFIX "Cannot map memory that high\n");
> return NULL;
> }
> - if (acpi_gbl_permanent_mmap)
> + if (acpi_gbl_permanent_mmap) {
> /*
> * ioremap checks to ensure this is in reserved space
> */
> - return ioremap((unsigned long)phys, size);
> - else
> + if (x86_platform.is_wb_acpi_tables() &&
> + (e820_all_mapped(phys, phys + size, E820_RAM) ||
> + e820_all_mapped(phys, phys + size, E820_ACPI) ||
> + e820_all_mapped(phys, phys + size, E820_NVS)))
> + return ioremap_cache((unsigned long)phys, size);
> + else
> + return ioremap((unsigned long)phys, size);
> + } else
> return __acpi_map_table((unsigned long)phys, size);
> }
> EXPORT_SYMBOL_GPL(acpi_os_map_memory);
> --
> To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
> the body of a message to majordomo(a)vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Ingo Molnar on

* ykzhao <yakui.zhao(a)intel.com> wrote:

> From the above description maybe the E820_ACPI region can be mapped as
> cached. But this still depends on the BIOS. If the some shared data resides
> in the E820_ACPI region on some BIOS, maybe we can't map the E820_ACPI
> region as cached again.

I dont think we can do this safely unless some other OS (Windows) does it as
well. (the reason is that if some BIOS messes this up then it will cause nasty
bugs/problems only on Linux.)

But the benefits of caching are very clear and well measured by Jack, so we
want the feature. What we can do is to add an exception for 'known good' hw
vendors - i.e. something quite close to Jack's RFC patch, but implemented a
bit more cleanly:

Exposing x86_platform and e820 details to generic ACPI code isnt particularly
clean - there should be an ACPI accessor function for that or so: a new
acpi_table_can_be_cached(table) function or so.

In fact since __acpi_map_table(addr,size) is defined by architectures already,
this could be done purely within x86 code.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: ykzhao on
On Fri, 2010-07-23 at 15:23 +0800, Ingo Molnar wrote:
> * ykzhao <yakui.zhao(a)intel.com> wrote:
>
> > From the above description maybe the E820_ACPI region can be mapped as
> > cached. But this still depends on the BIOS. If the some shared data resides
> > in the E820_ACPI region on some BIOS, maybe we can't map the E820_ACPI
> > region as cached again.
>
> I dont think we can do this safely unless some other OS (Windows) does it as
> well. (the reason is that if some BIOS messes this up then it will cause nasty
> bugs/problems only on Linux.)
>

Yes. We can't map the corresponding ACPI region as cached under the
following case:
>No E820_ACPI region is reported by BIOS. In such case the ACPI
table resides in the NVS region

But if the BIOS can follow the spec and report the separated
E820_ACPI/E820_NVS region, maybe we can give a try to map the E820_ACPI
region as cached type. For example: the server machine.(The spec
describes the E820_ACPI region as reclaimed memory, which means that it
can be managed by OS after ACPI table is loaded).

Can we add a boot option to control whether the E820_ACPI region can be
mapped as cached type?

> But the benefits of caching are very clear and well measured by Jack, so we
> want the feature. What we can do is to add an exception for 'known good' hw
> vendors - i.e. something quite close to Jack's RFC patch, but implemented a
> bit more cleanly:
>
> Exposing x86_platform and e820 details to generic ACPI code isnt particularly
> clean - there should be an ACPI accessor function for that or so: a new
> acpi_table_can_be_cached(table) function or so.

Agree. The function of acpi_os_map_memory will also be used on IA64
platform. It seems more reasonable to use a wrapper function to check
whether the corresponding region can be mapped as cached type.
>
> In fact since __acpi_map_table(addr,size) is defined by architectures already,
> this could be done purely within x86 code.
>
> Thanks,
>
> Ingo

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Jack Steiner on
On Thu, Jul 22, 2010 at 11:52:02AM -0400, Len Brown wrote:
> > The following experimental patch changes the kernel mapping for ACPI tables
> > to CACHED. This eliminates the page attibute conflict & allows users to map
> > the tables CACHEABLE. This significantly speeds up boot:
> >
> > 38 minutes without the patch
> > 27 minutes with the patch
> > ~30% improvement
> >
> > Time to run ACPIDUMP on a large system:
> > 527 seconds without the patch
> > 8 seconds with the patch
>
> Interesting.
>
> Can you detect a performance differene on a 1-node machine
> that doesn't magnify the penalty of the remote uncached access?


I timed acpidump on a smaller system running from both node 0 & a higher
node.
Serial number: UV-00000041
Partition number: 0
4 Blades
8 Nodes (Nehalem-EX sockets)
64 CPUs
60.87 Gb Memory Total


Times to run acpidump are (aver of 100 runs) show that cached runs 4X to 14X
faster than uncached, depending on the node it is running from. Since the
system is small, the total runtime is small.

baseline
.143 sec node 0
.479 sec node 7

ACPI tables mapped cached
.034 sec node 0
.037 sec node 7


I added trace code to remap_pfn_range() to see what ranges are mmapped.
The ranges are (first number is the number of times the range was mapped):

2 : paddr 0x78d1c000 - 0x79d1c000 DSDT @ 0x78d1c000
2 : paddr 0x78d1c000 - 0x9bd1c000 DSDT @ 0x78d1c000 << ???
4 : paddr 0x78d3f000 - 0x79d3f000 FACS @ 0x78d3f000
4 : paddr 0x78d6f000 - 0x79d6f000 DMAR @ 0x78d6f000
4 : paddr 0x78d70000 - 0x79d70000 SPCR @ 0x78d70000
4 : paddr 0x78d71000 - 0x79d71000 MCFG @ 0x78d71000
4 : paddr 0x78d72000 - 0x79d72000 SRAT @ 0x78d72000
4 : paddr 0x78d74000 - 0x79d74000 APIC @ 0x78d74000
4 : paddr 0x78d76000 - 0x79d76000 SLIT @ 0x78d76000
4 : paddr 0x78d78000 - 0x79d78000 HPET @ 0x78d78000
2 : paddr 0x78d79000 - 0x79d79000 SSDT @ 0x78d79000
2 : paddr 0x78d79000 - 0x7ed79000 SSDT @ 0x78d79000
4 : paddr 0x78d7f000 - 0x79d7f000 FACP @ 0x78d7f000
5 : paddr 0x78d80000 - 0x79d80000 ???



These ranges correspond to the following E820 entries

[ 0.000000] BIOS-e820: 000000000008f000 - 0000000000090000 (ACPI NVS)
[ 0.000000] BIOS-e820: 0000000078c61000 - 0000000078cd6000 (ACPI NVS)
[ 0.000000] BIOS-e820: 0000000078cd6000 - 0000000078d3f000 (ACPI data)
[ 0.000000] BIOS-e820: 0000000078d3f000 - 0000000078d61000 (ACPI NVS)
[ 0.000000] BIOS-e820: 0000000078d61000 - 0000000078d81000 (ACPI data)
[ 0.000000] BIOS-e820: 0000000078d81000 - 000000007cde1000 (usable)

and EFI entries:
[ 0.000000] EFI: mem136: type=9, attr=0xf, range=[0x0000000078cd6000-0x0000000078d3f000) (0MB)
[ 0.000000] EFI: mem137: type=10, attr=0xf, range=[0x0000000078d3f000-0x0000000078d61000) (0MB)
[ 0.000000] EFI: mem138: type=9, attr=0xf, range=[0x0000000078d61000-0x0000000078d6f000) (0MB)
[ 0.000000] EFI: mem139: type=9, attr=0xf, range=[0x0000000078d6f000-0x0000000078d81000) (0MB)

attr = 0xf ==> WB memory (UC WC WT also supported)
type 9 ==> EFI_ACPI_RECLAIM_MEMORY
type 10 ==> EFI_ACPI_MEMORY_NVS

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/