From: Borislav Petkov on
From: "H. Peter Anvin" <hpa(a)zytor.com>
Date: Wed, Aug 04, 2010 at 07:05:47PM -0400

> On 08/04/2010 09:45 AM, Borislav Petkov wrote:
> >
> > 2. Do not use swapper_pg_dir to boot secondary CPUs like 64-bit
> > does.
> >
> > This patch implements solution 2. It introduces a trampoline_pg_dir
> > which has the same layout as swapper_pg_dir with low_mappings. This page
> > table is used as the initial page table of the booting CPU. Later in the
> > bringup process, it switches to swapper_pg_dir and does a global TLB
> > flush. This fixes the crashes in our test cases.
> >
>
> I would like to keep around a page directory with the low mappings
> around -- and not use it for kernel threads -- at all times *anyway*.
> This means we can remove any current hacks that we have to do around S3
> entry and exit, for example.
>
> --- a/arch/x86/kernel/head_32.S
> +++ b/arch/x86/kernel/head_32.S
> @@ -328,7 +328,7 @@ ENTRY(startup_32_smp)
> /*
> * Enable paging
> */
> - movl $pa(swapper_pg_dir),%eax
> + movl pa(initial_page_table), %eax
> movl %eax,%cr3 /* set the page table pointer.. */
> movl %cr0,%eax
> orl $X86_CR0_PG,%eax
> @@ -608,6 +608,8 @@ ignore_int:
> .align 4
> ENTRY(initial_code)
> .long i386_start_kernel
> +ENTRY(initial_page_table)
> + .long pa(swapper_pg_dir)
>
> /*
> * BSS section
> @@ -623,6 +625,10 @@ ENTRY(swapper_pg_dir)
> #endif
> swapper_pg_fixmap:
> .fill 1024,4,0
> +#ifdef CONFIG_X86_TRAMPOLINE
> +ENTRY(trampoline_pg_dir)
> + .fill 1024,4,0
> +#endif
>
> I don't really see why this makes sense, though. It would make more
> sense that the initial page table we set up becomes trampoline_pg_dir;
> we can then set up and change to swapper_pg_dir almost immediately.

Yeah, now we use swapper_pg_dir at all times and zap the low mappings.
However, this is not perfectly clean, as this case in point shows how
unrelated CPUs might establish TLB entries speculatively. Now imagine
if they don't mcheck about it but silently and merrily continue on.
In this particular case, there were no improper accesses due to the
user/kernel permissions mismatch but imagine if suddenly kernel code
started accessing userspace and this not through copy_to_user() et al.

So it really does make sense to have an initial page table and copy
swapper_pg_dir from it. Which would be a perfect exercise for someone
who would like to play with the boot code a bit more, ^hint hint^, if
Joerg doesn't beat me to it.

But I'd suggest we get those fixes in now if there are no objections and
later adjustments should come ontop after excessive testing. And what
about backporting those fixes to .32 and .34, would you be ok with that?
Greg, what about you?

Thanks.

--
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Gemeinde Aschheim, Landkreis Muenchen
Registergericht Muenchen, HRB Nr. 43632
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Roedel, Joerg on
On Wed, Aug 04, 2010 at 07:05:47PM -0400, H. Peter Anvin wrote:
> On 08/04/2010 09:45 AM, Borislav Petkov wrote:
> >
> > 2. Do not use swapper_pg_dir to boot secondary CPUs like 64-bit
> > does.
> >
> > This patch implements solution 2. It introduces a trampoline_pg_dir
> > which has the same layout as swapper_pg_dir with low_mappings. This page
> > table is used as the initial page table of the booting CPU. Later in the
> > bringup process, it switches to swapper_pg_dir and does a global TLB
> > flush. This fixes the crashes in our test cases.
> >
>
> I would like to keep around a page directory with the low mappings
> around -- and not use it for kernel threads -- at all times *anyway*.
> This means we can remove any current hacks that we have to do around S3
> entry and exit, for example.

Yeah, the page table with the low mappings is trampoline_pg_dir
introduced in this patch :-)

> --- a/arch/x86/kernel/head_32.S
> +++ b/arch/x86/kernel/head_32.S
> @@ -328,7 +328,7 @@ ENTRY(startup_32_smp)
> /*
> * Enable paging
> */
> - movl $pa(swapper_pg_dir),%eax
> + movl pa(initial_page_table), %eax
> movl %eax,%cr3 /* set the page table pointer.. */
> movl %cr0,%eax
> orl $X86_CR0_PG,%eax
> @@ -608,6 +608,8 @@ ignore_int:
> .align 4
> ENTRY(initial_code)
> .long i386_start_kernel
> +ENTRY(initial_page_table)
> + .long pa(swapper_pg_dir)
>
> /*
> * BSS section
> @@ -623,6 +625,10 @@ ENTRY(swapper_pg_dir)
> #endif
> swapper_pg_fixmap:
> .fill 1024,4,0
> +#ifdef CONFIG_X86_TRAMPOLINE
> +ENTRY(trampoline_pg_dir)
> + .fill 1024,4,0
> +#endif
>
> I don't really see why this makes sense, though. It would make more
> sense that the initial page table we set up becomes trampoline_pg_dir;
> we can then set up and change to swapper_pg_dir almost immediately.

To make sure I understand correctly, you suggest to initialize
tramponline_pg_dir in the boot sequence of the first cpu and fork
swapper_pg_dir from it later on?

> I realize this isn't how the 64-bit code works at the moment, but in a
> lot of ways I think it would be better if it did.

Yeah, may make sense. This patch already brings the 32 bit
implementation closer to the 64 bit one. On 64 bit things are somewhat
simpler because the tramponline page table can be defined at
compile-time there (contains only 2 pgd_t entries) while on 32 bit we
have to initialize it at runtime.

Joerg

--
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Borislav Petkov on
From: "H. Peter Anvin" <hpa(a)zytor.com>
Date: Thu, Aug 12, 2010 at 11:34:37AM -0400

> On 08/12/2010 07:41 AM, Joerg Roedel wrote:
> > On Thu, Aug 05, 2010 at 07:14:14AM -0700, H. Peter Anvin wrote:
> >> On 08/05/2010 12:45 AM, Roedel, Joerg wrote:
> >> Correct, again. It's unclear to me if we can get away with the very
> >> simple 64-bit approach -- in particular, not including all the 1:1
> >> mappings in the kernel -- for all future users, though.
> >
> > Yeah, We could probably use the same symbol names for the trampoline and
> > the swapper page-table on 32 and 64 bit and merge the code that handles
> > it between the architectures. Shouldn't be too difficult. We cook
> > something up :-)
> > But do you mind to take this patch first? It fixes the occurence of an
> > erratum on AMD hardware, it is a quite simple fix compared to the rework
> > suggested. In fact, I would like to have a simple patch to fix the
> > problem first (and backport it as necessary) and then go the way to
> > rework the current code differences between 32 and 64 bit.
> >
>
> Agreed. I do have a concern about the kernel page tables not being
> synchronized into trampoline_pg_dir (it's only an issue for 32-bit
> !PAE), so at the very least we need to keep an eye out for it...

I've been working on what you suggested and the patch boots on a 32-bit
PAE kernel but keeps rebooting on non-PAE after setting the stack_start
on the AP. I could send it out later for you to look at, if you like -
you should be able to spot what I'm missing :)...

--
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Gemeinde Aschheim, Landkreis Muenchen
Registergericht Muenchen, HRB Nr. 43632
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Borislav Petkov on
From: "H. Peter Anvin" <hpa(a)zytor.com>
Date: Thu, Aug 12, 2010 at 11:47:55AM -0400

> On 08/12/2010 08:47 AM, Borislav Petkov wrote:
> >>
> >> Agreed. I do have a concern about the kernel page tables not being
> >> synchronized into trampoline_pg_dir (it's only an issue for 32-bit
> >> !PAE), so at the very least we need to keep an eye out for it...
> >
> > I've been working on what you suggested and the patch boots on a 32-bit
> > PAE kernel but keeps rebooting on non-PAE after setting the stack_start
> > on the AP. I could send it out later for you to look at, if you like -
> > you should be able to spot what I'm missing :)...
> >
>
> Sounds good. This is obviously .37-or-higher material.

Definitely, this needs a lot of hammering before going upstream.

So here's what I got so far, I went and changed everything called
swapper_* when used to prep the initial page table to initial_*. Then
in setup_arch I switch to the swapper_pg_dir after copying the kernel
address mappings into it so all other places touching swapper_pg_dir can
remain unchanged. Then, initial_page_table is still used for booting the
APs.

The nice effect is that we drop all that swsusp_pg_dir for ACPI_SLEEP
and unifying the do_boot_cpu() part of start_secondary() even more.

So this boots fine on 32-bit PAE but reboots the machine on !PAE when
the AP does

/* Set up the stack pointer */
lss stack_start,%esp

pushl $0
popfl

in head_32.S. I've debugged this by adding endless loops in the manner of

@@@ -274,6 -274,6 +274,7 @@@ ENTRY(startup_32_smp
movl %eax,%es
movl %eax,%fs
movl %eax,%gs
+ movl $0xdeadbeef, %ebx
#endif /* CONFIG_SMP */

so that this happens only on the AP and then did

+667:
+ cmpl $0xdeadbeef, %ebx
+ je 667b
+

So when the loop is before the "pushl.. popfl" sequence above, the
BSP would print "CPU stuck??" for the AP and continue booting by not
bringing the AP online. If I put the endless loop after that sequence,
the machine reboots. This is also nicely reproducible in kvm for even
faster and safer experimenting with the code.


Thanks.

--
diff --git a/arch/x86/include/asm/pgtable_32.h b/arch/x86/include/asm/pgtable_32.h
index 2984a25..8abde9e 100644
--- a/arch/x86/include/asm/pgtable_32.h
+++ b/arch/x86/include/asm/pgtable_32.h
@@ -26,6 +26,7 @@ struct mm_struct;
struct vm_area_struct;

extern pgd_t swapper_pg_dir[1024];
+extern pgd_t initial_page_table[1024];

static inline void pgtable_cache_init(void) { }
static inline void check_pgt_cache(void) { }
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 7f3eba0..169be89 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -172,6 +172,4 @@ static inline void flush_tlb_kernel_range(unsigned long start,
flush_tlb_all();
}

-extern void zap_low_mappings(bool early);
-
#endif /* _ASM_X86_TLBFLUSH_H */
diff --git a/arch/x86/kernel/acpi/sleep.c b/arch/x86/kernel/acpi/sleep.c
index fcc3c61..48d89d6 100644
--- a/arch/x86/kernel/acpi/sleep.c
+++ b/arch/x86/kernel/acpi/sleep.c
@@ -12,6 +12,11 @@
#include <asm/segment.h>
#include <asm/desc.h>

+#ifdef CONFIG_X86_32
+#include <asm/pgtable.h>
+#include <asm/pgtable_32.h>
+#endif
+
#include "realmode/wakeup.h"
#include "sleep.h"

@@ -90,7 +95,7 @@ int acpi_save_state_mem(void)

#ifndef CONFIG_64BIT
header->pmode_entry = (u32)&wakeup_pmode_return;
- header->pmode_cr3 = (u32)(swsusp_pg_dir - __PAGE_OFFSET);
+ header->pmode_cr3 = (u32)(initial_page_table - __PAGE_OFFSET);
saved_magic = 0x12345678;
#else /* CONFIG_64BIT */
header->trampoline_segment = setup_trampoline() >> 4;
diff --git a/arch/x86/kernel/head_32.S b/arch/x86/kernel/head_32.S
index 37c3d4b..80b0961 100644
--- a/arch/x86/kernel/head_32.S
+++ b/arch/x86/kernel/head_32.S
@@ -177,7 +177,7 @@ default_entry:
#ifdef CONFIG_X86_PAE

/*
- * In PAE mode swapper_pg_dir is statically defined to contain enough
+ * In PAE mode initial_page_table is statically defined to contain enough
* entries to cover the VMSPLIT option (that is the top 1, 2 or 3
* entries). The identity mapping is handled by pointing two PGD
* entries to the first kernel PMD.
@@ -191,7 +191,7 @@ default_entry:
xorl %ebx,%ebx /* %ebx is kept at zero */

movl $pa(__brk_base), %edi
- movl $pa(swapper_pg_pmd), %edx
+ movl $pa(initial_pg_pmd), %edx
movl $PTE_IDENT_ATTR, %eax
10:
leal PDE_IDENT_ATTR(%edi),%ecx /* Create PMD entry */
@@ -220,14 +220,14 @@ default_entry:
movl %eax, pa(max_pfn_mapped)

/* Do early initialization of the fixmap area */
- movl $pa(swapper_pg_fixmap)+PDE_IDENT_ATTR,%eax
- movl %eax,pa(swapper_pg_pmd+0x1000*KPMDS-8)
+ movl $pa(initial_pg_fixmap)+PDE_IDENT_ATTR,%eax
+ movl %eax,pa(initial_pg_pmd+0x1000*KPMDS-8)
#else /* Not PAE */

page_pde_offset = (__PAGE_OFFSET >> 20);

movl $pa(__brk_base), %edi
- movl $pa(swapper_pg_dir), %edx
+ movl $pa(initial_page_table), %edx
movl $PTE_IDENT_ATTR, %eax
10:
leal PDE_IDENT_ATTR(%edi),%ecx /* Create PDE entry */
@@ -251,8 +251,8 @@ page_pde_offset = (__PAGE_OFFSET >> 20);
movl %eax, pa(max_pfn_mapped)

/* Do early initialization of the fixmap area */
- movl $pa(swapper_pg_fixmap)+PDE_IDENT_ATTR,%eax
- movl %eax,pa(swapper_pg_dir+0xffc)
+ movl $pa(initial_pg_fixmap)+PDE_IDENT_ATTR,%eax
+ movl %eax,pa(initial_page_table+0xffc)
#endif
jmp 3f
/*
@@ -328,7 +328,7 @@ ENTRY(startup_32_smp)
/*
* Enable paging
*/
- movl $pa(swapper_pg_dir),%eax
+ movl $pa(initial_page_table),%eax
movl %eax,%cr3 /* set the page table pointer.. */
movl %cr0,%eax
orl $X86_CR0_PG,%eax
@@ -615,16 +615,18 @@ ENTRY(initial_code)
__PAGE_ALIGNED_BSS
.align PAGE_SIZE_asm
#ifdef CONFIG_X86_PAE
-swapper_pg_pmd:
+initial_pg_pmd:
.fill 1024*KPMDS,4,0
#else
-ENTRY(swapper_pg_dir)
+ENTRY(initial_page_table)
.fill 1024,4,0
#endif
-swapper_pg_fixmap:
+initial_pg_fixmap:
.fill 1024,4,0
ENTRY(empty_zero_page)
.fill 4096,1,0
+ENTRY(swapper_pg_dir)
+ .fill 1024,4,0

/*
* This starts the data section.
@@ -633,20 +635,20 @@ ENTRY(empty_zero_page)
__PAGE_ALIGNED_DATA
/* Page-aligned for the benefit of paravirt? */
.align PAGE_SIZE_asm
-ENTRY(swapper_pg_dir)
- .long pa(swapper_pg_pmd+PGD_IDENT_ATTR),0 /* low identity map */
+ENTRY(initial_page_table)
+ .long pa(initial_pg_pmd+PGD_IDENT_ATTR),0 /* low identity map */
# if KPMDS == 3
- .long pa(swapper_pg_pmd+PGD_IDENT_ATTR),0
- .long pa(swapper_pg_pmd+PGD_IDENT_ATTR+0x1000),0
- .long pa(swapper_pg_pmd+PGD_IDENT_ATTR+0x2000),0
+ .long pa(initial_pg_pmd+PGD_IDENT_ATTR),0
+ .long pa(initial_pg_pmd+PGD_IDENT_ATTR+0x1000),0
+ .long pa(initial_pg_pmd+PGD_IDENT_ATTR+0x2000),0
# elif KPMDS == 2
.long 0,0
- .long pa(swapper_pg_pmd+PGD_IDENT_ATTR),0
- .long pa(swapper_pg_pmd+PGD_IDENT_ATTR+0x1000),0
+ .long pa(initial_pg_pmd+PGD_IDENT_ATTR),0
+ .long pa(initial_pg_pmd+PGD_IDENT_ATTR+0x1000),0
# elif KPMDS == 1
.long 0,0
.long 0,0
- .long pa(swapper_pg_pmd+PGD_IDENT_ATTR),0
+ .long pa(initial_pg_pmd+PGD_IDENT_ATTR),0
# else
# error "Kernel PMDs should be 1, 2 or 3"
# endif
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index b4ae4ac..ba19c9e 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -727,6 +727,15 @@ void __init setup_arch(char **cmdline_p)
int k8 = 0;

#ifdef CONFIG_X86_32
+ /* Copy kernel address range */
+ clone_pgd_range(swapper_pg_dir + KERNEL_PGD_BOUNDARY,
+ initial_page_table + KERNEL_PGD_BOUNDARY,
+ min_t(unsigned long, KERNEL_PGD_PTRS,
+ KERNEL_PGD_BOUNDARY));
+
+ load_cr3(swapper_pg_dir);
+ __flush_tlb_all();
+
memcpy(&boot_cpu_data, &new_cpu_data, sizeof(new_cpu_data));
visws_early_detect();
#else
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index c4f33b2..d1bd555 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -73,7 +73,6 @@

#ifdef CONFIG_X86_32
u8 apicid_2_node[MAX_APICID];
-static int low_mappings;
#endif

/* State of each CPU */
@@ -300,8 +299,7 @@ notrace static void __cpuinit start_secondary(void *unused)
}

#ifdef CONFIG_X86_32
- while (low_mappings)
- cpu_relax();
+ load_cr3(swapper_pg_dir);
__flush_tlb_all();
#endif

@@ -894,20 +892,7 @@ int __cpuinit native_cpu_up(unsigned int cpu)

per_cpu(cpu_state, cpu) = CPU_UP_PREPARE;

-#ifdef CONFIG_X86_32
- /* init low mem mapping */
- clone_pgd_range(swapper_pg_dir, swapper_pg_dir + KERNEL_PGD_BOUNDARY,
- min_t(unsigned long, KERNEL_PGD_PTRS, KERNEL_PGD_BOUNDARY));
- flush_tlb_all();
- low_mappings = 1;
-
err = do_boot_cpu(apicid, cpu);
-
- zap_low_mappings(false);
- low_mappings = 0;
-#else
- err = do_boot_cpu(apicid, cpu);
-#endif
if (err) {
pr_debug("do_boot_cpu failed %d\n", err);
return -EIO;
diff --git a/arch/x86/mm/init_32.c b/arch/x86/mm/init_32.c
index bca7909..adad48b 100644
--- a/arch/x86/mm/init_32.c
+++ b/arch/x86/mm/init_32.c
@@ -548,48 +548,6 @@ static void __init pagetable_init(void)
permanent_kmaps_init(pgd_base);
}

-#ifdef CONFIG_ACPI_SLEEP
-/*
- * ACPI suspend needs this for resume, because things like the intel-agp
- * driver might have split up a kernel 4MB mapping.
- */
-char swsusp_pg_dir[PAGE_SIZE]
- __attribute__ ((aligned(PAGE_SIZE)));
-
-static inline void save_pg_dir(void)
-{
- memcpy(swsusp_pg_dir, swapper_pg_dir, PAGE_SIZE);
-}
-#else /* !CONFIG_ACPI_SLEEP */
-static inline void save_pg_dir(void)
-{
-}
-#endif /* !CONFIG_ACPI_SLEEP */
-
-void zap_low_mappings(bool early)
-{
- int i;
-
- /*
- * Zap initial low-memory mappings.
- *
- * Note that "pgd_clear()" doesn't do it for
- * us, because pgd_clear() is a no-op on i386.
- */
- for (i = 0; i < KERNEL_PGD_BOUNDARY; i++) {
-#ifdef CONFIG_X86_PAE
- set_pgd(swapper_pg_dir+i, __pgd(1 + __pa(empty_zero_page)));
-#else
- set_pgd(swapper_pg_dir+i, __pgd(0));
-#endif
- }
-
- if (early)
- __flush_tlb();
- else
- flush_tlb_all();
-}
-
pteval_t __supported_pte_mask __read_mostly = ~(_PAGE_NX | _PAGE_GLOBAL | _PAGE_IOMAP);
EXPORT_SYMBOL_GPL(__supported_pte_mask);

@@ -958,9 +916,6 @@ void __init mem_init(void)

if (boot_cpu_data.wp_works_ok < 0)
test_wp_bit();
-
- save_pg_dir();
- zap_low_mappings(true);
}

#ifdef CONFIG_MEMORY_HOTPLUG



--
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Gemeinde Aschheim, Landkreis Muenchen
Registergericht Muenchen, HRB Nr. 43632
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Borislav Petkov on
From: Joerg Roedel <joro(a)8bytes.org>
Date: Thu, Aug 12, 2010 at 03:04:14PM -0400

> On Thu, Aug 12, 2010 at 12:01:27PM -0700, H. Peter Anvin wrote:
> > On 08/12/2010 10:06 AM, Joerg Roedel wrote:
>
> > > The synchronization problem might be real. The cpu_init function uses
> > > per-cpu variables which might not be present in the early synced
> > > tramponline page table (iirc). The change to swapper_pg_dir and the
> > > global tlb flush should probably be moved to the very beginning of the
> > > start_secondary function.
> > >
> >
> > Are you going to submit an updated patch, then?
>
> Yes, I will send it next week when I am back in office and gave it some
> testing.

currently testing...

--
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Gemeinde Aschheim, Landkreis Muenchen
Registergericht Muenchen, HRB Nr. 43632
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/