From: randyhyde on 8 Mar 2007 20:12 On Mar 8, 4:42 pm, "Guga" <Guga...(a)gmail.com> wrote: > On Mar 8, 4:06 pm, "robertwess...(a)yahoo.com" <robertwess...(a)yahoo.com> > wrote: > > > On Mar 8, 5:35 pm, "Guga" <Guga...(a)gmail.com> wrote: > > > > Hi guys > > > > i´m looking for some table or any documentation that can contains the > > > clock cycles (and instruction lenght) of the mnemonics related to > > > Packed Data, like: ADDPD, ADDPS, CVTTPD2DQ etc. > > > > Someone have a link containing those kind of informations ? > > > The "Intel® 64 and IA-32 Architectures Optimization Reference Manual" > > has some of that. Appendix C includes a lot of latency and throughput > > information. > > >http://www.intel.com/design/processor/manuals/248966.pdf > > Tks Robert, this seems to be what i ws looking for.. It have the > latency for different processors (Core Duo, Pentium M, etc.) > > one question.. on this document it contains a table of "THROUGHPUT".. > but.. i´m unfamiliar with this word. What does "THROUGHPUT" means in > english ? > > Best Regards, > > Guga Okay, now I've read the document. I was pretty much correct. Here's how Intel defined throughput: Throughput - The number of clock cycles required to wait before the issue ports are free to accept the same instruction again. IOW, it's basically the number of instructions (of the same instruction) that can execute per given time (though they've described this as a period, it's still the same thing). Cheers, Randy Hyde
From: Guga on 8 Mar 2007 20:25 On Mar 8, 5:12 pm, "randyh...(a)earthlink.net" <randyh...(a)earthlink.net> wrote: > On Mar 8, 4:42 pm, "Guga" <Guga...(a)gmail.com> wrote: > > > > > On Mar 8, 4:06 pm, "robertwess...(a)yahoo.com" <robertwess...(a)yahoo.com> > > wrote: > > > > On Mar 8, 5:35 pm, "Guga" <Guga...(a)gmail.com> wrote: > > > > > Hi guys > > > > > i´m looking for some table or any documentation that can contains the > > > > clock cycles (and instruction lenght) of the mnemonics related to > > > > Packed Data, like: ADDPD, ADDPS, CVTTPD2DQ etc. > > > > > Someone have a link containing those kind of informations ? > > > > The "Intel® 64 and IA-32 Architectures Optimization Reference Manual" > > > has some of that. Appendix C includes a lot of latency and throughput > > > information. > > > >http://www.intel.com/design/processor/manuals/248966.pdf > > > Tks Robert, this seems to be what i ws looking for.. It have the > > latency for different processors (Core Duo, Pentium M, etc.) > > > one question.. on this document it contains a table of "THROUGHPUT".. > > but.. i´m unfamiliar with this word. What does "THROUGHPUT" means in > > english ? > > > Best Regards, > > > Guga > > Okay, now I've read the document. I was pretty much correct. Here's > how Intel defined throughput: > > Throughput - The number of clock cycles required to wait before the > issue > ports are free to accept the same instruction again. > > IOW, it's basically the number of instructions (of the same > instruction) that can execute per given time (though they've described > this as a period, it's still the same thing). > Cheers, > Randy Hyde Ok.. so, we can say that latecy+throughput are the number of clock cycles that those instructions to work ? EXample, on the document it says this (pg 443): CVTTPD2DQ xmm, xmm latency = 10, throughput = 2 (for 0F3n CPUIds) So, the total amount of cycles this mnemonic takes are 12 ? Best Regards, Guga
From: Guga on 8 Mar 2007 20:41 On Mar 8, 5:25 pm, "Guga" <Guga...(a)gmail.com> wrote: > On Mar 8, 5:12 pm, "randyh...(a)earthlink.net" <randyh...(a)earthlink.net> > wrote: > > > > > On Mar 8, 4:42 pm, "Guga" <Guga...(a)gmail.com> wrote: > > > > On Mar 8, 4:06 pm, "robertwess...(a)yahoo.com" <robertwess...(a)yahoo.com> > > > wrote: > > > > > On Mar 8, 5:35 pm, "Guga" <Guga...(a)gmail.com> wrote: > > > > > > Hi guys > > > > > > i´m looking for some table or any documentation that can contains the > > > > > clock cycles (and instruction lenght) of the mnemonics related to > > > > > Packed Data, like: ADDPD, ADDPS, CVTTPD2DQ etc. > > > > > > Someone have a link containing those kind of informations ? > > > > > The "Intel® 64 and IA-32 Architectures Optimization Reference Manual" > > > > has some of that. Appendix C includes a lot of latency and throughput > > > > information. > > > > >http://www.intel.com/design/processor/manuals/248966.pdf > > > > Tks Robert, this seems to be what i ws looking for.. It have the > > > latency for different processors (Core Duo, Pentium M, etc.) > > > > one question.. on this document it contains a table of "THROUGHPUT".. > > > but.. i´m unfamiliar with this word. What does "THROUGHPUT" means in > > > english ? > > > > Best Regards, > > > > Guga > > > Okay, now I've read the document. I was pretty much correct. Here's > > how Intel defined throughput: > > > Throughput - The number of clock cycles required to wait before the > > issue > > ports are free to accept the same instruction again. > > > IOW, it's basically the number of instructions (of the same > > instruction) that can execute per given time (though they've described > > this as a period, it's still the same thing). > > Cheers, > > Randy Hyde > > Ok.. so, we can say that latecy+throughput are the number of clock > cycles that those instructions to work ? EXample, on the document it > says this (pg 443): > > CVTTPD2DQ xmm, xmm latency = 10, throughput = 2 (for 0F3n CPUIds) > > So, the total amount of cycles this mnemonic takes are 12 ? > > Best Regards, > > Guga "Throughput - The number of clock cycles required to wait before the issue ports are free to accept the same instruction again. " Reading again your post... I´m a bit confused.... So.. it is not 12.. it is 20 (10 cycles of latency * 2 throughput ) ?
From: robertwessel2 on 8 Mar 2007 20:53 On Mar 8, 7:25 pm, "Guga" <Guga...(a)gmail.com> wrote: > On Mar 8, 5:12 pm, "randyh...(a)earthlink.net" <randyh...(a)earthlink.net> > wrote: > > > > > > > On Mar 8, 4:42 pm, "Guga" <Guga...(a)gmail.com> wrote: > > > > On Mar 8, 4:06 pm, "robertwess...(a)yahoo.com" <robertwess...(a)yahoo.com> > > > wrote: > > > > > On Mar 8, 5:35 pm, "Guga" <Guga...(a)gmail.com> wrote: > > > > > > Hi guys > > > > > > i´m looking for some table or any documentation that can contains the > > > > > clock cycles (and instruction lenght) of the mnemonics related to > > > > > Packed Data, like: ADDPD, ADDPS, CVTTPD2DQ etc. > > > > > > Someone have a link containing those kind of informations ? > > > > > The "Intel® 64 and IA-32 Architectures Optimization Reference Manual" > > > > has some of that. Appendix C includes a lot of latency and throughput > > > > information. > > > > >http://www.intel.com/design/processor/manuals/248966.pdf > > > > Tks Robert, this seems to be what i ws looking for.. It have the > > > latency for different processors (Core Duo, Pentium M, etc.) > > > > one question.. on this document it contains a table of "THROUGHPUT".. > > > but.. i´m unfamiliar with this word. What does "THROUGHPUT" means in > > > english ? > > > > Best Regards, > > > > Guga > > > Okay, now I've read the document. I was pretty much correct. Here's > > how Intel defined throughput: > > > Throughput - The number of clock cycles required to wait before the > > issue > > ports are free to accept the same instruction again. > > > IOW, it's basically the number of instructions (of the same > > instruction) that can execute per given time (though they've described > > this as a period, it's still the same thing). > > Cheers, > > Randy Hyde > > Ok.. so, we can say that latecy+throughput are the number of clock > cycles that those instructions to work ? EXample, on the document it > says this (pg 443): > > CVTTPD2DQ xmm, xmm latency = 10, throughput = 2 (for 0F3n CPUIds) > > So, the total amount of cycles this mnemonic takes are 12 ? No. It means that if you issue a CVTTPD2DQ with the appropriate functional unit ready for more work, it'll finish in 10 clocks. You can issue additional CVTTPD2DQs every two clocks (the throughput) without stalling things, and the results will pop out the other end every two clocks, but delayed (the latency) 10 clocks from when they entered the functional unit. Obviously being able to pipeline five CVTTPD2DQs requires that they have no dependencies which will cause them to stall, and that nothing else in the instruction stream causes any stalls or prevent the appropriate reorderings and whatnot. So you could issue one CVTTPD2DQ and it'll finish in 10 clocks. Or you could issue 10 (assuming nothing stalls), and they'll finish in 28 clocks. Or if you stream 1000, they'll finish in 2008 clocks (closely approaching the maximum possible throughput).
From: Guga on 8 Mar 2007 21:14
On Mar 8, 5:53 pm, "robertwess...(a)yahoo.com" <robertwess...(a)yahoo.com> wrote: > On Mar 8, 7:25 pm, "Guga" <Guga...(a)gmail.com> wrote: > > > > > On Mar 8, 5:12 pm, "randyh...(a)earthlink.net" <randyh...(a)earthlink.net> > > wrote: > > > > On Mar 8, 4:42 pm, "Guga" <Guga...(a)gmail.com> wrote: > > > > > On Mar 8, 4:06 pm, "robertwess...(a)yahoo.com" <robertwess...(a)yahoo.com> > > > > wrote: > > > > > > On Mar 8, 5:35 pm, "Guga" <Guga...(a)gmail.com> wrote: > > > > > > > Hi guys > > > > > > > i´m looking for some table or any documentation that can contains the > > > > > > clock cycles (and instruction lenght) of the mnemonics related to > > > > > > Packed Data, like: ADDPD, ADDPS, CVTTPD2DQ etc. > > > > > > > Someone have a link containing those kind of informations ? > > > > > > The "Intel® 64 and IA-32 Architectures Optimization Reference Manual" > > > > > has some of that. Appendix C includes a lot of latency and throughput > > > > > information. > > > > > >http://www.intel.com/design/processor/manuals/248966.pdf > > > > > Tks Robert, this seems to be what i ws looking for.. It have the > > > > latency for different processors (Core Duo, Pentium M, etc.) > > > > > one question.. on this document it contains a table of "THROUGHPUT"... > > > > but.. i´m unfamiliar with this word. What does "THROUGHPUT" means in > > > > english ? > > > > > Best Regards, > > > > > Guga > > > > Okay, now I've read the document. I was pretty much correct. Here's > > > how Intel defined throughput: > > > > Throughput - The number of clock cycles required to wait before the > > > issue > > > ports are free to accept the same instruction again. > > > > IOW, it's basically the number of instructions (of the same > > > instruction) that can execute per given time (though they've described > > > this as a period, it's still the same thing). > > > Cheers, > > > Randy Hyde > > > Ok.. so, we can say that latecy+throughput are the number of clock > > cycles that those instructions to work ? EXample, on the document it > > says this (pg 443): > > > CVTTPD2DQ xmm, xmm latency = 10, throughput = 2 (for 0F3n CPUIds) > > > So, the total amount of cycles this mnemonic takes are 12 ? > > No. It means that if you issue a CVTTPD2DQ with the appropriate > functional unit ready for more work, it'll finish in 10 clocks. You > can issue additional CVTTPD2DQs every two clocks (the throughput) > without stalling things, and the results will pop out the other end > every two clocks, but delayed (the latency) 10 clocks from when they > entered the functional unit. Obviously being able to pipeline five > CVTTPD2DQs requires that they have no dependencies which will cause > them to stall, and that nothing else in the instruction stream causes > any stalls or prevent the appropriate reorderings and whatnot. > > So you could issue one CVTTPD2DQ and it'll finish in 10 clocks. Or > you could issue 10 (assuming nothing stalls), and they'll finish in 28 > clocks. Or if you stream 1000, they'll finish in 2008 clocks (closely > approaching the maximum possible throughput). Tks.. but i lost the math logic.. Why 28 ? How did you got the conclusion that issuing 10 CVTTPD2DQ they will finish in 28 clocks ? |