From: randyhyde on
On Mar 8, 4:42 pm, "Guga" <Guga...(a)gmail.com> wrote:
> On Mar 8, 4:06 pm, "robertwess...(a)yahoo.com" <robertwess...(a)yahoo.com>
> wrote:
>
> > On Mar 8, 5:35 pm, "Guga" <Guga...(a)gmail.com> wrote:
>
> > > Hi guys
>
> > > i´m looking for some table or any documentation that can contains the
> > > clock cycles (and instruction lenght) of the mnemonics related to
> > > Packed Data, like: ADDPD, ADDPS, CVTTPD2DQ etc.
>
> > > Someone have a link containing those kind of informations ?
>
> > The "Intel® 64 and IA-32 Architectures Optimization Reference Manual"
> > has some of that. Appendix C includes a lot of latency and throughput
> > information.
>
> >http://www.intel.com/design/processor/manuals/248966.pdf
>
> Tks Robert, this seems to be what i ws looking for.. It have the
> latency for different processors (Core Duo, Pentium M, etc.)
>
> one question.. on this document it contains a table of "THROUGHPUT"..
> but.. i´m unfamiliar with this word. What does "THROUGHPUT" means in
> english ?
>
> Best Regards,
>
> Guga


Okay, now I've read the document. I was pretty much correct. Here's
how Intel defined throughput:

Throughput - The number of clock cycles required to wait before the
issue
ports are free to accept the same instruction again.

IOW, it's basically the number of instructions (of the same
instruction) that can execute per given time (though they've described
this as a period, it's still the same thing).
Cheers,
Randy Hyde

From: Guga on
On Mar 8, 5:12 pm, "randyh...(a)earthlink.net" <randyh...(a)earthlink.net>
wrote:
> On Mar 8, 4:42 pm, "Guga" <Guga...(a)gmail.com> wrote:
>
>
>
> > On Mar 8, 4:06 pm, "robertwess...(a)yahoo.com" <robertwess...(a)yahoo.com>
> > wrote:
>
> > > On Mar 8, 5:35 pm, "Guga" <Guga...(a)gmail.com> wrote:
>
> > > > Hi guys
>
> > > > i´m looking for some table or any documentation that can contains the
> > > > clock cycles (and instruction lenght) of the mnemonics related to
> > > > Packed Data, like: ADDPD, ADDPS, CVTTPD2DQ etc.
>
> > > > Someone have a link containing those kind of informations ?
>
> > > The "Intel® 64 and IA-32 Architectures Optimization Reference Manual"
> > > has some of that. Appendix C includes a lot of latency and throughput
> > > information.
>
> > >http://www.intel.com/design/processor/manuals/248966.pdf
>
> > Tks Robert, this seems to be what i ws looking for.. It have the
> > latency for different processors (Core Duo, Pentium M, etc.)
>
> > one question.. on this document it contains a table of "THROUGHPUT"..
> > but.. i´m unfamiliar with this word. What does "THROUGHPUT" means in
> > english ?
>
> > Best Regards,
>
> > Guga
>
> Okay, now I've read the document. I was pretty much correct. Here's
> how Intel defined throughput:
>
> Throughput - The number of clock cycles required to wait before the
> issue
> ports are free to accept the same instruction again.
>
> IOW, it's basically the number of instructions (of the same
> instruction) that can execute per given time (though they've described
> this as a period, it's still the same thing).
> Cheers,
> Randy Hyde


Ok.. so, we can say that latecy+throughput are the number of clock
cycles that those instructions to work ? EXample, on the document it
says this (pg 443):

CVTTPD2DQ xmm, xmm latency = 10, throughput = 2 (for 0F3n CPUIds)

So, the total amount of cycles this mnemonic takes are 12 ?

Best Regards,

Guga

From: Guga on
On Mar 8, 5:25 pm, "Guga" <Guga...(a)gmail.com> wrote:
> On Mar 8, 5:12 pm, "randyh...(a)earthlink.net" <randyh...(a)earthlink.net>
> wrote:
>
>
>
> > On Mar 8, 4:42 pm, "Guga" <Guga...(a)gmail.com> wrote:
>
> > > On Mar 8, 4:06 pm, "robertwess...(a)yahoo.com" <robertwess...(a)yahoo.com>
> > > wrote:
>
> > > > On Mar 8, 5:35 pm, "Guga" <Guga...(a)gmail.com> wrote:
>
> > > > > Hi guys
>
> > > > > i´m looking for some table or any documentation that can contains the
> > > > > clock cycles (and instruction lenght) of the mnemonics related to
> > > > > Packed Data, like: ADDPD, ADDPS, CVTTPD2DQ etc.
>
> > > > > Someone have a link containing those kind of informations ?
>
> > > > The "Intel® 64 and IA-32 Architectures Optimization Reference Manual"
> > > > has some of that. Appendix C includes a lot of latency and throughput
> > > > information.
>
> > > >http://www.intel.com/design/processor/manuals/248966.pdf
>
> > > Tks Robert, this seems to be what i ws looking for.. It have the
> > > latency for different processors (Core Duo, Pentium M, etc.)
>
> > > one question.. on this document it contains a table of "THROUGHPUT"..
> > > but.. i´m unfamiliar with this word. What does "THROUGHPUT" means in
> > > english ?
>
> > > Best Regards,
>
> > > Guga
>
> > Okay, now I've read the document. I was pretty much correct. Here's
> > how Intel defined throughput:
>
> > Throughput - The number of clock cycles required to wait before the
> > issue
> > ports are free to accept the same instruction again.
>
> > IOW, it's basically the number of instructions (of the same
> > instruction) that can execute per given time (though they've described
> > this as a period, it's still the same thing).
> > Cheers,
> > Randy Hyde
>
> Ok.. so, we can say that latecy+throughput are the number of clock
> cycles that those instructions to work ? EXample, on the document it
> says this (pg 443):
>
> CVTTPD2DQ xmm, xmm latency = 10, throughput = 2 (for 0F3n CPUIds)
>
> So, the total amount of cycles this mnemonic takes are 12 ?
>
> Best Regards,
>
> Guga


"Throughput - The number of clock cycles required to wait before the
issue ports are free to accept the same instruction again. "

Reading again your post... I´m a bit confused....

So.. it is not 12.. it is 20 (10 cycles of latency * 2 throughput ) ?

From: robertwessel2 on
On Mar 8, 7:25 pm, "Guga" <Guga...(a)gmail.com> wrote:
> On Mar 8, 5:12 pm, "randyh...(a)earthlink.net" <randyh...(a)earthlink.net>
> wrote:
>
>
>
>
>
> > On Mar 8, 4:42 pm, "Guga" <Guga...(a)gmail.com> wrote:
>
> > > On Mar 8, 4:06 pm, "robertwess...(a)yahoo.com" <robertwess...(a)yahoo.com>
> > > wrote:
>
> > > > On Mar 8, 5:35 pm, "Guga" <Guga...(a)gmail.com> wrote:
>
> > > > > Hi guys
>
> > > > > i´m looking for some table or any documentation that can contains the
> > > > > clock cycles (and instruction lenght) of the mnemonics related to
> > > > > Packed Data, like: ADDPD, ADDPS, CVTTPD2DQ etc.
>
> > > > > Someone have a link containing those kind of informations ?
>
> > > > The "Intel® 64 and IA-32 Architectures Optimization Reference Manual"
> > > > has some of that. Appendix C includes a lot of latency and throughput
> > > > information.
>
> > > >http://www.intel.com/design/processor/manuals/248966.pdf
>
> > > Tks Robert, this seems to be what i ws looking for.. It have the
> > > latency for different processors (Core Duo, Pentium M, etc.)
>
> > > one question.. on this document it contains a table of "THROUGHPUT"..
> > > but.. i´m unfamiliar with this word. What does "THROUGHPUT" means in
> > > english ?
>
> > > Best Regards,
>
> > > Guga
>
> > Okay, now I've read the document. I was pretty much correct. Here's
> > how Intel defined throughput:
>
> > Throughput - The number of clock cycles required to wait before the
> > issue
> > ports are free to accept the same instruction again.
>
> > IOW, it's basically the number of instructions (of the same
> > instruction) that can execute per given time (though they've described
> > this as a period, it's still the same thing).
> > Cheers,
> > Randy Hyde
>
> Ok.. so, we can say that latecy+throughput are the number of clock
> cycles that those instructions to work ? EXample, on the document it
> says this (pg 443):
>
> CVTTPD2DQ xmm, xmm latency = 10, throughput = 2 (for 0F3n CPUIds)
>
> So, the total amount of cycles this mnemonic takes are 12 ?


No. It means that if you issue a CVTTPD2DQ with the appropriate
functional unit ready for more work, it'll finish in 10 clocks. You
can issue additional CVTTPD2DQs every two clocks (the throughput)
without stalling things, and the results will pop out the other end
every two clocks, but delayed (the latency) 10 clocks from when they
entered the functional unit. Obviously being able to pipeline five
CVTTPD2DQs requires that they have no dependencies which will cause
them to stall, and that nothing else in the instruction stream causes
any stalls or prevent the appropriate reorderings and whatnot.

So you could issue one CVTTPD2DQ and it'll finish in 10 clocks. Or
you could issue 10 (assuming nothing stalls), and they'll finish in 28
clocks. Or if you stream 1000, they'll finish in 2008 clocks (closely
approaching the maximum possible throughput).

From: Guga on
On Mar 8, 5:53 pm, "robertwess...(a)yahoo.com" <robertwess...(a)yahoo.com>
wrote:
> On Mar 8, 7:25 pm, "Guga" <Guga...(a)gmail.com> wrote:
>
>
>
> > On Mar 8, 5:12 pm, "randyh...(a)earthlink.net" <randyh...(a)earthlink.net>
> > wrote:
>
> > > On Mar 8, 4:42 pm, "Guga" <Guga...(a)gmail.com> wrote:
>
> > > > On Mar 8, 4:06 pm, "robertwess...(a)yahoo.com" <robertwess...(a)yahoo.com>
> > > > wrote:
>
> > > > > On Mar 8, 5:35 pm, "Guga" <Guga...(a)gmail.com> wrote:
>
> > > > > > Hi guys
>
> > > > > > i´m looking for some table or any documentation that can contains the
> > > > > > clock cycles (and instruction lenght) of the mnemonics related to
> > > > > > Packed Data, like: ADDPD, ADDPS, CVTTPD2DQ etc.
>
> > > > > > Someone have a link containing those kind of informations ?
>
> > > > > The "Intel® 64 and IA-32 Architectures Optimization Reference Manual"
> > > > > has some of that. Appendix C includes a lot of latency and throughput
> > > > > information.
>
> > > > >http://www.intel.com/design/processor/manuals/248966.pdf
>
> > > > Tks Robert, this seems to be what i ws looking for.. It have the
> > > > latency for different processors (Core Duo, Pentium M, etc.)
>
> > > > one question.. on this document it contains a table of "THROUGHPUT"...
> > > > but.. i´m unfamiliar with this word. What does "THROUGHPUT" means in
> > > > english ?
>
> > > > Best Regards,
>
> > > > Guga
>
> > > Okay, now I've read the document. I was pretty much correct. Here's
> > > how Intel defined throughput:
>
> > > Throughput - The number of clock cycles required to wait before the
> > > issue
> > > ports are free to accept the same instruction again.
>
> > > IOW, it's basically the number of instructions (of the same
> > > instruction) that can execute per given time (though they've described
> > > this as a period, it's still the same thing).
> > > Cheers,
> > > Randy Hyde
>
> > Ok.. so, we can say that latecy+throughput are the number of clock
> > cycles that those instructions to work ? EXample, on the document it
> > says this (pg 443):
>
> > CVTTPD2DQ xmm, xmm latency = 10, throughput = 2 (for 0F3n CPUIds)
>
> > So, the total amount of cycles this mnemonic takes are 12 ?
>
> No. It means that if you issue a CVTTPD2DQ with the appropriate
> functional unit ready for more work, it'll finish in 10 clocks. You
> can issue additional CVTTPD2DQs every two clocks (the throughput)
> without stalling things, and the results will pop out the other end
> every two clocks, but delayed (the latency) 10 clocks from when they
> entered the functional unit. Obviously being able to pipeline five
> CVTTPD2DQs requires that they have no dependencies which will cause
> them to stall, and that nothing else in the instruction stream causes
> any stalls or prevent the appropriate reorderings and whatnot.
>
> So you could issue one CVTTPD2DQ and it'll finish in 10 clocks. Or
> you could issue 10 (assuming nothing stalls), and they'll finish in 28
> clocks. Or if you stream 1000, they'll finish in 2008 clocks (closely
> approaching the maximum possible throughput).


Tks.. but i lost the math logic.. Why 28 ? How did you got the
conclusion that issuing 10 CVTTPD2DQ they will finish in 28 clocks ?