From: robertwessel2 on
On Mar 8, 8:14 pm, "Guga" <Guga...(a)gmail.com> wrote:
> On Mar 8, 5:53 pm, "robertwess...(a)yahoo.com" <robertwess...(a)yahoo.com>
> wrote:
>
>
>
>
>
> > On Mar 8, 7:25 pm, "Guga" <Guga...(a)gmail.com> wrote:
>
> > > On Mar 8, 5:12 pm, "randyh...(a)earthlink.net" <randyh...(a)earthlink.net>
> > > wrote:
>
> > > > On Mar 8, 4:42 pm, "Guga" <Guga...(a)gmail.com> wrote:
>
> > > > > On Mar 8, 4:06 pm, "robertwess...(a)yahoo.com" <robertwess...(a)yahoo..com>
> > > > > wrote:
>
> > > > > > On Mar 8, 5:35 pm, "Guga" <Guga...(a)gmail.com> wrote:
>
> > > > > > > Hi guys
>
> > > > > > > i´m looking for some table or any documentation that can contains the
> > > > > > > clock cycles (and instruction lenght) of the mnemonics related to
> > > > > > > Packed Data, like: ADDPD, ADDPS, CVTTPD2DQ etc.
>
> > > > > > > Someone have a link containing those kind of informations ?
>
> > > > > > The "Intel® 64 and IA-32 Architectures Optimization Reference Manual"
> > > > > > has some of that. Appendix C includes a lot of latency and throughput
> > > > > > information.
>
> > > > > >http://www.intel.com/design/processor/manuals/248966.pdf
>
> > > > > Tks Robert, this seems to be what i ws looking for.. It have the
> > > > > latency for different processors (Core Duo, Pentium M, etc.)
>
> > > > > one question.. on this document it contains a table of "THROUGHPUT"..
> > > > > but.. i´m unfamiliar with this word. What does "THROUGHPUT" means in
> > > > > english ?
>
> > > > > Best Regards,
>
> > > > > Guga
>
> > > > Okay, now I've read the document. I was pretty much correct. Here's
> > > > how Intel defined throughput:
>
> > > > Throughput - The number of clock cycles required to wait before the
> > > > issue
> > > > ports are free to accept the same instruction again.
>
> > > > IOW, it's basically the number of instructions (of the same
> > > > instruction) that can execute per given time (though they've described
> > > > this as a period, it's still the same thing).
> > > > Cheers,
> > > > Randy Hyde
>
> > > Ok.. so, we can say that latecy+throughput are the number of clock
> > > cycles that those instructions to work ? EXample, on the document it
> > > says this (pg 443):
>
> > > CVTTPD2DQ xmm, xmm latency = 10, throughput = 2 (for 0F3n CPUIds)
>
> > > So, the total amount of cycles this mnemonic takes are 12 ?
>
> > No. It means that if you issue a CVTTPD2DQ with the appropriate
> > functional unit ready for more work, it'll finish in 10 clocks. You
> > can issue additional CVTTPD2DQs every two clocks (the throughput)
> > without stalling things, and the results will pop out the other end
> > every two clocks, but delayed (the latency) 10 clocks from when they
> > entered the functional unit. Obviously being able to pipeline five
> > CVTTPD2DQs requires that they have no dependencies which will cause
> > them to stall, and that nothing else in the instruction stream causes
> > any stalls or prevent the appropriate reorderings and whatnot.
>
> > So you could issue one CVTTPD2DQ and it'll finish in 10 clocks. Or
> > you could issue 10 (assuming nothing stalls), and they'll finish in 28
> > clocks. Or if you stream 1000, they'll finish in 2008 clocks (closely
> > approaching the maximum possible throughput).
>
> Tks.. but i lost the math logic.. Why 28 ? How did you got the
> conclusion that issuing 10 CVTTPD2DQ they will finish in 28 clocks ?


Assuming you're issuing them continuously (IOW, every two clocks), the
first CVTTPD2DQ will finish after the tenth clock (having been issued
on the first clock), the second after the 12th (issued on the third),
the third after the 14th clock (issued on the fifth), and (skipping
numbers four through nine) the tenth will finish after the 28th clock
(having been issued on the 19th clock).

From: Guga on
On Mar 8, 6:32 pm, "robertwess...(a)yahoo.com" <robertwess...(a)yahoo.com>
wrote:
> On Mar 8, 8:14 pm, "Guga" <Guga...(a)gmail.com> wrote:
>
>
>
> > On Mar 8, 5:53 pm, "robertwess...(a)yahoo.com" <robertwess...(a)yahoo.com>
> > wrote:
>
> > > On Mar 8, 7:25 pm, "Guga" <Guga...(a)gmail.com> wrote:
>
> > > > On Mar 8, 5:12 pm, "randyh...(a)earthlink.net" <randyh...(a)earthlink.net>
> > > > wrote:
>
> > > > > On Mar 8, 4:42 pm, "Guga" <Guga...(a)gmail.com> wrote:
>
> > > > > > On Mar 8, 4:06 pm, "robertwess...(a)yahoo.com" <robertwess...(a)yahoo.com>
> > > > > > wrote:
>
> > > > > > > On Mar 8, 5:35 pm, "Guga" <Guga...(a)gmail.com> wrote:
>
> > > > > > > > Hi guys
>
> > > > > > > > i´m looking for some table or any documentation that can contains the
> > > > > > > > clock cycles (and instruction lenght) of the mnemonics related to
> > > > > > > > Packed Data, like: ADDPD, ADDPS, CVTTPD2DQ etc.
>
> > > > > > > > Someone have a link containing those kind of informations ?
>
> > > > > > > The "Intel® 64 and IA-32 Architectures Optimization Reference Manual"
> > > > > > > has some of that. Appendix C includes a lot of latency and throughput
> > > > > > > information.
>
> > > > > > >http://www.intel.com/design/processor/manuals/248966.pdf
>
> > > > > > Tks Robert, this seems to be what i ws looking for.. It have the
> > > > > > latency for different processors (Core Duo, Pentium M, etc.)
>
> > > > > > one question.. on this document it contains a table of "THROUGHPUT"..
> > > > > > but.. i´m unfamiliar with this word. What does "THROUGHPUT" means in
> > > > > > english ?
>
> > > > > > Best Regards,
>
> > > > > > Guga
>
> > > > > Okay, now I've read the document. I was pretty much correct. Here's
> > > > > how Intel defined throughput:
>
> > > > > Throughput - The number of clock cycles required to wait before the
> > > > > issue
> > > > > ports are free to accept the same instruction again.
>
> > > > > IOW, it's basically the number of instructions (of the same
> > > > > instruction) that can execute per given time (though they've described
> > > > > this as a period, it's still the same thing).
> > > > > Cheers,
> > > > > Randy Hyde
>
> > > > Ok.. so, we can say that latecy+throughput are the number of clock
> > > > cycles that those instructions to work ? EXample, on the document it
> > > > says this (pg 443):
>
> > > > CVTTPD2DQ xmm, xmm latency = 10, throughput = 2 (for 0F3n CPUIds)
>
> > > > So, the total amount of cycles this mnemonic takes are 12 ?
>
> > > No. It means that if you issue a CVTTPD2DQ with the appropriate
> > > functional unit ready for more work, it'll finish in 10 clocks. You
> > > can issue additional CVTTPD2DQs every two clocks (the throughput)
> > > without stalling things, and the results will pop out the other end
> > > every two clocks, but delayed (the latency) 10 clocks from when they
> > > entered the functional unit. Obviously being able to pipeline five
> > > CVTTPD2DQs requires that they have no dependencies which will cause
> > > them to stall, and that nothing else in the instruction stream causes
> > > any stalls or prevent the appropriate reorderings and whatnot.
>
> > > So you could issue one CVTTPD2DQ and it'll finish in 10 clocks. Or
> > > you could issue 10 (assuming nothing stalls), and they'll finish in 28
> > > clocks. Or if you stream 1000, they'll finish in 2008 clocks (closely
> > > approaching the maximum possible throughput).
>
> > Tks.. but i lost the math logic.. Why 28 ? How did you got the
> > conclusion that issuing 10 CVTTPD2DQ they will finish in 28 clocks ?
>
> Assuming you're issuing them continuously (IOW, every two clocks), the
> first CVTTPD2DQ will finish after the tenth clock (having been issued
> on the first clock), the second after the 12th (issued on the third),
> the third after the 14th clock (issued on the fifth), and (skipping
> numbers four through nine) the tenth will finish after the 28th clock
> (having been issued on the 19th clock).

Tks.. robert.. i think i got it..

Assuming i´m using them continuosly, i made a simple formula that
shows the amount of clock cycles of those instructions used on such a
way (continuosly)

Clocks = Latency+(Throughput*N-1)

N = Amount of instructions used (all of the same type), like the 1000
example you gave.
Latency of the mnemonic
Throughput of the mnemonic.
Clocks = total amount of clocks of the sequence of the mnemonics used
continuosly

Is that it ?

Best Regards,

Guga

From: Guga on
On Mar 8, 6:32 pm, "robertwess...(a)yahoo.com" <robertwess...(a)yahoo.com>
wrote:
> On Mar 8, 8:14 pm, "Guga" <Guga...(a)gmail.com> wrote:
>
>
>
> > On Mar 8, 5:53 pm, "robertwess...(a)yahoo.com" <robertwess...(a)yahoo.com>
> > wrote:
>
> > > On Mar 8, 7:25 pm, "Guga" <Guga...(a)gmail.com> wrote:
>
> > > > On Mar 8, 5:12 pm, "randyh...(a)earthlink.net" <randyh...(a)earthlink.net>
> > > > wrote:
>
> > > > > On Mar 8, 4:42 pm, "Guga" <Guga...(a)gmail.com> wrote:
>
> > > > > > On Mar 8, 4:06 pm, "robertwess...(a)yahoo.com" <robertwess...(a)yahoo.com>
> > > > > > wrote:
>
> > > > > > > On Mar 8, 5:35 pm, "Guga" <Guga...(a)gmail.com> wrote:
>
> > > > > > > > Hi guys
>
> > > > > > > > i´m looking for some table or any documentation that can contains the
> > > > > > > > clock cycles (and instruction lenght) of the mnemonics related to
> > > > > > > > Packed Data, like: ADDPD, ADDPS, CVTTPD2DQ etc.
>
> > > > > > > > Someone have a link containing those kind of informations ?
>
> > > > > > > The "Intel® 64 and IA-32 Architectures Optimization Reference Manual"
> > > > > > > has some of that. Appendix C includes a lot of latency and throughput
> > > > > > > information.
>
> > > > > > >http://www.intel.com/design/processor/manuals/248966.pdf
>
> > > > > > Tks Robert, this seems to be what i ws looking for.. It have the
> > > > > > latency for different processors (Core Duo, Pentium M, etc.)
>
> > > > > > one question.. on this document it contains a table of "THROUGHPUT"..
> > > > > > but.. i´m unfamiliar with this word. What does "THROUGHPUT" means in
> > > > > > english ?
>
> > > > > > Best Regards,
>
> > > > > > Guga
>
> > > > > Okay, now I've read the document. I was pretty much correct. Here's
> > > > > how Intel defined throughput:
>
> > > > > Throughput - The number of clock cycles required to wait before the
> > > > > issue
> > > > > ports are free to accept the same instruction again.
>
> > > > > IOW, it's basically the number of instructions (of the same
> > > > > instruction) that can execute per given time (though they've described
> > > > > this as a period, it's still the same thing).
> > > > > Cheers,
> > > > > Randy Hyde
>
> > > > Ok.. so, we can say that latecy+throughput are the number of clock
> > > > cycles that those instructions to work ? EXample, on the document it
> > > > says this (pg 443):
>
> > > > CVTTPD2DQ xmm, xmm latency = 10, throughput = 2 (for 0F3n CPUIds)
>
> > > > So, the total amount of cycles this mnemonic takes are 12 ?
>
> > > No. It means that if you issue a CVTTPD2DQ with the appropriate
> > > functional unit ready for more work, it'll finish in 10 clocks. You
> > > can issue additional CVTTPD2DQs every two clocks (the throughput)
> > > without stalling things, and the results will pop out the other end
> > > every two clocks, but delayed (the latency) 10 clocks from when they
> > > entered the functional unit. Obviously being able to pipeline five
> > > CVTTPD2DQs requires that they have no dependencies which will cause
> > > them to stall, and that nothing else in the instruction stream causes
> > > any stalls or prevent the appropriate reorderings and whatnot.
>
> > > So you could issue one CVTTPD2DQ and it'll finish in 10 clocks. Or
> > > you could issue 10 (assuming nothing stalls), and they'll finish in 28
> > > clocks. Or if you stream 1000, they'll finish in 2008 clocks (closely
> > > approaching the maximum possible throughput).
>
> > Tks.. but i lost the math logic.. Why 28 ? How did you got the
> > conclusion that issuing 10 CVTTPD2DQ they will finish in 28 clocks ?
>
> Assuming you're issuing them continuously (IOW, every two clocks), the
> first CVTTPD2DQ will finish after the tenth clock (having been issued
> on the first clock), the second after the 12th (issued on the third),
> the third after the 14th clock (issued on the fifth), and (skipping
> numbers four through nine) the tenth will finish after the 28th clock
> (having been issued on the 19th clock).


Tks.. robert.. i think i got it..

Assuming i´m using them continuosly, i made a simple formula that
shows the amount of clock cycles of those instructions used on such a
way (continuosly)

Clocks = Latency+Throughput*(N-1)

N = Amount of instructions used (all of the same type), like the 1000
example you gave.
Latency of the mnemonic
Throughput of the mnemonic.
Clocks = total amount of clocks of the sequence of the mnemonics used
continuosly

Is that it ?

Best Regards,

Guga

From: Guga on
On Mar 8, 4:18 pm, //\\\\o//\\\\annabee <Wanna...(a)wannabee.org> wrote:
> På Fri, 09 Mar 2007 00:35:47 +0100, skrev Guga <Guga...(a)gmail.com>:
>
> > Hi guys
>
> > i´m looking for some table or any documentation that can contains the
> > clock cycles (and instruction lenght) of the mnemonics related to
> > Packed Data, like: ADDPD, ADDPS, CVTTPD2DQ etc.
>
> > Someone have a link containing those kind of informations ?
>
> > Best Regards,
>
> > guga
>
> Hi Guga.
>
> Why isnt this useful?
>
> < http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_do....
>
>


Tks wanabee... i downloaded it.. it don´t contains the latency or
lenght of the bytes, but it is usefull. I also found this
http://cdrom.amd.com/21860/updates/Optimization_Guide_Help/wwhelp/wwhimpl/common/html/wwhelp.htm?context=WebWorksHelpOptGuide&file=WebWorksHelpOptGuide-15-09.html

Best Regards,

Guga

From: robertwessel2 on
On Mar 8, 9:19 pm, "Guga" <Guga...(a)gmail.com> wrote:
> On Mar 8, 6:32 pm, "robertwess...(a)yahoo.com" <robertwess...(a)yahoo.com>
> wrote:
>
>
>
>
>
> > On Mar 8, 8:14 pm, "Guga" <Guga...(a)gmail.com> wrote:
>
> > > On Mar 8, 5:53 pm, "robertwess...(a)yahoo.com" <robertwess...(a)yahoo.com>
> > > wrote:
>
> > > > On Mar 8, 7:25 pm, "Guga" <Guga...(a)gmail.com> wrote:
>
> > > > > On Mar 8, 5:12 pm, "randyh...(a)earthlink.net" <randyh...(a)earthlink..net>
> > > > > wrote:
>
> > > > > > On Mar 8, 4:42 pm, "Guga" <Guga...(a)gmail.com> wrote:
>
> > > > > > > On Mar 8, 4:06 pm, "robertwess...(a)yahoo.com" <robertwess...(a)yahoo.com>
> > > > > > > wrote:
>
> > > > > > > > On Mar 8, 5:35 pm, "Guga" <Guga...(a)gmail.com> wrote:
>
> > > > > > > > > Hi guys
>
> > > > > > > > > i´m looking for some table or any documentation that can contains the
> > > > > > > > > clock cycles (and instruction lenght) of the mnemonics related to
> > > > > > > > > Packed Data, like: ADDPD, ADDPS, CVTTPD2DQ etc.
>
> > > > > > > > > Someone have a link containing those kind of informations ?
>
> > > > > > > > The "Intel® 64 and IA-32 Architectures Optimization Reference Manual"
> > > > > > > > has some of that. Appendix C includes a lot of latency and throughput
> > > > > > > > information.
>
> > > > > > > >http://www.intel.com/design/processor/manuals/248966.pdf
>
> > > > > > > Tks Robert, this seems to be what i ws looking for.. It have the
> > > > > > > latency for different processors (Core Duo, Pentium M, etc.)
>
> > > > > > > one question.. on this document it contains a table of "THROUGHPUT"..
> > > > > > > but.. i´m unfamiliar with this word. What does "THROUGHPUT" means in
> > > > > > > english ?
>
> > > > > > > Best Regards,
>
> > > > > > > Guga
>
> > > > > > Okay, now I've read the document. I was pretty much correct. Here's
> > > > > > how Intel defined throughput:
>
> > > > > > Throughput - The number of clock cycles required to wait before the
> > > > > > issue
> > > > > > ports are free to accept the same instruction again.
>
> > > > > > IOW, it's basically the number of instructions (of the same
> > > > > > instruction) that can execute per given time (though they've described
> > > > > > this as a period, it's still the same thing).
> > > > > > Cheers,
> > > > > > Randy Hyde
>
> > > > > Ok.. so, we can say that latecy+throughput are the number of clock
> > > > > cycles that those instructions to work ? EXample, on the document it
> > > > > says this (pg 443):
>
> > > > > CVTTPD2DQ xmm, xmm latency = 10, throughput = 2 (for 0F3n CPUIds)
>
> > > > > So, the total amount of cycles this mnemonic takes are 12 ?
>
> > > > No. It means that if you issue a CVTTPD2DQ with the appropriate
> > > > functional unit ready for more work, it'll finish in 10 clocks. You
> > > > can issue additional CVTTPD2DQs every two clocks (the throughput)
> > > > without stalling things, and the results will pop out the other end
> > > > every two clocks, but delayed (the latency) 10 clocks from when they
> > > > entered the functional unit. Obviously being able to pipeline five
> > > > CVTTPD2DQs requires that they have no dependencies which will cause
> > > > them to stall, and that nothing else in the instruction stream causes
> > > > any stalls or prevent the appropriate reorderings and whatnot.
>
> > > > So you could issue one CVTTPD2DQ and it'll finish in 10 clocks. Or
> > > > you could issue 10 (assuming nothing stalls), and they'll finish in 28
> > > > clocks. Or if you stream 1000, they'll finish in 2008 clocks (closely
> > > > approaching the maximum possible throughput).
>
> > > Tks.. but i lost the math logic.. Why 28 ? How did you got the
> > > conclusion that issuing 10 CVTTPD2DQ they will finish in 28 clocks ?
>
> > Assuming you're issuing them continuously (IOW, every two clocks), the
> > first CVTTPD2DQ will finish after the tenth clock (having been issued
> > on the first clock), the second after the 12th (issued on the third),
> > the third after the 14th clock (issued on the fifth), and (skipping
> > numbers four through nine) the tenth will finish after the 28th clock
> > (having been issued on the 19th clock).
>
> Tks.. robert.. i think i got it..
>
> Assuming i´m using them continuosly, i made a simple formula that
> shows the amount of clock cycles of those instructions used on such a
> way (continuosly)
>
> Clocks = Latency+Throughput*(N-1)
>
> N = Amount of instructions used (all of the same type), like the 1000
> example you gave.
> Latency of the mnemonic
> Throughput of the mnemonic.
> Clocks = total amount of clocks of the sequence of the mnemonics used
> continuosly
>
> Is that it ?


Basically yes. The complications are that the published latencies are
often (usually) for a functional unit, and not a particular
instruction. For example, two instructions might have the same
latency and throughput, but will only execute on the same functional
unit, in which case the two have to share the available throughput.
Also, dependencies are an issue (for example the sequence "add
eax,ebx / add edx,eax" can't execute in a single cycle (even though
there is more than one integer FU) because the second instruction
cannot execute until the result of the first one is available. Then
there are other dependencies and resource limitations - for example a
CPU might not be able to issue more than three instructions at once,
which may limit the total possible throughput. Memory accesses are
another complication. Obviously that's all quite implementation
dependent.

It's a rather complex field, and the entire Intel manual I references
is a good resource, and is fairly interesting reading (if you like
that sort of thing).