From: Charles Gardiner on
Hi Frank,

> The way it works is as follows:
> - the application allocates the memory (malloc).
> - a pointer to this memory is passed to the driver (custom made
> driver).
> - the driver creates a scatter-gather list by using the
> GetScatterGatherList method from the DMA_ADAPTER object.

You are aware of the following text from the Microsoft WDK Docu? Particularily the
first line.

>>
GetScatterGatherList is not a system routine that can be called directly by name.
This routine is callable only by pointer from the address returned in a
DMA_OPERATIONS structure. Drivers obtain the address of this routine by calling
IoGetDmaAdapter.

As soon as the appropriate DMA channel and any necessary map registers are
available, GetScatterGatherList creates a scatter/gather list, initializes the map
registers, and then calls the driver-supplied AdapterListControl routine to carry
out the I/O operation.

GetScatterGatherList combines the actions of the AllocateAdapterChannel and
MapTransfer routines for drivers that perform scatter/gather DMA.
GetScatterGatherList determines how many map registers are required for the
transfer, allocates the map registers, maps the buffers for DMA, and fills in the
scatter/gather list. It then calls the supplied AdapterListControl routine,
passing a pointer to the scatter/gather list in ScatterGather. The driver should
retain this pointer for use when calling PutScatterGatherList. Note that
GetScatterGatherList does not have the queuing restrictions that apply to
AllocateAdapterChannel.

In its AdapterListControl routine, the driver should perform the I/O. On return
from the driver-supplied routine, GetScatterGatherList keeps the map registers but
frees the DMA adapter structure. The driver must call PutScatterGatherList (which
flushes the buffers) before it can access the data in the buffer.
>>

> - the driver writes each entry of the scatter-gather list (which
> contains a physical address and length) to the FPGA.
> - the FPGA receives data (though another interface) and writes this
> data to the memory of the pc by use of DMA (just generates write
> requests).
> - after writing the data the FPGA generates an interrupt of PCIe (not
> working yet, but we know when the FPGA finished a transaction).
>
> I now understand I have to verify runtime if the physical address is
> below or above 4 GB and use a 3 DW respectively 4 DW TLP header. I
> will change that in the FPGA and give it a try.
>
> About the addresses, these are correct. We did the following test:
> write the virtual memory from the application and read the memory by
> using the physical addresses in the driver. In the driver we read what
> the application has written.
>

> Any other suggestions?

If you are convinced the addresses are correct, I would look at two other things.

1) Is you driver completing the request properly IoCompleteRequest()

2) Are the data being cached somewhere, Here, I would try a zero length read (from
the driver. PCIe TLP with length 1 and all BEs zero) on the last address
transferred to memory. Just discard the resulting completion. The PCIe spec says
the system must intrepret this as a flush.

By the way which buffering method is your driver using for the DMA transfer
(Buffered, Direct, Neither)

>
> Frank
From: Frank van Eijkelenburg on
On Jul 2, 9:41 am, Charles Gardiner <charles.gardi...(a)invalid.invalid>
wrote:
> Hi Frank,
>
> > The way it works is as follows:
> > - the application allocates the memory (malloc).
> > - a pointer to this memory is passed to the driver (custom made
> > driver).
> > - the driver creates a scatter-gather list by using the
> > GetScatterGatherList method from the DMA_ADAPTER object.
>
> You are aware of the following text from the Microsoft WDK Docu? Particularily the
> first line.
>
>
>
> GetScatterGatherList is not a system routine that can be called directly by name.
> This routine is callable only by pointer from the address returned in a
> DMA_OPERATIONS structure. Drivers obtain the address of this routine by calling
> IoGetDmaAdapter.
>
> As soon as the appropriate DMA channel and any necessary map registers are
> available, GetScatterGatherList creates a scatter/gather list, initializes the map
> registers, and then calls the driver-supplied AdapterListControl routine to carry
> out the I/O operation.
>
> GetScatterGatherList combines the actions of the AllocateAdapterChannel and
> MapTransfer routines for drivers that perform scatter/gather DMA.
> GetScatterGatherList determines how many map registers are required for the
> transfer, allocates the map registers, maps the buffers for DMA, and fills in the
> scatter/gather list. It then calls the supplied AdapterListControl routine,
> passing a pointer to the scatter/gather list in ScatterGather. The driver should
> retain this pointer for use when calling PutScatterGatherList. Note that
> GetScatterGatherList does not have the queuing restrictions that apply to
> AllocateAdapterChannel.
>
> In its AdapterListControl routine, the driver should perform the I/O. On return
> from the driver-supplied routine, GetScatterGatherList keeps the map registers but
> frees the DMA adapter structure. The driver must call PutScatterGatherList (which
> flushes the buffers) before it can access the data in the buffer.
>
>
>
>
>
> > - the driver writes each entry of the scatter-gather list (which
> > contains a physical address and length) to the FPGA.
> > - the FPGA receives data (though another interface) and writes this
> > data to the memory of the pc by use of DMA (just generates write
> > requests).
> > - after writing the data the FPGA generates an interrupt of PCIe (not
> > working yet, but we know when the FPGA finished a transaction).
>
> > I now understand I have to verify runtime if the physical address is
> > below or above 4 GB and use a 3 DW respectively 4 DW TLP header. I
> > will change that in the FPGA and give it a try.
>
> > About the addresses, these are correct. We did the following test:
> > write the virtual memory from the application and read the memory by
> > using the physical addresses in the driver. In the driver we read what
> > the application has written.
>
> > Any other suggestions?
>
> If you are convinced the addresses are correct, I would look at two other things.
>
> 1) Is you driver completing the request properly IoCompleteRequest()
>
> 2) Are the data being cached somewhere, Here, I would try a zero length read (from
> the driver. PCIe TLP with length 1 and all BEs zero) on the last address
> transferred to memory. Just discard the resulting completion. The PCIe spec says
> the system must intrepret this as a flush.
>
> By the way which buffering method is your driver using for the DMA transfer
> (Buffered, Direct, Neither)
>
>
>
> > Frank

Hi Charles,

I am not sure if we understand each other. What do you mean by
completing the request with IoCompleteRequest? There is no request
from software point of view. The FPGA will do a DMA write (data from
FPGA to PC memory) at its own initiative. The allocated memory is used
as long as the software is running. I do not allocate new memory for
each new DMA transfer, but at startup a large piece of memory is
allocated and the physical addresses are written to the FPGA by the
driver software.

And yes, we use a DMA adapter in combination with the
GetScatterGatherList method. We already used this in another project
but that was PCI and DMA read (data from PC memory to FPGA).

By the way, where can I set the type of DMA?

best regards,

Frank
From: Charles Gardiner on
Hi Frank,

>
> I am not sure if we understand each other.

Yes, it certainly sounds like that.

> What do you mean by
> completing the request with IoCompleteRequest? There is no request
> from software point of view.

I think this might clear up the reason why your data is missing. (See also below
about the type of DMA). I don't think the S/G list you are getting is describing
your application buffer. This is best done by specifying DO_DIRECT_IO as the DMA
method for your device. If you specify DO_BUFFERED_IO you will get an S/G List
describing an intermediate buffer in kernel space and this probably never gets
copied over to your application space buffer unless you terminate the request.
I've never done the 'neither' method myself and from what I hear, it's a
complicated beast.

> The FPGA will do a DMA write (data from
> FPGA to PC memory) at its own initiative. The allocated memory is used
> as long as the software is running. I do not allocate new memory for
> each new DMA transfer, but at startup a large piece of memory is
> allocated and the physical addresses are written to the FPGA by the
> driver software.

Sounds like you are doing something like a circular buffer in memory which stays
alive as long as your device does?

>
> And yes, we use a DMA adapter in combination with the
> GetScatterGatherList method. We already used this in another project
> but that was PCI and DMA read (data from PC memory to FPGA).
>
> By the way, where can I set the type of DMA?

Typically, you set the DMA buffering method in your AddDevice function after you
create your device object. Quoting from Oney's book,

NTSTATUS AddDevice(..) {
PDEVICE_OBJECT fdo;

IoCreateDevice(....., &fdo);
fdo->Flags |= DO_BUFFERED_IO;
<or>
fdo->Flags |= DO_DIRECT_IO;
<or>
fdo->Flags |= 0; // i.e. neither Direct nor Buffered

And, you can't change your mind afterwards.


By the way if my assumption about the circular buffer in your design is correct,
there is a slightly more standard solution (standard in the sense that everybody
on the microsoft drivers newgroup seems to do it). It however requires two threads
in your application. The first one requests a buffer (using new or malloc) and
sets up an I/O Request ReadFile, WriteFile or DeviceIoControl referencing this
buffer. This is performed as an asynchronous request.

The driver recognises this request and pends it indefinitely, (typically terminate
it when your driver is shutting down, otherwise windows will probably hang).
Pending the request has the nice side effect that the buffer now becomes locked
down permanently.

Assuming you have set up your driver to use DO_DIRECT_IO DMA, you can get the S/G
list describing the application space buffer as you are currently doing and feed
this to your FPGA.

Using the second thread in your application you can constantly read data from the
locked down pages (you app. space buffer) that are being written by your FPGA.


Assuming the DO_DIRECT_IO solves your problem (I think there is a good chance), I
would however still consider migrating to a KMDF based driver, particularily if
you are writing a new one. It's much easier to maintain and is probably more
portable for future MS versions.

>
> best regards,
>
> Frank

best regards,
Charles
From: Nico Coesel on
Frank van Eijkelenburg <fei.technolution(a)gmail.com> wrote:

>On Jul 2, 2:19=A0am, Charles Gardiner <charles.gardi...(a)invalid.invalid>
>wrote:
>> Frank van Eijkelenburg schrieb:
>>
>> > Hi,
>>
>> > I have a custom made PCIe board with a Virtex 5 FPGA on which I
>> > implemented a DMA unit which uses the PCIe endpoint block plus v1.14.
>> > I also implemented simple read/write operations from the PC to the
>> > board (the board responds with completion TLPs). The read/write
>> > operations are working, DMA is not working
>>
>> > The board is inserted in a pc with Windows 7 64 bits platform. An
>> > application allocates virtual memory and passes the memory block to
>> > the driver. The driver locks the memory and converts the virtual
>> > addresses into physical addresses. These physical addresses are
>> > written to the FPGA.
>>
>> How are you doing this? Normally, an application requests a buffer using =
>malloc()
>> or new() and gets a handle to the driver using CreateFile(). You then use
>> WriteFile(hDevice, Buffer,...), ReadFile(hDevice, Buffer,....) or
>> DeviceIoControl() to initiate a transfer to/from =A0the device. Thats the
>> application side.
>>
>> On the driver(kernel) side, I would strongly recommend that you write a K=
>MDF based
>> driver. Download the windows WDK, all it costs is your email. (You have t=
>o log in
>> over Microsoft Connect, last time I looked). There are lots of examples t=
>here,
>> including for PCI(e) based DMA. To (very quickly) summarise, your driver =
>requests
>> the scatter/gather list describing the buffers (see
>> WdfDmaTransactionInitializeUsingRequest() in the WDK API docs as a starti=
>ng point)
>> above and passes these to your hardware one-by-one which then does DMA in=
> or out.
>> With a call to WdfRequestComplete the buffers are released by the kernel =
>and your
>> application can reuse them or free them up as required. (This is of cours=
>e all
>> considerably more than a days work, by the way.)
>>
>> You do not have to explicitly lock down the buffer yourself. Windows does=
> this for
>> you while the I/O request is active. (Read/WriteFile from your app up to
>> WdfRequestComplete from the driver)
>>
>>
>>
>> > When I start an DMA operation, I can see in chipscope the correct
>> > physical addresses in the TLP header. However, I do not see the
>> > correct values in the allocated memory. What can I do to check where
>> > it is going wrong?
>>
>> In this case, I would first doubt whether the addresses are correct.
>>
>> > Another question is about the memory request TLPs. What should I use,
>> > 32 or 64 bit write requests? Or do I have to check runtime if the
>> > physical memory address is below or above the 4 GB (and use
>> > respectively 32 and 64 bit requests)?
>>
>> The PCIe spec says: a transfer below 4 GB must use a 3 DWord header, a tr=
>ansfer
>> above 4 GB must use a 4 DWord header. i.e. a four dword header wth addres=
>s[63:32]
>> set to zero is invalid.
>>
>>
>>
>> > Thanks in advance,
>>
>> > Frank
>
>The way it works is as follows:
>- the application allocates the memory (malloc).
>- a pointer to this memory is passed to the driver (custom made
>driver).

I strongly doubt you can use a malloc pointer to a driver. Actually
I'm quite sure this doesn't work. When the driver is active, the
application memory may be swapped to the hard-drive. And the pointer
must be translated to a physical address.

I'd go the other way around: have the driver allocate the memory and
pass a pointer to this memory to the application (this will require
some messing around with translation and access rights).

--
Failure does not prove something is impossible, failure simply
indicates you are not using the right tools...
nico(a)nctdevpuntnl (punt=.)
--------------------------------------------------------------
From: Michael S on
On Jul 2, 11:35 pm, n...(a)puntnl.niks (Nico Coesel) wrote:
> Frank van Eijkelenburg <fei.technolut...(a)gmail.com> wrote:
>
>
>
> >On Jul 2, 2:19=A0am, Charles Gardiner <charles.gardi...(a)invalid.invalid>
> >wrote:
> >> Frank van Eijkelenburg schrieb:
>
> >> > Hi,
>
> >> > I have a custom made PCIe board with a Virtex 5 FPGA on which I
> >> > implemented a DMA unit which uses the PCIe endpoint block plus v1.14..
> >> > I also implemented simple read/write operations from the PC to the
> >> > board (the board responds with completion TLPs). The read/write
> >> > operations are working, DMA is not working
>
> >> > The board is inserted in a pc with Windows 7 64 bits platform. An
> >> > application allocates virtual memory and passes the memory block to
> >> > the driver. The driver locks the memory and converts the virtual
> >> > addresses into physical addresses. These physical addresses are
> >> > written to the FPGA.
>
> >> How are you doing this? Normally, an application requests a buffer using =
> >malloc()
> >> or new() and gets a handle to the driver using CreateFile(). You then use
> >> WriteFile(hDevice, Buffer,...), ReadFile(hDevice, Buffer,....) or
> >> DeviceIoControl() to initiate a transfer to/from =A0the device. Thats the
> >> application side.
>
> >> On the driver(kernel) side, I would strongly recommend that you write a K=
> >MDF based
> >> driver. Download the windows WDK, all it costs is your email. (You have t=
> >o log in
> >> over Microsoft Connect, last time I looked). There are lots of examples t=
> >here,
> >> including for PCI(e) based DMA. To (very quickly) summarise, your driver =
> >requests
> >> the scatter/gather list describing the buffers (see
> >> WdfDmaTransactionInitializeUsingRequest() in the WDK API docs as a starti=
> >ng point)
> >> above and passes these to your hardware one-by-one which then does DMA in=
> > or out.
> >> With a call to WdfRequestComplete the buffers are released by the kernel =
> >and your
> >> application can reuse them or free them up as required. (This is of cours=
> >e all
> >> considerably more than a days work, by the way.)
>
> >> You do not have to explicitly lock down the buffer yourself. Windows does=
> > this for
> >> you while the I/O request is active. (Read/WriteFile from your app up to
> >> WdfRequestComplete from the driver)
>
> >> > When I start an DMA operation, I can see in chipscope the correct
> >> > physical addresses in the TLP header. However, I do not see the
> >> > correct values in the allocated memory. What can I do to check where
> >> > it is going wrong?
>
> >> In this case, I would first doubt whether the addresses are correct.
>
> >> > Another question is about the memory request TLPs. What should I use,
> >> > 32 or 64 bit write requests? Or do I have to check runtime if the
> >> > physical memory address is below or above the 4 GB (and use
> >> > respectively 32 and 64 bit requests)?
>
> >> The PCIe spec says: a transfer below 4 GB must use a 3 DWord header, a tr=
> >ansfer
> >> above 4 GB must use a 4 DWord header. i.e. a four dword header wth addres=
> >s[63:32]
> >> set to zero is invalid.
>
> >> > Thanks in advance,
>
> >> > Frank
>
> >The way it works is as follows:
> >- the application allocates the memory (malloc).
> >- a pointer to this memory is passed to the driver (custom made
> >driver).
>
> I strongly doubt you can use a malloc pointer to a driver. Actually
> I'm quite sure this doesn't work. When the driver is active, the
> application memory may be swapped to the hard-drive. And the pointer
> must be translated to a physical address.
>


Nah, malloc() at application level is o.k.
If I/O operation is specified as DIRECT_IO then I/O manager takes care
of locking the pages.
If operation is specified as NEITHER then driver itself should call
MmProbeAndLockPages in user context (in this case you should never
install filter drivers in between your driver and app).
In both cases it is very important to not complete the IRP associated
with the user buffer until the finish all DMA activities.

If I/O operation is specified as BUFFERED_IO then I/O manager
allocates kernel buffer and passes it to the driver and copies the
results from kernel to user buffer after driver completed the IRP.
Obviously, BUFFERED_IO is not suitable for OPs case, since he want the
result back without completing the original I/O request.

> I'd go the other way around: have the driver allocate the memory and
> pass a pointer to this memory to the application (this will require
> some messing around with translation and access rights).
>
> --
> Failure does not prove something is impossible, failure simply
> indicates you are not using the right tools...
> nico(a)nctdevpuntnl (punt=.)
> --------------------------------------------------------------

On general-purpose system allocation of big buffers by driver is
rarely a good idea. On the system dedicated to just one task it could
be pragmatically o.k, but I still don't like it from pure theoretical
point of view.

Anyway, the discussion doesn't belong here. I recommend
http://groups.google.com/group/microsoft.public.development.device.drivers