7:47 pm
April 17, 2016
I am using a Due, which has good processing power and lots of workspace. I have a buffer array in a global variable; it has about 1000 elements and is U16 representation; it is compiled within the program as a global variable. A callback VI slaved to the Due's timer takes this array and performs a small number (let's call it N) of simple calculations (a multiply and an add) on a few of its elements, replacing those elements with new values. It does this within a FOR loop which loops N times, pulling the indices of the affected buffer elements (and and multiplier values from a mask array of the same size as the buffer) from another array of size N.
Summarizing:
o Buffer and mask arrays of size ~1000
o Much smaller array, size N, of pointer indexes
o N calculations (a multiply and an add) per loop, replacing N elements in the buffer array.
I'm finding that the loop time depends significantly on the size of the array, which it shouldn't, and in any case takes way more time than I'd expect. Doing the calculations on about N = 30 elements takes several dozen msec. (I'm measuring the timing with an oscilloscope by toggling a digital line with each completion of N loop iterations).
From these observations I suspect that the doughty little Due must be reconstituting that big buffer array in each iteration, maybe multiple times. The Compiler has no documented capability to perform calculations in-place, so I guess this is not entirely unexpected.
A snapshot of the FOR loop is attached. My question: Is there an alternate construction that would be faster and prevent the cycle-time from depending on the size of the buffer?
Many thanks in advance for any suggestions. I've considered replacing the buffer global variable with a shift register as one possible approach but am stuck on how to initialize it since a First Call? subVI is not supported by the Compiler.
4:52 am
March 12, 2015
Hi Scott,
Yes, what you see is expected and timing and optimization gets very tricky based on how you access data. A lot of how the Compiler handles array and memory optimization and implaceness is discussed in the Memory Optimization section of Important Considerations in the user manual. If you haven't read that yet I suggest you take a look.
But there are still many intricacies that may help to improve performance. If there is a wire branch in an array, for example, then inplaceness cannot be performed and every iteration of a loop would cause a costly memcpy. But this doesn't appear to be your problem here. One thing you may want to try is moving the global variables inside the For loop. It sound counter intuitive, but I think the way you have it now, the compiler will have to copy that big buffer when the wire crosses the input side of the loop and the output side. By placing the global read and write inside the buffer, it should hopefully remove these memory copies and update the buffer global array in-place on each iteration of the loop. If that doesn't see to improve things, you could upload or send your VI and I can decompile it to see where there could be extra performance hits.
Also, using an initialize array somewhere at the beginning of your main VI (not in the interrupt) would pre-allocate the array so you can operate in-place (sounds like you have done that already). Definitely do not use Build Array and auto indexing on loops if you are going for performance. In general, as explained in the manual, there is a lot of overhead in the way arrays must be handled to support LabVIEW dynamic memory allocation. So there can be a performance hit using the Compiler with LabVIEW compared to writing something purely in C/C++.
Steffan
5:37 am
April 17, 2016
Thanks, Steffan,
That helped, but just a little. I've spent the day playing around to see what impacts the loop time. Among the unintuitive results of my benchmarking: using floating point representation for the buffer and mask arrays is somewhat faster than the integer representation shown in my screen-snap! I don't understand that at all.
In any case my program is still an order of magnitude slower than I need it to be. I'm currently getting about 30usec per iteration of the FOR loop. That's 30usec for a multiply and an add... and all the behind-the-scenes stuff.
I need more speed but unfortunately the Due appears to be the fastest board your compiler currently supports (and it's now been retired!). The Edison would seem promising. It shows up in your Compiler's menu when installed in the Arduino IDE, so I'll be trying that next!
3:25 pm
April 17, 2016
Thanks, Steffan. I'll get the VI to you.
Meanwhile I've ordered an Edison and have played around in the Compiler with that selected as the target. Unfortunately my timing-critical application relies on attaching a callback VI to a timer. (That callback VI is what contains the loop I'm beating my head against.) And while the Due has its own timer-attached interrupt subVI and there's the Attach Timer1 Interrupt subVI for AVR targets, neither will compile for the Edison target. In the help it's noted for the latter, "This interrupt will only compile on AVR platforms" and that's the case!
So, a timer-attached interrupt seems like something needed for the Edison. Which looks like a really impressive platform, by the way.
Meanwhile I'll hold out hope for the Due and will post the VI to you. Thanks again!
21
1 Guest(s)