Working with large volumes of data in Igor

From thomas.braun at virtuell-zuhause.de Thu May 23 02:51:23 2019
From: thomas.braun at virtuell-zuhause.de (Thomas Braun)
Date: Thu, 23 May 2019 11:51:23 +0200

Subject: Working with large volumes of data in Igor
In-Reply-To: <CAGjoBTUaSjy7z6OaAv4AXXt=+Jaj9bQ_uff63GMFN84nG0dm8w@mail.gmail.com>
References: <CAGjoBTUaSjy7z6OaAv4AXXt=+Jaj9bQ_uff63GMFN84nG0dm8w@mail.gmail.com>
Message-ID: <33f5b1fe-fbac-bd9f-4be7-2d72b4e78e4a@virtuell-zuhause.de>

Am 22.05.2019 um 00:22 schrieb Albert Aumentado:

Hi Albert,

> This is a somewhat broad question but I wonder if anyone in the Igor
> community has been routinely working with large volumes of data in Igor. I
> am mostly concerned with speeding up processes in situations involving
> large numbers of waves or traces.
>
> The type of scenario I am speaking about is working with 50000 waves or
> having 10000 traces in a graph. My previous experience was working with
> maybe 100x less objects. As I scaled-up the amount of data to analyze, I
> realized a lot of my routines and the visualization tools were starting to
> lag. I have gone through some O-notation analysis as well as the function
> profiler but reach the end of my current programming knowledge.

we are working with lots of traces (~5k) in IP8 on windows.

Getting this performant is tricky, a couple of things to consider
(thanks to WM support for insight as well):
- Don't access a single trace when building up the graph. This is
currently quite expensive to do. Just use AppendToGraph
- Make the trace names maximally different. I'm using a counter to name
them like "T1...", "T2..." this speeds up accessing them (if you have to)
- Only plot a subset of the data "Display wave[0,inf:16]" would only
plot every 16th point. Depending on your data this might give you a
false impression of the data though.
- Check if you really need to use double as data type for the waves
shown. Maybe float is enough as well.
- If you are comfortable with just lines style in the graph and no
opacity, look into the live mode flag of Display

We currently don't have to deal with a huge number of waves, but some
general things to consider:
- The problems with accessing waves from datafolders with lots of waves
are gone since IP7. At [1] I've posted a code snippet and a graph which
shows the access times of 10k waves having 100k waves in one datafolder.
These are usually around 7e-6s.
- Same double vs float advice from above.
- Does your analysis code has more than linear complexity with respect
to the number of waves?
- Use separate threads for crunching the data. We are using a wrapper
[2] around the igor threading tools which does allow you to skip the low
level fiddling. I can share the code if you are interested.
- Use free waves for analysis code if possible. Although adding a global
wave is quite cheap nowadays it still has some cost.
- Depending on the expression either Multithread statements or matrixOP
is faster. This also depends on the number of cores your CPU has.
- Regarding speeding up calculations the index waves introduced in IP8
can help to avoid even more explicit for loops. See DisplayHelpTopic
"Indexing with an index wave".
- For function profiling you can also use BeginFunctionProfiling() and
EndFunctionProfiling() so you don't have to use the panel.

>From your earlier question on the mailing list I presume you are
fetching the data from a database using the SQL XOP? Are you always
fetching it again or do you cache the waves in the experiment?

How many waves do you have in the experiment at a time? Is it 50k in
total or per datafolder?

Thomas

[1]: https://www.wavemetrics.com/node/20924
[2]: https://alleninstitute.github.io/MIES/asyncframework.html