Repeated processing of large datasets

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Repeated processing of large datasets

John Logsdon

Greetings to all

I am processing some large datasets that are currently stored as .csv
files and I can slurp them all into memory.  I only want specific columns.

The datasets are typically a few million records, each with up to 100
columns of which I am only interested in 20 or so.

So the first thing I do is to slurp it all into memory and discard the
unwanted data thus:

local function readValues(f)
  local Line = f:read("*l")
  if Line ~= nil then
    Line = split(string.gsub(string.gsub(Line,"[\n\r]","")," +"," "),",")
    return {Line[i1],Line[i2],Line[i3]}
  end
end

where i1, i2, i3 etc have been pre-calculated from the header line

Then in the main program I read each line at a time:

local Linez=readValues(tickStream)
while Linez ~= nil and #AllLinez < maxLines do
        table.insert(AllLinez,Linez)
        Linez=readValues(tickStream)
end

So that AllLinez holds the data I need.

The processing is then a matter of looping over AllLinez:

for thisLine = 1,#AllLinez do
        V1,V2,V3 = unpack(AllLinez[thisLine])
-- ... and then the data are processed
--
end

Processing involves a very large number of repeated optimisation steps so
it is important that the data are handled as efficiently as possible.  I
am using luajit of course.

My question is whether this is an efficient way to process the data or
would it be better to use a database such as SQLITE3?

[apologies that the mail nipped out before completion - finger problem!]

TIA

John

John Logsdon
Quantex Research Ltd
+44 161 445 4951/+44 7717758675



Best wishes

John

John Logsdon
Quantex Research Ltd
+44 161 445 4951/+44 7717758675


Reply | Threaded
Open this post in threaded view
|

Re: Repeated processing of large datasets

Francisco Olarte
On Tue, Mar 28, 2017 at 1:19 PM, John Logsdon
<[hidden email]> wrote:
( Swapped order )
> My question is whether this is an efficient way to process the data or
> would it be better to use a database such as SQLITE3?

It depend on your concrete data & processing but I'd like to point:

> Then in the main program I read each line at a time:
> local Linez=readValues(tickStream)
> while Linez ~= nil and #AllLinez < maxLines do
>         table.insert(AllLinez,Linez)
>         Linez=readValues(tickStream)
> end

I'm not sure if this is efficient in luajit, but IIRC in lua # is not
constant time in tables, and table.insert uses # as default insert
position, so if I were worried about speed I would normally do:

local nlines = 0
while nlines <maxLines do
   Linez = readValues(tickStream)
   if Linez==nil then break end
   nlines = nlines+1
   AllLinez[nlines]=Linez
end
-- Probably stash nlines in AllLinez.n for easier passing around...

> The processing is then a matter of looping over AllLinez:
> for thisLine = 1,#AllLinez do
And here I would use nlines instead of #AllLinez, although I think and
ipairs loop maybe faster ( just time it ).

Francisco Olarte.

Reply | Threaded
Open this post in threaded view
|

Re: Repeated processing of large datasets

Peter Pimley
In reply to this post by John Logsdon
On 28 March 2017 at 12:19, John Logsdon <[hidden email]> wrote:

My question is whether this is an efficient way to process the data or
would it be better to use a database such as SQLITE3?


Using something like SQLite will almost certainly be faster, and probably significantly so.  But of course you lose the ability to simply load the text files into a text editor. Whether the benefits outweigh the costs is a decision only you can make.

Given that the data set is quite large (i.e. not just a 100-line config file), I'd move to SQLite if I were in your position.  The command line sqlite program allows you to modify the data quite easily.
Reply | Threaded
Open this post in threaded view
|

Re: Repeated processing of large datasets

Geoff Leyland
In reply to this post by John Logsdon

> On 29/03/2017, at 12:19 AM, John Logsdon <[hidden email]> wrote:
>
> for thisLine = 1,#AllLinez do
> V1,V2,V3 = unpack(AllLinez[thisLine])
> -- ... and then the data are processed
> --
> end

I'm sure you know that V1, V2 and V3 should be local variables.

> Processing involves a very large number of repeated optimisation steps so
> it is important that the data are handled as efficiently as possible.  I
> am using luajit of course.
>
> My question is whether this is an efficient way to process the data or
> would it be better to use a database such as SQLITE3?

If lua can hold all your data in memory, and you're just iterating through the rows in your processing, not performing sql-like queries to find subsets of rows, then I don't see why luajit shouldn't be pretty quick.

If your data is all numeric, then you might do well out of moving it to cdata rather than a lua table?
Reply | Threaded
Open this post in threaded view
|

Re: Repeated processing of large datasets

John Logsdon
Thanks Geoff

Yes, they are local to the function.  cdata is a good idea although I am
not sure how that would fit with sqlite which may not be able to store
binary data.  In essence though if stored as cdata I could just read the
whole table in a single read.  Hmm.  Interesting.


>
>> On 29/03/2017, at 12:19 AM, John Logsdon
>> <[hidden email]> wrote:
>>
>> for thisLine = 1,#AllLinez do
>> V1,V2,V3 = unpack(AllLinez[thisLine])
>> -- ... and then the data are processed
>> --
>> end
>
> I'm sure you know that V1, V2 and V3 should be local variables.
>
>> Processing involves a very large number of repeated optimisation steps
>> so
>> it is important that the data are handled as efficiently as possible.  I
>> am using luajit of course.
>>
>> My question is whether this is an efficient way to process the data or
>> would it be better to use a database such as SQLITE3?
>
> If lua can hold all your data in memory, and you're just iterating through
> the rows in your processing, not performing sql-like queries to find
> subsets of rows, then I don't see why luajit shouldn't be pretty quick.
>
> If your data is all numeric, then you might do well out of moving it to
> cdata rather than a lua table?


Best wishes

John

John Logsdon
Quantex Research Ltd
+44 161 445 4951/+44 7717758675


Reply | Threaded
Open this post in threaded view
|

Re: Repeated processing of large datasets

Geoff Leyland

> On 29/03/2017, at 12:21 PM, John Logsdon <[hidden email]> wrote:
>
> Thanks Geoff
>
> Yes, they are local to the function.  cdata is a good idea although I am
> not sure how that would fit with sqlite which may not be able to store
> binary data.  In essence though if stored as cdata I could just read the
> whole table in a single read.  Hmm.  Interesting.

Can you explain your use case a little further?

If you're provided with a CSV file, which you read once, and then process once, then I don't see a problem with reading the CSV into memory in lua and then processing it in memory (possibly using cdata) without getting SQLite involved.

If you read and process the CSV file multiple times, then it might be worth converting the CSV into something quicker to read, if the CSV reading is taking a significant amount of time (which I doubt if you're running some statistical/machine learning/optimisation model on the data).

My current trick with large data I want to read fast is to mmap it [1], so that loading is more or less instant.  So you'd read the CSV file once, write it to a memory-mapped file, and then for each processing run, mmap the data back in.

(I have nothing against SQLite, I use it often, I'm just not sure it's offers anything for what I imagine you're doing)

Cheers,
Geoff


[1] Shameless plug for https://github.com/geoffleyland/lua-mmapfile


Reply | Threaded
Open this post in threaded view
|

Re: Repeated processing of large datasets

Parke
In reply to this post by John Logsdon
On Tue, Mar 28, 2017 at 4:21 PM, John Logsdon
<[hidden email]> wrote:
> Yes, they are local to the function.  cdata is a good idea although I am
> not sure how that would fit with sqlite which may not be able to store
> binary data.

SQLite can store binary data.  See: http://sqlite.org/limits.html

The default maximum blob size is 10^9 (or one billion bytes).  If you
recompile SQLite, this can be increased up to 2^31 minus 1 (or 2GB
minus one).

-Parke