How to calculate MD5 of huge files?

classic Classic list List threaded Threaded
15 messages Options
Reply | Threaded
Open this post in threaded view
|

How to calculate MD5 of huge files?

"J.Jørgen von Bargen"
Hi folks.
I'm using LuaForWindows (migrating from perl for various reasons) and I
wonder, how to calculate MD5 over huge files (e.g. ISO-Files).

In perl I can add files via a handle to a sum (excuse the use of a
foreign language in this group ;-)

    $MD5->reset;
    open(FILE,$file) or die "open($file) failed($!)\n";
    binmode(FILE);
    $MD5->addfile(*FILE);
    close(FILE);
    my $hex=$MD5->hexdigest;

In lua I've only found the module md5, which provides

     md5.sumhexa(message)

but this requires reading the whole message into memory, which is quite
impossible for 7GB large files.
Is there any other solution?

Regards Jørgen

Reply | Threaded
Open this post in threaded view
|

Re: How to calculate MD5 of huge files?

Rob Kendrick-2
On Thu, 22 Apr 2010 12:35:18 +0200
"J.Jørgen von Bargen" <[hidden email]> wrote:

> In lua I've only found the module md5, which provides
>
>      md5.sumhexa(message)
>
> but this requires reading the whole message into memory, which is
> quite impossible for 7GB large files.
> Is there any other solution?

http://www.rjek.com/luahash-0.00.tar.bz2 gives you :init, :update,
and :final methods for MD5, SHA-1, and the variations of SHA-2.  I have
no idea if it works under Windows, as I've never tried.  Should be
pretty simple to build, though.

B.
Reply | Threaded
Open this post in threaded view
|

Re: How to calculate MD5 of huge files?

Jerome Vuarand
In reply to this post by "J.Jørgen von Bargen"
2010/4/22 "J.Jørgen von Bargen" <[hidden email]>:

> Hi folks.
> I'm using LuaForWindows (migrating from perl for various reasons) and I
> wonder, how to calculate MD5 over huge files (e.g. ISO-Files).
>
> In perl I can add files via a handle to a sum (excuse the use of a foreign
> language in this group ;-)
>
>   $MD5->reset;
>   open(FILE,$file) or die "open($file) failed($!)\n";
>   binmode(FILE);
>   $MD5->addfile(*FILE);
>   close(FILE);
>   my $hex=$MD5->hexdigest;
>
> In lua I've only found the module md5, which provides
>
>    md5.sumhexa(message)
>
> but this requires reading the whole message into memory, which is quite
> impossible for 7GB large files.
> Is there any other solution?

You can use LuaCrypto :

require 'crypto'

local evp = crypto.evp.new('md5')

local file = io.open(... or "bigfile.iso", "rb")
while true do
        local chunk = file:read(16*1024) -- 16kB at a time
        if not chunk then break end
        evp:update(chunk)
end

print(evp:digest())
Reply | Threaded
Open this post in threaded view
|

Re: How to calculate MD5 of huge files?

Luiz Henrique de Figueiredo
In reply to this post by Rob Kendrick-2
> http://www.rjek.com/luahash-0.00.tar.bz2 gives you :init, :update,
> and :final methods for MD5, SHA-1, and the variations of SHA-2.

So does http://www.tecgraf.puc-rio.br/~lhf/ftp/lua/#lmd5 but you need
OpenSSL to build it out of the box. If you don't have that, then there's
C code for MD5 in many places around, eg
        http://userpages.umbc.edu/~mabzug1/cs/md5/md5.html
        http://archiv.tu-chemnitz.de/pub/1999/0004/data/md5/
Reply | Threaded
Open this post in threaded view
|

Re: How to calculate MD5 of huge files?

Petite Abeille
In reply to this post by Jerome Vuarand

On Apr 22, 2010, at 12:49 PM, Jerome Vuarand wrote:

> local file = io.open(... or "bigfile.iso", "rb")
> while true do
> local chunk = file:read(16*1024) -- 16kB at a time
> if not chunk then break end
> evp:update(chunk)
> end

A bit off topic, but... out of curiosity, I was wondering about the use of 'while true do' + 'if not chunk then break' vs. 'while aChunk do' + 'aChunk = aFile:read()', e.g.:

-- using LHF's previous version of lmd5, the one before it was married to openssl :D
local sha1 = require( 'sha1' )
local aHash = sha1.new()
local aFile = assert( io.open( 'bigfile.iso', 'rb' ) )
local aChunk = aFile:read( 16384 )

while aChunk do
    aHash:update( aChunk )
    aChunk = aFile:read( 16384 )
end

Any reasons, aside from stylistic ones, to favor one form or another? I personally prefer an explicit conditional when using a 'while' clause, as I find it easier to comprehend, but I have noticed the above style quite often. Just curious :)

All things consider, I would rather use an iterator anyway, but perhaps that's just me:

local aReader = function() return aFile:read( 16384 ) end

for aChunk in aReader do
    aHash:update( aChunk )
end



Reply | Threaded
Open this post in threaded view
|

Re[2]: How to calculate MD5 of huge files?

Arseny Vakhrushev
>> local file = io.open(... or "bigfile.iso", "rb")
>> while true do
>>       local chunk = file:read(16*1024) -- 16kB at a time
>>       if not chunk then break end
>>       evp:update(chunk)
>> end

> A bit off topic, but... out of curiosity, I was wondering about the use of 'while true do' + 'if
> not chunk then break' vs. 'while aChunk do' + 'aChunk = aFile:read()', e.g.:

> local sha1 = require( 'sha1' )
> local aHash = sha1.new()
> local aFile = assert( io.open( 'bigfile.iso', 'rb' ) )
> local aChunk = aFile:read( 16384 )

> while aChunk do
>     aHash:update( aChunk )
>     aChunk = aFile:read( 16384 )
> end

I'd say that it matters for me in the sense of 'aFile:read( 16384 )' appearing twice, i.e. it
doesn't look nice when a buffer size is mentioned twice (you could have a constant for that though)
and a fetch call is made in two places. Anyway, I'd prefer a 'while true do ...' or 'repeat .. until
false' form as well as the author.

Be well,
Seny

Reply | Threaded
Open this post in threaded view
|

Re: How to calculate MD5 of huge files?

Doug Rogers-3
In reply to this post by Petite Abeille
Petite Abeille wrote:
> A bit off topic, but... out of curiosity, I was wondering about the use of 'while true do' + 'if not chunk then break' vs. 'while aChunk do' + 'aChunk = aFile:read()', e.g.:
> All things consider, I would rather use an iterator anyway, but perhaps that's just me:
>
> local aReader = function() return aFile:read( 16384 ) end
> for aChunk in aReader do
>     aHash:update( aChunk )
> end

Yes, I like your iterator version best.

But I do prefer the 'while true do  .. break .. end' construct over
multiple :read() calls. When things are repeated, they often become
broken later. I've been bitten by the latter construct in more painful
ways than I have the former. This is mostly because an infinite loop
almost always shows up earlier and in a more obvious way than an
interaction caused by some subtle linking between the two repeated lines
(especially when one is later changed without the other). Those sorts of
bugs look like they work but miss the last read, say, under some strange
circumstance.

Of course in this limited example, it doesn't really matter. Either way
is clear and readily understood.

Doug

Reply | Threaded
Open this post in threaded view
|

Re: How to calculate MD5 of huge files?

Jerome Vuarand
In reply to this post by Petite Abeille
2010/4/22 Petite Abeille <[hidden email]>:

>
> On Apr 22, 2010, at 12:49 PM, Jerome Vuarand wrote:
>
>> local file = io.open(... or "bigfile.iso", "rb")
>> while true do
>>       local chunk = file:read(16*1024) -- 16kB at a time
>>       if not chunk then break end
>>       evp:update(chunk)
>> end
>
> A bit off topic, but... out of curiosity, I was wondering about the use of 'while true do' + 'if not chunk then break' vs. 'while aChunk do' + 'aChunk = aFile:read()', e.g.:
>
> -- using LHF's previous version of lmd5, the one before it was married to openssl :D
> local sha1 = require( 'sha1' )
> local aHash = sha1.new()
> local aFile = assert( io.open( 'bigfile.iso', 'rb' ) )
> local aChunk = aFile:read( 16384 )
>
> while aChunk do
>    aHash:update( aChunk )
>    aChunk = aFile:read( 16384 )
> end
>
> Any reasons, aside from stylistic ones, to favor one form or another? I personally prefer an explicit conditional when using a 'while' clause, as I find it easier to comprehend, but I have noticed the above style quite often. Just curious :)
>
> All things consider, I would rather use an iterator anyway, but perhaps that's just me:
>
> local aReader = function() return aFile:read( 16384 ) end
>
> for aChunk in aReader do
>    aHash:update( aChunk )
> end

I prefer both syntaxes you present over mine (with the difference that
I wouldn't name the iterator), but I tend to write the one I sent
previously because of a very subtle combination of circumstances that
I understand, despise, but yet repeat helplessly.
Reply | Threaded
Open this post in threaded view
|

Re: How to calculate MD5 of huge files?

Duncan Cross
In reply to this post by Petite Abeille
On Thu, Apr 22, 2010 at 7:03 PM, Petite Abeille
<[hidden email]> wrote:

>
> On Apr 22, 2010, at 12:49 PM, Jerome Vuarand wrote:
>
>> local file = io.open(... or "bigfile.iso", "rb")
>> while true do
>>       local chunk = file:read(16*1024) -- 16kB at a time
>>       if not chunk then break end
>>       evp:update(chunk)
>> end
>
> A bit off topic, but... out of curiosity, I was wondering about the use of 'while true do' + 'if not chunk then break' vs. 'while aChunk do' + 'aChunk = aFile:read()', e.g.:
>
> -- using LHF's previous version of lmd5, the one before it was married to openssl :D
> local sha1 = require( 'sha1' )
> local aHash = sha1.new()
> local aFile = assert( io.open( 'bigfile.iso', 'rb' ) )
> local aChunk = aFile:read( 16384 )
>
> while aChunk do
>    aHash:update( aChunk )
>    aChunk = aFile:read( 16384 )
> end
>
> Any reasons, aside from stylistic ones, to favor one form or another? I personally prefer an explicit conditional when using a 'while' clause, as I find it easier to comprehend, but I have noticed the above style quite often. Just curious :)

I dislike having the "aChunk = aFile:read(16384)" bit in two places
because it feels too easy to change one and forget about the other.

> All things consider, I would rather use an iterator anyway, but perhaps that's just me:

I agree, that's not a bad way to do it.

-Duncan
Reply | Threaded
Open this post in threaded view
|

Re: How to calculate MD5 of huge files?

Luiz Henrique de Figueiredo
In reply to this post by Petite Abeille
> -- using LHF's previous version of lmd5, the one before it was married to openssl :D

The code in lmd5.c is not married to OpenSSL at all: it works for any digest
library that has an API similar to the original Rivest API described in
RFC 1321 (ie, almost all of them). The current code in lmd5.h is married to
OpenSSL, but it is simple to add support for whatever library you have. The
current Makefile is married to OpenSSL because the previous Makefile was too
complicated and OpenSSL seems to be widely available. Just drop me a line
if you need help adapting the code to support another digest library.
Reply | Threaded
Open this post in threaded view
|

Re: How to calculate MD5 of huge files?

Joshua Jensen
----- Original Message -----
From: Luiz Henrique de Figueiredo
Date: 4/26/2010 6:12 AM
>> -- using LHF's previous version of lmd5, the one before it was married to openssl :D
>>      
My version of lmd5 has this useful function.  It calculates the MD5 of
an entire file and does so from C code.  This means the garbage
collector isn't pegged heavily during a read of a large file.

/**
**/
static int Lupdatefile(lua_State *L)
{
     MD5_CTX *c=Pget(L,1);
     const char *fileName=luaL_checkstring(L,2);
     const size_t BLOCK_SIZE = 32768;
     unsigned char* buffer;
     size_t fileLen;
     size_t bytesToDo;
     FILE* file = fopen(fileName, "rb");
     if (!file)
         return 0;
     buffer = malloc(BLOCK_SIZE);
     fseek(file, 0, SEEK_END);
     fileLen = ftell(file);
     fseek(file, 0, SEEK_SET);
     bytesToDo = fileLen;
     while (bytesToDo > 0)
     {
         size_t bytesToRead = bytesToDo < BLOCK_SIZE ? bytesToDo :
BLOCK_SIZE;

         fread(buffer, bytesToRead, 1, file);
         bytesToDo -= bytesToRead;
         MD5Update(c, buffer, bytesToRead);
     }

     fseek(file, 0, SEEK_SET);
     fclose(file);

     free(buffer);
     return 0;
}


-Josh
Reply | Threaded
Open this post in threaded view
|

Re: How to calculate MD5 of huge files?

Luiz Henrique de Figueiredo
> My version of lmd5 has this useful function.  It calculates the MD5 of
> an entire file and does so from C code.

Allow me some suggestions for that code:
 
>     const size_t BLOCK_SIZE = 32768;
>     unsigned char* buffer;

Use a fixed buffer of size BUFSIZ.  No need to malloc anything.
fread buffers data anyway. No need to buffer it more than that.

>     fseek(file, 0, SEEK_END);

Just read until fread returns zero. If you want to check for errors,
call ferror at the end.
Reply | Threaded
Open this post in threaded view
|

Re: Re: How to calculate MD5 of huge files?

Aurora-4
Dear,

If calculate the entire file is needed, the code can be:

unsigned char block[buf_size]

open_file
md5_init_ctx()
while(!feof)
{
        read_a_block_from_file(block)
        md5_update(block)
}
md5_fini()

If not, you can set a step to skip some data
                         
Aurora
2010-04-27

-------------------------------------------------------------

>> My version of lmd5 has this useful function.  It calculates the MD5 of
>> an entire file and does so from C code.
>
>Allow me some suggestions for that code:
>
>>     const size_t BLOCK_SIZE = 32768;
>>     unsigned char* buffer;
>
>Use a fixed buffer of size BUFSIZ.  No need to malloc anything.
>fread buffers data anyway. No need to buffer it more than that.
>
>>     fseek(file, 0, SEEK_END);
>
>Just read until fread returns zero. If you want to check for errors,
>call ferror at the end.

Reply | Threaded
Open this post in threaded view
|

Re: How to calculate MD5 of huge files?

Joshua Jensen
In reply to this post by Luiz Henrique de Figueiredo
----- Original Message -----
From: Luiz Henrique de Figueiredo
Date: 4/27/2010 5:05 AM
>> My version of lmd5 has this useful function.  It calculates the MD5 of
>> an entire file and does so from C code.
>>      
>>      const size_t BLOCK_SIZE = 32768;
>>      unsigned char* buffer;
>>      
> Use a fixed buffer of size BUFSIZ.  No need to malloc anything.
> fread buffers data anyway. No need to buffer it more than that.
>    
Well, this was written years ago, and it was tuned to certain data
access patterns.

To check, according to Process Monitor running on Windows 7, I found
that default fread() using BUFSIZ (512 bytes) reads in 4,096 bytes at a
time.  fread(), then, hits the disk 8 times more often than the version
of the code I have.  This is implementation dependent, of course, but
there you go.
>>      fseek(file, 0, SEEK_END);
>>      
> Just read until fread returns zero. If you want to check for errors,
> call ferror at the end.
>    
I think the original code read the entire file into memory and
calculated the MD5 in one fell swoop.  Then it evolved into its current
form.

When I have some time to test the updates, I'll be sure to integrate the
shorter forms you've suggested.

Thanks!

Josh
Reply | Threaded
Open this post in threaded view
|

Re[2]: How to calculate MD5 of huge files?

Bulat Ziganshin
Hello Joshua,

Tuesday, April 27, 2010, 6:32:29 PM, you wrote:

> To check, according to Process Monitor running on Windows 7, I found
> that default fread() using BUFSIZ (512 bytes) reads in 4,096 bytes at a
> time.  fread(), then, hits the disk 8 times more often than the version

:)  there are many levels of caching

i recommend you to read data in 32-64 kb chunks. our experiments shown that
it's the optimal block size for maximum disk i/o performance in Vista
(and probably Win7)


--
Best regards,
 Bulat                            mailto:[hidden email]