Capture patterns

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Capture patterns

Jeff Wise
>What the pattern actually captures is columns separated by whitespace.
>The pattern just forces that the columns corresponding to numbers are the
last three ones, and forces that the first field contains the rest of the
>line.

Thanks again for the explanation. After some consideration, I see how it
works.

I did a very poor job of transferring my problem to a suitable example. When
I tried to adapt the pattern to my 13-number real-world situation, I am
having diffficulty.  I have no doubt this is my bug- not an error in the
provided solution.  The problem with my original pattern was it capturing
dashes in the text or capturing periods as in an abbreviation.  The dollar
amounts are always preceded by a text description.

If I were doing this on my native platform, I would search for periods.  At
each period, I would look for two trailing digits (and perhaps one preceding
digit. (Consider that some use .00 and some 0.00.)  Then I would capture
everything in the string from space to space.  Look at all these possible
formats:

1. $100.00
2. .00
3. 100,000.00
4. -1,234.56
5. (1,234.56)
6. 1,234.56CR
7. 1,234.56-
8. +100.00

My original problem statement was to get a pattern that would do something
similar to the above strategy. My perception is that the code would be more
utilitarian if I did not look for a fixed number of fields, and that I allow
all the common financial notations used in reports.

I read the section in PiL multiple times on patterns (20.3, p. 180ff), and I
think that the section could be better organized by covering the "magic
characters" in order instead of randomly.  This would make the text more
useful as a reference (at least to someone scatter-brained like me).

I think that Lua is an excellent language for this type of problem.  I am
awed by its power and flexibility.



 











CONFIDENTIALITY NOTICE:  This E-mail message and all attachments, which originated from Sealy Management Company Inc, are intended solely for the use of the intended recipient or entity and may contain legally privileged and confidential information.  If the reader of this message is not the intended recipient, you are hereby notified that any reading, disclosure, dissemination, distribution, copying or other use of this message is strictly prohibited.  If you have received this message in error, please notify the sender of the message immediately and delete this message and all attachments, including all copies or backups thereof, from your system.  You may also reach us by phone at 205-391-6000.  Thank you.

Reply | Threaded
Open this post in threaded view
|

Re: Capture patterns

Eike Decker-2
Hi

Looking at your given samples:

1. $100.00
2. .00
3. 100,000.00
4. -1,234.56
5. (1,234.56)
6. 1,234.56CR
7. 1,234.56-
8. +100.00

I made this pattern here which does match all the given examples above and
returns the only the figures and the necessary characters to figure out if it
is a negative value:

%$?([%d%.,%-%(%)CR]+)

You could maybe replace all matches by some "normalized" number values using
gsub. Since some localizations are flipping the . and , (like 1.000,00), this
function would ignore this:

function normalize(str)
 str = str:gsub("%$?([%d%.,%-%(%)CR]+)",
   function(num)
     num = num:gsub("^(.*)%-$","-%1") -- let minus be in front
     num = num:gsub("^%((.-)%)$","-%1") -- if in brackets, negate it
     num = num:gsub("^(.*)CR$","-%1") -- I assume CR means negative?
     num = num:gsub("^(.-)[%.%,](..)$","%1.%2")
     num = num:gsub("[%,%.](....)","%1") -- remove ,/. except for last one
     return num
   end)
 return str
end

The given samples above are transformed to 

1. 100.00
2. .00
3. 100000.00
4. -1234.56
5. -1234.56
6. -1234.56
7. -1234.56
8. +100.00

All of them should be able to be transformed into a real number using tonumber
and could be matched by a [%-+]%d*%.?%d* pattern.

Maybe there are simpler (or more efficient) ways to do that... that's just how I
would approach this problem ;)

One problem remains: the pattern might match also strings that don't contain
numbers. This might be fixed by changing the matching pattern:

%$?([%d%.,%-%(%)CR]+)

to 

%$?([%d%.,%-%(%)CR]*%d+[%d%.,%-%(%)CR]*)

which would require the pattern to contain at least one figure next to all the
special chars. But I haven't tested that.

The lua regex functions are quite simple but also faster compared to full
regular expression as supplied in perl. Maybe it could be possible to replace
this function I wrote by one single regular expression which does all this in
one stroke. But I doubt that you could transform this by using a single lua
regular expression.

Eike
 
> >What the pattern actually captures is columns separated by whitespace.
> >The pattern just forces that the columns corresponding to numbers are the
> last three ones, and forces that the first field contains the rest of the
> >line.
> 
> Thanks again for the explanation. After some consideration, I see how it
> works.
> 
> I did a very poor job of transferring my problem to a suitable example. When
> I tried to adapt the pattern to my 13-number real-world situation, I am
> having diffficulty.  I have no doubt this is my bug- not an error in the
> provided solution.  The problem with my original pattern was it capturing
> dashes in the text or capturing periods as in an abbreviation.  The dollar
> amounts are always preceded by a text description.
> 
> If I were doing this on my native platform, I would search for periods.  At
> each period, I would look for two trailing digits (and perhaps one preceding
> digit. (Consider that some use .00 and some 0.00.)  Then I would capture
> everything in the string from space to space.  Look at all these possible
> formats:
> 
> 1. $100.00
> 2. .00
> 3. 100,000.00
> 4. -1,234.56
> 5. (1,234.56)
> 6. 1,234.56CR
> 7. 1,234.56-
> 8. +100.00
> 
> My original problem statement was to get a pattern that would do something
> similar to the above strategy. My perception is that the code would be more
> utilitarian if I did not look for a fixed number of fields, and that I allow
> all the common financial notations used in reports.
> 
> I read the section in PiL multiple times on patterns (20.3, p. 180ff), and I
> think that the section could be better organized by covering the "magic
> characters" in order instead of randomly.  This would make the text more
> useful as a reference (at least to someone scatter-brained like me).
> 
> I think that Lua is an excellent language for this type of problem.  I am
> awed by its power and flexibility.
> 
> 
> 
>  
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> CONFIDENTIALITY NOTICE:  This E-mail message and all attachments, which
> originated from Sealy Management Company Inc, are intended solely for the use
> of the intended recipient or entity and may contain legally privileged and
> confidential information.  If the reader of this message is not the intended
> recipient, you are hereby notified that any reading, disclosure,
> dissemination, distribution, copying or other use of this message is strictly
> prohibited.  If you have received this message in error, please notify the
> sender of the message immediately and delete this message and all
> attachments, including all copies or backups thereof, from your system.  You
> may also reach us by phone at 205-391-6000.  Thank you.
> 




Reply | Threaded
Open this post in threaded view
|

Re: Capture patterns

Eric Tetz
In reply to this post by Jeff Wise
On Jan 8, 2008 7:04 AM, Jeff Wise <[hidden email]> wrote:
> Look at all these possible formats:
>
> 1. $100.00
> 2. .00
> 3. 100,000.00
> 4. -1,234.56
> 5. (1,234.56)
> 6. 1,234.56CR
> 7. 1,234.56-
> 8. +100.00

If you strip out the superfluous characters:

  input = string.gsub(input, "[$,%+%s]", "")

It reduces the number of formats to 6.

One approach is just to try every valid pattern until one sticks:

   -- I have no idea what 'CR' appended to a number means,
   -- so I'll pretend it means negative...

   function match_number(s)
      s = s:gsub("[$,+%s]", "") -- strip garbage
      local n =
            s:match("%-(%d+%.%d+)")    -- -22.22
         or s:match("(%d+%.%d+)%-")    -- 22.22-
         or s:match("%((%d+%.%d+)%)")  -- (22.22)
         or s:match("(%d+%.%d+)CR")    -- 22.22CR
      return n and -n
          or s:match("(%d+%.%d+)")     -- 22.22
          or s:match("(%.%d+)")        --   .22
   end

Or you could use one pattern and test if you caught any of the extra stuff:

   function match_number(s)
      s = s:gsub("[$,+%s]", "") -- strip garbage
      local pre, n, post = s:match("([-(]?)(%d*%.%d+)([CR-]?)")
      return (pre ~= "" or post ~= "") and -n or n
   end

I'm sure there are a dozen other ways to skin this cat, but I've gotta
get to work...

> I think that the section could be better
> organized by covering the "magic characters" in
> order instead of randomly.  This would make the
> text more useful as a reference

Check out the online manual. It has exactly what you want. :)

Cheers,
Eric

Reply | Threaded
Open this post in threaded view
|

Re: Capture patterns

KHMan
Eric Tetz wrote:
> On Jan 8, 2008 7:04 AM, Jeff Wise <[hidden email]> wrote:
>> Look at all these possible formats:
>> [snip]
> 
> If you strip out the superfluous characters:
> [snip]
> Or you could use one pattern and test if you caught any of the extra stuff:
> 
>    function match_number(s)
>       s = s:gsub("[$,+%s]", "") -- strip garbage
>       local pre, n, post = s:match("([-(]?)(%d*%.%d+)([CR-]?)")
>       return (pre ~= "" or post ~= "") and -n or n
>    end

A minor quibble...

Any attempt at converting a financial figure to a floating point
representation without scaling gets me really worried... it might
work now, but somewhere down the line, extra processing tacked in
by a later developer trusting the data integrity of the input
layer might cause some values to go off by a wee bit.

In this case, I think Jeff just needs them as strings or lightly
processed strings and I do think not converting the figures is safer.

-- 
Cheers,
Kein-Hong Man (esq.)
Kuala Lumpur, Malaysia