Tricky little string parsing challenge

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

Tricky little string parsing challenge

Geoff Smith
This one has got me stuck for the moment, can anyone come up with an elegant solution for this without needing external library please.

I have a long string of text that i need to split into sentences, here is a sort of working attempt

 local text = "This is one sentence. This is another but with a number in it like 0.47 need to ignore it. This is the third. Fourth sentence"

local sentences = {}
for i in string.gmatch(text,  "[^%.]+" ) do
sentences[#sentences+1] = i
 end

for i = 1, #sentences do
print(i, sentences[i])
end

Of course I had forgotten about not splitting on decimal points in numbers.  How can I adapt this to ignore the full stop character if surrounded by numbers?

Thanks for any solutions.

Geoff


Virus-free. www.avast.com
Reply | Threaded
Open this post in threaded view
|

Re: Tricky little string parsing challenge

Luiz Henrique de Figueiredo
Try this

text = text .. ". "
for s in text:gmatch(".-%. ") do
        print(s)
end

Reply | Threaded
Open this post in threaded view
|

Re: Tricky little string parsing challenge

Sean Conner
In reply to this post by Geoff Smith
It was thus said that the Great Geoff Smith once stated:

> This one has got me stuck for the moment, can anyone come up with an
> elegant solution for this without needing external library please.
>
> I have a long string of text that i need to split into sentences, here is
> a sort of working attempt
>
>  local text = "This is one sentence. This is another but with a number in it like 0.47 need to ignore it. This is the third. Fourth sentence"
>
> local sentences = {}
> for i in string.gmatch(text,  "[^%.]+" ) do
> sentences[#sentences+1] = i
>  end
>
> for i = 1, #sentences do
> print(i, sentences[i])
> end
>
> Of course I had forgotten about not splitting on decimal points in
> numbers.  How can I adapt this to ignore the full stop character if
> surrounded by numbers?

  I had a similar issue back in 2014 [1] where I used LPEG to do the
parsing.  What I found is that just breaking on a period wasn't enough, and
so I had to special case the following [2]:

        MR.
        Mrs.
        MRS.
        Dr.
        DR.
        P. S.
        P.S.
        T. E.
        T.E.
        Gen.
        N. B.
        N.B.
        H.
        M.
        O.
        Z.

  The nice thing about LPEG was not only how easy it was to add exceptions
(like the list above) but I could also transform the input into a canonical
format (like converting N.B. to N. B.).

  So yes, I do have a solution, but it does violate your constraint.

  -spc (My use case was breaking the input into words, but it's similar
        enough ... )

[1] https://github.com/spc476/NaNoGenMo-2014

        Code I used:

        https://github.com/spc476/NaNoGenMo-2014/blob/master/word.lua

[2] Some, like Mrs. are generic, while T. E. were initials specific to
        the document.


Reply | Threaded
Open this post in threaded view
|

Re: Tricky little string parsing challenge

szbnwer@gmail.com
hi there! :)

an another gotcha is when an abbreviation with a point after it can
stand on the end of a sentence, and in that case, theres no extra
fullstop, at least in hungarian, but i think in english too... and the
previous wasnt 3 sentences, while this is one! :D btw what about
factorials (5!=120)? about malformed texts, like mine? u can only
reach higher and higher precision, but its a hard nut to make it it
perfect... what u can actually achieve is to make it semi automated,
and completed on a single document base, like collecting the nasty
bits (punctuation, float numbers and capital letters), and handselect
them, or give it some additional rules in general, or on a document
base...

good luck, have fun! :D

Reply | Threaded
Open this post in threaded view
|

Re: Tricky little string parsing challenge

szbnwer@gmail.com
+1:

1.2.3:
a) ...
b) ...
ba) ...
bb) ...
c) ...

just try to be smart based on the parentheses! :D btw the law stuffs
are kinda very well formed stuffs actually, so thats a good
playground. :D (<--my future plans) nlp is actually a thing where i'd
seriouly considerate to get a heavy-weight tool for anything
universal, while i'd still prefer an own one, for my custom needs, my
own mental models, better understanding, and what not. if i can parse
the language, that the nlp stuff was written in, then i can easily
collect out its data set goodies, while probably i can automatize it
for a next version....

bests! :)

2019-03-19 17:08 GMT, [hidden email] <[hidden email]>:

> hi there! :)
>
> an another gotcha is when an abbreviation with a point after it can
> stand on the end of a sentence, and in that case, theres no extra
> fullstop, at least in hungarian, but i think in english too... and the
> previous wasnt 3 sentences, while this is one! :D btw what about
> factorials (5!=120)? about malformed texts, like mine? u can only
> reach higher and higher precision, but its a hard nut to make it it
> perfect... what u can actually achieve is to make it semi automated,
> and completed on a single document base, like collecting the nasty
> bits (punctuation, float numbers and capital letters), and handselect
> them, or give it some additional rules in general, or on a document
> base...
>
> good luck, have fun! :D
>

Reply | Threaded
Open this post in threaded view
|

Re: Tricky little string parsing challenge

Dirk Laurie-2
In reply to this post by Geoff Smith
Op Di. 19 Mrt. 2019 om 14:40 het Geoff Smith <[hidden email]> geskryf:

>
> This one has got me stuck for the moment, can anyone come up with an elegant solution for this without needing external library please.
>
> I have a long string of text that i need to split into sentences, here is a sort of working attempt
>
>  local text = "This is one sentence. This is another but with a number in it like 0.47 need to ignore it. This is the third. Fourth sentence"
>
> local sentences = {}
> for i in string.gmatch(text,  "[^%.]+" ) do
> sentences[#sentences+1] = i
>  end
>
> for i = 1, #sentences do
> print(i, sentences[i])
> end
>
> Of course I had forgotten about not splitting on decimal points in numbers.  How can I adapt this to ignore the full stop character if surrounded by numbers?

You can use the pattern (fulllstop, whitespace, capital letter) to
terminate a sentence.

function break_into_sentences(str)
  local prose = {}
  local start = 1
  repeat
    local sentence,found = str:match("(.-)%.%s+()%u",start)
    prose[#prose+1] = sentence
    if found then start=found end
  until not found
  prose[#prose+1] = str:sub(start)
  return prose
end

Of course, there will still be some cases like "Dr. No: where the
author did not mean to end a sentence. That is no longer a string
parsing challenge, but a question of designing an unambiguous grammar.
For example:

* Bernard Shaw 100 years ago already dropped full stops in
abbreviations, and it is fairly common practice nowadays.
* I used to (but no longer do) put two spaces after each genuine fullstop.
* You could put hard spaces after abbreviations.
* You could gsub a list of allowed exceptions replacing a fullstop by
a central dot or other unused Unicode character, and afterwords gsub
them back in.

Etc.

Reply | Threaded
Open this post in threaded view
|

Re: Tricky little string parsing challenge

Steve Litt
In reply to this post by Geoff Smith
On Tue, 19 Mar 2019 12:07:19 +0000
Geoff Smith <[hidden email]> wrote:


> Of course I had forgotten about not splitting on decimal points in
> numbers.  How can I adapt this to ignore the full stop character if
> surrounded by numbers?
>
> Thanks for any solutions.

The problem is in the specification. It's not easy to describe what's a
sentence ender and what's a decimal point. I'd split on a dot followed
immediately by whitespace: Space, Tab, Newline or Formfeed.
 
SteveT

Reply | Threaded
Open this post in threaded view
|

Re: Tricky little string parsing challenge

Sean Conner
It was thus said that the Great Steve Litt once stated:

> On Tue, 19 Mar 2019 12:07:19 +0000
> Geoff Smith <[hidden email]> wrote:
>
>
> > Of course I had forgotten about not splitting on decimal points in
> > numbers.  How can I adapt this to ignore the full stop character if
> > surrounded by numbers?
> >
> > Thanks for any solutions.
>
> The problem is in the specification. It's not easy to describe what's a
> sentence ender and what's a decimal point. I'd split on a dot followed
> immediately by whitespace: Space, Tab, Newline or Formfeed.

  Mr. Litt would break on a dot followed by whitespace. Mr. Conner would
disagree, as he thinks e. e. cummings would also disagree.  What constitutes
a sentence?  Is this a sentence?

  -spc (He who took the No. 9 train.)




Reply | Threaded
Open this post in threaded view
|

Re: Tricky little string parsing challenge

Steve Litt
On Thu, 21 Mar 2019 18:08:15 -0400
Sean Conner <[hidden email]> wrote:

> It was thus said that the Great Steve Litt once stated:
> > On Tue, 19 Mar 2019 12:07:19 +0000
> > Geoff Smith <[hidden email]> wrote:
> >
> >  
> > > Of course I had forgotten about not splitting on decimal points in
> > > numbers.  How can I adapt this to ignore the full stop character
> > > if surrounded by numbers?
> > >
> > > Thanks for any solutions.  
> >
> > The problem is in the specification. It's not easy to describe
> > what's a sentence ender and what's a decimal point. I'd split on a
> > dot followed immediately by whitespace: Space, Tab, Newline or
> > Formfeed.  
>
>   Mr. Litt would break on a dot followed by whitespace. Mr. Conner
> would disagree, as he thinks e. e. cummings would also disagree.
> What constitutes a sentence?  Is this a sentence?
>
>   -spc (He who took the No. 9 train.)

I have no idea whether the preceding is a sentence. Maybe the Chicago
Manual of Style would help?

You bring up an interesting point. No matter how wonderful our
sentence detection algorithm, there will always be exceptions. Maybe
the key is to go as far as possible with the general algorithm, and
then use a blacklist and whitelist for each of specific text in the
document and specific phrases.

Also, as Dirk pointed out, it might be better to split on a dot, one
or two spaces, and a capital letter. Unless, of course, you're
beginning the sentence with "systemd", whose producers insist on
spelling it with all small characters. Also, I forgot that sentences
can end with an exclamation point or a question mark.

Is regex the best way, or might this better be done with callback
routines?

SteveT

Reply | Threaded
Open this post in threaded view
|

Re: Tricky little string parsing challenge

Sean Conner
It was thus said that the Great Steve Litt once stated:

> On Thu, 21 Mar 2019 18:08:15 -0400
> Sean Conner <[hidden email]> wrote:
>
> > It was thus said that the Great Steve Litt once stated:
> > > On Tue, 19 Mar 2019 12:07:19 +0000
> > > Geoff Smith <[hidden email]> wrote:
> > >
> > >  
> > > > Of course I had forgotten about not splitting on decimal points in
> > > > numbers.  How can I adapt this to ignore the full stop character
> > > > if surrounded by numbers?
> > > >
> > > > Thanks for any solutions.  
> > >
> > > The problem is in the specification. It's not easy to describe
> > > what's a sentence ender and what's a decimal point. I'd split on a
> > > dot followed immediately by whitespace: Space, Tab, Newline or
> > > Formfeed.  
> >
> >   Mr. Litt would break on a dot followed by whitespace. Mr. Conner
> > would disagree, as he thinks e. e. cummings would also disagree.
> > What constitutes a sentence?  Is this a sentence?
> >
> >   -spc (He who took the No. 9 train.)
>
> I have no idea whether the preceding is a sentence. Maybe the Chicago
> Manual of Style would help?
>
> You bring up an interesting point. No matter how wonderful our
> sentence detection algorithm, there will always be exceptions. Maybe
> the key is to go as far as possible with the general algorithm, and
> then use a blacklist and whitelist for each of specific text in the
> document and specific phrases.
>
> Also, as Dirk pointed out, it might be better to split on a dot, one
> or two spaces, and a capital letter.

        Mr.
        Litt would break on a dot followd by whitespace.
        Mr.
        Conner would disagree, as he thinks e.
        e.
        cummings would also disagree.

> Unless, of course, you're
> beginning the sentence with "systemd",

  e. e. cummings is also an exception here.

> whose producers insist on
> spelling it with all small characters. Also, I forgot that sentences
> can end with an exclamation point or a question mark.

  You forgot the interobang‽

> Is regex the best way, or might this better be done with callback
> routines?

  LPEG.  

  -spc (Definitely LPEG)