splitting camel case?

classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|

splitting camel case?

Petite Abeille
Hello,

Does anyone have a simple pattern to split camel case strings?

For example:

"componentWithName" -> { "component", "With", "Name" }

Here is what I have presently:

( "componentWithName" ):gsub( "%u?%l+", print )

 > component
 > With
 > Name

However this doesn't quite cover all potential cases, specially  
acronyms (e.g. "componentWithURL"):

http://www.cincomsmalltalk.com/userblogs/vbykov/blogView?
showComments=true&entry=3271976088

Also, is the "frontier pattern" aka "%f" something which is officially  
supported? It doesn't seem to be documented anywhere beside this wiki  
page:

http://lua-users.org/wiki/FrontierPattern

Ideas?

Cheers

--
PA, Onnay Equitursay
http://alt.textdrive.com/

Reply | Threaded
Open this post in threaded view
|

Re: splitting camel case?

David Gish

> Here is what I have presently:
>
> ( "componentWithName" ):gsub( "%u?%l+", print )
>
> > component
> > With
> > Name
>
> However this doesn't quite cover all potential cases, specially
> acronyms (e.g. "componentWithURL"):

Such limitations are inherent to "camel case" (aka "inner-caps"). For
example, there is no way to parse "componentWithURLAndParameters" with
a simple regular expression. The solution is to use a limited
dictionary containing "URL" and any other problematic tokens. Then you
parse in two phases:

1. Parse dictionary terms:
        "componentWithURLAndParameters" ->

        > "componentWith"
        > "URL" (flag as dictionary term)
        > "AndParameters"

2. Parse "camel" cases for each resulting token into a secondary
buffer, passing dictionary terms unchanged:

        > "component
        > "With"
        > "URL"
        > "And"
        > "Parameters"

Cheers,

David Gish
Senior Systems Engineer
Aspyr Media
www.aspyr.com

Reply | Threaded
Open this post in threaded view
|

Re: splitting camel case?

Jens Alfke
In reply to this post by Petite Abeille

On 14 Jan '06, at 4:53 AM, PA wrote:

Does anyone have a simple pattern to split camel case strings?


I don't, but I can suggest looking at the source code to various wiki markup engines, since they all have to detect (if not split) CamelCase, and they probably use regexps to do it, which would probably port well to Lua patterns. 

IIRC, Instiki <http://instiki.org> does split apart the words in a page title, so that might be the best place to start. (E.g., if there is a page called "componentWithURL", when you view that page the title will show as "component With URL".) Plus, Instiki is written in Ruby, which is sort of a cousin of Lua so the source should be more readable than, say, Perl :)

Also, is the "frontier pattern" aka "%f" something which is officially supported? It doesn't seem to be documented anywhere beside this wiki page:

Interesting! Thanks for pointing that out; it's useful. Somewhat like the usual regexp "\b" but more general-purpose. If it's supported, could it please be added to the 5.1 reference manual?

--Jens
Reply | Threaded
Open this post in threaded view
|

Re: splitting camel case?

Michael Richter
In reply to this post by David Gish
On Sat, 2006-14-01 at 12:10 -0600, David Gish wrote:
Such limitations are inherent to "camel case" (aka "inner-caps"). For 
example, there is no way to parse "componentWithURLAndParameters" with 
a simple regular expression. The solution is to use a limited 
dictionary containing "URL" and any other problematic tokens.

This doesn't strike me as a robust solution.  Why not instead use two regular expressions?  The first would find you the subtokens that are all-caps and the second would then split apart what's left at the single-cap borders.

Michael T. Richter
Email: [hidden email], [hidden email]
MSN: [hidden email], [hidden email]; YIM: michael_richter_1966; AIM: YanJiahua1966; ICQ: 241960658; Jabber: [hidden email]

signature.asc (196 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: splitting camel case?

Petite Abeille

On Jan 14, 2006, at 22:59, Michael T. Richter wrote:

>  On Sat, 2006-14-01 at 12:10 -0600, David Gish wrote:
>> Such limitations are inherent to "camel case" (aka "inner-caps"). For
>> example, there is no way to parse "componentWithURLAndParameters" with
>> a simple regular expression. The solution is to use a limited
>> dictionary containing "URL" and any other problematic tokens.
>>
>
>  This doesn't strike me as a robust solution.  Why not instead use two
> regular expressions? 

At first glance, the "URLAnd" sequence would appear problematic.

>  The first would find you the subtokens that are all-caps and the
> second would then split apart what's left at the single-cap borders.

Perhaps something along these lines:

local aString = "componentWithURLAndParameters"

aString = aString:gsub( "(%l)(%u)", "%1 %2" )
aString = aString:gsub( "(%u)(%u)(%l)", "%1 %2%3" )
aString:gsub( "%a+", print )

 > component
 > With
 > URL
 > And
 > Parameters

Taken from "Programmer's Purgatory":
http://secretgeek.net/progr_purga.asp

Cheers

--
PA, Onnay Equitursay
http://alt.textdrive.com/

Reply | Threaded
Open this post in threaded view
|

Re: splitting camel case?

David Gish
On Jan 14, 2006, at 5:16 PM, PA wrote:

> At first glance, the "URLAnd" sequence would appear problematic.
>
> Perhaps something along these lines:
>
> local aString = "componentWithURLAndParameters"
>
> aString = aString:gsub( "(%l)(%u)", "%1 %2" )
> aString = aString:gsub( "(%u)(%u)(%l)", "%1 %2%3" )
> aString:gsub( "%a+", print )
>
> > component
> > With
> > URL
> > And
> > Parameters
>
Much better, but problematic cases are still fairly easy to construct:

"componentWithAURL"
"registerAPIs"
"getCPUID"

The first one's a bit contrived, I admit, but the symbol table of any
moderately sized program probably contains more than a few like these.
As someone once said, "The problem with making something idiot-proof is
that idiots are so clever."

Cheers,

David

Reply | Threaded
Open this post in threaded view
|

Re: splitting camel case?

Rob Kendrick
On Sat, 2006-01-14 at 20:09 -0600, David Gish wrote:

<snip>

> Much better, but problematic cases are still fairly easy to construct:
>
> "componentWithAURL"

<snip>

> The first one's a bit contrived, I admit,

Well, yes - it should be "componentWithAnURL" which also solves the
problem :)

--
Rob Kendrick <[hidden email]>

Reply | Threaded
Open this post in threaded view
|

Re: splitting camel case?

Michael Abbott
In reply to this post by David Gish
On Sat, 2006-01-14 at 20:09 -0600, David Gish wrote:
> On Jan 14, 2006, at 5:16 PM, PA wrote:
> "componentWithAURL"
> "registerAPIs"
> "getCPUID"
>
> The first one's a bit contrived, I admit, but the symbol table of any
> moderately sized program probably contains more than a few like these.
> As someone once said, "The problem with making something idiot-proof is
> that idiots are so clever."

I'm not sure whether you're trying to read particular camel case entries
(for a pre-defined Wiki environment or something like that), but it's
cases like the one above which have made me prefer the
"componentWithAUrl" style camel-case (where the second and subsequent
letters of an acronym are lower-case).  I'm not sure what that
particular camel-case style is called.

The advantage is that since acronyms are always treated as words ("Which
URL were you using?") they are split properly.  Of course they don't get
capitalised properly but you have the same issue with the
"componentWithAURL" case anyway.

My 2c,
- Mab

Reply | Threaded
Open this post in threaded view
|

Splitting hairs. Was Re: splitting camel case?

Rici Lake-2
In reply to this post by Rob Kendrick

On 14-Jan-06, at 9:35 PM, Rob Kendrick wrote:

> Well, yes - it should be "componentWithAnURL" which also solves the
> problem :)

An URL? As in http://www.william-le-gros.gov/, the last URL of York?

Maybe I'm just behind the times, pronounciation-wise, but I pronounce
it "you are ell", so it's a URL to me.

Reply | Threaded
Open this post in threaded view
|

Re: Splitting hairs. Was Re: splitting camel case?

Michael Abbott
On Sat, 2006-01-14 at 22:12 -0500, Rici Lake wrote:
> On 14-Jan-06, at 9:35 PM, Rob Kendrick wrote:
>
> > Well, yes - it should be "componentWithAnURL" which also solves the
> > problem :)
>
> An URL? As in http://www.william-le-gros.gov/, the last URL of York?
>
> Maybe I'm just behind the times, pronounciation-wise, but I pronounce
> it "you are ell", so it's a URL to me.

Shouldn't it still strictly be "an URL" even if you pronounce it that
way?  That is, you'd say "Give me an ETA".  I thought that it might be
because URL sounds like it has a silent 'y' but that also isn't the case
for something like "I'll be there in an hour".

Oh, and I told my housemate about this discussion and his response was
"well that's the split-hair that broke the camels back".  Quite.

- Mab

Reply | Threaded
Open this post in threaded view
|

Re: Splitting hairs. Was Re: splitting camel case?

David Haley
On this day of 1-14-2006 7:26 PM, Michael Abbott saw fit to scribe:

> On Sat, 2006-01-14 at 22:12 -0500, Rici Lake wrote:
>  
>> An URL? As in http://www.william-le-gros.gov/, the last URL of York?
>>
>> Maybe I'm just behind the times, pronounciation-wise, but I pronounce
>> it "you are ell", so it's a URL to me.
>>    
>
> Shouldn't it still strictly be "an URL" even if you pronounce it that
> way?  That is, you'd say "Give me an ETA".  I thought that it might be
> because URL sounds like it has a silent 'y' but that also isn't the case
> for something like "I'll be there in an hour".
>  

If you pronounce it "you are ell", it should be "a URL", not an URL. The
'y' isn't silent, else it'd be "oo are ell" in which case you'd use "an"
before it. The 'a' vs. 'an' rule doesn't have to do strictly with the
following word having a vowel, but with how you pronounce it. I don't
know the technical term for that, though...

--
~David-Haley
http://david.the-haleys.org


Reply | Threaded
Open this post in threaded view
|

Re: Splitting hairs. Was Re: splitting camel case?

Ben Sunshine-Hill
On 1/14/06, David Haley <[hidden email]> wrote:
> The 'a' vs. 'an' rule doesn't have to do strictly with the
> following word having a vowel, but with how you pronounce it. I don't
> know the technical term for that, though...

Lexical scoping.
Reply | Threaded
Open this post in threaded view
|

Re: Splitting hairs. Was Re: splitting camel case?

David Gish
In reply to this post by David Haley
On Jan 14, 2006, at 10:05 PM, David Haley wrote:
> The 'a' vs. 'an' rule doesn't have to do strictly with the following
> word having a vowel, but with how you pronounce it.

Since the day that I was born,
I've never seen an unicorn.

Cheers,

David