Lua pattern to match an url

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

Lua pattern to match an url

luciano de souza-2
Hello all,
Cultura FM radio has some interesting audios about classical music.
I'd like to download it automatically.
The steps are:
1. To get the page with http.request;
2. To match urls started with 'http://' and ended with '.mp3';
3. To record the urls in a file.
My problem is the step 2. I could not find a pattern to match urls like:
http://midia.cmais.com.br/assets/audio/default/CENA_00087___P___24_12_10_1293450448.mp3

Let me show to you my attempt:

local http = require('socket.http')

local target = 'http://culturafm.cmais.com.br/cena-brasileira/cena-brasileira'

local content, status = http.request(target)

if status == 200 then
        local file = io.open('url.txt', 'w')
        local pattern = '(http://[a-zA-Z0-9_/]-%.mp3)'
        for url in content:gmatch(pattern) do
                file:write(url)
        end
        file:close()
end

Would someone know a lua pattern to match urls started with "http://"
and ended with '.mp3'?
Best regards,

--
Luciano de Souza

Reply | Threaded
Open this post in threaded view
|

Re: Lua pattern to match an url

Philippe Verdy
There's no standard for download urls to terminate in .mp3. The standard uses mime types when querying urls.

You can query mime types of http(s) urls without downloading them using HEAD requests rather than GET.

URLS have a standard for parsing them, which allows distinguishing the protocol, the host name or address, a possible port number, a path and a query string. All are required but none of them indicate a mime type. An if the path part may frequently be used to indicate the mime type, this is not required, as the effective mp3 you request may be selected from the query string and both may be using randomized encoding defined by the server and possibly depending on user's session, i.e.cookies or additional parameters in query strings or in encoded form data submited outside the url, such as authentication parameters or user preferences set by form input variables (possibly hidden). Each web site then defines its own encodings and API for path and query strings as well as form data.

When your request will succeed, the download will senf the binary mp3, it's mime type, a possible suggested name for storing a file in your local file system. The answer may as well return an error status, an html page (such as an log-on form, or reason why your request was denied by the server).


Le mer. 25 déc. 2019 à 18:02, luciano de souza <[hidden email]> a écrit :
Hello all,
Cultura FM radio has some interesting audios about classical music.
I'd like to download it automatically.
The steps are:
1. To get the page with http.request;
2. To match urls started with 'http://' and ended with '.mp3';
3. To record the urls in a file.
My problem is the step 2. I could not find a pattern to match urls like:
http://midia.cmais.com.br/assets/audio/default/CENA_00087___P___24_12_10_1293450448.mp3

Let me show to you my attempt:

local http = require('socket.http')

local target = 'http://culturafm.cmais.com.br/cena-brasileira/cena-brasileira'

local content, status = http.request(target)

if status == 200 then
        local file = io.open('url.txt', 'w')
        local pattern = '(http://[a-zA-Z0-9_/]-%.mp3)'
        for url in content:gmatch(pattern) do
                file:write(url)
        end
        file:close()
end

Would someone know a lua pattern to match urls started with "http://"
and ended with '.mp3'?
Best regards,

--
Luciano de Souza

Reply | Threaded
Open this post in threaded view
|

Re: Lua pattern to match an url

luciano de souza-2
Actually, the problem is not how to download the file. This can be
done as follows:

local http = require('socket.http')

local url = 'http://midia.cmais.com.br/assets/audio/default/CENA_00087___P___24_12_10_1293450448.mp3'

local content, status = http.request(url)
if status == 200 then
local file = io.open('file.mp3', 'wb')
file:write(content)
file:close()
end

The problem is that urls should be searched in html file. In my
example, I used a known url, but to know it a web page need to be
scanned.
I have this url: http://culturafm.cmais.com.br/cena-brasileira/cena-brasileira.
Inside html, I have lots of urls. Some of then are direct links to
audio files. I don't need mime types becose I know links are finished
by ".mp3".
So my problem is which pattern matches urls started with "http" and
ended with ".mp3". After obtaining the list of urls pointing to audio
files, I can download each of then as I have done in the above
example.
Simplifying: if I have the url http://www.a.com/b/c/d.mp3' or
something like that  which lua patterns matches it?
Perhaps, explaining more than necessary, in my first message, I was
confused. Sorry!

2019-12-25 15:12 GMT-03:00, Philippe Verdy <[hidden email]>:

> There's no standard for download urls to terminate in .mp3. The standard
> uses mime types when querying urls.
>
> You can query mime types of http(s) urls without downloading them using
> HEAD requests rather than GET.
>
> URLS have a standard for parsing them, which allows distinguishing the
> protocol, the host name or address, a possible port number, a path and a
> query string. All are required but none of them indicate a mime type. An if
> the path part may frequently be used to indicate the mime type, this is not
> required, as the effective mp3 you request may be selected from the query
> string and both may be using randomized encoding defined by the server and
> possibly depending on user's session, i.e.cookies or additional parameters
> in query strings or in encoded form data submited outside the url, such as
> authentication parameters or user preferences set by form input variables
> (possibly hidden). Each web site then defines its own encodings and API for
> path and query strings as well as form data.
>
> When your request will succeed, the download will senf the binary mp3, it's
> mime type, a possible suggested name for storing a file in your local file
> system. The answer may as well return an error status, an html page (such
> as an log-on form, or reason why your request was denied by the server).
>
>
> Le mer. 25 déc. 2019 à 18:02, luciano de souza <[hidden email]> a
> écrit :
>
>> Hello all,
>> Cultura FM radio has some interesting audios about classical music.
>> I'd like to download it automatically.
>> The steps are:
>> 1. To get the page with http.request;
>> 2. To match urls started with 'http://' and ended with '.mp3';
>> 3. To record the urls in a file.
>> My problem is the step 2. I could not find a pattern to match urls like:
>>
>> http://midia.cmais.com.br/assets/audio/default/CENA_00087___P___24_12_10_1293450448.mp3
>>
>> Let me show to you my attempt:
>>
>> local http = require('socket.http')
>>
>> local target = '
>> http://culturafm.cmais.com.br/cena-brasileira/cena-brasileira'
>>
>> local content, status = http.request(target)
>>
>> if status == 200 then
>>         local file = io.open('url.txt', 'w')
>>         local pattern = '(http://[a-zA-Z0-9_/]-%.mp3)'
>>         for url in content:gmatch(pattern) do
>>                 file:write(url)
>>         end
>>         file:close()
>> end
>>
>> Would someone know a lua pattern to match urls started with "http://"
>> and ended with '.mp3'?
>> Best regards,
>>
>> --
>> Luciano de Souza
>>
>>
>


--
Luciano de Souza

Reply | Threaded
Open this post in threaded view
|

Re: Lua pattern to match an url

nobody
In reply to this post by luciano de souza-2

> On 25. Dec 2019, at 18:02, luciano de souza <[hidden email]> wrote:
>
> I could not find a pattern to match urls like:
> http://midia.cmais.com.br/assets/audio/default/CENA_00087___P___24_12_10_1293450448.mp3
>
> *snip*
>
> local pattern = '(http://[a-zA-Z0-9_/]-%.mp3)'
>
> Would someone know a lua pattern to match urls started with "http://"
> and ended with '.mp3'?

Currently not near a computer / can't check if that's all, but I think you just forgot a dot in the character class?

Failing that, maybe try
"http://[^ ]+%.mp3"
which might still work and is simpler. (Also if you capture the full match, you don't need to parenthesize it IIRC.)

-- nobody

Reply | Threaded
Open this post in threaded view
|

Re: Lua pattern to match an url

luciano de souza-2
In reply to this post by luciano de souza-2
I know my english need to be improved. I'm sure my difficulty to
explain is due to it.
I said about lua pattern, perhaps it's better to say "lua string
pattern" or "lua regex".
I get a webpage:

local content, status = http.request(url)

Now I have in "content" variable the entire content of the webpage.
By means of Lua string patterns, I would like to get all urls started
with "http" and ended with ".mp3". I know previously the target url
are ended with ".mp3".
I tried something like:
for url in content:gmatch('(http[a-zA-z0-9_/]-)%.mp3") do
print(url)
end

But the above string pattern can't match urls like:
http://midia.cmais.com.br/assets/audio/default/CENA_00087___P___24_12_10_1293450448.mp3

2019-12-25 16:17 GMT-03:00, luciano de souza <[hidden email]>:

> Actually, the problem is not how to download the file. This can be
> done as follows:
>
> local http = require('socket.http')
>
> local url =
> 'http://midia.cmais.com.br/assets/audio/default/CENA_00087___P___24_12_10_1293450448.mp3'
>
> local content, status = http.request(url)
> if status == 200 then
> local file = io.open('file.mp3', 'wb')
> file:write(content)
> file:close()
> end
>
> The problem is that urls should be searched in html file. In my
> example, I used a known url, but to know it a web page need to be
> scanned.
> I have this url:
> http://culturafm.cmais.com.br/cena-brasileira/cena-brasileira.
> Inside html, I have lots of urls. Some of then are direct links to
> audio files. I don't need mime types becose I know links are finished
> by ".mp3".
> So my problem is which pattern matches urls started with "http" and
> ended with ".mp3". After obtaining the list of urls pointing to audio
> files, I can download each of then as I have done in the above
> example.
> Simplifying: if I have the url http://www.a.com/b/c/d.mp3' or
> something like that  which lua patterns matches it?
> Perhaps, explaining more than necessary, in my first message, I was
> confused. Sorry!
>
> 2019-12-25 15:12 GMT-03:00, Philippe Verdy <[hidden email]>:
>> There's no standard for download urls to terminate in .mp3. The standard
>> uses mime types when querying urls.
>>
>> You can query mime types of http(s) urls without downloading them using
>> HEAD requests rather than GET.
>>
>> URLS have a standard for parsing them, which allows distinguishing the
>> protocol, the host name or address, a possible port number, a path and a
>> query string. All are required but none of them indicate a mime type. An
>> if
>> the path part may frequently be used to indicate the mime type, this is
>> not
>> required, as the effective mp3 you request may be selected from the query
>> string and both may be using randomized encoding defined by the server
>> and
>> possibly depending on user's session, i.e.cookies or additional
>> parameters
>> in query strings or in encoded form data submited outside the url, such
>> as
>> authentication parameters or user preferences set by form input variables
>> (possibly hidden). Each web site then defines its own encodings and API
>> for
>> path and query strings as well as form data.
>>
>> When your request will succeed, the download will senf the binary mp3,
>> it's
>> mime type, a possible suggested name for storing a file in your local
>> file
>> system. The answer may as well return an error status, an html page (such
>> as an log-on form, or reason why your request was denied by the server).
>>
>>
>> Le mer. 25 déc. 2019 à 18:02, luciano de souza <[hidden email]> a
>> écrit :
>>
>>> Hello all,
>>> Cultura FM radio has some interesting audios about classical music.
>>> I'd like to download it automatically.
>>> The steps are:
>>> 1. To get the page with http.request;
>>> 2. To match urls started with 'http://' and ended with '.mp3';
>>> 3. To record the urls in a file.
>>> My problem is the step 2. I could not find a pattern to match urls like:
>>>
>>> http://midia.cmais.com.br/assets/audio/default/CENA_00087___P___24_12_10_1293450448.mp3
>>>
>>> Let me show to you my attempt:
>>>
>>> local http = require('socket.http')
>>>
>>> local target = '
>>> http://culturafm.cmais.com.br/cena-brasileira/cena-brasileira'
>>>
>>> local content, status = http.request(target)
>>>
>>> if status == 200 then
>>>         local file = io.open('url.txt', 'w')
>>>         local pattern = '(http://[a-zA-Z0-9_/]-%.mp3)'
>>>         for url in content:gmatch(pattern) do
>>>                 file:write(url)
>>>         end
>>>         file:close()
>>> end
>>>
>>> Would someone know a lua pattern to match urls started with "http://"
>>> and ended with '.mp3'?
>>> Best regards,
>>>
>>> --
>>> Luciano de Souza
>>>
>>>
>>
>
>
> --
> Luciano de Souza
>


--
Luciano de Souza

Reply | Threaded
Open this post in threaded view
|

Re: Lua pattern to match an url

luciano de souza-2
Oh! Wonderful! Problem solved. The dot is really lacking.
Thank you very much.

2019-12-25 16:49 GMT-03:00, luciano de souza <[hidden email]>:

> I know my english need to be improved. I'm sure my difficulty to
> explain is due to it.
> I said about lua pattern, perhaps it's better to say "lua string
> pattern" or "lua regex".
> I get a webpage:
>
> local content, status = http.request(url)
>
> Now I have in "content" variable the entire content of the webpage.
> By means of Lua string patterns, I would like to get all urls started
> with "http" and ended with ".mp3". I know previously the target url
> are ended with ".mp3".
> I tried something like:
> for url in content:gmatch('(http[a-zA-z0-9_/]-)%.mp3") do
> print(url)
> end
>
> But the above string pattern can't match urls like:
> http://midia.cmais.com.br/assets/audio/default/CENA_00087___P___24_12_10_1293450448.mp3
>
> 2019-12-25 16:17 GMT-03:00, luciano de souza <[hidden email]>:
>> Actually, the problem is not how to download the file. This can be
>> done as follows:
>>
>> local http = require('socket.http')
>>
>> local url =
>> 'http://midia.cmais.com.br/assets/audio/default/CENA_00087___P___24_12_10_1293450448.mp3'
>>
>> local content, status = http.request(url)
>> if status == 200 then
>> local file = io.open('file.mp3', 'wb')
>> file:write(content)
>> file:close()
>> end
>>
>> The problem is that urls should be searched in html file. In my
>> example, I used a known url, but to know it a web page need to be
>> scanned.
>> I have this url:
>> http://culturafm.cmais.com.br/cena-brasileira/cena-brasileira.
>> Inside html, I have lots of urls. Some of then are direct links to
>> audio files. I don't need mime types becose I know links are finished
>> by ".mp3".
>> So my problem is which pattern matches urls started with "http" and
>> ended with ".mp3". After obtaining the list of urls pointing to audio
>> files, I can download each of then as I have done in the above
>> example.
>> Simplifying: if I have the url http://www.a.com/b/c/d.mp3' or
>> something like that  which lua patterns matches it?
>> Perhaps, explaining more than necessary, in my first message, I was
>> confused. Sorry!
>>
>> 2019-12-25 15:12 GMT-03:00, Philippe Verdy <[hidden email]>:
>>> There's no standard for download urls to terminate in .mp3. The standard
>>> uses mime types when querying urls.
>>>
>>> You can query mime types of http(s) urls without downloading them using
>>> HEAD requests rather than GET.
>>>
>>> URLS have a standard for parsing them, which allows distinguishing the
>>> protocol, the host name or address, a possible port number, a path and a
>>> query string. All are required but none of them indicate a mime type. An
>>> if
>>> the path part may frequently be used to indicate the mime type, this is
>>> not
>>> required, as the effective mp3 you request may be selected from the
>>> query
>>> string and both may be using randomized encoding defined by the server
>>> and
>>> possibly depending on user's session, i.e.cookies or additional
>>> parameters
>>> in query strings or in encoded form data submited outside the url, such
>>> as
>>> authentication parameters or user preferences set by form input
>>> variables
>>> (possibly hidden). Each web site then defines its own encodings and API
>>> for
>>> path and query strings as well as form data.
>>>
>>> When your request will succeed, the download will senf the binary mp3,
>>> it's
>>> mime type, a possible suggested name for storing a file in your local
>>> file
>>> system. The answer may as well return an error status, an html page
>>> (such
>>> as an log-on form, or reason why your request was denied by the server).
>>>
>>>
>>> Le mer. 25 déc. 2019 à 18:02, luciano de souza <[hidden email]> a
>>> écrit :
>>>
>>>> Hello all,
>>>> Cultura FM radio has some interesting audios about classical music.
>>>> I'd like to download it automatically.
>>>> The steps are:
>>>> 1. To get the page with http.request;
>>>> 2. To match urls started with 'http://' and ended with '.mp3';
>>>> 3. To record the urls in a file.
>>>> My problem is the step 2. I could not find a pattern to match urls
>>>> like:
>>>>
>>>> http://midia.cmais.com.br/assets/audio/default/CENA_00087___P___24_12_10_1293450448.mp3
>>>>
>>>> Let me show to you my attempt:
>>>>
>>>> local http = require('socket.http')
>>>>
>>>> local target = '
>>>> http://culturafm.cmais.com.br/cena-brasileira/cena-brasileira'
>>>>
>>>> local content, status = http.request(target)
>>>>
>>>> if status == 200 then
>>>>         local file = io.open('url.txt', 'w')
>>>>         local pattern = '(http://[a-zA-Z0-9_/]-%.mp3)'
>>>>         for url in content:gmatch(pattern) do
>>>>                 file:write(url)
>>>>         end
>>>>         file:close()
>>>> end
>>>>
>>>> Would someone know a lua pattern to match urls started with "http://"
>>>> and ended with '.mp3'?
>>>> Best regards,
>>>>
>>>> --
>>>> Luciano de Souza
>>>>
>>>>
>>>
>>
>>
>> --
>> Luciano de Souza
>>
>
>
> --
> Luciano de Souza
>


--
Luciano de Souza

Reply | Threaded
Open this post in threaded view
|

Re: Lua pattern to match an url

Philippe Verdy
It matches too many things, not all characters can be valid in a URI (look at the specs in the relevant RFC and see the RFC describing the URL-encoding on why it is needed to escape other punctuation and whitespace characters in the ASCII range, that a dot pattern would matchunexpectedly, making your matched URI much too long than expected)

Le mer. 25 déc. 2019 à 20:58, luciano de souza <[hidden email]> a écrit :
Oh! Wonderful! Problem solved. The dot is really lacking.
Thank you very much.

2019-12-25 16:49 GMT-03:00, luciano de souza <[hidden email]>:
> I know my english need to be improved. I'm sure my difficulty to
> explain is due to it.
> I said about lua pattern, perhaps it's better to say "lua string
> pattern" or "lua regex".
> I get a webpage:
>
> local content, status = http.request(url)
>
> Now I have in "content" variable the entire content of the webpage.
> By means of Lua string patterns, I would like to get all urls started
> with "http" and ended with ".mp3". I know previously the target url
> are ended with ".mp3".
> I tried something like:
> for url in content:gmatch('(http[a-zA-z0-9_/]-)%.mp3") do
> print(url)
> end
>
> But the above string pattern can't match urls like:
> http://midia.cmais.com.br/assets/audio/default/CENA_00087___P___24_12_10_1293450448.mp3
>
> 2019-12-25 16:17 GMT-03:00, luciano de souza <[hidden email]>:
>> Actually, the problem is not how to download the file. This can be
>> done as follows:
>>
>> local http = require('socket.http')
>>
>> local url =
>> 'http://midia.cmais.com.br/assets/audio/default/CENA_00087___P___24_12_10_1293450448.mp3'
>>
>> local content, status = http.request(url)
>> if status == 200 then
>> local file = io.open('file.mp3', 'wb')
>> file:write(content)
>> file:close()
>> end
>>
>> The problem is that urls should be searched in html file. In my
>> example, I used a known url, but to know it a web page need to be
>> scanned.
>> I have this url:
>> http://culturafm.cmais.com.br/cena-brasileira/cena-brasileira.
>> Inside html, I have lots of urls. Some of then are direct links to
>> audio files. I don't need mime types becose I know links are finished
>> by ".mp3".
>> So my problem is which pattern matches urls started with "http" and
>> ended with ".mp3". After obtaining the list of urls pointing to audio
>> files, I can download each of then as I have done in the above
>> example.
>> Simplifying: if I have the url http://www.a.com/b/c/d.mp3' or
>> something like that  which lua patterns matches it?
>> Perhaps, explaining more than necessary, in my first message, I was
>> confused. Sorry!
>>
>> 2019-12-25 15:12 GMT-03:00, Philippe Verdy <[hidden email]>:
>>> There's no standard for download urls to terminate in .mp3. The standard
>>> uses mime types when querying urls.
>>>
>>> You can query mime types of http(s) urls without downloading them using
>>> HEAD requests rather than GET.
>>>
>>> URLS have a standard for parsing them, which allows distinguishing the
>>> protocol, the host name or address, a possible port number, a path and a
>>> query string. All are required but none of them indicate a mime type. An
>>> if
>>> the path part may frequently be used to indicate the mime type, this is
>>> not
>>> required, as the effective mp3 you request may be selected from the
>>> query
>>> string and both may be using randomized encoding defined by the server
>>> and
>>> possibly depending on user's session, i.e.cookies or additional
>>> parameters
>>> in query strings or in encoded form data submited outside the url, such
>>> as
>>> authentication parameters or user preferences set by form input
>>> variables
>>> (possibly hidden). Each web site then defines its own encodings and API
>>> for
>>> path and query strings as well as form data.
>>>
>>> When your request will succeed, the download will senf the binary mp3,
>>> it's
>>> mime type, a possible suggested name for storing a file in your local
>>> file
>>> system. The answer may as well return an error status, an html page
>>> (such
>>> as an log-on form, or reason why your request was denied by the server).
>>>
>>>
>>> Le mer. 25 déc. 2019 à 18:02, luciano de souza <[hidden email]> a
>>> écrit :
>>>
>>>> Hello all,
>>>> Cultura FM radio has some interesting audios about classical music.
>>>> I'd like to download it automatically.
>>>> The steps are:
>>>> 1. To get the page with http.request;
>>>> 2. To match urls started with 'http://' and ended with '.mp3';
>>>> 3. To record the urls in a file.
>>>> My problem is the step 2. I could not find a pattern to match urls
>>>> like:
>>>>
>>>> http://midia.cmais.com.br/assets/audio/default/CENA_00087___P___24_12_10_1293450448.mp3
>>>>
>>>> Let me show to you my attempt:
>>>>
>>>> local http = require('socket.http')
>>>>
>>>> local target = '
>>>> http://culturafm.cmais.com.br/cena-brasileira/cena-brasileira'
>>>>
>>>> local content, status = http.request(target)
>>>>
>>>> if status == 200 then
>>>>         local file = io.open('url.txt', 'w')
>>>>         local pattern = '(http://[a-zA-Z0-9_/]-%.mp3)'
>>>>         for url in content:gmatch(pattern) do
>>>>                 file:write(url)
>>>>         end
>>>>         file:close()
>>>> end
>>>>
>>>> Would someone know a lua pattern to match urls started with "http://"
>>>> and ended with '.mp3'?
>>>> Best regards,
>>>>
>>>> --
>>>> Luciano de Souza
>>>>
>>>>
>>>
>>
>>
>> --
>> Luciano de Souza
>>
>
>
> --
> Luciano de Souza
>


--
Luciano de Souza
Reply | Threaded
Open this post in threaded view
|

Re: Lua pattern to match an url

nobody

> On 25. Dec 2019, at 22:44, Philippe Verdy <[hidden email]> wrote:
>
> It matches too many things […]

When dealing with _valid HTML,_ heuristically, any string that starts with 'http://', ends with '.mp3' and doesn't contain spaces is almost certainly exactly a URL pointing at (something that claims to be) an MP3. (The other pattern works, too.) [So a somewhat better pattern than what I initially suggested would be "<a href="http://%S+%.mp3">http://%S+%.mp3" – also excluding line breaks.]

When you're not dealing with random / adversarial strings, that is good enough and you don't have to care about all those intricacies. From what I gathered, the goal is one-off semi-manual extraction of links from HTML generated by some other party, so even potential errors don't really matter… (The human in the loop can notice / fix things.)

-- nobody

Reply | Threaded
Open this post in threaded view
|

Re: Lua pattern to match an url

Philippe Verdy
Wrong, there are typically quotation marks around attributes, then the link anchor tag is followed by text which can be "MP3", you would then match too much.
The URL can use various extensions (including with variable letter case). Also note that the dot pattern does matches spaces. And the .mp3 links can also use non-ASCII characters (not necessarily URL-encoded, and you cannot safely guess which text encoding is used in the path or querystring of the URL, as it is not necessarily the same as the HTML page encoding itself (including when it is URL-encoded to become ASCII). URLs re designed to be opaque for most things, except that HTTP(S) are designed to be "hierarchic" and make special behavior only of the "/" and or .." relative references; "." alone is supported by target filesystems onwhic h the webserver is installed, but not needed for HTTP(S) which defines its own filesystem space with web semantics, not local-OS semantics on the server. Beside that the path elements in HTTP are opaque binary, do not support any control or whitespaces that have not been URLencoded in %xx hexadecimal form, and do not have any semantics which is brought separately in MIME type headers.
Once again you need to stcik to the URI RFC. Don't reinvent the wheel, there are already tons of URL parsers. And many MP3 on the web are never accessible with an URL ending with ".mp3" in their path or in their query string, and the terminatione ".mp3" may has well return actually NO valid MP3 but plain HTML, or plain text or other file formats (as indicated properly by the MIME type header and the HTTP status code).
So please read the RFC ! https://tools.ietf.org/html/rfc3986
Then look and HTML refefences to see how these URLS are further reencoded by another layer in HTML, applying additional escaping when needed (like character references "&name;" or "&#numericDecimal;" or "&#xnumericHexadecimal;" using Unicode code points independantly of the encoding in the URI itself. Three distinct encoding layers are applied to encapulate the actual resource names, two of them being standard, but one of them being resolved only in the server side.



Le mer. 25 déc. 2019 à 23:22, nobody <[hidden email]> a écrit :

> On 25. Dec 2019, at 22:44, Philippe Verdy <[hidden email]> wrote:
>
> It matches too many things […]

When dealing with _valid HTML,_ heuristically, any string that starts with 'http://', ends with '.mp3' and doesn't contain spaces is almost certainly exactly a URL pointing at (something that claims to be) an MP3. (The other pattern works, too.) [So a somewhat better pattern than what I initially suggested would be "http://%S+%.mp3" – also excluding line breaks.]

When you're not dealing with random / adversarial strings, that is good enough and you don't have to care about all those intricacies. From what I gathered, the goal is one-off semi-manual extraction of links from HTML generated by some other party, so even potential errors don't really matter… (The human in the loop can notice / fix things.)

-- nobody