Catching and linkifying arbitrary urls in user input

written on Saturday, March 30, 2013

RogueShell users have asked for a couple of improvements to the public site chat. A few People wanted to be able to use the IRC command "/me" and everyone thought that pasted links should be made clickable.

Adding support for "/me" was pretty simple. But matching urls was trickier. The solutions I found were all relying on their being some kind of url'ish prefix (like "://" or "www") but I wanted to match the most basic url pattern that people might use: example.com or example.co.nz as well. Google does it in gchat after all.

So I approached it from a slightly different position. I decided that a url would be a "." that isn't followed by another period and does not have a space on either side

I decided it would be better to have a few false positives than to limit people to having to use a protocol marker. And I could safely assume that any urls without a protocol should be http. So the regex I came up with was:

re.compile(r"""[^\s]             # not whitespace
               [a-zA-Z0-9:/\-]+  # the protocol and domain name
               \.(?!\.)          # A literal '.' not followed by another
               [\w\-\./\?=&%~#]+ # country and path components
               [^\s]             # not whitespace""", re.VERBOSE)

You can see the regex working in an interactive Python console here

I'm sure there are a few edge cases that I haven't thought of. But as they come up I'll keep this post updated with them.

This entry was tagged python, regex and web