Bite my bytes

What I learn by day I blog at night - A blog from Microsoft Consultant working from Ljubljana, Slovenia

  Home :: Contact :: Syndication  
  999 Posts :: 7691 Comments :: 235 Trackbacks

Search

Most popular posts
in last 180 days

Categories

My Projects

Archives

Stuff


Copyright © by David Vidmar
 
Contact me!
 
LinkedIn Profile
 
 
 

I needed a regular expression that would match http links in html documents and this is what I came up with:

<a.*href=('|")?(http\://.*?(?=\1)).*>\s*([^<]+|.*?)?\s*</a>

It will match anything that:

    • <a …>…</a> tags
    • a tag has “href” attribute
    • value of href has to have matching quotes or no quotes
    • value of href has to be http (my requirement)

The interesting par is ?=\1 that will match the quotation (‘ or “) that we started with. This construct is called Positive lookahead with backreference.

The other interesting part is \s*([^<]+|.*?)?\s* for matching of linked text that can have whitespace and new lines but this regex will strip left and right whitespace for easier reading.

Posted on Thursday, September 10, 2009 12:20 PM | Filed under: Developement |

Feedback

# re: Matching links with regular expression in HTML 9/10/2009 6:00 PM Mark
What about <a class="href='...

In other words, .* between <a and href is a little tricky.

# re: Matching links with regular expression in HTML 9/10/2009 6:16 PM Mark
A little Googling revealed this:

Richard Ponton said something alogn the lines of

"HTML is a non-regular language.

Like balanced parenthesis, matching balanced HTML tags is impossible to do with Regular Expressions (in CS terms, an NFA or DFA) alone. Impossible as in "halting problem" impossible.

You can write complex RegExes that will work on the sample documents you have, but you can never find a correct RegEx that will work on all valid HTML, let alone the invalid HTML.

What you *can* do is use regular expressions to match individual open and close tags and use a stack to keep track of your depth. In CS terms, this would be a Push-down Automata, or PDA."

And Wes Haggard suggested this regex (for matching _balanced_ a tags):

(?:<a.*?href=[""'](?<url>.*?)[""'].*?>)(?<name>(?><a[^<]*>(?<DEPTH>)|(?<-DEPTH>)|.)+)(?(DEPTH)(?!))(?:)

# re: Matching links with regular expression in HTML 9/10/2009 8:20 PM David
Sure, it's not 100% complete. HTML is tricky. If it wasn't we would have browser wars.

I need a "good enough" solution and my regex is just that.

Thanks for the comments, though.

# re: Matching links with regular expression in HTML 10/6/2009 1:50 PM Zaki Sahheen
I did a small project for this type of thing. Basically i needed an HTML parser that could parse Dirty HTML as well (unclosed tags, not properly using quotation marks for attributes, dirty attributes etc.)

These regex are used to get name/value pairs for attributes once you have found a tag.
Regex rxTagAttributeUnclosed = new Regex("(?<name>([a-zA-Z0-9\\-]+))(\\s)*=(?<value>([a-zA-Z0-9_&%=#\\-\\.,:\\?;\\(\\)/\\\\])+)");
Regex rxTagAttributeClosed = new Regex("(?<name>([a-zA-Z0-9\\-]+))(\\s)*=(\"|')(?<value>([a-zA-Z0-9_&%=#\\-\\.\\s,:\\?;\\(\\)/\\\\])+)(\"|')");

Finding a tag is fairly simple.
Regex rxStartTagName = new Regex("<(?<tagname>\\w+)");
Regex rxEndTagName = new Regex("</s*(?<tagname>\\w+)\\s*>");

Though not the most performant RegEx :) but helped me earn a few bucks for this project from a client in my Uni days :)


# re: Matching links with regular expression in HTML 10/13/2009 8:31 AM Andrew
Regex are usually greedy unless told otherwise, in which case the original regex is flawed. For instance:
http\://.*
matches 'http://', then eveything (all HTML) until it finds the correct ending of tag. So if you have two or more tags in a HTML document this regex will not work - if greedy. So make it ungreedy with flag. :)

"HTML is a non-regular language."
It might be, but regex will stil match all valid
tags. Of course, you have to take into account it is possible to trick your regex to accept href's that are not inside valid tags. But this might be acceptible, depends on usage.

# re: Matching links with regular expression in HTML 1/6/2010 12:26 AM Triston J. Taylor
K.I.S.S. (Keep it simple stupid)

Why validate the html? Title says match links.

Here ya go: href=["|']([https*://][^><\s"']+)

Find a particular extension: href=["|']([https*://][^><\s"']+\.htm)

In the above snippet replace htm with desired extension.

# re: Matching links with regular expression in HTML 1/6/2010 11:22 PM Triston J. Taylor
Here's another Matching four groups of info.

(<a\s[\W\w]*?href=["|']([\W\w]*?)["|'|\s|>][\W\w]*?["|/]>)([\W\w]*?)()

You should note that Script urls won't be caught by this 100% of the time because of quirks like this:

javascript:Proc(this, that) has a space in it.

I'm not sure but I don't think a url should include a space in it even if it is quoted. (Scripts and some servers/clients being the exception) It should be a '+' or a '%20' escape sequence.

I am sure the above quirk can be fixed with some Conditional matching, however my project does not intend to incorporate inline script URIs or malformed URLs at this time. However, you can also just write another match pattern to extract the script URIs from match group one. Doing this would increase the speed of processing tremendously.

Match Group One = everything between '<a ' and '>'
Match Group Two = Target URL
Match Group Three = Visible Link Content (text & html)
Match Group Four = anchor end tag ''

If you think this is the best solution, please give me the thumbs up!

# re: Matching links with regular expression in HTML 1/6/2010 11:25 PM Triston J. Taylor
Sorry the above posting has a typo in it.

Correct code:

(<a\s[\W\w]*?href=["|']([\W\w]*?)["|'|\s|>][\W\w]*?["|/]>)([\W\w]*?)()

Comments have been closed on this topic.