pr0g33k

Using PATINDEX to parse HTML with a Regular Expression

When I created the RSS feed for this blog, I didn't want to return each entire blog post but, rather, just the first paragraph. I tend to get a little wordy (it's a personal flaw; I'm working on it) and I wanted to display as many entries in the RSS as I could without it taking too long to download to the client. I first thought I would query the blog posts from the database and use a regular expression to extract the first paragraph since SQL doesn't have native support for regular expressions. (There's the SQL Server CLR route, sure, but since I'm using SQL Azure, I don't have that option.) Then I remembered PATINDEX. According to the online documentation, PATINDEX returns "the starting position of the first occurrence of a pattern in a specified expression, or zeros if the pattern is not found, on all valid text and character data types." Perfect! The pattern language is pretty limited but so is the HTML I'm wanting to match. Here's a little test script I threw together to test it out:

Declare @Html table
(
	Markup varchar(1000)
)

Insert @Html (Markup) Select '<p>This is what I want.</p><p>I want to ignore this.</p>'
Insert @Html (Markup) Select '<div>I do not want this.</div><pre>Some preformatted text I do not want.</pre><p class="my-css">This is what I want.</p>'
Insert @Html (Markup) Select '<div>I do not want this.</div><span>I do not want this, either</span>'
Insert @Html (Markup) Select '<param>A param I do not want.</param><progress>I do not want progress (well, not this progress)</progress><p>This is what I want!</p>'

Declare @Begin varchar(50) = '%<p[^re,aram,rogress]%'
Declare @End varchar(50) = '%</p>%'

Select PATINDEX(@Begin, Markup) As FirstMarker
	, Case When PATINDEX(@End, Markup) > 0 Then PATINDEX(@End, Markup) + 3 Else 0 End As SecondMarker
	, SUBSTRING(Markup, PATINDEX(@Begin, Markup), Case When PATINDEX(@End, Markup) > 0 Then PATINDEX(@End, Markup) + 3 Else 0 End) As Body
From @Html
    

The @Begin variable holds a pattern that matches all HTML tags starting with "p" but ignores <pre>, <param>, and <progress> which just leaves <p>. I left the begin tag "open" (notice there's no closing ">") so that I could capture any attributes on the paragraph tag. I'm only interested in a part of the string that begins with "<p" and ends with "</p>". Using SUBSTRING, I can extract the first paragraph of the HTML. Now I just have to remember to put a summary in the first paragraph of each blog post...

Posted on 4/29/2013 at 04:04 PM
Tags: SQL

Comments:

Leave a comment
  1. CAPTCHA