<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Chris JeanChris Jean &#187; regular expressions</title>
	<atom:link href="http://chrisjean.com/tag/regular-expressions/feed/" rel="self" type="application/rss+xml" />
	<link>http://chrisjean.com</link>
	<description>Linux, WordPress, programming, anime, and other stuff</description>
	<lastBuildDate>Mon, 16 Jan 2012 15:22:13 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>Unicode Support on CentOS 5.2 with PHP and PCRE</title>
		<link>http://chrisjean.com/2009/01/31/unicode-support-on-centos-52-with-php-and-pcre/</link>
		<comments>http://chrisjean.com/2009/01/31/unicode-support-on-centos-52-with-php-and-pcre/#comments</comments>
		<pubDate>Sat, 31 Jan 2009 06:00:32 +0000</pubDate>
		<dc:creator>Chris Jean</dc:creator>
				<category><![CDATA[Linux]]></category>
		<category><![CDATA[Tips 'n Tricks]]></category>
		<category><![CDATA[CentOS]]></category>
		<category><![CDATA[PCRE]]></category>
		<category><![CDATA[PHP]]></category>
		<category><![CDATA[regular expressions]]></category>
		<category><![CDATA[Unicode]]></category>

		<guid isPermaLink="false">http://gaarai.com/?p=820</guid>
		<description><![CDATA[Yesterday, I talked about how to get the most out of running regular expressions in PHP. The reason that I needed to dig in deep on regular expression syntax with PHP is because I needed to write some regular expressions that deal with Unicode characters. After much reading, I believed that I knew everything that [...]]]></description>
			<content:encoded><![CDATA[<!-- filtered -->
<p><a href="http://chrisjean.com/2009/01/30/php-regular-expression-syntax-references/" target="_blank">Yesterday</a>, I talked about how to get the most out of running regular expressions in PHP. The reason that I needed to dig in deep on regular expression syntax with PHP is because I needed to write some regular expressions that deal with Unicode characters.</p>
<p>After much reading, I believed that I knew everything that I needed. I started writing some regex strings and testing the code. Unfortunately, every time I ran a test with a string that contained Unicode characters, the match failed. When I removed the Unicode characters from the string and tested again, it would work. I was baffled.</p>
<p><span id="more-820"></span></p>
<h3>Finding the Problem</h3>
<p>I had the regex testing characters (&#8216;\X&#8217;, &#8216;\pL&#8217;, etc) inside of a character class, such as &#8216;[\X-]&#8216;, since I was creating a regex to test for domains. I wrote a really simple rule by simply looking for &#8216;/^\X$/&#8217; and testing the regex with a single Unicode character. Amazingly, having the &#8216;\X&#8217; outside of the square brackets changed everything as I now received the following very concerning warning:</p>
<div class="code">PHP Warning:  preg_match(): Compilation failed: support for \P, \p, and \X has not been compiled at offset 2 in wp-content/plugins/dnsyogi/testunicode.php on line 4</div>
<p>Since PHP uses the PCRE engine to run regular expressions, I started to dig into it. I found out that I could query PCRE directly. I ended up with something very similar:</p>
<div class="code">$ pcregrep &#8216;/\X*/u&#8217; character.txt<br />
pcregrep: Error in command-line regex at offset 2: support for \P, \p, and \X has not been compiled</div>
<p>It looked like the error was coming from PCRE itself. I searched around for a while thinking that I could simply install a new package using yum. I hoped to find something like pcre-utf8, pcre-unicode, php-pcre-unicode, or something to make it simple and quick to add this support since I much prefer using package management tools rather than compiling and installing from source.</p>
<p>Unfortunately, no such package exists. This support is something that must be an option that PCRE is compiled with, and my CentOS repository only has packages that don&#8217;t include that support. After much digging around, I found that this isn&#8217;t necessarily CentOS&#8217;s fault as this package has carried over from the RHEL (Red Hat Enterprice Linux) side of things.</p>
<p>A great way of checking to see if this is an issue on your system is by running the following:</p>
<pre style="padding-left: 30px;">$ pcretest -C
PCRE version 6.6 06-Feb-2006
Compiled with
  UTF-8 support
  No Unicode properties support
  Newline character is LF
  Internal link size = 2
  POSIX malloc threshold = 10
  Default match limit = 10000000
  Default recursion depth limit = 10000000
  Match recursion uses stack</pre>
<p>This is the output that I received. Notice the &#8220;UTF-8 support&#8221; and the &#8220;No Unicode properties support&#8221; lines. This means that PRCE was compiled with the &#8220;&#8211;enable-utf8&#8243; configure option which allows PCRE to recognize and work with UTF-8 encoded strings. However, it wasn&#8217;t compiled with the &#8220;&#8211;enable-unicode-properties&#8221; configure option which works in conjunction with the enable-utf8 option to add support for the &#8216;\p&#8217;, &#8216;\P&#8217;, and &#8216;\X&#8217; character classes.</p>
<p>This seems to have been an oversight when the rpm file was first put together. Fortunately, there is a way to fix it.</p>
<h3>Fixing the Problem</h3>
<p>Since I&#8217;m sure that many of you are like me and would rather not manually compile and install software outside of the package management system, the solution is to update the rpm to have the option that it needs and install it.</p>
<p>I had never done this before. Fortunately, I found a very helpful guide that details this process out very nicely: <a href="http://bradthemad.org/tech/notes/patching_rpms.php" target="_blank">How to patch and rebuild an RPM package</a>.</p>
<p>I have provided the new rpm file that I have built at the bottom of this post. If you don&#8217;t care about all this jibber-jabber, you can skip down there and grab the file. However, if you would like to learn how to address this issue yourself or have a system that my file will not support, please read on to see how I rebuilt the rpm with the new option.</p>
<h3>Rebuilding the rpm</h3>
<ol>
<li>The first thing I did is set up my ~/.rpmmacros file and src/rpm folder structure as detailed in the Setup section of <a href="http://bradthemad.org/tech/notes/patching_rpms.php" target="_blank">the guide</a> that I&#8217;m following. I&#8217;ll simply refer you over there as it doesn&#8217;t need repeating here.</li>
<li>I needed to grab the source rpm for the current version of PCRE on my platform. I&#8217;m on CentOS 5.2 with version 6.6 of PCRE. I found the matching source rpm file (pcre-6.6-2.el5_1.7.src.rpm) <a href="http://mirrors.kernel.org/centos/5.2/os/SRPMS/pcre-6.6-2.el5_1.7.src.rpm" target="_blank">here</a>.</li>
<li>I then installed the source rpm in order to gain access to its files:
<pre style="padding-left: 30px;">$ rpm -ivh pcre-6.6-2.el5_1.7.src.rpm</pre>
<p>This put the necessary files into my ~/src/rpm/SOURCES and ~/src/rpm/SPECS folders.</li>
<li>I opened up the ~/src/rpm/SPECS/pcre.spec file and found the following line:
<pre style="padding-left: 30px;">%configure --enable-utf8</pre>
<p>I changed it to include the Unicode properties option:</p>
<pre style="padding-left: 30px;">%configure --enable-utf8 --enable-unicode-properties</pre>
<p>I then saved and closed the file.</li>
<li>This is the only change that I needed to make. So, now it is time to build the new rpm file. I simply ran the following to build it:
<pre style="padding-left: 30px;">$ rpmbuild -ba ~/src/rpm/SPECS/pcre.spec</pre>
<p>Toward the end of the large amount of output, I received the following:</p>
<pre style="padding-left: 30px;">Wrote: ~/src/rpm/SRPMS/pcre-6.6-2.7.src.rpm
Wrote: ~/src/rpm/RPMS/x86_64/pcre-6.6-2.7.x86_64.rpm
Wrote: ~/src/rpm/RPMS/x86_64/pcre-devel-6.6-2.7.x86_64.rpm
Wrote: ~/src/rpm/RPMS/x86_64/pcre-debuginfo-6.6-2.7.x86_64.rpm</pre>
<p>This tells me exactly where I can find my new source rpm and rpm files.</li>
</ol>
<h3>Updated rpm File for CentOS 5.2 64-bit</h3>
<p>If you are running a 64-bit version of CentOS 5.2, the following file should work for you. If you have a different architecture, Linux distro, or encounter any errors when trying to install this file, then you should follow the instructions above to build an rpm that is suitable for your distribution.</p>
<p><a href="http://chrisjean.com/wp-content/uploads/2009/01/pcre-66-27x86_64.rpm" target="_blank">pcre-6.6-2.7.x86_64.rpm</a> &#8211; PCRE 6.6 for CentOS 5.2 64-bit</p>
<div class="post-notice" style="margin-bottom:10px;">Thanks <a href="http://chrisjean.com/2009/01/31/unicode-support-on-centos-52-with-php-and-pcre/comment-page-1/#comment-1144">Robin</a> for providing a 32-bit version: <a href="http://chrisjean.com/wp-content/uploads/2009/01/pcre-6.6-2.7.i386.rpm">pcre-6.6-2.7.i386.rpm</a></div>
<h3>Installing New rpm</h3>
<p>Now that I have my new rpm file, I just need to install it. Since I already have a pcre package installed, I need to tell the rpm command to update rather than install. The following command does this for me:</p>
<pre style="padding-left: 30px;"># rpm -Uvh ~/src/rpm/RPMS/x86_64/pcre-6.6-2.7.x86_64.rpm</pre>
<p>Notice that I need to be root to run this command.</p>
<p>Finally, to verify that everything worked, I ran the pcrecheck program again:</p>
<pre style="padding-left: 30px;">$ pcretest -C
PCRE version 6.6 06-Feb-2006
Compiled with
  UTF-8 support
  Unicode properties support
  Newline character is LF
  Internal link size = 2
  POSIX malloc threshold = 10
  Default match limit = 10000000
  Default recursion depth limit = 10000000
  Match recursion uses stack</pre>
<p>Looks good.</p>
<p>Finally, time to move on with life.</p>
]]></content:encoded>
			<wfw:commentRss>http://chrisjean.com/2009/01/31/unicode-support-on-centos-52-with-php-and-pcre/feed/</wfw:commentRss>
		<slash:comments>63</slash:comments>
		</item>
		<item>
		<title>PHP Regular Expression Syntax References</title>
		<link>http://chrisjean.com/2009/01/30/php-regular-expression-syntax-references/</link>
		<comments>http://chrisjean.com/2009/01/30/php-regular-expression-syntax-references/#comments</comments>
		<pubDate>Fri, 30 Jan 2009 16:47:09 +0000</pubDate>
		<dc:creator>Chris Jean</dc:creator>
				<category><![CDATA[Development]]></category>
		<category><![CDATA[Tips 'n Tricks]]></category>
		<category><![CDATA[WordPress]]></category>
		<category><![CDATA[PHP]]></category>
		<category><![CDATA[regular expressions]]></category>

		<guid isPermaLink="false">http://gaarai.com/?p=812</guid>
		<description><![CDATA[Since beginning work on my DNS Yogi site, I&#8217;ve had to do numerous regular expressions to matching all sorts of string bits. I quickly ran into problems when I realized that I need to add support for Unicode characters since certain TLD registrars support registrations with non-Latin characters. The main issue is that there are [...]]]></description>
			<content:encoded><![CDATA[<!-- filtered -->
<p>Since beginning work on my <a href="http://dnsyogi.com/">DNS Yogi</a> site, I&#8217;ve had to do numerous <a href="http://en.wikipedia.org/wiki/Regular_expression" target="_blank">regular expressions</a> to matching all sorts of string bits. I quickly ran into problems when I realized that I need to add support for <a href="http://en.wikipedia.org/wiki/Unicode" target="_blank">Unicode</a> characters since certain <a href="http://en.wikipedia.org/wiki/Top-level_domain" target="_blank">TLD</a> registrars support registrations with non-Latin characters.</p>
<p>The main issue is that there are multiple regular expression engines. <a href="http://php.net/" target="_blank">PHP</a> uses a flavor of the <a href="http://www.pcre.org/" target="_blank">PCRE</a> (Perl Compatible Regular Expression) engine. Each engine and varient of an engine has a slightly different way of handling regular expression syntax. I needed to find out exactly how the PHP regular expression engine worked, and finding that information was not easy.</p>
<p><span id="more-812"></span></p>
<p>I&#8217;d have to say that there isn&#8217;t a single resource that will provide everything needed, and I certainly don&#8217;t believe that I can produce and maintain a better one. So, this post will act as a compilation of resources that together provide a robust overview of how PHP handles regular expressions.</p>
<ul>
<li>Your first stop should be the <a href="http://www.php.net/manual/" target="_blank">PHP Manual</a>&#8216;s page on the <a href="http://www.php.net/manual/en/function.preg-match.php" target="_blank">preg_match</a> function. This page will get you started on how to run regular expressions using PHP. In addition, you should look at the <a href="http://www.php.net/manual/en/function.preg-match-all.php" target="_blank">preg_match_all</a>, <a href="http://www.php.net/manual/en/function.preg-replace.php" target="_blank">preg_replace</a>, and <a href="http://www.php.net/manual/en/function.preg-split.php" target="_blank">preg_split</a> function references so you get a good overview of what each function can do.</li>
<li>Next, stop by <a href="http://www.regular-expressions.info/" target="_blank">Regular-Expressions.info</a> and read through their page on <a href="http://www.regular-expressions.info/php.html" target="_blank">PHP</a> to get an outsider view of how PHP handles regular expressions. If you aren&#8217;t familiar with regular expressions, you will gain an amazing amount of knowledge about how to build and use them by reading through their <a href="http://www.regular-expressions.info/tutorialcnt.html" target="_blank">Regex Tutorial</a>.</li>
<li>Now it&#8217;s time to really dig in and get dirty on the internals of the PCRE engine that PHP uses. The <a href="http://www.regextester.com/pregsyntax.html" target="_blank">PCRE Regular Expression Pattern Syntax Reference (PHP preg*)</a> document is an extremely in-depth reference that details the finer points of how the regular expression engine in PHP really works. If you want to know how to build advanced regex patterns in PHP, this is the document for you.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://chrisjean.com/2009/01/30/php-regular-expression-syntax-references/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

