Yesterday, I talked about how to get the most out of running regular expressions in PHP. The reason that I needed to dig in deep on regular expression syntax with PHP is because I needed to write some regular expressions that deal with Unicode characters.
After much reading, I believed that I knew everything that I needed. I started writing some regex strings and testing the code. Unfortunately, every time I ran a test with a string that contained Unicode characters, the match failed. When I removed the Unicode characters from the string and tested again, it would work. I was baffled.
Finding the Problem
I had the regex testing characters (
\pL, etc) inside of a character class, such as
[\X-], since I was creating a regex to test for domains. I wrote a really simple rule by simply looking for
/^\X$/ and testing the regex with a single Unicode character. Amazingly, having the
\X outside of the square brackets changed everything as I now received the following very concerning warning:
PHP Warning: preg_match(): Compilation failed: support for \P, \p, and \X has not been compiled at offset 2 in wp-content/plugins/dnsyogi/testunicode.php on line 4
Since PHP uses the PCRE engine to run regular expressions, I started to dig into it. I found out that I could query PCRE directly. I ended up with something very similar:
[chris@home ~]$ pcregrep '/\X*/u' character.txt pcregrep: Error in command-line regex at offset 2: support for \P, \p, and \X has not been compiled
It looked like the error was coming from PCRE itself. I searched around for a while thinking that I could simply install a new package using yum. I hoped to find something like
php-pcre-unicode, or something to make it simple and quick to add this support since I much prefer using package management tools rather than compiling and installing from source.
Unfortunately, no such package exists. This support is something that must be an option that PCRE is compiled with, and my CentOS repository only has packages that don’t include that support. After much digging around, I found that this isn’t necessarily CentOS’s fault as this package has carried over from the RHEL (Red Hat Enterprise Linux) side of things.
A great way of checking to see if this is an issue on your system is by running the following:
[chris@home ~]$ pcretest -C PCRE version 6.6 06-Feb-2006 Compiled with UTF-8 support No Unicode properties support Newline character is LF Internal link size = 2 POSIX malloc threshold = 10 Default match limit = 10000000 Default recursion depth limit = 10000000 Match recursion uses stack
This is the output that I received. Notice the “UTF-8 support” and the “No Unicode properties support” lines. This means that PRCE was compiled with the
--enable-utf8 configure option which allows PCRE to recognize and work with UTF-8 encoded strings. However, it wasn’t compiled with the
--enable-unicode-properties configure option which works in conjunction with the
--enable-utf8 option to add support for the
\X character classes.
This seems to have been an oversight when the rpm file was first put together. Fortunately, there is a way to fix it.
Fixing the Problem
Since I’m sure that many of you are like me and would rather not manually compile and install software outside of the package management system, the solution is to update the rpm to have the option that it needs and install it.
I had never done this before. Fortunately, I found a very helpful guide that details this process out very nicely: How to patch and rebuild an RPM package.
I have provided the new rpm file that I have built at the bottom of this post. If you don’t care about all this jibber-jabber, you can skip down there and grab the file. However, if you would like to learn how to address this issue yourself or have a system that my file will not support, please read on to see how I rebuilt the rpm with the new option.
Rebuilding the rpm
- The first thing I did is set up my
src/rpmfolder structure as detailed in the Setup section of the guide that I’m following. I’ll simply refer you over there as it doesn’t need repeating here.
- I needed to grab the source rpm for the current version of PCRE on my platform. I’m on CentOS 5.2 with version 6.6 of PCRE. I found the matching source rpm file (
- I then installed the source rpm in order to gain access to its files:
[chris@home ~]$ rpm -ivh pcre-6.6-2.el5_1.7.src.rpm
This put the necessary files into my
- I opened up the
~/src/rpm/SPECS/pcre.specfile and found the following line:
I changed it to include the Unicode properties option:
%configure --enable-utf8 --enable-unicode-properties
I then saved and closed the file.
- This is the only change that I needed to make. So, now it is time to build the new rpm file. I simply ran the following to build it:
[chris@home ~]$ rpmbuild -ba ~/src/rpm/SPECS/pcre.spec
Toward the end of the large amount of output, I received the following:
Wrote: ~/src/rpm/SRPMS/pcre-6.6-2.7.src.rpm Wrote: ~/src/rpm/RPMS/x86_64/pcre-6.6-2.7.x86_64.rpm Wrote: ~/src/rpm/RPMS/x86_64/pcre-devel-6.6-2.7.x86_64.rpm Wrote: ~/src/rpm/RPMS/x86_64/pcre-debuginfo-6.6-2.7.x86_64.rpm
This tells me exactly where I can find my new source rpm and rpm files.
Updated rpm File for CentOS 5.2 64-bit
If you are running a 64-bit version of CentOS 5.2, the following file should work for you. If you have a different architecture, Linux distro, or encounter any errors when trying to install this file, then you should follow the instructions above to build an rpm that is suitable for your distribution.
pcre-6.6-2.7.x86_64.rpm – PCRE 6.6 for CentOS 5.2 64-bit
Installing New rpm
Now that I have my new rpm file, I just need to install it. Since I already have a pcre package installed, I need to tell the rpm command to update rather than install. The following command does this for me:
[root@home ~]# rpm -Uvh ~/src/rpm/RPMS/x86_64/pcre-6.6-2.7.x86_64.rpm
Notice that I need to be root to run this command.
Finally, to verify that everything worked, I ran the
pcrecheck program again:
[chris@home ~]$ pcretest -C PCRE version 6.6 06-Feb-2006 Compiled with UTF-8 support Unicode properties support Newline character is LF Internal link size = 2 POSIX malloc threshold = 10 Default match limit = 10000000 Default recursion depth limit = 10000000 Match recursion uses stack
Finally, time to move on with life.