Yesterday, I talked about how to get the most out of running regular expressions in PHP. The reason that I needed to dig in deep on regular expression syntax with PHP is because I needed to write some regular expressions that deal with Unicode characters.
After much reading, I believed that I knew everything that I needed. I started writing some regex strings and testing the code. Unfortunately, every time I ran a test with a string that contained Unicode characters, the match failed. When I removed the Unicode characters from the string and tested again, it would work. I was baffled.
Finding the Problem
I had the regex testing characters (\X
, \pL
, etc) inside of a character class, such as [\X-]
, since I was creating a regex to test for domains. I wrote a really simple rule by simply looking for /^\X$/
and testing the regex with a single Unicode character. Amazingly, having the \X
outside of the square brackets changed everything as I now received the following very concerning warning:
PHP Warning: preg_match(): Compilation failed: support for \P, \p, and \X has not been compiled at offset 2 in wp-content/plugins/dnsyogi/testunicode.php on line 4
Since PHP uses the PCRE engine to run regular expressions, I started to dig into it. I found out that I could query PCRE directly. I ended up with something very similar:
[chris@home ~]$ pcregrep '/\X*/u' character.txt pcregrep: Error in command-line regex at offset 2: support for \P, \p, and \X has not been compiled
It looked like the error was coming from PCRE itself. I searched around for a while thinking that I could simply install a new package using yum. I hoped to find something like pcre-utf8
, pcre-unicode
, php-pcre-unicode
, or something to make it simple and quick to add this support since I much prefer using package management tools rather than compiling and installing from source.
Unfortunately, no such package exists. This support is something that must be an option that PCRE is compiled with, and my CentOS repository only has packages that don’t include that support. After much digging around, I found that this isn’t necessarily CentOS’s fault as this package has carried over from the RHEL (Red Hat Enterprise Linux) side of things.
A great way of checking to see if this is an issue on your system is by running the following:
[chris@home ~]$ pcretest -C PCRE version 6.6 06-Feb-2006 Compiled with UTF-8 support No Unicode properties support Newline character is LF Internal link size = 2 POSIX malloc threshold = 10 Default match limit = 10000000 Default recursion depth limit = 10000000 Match recursion uses stack
This is the output that I received. Notice the “UTF-8 support” and the “No Unicode properties support” lines. This means that PRCE was compiled with the --enable-utf8
configure option which allows PCRE to recognize and work with UTF-8 encoded strings. However, it wasn’t compiled with the --enable-unicode-properties
configure option which works in conjunction with the --enable-utf8
option to add support for the \p
, \P
, and \X
character classes.
This seems to have been an oversight when the rpm file was first put together. Fortunately, there is a way to fix it.
Fixing the Problem
Since I’m sure that many of you are like me and would rather not manually compile and install software outside of the package management system, the solution is to update the rpm to have the option that it needs and install it.
I had never done this before. Fortunately, I found a very helpful guide that details this process out very nicely: How to patch and rebuild an RPM package.
I have provided the new rpm file that I have built at the bottom of this post. If you don’t care about all this jibber-jabber, you can skip down there and grab the file. However, if you would like to learn how to address this issue yourself or have a system that my file will not support, please read on to see how I rebuilt the rpm with the new option.
Rebuilding the rpm
- The first thing I did is set up my
~/.rpmmacros
file andsrc/rpm
folder structure as detailed in the Setup section of the guide that I’m following. I’ll simply refer you over there as it doesn’t need repeating here. - I needed to grab the source rpm for the current version of PCRE on my platform. I’m on CentOS 5.2 with version 6.6 of PCRE. I found the matching source rpm file (
pcre-6.6-2.el5_1.7.src.rpm
) here. - I then installed the source rpm in order to gain access to its files:
[chris@home ~]$ rpm -ivh pcre-6.6-2.el5_1.7.src.rpm
This put the necessary files into my
~/src/rpm/SOURCES
and~/src/rpm/SPECS
folders. - I opened up the
~/src/rpm/SPECS/pcre.spec
file and found the following line:%configure --enable-utf8
I changed it to include the Unicode properties option:
%configure --enable-utf8 --enable-unicode-properties
I then saved and closed the file.
- This is the only change that I needed to make. So, now it is time to build the new rpm file. I simply ran the following to build it:
[chris@home ~]$ rpmbuild -ba ~/src/rpm/SPECS/pcre.spec
Toward the end of the large amount of output, I received the following:
Wrote: ~/src/rpm/SRPMS/pcre-6.6-2.7.src.rpm Wrote: ~/src/rpm/RPMS/x86_64/pcre-6.6-2.7.x86_64.rpm Wrote: ~/src/rpm/RPMS/x86_64/pcre-devel-6.6-2.7.x86_64.rpm Wrote: ~/src/rpm/RPMS/x86_64/pcre-debuginfo-6.6-2.7.x86_64.rpm
This tells me exactly where I can find my new source rpm and rpm files.
Updated rpm File for CentOS 5.2 64-bit
If you are running a 64-bit version of CentOS 5.2, the following file should work for you. If you have a different architecture, Linux distro, or encounter any errors when trying to install this file, then you should follow the instructions above to build an rpm that is suitable for your distribution.
pcre-6.6-2.7.x86_64.rpm – PCRE 6.6 for CentOS 5.2 64-bit
Installing New rpm
Now that I have my new rpm file, I just need to install it. Since I already have a pcre package installed, I need to tell the rpm command to update rather than install. The following command does this for me:
[root@home ~]# rpm -Uvh ~/src/rpm/RPMS/x86_64/pcre-6.6-2.7.x86_64.rpm
Notice that I need to be root to run this command.
Finally, to verify that everything worked, I ran the pcrecheck
program again:
[chris@home ~]$ pcretest -C PCRE version 6.6 06-Feb-2006 Compiled with UTF-8 support Unicode properties support Newline character is LF Internal link size = 2 POSIX malloc threshold = 10 Default match limit = 10000000 Default recursion depth limit = 10000000 Match recursion uses stack
Looks good.
Finally, time to move on with life.
Did I help you?
Thanks for this – your RPM works perfectly, and right now I just needed to get this working 🙂
That’s great news Grant. I’m glad that it helped you out.
Thanks a ton! This is exactly what i needed after about 2 hours of searching the internet.
Good deal Adam. I checked out your sites. What anime is that in your “Awesome” category? I don’t recognize it, but I’m intrigued.
Your walk through rocked, I have the new rpm installed, and get unicode goodness :
$ pcretest -C
PCRE version 6.6 06-Feb-2006
Compiled with
UTF-8 support
Unicode properties support
…
But I still get the errors in my PHP scripts. Were there any mods to PHP you made here?
I’m glad that you liked the tutorial and got PCRE to work properly Cameron.
As for the errors in your scripts, my best guess is that you need to restart your server process so that PHP reloads. I’ve run into many situations where modifications of PHP wouldn’t change the behavior until I restarted Apache.
If that doesn’t fix your problem, do your error messages match what I have in the post?
Right you are. I did a graceful restart at the time which didn’t work, but I just did a full stop and start cycle and it works well. Thanks again!
That’s great news Cameron.
Happy UTF-8ing. 🙂
Worked for me, thanks!
You’re welcome Sebastiaan. Thanks for the blog link. 🙂
Thanks for the RPM, I just received a confirmation from our RHEL sales rep that this feature will be in RHEL 5.4
Thanks for the instructions on how to enable the UTF-8 properties. I had an oddity in Laconica (a microblogging script) where the RSS feed has asterisks in it. It turns out their regex to replace control characters (“\p{Cc}\p{Cs}”) with asterisks wasn’t working right and was replacing all Ps, Cs and Ss with an asterisk!
Just for reference, I’m on CentOS4 and the CentOS 5 RPM rebuilds fine there 🙂
Thanks for sharing about CentOS 4. I’m sure that others will find the info helpful.
Hi –
Thanks for the info — everything works well until I try the
install….. (I’m running CentOS 5.3 on Intel 386 platform)
>>>>>>>>>>>>>>>
rpm -Uvh ./src/rpm/RPMS/i386/pcre-6.6-2.7.i386.rpm
error: Failed dependencies:
pcre = 6.6-2.el5_1.7 is needed by (installed) pcre-devel-6.6-2.el5_1.7.i 386
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
I don’t want to start removing installed RPMs and risk wandering into the woods – any idea what I need to do to get back on the right road?
Sorry for the delay KC.
If you follow my “Rebuilding the rpm” instructions, you will get both the pcre and pcre-devel rpms. Try installing both at the same time:
rpm -Uvh ./src/rpm/RPMS/i386/pcre-6.6-2.7.i386.rpm ./src/rpm/RPMS/i386/pcre-devel-6.6-2.7.i386.rpm
This should upgrade both packages at the same time and should bypass the problem. Let me know if this works for you.
Works like a champ! – Thank you so much.
Good to hear KC. I’m glad that it worked.
[…] este pequeño tutorial me he basado en los manuales Unicode Support on CentOS 5.2 with PHP and PCRE y How to patch and rebuild an RPM package Tagged as: centos, pcre, PHP, unicode No hay […]
Many thanks,
Worked like a treat.. really appreciate not having to create the RPM!
All the best,
Paul Hudson
Thank you! I am far from an expert on cmd line but i was able to get through this successfully – thanks for taking the time to put it together-
[…] I find the way to slove the problem this might be useful Unicode Support on CentOS 5.2 with PHP and PCRE | gaarai.com […]
Thank you SO VERY MUCH for this guide. I would be completely stuck without it.
I’ve uploaded a compiled version for Centos 5.2 on i386 which can be found here:
http://www.ngse.co.uk/pcre-6.6-2.7.i386.rpm
Not sure if you want to host/add this to the article downloads at all.
Thanks again!
I’m glad that it was helpful to you Robin. Thanks for the i386 version. I’ll add it to the post.
Holy Criminey!
That 32-bit package just saved my life. Great post…I stand humbled before the linux gurus.
Thanks, worked for me as well.
I’m running CentOS 4.7 so I had to track down the source RPM’s. Was able to thanks to Alan Dixon’s post here: http://homeofficekernel.blogspot.com/2009/01/centos4-and-civicrm-21.html
I know there are a number of folks out there still running CentOS 4 – I’d be happy to send you my 4.7 compatible rebuilt rpm’s to post, if you’d like. Just shoot me an email.
Thanks again!
Thanks Chris. I’m adding your RPM to the post.
This totally helped me out. I was going crazy trying to get the compile to work with pcre-4.5-4.el4_6.6.src.rpm but it just does not work.
Thank You Very Much,
Pete.
Thank you very much for your detailed explanation. I am running RHEL5 x86_64 but not an expert at all in this kind of thing. It worked out very well.
Best regards,
Jaap
I’m glad that it worked for you Jaap.
Thank you! Worked perfectly in one step:
rpm -Uvh http://gaarai.com/wp-content/uploads/2009/01/pcre-66-27x86_64.rpm
I learn something new every day. I didn’t realize that you could run RPM using a URL.
Thanks for the tip.
much love for the quick fix. providing the rpm is appreciated.
All this time later I’m in a situation where I’m forced to use CentOS and still the problem exists. Meh.
Your RPM saved me plenty of time here – thanks so much for your good work!
Another big thanks for the RPM! I am running Centos 5.4 64-bit and it worked perfectly.
Thanks again!
Many hanks for this. I was trying to do some crazy work arounds with htmlentites. It was driving me nuts.
You rock.
I had to do this on a 32-bit Centos 5.4 machine to get Status.net working. It was a really hard bug to find, a regex in the rss feed generating code was failing and replacing all C’s, c’s and s’s with asterisks.
Thanks for the tip.
Thanks for sharing this tip. It solved my problem when I was trying to get kohana 2.4 to work on my CentOS 5.5.
Before the kohana complains:
PCRE has not been compiled with Unicode property support
Now it shows PCRE UTF-8 “PASS”.
Thanks VERY much for sharing this!
Very nice. Thanks a lot!
You are amazing!
It works also with CentOS release 5.5 (Final)
Thank you SO much, you saved me!
The messages you get from the abovw version are only:
Preparing… ########################################### [100%]
1:pcre ########################################### [100%]
It runs for just a few seconds and you’re done!
[…] Unicode Support on CentOS 5.2 with PHP and PCRE VN:F [1.8.1_1037]please wait…Rating: 0.0/5 (0 votes cast) October 02nd, 2010 | Category: Programming | No comments […]
in the above, Manos said It works with CentOS release 5.5 (Final)
But it did not work on me with CentOS 5.5(final)
The messages I get are:
rpm -ivh /home/l0l/oldfilesFromMybook/CentOS5..5/pcre-6.6-2.7.i386.rpm
Preparing… ########################################### [100%]
file /lib/libpcre.so.0.0.1 from install of pcre-6.6-2.7.i386 conflicts with file from package pcre-6.6-2.el5_1.7.i386
file /usr/bin/pcregrep from install of pcre-6.6-2.7.i386 conflicts with file from package pcre-6.6-2.el5_1.7.i386
file /usr/bin/pcretest from install of pcre-6.6-2.7.i386 conflicts with file from package pcre-6.6-2.el5_1.7.i386
file /usr/lib/libpcrecpp.so.0.0.0 from install of pcre-6.6-2.7.i386 conflicts with file from package pcre-6.6-2.el5_1.7.i386
file /usr/lib/libpcreposix.so.0.0.0 from install of pcre-6.6-2.7.i386 conflicts with file from package pcre-6.6-2.el5_1.7.i386
That’s because you have the pcre-6.6-2.el5_1.7.i386 package installed just as the message indicates. It has to be removed before you can install the new one.
Great writeup. Used it to patch PCRE on CentOS 5.5 – Many thanks!
Chris, thank you for sharing. Works well.
Robin, thank you for your 32bit rpm.
I just used
rpm -ivh –force –nodeps –replacepkgs –replacefiles http://gaarai.com/wp-content/uploads/2009/01/pcre-6.6-2.7.i386.rpm
Worked like a charm.
Chris,
This article was absolutely instrumental in helping us solve an issue we were seeing on our CentOS server using the Sluggable Behavior in CakePHP – the slugs were coming out as complete garbage.
After several hours of head scratching, we eventually came across your article. I downloaded the pcre-66-27x86_64.rpm file, uploaded it to our server, ran the rpm -Uvh pcre-66-27x86_64.rpm command, restarted apache and… problem sorted!
Thank you, thank you, thank you!
Keith
Glad to help Keith. Best of luck with your project.
I’m wondering why Centos does NOT have the 6.6-6 version out yet. Redhat has this version and with unisupport enabled.
It’s hard to say. Since it causes so many issues with applications and since it is so clearly a mistake, I really don’t know why this hasn’t been passed into the distro repos.
[…] it was not difficult though, based on this blog post and these directions. Then rebuild PHP, bounce Apache, and you’re […]
Rules!
You saved my day!
Thanks for this post and this tip.
Thanks. All works fine.
Thanks, it saved me lot of headache!
reason #65,537 why i love the internet.
Thanks man,
you saved hours of my life! 🙂
Thanks for pointing me in the right direction.
If running CentOS 5 the latest PCRE version 6.6.6 has this enabled, and I think that was updated recently (I am CentOS 5.3). Just update now. Since I was 5.3 and 32 bit, I was following the rebuild instructions (my steps below), and editing pcre.spec I found enable-unicode-properties already there. Yum update got me rolling on my machine:
Pre-update:
%uname -i
– Linux system_name 2.6.18-194.32.1.el5 #1 SMP Wed Jan 5 17:53:09 EST 2011 i686 i686 i386 GNU/Linux
%yum list pcre*
– told me I was running 6.6.2 and 6.6.6 was available.
%pcretest -C
– said “No Unicode properties support”.
After “yum update”…
%pcretest -C
– said “Unicode properties support”
Here were the steps I “was” up to in case it helps someone else…
su – my_nomral_user
echo “%_topdir /home/elliottcg/src/rpm” >> ~/.rpmmacros
mkdir -p ~/src/rpm
cd ~/src/rpm
mkdir BUILD RPMS SOURCES SPECS SRPMS
mkdir RPMS/{i386,i486,i586,i686,noarch,athlon}
cd ~
wget http://mirrors.kernel.org/centos/5/os/SRPMS/pcre-6.6-6.el5.src.rpm
rpm -ivh pcre-6.6-6.el5.src.rpm
note: ignored “warning: xxx mockbuild…” messages
vi ~/src/rpm/SPECS/pcre.spec
FIX (hey! already emabled in 6.6.6): %configure –enable-utf8 –enable-unicode-properties
rpmbuild -ba ~/src/rpm/SPECS/pcre.spec
rpm -Uvh ~/src/rpm/RPMS/xxx/pcre-6.6-2.7.i386.rpm ~/src/rpm/RPMS/xxx/pcre-devel-6.6-2.7.i386.rpm
Since `pcre-6.6-6.el5_6.1` unicode properties had been enabled officially, so feel free to upgrade pcre in causal way.
Thank you very much for the help, it saved me a lot of time!
[…] The solution varies from distro to distro. Here you’ll find the way to get Unicode support on Centos. […]
Thanks, I use this for a new installation of joomla … there was blank page on new install, then I saw my nginx error log and found the problem – PCRE issue.
I was having 8.10 and have updated to 8.13 – downloaded from the main website http://www.pcre.org/ .
After that i just tar -zxvf , configure as the author of this post said ( ./configure –enable-utf8 –enable-unicode-properties ) , make, make isntall and restarted my nginx . (Note that I have stopped and started the webserver also fastcgi – don’t know why normal “restart” does not take affect )
Now everything works great. Thank you !
yum update pcre did it for us
An excellent write up, solution worked beautifully. Magento was yelling at me about this and I thought for sure I was going to have to compile from source or something. Very elegant and no errors encountered. Thanks!
Thanks, it really helped. I was facing similar issues while installing MediaWiki 1.19.0.
[…] Follow this wonderful guide for rebuilding the pcre package to include the “–enable-unicode-properties”. Once built, use RPM to force the re-installation/upgrade of the package. (rpm -Uvh –force ~/src/rpm/RPMS/x86_64/pcre-6.6-2.el5_1.7.x86_64.rpm) […]
I know this is an old post but I just downloaded the 32 bit version that Robin provided.
I then uploaded the file pcre-6.6-2.7.i386.rpm to my VPS (CentOS 5, 32bit, Plesk 10.4)
Installed the new rpm using :
rpm -Uvh /myuploadpath/pcre-6.6-2.7.i386.rpm
Retarted apache
Works great. unicode properties enabled
Saved me a lot of time.
Thanks,
[…] нескольких часов поиска, я набрел вот на этот пост: Unicode Support on CentOS 5.2 with PHP and PCRE в котором говорится, что проблема в том, что PCRE, […]
Thanks a lot for this page. Infact, i am new into mediawiki setup. Everything I did , But my wiki setup was showing fatal error related to PCRE. I got idea from your page and tried to download PCRE for my setup i.e i386. . I downoaded the RPM but it was showing compatibile error. Then I tried to do an “yum instal pcre “for centos 5.5. And then I tried to configure , it was working and Finally i got my wiki.
[…] If you get this error, this page describes a fix […]