Searching accentuated arabic text problem

cestmoi - January 23, 2008 - 00:00

Hi all,
The search module works fine as long as the searched text (Arabic) isn't accentuated. Once the searched sentence has any word that is accentuated, the search yields no results. A google search still can find those words with no problem.

I installed the "accents" module but has no results still.

example:

good :
هذه جملة
bad:

هَذِه جُملَةٌ

I hope someone has some answer/ solution for this problem.

Many thanks

accents not for Arabic

Emad - January 25, 2008 - 07:31

Hi,
accents module doesn't support arabic, you need to customize it. while Google has complex and smart engine.
If I found anything in the web regarding this issue I'll inform you,
BTW users rarly search with accents and most sites will return no result.

some clarification

cestmoi - January 25, 2008 - 17:20

thanks shobaki

I did a search on the web and have yet to find any answer.

BTW users rarly search with accents and most sites will return no result.

My understanding is that the module accents removes accents from the indexed words in site contents and not in the searched string.

the problem is with the accents being in the searched content. e.g. if I have one or more occurences of an "AnArabicWord_with accents" in articles on my website and I searched for that word with or without accents in the search string, it will return no results either ways.

the problem does not exist for certain accents (e.g. dammah). While for others it gives this messages for any word or sentence:

عليك أن تضمن على الأقل كلمة واحدة (غير منفية) بها 3 أحرف أو أكثر.

also, searching for a text without accents in the search form will not return results if that text has accents in the site's content

My knowledge is very bad when it comes to languages in non-english websites.

Thanks always

Arabic Query

Emad - January 26, 2008 - 22:10

Thanks cestmoi,
You are right, accents in content is common "specially the religion related"

I found this function from http://www.phpclasses.org/browse/package/2875.html

function lex($arg) {
$patterns = array();
$replacements = array();

// Prefix's
array_push($patterns, '/^ال/'); array_push($replacements, '(ال)?');

// Singular
array_push($patterns, '/(\S{3,})تين$/'); array_push($replacements, '\\1(تين|ة)?');
array_push($patterns, '/(\S{3,})ين$/'); array_push($replacements, '\\1(ين)?');
array_push($patterns, '/(\S{3,})ون$/'); array_push($replacements, '\\1(ون)?');
array_push($patterns, '/(\S{3,})ان$/'); array_push($replacements, '\\1(ان)?');
array_push($patterns, '/(\S{3,})تا$/'); array_push($replacements, '\\1(تا)?');
array_push($patterns, '/(\S{3,})ا$/'); array_push($replacements, '\\1(ا)?');
array_push($patterns, '/(\S{3,})(ة|ات)$/'); array_push($replacements, '\\1(ة|ات)?');

// Postfix's
array_push($patterns, '/(\S{3,})هما$/'); array_push($replacements, '\\1(هما)?');
array_push($patterns, '/(\S{3,})كما$/'); array_push($replacements, '\\1(كما)?');
array_push($patterns, '/(\S{3,})ني$/'); array_push($replacements, '\\1(ني)?');
array_push($patterns, '/(\S{3,})كم$/'); array_push($replacements, '\\1(كم)?');
array_push($patterns, '/(\S{3,})تم$/'); array_push($replacements, '\\1(تم)?');
array_push($patterns, '/(\S{3,})كن$/'); array_push($replacements, '\\1(كن)?');
array_push($patterns, '/(\S{3,})تن$/'); array_push($replacements, '\\1(تن)?');
array_push($patterns, '/(\S{3,})نا$/'); array_push($replacements, '\\1(نا)?');
array_push($patterns, '/(\S{3,})ها$/'); array_push($replacements, '\\1(ها)?');
array_push($patterns, '/(\S{3,})هم$/'); array_push($replacements, '\\1(هم)?');
array_push($patterns, '/(\S{3,})هن$/'); array_push($replacements, '\\1(هن)?');
array_push($patterns, '/(\S{3,})وا$/'); array_push($replacements, '\\1(وا)?');
array_push($patterns, '/(\S{3,})ية$/'); array_push($replacements, '\\1(ي|ية)?');
array_push($patterns, '/(\S{3,})ن$/'); array_push($replacements, '\\1(ن)?');

// Writing errors
array_push($patterns, '/(ة|ه)$/'); array_push($replacements, '(ة|ه)');
array_push($patterns, '/(ة|ت)$/'); array_push($replacements, '(ة|ت)');
array_push($patterns, '/(ي|ى)$/'); array_push($replacements, '(ي|ى)');
array_push($patterns, '/(ا|ى)$/'); array_push($replacements, '(ا|ى)');
array_push($patterns, '/(ئ|ىء|ؤ|وء|ء)/'); array_push($replacements, '(ئ|ىء|ؤ|وء|ء)');

// Normalization
array_push($patterns, '/ّ|َ|ً|ُ|ٌ|ِ|ٍ|ْ/'); array_push($replacements, '(ّ|َ|ً|ُ|ٌ|ِ|ٍ|ْ)?');
array_push($patterns, '/ا|أ|إ|آ/'); array_push($replacements, '(ا|أ|إ|آ)');

$arg = preg_replace($patterns, $replacements, $arg);

return $arg;
}

I hope it help you,

thanks shobaki, I hope this

cestmoi - January 28, 2008 - 23:38

thanks shobaki, I hope this is what I need .

any hints what to do with it, where to start ?

(I am not a programmer but a good learner)

thanks for the help

accents.module

Emad - January 29, 2008 - 00:25

Well, I am new to Drupal and I just tried the search module, it returns no result for me whatever I search!, maybe I missed something.
Anyway I changed the accents.module file but I cannot confirm the correctence of my changes as my search as always return nothing.
I don't know how to attach a file so I'll put the content of accents.module here: (I apologize for any inconvenience )
<?php
// $Id: accents.module,v 1.3 2006/11/11 00:38:16 canen Exp $
/**
* @file
* Remove accents from words before searching. Will require a re-indexing.
*
* The remove_accents and seems_utf8 functions were taken from Wordpress.
* They can be found here:
* http://trac.wordpress.org/browser/trunk/wp-includes/formatting.php
*/

function accents_search_preprocess($text) {
$text = _accents_remove_accents($text);
$text = _accents_lex($text );
return $text;
}

/**
* From Wordpress:
* http://trac.wordpress.org/browser/trunk/wp-includes/formatting.php#L86
*/
function _accents_seems_utf8($str) { // by bmorel at ssi dot fr
for ($i=0; $ihttp://trac.wordpress.org/browser/trunk/wp-includes/formatting.php#L151
*/
function _accents_remove_accents($string) {
if (!preg_match('/[\x80-\xff]/', $string))
return $string;

if (_accents_seems_utf8($string)) {
$chars = array(
// Decompositions for Latin-1 Supplement
chr(195).chr(128) => 'A', chr(195).chr(129) => 'A',
chr(195).chr(130) => 'A', chr(195).chr(131) => 'A',
chr(195).chr(132) => 'A', chr(195).chr(133) => 'A',
chr(195).chr(135) => 'C', chr(195).chr(136) => 'E',
chr(195).chr(137) => 'E', chr(195).chr(138) => 'E',
chr(195).chr(139) => 'E', chr(195).chr(140) => 'I',
chr(195).chr(141) => 'I', chr(195).chr(142) => 'I',
chr(195).chr(143) => 'I', chr(195).chr(145) => 'N',
chr(195).chr(146) => 'O', chr(195).chr(147) => 'O',
chr(195).chr(148) => 'O', chr(195).chr(149) => 'O',
chr(195).chr(150) => 'O', chr(195).chr(153) => 'U',
chr(195).chr(154) => 'U', chr(195).chr(155) => 'U',
chr(195).chr(156) => 'U', chr(195).chr(157) => 'Y',
chr(195).chr(159) => 's', chr(195).chr(160) => 'a',
chr(195).chr(161) => 'a', chr(195).chr(162) => 'a',
chr(195).chr(163) => 'a', chr(195).chr(164) => 'a',
chr(195).chr(165) => 'a', chr(195).chr(167) => 'c',
chr(195).chr(168) => 'e', chr(195).chr(169) => 'e',
chr(195).chr(170) => 'e', chr(195).chr(171) => 'e',
chr(195).chr(172) => 'i', chr(195).chr(173) => 'i',
chr(195).chr(174) => 'i', chr(195).chr(175) => 'i',
chr(195).chr(177) => 'n', chr(195).chr(178) => 'o',
chr(195).chr(179) => 'o', chr(195).chr(180) => 'o',
chr(195).chr(181) => 'o', chr(195).chr(182) => 'o',
chr(195).chr(182) => 'o', chr(195).chr(185) => 'u',
chr(195).chr(186) => 'u', chr(195).chr(187) => 'u',
chr(195).chr(188) => 'u', chr(195).chr(189) => 'y',
chr(195).chr(191) => 'y',
// Decompositions for Latin Extended-A
chr(196).chr(128) => 'A', chr(196).chr(129) => 'a',
chr(196).chr(130) => 'A', chr(196).chr(131) => 'a',
chr(196).chr(132) => 'A', chr(196).chr(133) => 'a',
chr(196).chr(134) => 'C', chr(196).chr(135) => 'c',
chr(196).chr(136) => 'C', chr(196).chr(137) => 'c',
chr(196).chr(138) => 'C', chr(196).chr(139) => 'c',
chr(196).chr(140) => 'C', chr(196).chr(141) => 'c',
chr(196).chr(142) => 'D', chr(196).chr(143) => 'd',
chr(196).chr(144) => 'D', chr(196).chr(145) => 'd',
chr(196).chr(146) => 'E', chr(196).chr(147) => 'e',
chr(196).chr(148) => 'E', chr(196).chr(149) => 'e',
chr(196).chr(150) => 'E', chr(196).chr(151) => 'e',
chr(196).chr(152) => 'E', chr(196).chr(153) => 'e',
chr(196).chr(154) => 'E', chr(196).chr(155) => 'e',
chr(196).chr(156) => 'G', chr(196).chr(157) => 'g',
chr(196).chr(158) => 'G', chr(196).chr(159) => 'g',
chr(196).chr(160) => 'G', chr(196).chr(161) => 'g',
chr(196).chr(162) => 'G', chr(196).chr(163) => 'g',
chr(196).chr(164) => 'H', chr(196).chr(165) => 'h',
chr(196).chr(166) => 'H', chr(196).chr(167) => 'h',
chr(196).chr(168) => 'I', chr(196).chr(169) => 'i',
chr(196).chr(170) => 'I', chr(196).chr(171) => 'i',
chr(196).chr(172) => 'I', chr(196).chr(173) => 'i',
chr(196).chr(174) => 'I', chr(196).chr(175) => 'i',
chr(196).chr(176) => 'I', chr(196).chr(177) => 'i',
chr(196).chr(178) => 'IJ',chr(196).chr(179) => 'ij',
chr(196).chr(180) => 'J', chr(196).chr(181) => 'j',
chr(196).chr(182) => 'K', chr(196).chr(183) => 'k',
chr(196).chr(184) => 'k', chr(196).chr(185) => 'L',
chr(196).chr(186) => 'l', chr(196).chr(187) => 'L',
chr(196).chr(188) => 'l', chr(196).chr(189) => 'L',
chr(196).chr(190) => 'l', chr(196).chr(191) => 'L',
chr(197).chr(128) => 'l', chr(197).chr(129) => 'L',
chr(197).chr(130) => 'l', chr(197).chr(131) => 'N',
chr(197).chr(132) => 'n', chr(197).chr(133) => 'N',
chr(197).chr(134) => 'n', chr(197).chr(135) => 'N',
chr(197).chr(136) => 'n', chr(197).chr(137) => 'N',
chr(197).chr(138) => 'n', chr(197).chr(139) => 'N',
chr(197).chr(140) => 'O', chr(197).chr(141) => 'o',
chr(197).chr(142) => 'O', chr(197).chr(143) => 'o',
chr(197).chr(144) => 'O', chr(197).chr(145) => 'o',
chr(197).chr(146) => 'OE',chr(197).chr(147) => 'oe',
chr(197).chr(148) => 'R',chr(197).chr(149) => 'r',
chr(197).chr(150) => 'R',chr(197).chr(151) => 'r',
chr(197).chr(152) => 'R',chr(197).chr(153) => 'r',
chr(197).chr(154) => 'S',chr(197).chr(155) => 's',
chr(197).chr(156) => 'S',chr(197).chr(157) => 's',
chr(197).chr(158) => 'S',chr(197).chr(159) => 's',
chr(197).chr(160) => 'S', chr(197).chr(161) => 's',
chr(197).chr(162) => 'T', chr(197).chr(163) => 't',
chr(197).chr(164) => 'T', chr(197).chr(165) => 't',
chr(197).chr(166) => 'T', chr(197).chr(167) => 't',
chr(197).chr(168) => 'U', chr(197).chr(169) => 'u',
chr(197).chr(170) => 'U', chr(197).chr(171) => 'u',
chr(197).chr(172) => 'U', chr(197).chr(173) => 'u',
chr(197).chr(174) => 'U', chr(197).chr(175) => 'u',
chr(197).chr(176) => 'U', chr(197).chr(177) => 'u',
chr(197).chr(178) => 'U', chr(197).chr(179) => 'u',
chr(197).chr(180) => 'W', chr(197).chr(181) => 'w',
chr(197).chr(182) => 'Y', chr(197).chr(183) => 'y',
chr(197).chr(184) => 'Y', chr(197).chr(185) => 'Z',
chr(197).chr(186) => 'z', chr(197).chr(187) => 'Z',
chr(197).chr(188) => 'z', chr(197).chr(189) => 'Z',
chr(197).chr(190) => 'z', chr(197).chr(191) => 's',
// Euro Sign
chr(226).chr(130).chr(172) => 'E'
);

$string = strtr($string, $chars);
}
else {
// Assume ISO-8859-1 if not UTF-8
$chars['in'] = chr(128).chr(131).chr(138).chr(142).chr(154).chr(158)
.chr(159).chr(162).chr(165).chr(181).chr(192).chr(193).chr(194)
.chr(195).chr(196).chr(197).chr(199).chr(200).chr(201).chr(202)
.chr(203).chr(204).chr(205).chr(206).chr(207).chr(209).chr(210)
.chr(211).chr(212).chr(213).chr(214).chr(216).chr(217).chr(218)
.chr(219).chr(220).chr(221).chr(224).chr(225).chr(226).chr(227)
.chr(228).chr(229).chr(231).chr(232).chr(233).chr(234).chr(235)
.chr(236).chr(237).chr(238).chr(239).chr(241).chr(242).chr(243)
.chr(244).chr(245).chr(246).chr(248).chr(249).chr(250).chr(251)
.chr(252).chr(253).chr(255);

$chars['out'] = "EfSZszYcYuAAAAAACEEEEIIIINOOOOOOUUUUYaaaaaaceeeeiiiinoooooouuuuyy";

$string = strtr($string, $chars['in'], $chars['out']);
$double_chars['in'] = array(chr(140), chr(156), chr(198), chr(208), chr(222), chr(223), chr(230), chr(240), chr(254));
$double_chars['out'] = array('OE', 'oe', 'AE', 'DH', 'TH', 'ss', 'ae', 'dh', 'th');
$string = str_replace($double_chars['in'], $double_chars['out'], $string);
}

return $string;
}

// ----------------------------------------------------------------------
// Copyright (C) 2006 by Khaled Al-Shamaa.
// http://www.al-shamaa.com/
// --
/**
* @return String Regular Expression format to be used in MySQL query statement
* @param String $arg String of one word user want to search for
* @desc Lex method will implement various regular expressin rules based on pre-defined Arabic lexical rules
* @author Khaled Al-Shamaa
*/
function _accents_lex($arg) {
$patterns = array();
$replacements = array();

// Prefix's
array_push($patterns, '/^ال/'); array_push($replacements, '(ال)?');

// Singular
array_push($patterns, '/(\S{3,})تين$/'); array_push($replacements, '\\1(تين|ة)?');
array_push($patterns, '/(\S{3,})ين$/'); array_push($replacements, '\\1(ين)?');
array_push($patterns, '/(\S{3,})ون$/'); array_push($replacements, '\\1(ون)?');
array_push($patterns, '/(\S{3,})ان$/'); array_push($replacements, '\\1(ان)?');
array_push($patterns, '/(\S{3,})تا$/'); array_push($replacements, '\\1(تا)?');
array_push($patterns, '/(\S{3,})ا$/'); array_push($replacements, '\\1(ا)?');
array_push($patterns, '/(\S{3,})(ة|ات)$/'); array_push($replacements, '\\1(ة|ات)?');

// Postfix's
array_push($patterns, '/(\S{3,})هما$/'); array_push($replacements, '\\1(هما)?');
array_push($patterns, '/(\S{3,})كما$/'); array_push($replacements, '\\1(كما)?');
array_push($patterns, '/(\S{3,})ني$/'); array_push($replacements, '\\1(ني)?');
array_push($patterns, '/(\S{3,})كم$/'); array_push($replacements, '\\1(كم)?');
array_push($patterns, '/(\S{3,})تم$/'); array_push($replacements, '\\1(تم)?');
array_push($patterns, '/(\S{3,})كن$/'); array_push($replacements, '\\1(كن)?');
array_push($patterns, '/(\S{3,})تن$/'); array_push($replacements, '\\1(تن)?');
array_push($patterns, '/(\S{3,})نا$/'); array_push($replacements, '\\1(نا)?');
array_push($patterns, '/(\S{3,})ها$/'); array_push($replacements, '\\1(ها)?');
array_push($patterns, '/(\S{3,})هم$/'); array_push($replacements, '\\1(هم)?');
array_push($patterns, '/(\S{3,})هن$/'); array_push($replacements, '\\1(هن)?');
array_push($patterns, '/(\S{3,})وا$/'); array_push($replacements, '\\1(وا)?');
array_push($patterns, '/(\S{3,})ية$/'); array_push($replacements, '\\1(ي|ية)?');
array_push($patterns, '/(\S{3,})ن$/'); array_push($replacements, '\\1(ن)?');

// Writing errors
array_push($patterns, '/(ة|ه)$/'); array_push($replacements, '(ة|ه)');
array_push($patterns, '/(ة|ت)$/'); array_push($replacements, '(ة|ت)');
array_push($patterns, '/(ي|ى)$/'); array_push($replacements, '(ي|ى)');
array_push($patterns, '/(ا|ى)$/'); array_push($replacements, '(ا|ى)');
array_push($patterns, '/(ئ|ىء|ؤ|وء|ء)/'); array_push($replacements, '(ئ|ىء|ؤ|وء|ء)');

// Normalization
array_push($patterns, '/ّ|َ|ً|ُ|ٌ|ِ|ٍ|ْ/'); array_push($replacements, '(ّ|َ|ً|ُ|ٌ|ِ|ٍ|ْ)?');
array_push($patterns, '/ا|أ|إ|آ/'); array_push($replacements, '(ا|أ|إ|آ)');

$arg = preg_replace($patterns, $replacements, $arg);

return $arg;
}

Accents work with Arabic

Emad - January 29, 2008 - 19:54

I tested it, the Accents works perfectly with arabic.
I just discovered why the search was not working, no corn. I installed poorman corn, it taks long time, It force me to change php.ini max_execution_time = 200 to give more time for corn, I don't know if this was caused by my changes to accents module or normal.

what about this error

cestmoi - January 30, 2008 - 00:15

thanks shobaki but isn't it giving you this error after patching the accents module?

Parse error: syntax error, unexpected ':', expecting ';' in /home/drupal/public_html/sites/all/modules/accents/accents.module on line 23

this error is generated on just visiting the homepage without doing anything at all

accent.module link

Emad - January 30, 2008 - 04:03

download link for accects.module
http://depositfiles.com/files/3298499

 
 

Drupal is a registered trademark of Dries Buytaert.