Hey, don't know where this should go exactly.
Please correct /replace where this should be in the forum if i was wrong placing in here.

Made a search on drupal.org for "site search noise list" (in different variations), could not find any, so decided to contribute.

This is a premade noise words list that drupal will not search for (administered/added from the site settings page)

===drupal noise words list 0.1 start (do not insert this)===

about,after,all,also,an,and,another,any,are,as,at,be,because,been,before
being,between,both,but,by,came,can,come,could,did,do,each,for,from,get
got,has,had,he,have,her,here,him,himself,his,how,if,in,into,is,it,like
make,many,me,might,more,most,much,must,my,never,now,of,on,only,or,other
our,out,over,said,same,see,should,since,some,still,such,take,than,that
the,their,them,then,there,these,they,this,those,through,to,too,under,up
very,was,way,we,well,were,what,where,which,while,who,with,would,you,your,a
b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z,$,1,2,3,4,5,6,7,8,9,0,_

======drupal noise list 0.1 stop (do not insert this)===

Comments

===drupal noise words list 0.1 start (do not insert this)===

aan, alle, als, andere, arial, backgroundcolor, biedt, bij, but, color, dan, dat, de, deze, die, dit, door, drie, dus, een, elk, elke, en, enkel, even, friday, geen, hebben, hebt, het, hij, http, in, is, je, kan, kunnen, maar, mag, maken, meer, mensen, met, moet, moeten, more, naar, nbsp, niet, nieuwe, nog, not, of, on, ook, op, over, party, sansserif, sex, steeds, te, tot, uit, uur, van, veel, verdana, voor, voor, vooral, way, web, wel, welke, werd, willen, worden, wordt, zal, zet, zijn, zoals, zonder, zou

======drupal noise list 0.1 stop (do not insert this)===

I was searching for a solution to a problem I have been having with drupal search and came across this thread. I thought I would add the 27 most common words in the english language as a starting point for people.

=== top 27 noise words list (do not insert this) ===
the, and, a, to, of, in, i, is, that, it, on, you, this, for, but, with, are, have, be, at, or, as, was, so, if, out, not
=== /end list (do not insert this) ===

andre

culled and merged from a few online sources:

------ cut ------

able, available, bad, big, black, central, certain, clear, close, common, concerned, current, different, difficult, due, early, easy, economic, far, final, financial, fine, following, foreign, free, full, general, good, great, happy, hard, high, human, individual, industrial, international, important, large, last, late, legal, likely, line, little, local, long, low, main, major, modern, new, name, national, natural, necessary, nice, normal, old, only, open, other, particular, personal, political, poor, possible, present, previous, prime, private, public, real, recent, red, right, royal, serious, short, significant, simple, similar, single, small, social, sorry, special, strong, sure, true, various, was, white, whole, wide, wrong, young, labor, left, dead, specific, total, appropriate, military, basic, original, successful, aware, popular, professional, heavy, top, dark, ready, useful, not, out, up, so, then, more, now, just, also, well, only, very, how, when, as, mean, even, there, down, back, still, here, too, on, turn, where, over, much, is, however, again, never, all, most, about, in, why, away, really, cause, off, always, next, rather, quite, right, often, yet, perhaps, already, least, almost, long, together, are, later, less, both, once, probably, ever, no, far, actually, today, enough, therefore, around, soon, particularly, early, else, sometimes, thus, further, ago, yesterday, usually, indeed, certainly, home, simply, especially, better, either, clearly, instead, round, to, finalty, please, forward, quickly, recently, anyway, suddenly, generality, nearly, obviously, though, hard, okay, exactly, above, maybe, and, that, help, but, or, as, it, think, than, when, because, so, while, where, although, whether, until, though, since, alter, before, nor, unless, once, the, a, form, this, this, that, which, an, their, what, all, her, some, its, my, your, no, these, any, such, our, many, those, own, more, same, each, another, next, most, both, every, much, little, several, half, whose, few, former, whatever, either, less, to, yeah, no, yes, well, will, would, can, could, should, may, must, might, shall, used, come, get, give, go, keep, let, make, put, seem, take, be, do, have, say, see, send, may, will, about, across, after, against, among, at, before, between, by, down, from, in, off, on, over, through, to, under, up, with, as, for, of, till, than, a, the, all, any, every, little, much, no, other, some, such, that, this, I , he, you, who, and, because, but, or, if, though, while, how, when, where, why, again, ever, far, forward, here, near, now, out, still, then, there, together, well, almost, enough, even, not, only, quite, so, very, tomorrow, yesterday, north, south, east, west, please, yes, able, acid, angry, automatic, beautiful, black, boiling, bright, broken, brown, cheap, chemical, chief, clean, clear, common, complex, conscious, cut, deep, dependent, early, elastic, electric, equal, fat, fertile, first, fixed, flat, free, frequent, full, general, good, great, grey/gray, hanging, happy, hard, healthy, high, hollow, important, kind, like, living, long, male, married, material, medical, military, natural, necessary, new, normal, open, parallel, past, physical, political, poor, possible, present, private, probable, quick, quiet, ready, red, regular, responsible, right, round, same, second, separate, serious, sharp, smooth, sticky, stiff, straight, strong, sudden, sweet, tall, thick, tight, tired, true, violent, waiting, warm, wet, wide, wise, yellow, young, awake, bad, bent, bitter, blue, certain, cold, complete, cruel, dark, dead, dear, delicate, different, dirty, dry, false, feeble, female, foolish, future, green, ill, last, late, left, loose, loud, low, mixed, narrow, old, opposite, public, rough, sad, safe, secret, short, shut, simple, slow, small, soft, solid, special, strange, thin, white, wrong

------ cut ------

/marky

--
/marky

;,0 ,1 ,2 ,3 ,4 ,5 ,6 ,7 ,8 ,9 , ‍,  ,  ,  , `, `, ´, ˜, ^, ^, ¯, ‾, ¨, ¨, ¸, _, ­, -, –, —, ,, ;, :, !, ¡, ?, ¿, ., ., …, ·, ', ', ‘, ’, ‚, ‹, ›, ", “, ”, „, «, », (, ), [, ], {, }, §, ¶, ©, ®, @, *, /, ⁄", \, &, #, %, ‰, †, ‡, •, ′, ″, ˆ, °, ←, →, ↑, ↓, ↔, ↵, ⇐, ⇑, ⇒, ⇓, ⇔, ∀, ∂, ∃, ∅, ∇, ∈, ∉, ∋, ∏, ∑, +, ±, ÷, ×, <, =, =, ≠, >, ¬, |, ¦, ~, −, ∗, √, ∝, ∞, ∠, ∧, ∨, ∩, ∪, ∫, ∴, ∼, ≅, ≈, ≡, ≤, ≥, ⊂, ⊄, ⊃, ⊆, ⊇, ⊕, ⊗, ⊥, ⋅, ◊, ♠, ♣, ♥, ♦, ¤, ¢, $, $, £, ¥, €, ℘, ¹, ½, ¼, ², ³, ¾, ª, a, á, Á, À, à, à, Â, â, å, Å, ä, Ä, ã, Ã, æ, Æ, ai, alors, après, as, aussi, autre, autres, avant, b, c, Ç, ç, ça, ceci, cela, ces, ceux-ci, comme, d, dans, de, depuis, des, du, ð, Ð, e, É, é, È, è, Ê, ê, ë, Ë, elle, elle-même, est, et, eux, f, ƒ, ƒ, g, h, i, ℑ, í, Í, Ì, ì, î, Î, Ï, ï, ici, il, ils, j, je, k, l, la, là, le, les, leur, leurs, lui, lui-même, m, ma, mais, mes, moi, moins, mon, n, Ñ, ñ, nos, notre, nous, º, o, Ó, ó, ò, Ò, ô, Ô, Ö, ö, õ, Õ, œ, Œ, on, ou, où, ø, Ø, p, par, plus, pour, q, qu’, que, r, ℜ, s, Š, š, sa, ses, son, ß, t, ta, tel, tes, ™, toi, ton, tous, tout, tu, u, ú, Ú, Ù, ù, û, Û, ü, Ü, un, une, v, vos, votre, vous, w, x, y, Ý, ý, Ÿ, ÿ, z, þ, Þ, Α, α, β, Β, γ, Γ, δ, Δ, ε, Ε, Ζ, ζ, Η, η, Θ, θ, ι, Ι, κ, Κ, λ, Λ, μ, Μ, µ, ν, Ν, Ξ, ξ, Ο, ο, Π, π, Ρ, ρ, σ, Σ, ς, Τ, τ, υ, Υ, Φ, φ, χ, Χ, Ψ, ψ, ω, Ω, ℵ

Nautile Bleu comics and pictures

Looks like noise words have been removed in 4.6.

Rick Cogley :: rick.cogley@esolia.co.jp
Tokyo, Japan

Rick Cogley :: rick.cogley@esolia.co.jp
Tokyo, Japan

Stop or noise words are not a wise idea as they can, on the one hand, remove what one is looking for and, on the one hand, create more interference or noise than they intend to reduce. They are little more than a throw-back to the limitations of the inverted index.

Removing Info:

The most classic example is: "To be or not to be".

Each of these words are traditionaly part of the stop words and yet the combination is hardly without meaning. What about Vitamin-A ? Who about C++? Some engines like Google store words (a throwback to the Ur-WAIS of 15 years ago) as pairs so "to be or not to be" go in, with the stopwords removes, as "tobe" "beor" "ornot" "notto" "tobe".

To search for the phrase "to be or not to be" is to search for the phrase "tobe
ornot tobe".

Pairs have their limits and so some systems have adopted n-grams: tobe, tobeor, tobeornot, tobeornotto, tobeor nottobe, beor, beornot, beornotto, beornottobe, ornot, ornotto, ornottobe, notto, nottobe

Not a very good solution either since it tends to introduce some noise as well-- especially given they don't parse the lexical structure and don't try to guess or use structure.

Increasing accuracy:

Adverbs, articles etc. are widely considered to carry less content and many systems have been developed that focused just on nouns. While verbs might contain "less" they still contain information information and, at least my, experience seems to indicate that removing them decreases rather than increases accuracy.

Worse still. What is a stopword? They are inherently flawed. What language? For what content?

The trick is to use good old Document Vector model of Gerald Salton and adjusted weighting. Common words get less weight, rarer words get more weight. Despite 4 decades of computer advancement and numerous attempts this model still seems to meet most of our critera best. Its main drawbacks are, however, especially speed (there is no way to caculate ranks beforehand and sorting, especially large sets is slow) and memory (sets can get very large).

Its also difficult to design these systems to handle larger amounts of data and especially long terms. The most popular algorithms tended to be based around inverted indexes and these are ill-suited to "large" collections and have can't really be used for frequent words--- whence the call to filter these words out of the index to keep the table size down.

What's wrong with pre-set weighting?

Using pre-set weights for documents allows one to design systems that store only, at most, the unique occurances of words, rather than the words or terms in the documents. This allows for much simplier algorithms and let
s one use B-Tree to much advantage.

In the Internet models that allow one to pre-set weights (and even allow them to reflect "popularity") are the current vogue. They work reasonably well to satisfy the want of **anything** about a given term but fail on the issue of specifics. They associate popularity of a site with the quality and relevance of all content contained in it. According to "Google", for instamce, the second most significant site on the subject of "Jews" is "Jews for Jesus" followed by "Jews for Judaism" followed by an Anti-Zionist site, Jews for Guns, Jews for Islam and a wide assortment of anti-semetic sites. Looking for "Islam" is hardly better. Now what happens when we type in "Islam Jews"? We get as most relevant according to Google: "Radio Islam", a well known wack anti-semitic site in Sweden. How about "slaves jews" thinking about the exodus? Well.. right up on the top of the list.. is Louis "Hitler" Farrakhan's "Jews and the Black Holocaust".

Do these racist views indicate what people are looking for or what they are finding?

Indexes for Drupal?

A typical Drupal site is small and so there is really no problem to, even using an inverted index, to allow for all the words (save perhaps the single letters). Important is to use something like a normalized cosine metric for relevant ranking.

... for this excellent introduction. Very interesting!

Rick Cogley :: rick.cogley@esolia.co.jp
Tokyo, Japan

Rick Cogley :: rick.cogley@esolia.co.jp
Tokyo, Japan

Posted just to be updated to anything that may be said on this list. I'm considering implementing a B-tree search module for drupal. Of course, this would only be useful for large drupal sites.