Hello,
I'm running 4.7.2 and swish and everything works swimmingly. That is, everything excepting the filters.
I've got catdoc, pdftotext, and xls2csv installed, but Swish-e will not use them. In the return values that the engine passes back I can see that Swish sees all the files that it is supposed to be indexing (the conf file is correct), but it ends up only indexing the txt files in the directory. For example, here are the last three lines of the return values:
Indexing done!
Elapsed time: 00:00:00 CPU time: 00:00:00
28 files indexed. 6,023,282 total bytes. 17 total words.
Is this most likely a perms issue? Wouldn't the return values show an error in that case? My conf file builds as follows:
IncludeConfigFile /home/myuser/public_html/dev/modules/swish/conf/common.conf
IndexDir /home/myuser/zDocs
IgnoreWords file: /home/myuser/public_html/dev/modules/swish/conf/stopwords/english.txt
FileRules filename contains .php .inc .module .sql index.
IndexOnly .txt .doc .xls .pdf
FileFilter .doc /usr/local/bin/catdoc "-s8859-1 -d8859-1 '%p'"
FileFilter .xls /usr/local/bin/xls2csv "'%p'"
FileFilter .pdf /usr/local/bin/pdftotext "'%p'"
Any ideas? I'm hugely invested in this project and I simply can't beat my head against the wall any longer! Thanks in advance for any help.
| Comment | File | Size | Author |
|---|---|---|---|
| #24 | swish.gif | 21.12 KB | geme4472 |
| #22 | swish_6.module | 13.76 KB | geme4472 |
| #21 | swish_5.module | 13.93 KB | geme4472 |
| #18 | swish_4.module | 13.45 KB | Kenny_au |
| #13 | swish_3.module | 13.47 KB | Kenny_au |
Comments
Comment #1
geme4472 commentedAddition: I'm also chmoding the temporary conf file to 0777 after it is created on-the-fly, plus the Swish-e output suggests that my problem does not lie with the config file.
Comment #2
Kenny_au commentedI am having this exact same problem, txt file indexing works fine, the folders with the documents getting scanned and swish-e reports the correct file size.
If it was a permissions problem, the txt file wouldn't be indexed.
Please help, if you could give any suggestions or things to try, please let me know.
Comment #3
Kenny_au commentedSwish-e 2.4.4 resolves this issue, it related to a problem with the quoteing around the filename varible.
Was getting worried for a moment :)
Comment #4
apapendieck commentedHi Kenny,
I'm in precisely the same boat, and already using Swish-e 2.4.4. Can you please go into the problem a bit more? Which variable? Is it a Drupal config problem, or a swish-e problem? Is there a fix?
Thanks,
Adam
Comment #5
Kenny_au commentedI found the only way to get the FileFilters to work was to remove all the double quotes and just use single quotes around the %p varible.
Swish-e by default outputs the following:
FileFilter .doc /usr/local/bin/catdoc "-s8859-1 -d8859-1 '%p'"
This does not work! I'm now sure exactly what it ends up passing to catdoc, but the result is a "Cannot find file" error each time a .doc tries to be indexed.
I changed it to the following:
FileFilter .doc /usr/local/bin/catdoc '%p'
This works fine and you can test this by adding it to a custom conf and running it from bash.
/use/local/bin/swish -v 4 -c /tmp/swish-tmp-conf -f /var/www/blah/files/my_swish_index
-v verbose level 4 debugging and -f specifys the index file, you can use the exsisting one and go do an uploaded files search on your website to confirm it's being indexed.
Same for pdf:
FileFilter .pdf /usr/local/bin/pdftotext '%p -'
Changing the FileFilter lines in the swish.module to output the quotes correctly will resolve this issue. I tried to change it myself but ended up just breaking the script.
Now, I'm still having incorrect links in search results, regardless of what sub-folder the document is in, it still links to it in the files directory. This may be because I am not index through the swish.module, not sure though.
To debug the conf file swish.module outputs, comment out the line near the bottom that rm's the tmp conf file and go open it from your temp drive after running an index.
Comment #6
Kenny_au commentedwish you could edit these posts :(
The the incorrect link to search results still is apparent when indexing just the txt files, so it wasn't because I didn't build the index through the swish.module.
The fix's posted in this project related to this don't solve anything, that patch didn't break anything, but documents will still being linked to the files directory :(
Not sure what I'm doing wrong.
If some hero can upload a swish.module with one that defiantly links to documents correctly in search results, I can check if it to narrow down the problem.
Comment #7
Kenny_au commentedPlatform:
Latest drupal
swish-e 2.4.4
swish.module cvs 1.3
latest debian unstable
apache/php4
Comment #8
geme4472 commentedI can confirm Kenny_au's estimation of the situation. It can be run from bash w/o the dbl quotes and catdoc works. It (swish engine, not drupal mod) also gives the following error if I remove the double quotes:
Yes, the spelling error was funny, but that was about four thousand errors ago. So, how can you pass the "[options]" arg without double quotes? Well, I removed them, to no avail. I added braces instead, to no avail (yeah, this was in desperation). How else can I pass a string with spaces? I ended up passing the '%p' arg and catdoc worked fine. Let me restate catdoc worked fine!. I have yet to try the other filters.
As for the link's filepath, I think that's just something that needs to be correctly parsed from swish's results. In other words, I think it's the module's fault. I'll see if I can't get this to work correctly. If I do post a patch, it'll be messy, cuz I've hacked this module from dusk till dawn.
Comment #9
geme4472 commentedWow, swish is returning almost nothin' after that search command is executed. I truly thought filepath would be passed back... perhaps we just need to ask for it? Unfortunately, I don't know the engine very well (yet).
I think another huge task to be done with this module is access rights checking, unless I missed a whole pantload of code somewhere? If I can get the resources to write in some additional functionality, I'll certainly post.
Comment #10
geme4472 commentedI am fool. Of course swish is sending back the right path. The module is killing it. I'll work on this a bit later and post.
Comment #11
geme4472 commentedHere's a couple of lines of code that seem to cure the filepath problem. This is not tested, so beware. Insert these lines in function swish_search() right before the line
$basename .= basename(substr($result, $k+1, $i-$k-1));Comment #12
geme4472 commentedAw shoot. One last thing, change the line:
to be:
Comment #13
Kenny_au commentedThought you might forgotten about this geme4472, welcome back :)
The fix you posted for the results path nearly works, example of what im getting with your fix:
http://domain.com/files/var/www/intranet//manuals/path/to/manual.doc
Get rid of the "/var/www/intranet/" bit and we have a winner. My files folder is at /var/www/intranet/files and docroot is /var/www/intranet/. I have an alias in apache that makes intranet/ the default site to load. I am no longer using private files caused filenames with spaces to break and I seriously cannot be fuxed debugging that.
Also, can you share the code to get catdoc working?
And also one more thing, how in earth can this release be at 1.3 when the previous versions filefilters were still incorrect? Surely this would of had to been reported earlier?
Comment #14
Kenny_au commentedoh I attached my swish module for reference :)
Comment #15
Kenny_au commentedSo need a bloody edit function - how hard would it be too extract teasers from around the keywords and display them underneath each result Google style?
Comment #16
geme4472 commentedKenny_au, are you using public downloads?
As for catdoc, I'm using your code from #5!! Or are you looking for the changes I made to the module to get there?
Again, I'll probably be in and out of this thread, but eventually I'll fix this. If you have specific deadlines, let me know. I might get a chance this weekend or next to commit some of these hacks into real code. Reason being, my job is completely out of hand. I have that boss--you know, the one whom you show your new module that runs 543 lines of regex to add 2% relevance to search results, and he replies, "Isn't that logo a couple of pixels too far to the left?"
Comment #17
geme4472 commentedI couldn't let this one rest. Here's some new code to get the $basename variable. I don't have my stuff set to public downloads, so I really wasn't able to test this at all. I just parsed the $basename to not have the filepath nor the leading slash. If it fails, just post your search output link, like you did last time.
So, scratch the other lines of code and go with this (so these are the only lines that use $basename, aside from the final building of the link in the line that starts $link)
That should yield: 1) a filename with 2) any filepath prefix that may exist (not below public_html) and 3) no leading slash
As for your other questions, I think the google teaser idea is fantastic. If you can figure out the command to send swish-e to return some sort of teaser, I'll take care of the balance--though highlighting the search word might get hairy, or might be a bunch of code, we'll see. I'm not sure swish-e can do this, but it seems pretty robust, so it's probably just that we're not asking it to do so.
My take on your version 1.3 comment is this: At the time it came out, v1.3 was probably working swimmingly. It's just that either the engine or drupal evolved faster than the module did, thus breaking the conf files. Actually, there are even more moving pieces than I'm addressing, given that swish-e is external to drupal. Incidentally, this is totally speculation.
Let me know how your links turn out on the search page.
Comment #18
Kenny_au commentedI'll do some investigating to check the teaser thing out.
Ok back to the search results, no change :(
The output with private files is:
http://domain.com/?q=system/files/var/www/intranet//procedure_manuals54/...
Public files:
http://domain.com/files/var/www/intranet//procedure_manuals54/path/to/ma...
Uploaded my swish module too, I'll post more when I get back from lunch :)
Comment #19
Kenny_au commentedline 221: //TODO: parse swish result comments
lets parse these comments shall we :)
Comment #20
geme4472 commentedInterestingly enough, the reason my parsing isn't doing what it is supposed to do is because I hacked the mod elsewhere. I'll post as soon as I can write something that isn't completely broken.
Comment #21
geme4472 commentedSo, it took me about three hours to un-hack all the work I'd done in the original troubleshooting of the module. Then, it took me about five minutes to change the string parsing. Phew (wiping sweat from brow).
I don't have uploads going to more than one folder, so all I can verify is that the outputted link does have the extra node(s) in the filepath when I add child folders under my upload folder (and index some files in that child folder). I did test w/public and private, and it seemed to work, but I sort of feel blind since I only have one upload folder. What module are you using for uploads?
Let me know how your results work. Module attached.
Comment #22
geme4472 commentedIgnore that last upload, as it had some debug code in it. This should work better.
Comment #23
Kenny_au commentedNice work geme4472, works a treat :)
Comment #24
geme4472 commentedWhat! Really?!
As of late, I've really juiced this thing up in my install. Of most interest: uses db_rewrite logic to dictate node access within the search (incredibly for my needs), but also becomes dependent on the core upload module in doing so. Also, I snagged some great icons from the silk collection (open) and use the same theme function as the regular search, so as to keep a consistent look/feel. If any of the above interest you, contact me over email and I'll zip and send your way.
Cheers!
I almost forgot to mention: the last module I uploaded needs the tweaks that you suggested for filefilters for pdftotext and unrtf.