Proposal: Significant redesign of search_attachments.module
| Project: | Search attachments |
| Version: | 5.x-3.0 |
| Component: | Documentation |
| Category: | task |
| Priority: | normal |
| Assigned: | Unassigned |
| Status: | active |
Jump to:
This issue is a proposal to change the way search_attachments works. It is based on a file called 'nextgen.txt' distributed with version 5.x-3-dev.
Currently, search_attachments has two significant design limitations: 1) it indexes files based on whether the parent node has been altered (not whether the file itself has been altered), and 2) it cannot index files uploaded via FTP (i.e., not uploaded using one of the file management modules) or files not attached to a node.
To remedy these limitations, search_attachemts could use a database table (call it search_attachments_files) that records the managing module and filepath of each file, and also the filepath of all files uploaded into the /files directory or any other that might store FTPed files. The site admin would be able to configure which directories fall into the last category; 'none' could be used as a value for FTPed files in the managing module column.
This table would also record the modification time for each file generated by the php stat() function. Using hook_cron(), search_attachments would iterate through the new 'search_attachments_files' table, generate the mtime for each file, and compare the last mtime with the generated one. If stat() returned a more recent mtime, the file would be reindexed by calling search_index(). For files that are not attached to a node, the 'sid' parameter that is passed to search_index() could be the file's index in search_attachments_files; for files that are attached to nodes, the node's ID is passed as is currently the case. Display in search results of files not attached to nodes would probably be a subset of the current display.
The rows in the search_attachments_files table would need to be populated by iterating through each driver and determining what files have been added since last check. In order to find which files are not managed by a file manager, the /files and other indicated directories would need to be checked to see if any eligible files existed, and if so, they would be registered in the table. This syncronization activity would need to be performed each time cron.php was run, before the indexing took place.
Any feedback on this proposal is welcome.

#1
I'm surely interested in this, but will need to wait for my intranet site to be updated to drupal-5...
#2
This sounds great. How soon are you planning on starting? Could you use any help/sponsorship?
#3
I'd like to move the current version (5.x-3-dev) from -dev to stable, which I'm happy to do any time unless anyone reports problems with it. I introduced a lot of new UI/configuration features in 5.x-3 so I want to make sure it is stable. As soon as 5.x-3 is stable I could start working on implementing the proposal above, hoping to have it done by the beginning of January, and also a 6.x version at that time as well. If you'd like to see it sooner than that, I could use some help since I don't think I can do it myself before then. Perhaps sponsoring someone to do it more quickly would be a good idea. Thanks for your offer and interest.
#4
January sounds great—I'm not in that much of a rush. I'm using search_attachments on a client site we're launching in December and I have a feeling they're eventually going to ask "Can I search files uploaded outside Drupal?" This way I'll have a tentative timeframe to suggest to them. Thanks a lot for your work on this.
#5
I'd like to get some indication of how people who are wanting to index files not attached to any node restrict access to those files. Are they accessible to all users, including anonymous users? Do you use any access/authentication modules to restrict access to them? Presumably if these files are uploaded without using Drupal modules, Drupal can't (as far as I can see) control access to them.
Thanks for any info you can provide.
#6
In my case, everything I would want to index is publicly accessible. I'm envisioning a /documents/public directory with PDFs and such in it, things that are linked to from Drupal nodes but that were uploaded via FTP. I might also have a /documents/private directory, which I wouldn't want to index.
#7
Update, 2008-01-03: January is here and I haven't made much progress on the rewrite. However, the recently released version 5.x-3 has fixed a significant issue that prevented the indexing and display of characters with diacritics. Now that this version is out I will be moving on to the planned rewrite.
#8
Another update on 5.x-4 -- this rewrite is moving along nicely and I have completed the following:
-General code cleanup and addition of inline documentation.
-Replaced search_attachments_nodeapi() with search_attachments_register_files() function as per http://drupal.org/node/188895.
-Added search_attachments_update_index().
-Replaced 'file_xxx' where xxx is driver name with 'file' in search_dataset and search_index tables.
-Changed driver filenames from .inc to _driver.inc.
-Updated database field definitions to be consistent with system.module.
-Created search_attachements_files db table.
-Renamed search_attachments table to search_attachments_helpers.
-Changed admin tabs to be more explicit; added 'Files' tab (done but form not implemented).
-Added get_XXX_register_files() function to upload and webfm drivers (attachment_driver.inc not done yet).
-Added 'clone' action in helper list (but needs PostgreSQL fix).
-Written a proper .install file that includes updates for db table changes.
The approach described in the proposal is working as intended. I would like to release a -dev version of 5.x-4 as soon as I complete the admin tools for identifying directories where FTPed files are kept and a generic driver for these files, probably in the next week or two.
#9
Let me know if I can help test. The search FTP'd files functionality is important to my client's site. If I ever get it working, I'll commit to making a brief screencast about it.
(I find jingproject.com to be terrific, and easy for screencasting, if you want to do one, too.)
#10
Yes, that would be great. I will be taking a week off (vacation) on Feb. 10 so would like to get 5.x-4-dev out before then for testing. I am just fixing a few bugs that have come up at the last minute but might defer them (none are critical) so I can get the module out before I go.
A screencast would be awesome, thanks for the offer and pointer to jingproject.
#11
The new version of the module, which allows searching of FTPed files, is now available for testing at http://interoperating.info/mark/node/74 . Please give it a try.