Problem/Motivation

Search API Attachments does an awesome job of extracting text from attachments. In drupal 8 it's pretty common to nest entities though (eg. a node containing a media document). Right now it's possible to index the entity that directly holds the file (media document in the example) but there's no way of connecting it to the parent node.

Proposed resolution

Provide a way of indexing the documents attached to nested content within the context of parent entity by adding an ExtractedText field formatter for file field. With that in hand one can add the Search index view modes to all the entity types along the reference path and choose them to be used for indexing.

Example:

  • Add Search index view mode to media bundle
  • Set Text extracted from attachment as the formatter for the file field in media bundle's display settings.
  • Add Search index view mode to node type
  • Set Rendered entity as the formatter for the media reference in node type's display settings and choose to render it as Search index in settings.
  • Set Search index to be used when indexing in the Search API Index (Search API -> Index -> Fields -> Rendered HTML output (edit) -> View mode for content X = Search index)

Result:
After reindexing it is possible to find the node when searching for a text that is in the attachment associated with the media entity.

Remaining tasks

Maybe abstract text extraction from Search API integration at some point in time?

User interface changes

New field formatter available for file fields.

API changes

None.

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

blazey created an issue. See original summary.

blazey’s picture

Assigned: blazey » Unassigned
Status: Active » Needs review
FileSize
6.87 KB

Attaching the patch that fixed the problem for me.

blazey’s picture

Issue summary: View changes
blazey’s picture

Issue summary: View changes
blazey’s picture

Issue summary: View changes
ekes’s picture

Interesting different approach, I'll have a look. I guess this means that you can apply only the same filters (stemming, case etc) the same to all attachments? It's also still with only one extraction configuration. No chance, if not using tika/tika-solr of an exif extractor for images, and a pdf extractor per field?

I was thinking about the issues related #2832407: Refactor configuration for search_api . My thought there was to use the same pattern as aggregate field or rendered item, with a property that appears as a additional 'field/s' for extracted data.

blazey’s picture

Hey ekes. Processors are set on index-level so I think this is out of scope here. Setting extraction method per field is possible but it would probably be best to decouple the extraction process from the Search API integration first. Do you have a real use case for that?

I will have some time for community contribution during the global sprint weekend next week in Wrocław (https://groups.drupal.org/node/515817). I could work on this, but since it's a big change, it would be nice to get a sign off from the Maintainer first.

ekes’s picture

All I'm proposing in #2832407: Refactor configuration for search_api is moving all the configuration to a Property. The processor stays much the same (although hidden - and less need for some of the magic constants). It makes possible your use case of selecting any field, and others. I've ping'd one of the maintainers, and started the issue, but nothing yet. I'll be around irc for global sprint weekend, but I'm also helping organise the sprint itself, and the space, in Amsterdam.

trzcinski.t@gmail.com’s picture

Status: Needs review » Reviewed & tested by the community

I have tested and it works as described and expected.

trzcinski.t@gmail.com’s picture

nicholas.alipaz’s picture

+1, this is working great for me on a document that is attached to a nested paragraph.

  • izus committed a4e4c07 on 8.x-1.x authored by blazey
    Issue #2844979 by blazey, ekes, tom_ek, nicholas.alipaz, izus: Index...
izus’s picture

Status: Reviewed & tested by the community » Fixed

Thanks all and sorry for the long time
this will be part of next beta that will be available today

rferguson’s picture

I have the latest release (8.x-1.0-alpha5) and am running paragraphs module. I have a paragraph created with a nested file field but am not seeing the usual search api attachments option under add fields -> general. I also have the File attachments processor turned on.

Is there something different I might be missing?

EDIT:

I see it now, if I add in separate data sources. But I was wondering if there's a way to get the content of the file from within the paragraph. Previously in Drupal 7 with search api attachments and Field Collections, it came with a sub-module that could do this.

Status: Fixed » Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.

acbramley’s picture

@rferguson you need to follow the steps in the issue summary (these should probably be documented as it doesn't seem to be the normal way of adding fields to the search index).