This is a bug report used to push a small piece of code needed to fix remote extraction of attachments with tika.
This could have been set in the #1289222 patch but fits better here has it implies a modification on the search_api_solr classes to allow for a custom POST request with custom headers (content multi-part etc).

I'll attach the patch as soon as I get a bug id

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

regilero’s picture

So here's the patch.

Sutharsan’s picture

Status: Active » Needs review
FileSize
3.03 KB

Code style fixes and function documentation (only). The function description could use some extra explanation.

gp.mazzola’s picture

Hi, the new update of the Search API Solr search module to rc4, with the removed dependency on the SolrPhpClient library, breaks this patch, that I was using combined to a patch on Search API Attachments module, to allow remote extraction of files on the machines where Solr/Tomcat live.

Is it possible to update this patch to work with the new rewritten module? Otherwise, could this patch be committed to the module, to allow remote extraction feature?

lotyrin’s picture

Status: Needs review » Needs work
drunken monkey’s picture

Might it be that SearchApiSolrConnection::makeServletRequest() already does what you want?
Otherwise, maybe we could add a method for that to the connection class, yes. (Or, rather, make the method for that public.)

gp.mazzola’s picture

Hello drunken monkey, I am reading your answer only now.
The patch to search_api_attachments, add this code to allow remote file extraction on the solr machine:

  /**
   * Extraction done via a remote solr having tika extractyion on '/extract/tika' servlet
   */
  protected function extract_solr($file,$filepath) {
    try {
      $filename = basename($filepath);
      $conditions = array('class' => 'search_api_solr_service', 'enabled' => TRUE);
      foreach (search_api_server_load_multiple(FALSE, $conditions) as $server) {
        $solr = $server->getSolrConnection();
        break;
      }
      $params = array(
        'resource.name' => $filename,
        'extractFormat' => 'text', // Matches the -t command for the tika CLI app.
        'wt' => 'json',
      );

      // Construct a multi-part form-data POST body in $data.
      $boundary = '--' . md5(uniqid(REQUEST_TIME));
      $data = "--{$boundary}\r\n";
      // The 'filename' used here becomes the property name in the response.
      $data .= 'Content-Disposition: form-data; name="file"; filename="extracted"';
      $data .= "\r\nContent-Type: application/octet-stream\r\n\r\n";
      $data .= file_get_contents($filepath);
      $data .= "\r\n--{$boundary}--\r\n";
      $headers = array('Content-Type' => 'multipart/form-data; boundary=' . $boundary);
      // PHP's built in http_build_query() doesn't give us the format Solr wants.
      $query_string = $this->httpBuildQuery($params);
      if ($query_string) {
        $query_string = '?' . $query_string;
      }
      $result = $solr->sendHttpRequest('extract/tika'. $query_string, 'POST', $headers, $data, FALSE);
      $response = json_decode($result->getRawResponse());
      if (isset($response->extracted)) {
        return $response->extracted;
      }
    }
    catch (Exception $e) {
      // Exceptions from Solr may be transient, or indicate a problem with a specific file.
      watchdog('Search API Solr Attachments', "Exception occurred sending %filepath to Solr\n!message", array('%filepath' => $file['uri'], '!message' => nl2br(check_plain($e->getMessage()))), WATCHDOG_ERROR);
      return FALSE;
    }
    return FALSE;
  }

This function calls $solr->sendHttpRequest('extract/tika'. $query_string, 'POST', $headers, $data, FALSE); which was introduced by another patch on the search_api_solr module (solr_connection.inc file)


   /**
   * Allow extended queries
   *
   */
  public function sendHttpRequest($uri = array(),$method='GET', $headers = array(), $content= '', $timeout = FALSE) {
    if (!array_key_exists('Content-Type',$headers)) {
      $headers['Content-Type'] = 'text/xml; charset=UTF-8';
    }
    $url = 'http://' . $this->_host . ':' . $this->_port . $this->_path . $uri;
    list($data, $headers) = $this->_makeHttpRequest($url, $method, $headers, $content, $timeout);
    if ($this->newClient) {
      $status = 0;
      $contentType = false;
      //iterate through headers for real status, type, and encoding
      if (is_array($headers) && count($headers) > 0) {
        while (isset($headers[0]) && substr($headers[0], 0, 4) == 'HTTP') {
          // we can do a intval on status line without the "HTTP/1.X " to get the code
          $status = intval(substr($headers[0], 9));
          // remove this from the headers so we can check for more
          array_shift($headers);
        }

        //Look for the Content-Type response header and determine type
        //and encoding from it (if possible - such as 'Content-Type: text/plain; charset=UTF-8')
        foreach ($headers as $header) {
          // look for the header that starts appropriately
          if (strncasecmp($header, 'Content-Type:', 13) == 0) {
            $contentType = substr($header, 13);
            break;
          }
        }
      }
      $httpResponse = new Apache_Solr_HttpTransport_Response($status, $contentType, $data);
      $response = new Apache_Solr_Response($httpResponse, $this->_createDocuments, $this->_collapseSingleValueArrays);
    } else {
      $response = new Apache_Solr_Response($data, $headers, $this->_createDocuments, $this->_collapseSingleValueArrays);
    }
    $code = (int) $response->getHttpStatus();
    if ($code != 200) {
      $message = $response->getHttpStatusMessage();
      if ($code >= 400 && $code != 403 && $code != 404) {
        // Add details, like Solr's exception message.
        $message .= $response->getRawResponse();
      }
      throw new Exception('"' . $code . '" Status: ' . $message. '::'.$url);
    }
    return $response;
  }

Do you think that the method you suggest, could do the job? Or do you need to add another method to the connection class?

drunken monkey’s picture

Do you think that the method you suggest, could do the job?

Yes, I'm fairly certain. Just try it and report back, please.

torpy’s picture

Thanks for the pointer! Posted a working patch making use of makeServletRequest in #1289222: Allow remote document processing (comment #15).

OanaIlea’s picture

Issue summary: View changes
Status: Needs work » Closed (outdated)

This issue was closed due to lack of activity over a long period of time. If the issue is still acute for you, feel free to reopen it and describe the current state.