I'm trying to use EntityFieldQuery::fieldOrderBy to sort results, but it filters out results where the field is empty. I think this is not the proper behaviour (sort criteria shouldn't modify the results set).

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

tim.plunkett’s picture

Version: 7.14 » 8.x-dev
Issue tags: +Needs backport to D7

I don't think this is major, but leaving for now.

This may not even be a bug, per se. What is the behavior you'd expect?

jordisan’s picture

I would expect empty values to be at the top or at the bottom (that even could be a parameter). But I don't think that sorting results by any criteria should filter out any result.

Do you know any possible patch or workaround? I'm trying to use Drupal APIs and avoid making my own SQL sentences...

pivanov’s picture

Agreed.
With the current behavior, the method name is misleading as well.
A default value argument would be nice to have in the future.

jordisan’s picture

So... what about a solution in D7?

bxtaylor’s picture

Here are the steps I used to reproduce this bug:

  1. Create a new content type (Page)
  2. Add a field to that content type (field_field_test)
  3. Create three Page nodes - two with the field populated, one with the field left blank
  4. Create a custom module that implements EntityFieldQuery without the fieldOrderBy method (see module code)
  5. Inspect results (all three Page nodes are returned)
  6. Add the fieldOrderBy method to the module code
  7. Inspect results (only the Page nodes with the field populated are returned)

I agree with the OP that the returned node should not be filtered out just because the field is not populated.

I think the breakdown is actually in the execute() method when this line is called:

// Execute the query using the correct callback.
$result = call_user_func($this->queryCallback(), $this);

The function being called here is hook_field_storage_query, which if I understand correctly, goes through the field table (in this case field_data_field_field_test) and returns results based on what it finds there - which would explain why using the fieldOrderBy method only gets you two results (in my example code) and not three.

Module code
(with fieldOrderBy method commented out)


function orderbytest_menu(){
  $items = array();
  $items['orderby'] = array(
    'title' => 'EntityFieldQuery::fieldOrderBy Test Page',
    'page callback' => '_orderbytest',
    'access callback' => TRUE,
    'type' => MENU_NORMAL_ITEM
  );
  return $items;
}

function _orderbytest(){
  $query = new /\Drupal/\entity/\EntityFieldQuery;
  $query->entityCondition('entity_type', 'node')
        ->entityCondition('bundle', 'page');
  //$query->fieldOrderBy('field_field_test', 'value');
  $result = $query->execute();
  $nodes = array();
  if(isset($result['node'])){
    $item_nids = array_keys($result['node']);
    foreach($item_nids as $nid){
      $nodes[] = node_load($nid);
    }
  }
  $output = '';
  foreach($nodes as $node){
    $output .= '<h3>' . $node->title . '</h3>';
  }
  return $output;
}
bxtaylor’s picture

Component: field system » entity system

Changing issue component is appropriate, I think.

bxtaylor’s picture

Read through the entity related issues and these seem related:

#1226622: EntityFieldQuery doesn't support query for entities without a value for a specific field
#1240566: In field_sql_storage_field_storage_query, any condition on a field's column excludes entities with a NULL value for this column

Essentially, those issues state that because the field_sql_storage_field_storage_query() hook uses an inner join to return results, any NULL values won't be returned.

Potential workaround here: http://drupal.org/node/1157006#comment-5465034

I'll see if that works at all and post any results here.

tim.plunkett’s picture

Assigned: Unassigned » tim.plunkett
Issue tags: +Needs tests
tim.plunkett’s picture

Assigned: tim.plunkett » Unassigned
Status: Active » Needs review
Issue tags: -Needs tests
FileSize
1.15 KB

Here's a patch proving that this is broken.

Status: Needs review » Needs work

The last submitted patch, drupal-1611438-9.patch, failed testing.

Berdir’s picture

The test shows why it can't work. Looking at the SQL queries that it generates:

SELECT DISTINCT field_data_enwazqqz_field_name0.entity_type AS entity_type, field_data_enwazqqz_field_name0.entity_id AS entity_id, field_data_enwazqqz_field_name0.revision_id AS revision_id, field_data_enwazqqz_field_name0.bundle AS bundle
FROM 
{field_data_enwazqqz_field_name} field_data_enwazqqz_field_name0
WHERE  (field_data_enwazqqz_field_name0.deleted = :db_condition_placeholder_0) AND (field_data_enwazqqz_field_name0.entity_type = :db_condition_placeholder_1) 
ORDER BY field_data_enwazqqz_field_name0.enwazqqz_field_name_shape ASC

There is no JOIN involved. Instead, it directly queries the field data table, and if there is no value in there, it obviously does not return anything.

Even when adding a property condition to force an involvement of the entity base table means that we start on the field table and then join the entity table.

'SELECT DISTINCT field_data_cwlvlhtb_field_name0.entity_type AS entity_type, field_data_cwlvlhtb_field_name0.entity_id AS entity_id, field_data_cwlvlhtb_field_name0.revision_id AS revision_id, field_data_cwlvlhtb_field_name0.bundle AS bundle
FROM 
{field_data_cwlvlhtb_field_name} field_data_cwlvlhtb_field_name0
INNER JOIN {test_entity_bundle_key} test_entity_bundle_key ON test_entity_bundle_key.ftid = field_data_cwlvlhtb_field_name0.entity_id
WHERE  (field_data_cwlvlhtb_field_name0.deleted = :db_condition_placeholder_0) AND (test_entity_bundle_key.fttype = :db_condition_placeholder_1) AND (field_data_cwlvlhtb_field_name0.entity_type = :db_condition_placeholder_2) 
ORDER BY field_data_cwlvlhtb_field_name0.cwlvlhtb_field_name_shape ASC'

So, the only way to fix this would be to always start with the entity base table and then join the field table with a left join which would have quite some performance regressions.

tim.plunkett’s picture

I discussed this with chx in IRC:

if we left join i think it might (or might not) also affect the number of results
that needs research
and also benchmark

There is a chance that fixing the bug is so horrible for performance that we just leave it, but we won't know until we try.

chx’s picture

Trying does not involve coding it: just time some plausible left join queries vs inner join ones. Check what efq generates (berdir already posted samples) and change inner to left. And check the number of results too.

Berdir’s picture

Ok, made a few tests on a real, large database. 540k nodes, tested two fields, one with 40k rows and another with 500

mysql> select count(*) from node;
+----------+
| count(*) |
+----------+
|   544430 |
+----------+

mysql> select count(*) from field_data_field_A;
+----------+
| count(*) |
+----------+
|    40277 |
+----------+

mysql> select count(*) from field_data_field_B;
+----------+
| count(*) |
+----------+
|   524848 |
+----------+
1 row in set (0.23 sec)

First the current query, directly on the field table, no joins involved.

mysql> SELECT sql_no_cache DISTINCT entity_type, entity_id, revision_id, bundle FROM field_data_field_A WHERE deleted = 0 AND entity_type = 'node' ORDER BY field_A_tid DESC LIMIT 10;
+-------------+-----------+-------------+------------+
| entity_type | entity_id | revision_id | bundle     |
+-------------+-----------+-------------+------------+
| node        |   1288721 |     1365755 | bt_article |
| node        |   1227269 |     1304310 | bt_article |
| node        |   1227327 |     1304368 | bt_article |
| node        |   1227283 |     1304324 | bt_article |
| node        |   1227282 |     1304323 | bt_article |
| node        |   1227261 |     1304302 | bt_article |
| node        |   1227318 |     1304359 | bt_article |
| node        |   1227253 |     1304294 | bt_article |
| node        |   1227340 |     1304381 | bt_article |
| node        |   1227313 |     1304354 | bt_article |
+-------------+-----------+-------------+------------+
10 rows in set (0.02 sec)

mysql> SELECT  sql_no_cache DISTINCT entity_type, entity_id, revision_id, bundle FROM field_data_field_B WHERE deleted = 0 AND entity_type = 'node' ORDER BY field_B_value DESC LIMIT 10;
+-------------+-----------+-------------+---------+
| entity_type | entity_id | revision_id | bundle  |
+-------------+-----------+-------------+---------+
| node        |     14380 |       91424 | dossier |
| node        |     14392 |       91436 | dossier |
| node        |     14423 |       91467 | dossier |
| node        |     14424 |       91468 | dossier |
| node        |     14436 |       91480 | dossier |
| node        |     14461 |       91505 | dossier |
| node        |     14479 |       91523 | dossier |
| node        |     14493 |       91537 | dossier |
| node        |     14552 |       91596 | dossier |
| node        |     14555 |       91599 | dossier |
+-------------+-----------+-------------+---------+
10 rows in set (7.79 sec)

(Hint: We have a custom index on the value column of field_B_value which gets the second query down to 0.00)

Now, a query that starts on the node table and uses a left join

mysql> SELECT sql_no_cache DISTINCT entity_type, entity_id, revision_id, bundle FROM node n LEFT JOIN field_data_field_A f ON n.nid = f.entity_id WHERE deleted = 0 AND entity_type = 'node' ORDER BY field_A_tid DESC LIMIT 10;
+-------------+-----------+-------------+------------+
| entity_type | entity_id | revision_id | bundle     |
+-------------+-----------+-------------+------------+
| node        |   1288721 |     1365755 | bt_article |
| node        |   1227269 |     1304310 | bt_article |
| node        |   1227327 |     1304368 | bt_article |
| node        |   1227282 |     1304323 | bt_article |
| node        |   1227283 |     1304324 | bt_article |
| node        |   1227261 |     1304302 | bt_article |
| node        |   1227318 |     1304359 | bt_article |
| node        |   1227253 |     1304294 | bt_article |
| node        |   1227340 |     1304381 | bt_article |
| node        |   1227280 |     1304321 | bt_article |
+-------------+-----------+-------------+------------+
10 rows in set (1.20 sec) (same for inner join)

mysql> SELECT sql_no_cache DISTINCT entity_type, entity_id, revision_id, bundle FROM node n LEFT JOIN field_data_field_B f ON n.nid = f.entity_id WHERE deleted = 0 AND entity_type = 'node' ORDER BY field_B_value DESC LIMIT 10;
+-------------+-----------+-------------+------------+
| entity_type | entity_id | revision_id | bundle     |
+-------------+-----------+-------------+------------+
| node        |    625021 |      702064 | bt_article |
| node        |    627352 |      704395 | bt_article |
| node        |    627596 |      704639 | bt_article |
| node        |    627737 |      704780 | bt_article |
| node        |    631257 |      708300 | bt_article |
| node        |    635430 |      712473 | bt_article |
| node        |    637061 |      714104 | bt_article |
| node        |    637153 |      714196 | bt_article |
| node        |    637657 |      714700 | bt_article |
| node        |    637817 |      714860 | bt_article |
+-------------+-----------+-------------+------------+
10 rows in set (1 min 12.76 sec) (21s for inner join)

(Hint: the mentioned index does not improve this query, although you might be able to create one that does)

Slow enough?

You're of course invited to do your own tests and do not need to believe me :)

Berdir’s picture

Given the above performance regressions that this change would imply to document this as by design and keep the issue open as a non-major feature request in case someone comes up with an idea to solve this in a nice way.

tim.plunkett’s picture

Title: fieldOrderBy filters out results with empty field values » Document that fieldOrderBy filters out results with empty field values
Category: bug » task
Priority: Major » Normal

Agreed.

bxtaylor’s picture

Title: Document that fieldOrderBy filters out results with empty field values » fieldOrderBy filters out results with empty field values
Category: task » bug
Status: Needs work » Closed (works as designed)

Thanks for those tests @Berdir!

While the fieldOrderby() method does not have the behavior I'd expect, I agree that this is a "works as designed" situation.

I think, however, it'd be better to close this as "works as designed" and open a new feature request issue that references this issue.

That way, anybody that lands on this issue can see that we put it to the test and closed it appropriately.

The behavior does need to be documented as @tim.plunkett suggests, but that should be a separate issue, too.

Here are the follow up issues:
#1662942: Make EntityFieldQuery::fieldOrderBy include NULL field values
#1662950: Document that EntityFieldQuery::fieldOrderBy will exclude Entities with empty field values.

matt2000’s picture

Status: Closed (works as designed) » Needs work

I couldn't reproduce anything like Berdir's results on a database of 134k nodes. I found LEFT JOINs to be no worse than a couple hundredths of a second slower than INNER JOINs, and even slightly faster on one table.

I'm currently generating another 400k nodes for further testing, but I'd like to raise the question of whether performance consideration for large sites should justify disallowing developers of small sites from even the option to use LEFT JOIN. (I am not suggesting it should be the default. My proposed solution proof-of-concept is http://drupal.org/node/1057110#comment-6179538 . And yes, I know about hook_query_tag_alter, and I feel that it's terrible DX for this requirement.)

Berdir’s picture

The real performance difference isn't between LEFT or INNER JOIN.

It's between a query directly on the field table (which is what EFQ does by default) and one that starts of on the node table and joins the field table. And as you can see in my results, I had equal performance on a set of 40k nodes between left and inner join, it only really made a difference with a much larger result set, possibly because it had to use a temporary table on the disk or something.

bxtaylor created a feature for allowing to do a left join if you really want to, what we agreed on here is that it's not a *bug* (which would have to be fixed by default). Because that would mean that you'd always need to start of the base table and then left join the field table, no matter what condition/orderBy's you have. Not about LEFT vs. INNER JOIN, that doesn't affect this.

tim.plunkett’s picture

Status: Needs work » Closed (works as designed)
forssto’s picture

I just have to say I'm incredibly bummed out by this issue and the way people seem to regard it.

To say that a query interface that may change the contents of the result set depending on its ordering is working as designed is pretty much reckless.

Yes, there may be performance benefits in taking the broken path, but what use are the performance benefits if they result in functionality that's counterintuitive.

Anyways, I just patched this in our class that wraps EFQ. The solution can be summarized this way:

/* Wherever you're using the entityFieldQuery */

$efq = /* whatever you need in your entityFieldQuery */;

/* Some example order parameters */
$orderdata = array(
  array(
    "type" => "field",
    "field" => "field_example",
    "column" => "value",
    "direction" => "asc"
  ),
  array(
    "type" => "property",
    "property" => "title",
    "direction" => "asc"
  ),
);
$efq->addTag("customorder");
$efq->addMetaData("customorderdata", $orderdata);
/* In a custom module of your choice */

function yourmodulenamehere_query_customorder_alter(QueryAlterableInterface $query) {
  $order_data = (array) $query->getMetaData('customorderdata');
  if(!$order_data) return;

  foreach($order_data as $order){
    switch($order["type"]){
      case "property":
        yourmodulenamehere_query_add_order_by_property($order, $query);
        break;

      case "field":
        yourmodulenamehere_query_add_order_by_field($order, $query);
        break;
    }
  }
}

function yourmodulenamehere_util_query_add_order_by_property($propertyorder_data, $query){
  $property = $propertyorder_data["property"];
  $direction = $propertyorder_data["direction"];
  $basetable = yourmodulenamehere_get_basetable_from_query($query);
  $query->orderBy("$basetable.$property", $direction);
}


function yourmodulenamehere_query_add_order_by_field($fieldorder_data, $query){
  $field = $fieldorder_data["field"];
  $column = $fieldorder_data["column"];
  $direction = $fieldorder_data["direction"];

  $basetable = yourmodulenamehere_get_basetable_from_query($query);
  $entity_type = yourmodulenamehere_get_entity_type_from_query($query);
  $entity_info = entity_get_info($entity_type);
  $entity_id_column = $entity_info['entity keys']['id'];

  $join_column = $entity_info['entity keys']['id'];
  $join_table = "field_data_$field";

  $query->leftJoin($join_table, "fieldorder_$field", "fieldorder_$field.entity_id = $basetable.$entity_id_column AND fieldorder_$field.entity_type = '$entity_type'");
  $query->orderBy("fieldorder_$field.{$field}_$column", $direction);
}

function yourmodulenamehere_get_basetable_from_query(QueryAlterableInterface $query) {
  $basetable = "";
  $entity_type = yourmodulenamehere_get_entity_type_from_query($query);
  if($entity_type){
    $entityinfo = entity_get_info($entity_type);
    if($entityinfo){
      $basetable = $entityinfo["base table"];
    }
  }
  if(!$basetable){
    array_shift(array_keys($query->getTables()));
  }
  return $basetable;
}

function yourmodulenamehere_get_entity_type_from_query(QueryAlterableInterface $query) {
  $entity_type = "";
  if(!empty($query->alterMetaData["entity_field_query"]->entityConditions["entity_type"]["value"])){
    $entity_type = $query->alterMetaData["entity_field_query"]->entityConditions["entity_type"]["value"];
  }
  return $entity_type;
}

Apologies if there's any bugs of missing functions. I tore this off a bit more comprehensive suite without any in-depth testing or sanity checks. I'm sure the code could be cleaner, more concise or more performant, but at least it works... Which is more than I can say about EFQ's sorting interface. Send me a message at tommi@kollabora.com if you want to discuss further.

Rob230’s picture

I just have to say I'm incredibly bummed out by this issue and the way people seem to regard it.

To say that a query interface that may change the contents of the result set depending on its ordering is working as designed is pretty much reckless.

I agree it is a shame. Regardless of performance implications, I fail to see how it can be considered correct that ordering a query changes the result set. Ordering should change the order; it should not remove rows.

I found this bug when using tableSort(). If you click on some of the column headings then the results get ordered, but if you click on one of the headings, you end up with half of the rows removed. It doesn't make any sense to the user that trying to order the results removes some of them. You would expect that blank entries appear first or last.

I understand that it is difficult because of performance reasons, but it's a shame that it was just given up on when (in my opinion) it is not correct behaviour.

tim.plunkett’s picture