In CCK as in many other areas of contrib, the introduction of translation sets into Drupal core presents new challenges.

Prior to Drupal 6, a node ID was unambiguously the primary identifier of content.

With translation, however, there are two potential ways a piece of content might be identified. The first is by as a "translation set"--the group of nodes that are translations of one another. These are identified by {node}.tnid--the nid of the original or "source" translation. When a node has not yet been translated, it has a tnid of 0. In these cases, the nid is the primary identifier.

The second is the familiar nid. For a translation set, the various members each have different nids. The NID in this context uniquely identifies a *translation* rather than a piece of content per se.

How does this relate to CCK?

In CCK, we attach field data to nodes. Currently, these data can be attached only by nid. This means that each translation has its own set of values for any given field.

There are many cases in which attaching by nid indeed makes sense. In particular, any data that vary by language (e.g., that include strings that need to be translated) should attach to the translation, not the translation set.

But other data don't need to change per translation. They are properties or attributes of the content per se--not of a particular translation. Examples might include a date field for a historical event. The display of a date might change by language, but the data stored will remain the same.

Currently it's possible to somewhat awkwardly synchronize data for a given field across a translation set via use of the field synchronization module that comes with Internationalization (i18nsynch). But this workaround doesn't address the underlying issues of e.g. duplicate data storage.

For D7, probably, it will be worth considering refactoring so that, when translation module is present, a field can be attached either by translation (the current implementation) or by translation set.

If by translation set, we store the tnid, if present, or the nid if not. If by nid, we always use the nid.

This approach would parallel what is being done in various other contrib modules, see:

* Flag: #307810: Multilingual support for flagging
* Fivestar: #307207: Multilingual voting: option to tally votes by translation set
* Nodequeue: #251092: Multilingual Support

Implementation ideas:

* Add field, e.g., 'translation_source', to field definition schema. Value indicates whether nid or tnid should be used.
* Method to return the correct value to use for a given node.
* Method to handle changes in tnid. See #318328: Hook to respond to change of source translation.
* For fields set to attach to the translation set, data are stored and loaded using the tnid (or nid if tnid = 0).
* Admin UI for field creation includes choice to apply to translation set or to individual translations, with appropriate defaults (e.g., text fields default to individual translations, number fields default to translation set).

Example:

Translation set contains two members:

nid 20, tnid 0, language ''--an untranslated node
nid 21, tnid 21, language 'fr'--the source translation
nid 22, tnid 21, language 'en'--a member of the translation set

Field 'field_birthdate', type date, is set to apply to the translation set.

'field_birthdate' data are stored only once, in association with the source translation nid.

Editing nid 20, field_birthdate registered to nid 20.
Editing nid 22, field_birthdate registered to nid 21.

See #308188: Refactor nodereference to handle multilingual content? for discussion of the tnid issue related to nodereference fields.

Comments

yched’s picture

I guess this requires storing 'per translation' and 'per translation set' fields in separate tables ?
So we have 6 storage patterns :
- shared field, 'per translation' :
own table : nid, vid + field columns
- shared field, 'per translation set' :
own table : tnid, (vid ?) + field columns
- multiple field , 'per translation' :
own table : nid, vid, delta + field columns
- multiple field , 'per translation set' :
own table : tnid, (vid ?), delta + field columns
- single, non shared, 'per translation' :
per type table nid, vid + field columns
- single, non shared, 'per translation set' :
per type table tnid, (vid ?) + field columns

:-(

Also, we currently don't attach to nids, but to vids - how would this translate for translation sets ?

nedjo’s picture

I'm not sure we need different storage mechanisms. Rather, I think, we just need to add this schema information to the field definition. We would continue to store by nid/vid. The field schema information, however, lets us know if its a translation set or a translation that we're talking about. If it's a translation set, the nid/vid is of the translation source; otherwise, it's for the individual node. When e.g. loading the data, we consult the schema and determine what data we need to load. Does that sound right?

yched’s picture

So, when loading a node, you fetch data for two sets of nid/vid : based on the node's own nid/vid for 'per transation' fields, and based on the translation set's nid/vid for 'per translation set' fields. And similarly on node_save.
Maybe... :-).

- Still raises the issue of fields that are currently stored together in a single per-type table :
Say field_description is a 'regular' (per translation) field, and field_birthdate is shared across all translations of a node. Say both fields are single and not shared, and thus lumped together in {content_type_story} table.

How do you store stuff in {content_type_story} ? Filling in with NULL like below ? (using the node examples in your original post)

nid | vid | field_description_value | field_birthdate_value
20  | 20  | some text               | 01/01/1900 
21  | 21  | some french text        | 02/02/1950
22  | 22  | some english text       | NULL (the value from nid 21 will be used)

Makes it rather convoluted to know what belongs to what just by looking at the data tables...

Or do we need two different 'per type' tables - one with 'per translation fields', one with 'per translation set' fields ?

Both options would IMO raise another wave of 'CCK storage is just weird' feeling...

- I'm also not sure about the impact on Views integration (although maybe Flag / Fivestar / Nodequeue successfully answer that already ?)

nedjo’s picture

Yes, we would need one or the other of these two approaches. I hadn't thought of having separate tables. It's an interesting idea and would be more fully normalized, but does increase the complexity. I guess my first inclination would be the NULL values (or, I guess, the default DB values, which might be e.g. zero-length strings).

If we stick with NULLs, I don't think this really adds a lot of overhead or complexity. Another table for insert/update, and another table to join on when loading.

plach’s picture

I've been thinking on this since I read the Fields in Core code sprint announcement: now that any entity in Drupal is fieldable a node may be thought just as a collection of fields; in this scenario we could get rid of the translation set concept and make the field themselves translatable; this way the bundle would become what previously was the translation set and field instances might store the translatable property.

A field_data_foo entry could have an additional column specifying its language id: for example if language_id is set to fr the entry is the french version of the foo field value, if there is no language specified the foo field value is shared between all the translations of the entity it belongs to.

The language_id column would become part of the primary key, and it might actually hold an integer value to improve performance: PK(etid,entity_id,language_id).

The translation needs update flag could be set on a field basis too, thus allowing to specify which portions of the entity need translation, if necessary.

The entity creation date, as well as any other similar data, might become a translatable field too, allowing us to keep treating an entity translation as a whole standalone object, the equivalent of a translation node in Drupal 6.

Revision ids might differ between field translations, this should not be a problem as the field_data_* tables would hold the current revision for each translation.

In a per-bundle storage context translatable fields would behave exactly as multiple ones, with the possible exception of revision ids which might differ between translations.

Comments could be bound to node translations by storing also the language id of the commented translation: this way when showing a node translation we could choose whether show all the comments bound to the given node id or show just the ones matching the given language id.

With this approach I see this advantages:

  • translated data might be attached to any entity;
  • entity translations might share all the untranslatable data with no need for synchronization;
  • no need to distinguish between translation and translation set: if a field instance is translatable then its values will have a language id, otherwise they will have none; when aggregating an entity, all the field values matching the given language id or having none are loaded;
  • no need for dedicated storage patterns.

If you find this idea is worth thinking about a little more seriously, I'd be in favor to change the issue properties to:

Title: Make fields translatable
Project: Drupal
Version: 7.x-dev
Component: field system
Category: task
plach’s picture

Status: Active » Closed (duplicate)