The current serialization of PHP values (arrays / objects) to CMI config files uses a convention of :
array key / object property name is used as XML tag.

This doesn't work when the thing you need to store is a numeric array, since that would result in <0>, <1> tags, which are not valid XML (and would be pretty ugly to begin with)

Config files for image styles (which include an ordered list of image effects) worked around that by generating keys based on the effect machine name + settings :

  <name>large</name>
  <effects>
    <image_scale_480_480_1>
      <name>image_scale</name>
      <data>
        <width>480</width>
        <height>480</height>
        <upscale>1</upscale>
      </data>
      <weight>0</weight>
    </image_scale_480_480_1>
    <image_crop_200_100>
      ...
    </image_crop_200_100>
  </effects>

Not too nice looking, but works, mostly because what's within $image_style['effects'] is predictable (an image effect with a machine name and settings - although : is it *that* predictable that all values for all settings will be scalars ?)

Field API cannot really do that, since the values stored in $field['settings']['some_setting_name'] can be arbitrary, up to each field type : a scalar, a numerically ordered array, an associative array, an associative array whose keys could happen to be numeric (e.g a vocab id...)
As an example of another problematic subsystem : input formats (= an ordered collection of filters with settings, which can definitely be non-scalars)

Proposal :
standardize on the use of the some specific tag, e.g. <item> with a "key" attribute, for numeric keys :

example 1: array($a, $b) -->

<item key="0">[XML for $a]</item>
<item key="1">[XML for $b]</item>

example 2: array(25 => $a, 12 => $b) -->

<item key="25">[XML for $a]</item>
<item key="12">[XML for $b]</item>

example 3: array('key_a' => $a, 12 => $b) -->

<key_a>[XML for $a]</key_a>
<item key="12">[XML for $b]</item>

example 4: array('key_a' => $a, 'key_b' => $b) (like currently ) -->

<key_a>[XML for $a]</key_a>
<key_b>[XML for $b]</key_b>

As a side note, would be nicer if the first case (contiguous numeric keys starting from 0) gets serialized with implicit keys :

<item>[XML for $a]</item>
<item>[XML for $b]</item>

but I'm not sure that's doable, nor a big deal...

CommentFileSizeAuthor
#156 drupal8.config-yaml.154.patch1.43 KBwebchick
#154 drupal8.config-yaml.154.patch2.46 KBsun
#153 drupal8.config-yaml.153.patch3.11 KBsun
#151 drupal8.config-yaml.151.patch3.12 KBsun
#145 xml-to-yaml.patch2.37 KBwebchick
#142 xml-to-yaml.patch2.36 KBwebchick
#136 interdiff-134-136.txt374 bytesalexpott
#136 config-yaml-1470824-136.patch182.88 KBalexpott
#134 config-yaml-1470824-133.patch182.92 KBalexpott
#134 interdiff-131-133.txt421 bytesalexpott
#132 interdiff-130-131.txt400 bytesalexpott
#132 config-yaml-1470824-131.patch182.92 KBalexpott
#130 interdiff-122-130.txt1.36 KBalexpott
#130 config-yaml-1470824-130.patch182.62 KBalexpott
#124 interdiff-121-122.txt2.55 KBalexpott
#122 config-yaml-1470824-122.patch182.82 KBalexpott
#122 interdiff-121-122.patch2.55 KBalexpott
#121 config-yaml-1470824-121.patch182.74 KBalexpott
#121 interdiff-119-121.txt823 bytesalexpott
#119 config-yaml-1470824-119.patch182.72 KBalexpott
#119 interdiff-116-119.txt983 bytesalexpott
#116 config.yaml_.116.patch181.97 KBsun
#114 config.yaml_.112.patch181.98 KBsun
#107 config-yaml-1470824-107.patch180.28 KBalexpott
#107 config-yaml-104-107-interdiff.txt11.82 KBalexpott
#104 config-yaml-1470824-104.patch175.34 KBalexpott
#104 config-yaml-98-104-interdiff.txt1.7 KBalexpott
#98 config-yaml-1470824-98.patch175.29 KBalexpott
#98 config-yaml-93-98-interdiff.txt2.48 KBalexpott
#93 config-yaml-1470824-93.patch173.82 KBalexpott
#93 interdiff-yaml.txt1.19 KBalexpott
#92 config-yaml-1470824-92.patch173.32 KBalexpott
#68 config-yaml-1470824-68.patch177 KBalexpott
#66 config-yaml-1470824-66.patch177.04 KBalexpott
#60 config-yaml-1470824-60.patch172.01 KBalexpott
#59 config-yaml-1470824-59.patch172.02 KBalexpott
#51 config-yaml-1470824-51.patch169.11 KBalexpott
#49 config-yaml-1470824-49.patch170.63 KBalexpott
#44 config-yaml-1470824-44.patch167.16 KBalexpott
#43 config-yaml-1470824-43.patch167.26 KBalexpott
#42 config-yaml-1470824-42.patch166.71 KBalexpott
#40 config-yaml-1470824-40.patch166.37 KBalexpott
#32 config.yaml_.32.patch72.6 KBsun
Support from Acquia helps fund testing for Drupal Acquia logo

Comments

yched’s picture

[edit : removed the 'unserialize to object or array ?' considerations - CMI always unserializes to arrays right now]

yched’s picture

Title: Storage format for unkeyed (numeric) arrays » Storage format for arrays with numeric keys

more accurate title

yched’s picture

Issue summary: View changes

removed the 'unserialize as object or array' considerations - CMI always unserializes to arrays right now

yched’s picture

Title: Storage format for arrays with numeric keys » XML encoder can only handle a small subset of PHP arrays
Category: task » bug

Actually, it's not just about numeric keys. config_encode() can currently only serialize arrays whose keys would be valid xml tags. Any key not matching [a-z]([_a-z0-9])* breaks the encoder. Upper case letters, while valid, might also be an issue.

So the propsal above would become :
an array key that matches the regexp above gets serialized to a tag : <some_key>...</some_key>
other keys get serialized to an 'item' tag + 'key' attribute : <item key="Some other key">...</item>

Can't say I'm thrilled :-/ this gives little control on the overall consistency and prettyness of the generated XML markup.

Or we state that only [a-z]([_a-z0-9])* AND numbers (we have to support numeric indexes) are allowed as keys in CMI-bound structures. Can be difficult to enforce on pluggable stuff like 'value of a given [field | image effect | views handler] setting' (can be non-literals), though.

This issue is probably an argument in the larger JSON vs. XML discussion, but as is, and while we are on XML, that's a stopper for Field API, for instance. Right now if I want to start playing with CMI for fields, I'd need to manually (php-)serialize()/unserialize() the parts of a $field or $instance definition array that would make config_encode()/decode() choke. Doable, but obviously not sustainable in the long run.

Changing to a bug report, actually.

gdd’s picture

Issue tags: +Configuration system

Tagging

naxoc’s picture

I ran into this while working on #1479466: Convert IP address blocking to config system. I ended up using a prefix on all tags, so I guess like example 3 or 4 from #1

nod_’s picture

Why not just use example 1 and 2? <item key="0"> right now it looks like this is trying to be too smart. It introduce a different XML schema which mean that unless you know the node name you can't access it directly. I mean we're already using XML, might as well use the nice things it provides.

That way we can have a xsd schema (or a dtd, or something) that can validate config files, which can be nice to check if they are not broken at some point. Also, testing FTW: to check the export/import just need to make a XSD schema and let XML do it's thing :)

I don't really see why <key_a> or <key_b> would be better. XML is not meant for humans, It's supposed to be readable but ultimately it's not for us.

I'm not for or against, just giving a few arguments.

Anonymous’s picture

nod_: we're trying to keep the format simple, and definitely avoid an XML schema at this point.

the solution in #5 is annoying, but is lowfi and works.

i'm torn on this issue in general. if prefixing is too unpalatable (and the other restrictions), i'm not sure what i'd choose between a more complicated XML schema and not-xml.

yched’s picture

Just a remark that, as mentioned in the OP, I can't predict what I find in $field['settings'], so manual massaging (e.g. #5) to account for our encoder/decoder restrictions is not really an option.

pounard’s picture

The issue description is missing something, you can have both keys and order into items. The XML specification doesn't guarantee items order when they are at the same level, some we might end up in the end:

<item key="a" order="2">ITEM VALUE A</item>
<item key="b" order="1">ITEM VALUE B</item>

Which starts to make sense. This means that if we want to guarantee all that (order, keys, and other array features) we need to write a decent parser and a DTD sounds like the least of our problems.

I'm just saying, if we are using XML, let's use it fully and properly, we could make both the order and key attributes, which makes sense.

This is an hashmap where order matters:

<item key="a" order="2">ITEM VALUE A</item>
<item key="b" order="1">ITEM VALUE B</item>

This is a list where order matters but key doesn't:

<item order="2">ITEM VALUE A</item>
<item order="1">ITEM VALUE B</item>

This is an array where key is numeric (order doesn't really maters since keys are numeric):

<item key="1">ITEM VALUE A</item>
<item key="2">ITEM VALUE B</item>

Plus, having a more descriptive schema based on heavy attributes usage also allow us to proceed to efficient XPath usage if needed, easier than by doing <item_key_a>

This also means, by extension, that the config system might want to handle data types more carefuly than just storing values, when storing a PHP array there is ambiguity to what we are actually storing: a configuration subtree, an element list, an hashmap?

This is where I imagined in the begining (but that the config system doesn't seem to do at all) that stored types should be restricted to a simpler subset that an generic UI would allow any user to edit easily, such as:

  • boolean: checkbox
  • string: textbox
  • number: textbox with validation
  • list: textbox with comma separated values (and eventually taking care of enclosing values)
  • floats: textbox

From this would derivate a stricter, but more predictable XML schema, as a more featured developer API. Each one of those types can be introspected without ambiguity even in PHP, and can be stored without ambiguity with a good schema preserving values type.

We could end up with something more explicit such as:

<section name="something">
  <item key="dosomething" type="bool">1</item>
  <item key="class" type="string">\\Something\\MyImplementation</item>
  <item key="somevalues" type="list">some,"comma,separated",values</list>
  <!-- And this would be eventually our hashmap -->
  <section name="stringplurals">
    <item key="cat" type="string">Cats</list>
    <item key="poney" type="string">Ponies</list>
  </section>
</section>

Which would be more predictable and readable, also easier to parse, and easier to work with for let's say, building a nice cutting edge registry-like magic advanced edition UI.

When I see this in the DrupalConfig.php file:

  public function castValue($value) {
    if (is_scalar($value)) {
      // Handle special case of FALSE, which should be '0' instead of ''.
      if ($value === FALSE) {
        $value = '0';
      }
      else {
        $value = (string) $value;
      }
    }
    else {
      // Any non-scalar value must be an array.
      if (!is_array($value)) {
        $value = (array) $value;
      }
      // Recurse into any nested keys.
      foreach ($value as $key => $nested_value) {
        $value[$key] = $this->castValue($nested_value);
      }
    }
    return $value;
  }

I'm actually afraid that the config system won't keep datatypes and thus can cause wrong behaviors in runtime. This seems muddled and we don't know what will be the storage behavior regarding all of this.

EDIT:
When I say that the DTD is the least of our problem, I'm not saying we don't want it, I'm saying we need the schema to be more formalized. With a formalized schema a simple DTD would come naturally and would be easy to implement (and a really good way to ensure configuration files format).

By having more strict type checking and restriction over the allowed configuration data types, we can also support more formats and potential backends, with what I proposed upper, we can use JSON naturally, but also ini files, YML and such.

nod_’s picture

Very nicely said, that is what I imagined CMI XML files would be like.

yched’s picture

Siblings order will be critical to many subsystems that will store stuff in CMI. Image styles are ordered collections of effects, Input formats are ordered collections of filters, etc...

Even though XML 1.0 spec doesn't say anything about order, it seems finding an actual XML parser implementation that hands out sibings in a *different* order than document order is a tough job ?

And relying on DTDs is a pipe dream IMO. Given the variety of subsystems that will store their stuff in CMI files, each with its own "thing" structure, forcing each one to write a proper DTD before it can use CMI sounds like a showstopper.

pounard’s picture

And relying on DTDs is a pipe dream IMO. Given the variety of subsystems that will store their stuff in CMI files, each with its own "thing" structure, forcing each one to write a proper DTD before it can use CMI sounds like a showstopper.

XML format is just a serialized representation of data handled by the CMI API. If the the API is unclear about data structure and data type, which we can see throught the lack of schema today, then the API is unusable because there is no boundaries about what we can or cannot do. This also means that storage is not plugable due to the lack of specification which makes the interfaces unclear. I don't see the showstopper in the DTD, I see the showstopper way upper in the API itself.

EDIT: Don't take it harsh, the config initiative did good job, I'm just surprised that people are cautious about formalizing a bit more our XML file, if we don't do that why would us use XML anyway?

EDIT: And I meant a unique DTD for all, not one per subsystem.

gdd’s picture

Another problem we came up with this weekend is that there are various places where we need to store html (for instance the list of allowed tags.) There are two options there:

- htmlspecialchars() everything going into the config system and decode on the way out. Sucks because it makes the files a lot less readable.
- add either setCData() or an options array to the existing set() where you can specify 'cdata' => TRUE.

Neither of these is particularly joyous. I also spent a lot of time with people new to the config system, and the process of mapping Drupal arrays to a format easily represented in XML was a very common point of confusion.

Given all these issues, and despite my deep reluctance to open this discussion again, I spent some time at DrupalCon talking to various groups about the possibility of switching from XML to YAML. YAML has some real advantages. It is VERY readable, it is the default configuration format for Symfony and their YAML library is very nice and easy to use. The one downside is that it is a userspace library, however if we switch to having serialized PHP in the active store (which seems likely, see #1500238: Encode/decode data via PHP serialize()/unserialize() for DatabaseStorage (storage controller specific data formats)) then the only time it will ever happen is on export/import which is not a critical path performance-wise.

The other big downside is that if we dump XML, then we have to come up with our own Drupalism for storing properties. Right now, for instance, we need a way to mark individual pieces of config as translatable or not. I had thought about doing something like

  <site_name translatable="true">Biff Bang Pow!</site_name>

This is really elegant and it is exactly what XML was made for. If we dump XML, we would have to come up with our own attributes/property storage standard, probably something like

site-name: Biff Bang Pow!
  #properties:
    translatable: true

or something (just making this up on the fly, but I assume everyone gets the idea.) Alternately we could come up with a delimiter or something to express translatability on one line, although I'm not sure how we would make this work consistently.

Note: this is not me opening up the format wars again. It is me offering an alternative that will address some of the problems of XML that we are encountering. Thoughts welcomed.

eaton’s picture

I'm a huge sucker for YAML, and tinkering around with jekyll for a few months has really made it clear how a pleasant, simply syntax can make a huge difference when it comes time to slap down some metadata. On the other hand, the cdata options mentioned in #2 seem to preserve consistent syntax, special casing HTML/tagged text rather than "anything that is a property." The latter seems like it would be a bit difficult to explain clearly. If it just came down to these questions I'd lean towards keeping XML because of its flexibility and the fact that so many systems can parse and manipulate it efficiently.

More worrying is the real-world testing with users not familiar with XML syntax and the trouble they had. Do you think it's the sort of "new stuff takes time" adjustment we'd face with any system? I used to work with Java and .Net, so it's not unfamiliar -- I'm probably not the best judge of how hard it is to pick up.

gdd’s picture

I think the biggest problem is that XML does not represent data the way Drupal does, and ultimately this will confuse people which makes it a pretty big DX issue. We can sit here and say 'Well Drupal shouldn't have giant indexed arrays of doom' all we want, but we do and its what people are used to. This same line of argument goes for pounard's comments in #9. This is very nice XML, with good datatypes and all sorts of information, that would work great in a system that contained this type of information. Drupal does not have strong typing or anything like this. I'm implementing a config system that represents Drupal, not changing Drupal to match a config system. Once the system is in place for 8, maybe we can approach that in 9. As it stands now we just don't have that kind of luxury.

This is really why I mentally have started re-opening the file format issue. XML is fine I just don't think it works very well for the kind of system we have.

pounard’s picture

I'm not sure having a stronger schema for XML would confuse people, when we store data into a database we always have a strong schema, and when we store serialized PHP we have an even more strong schema (we have PHP types kept). And as I described it, the datatypes can be introspected when writing the conf quite easily. I think that a strong schema doesn't imply that its representation changes of the real nature of it, on the contrary, it allows to keep it in a saner and standardized serialization representation which would make life easier for people.

alexpott’s picture

I've posted a possible solution for the html issue discovered at the core sprint in Denver #1504770: HTML in the config xml files is very messy

fietserwin’s picture

PHP arrays are not ordered by numeric key but by order of insertion (see e.g. http://technosophos.com/content/php-arrays-are-not-arrays). So if one wants to preserve order it should use an additional property for that. In Drupal we normally call this weight, which is a property of the item. So I think that the idea in #9 of an order attribute is not needed (it won't work as suggested).

Looking a bit longer at the problem I came to the same conclusion as #3 as a viable solution. However I don't have that many problems with it. As long as Drupal uses array for both "objects" and lists of "objects" (and our lists of objects are keyed for fast direct access) we will have difficulties in distinguishing those 2 cases. But as long as we can serialize and deserialize without loss of information, I don't see a problem. Nor would I have a problem with restricting the keys I can use to just those that lead to valid XML tags, but am not sure that we change existing naming schemes (#type, #theme, etc.)

<ïmage_style key="large" >
  <name>large</name>
  <effects>
    <item key="1">
      <eid>1</eid>
      <name>image_scale</name>
      <data>
        <width>480</width>
        <height>480</height>
        <upscale>1</upscale>
      </data>
      <weight>0</weight>
    </item>
    <item key="2">
      <eid>2</eid>
      <data>
        ...
      </data>
      <weight>5</weight>
    </item>
  </effects>
</image_style>

Looks fine for me.

pounard’s picture

The fact that some XML elements are arbitrary names upon the module business (such as <image_style [attributes]>) actually makes potential XPath queries over configuration virtually impossible because we have no spec or ideas on how to actually query it. It lacks normalization and specification in order to be able to create additional tooling around the configuration API or even the files themselves. I know that no-one is supposed to modify those files directly in Drupal side we are meant to use the high-level API instead, but still, a complete API will only be possible alongside a rigorous spec of what is the data, what possibilities does it offer, depending on the final trade-off made between performance and portability, wich doesn't really seem to have been done yet. The initiative higly focused on the multiple layer design, which is good, and now that this goal has been met, we need it to become an high API for the sake of DX, and for this we need to spec the API capabilities, especially for data introspection, consistency and high level usability.

dixon_’s picture

The problem CMI is trying to solve is better management of configuration, i.e. the ability to not break your site horribly when moving configuration around. And CMI addresses that with the current architecture.

I agree with heyrocker:

Drupal does not have strong typing or anything like this. I'm implementing a config system that represents Drupal, not changing Drupal to match a config system. Once the system is in place for 8, maybe we can approach that in 9. As it stands now we just don't have that kind of luxury.

Let's focus on better typing for everything and a more efficient format in Drupal 9. Doing everything at once will simply be too much work before code freeze. Let's realize our limitations ;)

pounard’s picture

The problem if we don't start worrying about this right now -at least for the files spec- is that we'd need to rewrite all configuration files for D9, also including modules' files, this doesn't seem a good tradeoff to me. CMI is already breaking completely the API with the goal to make something durable, but the files aren't right now, so the goal is not met, and later versions may be even harder than this one in the end, thus defying the initial purpose.

Revisiting file format and spec may be hard and a long run task, but we can do it without modfying the actual CMI PHP API modification, and we could ensure a more durable and interoperable file format.

jide’s picture

I support the general idea in #9, using known keywords for nodes (in the XML meaning !) such as "item" with keyed attributes. I believe this is more in the spirit of XML, having node named after arbitrary properties does not really make sense.

Even if we do not use a DTD, we can see why they exist in the first place : create a predictible set of rules on how the data should be structured. And that's impossible when nodes can have any name. XML nodes are not made to store values, and arbitrary property names ARE values, in a way.

Moreover, numeric keys are not the only problem, we can imagine properties with spaces in the name, or other problematic cases.

Let's drop items order and typing, add CDATA for long strings, and we have something solid, conform to what XML is designed for, future-proof etc.

To be short, I support #9 without ordering and typing. It just makes sense.

jide’s picture

My proposal :

<section name="something">
<item key="mykey">my value</item>
<item key="myotherkey">my other value</item>
</section>
<section name="numericlist">
<item key="0">my value</item>
<item key="1">my other value</item>
</section>
<section name="cdatalist">
<item key="text"><![CDATA[
The term CDATA is used about text data that should not be parsed by the XML parser.
]]></item>
</section>

gdd’s picture

I would really like to see his get resolved one way or the other. Re-reading this thread from scratch, I definitely agree that if we are to move forward with xml, we will need to adopt some kind of structure as described in #24 which is pretty nice. One thing I like about this over the earlier suggestions is that everything is in the format <item key="foo"> and not just numeric or incompatible keys. This will keep the system simpler and the XML more consistent and thus easier to read. acrollet made a proposal for how to deal with cdata at #1504770: HTML in the config xml files is very messy which I'm not completely sold on but its worth discussing. Finally I agree we don't need to have order and type keys on all entries. Systems that want this can add it.

So if we go this route, that means some rewriting of the XML parser which is not a horrible task. Do people feel like this is a good solution? Will it solve all the functional issues we currently face? It still means maintaining our own parser, whereas if we moved to YAML we could use Symfony's, but on the other hand as we work more and more with the system, the harder and harder it becomes to change formats and I don't really have a problem with this new proposal.

pounard’s picture

SimpleXML and DOM are both good ways to do it. DOM is a more conventionnal interface and supposedly more reliable, so I'd use it rather than SimpleXML. Both work loading fully the files in memory, but our configuration files won't be huge, so that's not a problem. I don't know for SimpleXML, but DOM uses the standard libxml, which is actually very fast and reliable (and flexible). In any case, any people used to XML will be able to write a custom reader for CMI really quickly, especially if the syntax is that simple as #24 is proposing.

Note: Even if the CMI PHP API doesn't care about types, I would strongly suggest to type data (using attributes) in the XML files. We can guess types when setting values, force type when manually writing the XML files, and we can cast values to the right PHP type at runtime when reading it. It won't change anything for the consumer, at least we would have it if we decide to go this way later. Plus, having the type info in files can be a good hint for people reading the files when searching for variables definition and their meaning.

yched’s picture

Settling on <item key="foo">value</item> on everything works for me.

As per the larger XML/other format, I guess no-one wants to be the one to shoot the first bullet on Format War II (III ?)...
I for one could definitely go with YAML...

ksenzee’s picture

Isn't PHP DOM intended for XHTML? I guess maybe you could use it for arbitrary HTML XML but it seems a little odd.

If we stick with XML, the structure in #24 makes sense. It might make even more sense to just switch to YAML while we still can. IIRC the arguments against it were that it's a bit brittle (you can mess things up royally just by adding a space or two) and that the libraries are all userspace. I think since we had Format War I, we decided a slow userspace parser was okay, and we're less worried about people editing the files manually. So I will tentatively place myself in the YAML camp as well.

Crell’s picture

PHP's DOM API is an implementation of the W3C DOM API, which is horribly clunky in every language but is intended to work on any arbitrary DOM tree; 99% of the time that is serialized as XML. XHTML is just a special case of that. It works fine for arbitrary XML. That said there, are other libraries out there to help make its ugliness less ugly (such as QueryPath).

I still have an irrational hatred of YAML, but I'm going to play Swiss for now. :-)

gdd’s picture

Based on the IRC meeting today and other discussions I've had with people, it appears that there is a lot of support for YAML and it would solve a lot of problem for us. So let me frame this a different way, does anyone have a passionate defense of XML at this point?

pounard’s picture

XML uses a native implementation in PHP based upon libxml, which is actually really fast, as it exists in a lot of other languages as well, is well known, well supported, supports schema and files validation. I don't have a an opinion about YAML because I don't know it enough. Plus I really don't see what problem you actually can't solve with it that YAML can solve?

sun’s picture

Priority: Normal » Major
Status: Active » Needs review
FileSize
72.6 KB

Yes. I don't see why we'd want to pester our lives any further with those rigorously ugly XML problems.

The keys are not the only problem with XML. Check the current config_decode() and config_encode() functions to see how much insanity XML causes us.

Symfony comes with a working YAML parser/dumper component. It's already there. It solves all problems. Zero development and maintenance effort.

Attached patch is a hack-ish kick-start.

  • Dumps latest Yaml component from Symfony master into core. Needs to be changed to a specific package/release.
  • Hacks in public access to the file storage object within DrupalConfig / DrupalConfigVerifiedStorage. (Good lord, this API badly needs a massive clean-up and simplification.)
  • Hacks out the awkward file signature stuff to facilitate file conversions. To be properly removed in #1444620: Remove file signing from configuration system.
  • Contains config_xml2yaml*() conversion functions to ease conversions until this lands.

Code lives in the config-yaml-1470824-sun branch in CMI.

---
Lastly, before doing this patch, I additionally explored whether PHP's native support for .ini files would be an option - but alas, nested keys of infinite depth are not supported. Thus, not remotely an option.

Status: Needs review » Needs work

The last submitted patch, config.yaml_.32.patch, failed testing.

gdd’s picture

Can we just worry about the matter at hand and not about hacking in file access or signatures? One thing at a time please. (note I am working on a restructuring of the objects now, will hopefully have a patch tomorrow)

pounard’s picture

When reading this thread without any information from the outside world, it sounds like nobody even really tried to use XML. I still don't see what problems YAML can solve that XML cannot? One good thing with XML is that we have natively have the choice between 3 different implementations, meanwhile YAML has none native one, and the only one everyone speaks about is the Symfony's one (other exists) from which we would depend upon as soon as we make this choice. It doesn't sounds like a reasonable choice to me.

sun’s picture

meanwhile YAML has none native [PHP extension]

@see http://php.net/yaml

Granted, needs to be installed manually. But there is a native extension. Optional/conditional support for that could easily baked in (and bonus points for doing that upstream in Symfony's YAML component).

To clarify on the XML problems - all in once:

use Symfony\Component\Yaml\Yaml;

$config = array(
  // Indexed arrays; the order of elements is essential.
  'numeric keys' => array('i', 'n', 'd', 'e', 'x', 'e', 'd'),
  // Infinitely nested keys using arbitrary element names.
  'nested keys' => array(
    // HTML/XML in values.
    'HTML' => '<strong> <bold> <em> <blockquote>',
    // UTF-8 in values.
    'UTF-8' => 'FrançAIS is ÜBER-åwesome',
    // Unicode in keys and values.
    'ΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟΣὨ' => 'αβγδεζηθικλμνξοσὠ',
  ),
);
var_dump($config);
array(2) {
  ["numeric keys"]=>
  array(7) {
    [0]=>
    string(1) "i"
    [1]=>
    string(1) "n"
    [2]=>
    string(1) "d"
    [3]=>
    string(1) "e"
    [4]=>
    string(1) "x"
    [5]=>
    string(1) "e"
    [6]=>
    string(1) "d"
  }
  ["nested keys"]=>
  array(3) {
    ["HTML"]=>
    string(33) "<strong> <bold> <em> <blockquote>"
    ["UTF-8"]=>
    string(27) "FrançAIS is ÜBER-åwesome"
    ["ΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟΣὨ"]=>
    string(35) "αβγδεζηθικλμνξοσὠ"
  }
}
$yaml = Yaml::dump($config);
var_dump($yaml);
'numeric keys':
    - i
    - n
    - d
    - e
    - x
    - e
    - d
'nested keys':
    HTML: '<strong> <bold> <em> <blockquote>'
    UTF-8: 'FrançAIS is ÜBER-åwesome'
    ΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟΣὨ: αβγδεζηθικλμνξοσὠ
$config_reloaded = Yaml::parse($yaml);
var_dump($config == $config_reloaded);
bool(true)
// Extract the numeric keys, and reverse them for extra ordering fun.
$numeric_keys = array_reverse($config['numeric keys']);
// Unset the numeric keys, for even more ordering fun.
unset($config['numeric keys']);
$config['numeric keys'] = $numeric_keys;
$yaml = Yaml::dump($config);
var_dump($yaml);
'nested keys':
    HTML: '<strong> <bold> <em> <blockquote>'
    UTF-8: 'FrançAIS is ÜBER-åwesome'
    ΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟΣὨ: αβγδεζηθικλμνξοσὠ
'numeric keys':
    - d
    - e
    - x
    - e
    - d
    - n
    - i
pounard’s picture

Solutions for XML:
- numeric keys: implicit if no "key" attribute specified
- ordering: most XML writers and readers will preserve order, if none specify a "delta" or "order" attribute
- nested keys: CDATA
- Encoding: just as YAML, values will be encoded with the document encoding
- "and reverse them for extra ordering fun" was totally irrelevant

I agree XML will need extra code for parsing, but that's not a relevant factor for choosing one or the other: syntax and capabilities are.

XML allows to use a DTD and easy document validation (really useful feature). I don't know if YAML can do that.

I didn't knew about the YAML PHP extension, that is a good thing.

Thanks for your answer.

EDIT: I still don't trust indentation based formats.

pounard’s picture

From http://www.yaml.org/spec/1.2/spec.html#id2765608

In the representation model, mapping keys do not have an order. To serialize a mapping, it is necessary to impose an ordering on its keys. This order is a serialization detail and should not be used when composing the representation graph (and hence for the preservation of application data). In every case where node order is significant, a sequence must be used. For example, an ordered mapping can be represented as a sequence of mappings, where each mapping is a single key: value pair. YAML provides convenient compact notation for this case.

If I understand well, the example you wrote is actually wrong, you should have used a sequence for the "numeric keys" key:

'numeric keys': [a, d, e, x, a, d, n, i]

See http://www.yaml.org/spec/1.2/spec.html#sequence and http://www.yaml.org/spec/1.2/spec.html#id2790088 (this is 1.2 YAML version)

This keeps the order (it's the only case --it seems-- in YAML for which order is preserved).

Moreover, this means that YAML libraries usage is not as simple as your sample shows it, in order to preserve ordering, you have to explicitely generate such YAML, which is not generated per default. The integration code will be as complex as the XML one given that we actually find a good YAML library that supports the full spec.

The more I read documentation about YAML, the more it seem as complex as XML to me. I don't see it neither more readable neither less. I just think XML is better supported in many technologies including PHP.

EDIT: To be clear, I am not really preaching XML here, but just trying to make a reasonable argument. I dislike YAML not because it's not a good format, but because it's a whitespace based format: this is prone to more human error than a format such as XML which may be a bit more verbose when manually editing the files.

alexpott’s picture

Yaml seems to solve a lot of our issues with config without any work:

  • Yaml is a "data serialization language" - whereas xml is much much more. The problem we are trying to solve is the purpose of Yaml. Thus obeying the UNIX philosophy - "Write programs that do one thing and do it well."
  • numeric keys just work and (afaik) Sun's example is correct - see http://www.yaml.org/spec/1.2/spec.html#id2760118 - a sequence of scalars - which is exactly what we want.
    'numeric keys':
        - i
        - n
        - d
        - e
        - x
        - e
        - d
    

    is equivalent to

    'numeric keys': [a, d, e, x, a, d, n, i]
    
  • Variable type is maintained - integers are integers etc...
  • html in values just works

It's also nice because:

  • Symfony has done most of the work for us
  • The native php extension is fast - I've done 10000 yaml_emit() of an image style in 0.832s - this compares to 8.632s for the Symfony Yaml component and 2.317s for the current config_encode function. (I'm working on the patch for the component)
alexpott’s picture

Status: Needs work » Needs review
FileSize
166.37 KB

Done some work on the yaml patch

  • Completely remove config_decode and config_encode as these can be replaced by the equivalent static functions provided by the Yaml class (parse and dump)
  • Removed all the functions to convert xml to arrays
  • Removed casting of all values to strings... that's why Yaml's a good idea
  • Use the extension .yml as that is the extension used by Symfony's tests
  • Updated tests so that they pass
  • Removed changes to other parts of CMI not related to saving config as Yaml
  • Updated config.yml files to use integer values where it currently would when saving eg.
    cache: 0
    cache_lifetime: '0'
    page_cache_maximum_age: '0'
    page_compression: 0
    preprocess_css: 0
    preprocess_js: 0
    

Status: Needs review » Needs work

The last submitted patch, config-yaml-1470824-40.patch, failed testing.

alexpott’s picture

Status: Needs work » Needs review
FileSize
166.71 KB

Ha... system.performance.cache is now an integer so the test that rewrites it with a string now fails :)

New patch to address this...

alexpott’s picture

FileSize
167.26 KB

Realised we can remove some more code as we no longer need to check keys for a specific set of characters. Removed the following line from the config->set() method.

 $key = preg_replace('@[^a-zA-Z0-9_.-]@', '', $key);

Now we can have config keys like:

ΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟΣὨ: αβγδεζηθικλμνξοσὠ
FrançAIS: ÜBER-åwesome
alexpott’s picture

FileSize
167.16 KB

Cleaned up an a test config value in system.performance.yml and a space at the end of a line.

pounard’s picture

#39 The equivalency is not true, as soon as it specified that the order is an implementation detail: this means that using the same library for parsing and reading should keep the order, while using different libraries for parsing and reading may not.

Both YAML and XML are serialization formats, using XML here would be as revelant as using YAML, the only question here is which one gives us the best flexibility for what we intend to do, not which one is against UNIX philosophy (honestly WhyTF are you using this argument here)?

I don't really care about writing performances since it's not something we are going to do every now and then.

Aside of that, if YAML is choosen, then YAML it is. But I still think that using a whitespace based format will be a pain in the ass for manual files editing. I have to admit XML is not the best one too for manual editing (honestly, a good INI file would be as good for us here too) but at least it's less subject to human input errors due to missing or extra whitespaces (which is something kind of hard to debug, especially when we'll deal with blobs, i.e. mail contents or UI text variables).

gdd’s picture

I'm not nearly as concerned with what gives us the most flexibility as I am with what is easiest and doesn't block us from doing things we already do. I am not really interested in trying to fix Drupal weaknesses through the config system (by adding support for typing for instance.) I'll leave that kind of work to Larry.

I am also much more concerned with readability than writeability, so from that standpoint YAML is a pretty big win IMO. Yes, manual editing will be more difficult, but I do believe that is not by any means a horribly common activity (although I realize you disagree.)

I will try and get some time to look at this patch soon.

alexpott’s picture

As an example of how Yaml makes things a little bit easier - see #1496534: Convert account settings to configuration system. This currently failing the HTML in page titles (PageTitleFiltering) [System] test because XML is failing to save the title when it is set in testTitleXSS() to

  $title = '</title><script type="text/javascript">alert("Title XSS!");</script> & < > " \' ';
 

After applying the patch in #44 and converting system.site.xml to Yaml this test passes.

pounard’s picture

I guess it's because the actual XML handling code doesn't care about data. Using CDATA would fix this.
The actual XML handling code is about this:

function config_decode($data) {
  if (empty($data)) {
    return array();
  }

  // This is the fastest and easiest way to get from a string of XML to a PHP
  // array since SimpleXML and json_decode()/encode() are native to PHP. Our
  // only other choice would be a custom userspace implementation which would
  // be a lot less performant and more complex.
  $xml = new SimpleXMLElement($data);
  $json = json_encode($xml);
  return json_decode($json, TRUE);
}

function config_xml_to_array($data) {
  $out = array();
  $xmlObject = simplexml_load_string($data);

  if (is_object($xmlObject)) {
    $attributes = (array) $xmlObject->attributes();
    if (isset($attributes['@attributes'])) {
      $out['#attributes'] = $attributes['@attributes'];
    }
  }
  if (trim((string) $xmlObject)) {
    return trim((string) $xmlObject);
  }
  foreach ($xmlObject as $index => $content) {
    if (is_object($content)) {
      $out[$index] = config_xml_to_array($content);
    }
  }

  return $out;
}

function config_encode($data) {
  // Convert the supplied array into a SimpleXMLElement.
  $xml_object = new SimpleXMLElement("<?xml><config></config>");
  config_array_to_xml($data, $xml_object);

  // Pretty print the result.
  $dom = new DOMDocument('1.0');
  $dom->preserveWhiteSpace = false;
  $dom->formatOutput = true;
  $dom->loadXML($xml_object->asXML());

  return $dom->saveXML();
}

function config_array_to_xml($array, &$xml_object) {
  foreach ($array as $key => $value) {
    if (is_array($value)) {
      if (!is_numeric($key)){
        $subnode = $xml_object->addChild("$key");
        config_array_to_xml($value, $subnode);
      }
      else {
        config_array_to_xml($value, $xml_object);
      }
    }
    else {
      $xml_object->addChild($key, $value);
    }
  }
}

Which is:

  1. Does not care about the language spec.
  2. Does not care about the serialized data typing.
  3. Global state instead of being a clean interface implementation.
  4. Not reliable nor complete.
  5. Contains ugly hacks.
  6. Has no revelant structure, nor applicative data shape spec.

My humble opinion here is that until the CMI doesn't care about data type and structure it stores, you will have surprises due to the choosen serialization mecanism, whichever it is. The current code is not realiable, and design seems to be a bit too light: there is no data definition, specification, no storage backend interface, it is not swappable, not testable. In this current state, it doesn't make any sense, and using any of INI, XML, YAML, you will always have bad surprises as the one you just highlighted.

Even before thinking about the file format to use, you should first rethink the overall CMI API design. The whole storage API will be reliable only once you'll have specified what are the data types you can or cannot store, and what are the rules applying when serializing complex data structures. If you do not, every now and then you will have problems due the serialization mecanism: each one of them will bring good and bad, XML as YAML will, in both ways.

alexpott’s picture

FileSize
170.63 KB

Rerolled #44 patch:

Status: Needs review » Needs work

The last submitted patch, config-yaml-1470824-49.patch, failed testing.

alexpott’s picture

FileSize
169.11 KB

Bah... new patch

aspilicious’s picture

Status: Needs work » Needs review
alexpott’s picture

Status: Needs review » Needs work

Interestingly every single function in the code snippet in #48 is removed by the Yaml patch. I think with Yaml we are closer to having a CMI that doesn't care about the data type and structure it stores.

However, potentially, we are further away from solving #1448330: [META] Discuss internationalization of configuration since we can't add additional properties to keys eg:

<item key="site_name" lang="cy">Mae hon yn wefan cŵl!</item>
alexpott’s picture

Status: Needs work » Needs review

Changing status back... crosspost :(

pounard’s picture

One bright side of this Yaml side is surely that the code is lighter, but in this patch we don't have a lot of edge cases (sequences, etc..).

neclimdul’s picture

So the escaping thing is a non issue. CDATA is actually the wrong solution for the record because then you have to escape ]]> from the string and can't really be done sanely. The actually solution is to escape < and & (ref).

So apparently I'd always misunderstood what was being demonstrated in the <item /> example and that's horrifying. If you think that simplifies anything I've got megs of XSLT sitting on a server for translating MS CRM packets to correct you. I wish I could give a better solution but that is a truly miserable one to work with. The problem with XML inevitably always boils down to the fact that it doesn't translate to common language constructs in any language. .NET has a freaking system for treating them like a database its so bad. That's one of the impetus for JSON even existing(not suggesting it, I'm well aware of the percieved pitfalls for us). I thought we'd reached a working solution but it seems we really haven't if this is our best option now.

ksenzee’s picture

#55: If there are edge cases we need to support but that are not in our tests, it would be great to get some tests for them.

#56: I'm not sure I follow what is horrifying. :) Since you brought up JSON, it's worth noting that JSON is a subset of YAML 1.2. If you specify YAML's "inline style", what you get is valid JSON. So unless there is a hole in the Symfony YAML library, or a hole in the YAML spec, YAML can by definition do anything JSON can do.

gdd’s picture

Status: Needs review » Needs work

#48:

Even before thinking about the file format to use, you should first rethink the overall CMI API design. The whole storage API will be reliable only once you'll have specified what are the data types you can or cannot store, and what are the rules applying when serializing complex data structures. If you do not, every now and then you will have problems due the serialization mecanism: each one of them will bring good and bad, XML as YAML will, in both ways.

Unfortunately this is not something CMI can do in a vacuum. I have been trying to point out this entire time that we can only store what Drupal gives us, and any rules about data types and storage have to be done at that level and not ours. We can not impose standards onto Drupal, we are here to store what Drupal has.

#53: The newest proposal for internationalization has this moving out into the property system, so this may not ever be an issue. If we ever want to store metadata, then we are going to end up either with a naming convention (like the '#' for properties in FAPI) or some kind of metadata key that every entry has. This would be an ugly hack, but I would argue its a less-ugly hack than having everything named item. That's just me though. (insert pounard response here.)

One thing I note in this patch is that we have a LOT of places where we hardcode a file extension. I'm wondering if it might be nice to put that into a constant to make this easier down the road when we have to go through this whole exercise again. The same goes for comments like 'Saves the configuration object to disk as Yaml.', I'd like to see these changed to 'as the active format' or something along those lines.

I would like to keep config_decode() and config_encode() for similar reasons. It looks dumb now but it is more self-documenting and will make things easier if we find we need to do some kind of data munging on the way in or out.

+++ b/core/modules/image/config/image.style.large.ymlundefined
@@ -0,0 +1,10 @@
+effects:
+  image_scale_480_480_1:
+    name: image_scale
+    ieid: image_scale_480_480_1
+    data:
+      width: '480'
+      height: '480'
+      upscale: 1
+    weight: '0'

We don't need the machine names here anymore, we can just move to indexes. We can also get rid of the code that generates machine names in image.module (I think it is in image_effect_save())

alexpott’s picture

Status: Needs work » Needs review
FileSize
172.02 KB

The patch attached here takes a slightly different approach to the one outlined in #58. Basically it moves the serialization "problem" to the class that is writing the array to storage. So the Yaml code is moved to FileStorage.

The advantage of moving the serialization implementation to storage class is if we want to swap out Yaml for something else - we only have to change FileStorage.php

Another advantage of this is to allow the SQL storage to use a different serialization method - in this case php's serialize() function which will have considerable performance benefits.

The idea behind this is that where we are storing the config makes different serialization options preferable based on performance, readability, editability... so:

  • in memory we store it in a php array contained in an object
  • in database we store it in a string in a row of a table
  • in a filesystem we store it in a readable / editable format in a file
alexpott’s picture

FileSize
172.01 KB

Fixed up some line endings... must remember to run grep -n "^\+.* $" patch_file before submitting a patch...

catch’s picture

There's already #1500238: Encode/decode data via PHP serialize()/unserialize() for DatabaseStorage (storage controller specific data formats) for the active store (which should probably be major or critical), if it's not extra review overhead I'd not object to solving that here but otherwise it'd be good to keep the two issues focused.

marcingy’s picture

The failures in #1496542: Convert site information to config system with regards invalid xml, really says that YAML might be a more robust solution, otherwise we are going to get issues being reported all over the place for exceptions unless we validate the content of all fields being sent to config to ensure they are valid xml before actually saving them.

pounard’s picture

@alexpott your FileStorage class has static, which means it's still hardcoded, not swappable. Its business is no masqued behind an interface either. A good way to improve this would be to start with writing a nice interface then implement it. If this is done, this means the file format won't not hardcoded anymore. The class name could be something configurable (we could get from the settings.php file), then file format would be fully swappable.

gdd’s picture

I don't see the benefit in a truly swappable file format. Drupal core and all contrib modules are going to ship with default configuration in a single format, so what will swappability buy you? You would have to go through and convert all these files yourself to the new format, as well as keep up with them when changes occur. While I understand that supporting pluggability forces you to make sure your design is not coupled and remains independent, I think there is very little real world value to it.

webchick’s picture

I agree with heyrocker here. We need to standardize on a single file format and use it through all of core + contrib or we haven't actually solved this problem in D8.

So... I don't see any real, technical objections to YAML in this thread other than "whitespace can cause problems" but we already know XML is causing problems, and that YAML addresses the concerns brought up so far... so should we re-focus this issue to change the XML format to YAML?

alexpott’s picture

FileSize
177.04 KB

Added some tests based on the config data in #36.

#61 - I think the issue of what serialization technique is used for file storage and what is used for database storage is best solved together as at the moment we choose a serialization format to pass to the different storage engines at the config interface level.

Status: Needs review » Needs work

The last submitted patch, config-yaml-1470824-66.patch, failed testing.

alexpott’s picture

Status: Needs work » Needs review
FileSize
177 KB

Rerolled patch against latest 8.x

sun’s picture

I'd recommend to postpone further re-rolls on #1567812: Remove "Verified" from configuration class names, since that will change a lot of class and file names (for the better).

I agree with @alexpott in #59 to move the encoding/decoding onto the storage controller, but as long as there's still no definite agreement on YAML, it probably makes more sense to do that as part of #1500238: Encode/decode data via PHP serialize()/unserialize() for DatabaseStorage (storage controller specific data formats) or in a completely separate issue.

Crell’s picture

To #63 and #64, whether or not the actual file format is swappable in practice API code should be divorced from it. Yes, that means avoiding statics. Swappable implementations (even if it's just two different XML/Yaml/whatever-we-use implementations) is just one of the benefits of avoiding statics. Static == singleton == global == anger == hate == the Dark Side.

pounard’s picture

Yeah, I lost hope that someday SuperCrell would come and say that! I'm not just saying that because I deeply hate YAML :) Which is no true, I do not hate YAML, I just think it's not the best here.

gdd’s picture

Looking at the XML files side by side with the YAML ones, it is really striking how much more readable the YAML ones are. You also realize that in most cases (providing basic config with a module) whitespace isn't much of an issue because there is no indentation. I feel pretty ready to pull the switch on the YAML change.

A couple comments on the patch itself

- It will need a reroll because of #1567812: Remove "Verified" from configuration class names
- A lot of the comments here (like the need for an interface) mirror comments we've been making in #1560060: [meta] Configuration system roadmap and architecture clean-up. I'd prefer to manage just the format change in as minimally invasive a way as possible in this patch, and leave the rest for followups. We can then do the PHP serialization in #1500238: Encode/decode data via PHP serialize()/unserialize() for DatabaseStorage (storage controller specific data formats) and work on cleaning up the abstraction of the objects in another followup. Given that I'm happy leaving the statics for the moment knowing we will be cleaning up stuff down the road. Does this sound good to people?

gdd’s picture

Status: Needs review » Needs work
Crell’s picture

I am OK with incremental steps toward cleaning up the API and leaving the statics in place for now, as long as they're slated for refactoring sooner or later.

However, I do disagree about the YAML file being more readable than XML.

Should this issue be highlighted further since it would be making a rather significant change in an issue whose title does not at all imply what the latest patch is doing?

(Note: That's not intended as a stalling tactic or undermining tactic. If we end up with YAML I will grumble a lot and move on with my life. But given how much discussion there was before, I am concerned about the implications of that being perceived as a "stealth change", rightly or wrongly.)

webchick’s picture

Title: XML encoder can only handle a small subset of PHP arrays » XML encoder can only handle a small subset of PHP arrays, so switch to YAML

Amending title, because I agree it's currently false advertising. :)

moshe weitzman’s picture

I read the whole thread and reviewed the YAML patch. I think YAML is quite a bit more readable, and more easily edited as well. Pull that trigger, heyrocker.

bojanz’s picture

I like this as well. Go go go!

pounard’s picture

So I have no other choice than accept YAML if anyone thinks it's better. I still think that changing every now and then of file format is a red flag and that something else higher in the API is wrong. Let's go YAML and open other issues for other concerns.

effulgentsia’s picture

I have no objection to YAML. However, given some people's concerns with a format where whitespace has meaning, I'm wondering why JSON is off the table. In http://groups.drupal.org/node/167584, the arguments were:

First is that you can't put comments in it, and this really won't work for files that we expect devs and sysadmins to be able to hack around on at will.

This can be solved by preprocessing the string before calling json_decode(). See https://github.com/getify/JSON.minify for details and an example PHP implementation.

Additionally, json_encode() does not properly encode UTF8 characters

This is solved in PHP 5.4 with a JSON_UNESCAPED_UNICODE option. We can add a drupal_json_encode() wrapper to emulate this for PHP 5.3.

Are there any other benefits that YAML has over JSON that's relevant for our needs?

pounard’s picture

JSON.minify sounds like a huge hack: comments are not part of the JSON standard, and may disturb some parser (even if most of them sounds OK with that) - Anyway, rewriting a wrapper around the native JS handling functions sounds like another one Drupal WTF, the main argument pro-JSON is that it's parsing and writing is really fast, but if we start adding our own layer, this won't be true anymore. Remember that YAML actually is a subset of JSON regarding its capabilities. -- I'm still anti-YAML, but let's face the truth, it's better than JSON for configuration.

EDIT: Also remember that every question we ask here probably already has an answer on StackOverflow.

jide’s picture

Not to mention escaping quotes...

effulgentsia’s picture

Whether YAML is actually a superset of JSON is disputed, but regardless, the choice of YAML vs. JSON will affect how we in practice structure and comment core config files. If we choose YAML, we'll use its commenting format (# to end of line) which only works in whitespace-delimited block flow, not in JSONish inline flow. Even without the commenting consideration, we'll probably still use whitespace-delimited block flow in most places. I'm okay with that, but want to make sure we're choosing that over JSON (with C/JS/PHP style commenting) explicitly, not because of inaccurate perceived limitations of JSON.

pounard’s picture

I'm not convinced we are misperceiving JSON, it's a simple format in the end. However YAML is much more complicated than it seems, and I just wanted to be sure people are not misperceiving it! I think people are misperceiving XML too, but I bet this one always seems a lot harder than it really is. In the end, XML has a set of really really nice features that people often ignore. But YAML it is, I will just grunt every time I will manually edit one of those files, let's proceed.

gdd’s picture

In my mind, having to maintain our own code and wrapper functions is a critical disadvantage. The more code we can push off to Symfony the better, and their YAML library is very good.

It has turned out in practice that the commenting stuff isn't really useful regardless of format. Any parser is going to strip them, and then they are gone and we have no way of saving them back out to the files when we export later. So that whole line of thought was largely abandoned.

chx’s picture

Reading this issue aside from vague generalizations ("you should first rethink the overall CMI API design") I can not see any technical objections against YAML. Our concerns were first that YAML en/decode is userspace hence slower but as the active storage format can be different from the file format and sun has the code to do just that, this is moot. Everyone with a web programming affinity hates XML with a passion why do you think YAML and JSON were created?

YAML readable, commentable, maps better to what we want to do. Go with it.

damiankloip’s picture

The Symfony YAML lib is good, it will be fine for what we need. chx has summed it up perfectly in #85 I think.

pounard’s picture

Just for posterity, my generalizations will go to other issues.

gdd’s picture

#1500238: Encode/decode data via PHP serialize()/unserialize() for DatabaseStorage (storage controller specific data formats) was committed last night, so this will need a reroll. #1493098: Convert cron settings to configuration system was also committed, so we will need to add that config file into this patch as well.

ksenzee’s picture

Assigned: Unassigned » ksenzee

I'll reroll and review at the same time.

ksenzee’s picture

I'm finding it really difficult to incorporate all the stuff that's been committed since I last had two seconds to look at CMI. I'm still rerolling the patch (more like a rewrite than a reroll what with all the changes) so if someone else is dying to work on it feel free to assign yourself.

alexpott’s picture

Assigned: ksenzee » alexpott

I've rolled this patch a few times :)... will take it on...

alexpott’s picture

Status: Needs work » Needs review
FileSize
173.32 KB

Here we go... I haven't attached an interdiff because of the changes due to the serialisation patch.

Whilst converting the system.cron.xml file I noticed that we have a key/value set in there that is never used - cron_max_threshold. But I think this as well as the removal of the image style ieid key/value can be covered by a follow up patch.

One addition to the previous patches is the ability to pass a directory to the FileStorage constructor which makes the code to parse config files in a module's directory quite a bit nicer as then the decode won't have to be a static.

alexpott’s picture

Realised that the read function in FileStorage could be simplified and needed to always return an array - or throw an exception.

tim.plunkett’s picture

+++ b/core/lib/Drupal/Core/Config/FileStorage.phpundefined
@@ -10,14 +11,24 @@ namespace Drupal\Core\Config;
    *   The name for the configuration data. Should be lowercase.
+   *
+   * @param string $directory

In case this needs to be rerolled, there shouldn't be a blank line here.

+++ b/core/lib/Drupal/Core/Config/FileStorage.phpundefined
@@ -99,43 +117,7 @@ class FileStorage {
+    return Yaml::dump($data, 5);

Any use in documenting what this 5 is? Or would I know that by looking at the Yaml docs?

alexpott’s picture

Tim I think you're right we should document the 5. I'll include this in any re-roll once more comments are in.

  public static function encode($data) {
    // Once the array nesting level reaches 5 switch to inline YAML for
    // readability.
    return Yaml::dump($data, 5);
  }
catch’s picture

<?php
  public static function encode($data) {
    // Once the array nesting level reaches 5 switch to inline YAML for
    // readability.
    return Yaml::dump($data, 5);
  }
?>

Something like field instance settings could end up with variable levels of array nesting depending on what the settings are, at least potentially. Won't this make it harder to diff files in that case?

fietserwin’s picture

core/lib/Drupal/Core/Config/FileStorage.php:

+      throw new FileStorageReadException('Read file is does not exist.');

Incorrect English. BTW: "... is does ..." appears 3 times in the patch.

alexpott’s picture

Made changes to #93 according to comments above. There was only one incorrect "is does" in the patch - now fixed.

Also removed the public statics from the StorageInterface as they are unnecessary.

Regarding #96 - shouldn't diffs be ok as long as the nesting level at which we switch does not change?

pounard’s picture

You should put back the "public" keyword, non scoped methods are public historically, but if we want to maintain consistency across all classes, we should always make the scope operator explicit.

pounard’s picture

FileStorage::getFileExtention() is defined as is:

+  public static function getFileExtension () {
+    return 'yml';
   }

And you call at this place as an object method, while it's defined as a class method:

+  public function setFilePath($directory) {
+    $this->filePath = $directory . '/' . $this->name . '.' . $this->getFileExtension();
   }

PHP won't yell but this is semantically wrong.

gdd’s picture

One addition to the previous patches is the ability to pass a directory to the FileStorage constructor which makes the code to parse config files in a module's directory quite a bit nicer as then the decode won't have to be a static.

Do we ever have to parse config files in a module directory? We have to copy them but we should never be actually parsing and using the data in them. That should always come from the files in $config_directory.

Something like field instance settings could end up with variable levels of array nesting depending on what the settings are, at least potentially. Won't this make it harder to diff files in that case?

I agree, I'm not really in favor of changing to inline at a certain level of nesting. I'd rather have everything be consistent even if it impacts readability to some extent.

yched’s picture

Something like field instance settings could end up with variable levels of array nesting depending on what the settings are

Just to weigh in on that, we should definitely leave out the assumption that every subsystem has complete knowledge of the inner structure of its config objects. Any "pluggable" component will have nested parts that are up to the specific plugin in use, and that can be any arbitrary data structure (scalars, arrays, nested arrays...) : field settings, widget settings, image effect settings, views filter options...
The location of those parts is "predictable" by the subsystem, but the content is arbitrary.

fietserwin’s picture

#98: the other "is does" (1 left now) is in core/vendor/Symfony/Component/Yaml/Tests/Fixtures/YtsSpecificationExamples.yml. I guess we do not want to change that ourselves.

alexpott’s picture

New patch that:

  • sets the nesting level to PHP_MAX_INT
  • implements code improvement in #100
  • changes encode and decode to private for File Storage
  • removes the static from the Storage Interface and Database function - I think the encode and decode functions have no business being in the interface. A class that implements StorageInterface should implement a write function that accepts an array and a read that returns the array as it was when it was written. How it does that is the concern of the implementing class and should be transparent to the caller. Once this patch has landed there will be further patches to clean up the interface #1560060: [meta] Configuration system roadmap and architecture clean-up
alexpott’s picture

#101: We have been reading the config files from the module directory in config_install_default_config(). If you want to replace this with a copy from the module directory and then a read from the copied file should we do this is a follow up patch?

The change this patch makes is to not use a static decode method to do this but to use a FileStorage object which I think makes the code more consistent - ie. in all instances where we read a configuration file we use a FileStorage object to do so.

pounard’s picture

Various notes

  • Private is evil, you should consider downgrading the scope to protected, it achieve the same goal for us and allow inheritace, and why is that bad to keep them public anyway? It could be useful to some modules eventually
  • If FileStorage doesn't implement StorageInterface, then you cannot document the various Implements StorageInterface::encode() since it's not an implementation of
  • The fact that FileStorage doesn't implement StorageInterface is confusing, is that wanted? if so, why and this should probably be documented because the names are too close not to be confusing to people
  • The FileStorage class should probably be called YamlFileStorage instead, even if it's the only one, this would be more explicit, it also would help eventually when we'll have more than one file type handled
  • The fact that you are calling getFileExtension() statically everywhere definitely defies the purpose of having an object in the end, for consistency I'd switch this method to be an instance method and I'd always work with instances instead
alexpott’s picture

#106:

  • Changed to protected - I don't think that encoding / decoding should be accessible from the public interface
  • It will do - but a follow up will be needed
  • As above
  • Agreed makes a lot of sense
  • This should be explored in a follow up
sun’s picture

Status: Needs review » Postponed

NACK on the YamlFileStorage rename. We'll only support one format for the file storage, and you will not be able to simply change the default format.

Second, FileStorage will implement StorageInterface. This has been discussed before. Since we remain to be confused about that, let's postpone this issue on #1447686: Allow importing and synchronizing configuration when files are updated

pounard’s picture

Even if we support only one file format, having Yaml in the class name makes a lot of sense. At least we know it's gonna deal with Yaml even without opening the source code file, we can see it thanks to the file name, which is a win for comprehensability IMHO. And in any case, the class is hardcoded with Yaml support inside, so it's just legit to say it in its name.

And I know incremental patches and all, but still, anticipating a potential future where we could support more than one file format starts with naming classes correctly. Even if we don't, it's still a win for the reason upper, and probably many more.

If I stand correct regarding

We'll only support one format for the file storage, and you will not be able to simply change the default format.

then it has nothing to do its in own encapsulated class, so either way, let's name it correctly.

gdd’s picture

Status: Postponed » Needs review

I don't want to postpone this issue on the import one, because import is still going to take forever to sort out and I'd like to see this land as quickly as possible, otherwise its going to get stuck in reroll hell.

pounard’s picture

Yes, agree, let's do this.

neclimdul’s picture

I fully support the storage rename. I think us supporting only one format is a poor excuse for descriptively naming the class especially if its an interface. We don't know how the class will end up being used at the end of the day and possibly in contrib and furthermore having an appropriately named class is just the right thing to do.

alexpott’s picture

Just to clarify... the comments on recent change of class name... I asked on IRC:

Alexpott: neclimdul: re: http://drupal.org/node/1470824#comment-6014114 just so I understand because it's not 100% clear… you want the class to be to FileStorage and not YamlFileStorage...

webchick: alexpott: no, the opposite

neclimdul: alexpott: no, i want it to be yamlfilestorage

sun’s picture

Assigned: alexpott » sun
FileSize
181.98 KB

I already mentioned that I'd prefer to get the other patch in first, but alas, we want to tackle this one first. In turn, I spent the last couple of hours to bring those two architecturally in line, which was a major PITA.

That is, because we have the CMI sandbox for a reason: Major development efforts need to be able to run in parallel. And at this point, we're not really using the sandbox, so we're stepping on each other's foot by doing X here and a conflicting Y change there.

In this patch:

* Reverted and rewrote various things to be in line with parallel patches.

- Prepare for FileStorage implementing StorageInterface. (no custom constructor arguments allowed)
- config_get_files_with_prefix() is obsolete.
- Retain encode() and decode() as public static methods, and explain why (meh).
- Revert removal of type-casting of all config values to strings. Needs separate issue + discussion.
- Only the path on FileStorage can and must vary, not the full filepath+name.

And lastly, re: #109 + #112:

No. Again: The format/encoding is specific to each configuration storage controller. For the same reason, we don't call the active storage "PHPSerializedDBTNGSQLDatabaseStorage". We already discussed in the past whether it would make sense to abstract the encoding/decoding into separate handlers, but as mentioned before, the format/encoding is factually hard-coded for a storage controller, and even more so, for the file configuration format. You can exchange controllers, but you will NOT be able to override the default file storage controller that is required for loading and parsing default configuration provided by all modules, themes, and installation profiles. That said, I'm happy to discuss this further - but that's a completely separate issue, so absolutely not in this issue.

Status: Needs review » Needs work

The last submitted patch, config.yaml_.112.patch, failed testing.

sun’s picture

Status: Needs work » Needs review
FileSize
181.97 KB

Sorry, stupid mistake.

Status: Needs review » Needs work

The last submitted patch, config.yaml_.116.patch, failed testing.

pounard’s picture

#107 passed the tests.

alexpott’s picture

Status: Needs work » Needs review
FileSize
983 bytes
182.72 KB

#1496462: Convert RSS publishing settings to configuration system had been committed... confirming Heyrocker's suspicion that if we don't get this in fast we're into re-roll after re-roll after...

Patch attached should work...

Status: Needs review » Needs work

The last submitted patch, config-yaml-1470824-119.patch, failed testing.

alexpott’s picture

Status: Needs work » Needs review
FileSize
823 bytes
182.74 KB

Bah!

alexpott’s picture

Rerolled to ensure that yml files match what Drupal would create.

Status: Needs review » Needs work

The last submitted patch, interdiff-121-122.patch, failed testing.

alexpott’s picture

Status: Needs work » Needs review
FileSize
2.55 KB

Ho hum... my interdiff failed testing... :)

alexpott’s picture

Issue tags: -Configuration system

#122: config-yaml-1470824-122.patch queued for re-testing.

alexpott’s picture

Issue tags: +Configuration system

#122: config-yaml-1470824-122.patch queued for re-testing.

gdd’s picture

Status: Needs review » Needs work
+++ b/core/includes/config.incundefined
@@ -28,7 +28,7 @@ function config_get_config_directory() {
- * Moves the default config supplied by a module to the live config directory.
+ * Moves the default config supplied by a module to the active store.

This comment change isn't quite true is it? It is copying it to the active store AND moving it to the live config directory.

+++ b/core/includes/config.incundefined
@@ -40,55 +40,23 @@ function config_install_default_config($module) {
       $storage = new DatabaseStorage($config_name);
-      $data = FileStorage::decode(file_get_contents($module_config_dir . '/' . $file));
-      $storage->write($data);
+      $filestorage = new FileStorage($config_name);
+      $filestorage->setPath($module_config_dir);
+      $storage->write($filestorage->read());

It seems like $storage should be $dbstorage here, if the other is going to be $filestorage. I was a little confused reading this at first.

+++ b/core/modules/system/tests/modules/config_upgrade/config/config.test.ymlundefined
@@ -0,0 +1,2 @@
+config_test_foo: 'bar'

I notice that here we are surrounding text strings with single quotes, whereas elsewhere in the patch we are not. What's the standard for this?

This is all pretty nitpicky, this is very close. Thanks for all the work people.

11 days to next Drupal core point release.

neclimdul’s picture

I tried to take a stab at YamlFileStorage and realized why people where uncomfortable with it being in this patch. TL:DNR I take back my call to rename FileStorage, at least in this patch.

There's a bit too much responsibility built into FileStorage and assumptions made by the calling methods. Its expected to be both the global interface for any "file storage activity" and also "the abstraction of file format activities." This makes the interface a bit of a lie because this coupling makes it impossible to actually use the interface. There is a noble effort to generalize some of that responsibility in this patch which it probably good but it is going to be impossible to refactor these to responsibilities to make a sane YamlFileStorage class and convert from XML to Yaml in the same patch.

neclimdul’s picture

It seems like $storage should be $dbstorage here, if the other is going to be $filestorage. I was a little confused reading this at first.

It took me a couple reads to catch what was happening also. The way the patch reads makes it harder but the interaction between the two storage mechanisms here is a bit confusing and a better name for $storage might help.

alexpott’s picture

Status: Needs work » Needs review
FileSize
182.62 KB
1.36 KB

Incorporated comments #127 and #129

yched’s picture

Shouldn't the vars be $file_store and $active_store rather than $file_storage and $db_storage ?
The logic dance here is between file store as opposed to active store, not really between file and db.
If I'm not mistaken, active store *will* be pluggable, and could very well live in memcache.

alexpott’s picture

Changed comment to "Moves the default config supplied by a module to the active store and the live config directory."

Status: Needs review » Needs work

The last submitted patch, config-yaml-1470824-131.patch, failed testing.

alexpott’s picture

Status: Needs work » Needs review
FileSize
421 bytes
182.92 KB

fixed phpdoc issue...

alexpott’s picture

re Everyone who has brought up Architectural / Nomenclature issues (eg #131 and several Pounard's posts) these will be covered by follow up issues which will be collected in the meta issue: #1560060: [meta] Configuration system roadmap and architecture clean-up

alexpott’s picture

Improve code docs for config_install_default_config()

gdd’s picture

Status: Needs review » Reviewed & tested by the community

After further review and discussion with everyone involved, I think we are ready to go.

webchick’s picture

Issue tags: +Avoid commit conflicts

Woo!

This is one of those big hairy issues that could probably benefit from the "avoid commit conflicts" tag.

sun’s picture

catch’s picture

catch’s picture

Title: XML encoder can only handle a small subset of PHP arrays, so switch to YAML » Change notification for: XML encoder can only handle a small subset of PHP arrays, so switch to YAML
Category: bug » task
Priority: Major » Critical
Status: Reviewed & tested by the community » Active

Thanks folks. Committed/pushed this one to 8.x.

I'm sticking this into change notification purgatory in case there's references to XML in existing change notifications.

webchick’s picture

Title: Change notification for: XML encoder can only handle a small subset of PHP arrays, so switch to YAML » XML encoder can only handle a small subset of PHP arrays, so switch to YAML
Priority: Critical » Normal
Status: Active » Needs work
FileSize
2.36 KB

Searched http://drupal.org/list-changes for XML, found:

http://drupal.org/node/1511542
http://drupal.org/node/1500260

...and updated all XML/.xml things to YAML/.yml. So that part's fixed.

However, that compelled me to grep the 8.x source code for references to 'xml' and found a couple of stragglers, including some tests that.. I have no idea why they didn't fail above. Patch attached. I'm not sure what to do with the invalid XML test, so left that as a todo.

webchick’s picture

Status: Needs work » Needs review

Well, for the bot...

Status: Needs review » Needs work

The last submitted patch, xml-to-yaml.patch, failed testing.

webchick’s picture

Status: Needs work » Needs review
FileSize
2.37 KB

LOL I fail at life. :)

webchick’s picture

Priority: Normal » Major

Since those tests are extremely broken right now, escalating to major.

webchick’s picture

Category: task » bug
Crell’s picture

Status: Needs review » Reviewed & tested by the community

#145 looks a no-brainer.

webchick’s picture

Hm. What about this bit though?


       'invalid xml' => '</title><script type="text/javascript">alert("Title XSS!");</script> & < > " \' ',

Is it possible to replicate XSS vulnerability / other parsing error in YAML?

alexpott’s picture

The reason the invalid xml test is here is that the old xml dumper could not handle it. The point being, if we were to move to another file format (hopefully not) then it has to pass the "invalid xml" test - which the Yaml dumper does :)

Perhaps this needs explaining more in the test.

sun’s picture

Category: bug » task
Priority: Major » Normal
Status: Reviewed & tested by the community » Needs review
Issue tags: -Avoid commit conflicts
FileSize
3.12 KB

Thanks for the follow-up patch. I looked at that yesterday already, but wasn't entirely sure how to follow up.

To move the trivial thing out of the way, @alexpott is right, the "invalid xml" key just needs a better key name, because it really only exists to prove that YAML can natively handle values that caused sick problems before. Sames applies to the phpDoc change from xml to yml; trivial.

Now, let's get to the meat.

ModuleTestCase shouldn't have been able to pass with this change to YAML. I've no idea who wrote that test and I definitely don't want to step on anyone's toes, but this is yet another example for a poorly coded test. The test goes to great lengths to assert that config files are read and parsed and installed and whatever, but alas, blatantly forgets to assert that there is some config to assert in the first place.

Almost all of our tests throughout core are commonly suffering from this lack of properly asserting the most basic and most trivial conditions you could think of. :-/

This "minor detail" is able to lead to fundamental regressions all over the place (as nicely demonstrated here), and that's the only reason for why I'm pointing it out. Again, I don't want to blame whoever wrote this test. But rather: this happens all over the place, and all of the existing tests perfectly serve as examples for how to not write tests. We absolutely have to fix that.

(And before it comes up: No, switching to another test framework will not magically fix the actual assertions being performed in a test. Thanks.)

neclimdul’s picture

Status: Needs review » Needs work
+++ b/core/modules/config/config.testundefined
@@ -251,7 +251,7 @@ class ConfigFileContentTestCase extends WebTestBase {
-      'invalid xml' => '</title><script type="text/javascript">alert("Title XSS!");</script> & < > " \' ',
+      'invalid if it were xml' => '</title><script type="text/javascript">alert("Title XSS!");</script> & < > " \' ',
     );
 

What about "properly escape html" or some sort of similar comment. This could totally be valid in an XML system it would just have to be trivially escaped. Also, we don't need to explain the history of why the key exists, that's what git is for. When looking at the code we need to know what its testing.

sun’s picture

Status: Needs work » Needs review
FileSize
3.11 KB

Better.key.

sun’s picture

Removing the phpDoc fix in update.inc, as that will be corrected much more properly in #1496542: Convert site information to config system

cosmicdreams’s picture

Status: Needs review » Reviewed & tested by the community

Simple & clean patch

webchick’s picture

I personally find the old test key name much more obvious, so here's an alternate patch with only the second hunk.

It sounds like we're not sure how to come up with a test for invalid YAML. I wonder if we should add a @todo for that?

ZenDoodles’s picture

I agree with webchick re: yaml validation, but I think it should be a new issue! Also, there is yaml validation functionality in Symphony. I was just reading https://github.com/symfony/symfony/issues/4021 not too long ago.

catch’s picture

Status: Reviewed & tested by the community » Fixed

Thanks. Committed/pushed the follow-up.

Automatically closed -- issue fixed for 2 weeks with no activity.

Anonymous’s picture

Issue summary: View changes

added 'input filters' example