Investigate better alternatives for encode() / decode() functionality [#1288090]

Right now the config_encode() and config_decode() functions are really hacky. The encode() functionality is just something I found on the internet, and decode() does a conversion through JSON because it is fast. There are probably better options for both. The plan right now is to keep the actual XML as simple as humanly possible, basically not much more than a key/value pair store, so keep that in mind. Performance and functionality need to be weighed pretty evenly.

The current status is described here: http://www.heyrocker.com/node/238 and here: http://www.heyrocker.com/how-use-drupal-8-configuration-system.

Comments

Comment #1

pounard

French

CreditAttribution: pounard commented 4 October 2011 at 19:18

Further, those might need to be able to parse attributes (at least of the language as I heard).

Comment #2

gdd

he/him

English

Portland, OR

CreditAttribution: gdd commented 6 October 2011 at 09:50

Some other issues to consider:

- Right now the json conversion forces into objects OR arrays, we have no way to mix and match. A different decode mechanism would allow us to use an object/array specification in the schema.
- Ideally the serialize/deserialize process will retain the XML comments, otherwise they will get lost when written back out to files.

Comment #3

rwohleb

he/him

English

CreditAttribution: rwohleb commented 12 October 2011 at 18:20

I'm trying to get caught up on this initiative, so I'm sorry if this is addressed elsewhere. What is the reasoning behind using XML for storage rather than just JSON? Drupal already has decent JSON encode/decode support.

Comment #4

mitchell CreditAttribution: mitchell commented 15 October 2011 at 07:21

Status:

Active

» Needs review

I found this array-to-domdocument library which might be of value. Here are a few notes:

It uses DOM instead of SimpleXML
Supports attributes
Found it in this discussion: http://www.devexp.eu/2009/04/11/php-domdocument-convert-array-to-xml/
A sandboxed project in contrib uses it for generating a WSDL/WADL file of a Drupal app, Services Docs.
MIT license

@rwohleb: Here you go -> Configuration management sprint - file formats && File format discussion continued

Comment #5

gdd

he/him

English

Portland, OR

CreditAttribution: gdd commented 8 December 2011 at 12:39

I have spent some time in the past week with this and other XML parsers, and here are my findings and associated thoughts.

This parser generates very complicated arrays. This means that existing Drupal code that relies on these arrays would be more complicated and difficult to maintain.
The main reason these arrays are so complicated is because they acknowledge two things XML has that Drupal's simple associative arrays don't - attributes and the ability to have multiple keys at the same level with the same name.
There are parsers that make simpler arrays, but they inevtiably don't support one of these situations. And honestly, if you don't support items at the same level with the same name then you don't meaningfully support attributes anyways.
We also have two design constraints which push us towards not supporting these properties of XML. One is that we want to be able to have a different data format in the active store than in the files. This is due to the fact that the active store has different priorities than the files (files need to focus on human redability, active store needs to focus on performance.) So whatever structure we have needs to map easily between various formats. Another issue we have is that we want to keep things as portable as possible in case we decide to change the format along the way. We can't do that easily if we start spreading code all over core that relies on having a '#attributes' key always available (since no other format will have this construct.)
All of these things have pushed me towards a very simplified XML format that is nothing more than an XML representation of an associative array.
Additionally, one of the biggest arguments we made for supporting XML was that it allows native commenting. However in practice this turns out to be very troublesome. For instance, say that we have XML for the files and serialized PHP arrays in the active store (I think this is a reasonably likely scenario in the end.) The XML gets read and transformed into a PHP array, so what happens to the comments? We don't really want them in that array because they will eat memory we don't need and make the arrays more complicated since we'll just want to strip them out whenever we iterate items. So we strip them out, then what happens when this data is written back out to the files? The comments are gone forever and the re-written file doesn't have them. This is even a problem if we keep straight XML in the database, because at some point we have to transform out of XML to something PHP can read natively, and we won't want the comments in this structure. Note this would be a problem if we implemented '#comment' or whatever into a JSON structure. I'm not sure comments are really something we'll be able to support in a meaningful way.
Given this and the general lack of support for XML from the community, I am now questioning just what gains we're getting from XML and whether we shouldn't just move to JSON and be done with it. It will be faster and the parsing will be easier. We won't be able to represent objects and arrays mixed in one structure, but I'm willing to just say 'Look everything is an array, deal with.' I've already had a preliminary discussion with Earl about switching Views to a real export format in order to support this and he seems generally open to it. The one big problem with JSON is still the encoding of UTF8 data, which is ugly.

So that's where I'm at right now. For the moment, the code in the repo has been updated to have a better parser (courtesy of EclipseGC and rszrama) and the XML files are very simple. More discussion welcomed!

Comment #6

Damien Tournoud CreditAttribution: Damien Tournoud commented 8 December 2011 at 13:25

There is nothing to gain from XML given those constraints. If you want XML anyway, you can use some standard key/value DTD, like the Apple Property List format, that should be supported by most IDE out there.

One additional question that has not been a big focus yet is the question of the mergeability. None of the serialization formats (JSON, XML, PHP) that have been suggesting have good mergeability. If you have already worked collaboratively using the Features module (to export Views, Panels, Fields, etc.) you probably already bumped into this issue: none of the common VCS and IDE out there now how to reasonably merge this type of files, because you need more then the standard line-based merge technique (see A State-of-the-Art Survey on Software Merging by T Mens for an overview of merge techniques).

This is going to be even more a problem if we package all the configuration using the same format. Not completely sure what the solution is at this point, but it is a discussion really worth having.

Comment #7

pounard

French

CreditAttribution: pounard commented 8 December 2011 at 14:17

My guess is that any machine/serialization oriented configuration format won't be easily mergeable if it has not been created with human readability in mind.

Best format ever for this is plain good old ini file, or eventually YAML.

Comment #8

Damien Tournoud CreditAttribution: Damien Tournoud commented 8 December 2011 at 14:26

Human readability and machine mergeability are two independent concepts.

The main mergeability issue we currently have with structured text format (pretty JSON, pretty XML, YAML) is that the tree structure is not represented in each line, so the merge tool is going to try to merge independent parts of the tree.

Typical example: two developers are adding a different field to a View; instead of adding those below each other, a line-based merge tool is going to try to merge them, because some of the lines are common in those blocks.

One way of fixing that would be to materialize the whole tree path at every line.

Comment #9

pounard

French

CreditAttribution: pounard commented 8 December 2011 at 14:33

Human readability and machine mergeability are two independent concepts.

While they are indeed two different concepts in real life pretty much all merge algorithms (at least those we use everyday, git, svn, diff, etc..) are merging on a per line basis: pretty much the same way you format your own code to make it human readable.

Comment #10

Damien Tournoud CreditAttribution: Damien Tournoud commented 8 December 2011 at 14:35

While they are indeed two different concepts in real life pretty much all merge algorithms (at least those we use everyday, git, svn, diff, etc..) are merging on a per line basis: pretty much the same way you format your own code to make it human readable.

Not exactly: human readability is often necessary for line-merging algorithms, but it is *far* from sufficient.

Comment #11

pounard

French

CreditAttribution: pounard commented 8 December 2011 at 14:40

Yes of course, but this plays its role. Most common diff algorithm is LCS (longuest common subsequence) and it definitely plays very well with human readable text, probably a lot more than any compiled binary (or not) data. We can consider XML being almost binary when you compile it with no pretty formatting, and considering the order doesn't matter.

Comment #12

Ralt CreditAttribution: Ralt commented 8 January 2012 at 18:37

What about ASN.1 ? There is a PHP library for it, the format is standardized, and the language is made to define rules and structure, which is what configuration is.

Just a wild idea, though.

P.S. : sorry for being somehow off-topic, I couldn't find anywhere else to post this idea.

Comment #13

bojanz CreditAttribution: bojanz commented 28 January 2012 at 00:29

#4 looks interesting.
That said, I've always preferred JSON over XML (and didn't think the "_comment" convention was a bad idea back in the original discussions).
Still, it's partially-irrational (as with everyone when XML is discussed), so I haven't felt the need to jump into the holy wars until now.

Comment #14

philippejadin CreditAttribution: philippejadin commented 15 February 2012 at 16:10

I don't really understand what we get with xml or json that we don't already have with php.

- any drupal user knows php
- php files are protected by the webserver
- php files are quick to parse
- php files are even quicker to parse when there is an opcode cache like apc
- php files (arrays) are quick to merge in php
- php files can be made easy to merge by a machine, look at this :


// this is the recent content view config code updated by bob on 12/01/02
// ^^^^^ look, comments !
$config['view']['recent_content']['title'] = 'Recent content view title';
$config['view']['recent_content']['title']['fr'] = 'Contenu récent'; // note this could come from another file
$config['view']['recent_content']['handler']['display']['fields']['title']['id'] = 'title';
$config['view']['recent_content']['handler']['display']['fields']['title']['table'] = 'node';
$config['view']['recent_content']['handler']['display']['fields']['title']['field'] = 'title';
$config['view']['recent_content']['handler']['display']['fields']['title']['label'] = 'Title';
// [...]

Php is so anchored in Drupal, that my config is even hightlighted (in color) in this comment :-)

I really don't see the point of using a file format that at the end, you will need to convert to a php array, instead of directly using a php array.

Using xml <-> array = impedance mismatch = developer nightmare

Comment #15

Crell CreditAttribution: Crell commented 15 February 2012 at 18:19

PHP cannot be taken out of memory, ever, so it leaks memory.
PHP is a potential security attack vector.
PHP is not as human editable as you might think.

Point 1 is the deal killer. The other two are just icing. PHP was rejected months ago for good reason. Let's please not reopen that debate.

Additional datapoints: Composer users JSON, and there's discussion of using Composer in core. However, Fabien from Symfony noted this weekend that he hates JSON as a config format (mostly due to the stupid trailing comma issue), and wondered why we were using XML without a schema. Of course his preference is YAML, which we also already rejected. :-)

Take those data points as you will.

Comment #16

gdd

he/him

English

Portland, OR

CreditAttribution: gdd commented 15 February 2012 at 18:26

The other problem with serialized PHP is that it is not interchangeable with any other systems unless they are also PHP. We have a stated desire of wanting to be able to easily integrate with deployment systems like Chef or Puppet, as well as people who roll their own. Serialized PHP is really terrible for this.

However, it has to be said, all the formats suck in their own special way. We may yet switch away from XML but whatever we switch to will have its own irritances. It is all about what we decide to prioritize.

Comment #17

pounard

French

CreditAttribution: pounard commented 15 February 2012 at 19:56

@hejrocker I think philippejadin was not talking about *serialized* PHP but about plain good old PHP files.

Comment #18

philippejadin CreditAttribution: philippejadin commented 16 February 2012 at 08:42

I've seen the discussions about config formats. Sorry for reopening this, coming late to the discussion.

I have to say it, just because I have gone this way a long time ago (xml vs json vs php vs xyz), and it has been a very painful ride.

The fact that xml converted to php arrays creates ugly structures should be taken as a fact.

Here is what I came to in 2006 for my home made cms (later I switched to Drupal ;-) ) :

- http://svn.berlios.de/wsvn/thinkedit/trunk/config/tables-dist.php
- http://svn.berlios.de/wsvn/thinkedit/trunk/config/sample_config/yapaka.php

I can tell you that it was very easy to use, parse, even for a non developer. Please take it into account too.

I stop there, because, I guess there is a bigger picture I probably don't understand.
This config initiative is in all cases a great thing for drupal. I hope module developers (views for example) will use it!