As described in the referenced spec, the underscore is a perfectly valid and widely used character in CSS identifiers. drupal_clean_css_identifier should not transliterate underscores by default! Even the comments in the code say that the underscore is a valid character.

Files: 
CommentFileSizeAuthor
#33 drupal-remove_double_underscore_from_css_filter-1109854-33.patch652 bytesmbrett5062
PASSED: [[SimpleTest]]: [MySQL] 49,096 pass(es).
[ View ]
#31 drupal-revise_css_identifier_filter-1109854-31.patch651 bytesmbrett5062
FAILED: [[SimpleTest]]: [MySQL] Invalid PHP syntax in core/includes/common.inc.
[ View ]
#28 drupal-remove_underscore_from_css_filter-1109854-28.patch626 bytesmbrett5062
FAILED: [[SimpleTest]]: [MySQL] 49,175 pass(es), 1,169 fail(s), and 947 exception(s).
[ View ]
#15 drupal_clean_css_identifier-1109854-15.patch2.62 KBPisco
FAILED: [[SimpleTest]]: [MySQL] 29,387 pass(es), 20 fail(s), and 0 exception(es).
[ View ]
#14 drupal_clean_css_identifier-1109854-13.patch1.51 KBPisco
FAILED: [[SimpleTest]]: [MySQL] 29,377 pass(es), 20 fail(s), and 0 exception(es).
[ View ]
#13 drupal_clean_css_identifier.test1.23 KBPisco

Comments

It also appears as though the function was written towards the CSS specification, though the intended target is HTML!

It is also not usable for IDs and Names even though the PHPdoc indicates this.

AFAIK we should pick http://www.w3.org/TR/html401/ as specification. While CSS specification prohibits certain literals, their escape sequences can still be used.

Of course, the output of a strict function should be passed through check_plain as well before insertion into HTML.

To illustrate, see http://heine.familiedeelstra.com/html-css-spec-classes

Version:7.x-dev» 8.x-dev

Underscore is traditionally escaped, for legacy reasons that do not apply to nowadays browsers.

We could consider changing that, but that's D8 material.

The function most definitely seems to have been written towards the CSS specification, that's what the function name css_identifier suggests, and also what the documentation very clearly and unambiguously states.

I would not rely on the 12 year old HTML 4.01 spec. Even if you wish to be as lax as possible, I'd recommend commplying to a common denominator of both the HTML and CSS specification. It seems as if the CSS spec has stricter requirements, still being a subset of the other, that's why I'd recommend complying to the CSS requirements.

On a side node: the current HTML spec seems to be similar to the HTML 4.01 spec mentioned by Heine. The only requirement for the id attribute is, that the value must not contain any space characters. Similar for the class attribute which must have a value that is a set of space-separated tokens.

We could consider changing that, but that's D8 material.

I'd most definitely change this in D8! From what the name and the documentation of drupal_clean_css_identifier suggests, and according the aforementioned specs, transliterating dashes to underscores is a bug.

Off the record: this is how I noticed this issue: #939992: "CSS classes must be alphanumeric or dashes only." I think underscores should be included. This bug is already causing many problems to users, as popular CSS "frameworks" rely heavily on CSS identifiers having underscores. This means you have hack with workaround because Views relying on drupal_clean_css_identifier makes it impossible to use these CSS identifiers through the UI.

I know that it was based on CSS, but it seems the intended target of the returned result is an HTML attribute. It doesn't make sense to base that on the CSS spec, does it?

"{foo" is a perfectly valid class attribute value. All it needs is an escape sequence to be used in a CSS selector.

I know that it was based on CSS, but it seems the intended target of the returned result is an HTML attribute. It doesn't make sense to base that on the CSS spec, does it?

I got it, you're right!
This means that all drupal_clean_css_identifier has to do, is to escape this caracter sequence: ]]>.

In a CDATA section, character data is any string of characters not including the CDATA-section-close delimiter, " ]]> ".

Extensible Markup Language (XML) 1.0 (Fifth Edition), 2.4 Character Data and Markup

I think the code comments in the function explain why underscores are removed:

  // By default, we filter using Drupal's coding standards.

Drupal's coding standards prefer dashes to underscores (although honestly, I'm not sure that's written down anywhere)... so the function is just designed to help Drupal's HTML output be consistent with the standards by default.

And that seems to have been the explicit intent of #464862: Add drupal_css_class() to clean class names and rename form_clean_id (the issue where this was added).

Not to say we couldn't change Drupal's standards if we wanted to, of course :) But I think that does explain why the function currently behaves the way that it does.

Relaxing the standards would actually help with issues like this: #936798: Dashboard uses unreliable method for identifying blocks.

Because currently, a whole bunch of characters get munged into dashes, so you can't go backwards from the CSS identifier to figure out what was fed into the function to produce it.

Pisco, a CDATA section starts with <![CDATA[. It is not the same as the CDATA SGML type (See 6.2 SGML basic types).

Heine, you're absolutely right, a CDATA section starts with <![CDATA[, but I was not taking about CDATA section. I was referring to CDATA as an attribute type.

Unfortunately section 6.2 SGML basic types is only a summary of key information, I was not able to find a free version of the the SGML spec. I think, however, that it is save to stick to the XML and XHTML spec (2.4 Character Data and Markup):

In the content of elements, character data is any string of characters which does not contain the start-delimiter of any markup and does not include the CDATA-section-close delimiter, " ]]> ". In a CDATA section, character data is any string of characters not including the CDATA-section-close delimiter, " ]]> ".

Notice the distinction between the CDATA section and content of elements.

I think the start-delimiter of any markup is "<", thus drupal_clean_css_identifier should do the following transliteration:

  • < to &lt;
  • ]]> to ]]&gt;
  • ' to &apos;
  • " to &quot;

I'm afraid that is not going to work. The problem is that the use of drupal_html_class hasn't been thought about sufficiently.

In theme_table it is used as an 'attributes' array value, then later passed through check_plain via drupal_attributes.

The best we can do is IMO have the function that produces a "potentially valid attribute value" that is later passed through check_plain before being inserted into HTML. This also means we need to fix the theme's use of the classes strings.

I have given this issue another thought: on one hand we have the HTML class attribute as defined in the HTML 4.01 and the current HTML spec, on the other hand we have CSS class identifiers as defined in the CSS spec. The HTML class attribute is a particularity of HTML, it has, on itself, nothing to do with CSS. CSS uses this particularity of HTML to add functionality on top of HTML, this is by design. These different specifications target different applications and impose different requirement on the class attribute and the class identifier respectively. It is also by design that the requirements for one (HTML) is a subset of the other (CSS). This means that every CSS class identifier is a valid HTML class attribute, but not every HTML class attribute is a valid CSS class identifier!

As said before, I think that drupal_clean_css_identifier is clearly targeted at CSS class identifiers, and these are the requirement it should enforce:

In CSS, identifiers (including element names, classes, and IDs in selectors) can contain only the characters [a-zA-Z0-9] and ISO 10646 characters U+00A0 and higher, plus the hyphen (-) and the underscore (_); they cannot start with a digit, two hyphens, or a hyphen followed by a digit. Identifiers can also contain escaped characters and any ISO 10646 character as a numeric code (see next item). For instance, the identifier "B&W?" may be written as "B\&W\?" or "B\26 W\3F".

StatusFileSize
new1.23 KB

This is what I propose:

<?php
function drupal_clean_css_identifier($identifier, $filter = array(' ' => '-')) {
 
// replace whitespaces
 
$identifier = strtr($identifier, $filter);
 
// CSS class identifiers must not start with
  // - a digit
  // - two hyphens
  // - hyphen followed by a digit
 
$prev = '';
  while (
$prev != $identifier) {
   
$prev = $identifier;
   
$identifier = preg_replace(array('/^\d*/u', '/^\x{002D}*(?=\x{002D})/u', '/^\x{002D}\d/u'),'', $identifier);
  }
 
// Valid characters in a CSS identifier are:
  // - the hyphen (U+002D)
  // - a-z (U+0030 - U+0039)
  // - A-Z (U+0041 - U+005A)
  // - the underscore (U+005F)
  // - 0-9 (U+0061 - U+007A)
  // - ISO 10646 characters U+00A1 and higher,
  // We strip out any character not in the above list.
  // as opposed to the spec we do not allow U+00A0
  // as opposed to the spec we only allow code points up to U+FFFF
 
$identifier = preg_replace('/[^\x{002D}\x{0030}-\x{0039}\x{0041}-\x{005A}\x{005F}\x{0061}-\x{007A}\x{00A1}-\x{FFFF}]/u', '', $identifier);
  return
$identifier;
}
?>

See the attached example.

StatusFileSize
new1.51 KB
FAILED: [[SimpleTest]]: [MySQL] 29,377 pass(es), 20 fail(s), and 0 exception(es).
[ View ]

Added a patch for Drupal 8.x, no unit tests yet.

Status:Active» Needs review
StatusFileSize
new2.62 KB
FAILED: [[SimpleTest]]: [MySQL] 29,387 pass(es), 20 fail(s), and 0 exception(es).
[ View ]

Added patch with unit tests.

It is also by design that the requirements for one (HTML) is a subset of the other (CSS). This means that every CSS class identifier is a valid HTML class attribute, but not every HTML class attribute is a valid CSS class identifier!

Nope. The CSS specification does not limit the allowed values of HTML classes, at least not to the extent you assume.

1foo

is a perfectly valid class attribute value, but not a valid CSS identifier. So what? To address the class 1foo in CSS, all you need to do is to create a valid CSS identifier. In this case:

.\31 foo

You're right. So back to #10 then, I think that's what's technically correct. Can you explain why you think this wouldn't work ( #11 )?

drupal_html_class is sometimes used to produce the value of an attribute][class entry. These will be passed through check_plain, resulting in double escaping.

&foo

Should become in HTML:

&amp;foo

But will become in some cases

&amp;amp;foo

It might be best for drupal_html_class to deliver a valid class value, then transform it to valid HTML via check_plain the moment it is inserted in the attribute value.

Status:Needs review» Needs work

The last submitted patch, drupal_clean_css_identifier-1109854-15.patch, failed testing.

It might be best for drupal_html_class to deliver a valid class value, then transform it to valid HTML via check_plain the moment it is inserted in the attribute value.

What exactly do you mean, I'd say that a valid class value is valid HTML.

Ok i'm confused. Can someone summarize why we can't change the filter from

$filter = array(' ' => '-', '_' => '-', '/' => '-', '[' => '-', ']' => ''))
to
$filter = array(' ' => '-', '/' => '-', '[' => '-', ']' => ''))

Views is creating its own wrazpper function just to remove that part of the filter.
And I can't find anything in the css standards about underscores.

I have no glue of css, but there seems to be only one grid system which uses "_" as part of their css classes.
http://960.gs/

As far as i see, it's not that unpopular even the trend seems to move to other systems.

Title:Overly aggressiv transliteration in drupal_clean_css_identifierOverly aggressive transliteration in drupal_clean_css_identifier
Issue tags:+VDC

We're removing the wrapper in Views 8.x-3.x, and this seems to be the best place to discuss changes to the core function.

Adding a tag,

I opened a more or less related issue (which would be dependent on this considering that we agree we want it):

http://drupal.org/node/1900768

So... Our new CSS Syntax requires the _ to NOT be stripped. Ignore my last comment because BEM apparently was already decided on.

Take a look at http://drupal.org/node/1887918 to see an example of the planned syntax (especially @see "6. Formatting Class Names"). Thanks JohnAlbin, r5yn and anyone else involved in that draft... It REALLY looks amazing.

So... To make it short: We need underscores. But, more specifically, we only really need double underscores. Currently modules seem to use drupal_html_class() also to simply avoid running an additional strtr() to replace underscores of machine-names before they are printed... Take node types for example.

Do we want to run them through a regex replace to only replace single occurrences of underscores? Does that even make sense or should modules take care of that?

Title:Overly aggressive transliteration in drupal_clean_css_identifierOverly aggressive transliteration in drupal_clean_css_identifier removes underscores from classes

So adding code for _strict_ CSS class name validation is out. I had it as part of the original patch for drupal_clean_css_identifier. See #464862-12: Add drupal_css_class() to clean class names and rename form_clean_id But was nixed by Sun in comment #15 when he said:

However, in general, I'm not completely sold on the idea that Drupal should all of this extra work due to sluggish developers. AFAIK, Drupal does not do this elsewhere. CSS IDs were the only special case, because those can't appear twice on a page - while a single developer is not able to control with other CSS IDs are on a page.

So. Bearing that, I would suggest to remove that extra validation for starting numbers/dashes.

Status:Needs work» Needs review
StatusFileSize
new626 bytes
FAILED: [[SimpleTest]]: [MySQL] 49,175 pass(es), 1,169 fail(s), and 947 exception(s).
[ View ]

Have read through all comments here, and AFAICT the final conclusion is that we should just remove the filtering of underscores from drupal_clean_css_identifier.
This will allow class names for BEM such as .messages--error or .messages__icon.
It does not touch ID's as there is no need to change the current filtering. Also, as per #26, if modules are incorrectly using this function to filter and alter machine names, I think that should be their issue to resolve, provided core does not follow this bad practice.
AFAICS it does not, so no issue with this route.
Here is a patch, I would like to get this moving ASAP, so we can get onto cleaning the core HTML and CSS, either before or during Twig conversion.

hhmm maybe I was wrong above. @fubhy did you mean that modules are replacing '_' with '-' in machine names for use in classes? If so, then maybe we could just change the filter to something like this:

drupal_clean_css_identifier($identifier, $filter = array(' ' => '-', '_' => '-', '__' => '__',  '/' => '-', '[' => '-', ']' => '')) {

Or would that not work? This needs a little more thought, I am not so great with regex.

Status:Needs review» Needs work

The last submitted patch, drupal-remove_underscore_from_css_filter-1109854-28.patch, failed testing.

Status:Needs work» Needs review
StatusFileSize
new651 bytes
FAILED: [[SimpleTest]]: [MySQL] Invalid PHP syntax in core/includes/common.inc.
[ View ]

I believe that single underscore filter must remain in place, just adding a filter for double underscores, so they remain untouched. If this does not work, then we will need some kind of regex to get this right.

Status:Needs review» Needs work

The last submitted patch, drupal-revise_css_identifier_filter-1109854-31.patch, failed testing.

StatusFileSize
new652 bytes
PASSED: [[SimpleTest]]: [MySQL] 49,096 pass(es).
[ View ]

Oopss missed a comma, still not sure this will work, I need 2 things, first a regex to replace single underscore with a hyphen, then remove underscore from filter, then I need a test to be sure it is working. Will work on that next.

Status:Needs work» Needs review

And send to test bot.

Having read the PHP docs, I think my last patch will work as expected.

The longer string '__' will be matched first and replaced with '__' (itself), but then will not be touched again, so then the shorter '_' will be checked and matched and replaced with '-'.

From the docs:

The next example shows the behavior of strtr() when called with only two arguments. Note the preference of the replacements ("h" is not picked because there are longer matches) and how replaced text was not searched again.

Example #2 strtr() example with two arguments

<?php
$trans
= array("h" => "-", "hello" => "hi", "hi" => "hello");
echo
strtr("hi all, I said hello", $trans);
?>

The above example will output:

hello all, I said hi

Hope I have read that correctly.

Of course I know you PHP geniuses out there know this already, it's more for my reference and anyone not familiar to understand the change made.

@mbrett5062 - I guess I don't understand why are we not allowing single underscrores in drupal_clean_css_identifier? I found this issue after seeing my underscores where getting converted into hypens when adding classes in the views ui.

Of course you are correct @gmclelland, I know of no reason why a single underscore can not be used, according to W3G standards, it was something introduced a long while ago in Drupal, and I am unable to find the reasoning.
However, according our new, upcoming standards for CSS, we are moving towards a SMACCS syntax, please see the following: (I pulled this out, so it is quicker to find, on an otherwise long page, Draft - CSS Architecture)

6. Formatting Class Names

The HTML class attribute has been made to do a lot. Drupal uses a naming convention to make clear what a particular class is for and what other parts of the system it affects:

/* Component Rules */
.component-name
.component-name--variant
.component-name__sub-object
/* Layout Rules */
.l-layout-method  /* e.g. `.l-container` */
.grid-*
/**
* State Classes
*
* These are always applied via JavaScript, and describe a non-
* default state.
*/
.is-state-type
/* Feature-Dependent Rules (with Modernizr) */
.supports-feature
/**
* Functional JavaScript Hooks
*
* Apply JS behavior using `js-` prefixed classes. These behavior hooks should never
* be used to apply CSS, and non-js-prefixed classes should never be used for JS hooks.
*/
.js-behavior-hook  /* e.g. .js-slider, .js-dropdown */

While this does not preclude the use of single underscores, if we stick to the syntax as detailed, then everyone gets a clear understanding of the semantic inference within the structure.

I hope this helps.

I guess I don't understand why are we not allowing single underscrores in drupal_clean_css_identifier?

From reading the issue, it seems there are a couple reasons, which I'd like to just summarize all in one place, to have as a reference:

1) Legacy browser support (comment #2).
2) Hyphens are a de-facto unofficial Drupal standard, used for cleanliness, readability, and consistency (comment #7).
3) Perhaps a continuation of the previous point, but the new CSS Architecture draft doc makes no mention of the use of single underscores in class name formatting (comment #37).
4) Because of the way the function works, PHP Drupal developers have come to make their modules depend on the underscore to hyphen transliteration, which allows for interesting conversion such as book_parent[bid][title] into book-parent-bid-title (Sun's example). Changing this, will have far-reaching effects in both contrib and core, and themers will be left with the mess of digging up and converting previously working styles to insert underscores in random places, creating inconsistencies and disorder (excuse the FUD ;).

Because of the wide-reaching effects that allowing underscores in class names will have (read: breaking lots of existing styles and scripts), and the relative innocuous addition of writing an exception for the double underscore and double hyphen situation... My suggestion is that we should spin off the double underscore / double hyphen exception to a separate issue, and leave this one just for the discussion of simply changing from '_' => '-' to '_' => '_'.

I see no problem keeping the existing behaviour for single underscores as long as double underscores are preserved in the final output. #33 removes a blocker for #1921610: [Meta] Architect our CSS without the side effects described in #39 4).

I agree with @jwilson in #39, so I’ve opened #2009584: Allow double underscores to pass through drupal_clean_css_identifier as per new CSS standards and posted @mbrett5062’s patch from #33. I’d love to get a review and hopefully RTBC.

Status:Needs review» Reviewed & tested by the community

I think this issue is fixed because #2009584: Allow double underscores to pass through drupal_clean_css_identifier as per new CSS standards is fixed, so RTBC. Feel free to reopen it if I am wrong.

Status:Reviewed & tested by the community» Needs work

Having #2009584: Allow double underscores to pass through drupal_clean_css_identifier as per new CSS standards lessens the need for this issue because it allows us to move forward with the proposed Drupal 8 class naming convention, but I think it is reasonable that other coders would want to use a different class naming scheme. And I don't really see a reason why Drupal should prohibit them from doing so. So I'm going to leave this open for that use-case.