When importing clean codes html with only one title mark <title>Title text</title>, it will works fine.

but when importing horrible html with too much title marks, it will grab other than <title>Title text</title> as title.

The html I'm importing has:
<title>Title text</title>
<h2 class='title'>Other text</h2>
'title': 'Just another text' <---this is in <script type='text/javascript'> of the html file

When importing, 'title': 'Just another text' is selected as title instead. this make all imported pages has same title

I have tried below code in the xsl but it grab <h2 class='title'>Title text</h2> instead:

  <xsl:template name="get-title">
      <xsl:value-of select="//title" />
  </xsl:template>

I want it only choose <title>Title text</title> as title

Comments

dman’s picture

<xsl:template name="get-title">
      <xsl:value-of select="//title" />
  </xsl:template>

Yeah, that should only get the <title> element.

There is some fallback code that tries for other options, but it's supposed to only try that if the early lookups fail.

The code in the generic html2simplehtml.xsl template does this


  <xsl:template name="get-title">
    <!-- 
      Note: TODO 
      Need to deal with the possible difference between 
      document title and H1 display title .
      Look for H1 or anything id=pagetitle first
      If that fails, use the meta title.
      Some exceptions to do so are defined later.
      
      Sometimes need to remove the header from the body once i've found it?
    -->
      <xsl:choose>
  
        <xsl:when test="descendant::*[local-name()='h1']">
            <xsl:value-of select="normalize-space(descendant::*[local-name()='h1'])" />
        </xsl:when>
        
        <xsl:when test="//*[@id='pagetitle']">
          <!-- 
            Nice sites set the id of their title.
            May add other #ids here as I find them
          -->
            <xsl:value-of select="normalize-space(descendant::*[@id='pagetitle'])" />
        </xsl:when>

        <xsl:when test="count(descendant::h2|descendant::xhtml:h2)=1">
          <!-- 
            Maybe it used an h2 because h1 was ugly.
            If there is ONLY ONE h2, use that
          -->
            <xsl:value-of select="normalize-space(descendant::h2|descendant::xhtml:h2)" />
        </xsl:when>
        
        <xsl:otherwise>
          <!-- 
            In practice, the html head title is often more verbose than we want
            and too often it's the same across whole sections, but it'll have to do.
          -->
            <xsl:value-of select="normalize-space(descendant::*[name() = 'title'])" />
        </xsl:otherwise>
        <!--
          Any of these steps may fail if the heading contains only html, eg
          <h1><img src='header.gif' /></h1>
        -->
      </xsl:choose>
  </xsl:template>

the "choose/when" means only one match will match.

Your version would also work, and is the expected way you would get what you want (as long as namespaces don't get in the way)

Attach your XSL and source doc and it may reveal what's going wrong

dman’s picture

The example starter is easier to look at.
H1 or TITLE, in that order

  <xsl:template name="get-title">
    <!-- 
    Returns the H1, or if unavailable, the <title>
    -->
    <xsl:choose>
      <xsl:when test="//xhtml:h1">
        <xsl:value-of select="normalize-space(//xhtml:h1)" />
      </xsl:when>
      <xsl:otherwise>
        <xsl:value-of select="normalize-space(//xhtml:title)" />
      </xsl:otherwise>
    </xsl:choose>
  </xsl:template>
apasaja’s picture

Status: Active » Fixed

now it works.. thanks for the 2dn codes

Status: Fixed » Closed (fixed)

Automatically closed -- issue fixed for 2 weeks with no activity.