Back to News
Advertisement
Advertisement

⚡ Community Insights

Discussion Sentiment

54% Positive

Analyzed from 1408 words in the discussion.

Trending Topics

#cdata#rss#html#twenty#xml#content#amp#wanna#character#hurry

Discussion (26 Comments)Read Original on HackerNews

Sephr1 day ago
Manual string replacement with a hardcoded list of cases for escaping as suggested by the article isn't good advice for the use case of 'support inserting arbitrary text'.

Do use CDATA nodes, but only work on XML with an actual XML DOM library instead of string manipulation. Browsers have these built-in (DOMParser).

blenderob1 day ago
I totally understand the general advice of using actual XML DOM library for making DOM. But for my own understanding, I want to ask why the 5 escapes the OP suggests (&, <, >, " and ') aren't good enough? Do you see anyway to exploit it if these 5 are escaped? Someone kind enough to enlighten me?
moebrowne1 day ago
They are:

> The ampersand character (&) and the left angle bracket (<) MUST NOT appear in their literal form, except when used as markup delimiters, or within a comment, a processing instruction, or a CDATA section. If they are needed elsewhere, they MUST be escaped using either numeric character references or the strings " &amp; " and " &lt; " respectively. The right angle bracket (>) may be represented using the string " &gt; ", and MUST, for compatibility, be escaped using either " &gt; " or a character reference when it appears in the string " ]]> " in content, when that string is not marking the end of a CDATA section.

> In the content of elements, character data is any string of characters which does not contain the start-delimiter of any markup and does not include the CDATA-section-close delimiter, " ]]> ". In a CDATA section, character data is any string of characters not including the CDATA-section-close delimiter, " ]]> ".

> To allow attribute values to contain both single and double quotes, the apostrophe or single-quote character (') may be represented as " &apos; ", and the double-quote character (") as " &quot; ".

https://www.w3.org/TR/xml/#syntax

aji1 day ago
i never really liked CDATA but i'm not buying the argument here since you can do the escaping with replaceAll("]]>", "]]]]><![CDATA[>") instead of four replaceAlls. (assuming you are writing your own xml serializer in javascript in 2026 for some reason)
senfiaj1 day ago
Just out of curiosity, I looked at the HN RSS feed and they still use regular escape for titles (and some other things, except description). It means they use 2 versions of escape instead of 1. So why not just use 1?
kijin1 day ago
Different requirements.

The description contains HTML markup, such as <p></p> for paragraph breaks. CDATA is a nice and clean way to encode them without breaking anything.

The title doesn't contain any markup, and shouldn't. A good old escape function covers both the "doesn't" part and the "shouldn't" part.

senfiaj1 day ago
What requirments are you talking about? Human readability? IMHO, RSS is for feed readers, not humans. When looking at https://news.ycombinator.com/rss , the RSS isn't that human friendly at all, all line breaks are removed. The point is the simplicity and uniformity, regular escape works well for many cases, not just description.
kijin1 day ago
That assumes that you don't have anything else to escape or sanitize.

I see people stuffing all sorts of HTML tags and nonstandard attributes in an RSS <description>, just because CDATA allows them to do so without breaking the parser. Images, videos, inline SVGs with maybe some scripts inside...

The RSS spec should never have allowed this. Reading a feed would have been much more pleasant (not to mention safer for everyone!) if the contents were required to be in plain text.

tpmoney1 day ago
I’m not sure I understand why this is a problem. RSS is a spec for publishing a list of available content, or publishing the content directly. Formatting that content was always going to be something people wanted to do, so whether it was rich text, html or what became markdown, it was inevitable that aggregators were always going to have to deal with both publishes wanting their publication to have styles and users wanting their aggregator software to either handle that style or hide it.

At least with a cdata tag your being explicitly told “here be dragons”

xg15about 6 hours ago
I guess the difference is if you want the descriptions to be readable by simple tools, or if you assume that every reader has a full-fledged Chrome available.
Pikamander21 day ago
In my experience, liberal use of CDATA is often the only way to get third-party data-importing software to work correctly.

Whether it's efficient is a far second to whether it successfully imports the data.

Looking at you, WP All Import...

xg15about 20 hours ago
CData's "ad-hoc escaping" by closing and reopening a CData section always felt to me as if it could be a compatibility hazard - I think most examples of CData sections have a single section spanning the whole text node - so I wouldn't be surprised if some homegrown parsers didn't handle "edge cases" like multiple sections inside a text node correctly.

But I'd want to see evidence that this is actually the case. The OP seems to argue "don't use CData, because the escape sequence for ]]> looks confusing" - and that's just vibes, not a proper argument.

If it's for "looks" I think CData would actually be the much better choice. ]]> appears extremely rarely in RSS content while <>& are guaranteed to appear if your content is HTML. So in 99.99%, you won't need any escaping at all for CData and can just insert the HTML verbatim, while "regular" escaping will change every single angle bracket of your HTML.

nreece1 day ago
Imo, ease and multi-line HTML readability of CDATA outweighs that one edge case (it cannot directly contain its own terminator).
senfiaj1 day ago
But are most people going to read raw RSS? Just out of curiosity, I checked HN RSS, it still escapes character in a regular way (without CDATA) for titles. So, just keep 2 versions of XML escape instead of one?
gradientsrneat1 day ago
RSS is truly in its own little universe.

I recently became aware of RSS stylesheets. Apparently there is a specification for that called XSLT which is distinct from CSS in both form and function. However, there are plans by Google/Mozilla to remove XSLT from their browser engines for security/maintainability reasons. Apparently RSS supports javascript though, so it's possible to manipulate the RSS DOM that way. One could imagine a javascript polyfill that interprets XSLT, although I'm not sure if there's some cross-site security issues that would make that impractical.

perilunar1 day ago
> RSS is truly in its own little universe.

More like a little island in the XML archipelago.

> RSS stylesheets. Apparently there is a specification for that called XSLT

XSLT is a bit more than just “RSS stylesheets”.

moebrowne1 day ago
The removal was discussed here: https://news.ycombinator.com/item?id=45873434

No need to imagine a polyfil, they already exist: https://github.com/mfreed7/xslt_polyfill

Fileformat1 day ago
If you want to style RSS, you can just go straight to JavaScript and avoid all the XSLT mishegas.

I made a site with to get people started: https://www.rss.style/

Example RSS feed: https://www.rss.style/changelog.xml

Cross-site is fine by default, though the script is small enough to easily self-host. If you have a content-security-policy, you'll need to allow the host in script-src.

donohoe1 day ago
I would not follow this advice. The most trouble I’ve had with RSS was usually from not having it. I also have never used CDATA at a word level - just wrap the full text block in it.
lapcat1 day ago
I come from a very different, old-school perspective, because I hand-write my blog posts in HTML and also hand-write my RSS feed in XML.

I've found CDATA invaluable, because I can just copy and paste the content from the HTML file to the XML file. I've never used the CDATA terminator characters in a blog post, so that's a non-problem.

senfiaj1 day ago
This is mostly about when you write your automated feed generator.
lapcat1 day ago
> This is mostly about when you write your automated feed generator.

Yes, that's why I said, "I come from a very different, old-school perspective."

However, I don't find the points persuasive:

1. A special case for the CDATA terminator doesn't seem any worse than special cases for every HTML character that needs to be escaped in XML.

2. I'm not sure who exactly the hypothetical misled people are (straw men?) who would think "the content is raw HTML or somehow safer."

3. I'm not sure how split CDATA blocks is "less uniform" than escaped characters or why less uniform output is a downside, especially as you state in another comment, "IMHO, RSS is for feed readers, not humans."

4. I'm not sure how CDATA makes "debugging confusing," and in any case using CDATA blocks inside an article seems like a pretty rare case; like I said, I haven't done that myself.

DonHopkins1 day ago
The worst use of the <BLINK> tag ever was the discussion held in the early days of RSS about escaping HTML in titles, whose attention-grabbing title went something like this: "Hey, what happens when you put a <BLINK> tag in the title???!!!"

The content of that notorious discussion went on and off and on and off for weeks, giving all the netizens of the RSS community blogosphere terrible headaches, with people's entire blogs disappearing and reappearing every second, until it finally reached a flashing point, when Dave Winer humbly conceded that it wasn't the user's fault for being an idiot, and maybe just maybe there was tiny teeny little design flaw in RSS, and it wasn't actually such a great idea to allow HTML tags in RSS titles.

I Wanna Be <![CDATA[

Sung to the tune of “I Wanna Be Sedated”, with apologies to The Ramones.

  Twenty-twenty-twenty four escapes to go, I wanna be <![CDATA[
  Nothin’ to markup and no where to quo-o-ote, I wanna be <![CDATA[
  Just get me through the parser, put me in a node
  Hurry hurry hurry before I go inline
  I can’t control my syntax, I can’t control my name
  Oh no no no no no
  Twenty-twenty-twenty four escapes to go….
  Just put me in a stylesheet, get me in a namespace
  Hurry hurry hurry before I go inline
  I can’t control my syntax, I can’t control my name
  Oh no no no no no
  Twenty-twenty-twenty four escapes to go, I wanna be <![CDATA[
  Nothin’ to markup and no where to quo-o-ote, I wanna be <![CDATA[
  Just get me through the parser, put me in a node
  Hurry hurry hurry before I go loco
  I can’t control my syntax I can’t control my name
  Oh no no no no no
  Twenty-twenty-twenty escapaes to go…
  Just get me through the parser…
  Ba-ba-bamp-ba ba-ba-ba-bamp-ba I wanna be <![CDATA[
  Ba-ba-bamp-ba ba-ba-ba-bamp-ba I wanna be <![CDATA[
  Ba-ba-bamp-ba ba-ba-ba-bamp-ba I wanna be <![CDATA[
  Ba-ba-bamp-ba ba-ba-ba-bamp-ba I wanna be <![CDATA[
eqvinox1 day ago
Is it just me or is the back button broken on this website?
senfiaj1 day ago
For me it seems fine. Sometimes when I click on the headings I have to press back several times, but this is because they have anchored links.
eqvinox1 day ago
Hm, on my desktop it's fine too, just on the phone it plain doesn't want to go back…