Malformed HTML parsed differently from browsers #512

demurgos · 2023-10-01T15:43:58Z

I have an HTML file with markup that can be reduced to the following:

<html>
<body>
<div id="div0">
  <a hr
</div>
<div id="div1">
  <div id="div2"></div>
  <div id="div3">
    <a href="/">bar</a>
  </div>
</div>
</body>
</html>

Notice the truncated <a tag on line 4 (caused by an HTML fragment accidentally truncated in the DB).

If I create a file with this content, load it in Firefox and print the resulting DOM with document.getElementsByTagName("html")[0].outerHTML , Firefox returns:

<html><head></head><body>
<div id="div0">
  <a hr="" <="" div="">
</a><div id="div1"><a hr="" <="" div="">
  <div id="div2"></div>
  </a><div id="div3"><a hr="" <="" div="">
    </a><a href="/">bar</a>
  </div>
</div>
</body></html>

The truncated link results in 3 nodes in the DOM
The well form tag with text bar is still present in the output

However, if I parse the input with html5ever and print back the result, I get:

<html><head></head><body>
<div id="div0">
  <a hr="" <="" div="">
</a><div id="div1"><a hr="" <="" div="">
  <div id="div2"></div>
  </div>


</div></body></html>

The truncated link only appears twice
The well-formed link with bar completely disappeared!

EDIT: See next message, there are still some differences but the ones here seem to be caused by the TreeSink impl I used, not the parser.

This difference in interpretation between Firefox/Chrome and html5ever is causing me issues when processing these documents to recover them. I'm well aware that the input is broken, but I would expect html5ever to produce the same structure as real browsers.

EDIT: Even smaller repro, removing the newline fixes the mismatch.

<html><body><div><a hr</div><div><div></div>
<div><a href="/">bar</a></div></div></body></html>

The text was updated successfully, but these errors were encountered:

demurgos · 2023-10-01T17:00:02Z

Running the arena example, I actually get a result close to real browsers.

I added Debug to html5ever/examples/arena:

impl<'arena> std::fmt::Debug for Node<'arena> {
    fn fmt(&self, f: &mut Formatter<'_>) -> std::fmt::Result {
        f.debug_struct("Node")
            .field("data", &self.data)
            .field("first_child", &self.first_child)
            .field("next_sibling", &self.next_sibling)
            .finish()
    }
}

And then executed:

$ cat ./malformed.html
<html><body><div><a hr</div><div><div></div>
<div><a href="/">bar</a></div></div></body></html>
$ cargo run --example arena < ./malformed.html

This produced a tree corresponding to:

<document>
  <html>
    <head></head>
    <body>
      <div>
        <a hr<="" div=""></a>
        <div>
          <a hr<="" div="">
            <div></div>
            "\n"
          </a>
          <div>
            <a hr<="" div=""></a>
            <a href="/">bar</a>
          </div>
        </div>
        "\n"
      </div>
    </body>
  </html>
</document>

The difference with real browsers is that:

there is a <div> inside the second anchor, while it's empty inside browsers.
the broken anchors have two attributes instead of three

Regarding the other differences, they may be caused by my TreeSink, I'm using html5ever through scraper so I'll check there too.

demurgos changed the title ~~Malformed HTML parsed differently from Firefox and Chrome~~ Malformed HTML parsed differently from browsers Oct 1, 2023

demurgos mentioned this issue Oct 1, 2023

Malformed HTML parsed differently from browsers rust-scraper/scraper#147

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Malformed HTML parsed differently from browsers #512

Malformed HTML parsed differently from browsers #512

demurgos commented Oct 1, 2023 •

edited

Loading

demurgos commented Oct 1, 2023 •

edited

Loading

Malformed HTML parsed differently from browsers #512

Malformed HTML parsed differently from browsers #512

Comments

demurgos commented Oct 1, 2023 • edited Loading

demurgos commented Oct 1, 2023 • edited Loading

demurgos commented Oct 1, 2023 •

edited

Loading

demurgos commented Oct 1, 2023 •

edited

Loading