Skip to content

Commit

Permalink
test: update tests to reflect new libxml2 HTML5 parsing behaviors (#3308
Browse files Browse the repository at this point in the history
)

**What problem is this PR intended to solve?**

@nwellnhof has done some work in libxml2 to move towards HTML5 parsing
behaviors. This branch is intended to update Nokogiri's expectations
(primarily tests) as a feedback loop for both projects.

See https://gitlab.gnome.org/GNOME/libxml2/-/issues/758#note_2217350


**Have you included adequate test coverage?**

Yes.


**Does this change affect the behavior of either the C or the Java
implementations?**

Not so far.
  • Loading branch information
flavorjones authored Oct 8, 2024
2 parents d992447 + a31c095 commit af81faf
Show file tree
Hide file tree
Showing 11 changed files with 198 additions and 103 deletions.
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,7 @@ We've resolved many long-standing bugs in the various schema classes, validation
* Introduce support for a new SAX callback `XML::SAX::Document#reference`, which is called to report some parsed XML entities when `XML::SAX::ParserContext#replace_entities` is set to the default value `false`. This is necessary functionality for some applications that were previously relying on incorrect entity error reporting which has been fixed (see below). For more information, read the docs for `Nokogiri::XML::SAX::Document`. [#1926] @flavorjones
* `XML::SAX::Parser#parse_memory` and `#parse_file` now accept an optional `encoding` argument. When not provided, the parser will fall back to the encoding passed to the initializer, and then fall back to autodetection. [#3288] @flavorjones
* `XML::SAX::ParserContext.memory` now accepts an optional `encoding` argument. When not provided, the encoding will be autodetected. [#3288] @flavorjones
* `XML::DocumentFragment#parse_options` and `HTML4::DocumentFragment#parse_options` return the options used to parse the document fragment. @flavorjones
* [CRuby] `Nokogiri::HTML5::Builder` is similar to `HTML4::Builder` but returns an `HTML5::Document`. [#3119] @flavorjones
* [CRuby] Attributes in an HTML5 document can be serialized individually, something that has always been supported by the HTML4 serializer. [#3125, #3127] @flavorjones
* [CRuby] Introduce a compile-time option, `--disable-xml2-legacy`, to remove from libxml2 its dependencies on `zlib` and `liblzma` and disable implicit `HTTP` network requests. These all remain enabled by default, and are present in the precompiled native gems. This option is a precursor for removing these libraries in a future major release, but may be interesting for the security-minded who do not need features like automatic decompression and would like to remove these dependencies. You can read more and give feedback on these plans in #3168. [#3247] @flavorjones
Expand Down
1 change: 1 addition & 0 deletions lib/nokogiri/html4/document_fragment.rb
Original file line number Diff line number Diff line change
Expand Up @@ -91,6 +91,7 @@ def initialize(document, tags = nil, ctx = nil, options = XML::ParseOptions::DEF
return self unless tags

options = Nokogiri::XML::ParseOptions.new(options) if Integer === options
@parse_options = options
yield options if block_given?

if ctx
Expand Down
6 changes: 6 additions & 0 deletions lib/nokogiri/xml/document_fragment.rb
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,11 @@
module Nokogiri
module XML
class DocumentFragment < Nokogiri::XML::Node
# The options used to parse the document fragment. Returns the value of any options that were
# passed into the constructor as a parameter or set in a config block, else the default
# options for the specific subclass.
attr_reader :parse_options

####
# Create a Nokogiri::XML::DocumentFragment from +tags+
def self.parse(tags, options = ParseOptions::DEFAULT_XML, &block)
Expand All @@ -20,6 +25,7 @@ def initialize(document, tags = nil, ctx = nil, options = ParseOptions::DEFAULT_
return self unless tags

options = Nokogiri::XML::ParseOptions.new(options) if Integer === options
@parse_options = options
yield options if block_given?

children = if ctx
Expand Down
9 changes: 2 additions & 7 deletions test/html4/sax/test_document_error.rb
Original file line number Diff line number Diff line change
Expand Up @@ -20,15 +20,10 @@ def start_document
end

def test_warning_document_encounters_error_but_terminates_normally
# Probably I'm doing something wrong, but I can't make nekohtml report errors,
# despite setting http://cyberneko.org/html/features/report-errors.
# See https://nekohtml.sourceforge.net/settings.html for more info.
# I'd love some help here if someone finds this comment and cares enough to dig in.
skip_unless_libxml2("nekohtml sax parser does not seem to report errors?")

warning_parser = Nokogiri::HTML4::SAX::Parser.new(Nokogiri::SAX::TestCase::Doc.new)
warning_parser.parse("<html><body><<div att=")
refute_empty(warning_parser.document.errors, "error collector did not collect an error")

assert(warning_parser.document.end_document_called)
end
end
end
55 changes: 50 additions & 5 deletions test/html4/test_comments.rb
Original file line number Diff line number Diff line change
Expand Up @@ -96,7 +96,19 @@ class TestComment < Nokogiri::TestCase
let(:doc) { Nokogiri::HTML4(html) }
let(:subject) { doc.at_css("div#under-test") }

if Nokogiri.uses_libxml?
if Nokogiri.uses_libxml?(">= 2.14.0")
it "behaves as if the comment is closed immediately before the end of the input stream" do # COMPLIANT
assert_pattern do
subject => {
name: "div",
attributes: [{ name: "id", value: "under-test" }],
children: [
{ name: "comment", content: "start of unterminated comment" }
]
}
end
end
elsif Nokogiri.uses_libxml?
it "behaves as if the comment is unterminated and doesn't exist" do # NON-COMPLIANT
assert_equal 0, subject.children.length
assert_equal 1, doc.errors.length
Expand Down Expand Up @@ -132,8 +144,12 @@ class TestComment < Nokogiri::TestCase
assert_equal inner_div, subject.children[1]
assert_predicate subject.children[2], :comment?
assert_equal "bar", subject.children[2].content
assert_equal 1, doc.errors.length
assert_match(/Comment incorrectly closed/, doc.errors.first.to_s)
if Nokogiri.uses_libxml?(">= 2.14.0")
assert_empty doc.errors
else
assert_equal 1, doc.errors.length
assert_match(/Comment incorrectly closed/, doc.errors.first.to_s)
end
end
else # jruby, or libxml2 system lib less than 2.9.11
it "behaves as if the comment encompasses the inner div" do # NON-COMPLIANT
Expand Down Expand Up @@ -161,7 +177,22 @@ class TestComment < Nokogiri::TestCase
let(:body) { doc.at_css("body") }
let(:subject) { doc.at_css("div#under-test") }

if Nokogiri.uses_libxml?("= 2.9.14")
if Nokogiri.uses_libxml?(">= 2.14.0")
it "parses as comments" do # COMPLIANT
assert_pattern do
body.children => [
{
name: "div",
children: [
{ name: "comment", content: " comment <div id=do-i-exist" },
{ name: "text", content: "inner content" },
]
},
{ name: "text", content: "-->hello" },
]
end
end
elsif Nokogiri.uses_libxml?("= 2.9.14")
it "parses as PCDATA" do # NON-COMPLIANT
assert_equal 1, body.children.length
assert_equal subject, body.children.first
Expand Down Expand Up @@ -212,7 +243,21 @@ class TestComment < Nokogiri::TestCase
let(:body) { doc.at_css("body") }
let(:subject) { doc.at_css("div#under-test") }

if Nokogiri.uses_libxml?("= 2.9.14")
if Nokogiri.uses_libxml?(">= 2.14.0")
it "parses the <! tags as comments" do
assert_pattern do
body.children => [
{
name: "div", children: [
{ name: "comment", content: "[if foo]" },
{ name: "div", attributes: [{name: "id", value: "do-i-exist"}] },
{ name: "comment", content: "[endif]" },
]
}
]
end
end
elsif Nokogiri.uses_libxml?("= 2.9.14")
it "parses the <! tags as PCDATA" do
assert_equal(1, body.children.length)
assert_equal(subject, body.children.first)
Expand Down
21 changes: 11 additions & 10 deletions test/html4/test_document.rb
Original file line number Diff line number Diff line change
Expand Up @@ -363,10 +363,8 @@ def test_document_has_error
html = Nokogiri::HTML4(<<~HTML)
<html>
<body>
<div awesome="asdf>
<p>inside div tag</p>
</div>
<p>outside div tag</p>
<div>
</foo>
</body>
</html>
HTML
Expand Down Expand Up @@ -660,14 +658,15 @@ def test_capturing_nonparse_errors_during_document_clone

def test_capturing_nonparse_errors_during_node_copy_between_docs
# Errors should be emitted while parsing only, and should not change when moving nodes.
doc1 = Nokogiri::HTML4("<html><body><diva id='unique'>one</diva></body></html>")
doc2 = Nokogiri::HTML4("<html><body><dive id='unique'>two</dive></body></html>")
doc1 = Nokogiri::HTML4("<html><body><div id='unique'>one</foo1></body></html>")
doc2 = Nokogiri::HTML4("<html><body><div id='unique'>two</foo2></body></html>")
node1 = doc1.at_css("#unique")
node2 = doc2.at_css("#unique")
original_errors1 = doc1.errors.dup
original_errors2 = doc2.errors.dup
assert(original_errors1.any? { |e| e.to_s.include?("Tag diva invalid") }, "it should complain about the tag name")
assert(original_errors2.any? { |e| e.to_s.include?("Tag dive invalid") }, "it should complain about the tag name")

refute_empty(original_errors1)
refute_empty(original_errors2)

node1.add_child(node2)

Expand Down Expand Up @@ -734,6 +733,8 @@ def test_silencing_nonparse_errors_during_attribute_insertion_1262
doc = Nokogiri::HTML4::Document.parse(html)
expected = if Nokogiri.jruby?
[Nokogiri::XML::Node::COMMENT_NODE, Nokogiri::XML::Node::PI_NODE]
elsif Nokogiri.uses_libxml?(">= 2.14.0")
[Nokogiri::XML::Node::COMMENT_NODE, Nokogiri::XML::Node::COMMENT_NODE]
elsif Nokogiri.uses_libxml?(">= 2.10.0")
[Nokogiri::XML::Node::COMMENT_NODE]
else
Expand Down Expand Up @@ -802,7 +803,7 @@ def test_silencing_nonparse_errors_during_attribute_insertion_1262
end

describe "read memory" do
let(:input) { "<html><body><div" }
let(:input) { "<html><body><div></foo>" }

describe "strict parsing" do
let(:parse_options) { html_strict }
Expand All @@ -824,7 +825,7 @@ def test_silencing_nonparse_errors_during_attribute_insertion_1262
end

describe "read io" do
let(:input) { StringIO.new("<html><body><div") }
let(:input) { StringIO.new("<html><body><div></foo>") }

describe "strict parsing" do
let(:parse_options) { html_strict }
Expand Down
2 changes: 1 addition & 1 deletion test/html4/test_document_encoding.rb
Original file line number Diff line number Diff line change
Expand Up @@ -148,7 +148,7 @@ def binopen(file)
end

describe "error handling" do
RAW = "<html><body><div"
RAW = "<html><body><div></foo>"

{ "read_memory" => RAW, "read_io" => StringIO.new(RAW) }.each do |flavor, input|
it "#{flavor} should handle errors" do
Expand Down
Loading

0 comments on commit af81faf

Please sign in to comment.