-
Notifications
You must be signed in to change notification settings - Fork 409
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a basic HTML compressor #956
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, this looks pretty good! JFYI it only saved 900 bytes across my entire blog, mainly by removing a comment in one of the pages. By comparing the results with what I had before, I also found a couple questionable changes that this code makes -- please see my comments below.
I'll tag this with "hacktoberfest-accepted" now because I think the remaining issues are minor and I trust you'll stick around to fix them :)
go [] = [] | ||
go [c] = [c] | ||
go (c1 : c2 : rest) | ||
| isSpace c1 && isSpace c2 = go (c2 : rest) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't make a distinction between various kinds of spaces, most notably no-break spaces. I can't find the reference in the spec, but browsers definitely avoid collapsing consecutive no-break spaces. Authors could rely on this behaviour, so can you please make it so no-breaking spaces are preserved?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, thanks for catching this! I was considering non-breaking spaces only as
, and didn't think about isSpace
swallowing those as well.
I can add a check against c1
not being one of U+00A0, U+2007, U+202F, or U+2060. I think that should do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apparently there are dozens of kinds of spaces in Unicode, so I think it's better to be restrictive here: check that c1
and c2
are one of space, \t, \v, \n, \r, or \f. This seems like a safe subset that can be expanded later.
@0xd34df00d, gentle ping ;) It'd be nice to get this merged. |
Ah! This would be most excellent to have. This is important to me because local dev on my website includes spaces that don't show up when the HTML eventually gets compressed over a CDN. I believe this PR would allow me to not have to worry about design differences between local and prod |
Ok, @Minoru, I just ended up trying to do this all from scratch and ended up doing something very similar: compressHtmlCompiler :: Item String -> Compiler (Item String)
compressHtmlCompiler = pure . fmap compressHtml
compressHtml :: String -> String
compressHtml = withTagList compressTags
compressTags :: [TS.Tag String] -> [TS.Tag String]
compressTags = go S.empty
where
go :: S.Set String -> [TS.Tag String] -> [TS.Tag String]
go stack =
\case [] -> []
((TS.TagComment _):rest) -> go stack rest
(tag@(TS.TagOpen name _):rest) -> tag : go (S.insert name stack) rest
(tag@(TS.TagClose name):rest) -> tag : go (S.delete name stack) rest
(tag@(TS.TagText _):rest)
| hasSignificantWhitespace stack -> tag : go stack rest
| hasTextContent stack -> fmap cleanTabsNewLines tag : go stack rest
| otherwise -> fmap cleanAll tag : go stack rest
(tag:rest) -> tag : go stack rest
-- Whitespace-sensitive content that shouldn't be compressed
hasSignificantWhitespace :: S.Set String -> Bool
hasSignificantWhitespace stack =
any (`S.member` stack) content
where
content = [ "pre", "script", "textarea" ]
-- Elements that can hold text content and should
-- hold on to leading and trailing whitespace
hasTextContent :: S.Set String -> Bool
hasTextContent stack = any (`S.member` stack) content
where
content =
[ "a", "abbr", "b", "bdi", "bdo", "blockquote", "button", "cite"
, "code", "del", "dfn", "em", "figcaption", "h1", "h2", "h3", "h4"
, "h5", "h6", "i", "img", "input", "ins", "kbd", "label", "li", "mark"
, "math", "noscript", "object", "p", "picture", "q", "rp"
, "rt", "ruby", "s", "samp", "select", "small", "span", "strong"
, "sub", "sup", "svg", "td", "textarea", "time", "var", "wbr"
]
-- Replace tab characters with spaces
replaceTab :: Char -> Char
replaceTab '\t' = ' '
replaceTab s = s
-- Replace newline characters with spaces
replaceNewLine :: Char -> Char
replaceNewLine '\n' = ' '
replaceNewLine s = s
-- Remove the following:
-- '\f' (form feed)
-- '\n' (newline [line feed])
-- '\r' (carriage return)
-- '\v' (vertical tab)
rmNewLines :: String -> String
rmNewLines = filter (not . (`elem` ("\f\n\r\v" :: String)))
cleanTabsNewLines :: String -> String
cleanTabsNewLines = fmap (replaceNewLine . replaceTab)
cleanAll :: String -> String
cleanAll = rmNewLines . trim . fmap replaceTab Feel free to use this if you like on another PR, or anybody can use it if they like. |
Actually, I iterated on this and got it really simple and solid, I think. This cleans up 99.9% of the whitespace scenarios I could reasonably come up with: compressHtmlCompiler :: Item String -> Compiler (Item String)
compressHtmlCompiler = pure . fmap compressHtml
compressHtml :: String -> String
compressHtml = withTagList compressTags
compressTags :: [TS.Tag String] -> [TS.Tag String]
compressTags = go S.empty
where
go :: S.Set String -> [TS.Tag String] -> [TS.Tag String]
go stack =
\case [] -> []
-- Removes comments by not prepending the tag
-- and, instead, continuing on with the other tags
((TS.TagComment _str):rest) ->
go stack rest
-- When we find an open tag, like `<div>`, prepend it
-- and continue through the rest of the tags while
-- keeping a separate stack of what elements a given
-- tag is currently "inside"
(tag@(TS.TagOpen name _attrs):rest) ->
tag : go (S.insert name stack) rest
-- When we find a closing tag, like `</div>`, prepend it
-- it and continue through the rest of the tags, making
-- sure to remove it from our stack of currently opened
-- elements
(tag@(TS.TagClose name):rest) ->
tag : go (S.delete name stack) rest
-- When a text/string tag is encountered, if it has
-- significant whitespace that should be preserved,
-- then prepend it without change; otherwise, clean up
-- the whitespace, and prepend it
(tag@(TS.TagText _str):rest)
| hasSignificantWhitespace stack -> tag : go stack rest
| otherwise -> fmap cleanWhitespace tag : go stack rest
-- If none of the above match, then this is unexpected,
-- so we should prepend the tag without change
(tag:rest) ->
tag : go stack rest
-- Whitespace-sensitive content that shouldn't be compressed
hasSignificantWhitespace :: S.Set String -> Bool
hasSignificantWhitespace stack =
any (`S.member` stack) content
where
content = [ "pre", "textarea" ]
cleanWhitespace :: String -> String
cleanWhitespace " " = " "
cleanWhitespace str = cleanWS str (clean str)
where
-- Strips out newlines, spaces, etc
clean :: String -> String
clean = unwords . words
-- Clean the whitespace while preserving
-- single leading and trailing whitespace
-- characters when it makes sense
cleanWS :: String -> String -> String
cleanWS _originalStr "" = ""
cleanWS originalStr trimmedStr =
keepSpaceWhen head originalStr ++
trimmedStr ++
keepSpaceWhen last originalStr
-- Determine when to keep a space based on a
-- string and a function that returns a character
-- within that string
keepSpaceWhen :: ([Char] -> Char) -> String -> String
keepSpaceWhen _fn "" = ""
keepSpaceWhen fn originalStr
| (isSpace . fn) originalStr = " "
| otherwise = ""
|
@rpearce, thanks for picking this up! Doesn't your solution suffer the same problem as the original one though, i.e. it swallows all kinds of spaces because it uses Other than that, it looks great, so if you want it merged it's probably time to send a pull request ;) |
The latest only uses It leverages I'll see about opening a PR! |
My point is: if a string starts with multiple non-break spaces, |
Okay, I've got this working for me for allowing non-breaking unicode spaces, and I'm going to continue evaluating it before going further. Thanks for your feedback on somebody's abandoned PR 😅 Expand to view codecompressHtmlCompiler :: Item String -> Compiler (Item String)
compressHtmlCompiler = pure . fmap compressHtml
compressHtml :: String -> String
compressHtml = withTagList compressTags
compressTags :: [TS.Tag String] -> [TS.Tag String]
compressTags = go Set.empty
where
go :: Set.Set String -> [TS.Tag String] -> [TS.Tag String]
go stack =
\case [] -> []
-- Removes comments by not prepending the tag
-- and, instead, continuing on with the other tags
((TS.TagComment _str):rest) ->
go stack rest
-- When we find an open tag, like `<div>`, prepend it
-- and continue through the rest of the tags while
-- keeping a separate stack of what elements a given
-- tag is currently "inside"
(tag@(TS.TagOpen name _attrs):rest) ->
tag : go (Set.insert name stack) rest
-- When we find a closing tag, like `</div>`, prepend it
-- it and continue through the rest of the tags, making
-- sure to remove it from our stack of currently opened
-- elements
(tag@(TS.TagClose name):rest) ->
tag : go (Set.delete name stack) rest
-- When a text/string tag is encountered, if it has
-- significant whitespace that should be preserved,
-- then prepend it without change; otherwise, clean up
-- the whitespace, and prepend it
(tag@(TS.TagText _str):rest)
| hasSignificantWhitespace stack -> tag : go stack rest
| otherwise -> fmap cleanWhitespace tag : go stack rest
-- If none of the above match, then this is unexpected,
-- so we should prepend the tag without change
(tag:rest) ->
tag : go stack rest
-- Whitespace-sensitive content that shouldn't be compressed
hasSignificantWhitespace :: Set.Set String -> Bool
hasSignificantWhitespace stack =
any (`Set.member` stack) content
where
content = [ "pre", "script", "textarea" ]
cleanWhitespace :: String -> String
cleanWhitespace " " = " "
cleanWhitespace str = cleanSurroundingWhitespace str (cleanHtmlWhitespace str)
where
-- Tests for the following:
-- ' ' (space)
-- '\f' (form feed)
-- '\n' (newline [line feed])
-- '\r' (carriage return)
-- '\v' (vertical tab)
isSpaceOrNewLineIsh :: Char -> Bool
isSpaceOrNewLineIsh = (`elem` (" \f\n\r\v" :: String))
-- Strips out newlines, spaces, etc
cleanHtmlWhitespace :: String -> String
cleanHtmlWhitespace = unwords . words'
where
-- Alternate `words` function that uses a different
-- predicate than `isSpace` in order to avoid dropping
-- certain types of spaces.
-- https://hackage.haskell.org/package/base-4.17.0.0/docs/src/Data.OldList.html#words
words' :: String -> [String]
words' s = case dropWhile isSpaceOrNewLineIsh s of
"" -> []
s' -> w : words' s''
where (w, s'') =
break isSpaceOrNewLineIsh s'
-- Clean the whitespace while preserving
-- single leading and trailing whitespace
-- characters when it makes sense
cleanSurroundingWhitespace :: String -> String -> String
cleanSurroundingWhitespace _originalStr "" = ""
cleanSurroundingWhitespace originalStr trimmedStr =
leadingStr ++ trimmedStr ++ trailingStr
where
leadingStr = keepSpaceWhen head originalStr
trailingStr = keepSpaceWhen last originalStr
-- Determine when to keep a space based on a
-- string and a function that returns a character
-- within that string
keepSpaceWhen :: ([Char] -> Char) -> String -> String
keepSpaceWhen _fn "" = ""
keepSpaceWhen fn originalStr
| (isSpaceOrNewLineIsh . fn) originalStr = " "
| otherwise = "" |
This adds a basic HTML compressor compiler, which, on my blog, reduces the size of an average code listing-heavy page by about 3-4%.