Add a basic HTML compressor #956

0xd34df00d · 2022-10-15T02:59:21Z

This adds a basic HTML compressor compiler, which, on my blog, reduces the size of an average code listing-heavy page by about 3-4%.

Minoru

Thanks, this looks pretty good! JFYI it only saved 900 bytes across my entire blog, mainly by removing a comment in one of the pages. By comparing the results with what I had before, I also found a couple questionable changes that this code makes -- please see my comments below.

I'll tag this with "hacktoberfest-accepted" now because I think the remaining issues are minor and I trust you'll stick around to fix them :)

Minoru · 2022-10-22T20:36:18Z

lib/Hakyll/Web/Html/Compress.hs

+    go [] = []
+    go [c] = [c]
+    go (c1 : c2 : rest)
+      | isSpace c1 && isSpace c2 = go (c2 : rest)


This doesn't make a distinction between various kinds of spaces, most notably no-break spaces. I can't find the reference in the spec, but browsers definitely avoid collapsing consecutive no-break spaces. Authors could rely on this behaviour, so can you please make it so no-breaking spaces are preserved?

Sure, thanks for catching this! I was considering non-breaking spaces only as  , and didn't think about isSpace swallowing those as well.

I can add a check against c1 not being one of U+00A0, U+2007, U+202F, or U+2060. I think that should do.

Apparently there are dozens of kinds of spaces in Unicode, so I think it's better to be restrictive here: check that c1 and c2 are one of space, \t, \v, \n, \r, or \f. This seems like a safe subset that can be expanded later.

lib/Hakyll/Web/Html/Compress.hs

Minoru · 2022-12-11T14:47:12Z

@0xd34df00d, gentle ping ;) It'd be nice to get this merged.

rpearce · 2023-02-16T01:19:31Z

Ah! This would be most excellent to have. This is important to me because local dev on my website includes spaces that don't show up when the HTML eventually gets compressed over a CDN.

I believe this PR would allow me to not have to worry about design differences between local and prod

rpearce · 2023-02-18T01:54:21Z

Ok, @Minoru, I just ended up trying to do this all from scratch and ended up doing something very similar:

compressHtmlCompiler :: Item String -> Compiler (Item String)
compressHtmlCompiler = pure . fmap compressHtml

compressHtml :: String -> String
compressHtml = withTagList compressTags

compressTags :: [TS.Tag String] -> [TS.Tag String]
compressTags = go S.empty
  where
    go :: S.Set String -> [TS.Tag String] -> [TS.Tag String]
    go stack =
      \case [] -> []
            ((TS.TagComment _):rest) -> go stack rest
            (tag@(TS.TagOpen name _):rest) -> tag : go (S.insert name stack) rest
            (tag@(TS.TagClose name):rest) -> tag : go (S.delete name stack) rest
            (tag@(TS.TagText _):rest)
              | hasSignificantWhitespace stack -> tag : go stack rest
              | hasTextContent stack -> fmap cleanTabsNewLines tag : go stack rest
              | otherwise -> fmap cleanAll tag : go stack rest
            (tag:rest) -> tag : go stack rest

    -- Whitespace-sensitive content that shouldn't be compressed
    hasSignificantWhitespace :: S.Set String -> Bool
    hasSignificantWhitespace stack =
      any (`S.member` stack) content
      where
        content = [ "pre", "script", "textarea" ]

    -- Elements that can hold text content and should
    -- hold on to leading and trailing whitespace
    hasTextContent :: S.Set String -> Bool
    hasTextContent stack = any (`S.member` stack) content
      where
        content =
          [ "a", "abbr", "b", "bdi", "bdo", "blockquote", "button", "cite"
          , "code", "del", "dfn", "em", "figcaption", "h1", "h2", "h3", "h4"
          , "h5", "h6", "i", "img", "input", "ins", "kbd", "label", "li", "mark"
          , "math", "noscript", "object", "p", "picture", "q", "rp"
          , "rt", "ruby", "s", "samp", "select", "small", "span", "strong"
          , "sub", "sup", "svg", "td", "textarea", "time", "var", "wbr"
          ]

    -- Replace tab characters with spaces
    replaceTab :: Char -> Char
    replaceTab '\t' = ' '
    replaceTab s    = s

    -- Replace newline characters with spaces
    replaceNewLine :: Char -> Char
    replaceNewLine '\n' = ' '
    replaceNewLine s    = s

    -- Remove the following:
    --   '\f' (form feed)
    --   '\n' (newline [line feed])
    --   '\r' (carriage return)
    --   '\v' (vertical tab)
    rmNewLines :: String -> String
    rmNewLines = filter (not . (`elem` ("\f\n\r\v" :: String)))

    cleanTabsNewLines :: String -> String
    cleanTabsNewLines = fmap (replaceNewLine . replaceTab)

    cleanAll :: String -> String
    cleanAll = rmNewLines . trim . fmap replaceTab

Feel free to use this if you like on another PR, or anybody can use it if they like.

rpearce · 2023-02-22T00:25:50Z

Actually, I iterated on this and got it really simple and solid, I think. This cleans up 99.9% of the whitespace scenarios I could reasonably come up with:

compressHtmlCompiler :: Item String -> Compiler (Item String)
compressHtmlCompiler = pure . fmap compressHtml

compressHtml :: String -> String
compressHtml = withTagList compressTags

compressTags :: [TS.Tag String] -> [TS.Tag String]
compressTags = go S.empty
  where
    go :: S.Set String -> [TS.Tag String] -> [TS.Tag String]
    go stack =
      \case [] -> []
            -- Removes comments by not prepending the tag
            -- and, instead, continuing on with the other tags
            ((TS.TagComment _str):rest) ->
              go stack rest

            -- When we find an open tag, like `<div>`, prepend it
            -- and continue through the rest of the tags while
            -- keeping a separate stack of what elements a given
            -- tag is currently "inside"
            (tag@(TS.TagOpen name _attrs):rest) ->
              tag : go (S.insert name stack) rest

            -- When we find a closing tag, like `</div>`, prepend it
            -- it and continue through the rest of the tags, making
            -- sure to remove it from our stack of currently opened
            -- elements
            (tag@(TS.TagClose name):rest) ->
              tag : go (S.delete name stack) rest

            -- When a text/string tag is encountered, if it has
            -- significant whitespace that should be preserved,
            -- then prepend it without change; otherwise, clean up
            -- the whitespace, and prepend it
            (tag@(TS.TagText _str):rest)
              | hasSignificantWhitespace stack -> tag : go stack rest
              | otherwise -> fmap cleanWhitespace tag : go stack rest

            -- If none of the above match, then this is unexpected,
            -- so we should prepend the tag without change
            (tag:rest) ->
              tag : go stack rest

    -- Whitespace-sensitive content that shouldn't be compressed
    hasSignificantWhitespace :: S.Set String -> Bool
    hasSignificantWhitespace stack =
      any (`S.member` stack) content
      where
        content = [ "pre", "textarea" ]

    cleanWhitespace :: String -> String
    cleanWhitespace " " = " "
    cleanWhitespace str = cleanWS str (clean str)
      where
        -- Strips out newlines, spaces, etc
        clean :: String -> String
        clean = unwords . words

        -- Clean the whitespace while preserving
        -- single leading and trailing whitespace
        -- characters when it makes sense
        cleanWS :: String -> String -> String
        cleanWS _originalStr "" = ""
        cleanWS originalStr trimmedStr =
          keepSpaceWhen head originalStr ++
            trimmedStr ++
            keepSpaceWhen last originalStr

        -- Determine when to keep a space based on a
        -- string and a function that returns a character
        -- within that string
        keepSpaceWhen :: ([Char] -> Char) -> String -> String
        keepSpaceWhen _fn ""  = ""
        keepSpaceWhen fn originalStr
          | (isSpace . fn) originalStr = " "
          | otherwise = ""

Minoru · 2023-02-22T18:54:11Z

@rpearce, thanks for picking this up! Doesn't your solution suffer the same problem as the original one though, i.e. it swallows all kinds of spaces because it uses isSpace?

Other than that, it looks great, so if you want it merged it's probably time to send a pull request ;)

rpearce · 2023-02-22T21:03:32Z

The latest only uses isSpace for checking a for if there's a leading or trailing space it should hang on to.

It leverages unwords . words for cleanup, and I'm having pretty good results so far.

I'll see about opening a PR!

Minoru · 2023-02-24T14:23:49Z

My point is: if a string starts with multiple non-break spaces, isSpace would return True and non-break spaces will be collapsed into a single ordinary space, which is clearly wrong.

rpearce · 2023-03-02T04:47:13Z

Okay, I've got this working for me for allowing non-breaking unicode spaces, and I'm going to continue evaluating it before going further. Thanks for your feedback on somebody's abandoned PR 😅

Expand to view code

compressHtmlCompiler :: Item String -> Compiler (Item String)
compressHtmlCompiler = pure . fmap compressHtml

compressHtml :: String -> String
compressHtml = withTagList compressTags

compressTags :: [TS.Tag String] -> [TS.Tag String]
compressTags = go Set.empty
  where
    go :: Set.Set String -> [TS.Tag String] -> [TS.Tag String]
    go stack =
      \case [] -> []
            -- Removes comments by not prepending the tag
            -- and, instead, continuing on with the other tags
            ((TS.TagComment _str):rest) ->
              go stack rest

            -- When we find an open tag, like `<div>`, prepend it
            -- and continue through the rest of the tags while
            -- keeping a separate stack of what elements a given
            -- tag is currently "inside"
            (tag@(TS.TagOpen name _attrs):rest) ->
              tag : go (Set.insert name stack) rest

            -- When we find a closing tag, like `</div>`, prepend it
            -- it and continue through the rest of the tags, making
            -- sure to remove it from our stack of currently opened
            -- elements
            (tag@(TS.TagClose name):rest) ->
              tag : go (Set.delete name stack) rest

            -- When a text/string tag is encountered, if it has
            -- significant whitespace that should be preserved,
            -- then prepend it without change; otherwise, clean up
            -- the whitespace, and prepend it
            (tag@(TS.TagText _str):rest)
              | hasSignificantWhitespace stack -> tag : go stack rest
              | otherwise -> fmap cleanWhitespace tag : go stack rest

            -- If none of the above match, then this is unexpected,
            -- so we should prepend the tag without change
            (tag:rest) ->
              tag : go stack rest

    -- Whitespace-sensitive content that shouldn't be compressed
    hasSignificantWhitespace :: Set.Set String -> Bool
    hasSignificantWhitespace stack =
      any (`Set.member` stack) content
      where
        content = [ "pre", "script", "textarea" ]

    cleanWhitespace :: String -> String
    cleanWhitespace " " = " "
    cleanWhitespace str = cleanSurroundingWhitespace str (cleanHtmlWhitespace str)
      where
        -- Tests for the following:
        --   ' '  (space)
        --   '\f' (form feed)
        --   '\n' (newline [line feed])
        --   '\r' (carriage return)
        --   '\v' (vertical tab)
        isSpaceOrNewLineIsh :: Char -> Bool
        isSpaceOrNewLineIsh = (`elem` (" \f\n\r\v" :: String))

        -- Strips out newlines, spaces, etc
        cleanHtmlWhitespace :: String -> String
        cleanHtmlWhitespace = unwords . words'
          where
            -- Alternate `words` function that uses a different
            -- predicate than `isSpace` in order to avoid dropping
            -- certain types of spaces.
            -- https://hackage.haskell.org/package/base-4.17.0.0/docs/src/Data.OldList.html#words
            words' :: String -> [String]
            words' s = case dropWhile isSpaceOrNewLineIsh s of
              "" -> []
              s' -> w : words' s''
                    where (w, s'') =
                           break isSpaceOrNewLineIsh s'

        -- Clean the whitespace while preserving
        -- single leading and trailing whitespace
        -- characters when it makes sense
        cleanSurroundingWhitespace :: String -> String -> String
        cleanSurroundingWhitespace _originalStr "" = ""
        cleanSurroundingWhitespace originalStr trimmedStr =
          leadingStr ++ trimmedStr ++ trailingStr
          where
            leadingStr  = keepSpaceWhen head originalStr
            trailingStr = keepSpaceWhen last originalStr

        -- Determine when to keep a space based on a
        -- string and a function that returns a character
        -- within that string
        keepSpaceWhen :: ([Char] -> Char) -> String -> String
        keepSpaceWhen _fn ""  = ""
        keepSpaceWhen fn originalStr
          | (isSpaceOrNewLineIsh . fn) originalStr = " "
          | otherwise = ""

Add a basic HTML compressor

5bfce16

Minoru requested changes Oct 22, 2022

View reviewed changes

Minoru added the hacktoberfest-accepted label Oct 22, 2022

This was referenced Feb 18, 2023

feat: compress HTML rpearce/robertwpearce.com#109

Merged

feat: compress HTML rpearce/robertwpearce.com#111

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a basic HTML compressor #956

Add a basic HTML compressor #956

0xd34df00d commented Oct 15, 2022 •

edited

Loading

Minoru left a comment

Minoru Oct 22, 2022

0xd34df00d Oct 22, 2022

Minoru Oct 22, 2022

Minoru commented Dec 11, 2022

rpearce commented Feb 16, 2023

rpearce commented Feb 18, 2023 •

edited

Loading

rpearce commented Feb 22, 2023

Minoru commented Feb 22, 2023

rpearce commented Feb 22, 2023

Minoru commented Feb 24, 2023

rpearce commented Mar 2, 2023

Add a basic HTML compressor #956

Are you sure you want to change the base?

Add a basic HTML compressor #956

Conversation

0xd34df00d commented Oct 15, 2022 • edited Loading

Minoru left a comment

Choose a reason for hiding this comment

Minoru Oct 22, 2022

Choose a reason for hiding this comment

0xd34df00d Oct 22, 2022

Choose a reason for hiding this comment

Minoru Oct 22, 2022

Choose a reason for hiding this comment

Minoru commented Dec 11, 2022

rpearce commented Feb 16, 2023

rpearce commented Feb 18, 2023 • edited Loading

rpearce commented Feb 22, 2023

Minoru commented Feb 22, 2023

rpearce commented Feb 22, 2023

Minoru commented Feb 24, 2023

rpearce commented Mar 2, 2023

0xd34df00d commented Oct 15, 2022 •

edited

Loading

rpearce commented Feb 18, 2023 •

edited

Loading