This version includes redirect following for the first URL (if it points to the same domain/subdomain of level 2), detection of a large number of similar URLs with 404 due to wrong relative path (discovered in svelte docs) + url skipping behavior, other improvements in the area of exporting/cloning the site on modern JS frameworks, better handling of some edge-cases and a lot of various minor improvements (see changelog).
Changes
- reports: changed file name composition from report.mydomain.com.* to mydomain.com.report.*
#9
- crawler: solved edge-case, which very rarely occurred when the queue processing was already finished, but the last outstanding coroutine still found some new URL
a85990d
- javascript processor: improvement of webpack JS processing in order to correctly replace paths from VueJS during offline export (as e.g. in case of docs.netlify.com) .. without this, HTML had the correct paths in the left menu, but JS immediately broke them because they started with an absolute path with a slash at the beginning
9bea99b
- offline export: detect and process fonts.googleapis.com/css* as CSS even if there is no .css extension
da33100
- js processor: removed the forgotten var_dump
5f2c36d
- offline export: improved search for external JS in the case of webpack (dynamic composition of URLs from an object with the definition of chunks) - it was debugged on docs.netlify.com
a61e72e
- offline export: in case the URL ends with a dot and a number (so it looks like an extension), we must not recognize it as an extension in some cases
c382d95
- offline url converter: better support for SVG in case the URL does not contain an extension at all, but has e.g. 'icon' in the URL (it's not perfect)
c9c01a6
- offline exporter: warning instead of exception for some edge-cases, e.g. not saving SVG without an extension does not cause the export to stop
9d285f4
- cors: do not set Origin request header for images (otherwise error 403 on cdn.sanity.io for svg, etc.)
2f3b7eb
- best practice analyzer: in checking for missing quotes ignore values longer than 1000 characters (fixes, e.g., at skoda-auto.cz the error Compilation failed: regular expression is too large at offset 90936)
8a009df
- html report: added loading of extra headers to the visited URL list in the HTML report
781cf17
- Frontload the report names
62d2aae
- robots.txt: added option --ignore-robots-txt (we often need to view internal or preview domains that are otherwise prohibited from indexing by search engines)
9017c45
- http client: adden an explicit 'Connection: close' header and explicitly calling $client->close(), even though Swoole was doing it automatically after exiting the coroutine
86a7346
- javascript processor: parse url addresses to import the JS module only in JS files (otherwise imports from HTML documentation, e.g. on the websites svelte.dev or nextjs.org, were parsed by mistake)
592b618
- html processor: added obtaining urls from HTML attributes that are not wrapped in quotes (but I am aware that current regexps can cause problems in the cases when are used spaces, which are not properly escaped)
f00abab
- offline url converter: swapping woff2/woff order for regex because in this case their priority is important and because of that woff2 didn't work properly
3f318d1
- non-200 url basename detection: we no longer consider e.g. image generators that have the same basename and the url to the image in the query parameters as the same basename
bc15ef1
- supertable: activation of automatic creation of active links also for homepage '/'
c2e228e
- analysis and robots.txt: improving the display of url addresses for SEO analysis in the case of a multi-domain website, so that it cannot happen that the same url, e.g. '/', is in the overview multiple times without recognizing the domain or scheme + improving the work with robots.txt in SEO detection and displaying urls banned for indexing
47c7602
- offline website exporter: we add the suffix '_' to the folder name only in the case of a typical extension of a static file - we don't want this to happen with domain names as well
d16722a
- javascript processor: extract JS urls also from imports like import {xy} from "./path/foo.js"
aec6cab
- visited url: added 'txt' extension to looksLikeStaticFileByUrl()
460c645
- html processor: extract JS urls also from <link href="*.js">, typically with rel="modulepreload"
c4a92be
- html processor: extracting repeated calls to getFullUrl() into a variable
a5e1306
- analysis: do not include urls that failed to load (timeout, skipping, etc.) in the analysis of content-types and source-domains - prevention of displaying content type 'unknown'
b21ecfb
- cli options: improved method of removing quotes even for options that can be arrays - also fixes --extra-columns='Title'
97f2761
- url skipping: if there are a lot of URLs with the same basename (ending after the last slash), we will allow a maximum of 5 requests for URLs with the same basename - the purpose is to prevent a lot of 404 from being triggered when there is an incorrect relative link to relative/my-img.jpg on all pages (e.g. on 404 page on v2.svelte.dev)
4fbb917
- analysis: perform most of the analysis only on URLs from domains for which we have crawling enabled
313adde
- audio & video: added audio/video file search in <audio> and <video> tags, if file crawling is not disabled
d72a5a5
- base practices: retexting stupid warning like '<h2> after <h0>' to '<h2> without previous heading
041b383
- initial url redirect: in the case thats is entered url that redirects to another url/domain within the same 2nd-level domain (typically http->https or mydomain.tld -> www.mydomain.tld redirects), we continue crawling with new url/domain and declare a new url as initial url
166e617