Release v1.0.8 · janreges/siteone-crawler

This version includes redirect following for the first URL (if it points to the same domain/subdomain of level 2), detection of a large number of similar URLs with 404 due to wrong relative path (discovered in svelte docs) + url skipping behavior, other improvements in the area of exporting/cloning the site on modern JS frameworks, better handling of some edge-cases and a lot of various minor improvements (see changelog).

Changes

reports: changed file name composition from report.mydomain.com.* to mydomain.com.report.* #9
crawler: solved edge-case, which very rarely occurred when the queue processing was already finished, but the last outstanding coroutine still found some new URL a85990d
javascript processor: improvement of webpack JS processing in order to correctly replace paths from VueJS during offline export (as e.g. in case of docs.netlify.com) .. without this, HTML had the correct paths in the left menu, but JS immediately broke them because they started with an absolute path with a slash at the beginning 9bea99b
offline export: detect and process fonts.googleapis.com/css* as CSS even if there is no .css extension da33100
js processor: removed the forgotten var_dump 5f2c36d
offline export: improved search for external JS in the case of webpack (dynamic composition of URLs from an object with the definition of chunks) - it was debugged on docs.netlify.com a61e72e
offline export: in case the URL ends with a dot and a number (so it looks like an extension), we must not recognize it as an extension in some cases c382d95
offline url converter: better support for SVG in case the URL does not contain an extension at all, but has e.g. 'icon' in the URL (it's not perfect) c9c01a6
offline exporter: warning instead of exception for some edge-cases, e.g. not saving SVG without an extension does not cause the export to stop 9d285f4
cors: do not set Origin request header for images (otherwise error 403 on cdn.sanity.io for svg, etc.) 2f3b7eb
best practice analyzer: in checking for missing quotes ignore values longer than 1000 characters (fixes, e.g., at skoda-auto.cz the error Compilation failed: regular expression is too large at offset 90936) 8a009df
html report: added loading of extra headers to the visited URL list in the HTML report 781cf17
Frontload the report names 62d2aae
robots.txt: added option --ignore-robots-txt (we often need to view internal or preview domains that are otherwise prohibited from indexing by search engines) 9017c45
http client: adden an explicit 'Connection: close' header and explicitly calling $client->close(), even though Swoole was doing it automatically after exiting the coroutine 86a7346
javascript processor: parse url addresses to import the JS module only in JS files (otherwise imports from HTML documentation, e.g. on the websites svelte.dev or nextjs.org, were parsed by mistake) 592b618
html processor: added obtaining urls from HTML attributes that are not wrapped in quotes (but I am aware that current regexps can cause problems in the cases when are used spaces, which are not properly escaped) f00abab
offline url converter: swapping woff2/woff order for regex because in this case their priority is important and because of that woff2 didn't work properly 3f318d1
non-200 url basename detection: we no longer consider e.g. image generators that have the same basename and the url to the image in the query parameters as the same basename bc15ef1
supertable: activation of automatic creation of active links also for homepage '/' c2e228e
analysis and robots.txt: improving the display of url addresses for SEO analysis in the case of a multi-domain website, so that it cannot happen that the same url, e.g. '/', is in the overview multiple times without recognizing the domain or scheme + improving the work with robots.txt in SEO detection and displaying urls banned for indexing 47c7602
offline website exporter: we add the suffix '_' to the folder name only in the case of a typical extension of a static file - we don't want this to happen with domain names as well d16722a
javascript processor: extract JS urls also from imports like import {xy} from "./path/foo.js" aec6cab
visited url: added 'txt' extension to looksLikeStaticFileByUrl() 460c645
html processor: extract JS urls also from <link href="*.js">, typically with rel="modulepreload" c4a92be
html processor: extracting repeated calls to getFullUrl() into a variable a5e1306
analysis: do not include urls that failed to load (timeout, skipping, etc.) in the analysis of content-types and source-domains - prevention of displaying content type 'unknown' b21ecfb
cli options: improved method of removing quotes even for options that can be arrays - also fixes --extra-columns='Title' 97f2761
url skipping: if there are a lot of URLs with the same basename (ending after the last slash), we will allow a maximum of 5 requests for URLs with the same basename - the purpose is to prevent a lot of 404 from being triggered when there is an incorrect relative link to relative/my-img.jpg on all pages (e.g. on 404 page on v2.svelte.dev) 4fbb917
analysis: perform most of the analysis only on URLs from domains for which we have crawling enabled 313adde
audio & video: added audio/video file search in <audio> and <video> tags, if file crawling is not disabled d72a5a5
base practices: retexting stupid warning like '<h2> after <h0>' to '<h2> without previous heading 041b383
initial url redirect: in the case thats is entered url that redirects to another url/domain within the same 2nd-level domain (typically http->https or mydomain.tld -> www.mydomain.tld redirects), we continue crawling with new url/domain and declare a new url as initial url 166e617

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v1.0.8

Changes