Documentation

Web Crawler

Technical documentation of the server-side web crawler. Covers architecture, URL normalisation pipeline, deduplication, internal PageRank engine, adaptive priority scoring, orphan detection and all export formats.

v1.0 250 pages default limit 10 analysis signals Node.js · Express · Cheerio · Axios Port 5000

Go to tool

Overview

The Web Crawler is a lightweight but feature-rich server-side crawling engine built entirely in Node.js without browser automation. Starting from a given base URL it recursively discovers internal links, analyses each page and produces a comprehensive dataset covering SEO signals, content quality, canonical state, link structure and sitemap data.

The crawler integrates with an existing sitemap.xml if present, uses it as a seed list and enriches crawl results with original priority and lastmod values. After crawling, a multi-stage post-processing pipeline runs duplicate detection, internal PageRank calculation, orphan detection and priority scoring before the final result is returned.

Capabilities at a glance

Max pages: 250 (configurable per request)
Concurrency: Sequential with 200 ms politeness delay
Redirect handling: Up to 5 hops (301/302/303/307/308)
robots.txt: Full parsing, wildcard support, optional
Sitemap integration: Reads existing sitemap.xml as seed
Duplicate detection: Path-normalised key comparison
Internal PageRank: 25 iterations, damping 0.85
Orphan detection: 3 categories (true / sitemap / near)
Thin content scoring: 5 weighted signals
Export formats: JSON, CSV, TXT, XML Sitemap
JS navigation: Not supported (static HTML only)
Dependencies: Express, Axios, Cheerio

URL Normalisation Pipeline

Every URL discovered during crawling passes through a multi-step normalisation pipeline before being added to the queue. This prevents duplicate visits caused by trailing slashes, index files, tracking parameters or inconsistent query string ordering.

Normalisation steps

Trailing slash after query string: Removed (e.g. ?tid=2/ → ?tid=2)
Double slashes: Collapsed to single slash
index.php / index.html: Replaced with /
Trailing slash enforcement: Added to directory-style paths
Tracking parameters: Stripped (UTM, fbclid, gclid, ref etc.)
Query string sorting: Alphabetical – prevents order-based duplicates
Path + query conflict: Trailing slash removed if query params present

URL validation rules

Same domain only: External links discarded
Protocol: http: and https: only
Max query params: 1 (prevents filter/forum explosions)
PHP with params: Whitelist only (showthread, viewtopic, topic.php etc.)
Binary assets: Images, fonts, CSS, JS, PDF, ZIP filtered out
Feeds: RSS, Atom in all variants excluded

Exclude patterns (43 rules)

WordPress: /wp-admin/, /wp-login, /wp-json/, login, cart, checkout
Forum systems: MyBB, phpBB, vBulletin, SMF action URLs
System pages: /print/, /feed/, /xmlrpc, /cgi-bin/, /admin/, /uploads/
Action parameters: action=, sort=, filter=, doing_wp_cron= etc.

Tracking parameters stripped

Analytics: utm_source, utm_medium, utm_campaign, utm_term, utm_content
Ad platforms: fbclid, gclid, msclkid, mc_eid
General: ref, source, affiliate

Page Analysis & Metadata Extraction

Each crawled page is parsed with Cheerio and a comprehensive set of signals is extracted. Navigation, header and footer elements are removed before text analysis to avoid inflating word counts with boilerplate content.

SEO signals

Title: Text content, length tracked
Meta description: Presence and content
Meta robots: noindex detection
X-Robots-Tag header: noindex / none detection
Canonical tag: Extracted, normalised, compared to page URL
H1 / H2 / H3 count: All heading levels counted
Image count: All <img> tags
Internal link count: All <a href> tags
External link count: Links to other domains

Content quality signals

Word count: Words >2 chars after nav/header/footer removal
Text/HTML ratio: Body text length ÷ raw HTML length
Last modified: HTTP header → article:modified_time → dateModified
JSON-LD: Organization / WebSite schema extraction (email, description)
Redirect detection: Final URL after all hops recorded

Thin content scoring (thinScore 0.0–1.0)

Word count <300: −0.3
Word count <100: −0.2 (cumulative)
Text/HTML ratio <0.1: −0.2
Link count <2: −0.1
External link ratio >70%: −0.2 (link directory signal)
Result: Clamped to 0.0 minimum, rounds to 1 decimal

Per-page output fields

url, title, description, bodyText, jsonLd
canonical, isCanonical, isNoIndex
h1Count, h2Count, h3Count
imageCount, linkCount, externalLinkCount
wordCount, textHtmlRatio, thinScore, qualityScore
depth, rawDepth, lastModified, changefreq
priority, pageRank, pageRankRaw
source, inOriginalSitemap
isDuplicate, duplicateOf
isOrphan, isSitemapOrphan, isNearOrphan, incomingLinks
isLegalPage

Sitemap Integration

Before crawling begins, the crawler fetches the existing sitemap.xml from the target domain's root. The sitemap URLs are added to the crawl queue with depth=null (meaning: present in sitemap but not yet reached via hyperlinks). This ensures that sitemap-only pages are always checked even if they are not linked from the crawled pages.

The original sitemap's priority and lastmod values are stored per URL and used later in the hybrid priority calculation. Pages found both in the sitemap and via crawling receive the BOTH source flag; those found only via crawling get CRAWLED; those only in the sitemap get a null depth and SITEMAP_ONLY treatment in the priority engine.

Duplicate Detection

After all pages are crawled, a path-normalised deduplication pass runs over the full results. The normalisation key is derived from the URL pathname, lowercased and with trailing slashes stripped. The first occurrence keeps isDuplicate=false; all subsequent occurrences for the same key are marked isDuplicate=true with a duplicateOf reference.

This detects URL variants such as /blog/page1 vs /blog/page1?ref=share, or /page/ vs /page/index.html. Duplicate pages are excluded from all frontend display views, CSV/TXT/JSON exports and the generated sitemap.xml.

Internal PageRank Engine

After crawling and deduplication, a simplified internal PageRank calculation runs over all crawled pages. The algorithm uses 25 iterations with a damping factor of 0.85, consistent with the standard PageRank formulation. Outgoing link counts are approximated from each page's linkCount field.

The raw PageRank values are normalised to a 0.1–1.0 scale. The normalised value (pageRank) is then used as a ±0.1 additive bonus/penalty in the final priority engine: pages with pageRank=1.0 receive +0.1, pages at 0.5 receive ±0, pages at 0.0 receive −0.1.

Adaptive Priority Engine

The final page priority (used in the generated sitemap.xml) is computed in three stages. The result is clamped to 0.1–1.0 and rounded to one decimal place. The root URL always receives priority 1.0.

Stage 1 – Depth-based base priority

Depth 0 (root): 1.0
Depth 1: 0.8
Depth 2: 0.6
Depth 3: 0.5
Depth 4+: 0.3
Sitemap-only (depth=null): Median site depth used

Stage 2 – Hybrid weighting

BOTH (crawled + sitemap): 70% crawler + 30% original priority
SITEMAP_ONLY: 50% crawler + 50% original priority
CRAWLED only: 100% crawler priority

Stage 3 – Final priority signals

PageRank bonus: (pageRank − 0.5) × 0.2 → range ±0.1
Thin content penalty: (thinScore − 1.0) × 0.1 → max −0.1
Legal page floor: Minimum 0.3, changefreq forced yearly
Orphan penalty: True orphan −0.2, near-orphan −0.1, sitemap-orphan −0.1

Adaptive changefreq

Root (/): daily
Legal pages: yearly (forced)
Blog/news paths: weekly
Depth ≤1: weekly
lastmod ≤1 day old: daily
lastmod ≤14 days: weekly
lastmod ≤365 days: monthly
lastmod >365 days: yearly

Orphan Detection

After the PageRank pass, every page is classified for link isolation. Three categories are distinguished to avoid false positives for pages that are legitimately only in the sitemap.

True orphan

Condition: 0 incoming links, not in original sitemap, no crawled parent path
Priority penalty: −0.2
Severity: Critical – page may be inaccessible

Sitemap orphan

Condition: 0 incoming links but present in original sitemap
Priority penalty: −0.1
Severity: Warning – page exists but internal linking missing

Near-orphan

Condition: Exactly 1 incoming link AND depth >2
Priority penalty: −0.1
Severity: Notice – weakly linked, may be hard to discover

Legal Page Protection

Pages matching known legal/trust URL patterns (impressum, imprint, datenschutz, privacy, terms, agb, contact, about, kontakt, ueber-uns) are never filtered or penalised. Their priority floor is 0.3, changefreq is forced to yearly, and their qualityScore is clamped to a minimum of 0 regardless of thin content signals. This ensures trust signals for search engines are never accidentally removed from generated sitemaps.

Export Formats

All exports apply the same deduplication filter: noindex pages, duplicate URLs and pages marked isDuplicate are excluded. The totalPages count in the JSON export reflects the filtered set.

JSON

Content: Full crawl dataset including all analysis fields
totalPages: Reflects deduplicated count
Use case: Integration, further processing, archiving

CSV

Columns: Page, Title, Canonical, isCanonical, H1, H2, Links, Images
Use case: Spreadsheet analysis, client reporting

TXT

Format: URL | Title | Canonical:Yes/No | H1:n H2:n Links:n Images:n
Use case: Quick overview, copy-paste, plain text audits

XML Sitemap

Standard: Sitemap Protocol 0.9
Filter: noindex, duplicates, non-canonical, binary assets excluded
Fields: loc, lastmod (today), priority (depth-based), changefreq
Use case: Direct submission to Search Console

Tree Visualisation

After a completed crawl, the frontend can generate an SVG tree diagram of the entire site structure. The tree is built from all crawled pages, grouped by URL path segments into a hierarchical node structure. Layout is computed with a recursive space-partitioning algorithm that distributes nodes proportionally based on subtree width.

The SVG is generated entirely client-side without external libraries and can be downloaded as a standalone file. Node colours indicate depth: root in red, first-level categories in purple, deeper levels in blue. Each node shows page title, path, H1/link/image counts and a child-count badge.

Backend Architecture

Server

Runtime: Node.js (CommonJS)
Framework: Express
HTTP client: Axios (redirects, timeouts, status validation)
HTML parser: Cheerio
Port: 5000
CORS: ai-ready-check.de + llmshub.de allowed

Network

Main page timeout: 10 seconds
robots.txt timeout: 5 seconds
Sitemap fetch timeout: 5 seconds
Max redirects: 5 hops
Politeness delay: 200 ms between requests
User-Agent: Mozilla/5.0 (compatible; Crawler/1.0)

API endpoints

POST /api/webcrawl/start: Start crawl, returns full result
GET /api/webcrawl/health: Health check
GET /api/fetch-sitemap?url=: CORS proxy for sitemap fetching
Security: Origin-restricted, rate limited

Frontend

JavaScript: Vanilla ES6+, no framework
Deduplication: Client-side filter on all views and exports
Search: Real-time URL filter
Canonical filter: Toggle to show only canonical pages
SVG visualisation: Client-side tree generation and download
Stats recalculation: All stats computed from filtered pages

Known Limitations

Current limitations

No JavaScript execution: SPA / JS-rendered navigation not followed
No caching: Every crawl fetches all pages fresh
Sequential requests: No parallel page fetching (politeness by design)
PageRank approximation: linkCount used as outgoing link proxy, not actual link graph
No authentication: Password-protected pages not crawlable
No cookie handling: Session-dependent pages may return empty content
Memory-bound: All pages held in memory during crawl – large sites >500 pages may strain Pi 5 RAM
No incremental crawling: Full re-crawl required for updates

Technical implementation: server.js (WebsiteCrawler class) · crawler.js (frontend) | Production system running on Raspberry Pi 5 · Node.js · Port 5000