Web Information Gathering: The Operator’s Complete Recon Playbook
#

Reconnaissance is not a formality before the “real” hacking starts. It is the hacking. Every shell you land on, every misconfiguration you exploit, every credential you harvest — traces back to something you found in the recon phase. Operators who invest here find attack vectors that automated scanners will never surface.

This is a complete field reference for web information gathering. Passive to active. Domain to subdomain. DNS to full technology stack. No padding — every technique maps to a real operational outcome.

The Four Goals of Web Recon
#

Before running a single tool, lock in what you’re actually hunting:

Asset Identification — Web pages, subdomains, IP ranges, technology stacks. You cannot attack what you haven’t mapped.

Hidden Information Discovery — Backup files, configuration files, exposed documentation, forgotten endpoints. These rarely appear in automated scans.

Attack Surface Analysis — Entry points, misconfigurations, outdated software versions, exposed admin interfaces.

Intelligence Gathering — Email addresses, employee data, organizational patterns. Fuel for phishing, password spraying, and social engineering.

DNS Reconnaissance — The Foundation
#

DNS is the GPS of the internet. Every web target has a DNS footprint. Understanding it gives you the full picture of what exists, what’s been forgotten, and what might be reachable.

DNS Resolution Chain
#

Understanding the resolution path matters when you’re diagnosing filtering or performing poisoning research:

Client Query → Local Cache → Recursive Resolver → Root Server → TLD Server → Authoritative Server → IP Returned

Essential DNS Tools
#

Tool	Purpose	Key Strength
`dig`	Versatile DNS queries	Multiple record types, detailed output
`nslookup`	Basic DNS lookups	A, AAAA, MX records
`host`	Quick DNS checks	Concise output
`dnsenum`	Automated enumeration	Subdomain discovery, brute-forcing
`dnsrecon`	Comprehensive recon	Multiple techniques, multiple output formats

Dig — Master the Tool
#

# Basic record queries
dig $TARGET_DOMAIN
dig $TARGET_DOMAIN A
dig $TARGET_DOMAIN MX
dig $TARGET_DOMAIN NS
dig $TARGET_DOMAIN TXT

# Advanced usage
dig @1.1.1.1 $TARGET_DOMAIN        # Query specific nameserver
dig +trace $TARGET_DOMAIN           # Full resolution path — trace every hop
dig -x $TARGET_IP                   # Reverse DNS lookup
dig +short $TARGET_DOMAIN           # Clean, concise output
dig +noall +answer $TARGET_DOMAIN   # Answer section only — no noise

DNS Record Type Reference
#

Record	Purpose	Example
`A`	IPv4 address mapping	`www IN A $TARGET_IP`
`AAAA`	IPv6 address mapping	`www IN AAAA 2001:db8::1`
`CNAME`	Hostname alias	`blog IN CNAME webserver.net`
`MX`	Mail server routing	`@ IN MX 10 mail.domain.com`
`NS`	Authoritative nameservers	`@ IN NS ns1.domain.com`
`TXT`	Arbitrary text data (SPF, DKIM, verification)	`@ IN TXT "v=spf1 mx -all"`
`SOA`	Zone authority and admin contact	Admin email, serial number, TTL timing
`PTR`	Reverse lookup — IP to hostname	Maps IPs back to hostnames

Operator note: TXT records are a goldmine. SPF records expose mail infrastructure. Verification tokens (Google, Microsoft, etc.) confirm which cloud services the target uses. DKIM records expose mail signing keys.

DNS Zone Transfers — Low Probability, High Payoff
#

A misconfigured DNS server will hand you the complete zone file — every hostname, every IP, every internal record. Always attempt it.

dig axfr @$NAMESERVER $TARGET_DOMAIN

A successful zone transfer exposes the complete subdomain list, all host IP addresses, internal infrastructure names, and mail server configuration. Misconfigured zone transfers are rare in modern environments but still appear on internal assessments and older infrastructure.

Subdomain Enumeration — Expanding the Attack Surface
#

The main domain is the front door. Subdomains are the side entrances, the basement windows, and the unlocked service hatches. Development and staging environments live here. Legacy apps with no patch cadence live here. Forgotten admin panels live here.

Passive vs. Active Enumeration
#

Passive — No traffic touches the target. Uses external data sources: certificate transparency logs, search engines, online DNS databases.

Active — Direct interaction with target DNS. Brute-forcing wordlists against the resolver, attempting zone transfers.

Tool Comparison
#

Tool	Primary Use	Standout Feature
`dnsenum`	Comprehensive DNS recon	Zone transfer attempts + Google scraping
`fierce`	Recursive subdomain discovery	Wildcard detection, clean output
`amass`	Active + passive discovery	Extensive multi-source data integration

Brute-Force Subdomain Discovery
#

dnsenum --enum $TARGET_DOMAIN \
  -f /usr/share/wordlists/seclists/Discovery/DNS/subdomains-top1million-20000.txt \
  -r

--enum enables full enumeration mode. -f points to your wordlist. -r enables recursive enumeration — subdomains of subdomains. That last flag surfaces things like dev.internal.corp.target.com that single-level enumeration misses.

WHOIS Investigation — Who Owns What
#

WHOIS data is often underused. Beyond confirming a registrant, it’s a threat intelligence tool.

Basic Query
#

whois $TARGET_DOMAIN

Three Operational Use Cases
#

Phishing Investigation — Red Flags to Hunt

Domain registered days or weeks ago
Registrant information hidden behind privacy services
Nameservers associated with bulletproof hosting providers

Malware C2 Analysis — Indicators

Registrant using anonymous or disposable email services
Registration origin in jurisdictions with lax abuse enforcement
Registrar with a history of ignoring abuse reports

Threat Intelligence — Pattern Analysis

Clusters of domain registrations preceding known attack campaigns
Registrants using aliases or provably fake identities
Shared nameservers across multiple suspicious domains
Prior ICANN takedown or abuse history

What Legitimate vs. Suspicious WHOIS Looks Like
#

A domain like facebook.com shows: established registrar (RegistrarSafe LLC), creation date in 1997, verified organizational registrant (Meta Platforms Inc.), and multiple ICANN protection status flags. A phishing domain looks like the inverse — registered last week, privacy-protected, hosted on a bulletproof provider with a generic registrar.

Certificate Transparency Logs — Definitive Subdomain Intel
#

CT logs are public, append-only ledgers that record every SSL/TLS certificate ever issued. Every certificate. Including the ones for subdomains the target thought were internal-only.

Why CT Logs Beat Brute-Forcing
#

Definitive: These are actual issued certificates, not guesses
Historical: Includes expired and revoked certificates — surfaces decommissioned assets
Comprehensive: Catches subdomains that would never appear in a wordlist
Passive: Zero traffic to the target

Tools
#

Tool	Access Method	Notes
crt.sh	Web UI + API	Free, no registration, JSON output
Censys	API	Advanced filtering, richer metadata

Command Line — Extract Subdomains via API
#

# All subdomains for a domain
curl -s "https://crt.sh/?q=$TARGET_DOMAIN&output=json" \
  | jq -r '.[].name_value' | sort -u

# Filter for specific environment (dev, staging, admin, etc.)
curl -s "https://crt.sh/?q=$TARGET_DOMAIN&output=json" \
  | jq -r '.[] | select(.name_value | contains("dev")) | .name_value' \
  | sort -u

Example output from a real engagement against a major platform:

*.dev.target.com
*.newdev.target.com
dev.target.com
newdev.target.com
secure.dev.target.com

Every one of those is a potential low-security entry point into the same organization.

Fingerprinting — Mapping the Technology Stack
#

You’ve found the hosts. Now you need to know exactly what’s running on them. Fingerprinting is the bridge between enumeration and targeted exploitation.

What Fingerprinting Tells You
#

Exact web server software and version
CMS platform and version (WordPress, Drupal, Joomla)
Backend language and framework
WAF presence and vendor
Outdated components with known CVEs

Technique 1: Banner Grabbing
#

curl -I $TARGET_DOMAIN

Example output:

HTTP/1.1 200 OK
Server: Apache/2.4.41 (Ubuntu)
X-Powered-By: PHP/7.4.3
X-Redirect-By: WordPress

Three findings in three lines: Apache version (check for CVEs), PHP version (check for CVEs), WordPress confirmed.

Technique 2: HTTP Header Analysis
#

Key headers to parse:

Server — Web server software and version
X-Powered-By — Backend language or framework
X-Generator — CMS fingerprint (common in WordPress, Drupal)
Set-Cookie — Session cookie names reveal platforms (PHPSESSID = PHP, JSESSIONID = Java)
Custom headers — Often reveal CDN provider, load balancer, or proprietary platform

Technique 3: WAF Detection
#

wafw00f $TARGET_DOMAIN

Identifies the WAF vendor and type before you attempt any active testing. Knowing you’re hitting a Cloudflare WAF vs. ModSecurity vs. Wordfence changes your entire approach.

Technique 4: Nikto Web Server Scan
#

# Standard scan
nikto -h $TARGET_DOMAIN -Tuning b

# Against a specific virtual host
nikto -h http://preprod-payroll.$TARGET_DOMAIN -Tuning b

-Tuning b focuses on software identification. Nikto will surface exposed configuration files, outdated headers, missing security headers, and default credentials pages.

Fingerprinting Tool Summary
#

Tool	Purpose	Best For
Wappalyzer	Technology profiling	Browser-based passive fingerprinting
WhatWeb	Command-line fingerprinting	Scripted recon, large target lists
Nmap NSE	Network + service detection	Combined port scan + tech ID
wafw00f	WAF identification	Pre-exploitation WAF mapping
Nikto	Web server vulnerability scan	Quick misconfiguration detection

Web Crawling & Spidering — Mapping Site Structure
#

Crawling maps what exists. It surfaces hidden directories, exposed files, comment fields with sensitive data, and form fields that reveal application logic.

How Crawlers Work
#

Seed URL → Fetch Page → Extract Links → Add to Queue → Repeat

Crawling Strategies
#

Strategy	Approach	Best For
Breadth-First	Width before depth	Full structure mapping, finding all sections
Depth-First	Depth before width	Following specific application paths

What Crawlers Extract
#

Internal and external links — full URL map
HTML comments — developers leave credentials, TODOs, and internal hostnames here
Page metadata — titles, descriptions, author fields
Form fields — reveals application functionality and input parameters
Sensitive files — .bak, .config, .log, .sql files left in web root

Crawler Tools
#

Tool	Type	Best For
Burp Suite Spider	Active crawler	Web app mapping during manual testing
OWASP ZAP	Security scanner	Automated + manual combined workflow
Scrapy	Python framework	Custom crawlers, structured data extraction
Apache Nutch	Scalable crawler	Large-scale multi-target crawling

ReconSpider — Scrapy-Based Custom Tool
#

/root/.local/bin/scrapy runspider /opt/ReconSpider.py \
  -a start_url=http://$TARGET_DOMAIN

Output is structured JSON — directly parseable and importable into other tools:

{
  "emails": ["admin@target.com"],
  "links": ["https://target.com/admin/dashboard"],
  "external_files": ["https://target.com/backup.zip"],
  "js_files": ["https://target.com/assets/app.js"],
  "form_fields": ["username", "password"],
  "images": ["https://target.com/logo.png"],
  "comments": ["<!-- staging server: 10.x.x.x -->"]
}

That comments field is where operators find internal IP addresses, hardcoded credentials, and API keys left by developers.

Virtual Host Discovery — Finding Hidden Applications
#

One IP. Multiple websites. Virtual hosting lets a single server run dozens of applications, differentiated only by the HTTP Host header. Applications that don’t appear in DNS. Applications with no public-facing links. Applications that the security team may not know exist.

VHosts vs. Subdomains
#

Subdomains exist at the DNS level — they have their own DNS records and resolve independently. Virtual hosts are web server configurations — they share an IP and are differentiated only by the Host header sent in the HTTP request.

A vhost with no DNS record is completely invisible to passive DNS enumeration. It only surfaces through active brute-forcing.

Types of Virtual Hosting
#

Type	Method	Tradeoff
Name-Based	HTTP Host header	Cost-effective, most common, SSL complications
IP-Based	Unique IP per site	Protocol-independent, requires multiple IPs
Port-Based	Different port per site	IP-conserving, non-standard user-facing URLs

Virtual Host Discovery — Gobuster
#

gobuster vhost \
  -u http://$TARGET_IP \
  -w /usr/share/wordlists/seclists/Discovery/DNS/subdomains-top1million-5000.txt \
  --append-domain \
  -t 50

--append-domain appends the base domain to each wordlist entry, generating dev.target.com, staging.target.com, etc. -t 50 sets thread count — tune based on target stability.

Vhost Discovery Tools
#

Tool	Standout Feature
gobuster	Fast, highly configurable, multi-mode
Feroxbuster	Recursive discovery, wildcard filtering
ffuf	Extremely flexible input/output filtering

Search Engine Discovery — Google Dorking
#

Search engines have indexed things the target never intended to expose. Dorking is passive OSINT — zero traffic to the target, massive intelligence return.

Basic Operators
#

Operator	Purpose	Example
`site:`	Limit results to domain	`site:target.com`
`inurl:`	Term must appear in URL	`inurl:login`
`filetype:`	Filter by file extension	`filetype:pdf`
`intitle:`	Term in page title	`intitle:"admin panel"`
`intext:`	Term in page body	`intext:"password reset"`

Advanced Operators
#

Operator	Purpose	Example
`cache:`	View cached page version	`cache:target.com`
`related:`	Find similar sites	`related:target.com`
`"exact phrase"`	Exact string match	`"confidential report"`
`numrange:`	Number range filter	`numrange:1000-2000`

Boolean Operators
#

Operator	Example
`AND`	`site:target.com AND inurl:admin`
`OR`	`inurl:admin OR inurl:login`
`NOT`	`site:target.com NOT www`

High-Value Dork Combinations
#

Login and Admin Panel Discovery

site:target.com inurl:login
site:target.com (inurl:admin OR inurl:dashboard OR inurl:portal)
intitle:"admin panel" site:target.com

Sensitive File Exposure

site:target.com filetype:pdf
site:target.com (filetype:xls OR filetype:xlsx OR filetype:docx)
site:target.com filetype:sql
site:target.com filetype:log

Configuration and Backup Files

site:target.com inurl:config.php
site:target.com (ext:conf OR ext:cnf OR ext:ini)
site:target.com inurl:backup
site:target.com (ext:bak OR ext:old OR ext:backup)

Exposed Credentials and Keys

site:target.com intext:"api_key"
site:target.com intext:"password" filetype:txt
site:target.com inurl:wp-config.php

OPSEC: Google dorks generate no traffic to the target. They are fully passive. Run them before any active scanning.

Web Archives — The Historical Record
#

The Wayback Machine (web.archive.org) has been archiving the internet since 1996. For an operator, it’s a time machine to find what’s been deleted, what used to be exposed, and how the application has evolved.

Operational Value
#

Historical endpoint discovery — Old API endpoints, admin panels, and pages that still exist on the server but have been removed from navigation
Vulnerability research — Older versions of the application with known CVEs
OSINT — Past employee names, org structure, contact information
Credential exposure — Configuration files that were briefly exposed before being removed
Fully passive — Zero interaction with the live target

How to Use It
#

Navigate to web.archive.org, enter the target URL, and browse the capture timeline. Select specific dates to view historical snapshots. Pay particular attention to captures from 2-5 years ago — that’s when legacy endpoints were often still active and documented in the UI.

Limitations to Account For
#

Coverage is not complete — low-traffic sites may have sparse captures
Site owners can request content removal via robots.txt exclusions
Dynamic content (AJAX-loaded, session-dependent) often doesn’t archive cleanly
Capture frequency varies — some sites have daily snapshots, others monthly

Automation — Integrating the Full Recon Stack
#

Manual techniques build understanding. Automation builds coverage. Use both.

Recon Framework Comparison
#

Framework	Focus	Standout Feature
FinalRecon	All-in-one web recon	Modular Python, covers every technique in this post
Recon-ng	Comprehensive framework	DNS, subdomains, port scanning, data storage
theHarvester	Email + subdomain discovery	Multi-search-engine OSINT aggregation
SpiderFoot	OSINT automation	Broadest data source integration

FinalRecon — Installation and Usage
#

git clone https://github.com/thewhiteh4t/FinalRecon.git
cd FinalRecon
pip3 install -r requirements.txt
chmod +x ./finalrecon.py

# Targeted run — headers + whois
./finalrecon.py --headers --whois --url http://$TARGET_DOMAIN

# Full recon — all modules
./finalrecon.py --full --url http://$TARGET_DOMAIN

FinalRecon Module Reference
#

Module	What It Does
`--headers`	HTTP header analysis and fingerprinting
`--whois`	Domain registration data
`--sslinfo`	SSL certificate details — extracts SANs for subdomain discovery
`--crawl`	Website crawling and link extraction
`--dns`	Full DNS enumeration
`--sub`	Subdomain discovery
`--dir`	Directory brute-forcing
`--wayback`	Historical URL retrieval from Wayback Machine
`--full`	All modules — complete recon run

Recon Playbook — Phase-Ordered Execution
#

Phase 1: Passive Reconnaissance (Zero Target Traffic)
#

# 1. WHOIS — ownership, registrar, creation date
whois $TARGET_DOMAIN

# 2. DNS records — full picture
dig $TARGET_DOMAIN ANY
dig $TARGET_DOMAIN MX
dig $TARGET_DOMAIN NS
dig $TARGET_DOMAIN TXT

# 3. Certificate transparency — passive subdomain discovery
curl -s "https://crt.sh/?q=$TARGET_DOMAIN&output=json" \
  | jq -r '.[].name_value' | sort -u > recon/subdomains_ct.txt

# 4. Google dorking — exposed files, admin panels, credentials
# (run manually in browser — see dork combinations above)

# 5. Wayback Machine — historical endpoint discovery
# (manual review at web.archive.org)

Phase 2: Active Reconnaissance (Target Interaction Begins)
#

# 1. Zone transfer attempt — always try, rarely succeeds
dig axfr @$(dig NS $TARGET_DOMAIN +short | head -1) $TARGET_DOMAIN

# 2. Subdomain brute-force
dnsenum --enum $TARGET_DOMAIN \
  -f /usr/share/wordlists/seclists/Discovery/DNS/subdomains-top1million-20000.txt \
  -r > recon/subdomains_brute.txt

# 3. Fingerprinting
curl -I $TARGET_DOMAIN
wafw00f $TARGET_DOMAIN
nikto -h $TARGET_DOMAIN -Tuning b

# 4. Web crawling
/root/.local/bin/scrapy runspider /opt/ReconSpider.py \
  -a start_url=http://$TARGET_DOMAIN

Phase 3: Advanced Techniques
#

# 1. Virtual host discovery
gobuster vhost \
  -u http://$TARGET_IP \
  -w /usr/share/wordlists/seclists/Discovery/DNS/subdomains-top1million-5000.txt \
  --append-domain -t 50

# 2. Directory brute-forcing
ffuf -w /usr/share/wordlists/dirbuster/directory-list-2.3-medium.txt \
  -u http://$TARGET_DOMAIN/FUZZ \
  -mc 200,301,302,403 -t 50

# 3. Full automation run
./finalrecon.py --full --url http://$TARGET_DOMAIN

Phase 4: Correlation and Documentation
#

All recon data gets correlated into a unified target profile before any exploitation begins:

Merge subdomain lists from CT logs + brute-force + crawling
Map technology stack per subdomain
Cross-reference WHOIS data with nameserver findings
Flag highest-value targets: dev/staging environments, admin panels, outdated software versions
Document everything with timestamps

Quick Command Reference
#

DNS
#

dig $TARGET_DOMAIN ANY
dig @1.1.1.1 $TARGET_DOMAIN MX
dig +trace $TARGET_DOMAIN
dig +short $TARGET_DOMAIN
nslookup $TARGET_DOMAIN
host $TARGET_DOMAIN
dig axfr @$NAMESERVER $TARGET_DOMAIN

Subdomain Discovery
#

dnsenum --enum $TARGET_DOMAIN -f wordlist.txt -r
curl -s "https://crt.sh/?q=$TARGET_DOMAIN&output=json" | jq -r '.[].name_value' | sort -u
amass enum -passive -d $TARGET_DOMAIN

Fingerprinting
#

curl -I $TARGET_DOMAIN
wafw00f $TARGET_DOMAIN
nikto -h $TARGET_DOMAIN -Tuning b
whatweb $TARGET_DOMAIN

Virtual Hosts
#

gobuster vhost -u http://$TARGET_IP -w wordlist.txt --append-domain -t 50
ffuf -w wordlist.txt -u http://$TARGET_IP -H "Host: FUZZ.$TARGET_DOMAIN" -mc 200

Automation
#

finalrecon --full --url http://$TARGET_DOMAIN
/root/.local/bin/scrapy runspider /opt/ReconSpider.py -a start_url=http://$TARGET_DOMAIN

Key Principles
#

Start passive. Zero-traffic techniques first — CT logs, WHOIS, Google dorks, Wayback Machine. Build your target profile before sending a single packet.

Layer techniques. No single tool sees everything. CT logs miss vhosts. Brute-force misses expired subdomains. Crawling misses DNS-only assets. Stack them.

Validate findings. Cross-reference subdomains from CT logs against brute-force results. Confirm fingerprinting output against multiple tools. False positives in recon become wasted exploitation attempts downstream.

Document in real time. Save every output to a structured directory. You cannot reconstruct recon state from memory in a debrief or a court of law.

Respect the scope. Everything above applies strictly within authorized boundaries. Out-of-scope assets discovered during recon get documented and disclosed to the client — not touched.

Recon is where the engagement is won or lost. The operator who maps the full attack surface before touching an exploit lands shells. The operator who skips it fires blind.

Web Information Gathering: The Operator’s Complete Recon Playbook#

The Four Goals of Web Recon#

DNS Reconnaissance — The Foundation#

DNS Resolution Chain#

Essential DNS Tools#

Dig — Master the Tool#

DNS Record Type Reference#

DNS Zone Transfers — Low Probability, High Payoff#

Subdomain Enumeration — Expanding the Attack Surface#

Passive vs. Active Enumeration#

Tool Comparison#

Brute-Force Subdomain Discovery#

WHOIS Investigation — Who Owns What#

Basic Query#

Three Operational Use Cases#

What Legitimate vs. Suspicious WHOIS Looks Like#

Certificate Transparency Logs — Definitive Subdomain Intel#

Why CT Logs Beat Brute-Forcing#

Tools#

Command Line — Extract Subdomains via API#

Fingerprinting — Mapping the Technology Stack#

What Fingerprinting Tells You#

Technique 1: Banner Grabbing#

Technique 2: HTTP Header Analysis#

Technique 3: WAF Detection#

Technique 4: Nikto Web Server Scan#

Fingerprinting Tool Summary#

Web Crawling & Spidering — Mapping Site Structure#

How Crawlers Work#

Crawling Strategies#

What Crawlers Extract#

Crawler Tools#

ReconSpider — Scrapy-Based Custom Tool#

Virtual Host Discovery — Finding Hidden Applications#

VHosts vs. Subdomains#

Types of Virtual Hosting#

Virtual Host Discovery — Gobuster#

Vhost Discovery Tools#

Search Engine Discovery — Google Dorking#

Basic Operators#

Advanced Operators#

Boolean Operators#

High-Value Dork Combinations#

Web Archives — The Historical Record#

Operational Value#

How to Use It#

Limitations to Account For#

Automation — Integrating the Full Recon Stack#

Recon Framework Comparison#

FinalRecon — Installation and Usage#

FinalRecon Module Reference#

Recon Playbook — Phase-Ordered Execution#

Phase 1: Passive Reconnaissance (Zero Target Traffic)#

Phase 2: Active Reconnaissance (Target Interaction Begins)#

Phase 3: Advanced Techniques#

Phase 4: Correlation and Documentation#

Quick Command Reference#

DNS#

Subdomain Discovery#

Fingerprinting#

Virtual Hosts#

Automation#

Key Principles#