Web Information Gathering: The Operator’s Complete Recon Playbook#
Reconnaissance is not a formality before the “real” hacking starts. It is the hacking. Every shell you land on, every misconfiguration you exploit, every credential you harvest — traces back to something you found in the recon phase. Operators who invest here find attack vectors that automated scanners will never surface.
This is a complete field reference for web information gathering. Passive to active. Domain to subdomain. DNS to full technology stack. No padding — every technique maps to a real operational outcome.
The Four Goals of Web Recon#
Before running a single tool, lock in what you’re actually hunting:
Asset Identification — Web pages, subdomains, IP ranges, technology stacks. You cannot attack what you haven’t mapped.
Hidden Information Discovery — Backup files, configuration files, exposed documentation, forgotten endpoints. These rarely appear in automated scans.
Attack Surface Analysis — Entry points, misconfigurations, outdated software versions, exposed admin interfaces.
Intelligence Gathering — Email addresses, employee data, organizational patterns. Fuel for phishing, password spraying, and social engineering.
DNS Reconnaissance — The Foundation#
DNS is the GPS of the internet. Every web target has a DNS footprint. Understanding it gives you the full picture of what exists, what’s been forgotten, and what might be reachable.
DNS Resolution Chain#
Understanding the resolution path matters when you’re diagnosing filtering or performing poisoning research:
Client Query → Local Cache → Recursive Resolver → Root Server → TLD Server → Authoritative Server → IP ReturnedEssential DNS Tools#
| Tool | Purpose | Key Strength |
|---|---|---|
dig | Versatile DNS queries | Multiple record types, detailed output |
nslookup | Basic DNS lookups | A, AAAA, MX records |
host | Quick DNS checks | Concise output |
dnsenum | Automated enumeration | Subdomain discovery, brute-forcing |
dnsrecon | Comprehensive recon | Multiple techniques, multiple output formats |
Dig — Master the Tool#
# Basic record queries
dig $TARGET_DOMAIN
dig $TARGET_DOMAIN A
dig $TARGET_DOMAIN MX
dig $TARGET_DOMAIN NS
dig $TARGET_DOMAIN TXT
# Advanced usage
dig @1.1.1.1 $TARGET_DOMAIN # Query specific nameserver
dig +trace $TARGET_DOMAIN # Full resolution path — trace every hop
dig -x $TARGET_IP # Reverse DNS lookup
dig +short $TARGET_DOMAIN # Clean, concise output
dig +noall +answer $TARGET_DOMAIN # Answer section only — no noiseDNS Record Type Reference#
| Record | Purpose | Example |
|---|---|---|
A | IPv4 address mapping | www IN A $TARGET_IP |
AAAA | IPv6 address mapping | www IN AAAA 2001:db8::1 |
CNAME | Hostname alias | blog IN CNAME webserver.net |
MX | Mail server routing | @ IN MX 10 mail.domain.com |
NS | Authoritative nameservers | @ IN NS ns1.domain.com |
TXT | Arbitrary text data (SPF, DKIM, verification) | @ IN TXT "v=spf1 mx -all" |
SOA | Zone authority and admin contact | Admin email, serial number, TTL timing |
PTR | Reverse lookup — IP to hostname | Maps IPs back to hostnames |
Operator note: TXT records are a goldmine. SPF records expose mail infrastructure. Verification tokens (Google, Microsoft, etc.) confirm which cloud services the target uses. DKIM records expose mail signing keys.
DNS Zone Transfers — Low Probability, High Payoff#
A misconfigured DNS server will hand you the complete zone file — every hostname, every IP, every internal record. Always attempt it.
dig axfr @$NAMESERVER $TARGET_DOMAINA successful zone transfer exposes the complete subdomain list, all host IP addresses, internal infrastructure names, and mail server configuration. Misconfigured zone transfers are rare in modern environments but still appear on internal assessments and older infrastructure.
Subdomain Enumeration — Expanding the Attack Surface#
The main domain is the front door. Subdomains are the side entrances, the basement windows, and the unlocked service hatches. Development and staging environments live here. Legacy apps with no patch cadence live here. Forgotten admin panels live here.
Passive vs. Active Enumeration#
Passive — No traffic touches the target. Uses external data sources: certificate transparency logs, search engines, online DNS databases.
Active — Direct interaction with target DNS. Brute-forcing wordlists against the resolver, attempting zone transfers.
Tool Comparison#
| Tool | Primary Use | Standout Feature |
|---|---|---|
dnsenum | Comprehensive DNS recon | Zone transfer attempts + Google scraping |
fierce | Recursive subdomain discovery | Wildcard detection, clean output |
amass | Active + passive discovery | Extensive multi-source data integration |
Brute-Force Subdomain Discovery#
dnsenum --enum $TARGET_DOMAIN \
-f /usr/share/wordlists/seclists/Discovery/DNS/subdomains-top1million-20000.txt \
-r--enum enables full enumeration mode. -f points to your wordlist. -r enables recursive enumeration — subdomains of subdomains. That last flag surfaces things like dev.internal.corp.target.com that single-level enumeration misses.
WHOIS Investigation — Who Owns What#
WHOIS data is often underused. Beyond confirming a registrant, it’s a threat intelligence tool.
Basic Query#
whois $TARGET_DOMAINThree Operational Use Cases#
Phishing Investigation — Red Flags to Hunt
- Domain registered days or weeks ago
- Registrant information hidden behind privacy services
- Nameservers associated with bulletproof hosting providers
Malware C2 Analysis — Indicators
- Registrant using anonymous or disposable email services
- Registration origin in jurisdictions with lax abuse enforcement
- Registrar with a history of ignoring abuse reports
Threat Intelligence — Pattern Analysis
- Clusters of domain registrations preceding known attack campaigns
- Registrants using aliases or provably fake identities
- Shared nameservers across multiple suspicious domains
- Prior ICANN takedown or abuse history
What Legitimate vs. Suspicious WHOIS Looks Like#
A domain like facebook.com shows: established registrar (RegistrarSafe LLC), creation date in 1997, verified organizational registrant (Meta Platforms Inc.), and multiple ICANN protection status flags. A phishing domain looks like the inverse — registered last week, privacy-protected, hosted on a bulletproof provider with a generic registrar.
Certificate Transparency Logs — Definitive Subdomain Intel#
CT logs are public, append-only ledgers that record every SSL/TLS certificate ever issued. Every certificate. Including the ones for subdomains the target thought were internal-only.
Why CT Logs Beat Brute-Forcing#
- Definitive: These are actual issued certificates, not guesses
- Historical: Includes expired and revoked certificates — surfaces decommissioned assets
- Comprehensive: Catches subdomains that would never appear in a wordlist
- Passive: Zero traffic to the target
Tools#
| Tool | Access Method | Notes |
|---|---|---|
| crt.sh | Web UI + API | Free, no registration, JSON output |
| Censys | API | Advanced filtering, richer metadata |
Command Line — Extract Subdomains via API#
# All subdomains for a domain
curl -s "https://crt.sh/?q=$TARGET_DOMAIN&output=json" \
| jq -r '.[].name_value' | sort -u
# Filter for specific environment (dev, staging, admin, etc.)
curl -s "https://crt.sh/?q=$TARGET_DOMAIN&output=json" \
| jq -r '.[] | select(.name_value | contains("dev")) | .name_value' \
| sort -uExample output from a real engagement against a major platform:
*.dev.target.com
*.newdev.target.com
dev.target.com
newdev.target.com
secure.dev.target.comEvery one of those is a potential low-security entry point into the same organization.
Fingerprinting — Mapping the Technology Stack#
You’ve found the hosts. Now you need to know exactly what’s running on them. Fingerprinting is the bridge between enumeration and targeted exploitation.
What Fingerprinting Tells You#
- Exact web server software and version
- CMS platform and version (WordPress, Drupal, Joomla)
- Backend language and framework
- WAF presence and vendor
- Outdated components with known CVEs
Technique 1: Banner Grabbing#
curl -I $TARGET_DOMAINExample output:
HTTP/1.1 200 OK
Server: Apache/2.4.41 (Ubuntu)
X-Powered-By: PHP/7.4.3
X-Redirect-By: WordPressThree findings in three lines: Apache version (check for CVEs), PHP version (check for CVEs), WordPress confirmed.
Technique 2: HTTP Header Analysis#
Key headers to parse:
Server— Web server software and versionX-Powered-By— Backend language or frameworkX-Generator— CMS fingerprint (common in WordPress, Drupal)Set-Cookie— Session cookie names reveal platforms (PHPSESSID= PHP,JSESSIONID= Java)- Custom headers — Often reveal CDN provider, load balancer, or proprietary platform
Technique 3: WAF Detection#
wafw00f $TARGET_DOMAINIdentifies the WAF vendor and type before you attempt any active testing. Knowing you’re hitting a Cloudflare WAF vs. ModSecurity vs. Wordfence changes your entire approach.
Technique 4: Nikto Web Server Scan#
# Standard scan
nikto -h $TARGET_DOMAIN -Tuning b
# Against a specific virtual host
nikto -h http://preprod-payroll.$TARGET_DOMAIN -Tuning b-Tuning b focuses on software identification. Nikto will surface exposed configuration files, outdated headers, missing security headers, and default credentials pages.
Fingerprinting Tool Summary#
| Tool | Purpose | Best For |
|---|---|---|
| Wappalyzer | Technology profiling | Browser-based passive fingerprinting |
| WhatWeb | Command-line fingerprinting | Scripted recon, large target lists |
| Nmap NSE | Network + service detection | Combined port scan + tech ID |
| wafw00f | WAF identification | Pre-exploitation WAF mapping |
| Nikto | Web server vulnerability scan | Quick misconfiguration detection |
Web Crawling & Spidering — Mapping Site Structure#
Crawling maps what exists. It surfaces hidden directories, exposed files, comment fields with sensitive data, and form fields that reveal application logic.
How Crawlers Work#
Seed URL → Fetch Page → Extract Links → Add to Queue → RepeatCrawling Strategies#
| Strategy | Approach | Best For |
|---|---|---|
| Breadth-First | Width before depth | Full structure mapping, finding all sections |
| Depth-First | Depth before width | Following specific application paths |
What Crawlers Extract#
- Internal and external links — full URL map
- HTML comments — developers leave credentials, TODOs, and internal hostnames here
- Page metadata — titles, descriptions, author fields
- Form fields — reveals application functionality and input parameters
- Sensitive files —
.bak,.config,.log,.sqlfiles left in web root
Crawler Tools#
| Tool | Type | Best For |
|---|---|---|
| Burp Suite Spider | Active crawler | Web app mapping during manual testing |
| OWASP ZAP | Security scanner | Automated + manual combined workflow |
| Scrapy | Python framework | Custom crawlers, structured data extraction |
| Apache Nutch | Scalable crawler | Large-scale multi-target crawling |
ReconSpider — Scrapy-Based Custom Tool#
/root/.local/bin/scrapy runspider /opt/ReconSpider.py \
-a start_url=http://$TARGET_DOMAINOutput is structured JSON — directly parseable and importable into other tools:
{
"emails": ["admin@target.com"],
"links": ["https://target.com/admin/dashboard"],
"external_files": ["https://target.com/backup.zip"],
"js_files": ["https://target.com/assets/app.js"],
"form_fields": ["username", "password"],
"images": ["https://target.com/logo.png"],
"comments": ["<!-- staging server: 10.x.x.x -->"]
}That comments field is where operators find internal IP addresses, hardcoded credentials, and API keys left by developers.
Virtual Host Discovery — Finding Hidden Applications#
One IP. Multiple websites. Virtual hosting lets a single server run dozens of applications, differentiated only by the HTTP Host header. Applications that don’t appear in DNS. Applications with no public-facing links. Applications that the security team may not know exist.
VHosts vs. Subdomains#
Subdomains exist at the DNS level — they have their own DNS records and resolve independently. Virtual hosts are web server configurations — they share an IP and are differentiated only by the Host header sent in the HTTP request.
A vhost with no DNS record is completely invisible to passive DNS enumeration. It only surfaces through active brute-forcing.
Types of Virtual Hosting#
| Type | Method | Tradeoff |
|---|---|---|
| Name-Based | HTTP Host header | Cost-effective, most common, SSL complications |
| IP-Based | Unique IP per site | Protocol-independent, requires multiple IPs |
| Port-Based | Different port per site | IP-conserving, non-standard user-facing URLs |
Virtual Host Discovery — Gobuster#
gobuster vhost \
-u http://$TARGET_IP \
-w /usr/share/wordlists/seclists/Discovery/DNS/subdomains-top1million-5000.txt \
--append-domain \
-t 50--append-domain appends the base domain to each wordlist entry, generating dev.target.com, staging.target.com, etc. -t 50 sets thread count — tune based on target stability.
Vhost Discovery Tools#
| Tool | Standout Feature |
|---|---|
| gobuster | Fast, highly configurable, multi-mode |
| Feroxbuster | Recursive discovery, wildcard filtering |
| ffuf | Extremely flexible input/output filtering |
Search Engine Discovery — Google Dorking#
Search engines have indexed things the target never intended to expose. Dorking is passive OSINT — zero traffic to the target, massive intelligence return.
Basic Operators#
| Operator | Purpose | Example |
|---|---|---|
site: | Limit results to domain | site:target.com |
inurl: | Term must appear in URL | inurl:login |
filetype: | Filter by file extension | filetype:pdf |
intitle: | Term in page title | intitle:"admin panel" |
intext: | Term in page body | intext:"password reset" |
Advanced Operators#
| Operator | Purpose | Example |
|---|---|---|
cache: | View cached page version | cache:target.com |
related: | Find similar sites | related:target.com |
"exact phrase" | Exact string match | "confidential report" |
numrange: | Number range filter | numrange:1000-2000 |
Boolean Operators#
| Operator | Example |
|---|---|
AND | site:target.com AND inurl:admin |
OR | inurl:admin OR inurl:login |
NOT | site:target.com NOT www |
High-Value Dork Combinations#
Login and Admin Panel Discovery
site:target.com inurl:login
site:target.com (inurl:admin OR inurl:dashboard OR inurl:portal)
intitle:"admin panel" site:target.comSensitive File Exposure
site:target.com filetype:pdf
site:target.com (filetype:xls OR filetype:xlsx OR filetype:docx)
site:target.com filetype:sql
site:target.com filetype:logConfiguration and Backup Files
site:target.com inurl:config.php
site:target.com (ext:conf OR ext:cnf OR ext:ini)
site:target.com inurl:backup
site:target.com (ext:bak OR ext:old OR ext:backup)Exposed Credentials and Keys
site:target.com intext:"api_key"
site:target.com intext:"password" filetype:txt
site:target.com inurl:wp-config.phpOPSEC: Google dorks generate no traffic to the target. They are fully passive. Run them before any active scanning.
Web Archives — The Historical Record#
The Wayback Machine (web.archive.org) has been archiving the internet since 1996. For an operator, it’s a time machine to find what’s been deleted, what used to be exposed, and how the application has evolved.
Operational Value#
- Historical endpoint discovery — Old API endpoints, admin panels, and pages that still exist on the server but have been removed from navigation
- Vulnerability research — Older versions of the application with known CVEs
- OSINT — Past employee names, org structure, contact information
- Credential exposure — Configuration files that were briefly exposed before being removed
- Fully passive — Zero interaction with the live target
How to Use It#
Navigate to web.archive.org, enter the target URL, and browse the capture timeline. Select specific dates to view historical snapshots. Pay particular attention to captures from 2-5 years ago — that’s when legacy endpoints were often still active and documented in the UI.
Limitations to Account For#
- Coverage is not complete — low-traffic sites may have sparse captures
- Site owners can request content removal via
robots.txtexclusions - Dynamic content (AJAX-loaded, session-dependent) often doesn’t archive cleanly
- Capture frequency varies — some sites have daily snapshots, others monthly
Automation — Integrating the Full Recon Stack#
Manual techniques build understanding. Automation builds coverage. Use both.
Recon Framework Comparison#
| Framework | Focus | Standout Feature |
|---|---|---|
| FinalRecon | All-in-one web recon | Modular Python, covers every technique in this post |
| Recon-ng | Comprehensive framework | DNS, subdomains, port scanning, data storage |
| theHarvester | Email + subdomain discovery | Multi-search-engine OSINT aggregation |
| SpiderFoot | OSINT automation | Broadest data source integration |
FinalRecon — Installation and Usage#
git clone https://github.com/thewhiteh4t/FinalRecon.git
cd FinalRecon
pip3 install -r requirements.txt
chmod +x ./finalrecon.py
# Targeted run — headers + whois
./finalrecon.py --headers --whois --url http://$TARGET_DOMAIN
# Full recon — all modules
./finalrecon.py --full --url http://$TARGET_DOMAINFinalRecon Module Reference#
| Module | What It Does |
|---|---|
--headers | HTTP header analysis and fingerprinting |
--whois | Domain registration data |
--sslinfo | SSL certificate details — extracts SANs for subdomain discovery |
--crawl | Website crawling and link extraction |
--dns | Full DNS enumeration |
--sub | Subdomain discovery |
--dir | Directory brute-forcing |
--wayback | Historical URL retrieval from Wayback Machine |
--full | All modules — complete recon run |
Recon Playbook — Phase-Ordered Execution#
Phase 1: Passive Reconnaissance (Zero Target Traffic)#
# 1. WHOIS — ownership, registrar, creation date
whois $TARGET_DOMAIN
# 2. DNS records — full picture
dig $TARGET_DOMAIN ANY
dig $TARGET_DOMAIN MX
dig $TARGET_DOMAIN NS
dig $TARGET_DOMAIN TXT
# 3. Certificate transparency — passive subdomain discovery
curl -s "https://crt.sh/?q=$TARGET_DOMAIN&output=json" \
| jq -r '.[].name_value' | sort -u > recon/subdomains_ct.txt
# 4. Google dorking — exposed files, admin panels, credentials
# (run manually in browser — see dork combinations above)
# 5. Wayback Machine — historical endpoint discovery
# (manual review at web.archive.org)Phase 2: Active Reconnaissance (Target Interaction Begins)#
# 1. Zone transfer attempt — always try, rarely succeeds
dig axfr @$(dig NS $TARGET_DOMAIN +short | head -1) $TARGET_DOMAIN
# 2. Subdomain brute-force
dnsenum --enum $TARGET_DOMAIN \
-f /usr/share/wordlists/seclists/Discovery/DNS/subdomains-top1million-20000.txt \
-r > recon/subdomains_brute.txt
# 3. Fingerprinting
curl -I $TARGET_DOMAIN
wafw00f $TARGET_DOMAIN
nikto -h $TARGET_DOMAIN -Tuning b
# 4. Web crawling
/root/.local/bin/scrapy runspider /opt/ReconSpider.py \
-a start_url=http://$TARGET_DOMAINPhase 3: Advanced Techniques#
# 1. Virtual host discovery
gobuster vhost \
-u http://$TARGET_IP \
-w /usr/share/wordlists/seclists/Discovery/DNS/subdomains-top1million-5000.txt \
--append-domain -t 50
# 2. Directory brute-forcing
ffuf -w /usr/share/wordlists/dirbuster/directory-list-2.3-medium.txt \
-u http://$TARGET_DOMAIN/FUZZ \
-mc 200,301,302,403 -t 50
# 3. Full automation run
./finalrecon.py --full --url http://$TARGET_DOMAINPhase 4: Correlation and Documentation#
All recon data gets correlated into a unified target profile before any exploitation begins:
- Merge subdomain lists from CT logs + brute-force + crawling
- Map technology stack per subdomain
- Cross-reference WHOIS data with nameserver findings
- Flag highest-value targets: dev/staging environments, admin panels, outdated software versions
- Document everything with timestamps
Quick Command Reference#
DNS#
dig $TARGET_DOMAIN ANY
dig @1.1.1.1 $TARGET_DOMAIN MX
dig +trace $TARGET_DOMAIN
dig +short $TARGET_DOMAIN
nslookup $TARGET_DOMAIN
host $TARGET_DOMAIN
dig axfr @$NAMESERVER $TARGET_DOMAINSubdomain Discovery#
dnsenum --enum $TARGET_DOMAIN -f wordlist.txt -r
curl -s "https://crt.sh/?q=$TARGET_DOMAIN&output=json" | jq -r '.[].name_value' | sort -u
amass enum -passive -d $TARGET_DOMAINFingerprinting#
curl -I $TARGET_DOMAIN
wafw00f $TARGET_DOMAIN
nikto -h $TARGET_DOMAIN -Tuning b
whatweb $TARGET_DOMAINVirtual Hosts#
gobuster vhost -u http://$TARGET_IP -w wordlist.txt --append-domain -t 50
ffuf -w wordlist.txt -u http://$TARGET_IP -H "Host: FUZZ.$TARGET_DOMAIN" -mc 200Automation#
finalrecon --full --url http://$TARGET_DOMAIN
/root/.local/bin/scrapy runspider /opt/ReconSpider.py -a start_url=http://$TARGET_DOMAINKey Principles#
Start passive. Zero-traffic techniques first — CT logs, WHOIS, Google dorks, Wayback Machine. Build your target profile before sending a single packet.
Layer techniques. No single tool sees everything. CT logs miss vhosts. Brute-force misses expired subdomains. Crawling misses DNS-only assets. Stack them.
Validate findings. Cross-reference subdomains from CT logs against brute-force results. Confirm fingerprinting output against multiple tools. False positives in recon become wasted exploitation attempts downstream.
Document in real time. Save every output to a structured directory. You cannot reconstruct recon state from memory in a debrief or a court of law.
Respect the scope. Everything above applies strictly within authorized boundaries. Out-of-scope assets discovered during recon get documented and disclosed to the client — not touched.
Recon is where the engagement is won or lost. The operator who maps the full attack surface before touching an exploit lands shells. The operator who skips it fires blind.


