Skip to main content

Web Information Gathering: The Operator's Complete Recon Playbook

·2855 words·14 mins
AskBlueCat
Author
AskBlueCat
Just a another typical human trying……!

Web Information Gathering: The Operator’s Complete Recon Playbook
#

Reconnaissance is not a formality before the “real” hacking starts. It is the hacking. Every shell you land on, every misconfiguration you exploit, every credential you harvest — traces back to something you found in the recon phase. Operators who invest here find attack vectors that automated scanners will never surface.

This is a complete field reference for web information gathering. Passive to active. Domain to subdomain. DNS to full technology stack. No padding — every technique maps to a real operational outcome.


The Four Goals of Web Recon
#

Before running a single tool, lock in what you’re actually hunting:

Asset Identification — Web pages, subdomains, IP ranges, technology stacks. You cannot attack what you haven’t mapped.

Hidden Information Discovery — Backup files, configuration files, exposed documentation, forgotten endpoints. These rarely appear in automated scans.

Attack Surface Analysis — Entry points, misconfigurations, outdated software versions, exposed admin interfaces.

Intelligence Gathering — Email addresses, employee data, organizational patterns. Fuel for phishing, password spraying, and social engineering.


DNS Reconnaissance — The Foundation
#

DNS is the GPS of the internet. Every web target has a DNS footprint. Understanding it gives you the full picture of what exists, what’s been forgotten, and what might be reachable.

DNS Resolution Chain
#

Understanding the resolution path matters when you’re diagnosing filtering or performing poisoning research:

Client Query → Local Cache → Recursive Resolver → Root Server → TLD Server → Authoritative Server → IP Returned

Essential DNS Tools
#

ToolPurposeKey Strength
digVersatile DNS queriesMultiple record types, detailed output
nslookupBasic DNS lookupsA, AAAA, MX records
hostQuick DNS checksConcise output
dnsenumAutomated enumerationSubdomain discovery, brute-forcing
dnsreconComprehensive reconMultiple techniques, multiple output formats

Dig — Master the Tool
#

# Basic record queries
dig $TARGET_DOMAIN
dig $TARGET_DOMAIN A
dig $TARGET_DOMAIN MX
dig $TARGET_DOMAIN NS
dig $TARGET_DOMAIN TXT

# Advanced usage
dig @1.1.1.1 $TARGET_DOMAIN        # Query specific nameserver
dig +trace $TARGET_DOMAIN           # Full resolution path — trace every hop
dig -x $TARGET_IP                   # Reverse DNS lookup
dig +short $TARGET_DOMAIN           # Clean, concise output
dig +noall +answer $TARGET_DOMAIN   # Answer section only — no noise

DNS Record Type Reference
#

RecordPurposeExample
AIPv4 address mappingwww IN A $TARGET_IP
AAAAIPv6 address mappingwww IN AAAA 2001:db8::1
CNAMEHostname aliasblog IN CNAME webserver.net
MXMail server routing@ IN MX 10 mail.domain.com
NSAuthoritative nameservers@ IN NS ns1.domain.com
TXTArbitrary text data (SPF, DKIM, verification)@ IN TXT "v=spf1 mx -all"
SOAZone authority and admin contactAdmin email, serial number, TTL timing
PTRReverse lookup — IP to hostnameMaps IPs back to hostnames

Operator note: TXT records are a goldmine. SPF records expose mail infrastructure. Verification tokens (Google, Microsoft, etc.) confirm which cloud services the target uses. DKIM records expose mail signing keys.

DNS Zone Transfers — Low Probability, High Payoff
#

A misconfigured DNS server will hand you the complete zone file — every hostname, every IP, every internal record. Always attempt it.

dig axfr @$NAMESERVER $TARGET_DOMAIN

A successful zone transfer exposes the complete subdomain list, all host IP addresses, internal infrastructure names, and mail server configuration. Misconfigured zone transfers are rare in modern environments but still appear on internal assessments and older infrastructure.


Subdomain Enumeration — Expanding the Attack Surface
#

The main domain is the front door. Subdomains are the side entrances, the basement windows, and the unlocked service hatches. Development and staging environments live here. Legacy apps with no patch cadence live here. Forgotten admin panels live here.

Passive vs. Active Enumeration
#

Passive — No traffic touches the target. Uses external data sources: certificate transparency logs, search engines, online DNS databases.

Active — Direct interaction with target DNS. Brute-forcing wordlists against the resolver, attempting zone transfers.

Tool Comparison
#

ToolPrimary UseStandout Feature
dnsenumComprehensive DNS reconZone transfer attempts + Google scraping
fierceRecursive subdomain discoveryWildcard detection, clean output
amassActive + passive discoveryExtensive multi-source data integration

Brute-Force Subdomain Discovery
#

dnsenum --enum $TARGET_DOMAIN \
  -f /usr/share/wordlists/seclists/Discovery/DNS/subdomains-top1million-20000.txt \
  -r

--enum enables full enumeration mode. -f points to your wordlist. -r enables recursive enumeration — subdomains of subdomains. That last flag surfaces things like dev.internal.corp.target.com that single-level enumeration misses.


WHOIS Investigation — Who Owns What
#

WHOIS data is often underused. Beyond confirming a registrant, it’s a threat intelligence tool.

Basic Query
#

whois $TARGET_DOMAIN

Three Operational Use Cases
#

Phishing Investigation — Red Flags to Hunt

  • Domain registered days or weeks ago
  • Registrant information hidden behind privacy services
  • Nameservers associated with bulletproof hosting providers

Malware C2 Analysis — Indicators

  • Registrant using anonymous or disposable email services
  • Registration origin in jurisdictions with lax abuse enforcement
  • Registrar with a history of ignoring abuse reports

Threat Intelligence — Pattern Analysis

  • Clusters of domain registrations preceding known attack campaigns
  • Registrants using aliases or provably fake identities
  • Shared nameservers across multiple suspicious domains
  • Prior ICANN takedown or abuse history

What Legitimate vs. Suspicious WHOIS Looks Like
#

A domain like facebook.com shows: established registrar (RegistrarSafe LLC), creation date in 1997, verified organizational registrant (Meta Platforms Inc.), and multiple ICANN protection status flags. A phishing domain looks like the inverse — registered last week, privacy-protected, hosted on a bulletproof provider with a generic registrar.


Certificate Transparency Logs — Definitive Subdomain Intel
#

CT logs are public, append-only ledgers that record every SSL/TLS certificate ever issued. Every certificate. Including the ones for subdomains the target thought were internal-only.

Why CT Logs Beat Brute-Forcing
#

  • Definitive: These are actual issued certificates, not guesses
  • Historical: Includes expired and revoked certificates — surfaces decommissioned assets
  • Comprehensive: Catches subdomains that would never appear in a wordlist
  • Passive: Zero traffic to the target

Tools
#

ToolAccess MethodNotes
crt.shWeb UI + APIFree, no registration, JSON output
CensysAPIAdvanced filtering, richer metadata

Command Line — Extract Subdomains via API
#

# All subdomains for a domain
curl -s "https://crt.sh/?q=$TARGET_DOMAIN&output=json" \
  | jq -r '.[].name_value' | sort -u

# Filter for specific environment (dev, staging, admin, etc.)
curl -s "https://crt.sh/?q=$TARGET_DOMAIN&output=json" \
  | jq -r '.[] | select(.name_value | contains("dev")) | .name_value' \
  | sort -u

Example output from a real engagement against a major platform:

*.dev.target.com
*.newdev.target.com
dev.target.com
newdev.target.com
secure.dev.target.com

Every one of those is a potential low-security entry point into the same organization.


Fingerprinting — Mapping the Technology Stack
#

You’ve found the hosts. Now you need to know exactly what’s running on them. Fingerprinting is the bridge between enumeration and targeted exploitation.

What Fingerprinting Tells You
#

  • Exact web server software and version
  • CMS platform and version (WordPress, Drupal, Joomla)
  • Backend language and framework
  • WAF presence and vendor
  • Outdated components with known CVEs

Technique 1: Banner Grabbing
#

curl -I $TARGET_DOMAIN

Example output:

HTTP/1.1 200 OK
Server: Apache/2.4.41 (Ubuntu)
X-Powered-By: PHP/7.4.3
X-Redirect-By: WordPress

Three findings in three lines: Apache version (check for CVEs), PHP version (check for CVEs), WordPress confirmed.

Technique 2: HTTP Header Analysis
#

Key headers to parse:

  • Server — Web server software and version
  • X-Powered-By — Backend language or framework
  • X-Generator — CMS fingerprint (common in WordPress, Drupal)
  • Set-Cookie — Session cookie names reveal platforms (PHPSESSID = PHP, JSESSIONID = Java)
  • Custom headers — Often reveal CDN provider, load balancer, or proprietary platform

Technique 3: WAF Detection
#

wafw00f $TARGET_DOMAIN

Identifies the WAF vendor and type before you attempt any active testing. Knowing you’re hitting a Cloudflare WAF vs. ModSecurity vs. Wordfence changes your entire approach.

Technique 4: Nikto Web Server Scan
#

# Standard scan
nikto -h $TARGET_DOMAIN -Tuning b

# Against a specific virtual host
nikto -h http://preprod-payroll.$TARGET_DOMAIN -Tuning b

-Tuning b focuses on software identification. Nikto will surface exposed configuration files, outdated headers, missing security headers, and default credentials pages.

Fingerprinting Tool Summary
#

ToolPurposeBest For
WappalyzerTechnology profilingBrowser-based passive fingerprinting
WhatWebCommand-line fingerprintingScripted recon, large target lists
Nmap NSENetwork + service detectionCombined port scan + tech ID
wafw00fWAF identificationPre-exploitation WAF mapping
NiktoWeb server vulnerability scanQuick misconfiguration detection

Web Crawling & Spidering — Mapping Site Structure
#

Crawling maps what exists. It surfaces hidden directories, exposed files, comment fields with sensitive data, and form fields that reveal application logic.

How Crawlers Work
#

Seed URL → Fetch Page → Extract Links → Add to Queue → Repeat

Crawling Strategies
#

StrategyApproachBest For
Breadth-FirstWidth before depthFull structure mapping, finding all sections
Depth-FirstDepth before widthFollowing specific application paths

What Crawlers Extract
#

  • Internal and external links — full URL map
  • HTML comments — developers leave credentials, TODOs, and internal hostnames here
  • Page metadata — titles, descriptions, author fields
  • Form fields — reveals application functionality and input parameters
  • Sensitive files — .bak, .config, .log, .sql files left in web root

Crawler Tools
#

ToolTypeBest For
Burp Suite SpiderActive crawlerWeb app mapping during manual testing
OWASP ZAPSecurity scannerAutomated + manual combined workflow
ScrapyPython frameworkCustom crawlers, structured data extraction
Apache NutchScalable crawlerLarge-scale multi-target crawling

ReconSpider — Scrapy-Based Custom Tool
#

/root/.local/bin/scrapy runspider /opt/ReconSpider.py \
  -a start_url=http://$TARGET_DOMAIN

Output is structured JSON — directly parseable and importable into other tools:

{
  "emails": ["admin@target.com"],
  "links": ["https://target.com/admin/dashboard"],
  "external_files": ["https://target.com/backup.zip"],
  "js_files": ["https://target.com/assets/app.js"],
  "form_fields": ["username", "password"],
  "images": ["https://target.com/logo.png"],
  "comments": ["<!-- staging server: 10.x.x.x -->"]
}

That comments field is where operators find internal IP addresses, hardcoded credentials, and API keys left by developers.


Virtual Host Discovery — Finding Hidden Applications
#

One IP. Multiple websites. Virtual hosting lets a single server run dozens of applications, differentiated only by the HTTP Host header. Applications that don’t appear in DNS. Applications with no public-facing links. Applications that the security team may not know exist.

VHosts vs. Subdomains
#

Subdomains exist at the DNS level — they have their own DNS records and resolve independently. Virtual hosts are web server configurations — they share an IP and are differentiated only by the Host header sent in the HTTP request.

A vhost with no DNS record is completely invisible to passive DNS enumeration. It only surfaces through active brute-forcing.

Types of Virtual Hosting
#

TypeMethodTradeoff
Name-BasedHTTP Host headerCost-effective, most common, SSL complications
IP-BasedUnique IP per siteProtocol-independent, requires multiple IPs
Port-BasedDifferent port per siteIP-conserving, non-standard user-facing URLs

Virtual Host Discovery — Gobuster
#

gobuster vhost \
  -u http://$TARGET_IP \
  -w /usr/share/wordlists/seclists/Discovery/DNS/subdomains-top1million-5000.txt \
  --append-domain \
  -t 50

--append-domain appends the base domain to each wordlist entry, generating dev.target.com, staging.target.com, etc. -t 50 sets thread count — tune based on target stability.

Vhost Discovery Tools
#

ToolStandout Feature
gobusterFast, highly configurable, multi-mode
FeroxbusterRecursive discovery, wildcard filtering
ffufExtremely flexible input/output filtering

Search Engine Discovery — Google Dorking
#

Search engines have indexed things the target never intended to expose. Dorking is passive OSINT — zero traffic to the target, massive intelligence return.

Basic Operators
#

OperatorPurposeExample
site:Limit results to domainsite:target.com
inurl:Term must appear in URLinurl:login
filetype:Filter by file extensionfiletype:pdf
intitle:Term in page titleintitle:"admin panel"
intext:Term in page bodyintext:"password reset"

Advanced Operators
#

OperatorPurposeExample
cache:View cached page versioncache:target.com
related:Find similar sitesrelated:target.com
"exact phrase"Exact string match"confidential report"
numrange:Number range filternumrange:1000-2000

Boolean Operators
#

OperatorExample
ANDsite:target.com AND inurl:admin
ORinurl:admin OR inurl:login
NOTsite:target.com NOT www

High-Value Dork Combinations
#

Login and Admin Panel Discovery

site:target.com inurl:login
site:target.com (inurl:admin OR inurl:dashboard OR inurl:portal)
intitle:"admin panel" site:target.com

Sensitive File Exposure

site:target.com filetype:pdf
site:target.com (filetype:xls OR filetype:xlsx OR filetype:docx)
site:target.com filetype:sql
site:target.com filetype:log

Configuration and Backup Files

site:target.com inurl:config.php
site:target.com (ext:conf OR ext:cnf OR ext:ini)
site:target.com inurl:backup
site:target.com (ext:bak OR ext:old OR ext:backup)

Exposed Credentials and Keys

site:target.com intext:"api_key"
site:target.com intext:"password" filetype:txt
site:target.com inurl:wp-config.php

OPSEC: Google dorks generate no traffic to the target. They are fully passive. Run them before any active scanning.


Web Archives — The Historical Record
#

The Wayback Machine (web.archive.org) has been archiving the internet since 1996. For an operator, it’s a time machine to find what’s been deleted, what used to be exposed, and how the application has evolved.

Operational Value
#

  • Historical endpoint discovery — Old API endpoints, admin panels, and pages that still exist on the server but have been removed from navigation
  • Vulnerability research — Older versions of the application with known CVEs
  • OSINT — Past employee names, org structure, contact information
  • Credential exposure — Configuration files that were briefly exposed before being removed
  • Fully passive — Zero interaction with the live target

How to Use It
#

Navigate to web.archive.org, enter the target URL, and browse the capture timeline. Select specific dates to view historical snapshots. Pay particular attention to captures from 2-5 years ago — that’s when legacy endpoints were often still active and documented in the UI.

Limitations to Account For
#

  • Coverage is not complete — low-traffic sites may have sparse captures
  • Site owners can request content removal via robots.txt exclusions
  • Dynamic content (AJAX-loaded, session-dependent) often doesn’t archive cleanly
  • Capture frequency varies — some sites have daily snapshots, others monthly

Automation — Integrating the Full Recon Stack
#

Manual techniques build understanding. Automation builds coverage. Use both.

Recon Framework Comparison
#

FrameworkFocusStandout Feature
FinalReconAll-in-one web reconModular Python, covers every technique in this post
Recon-ngComprehensive frameworkDNS, subdomains, port scanning, data storage
theHarvesterEmail + subdomain discoveryMulti-search-engine OSINT aggregation
SpiderFootOSINT automationBroadest data source integration

FinalRecon — Installation and Usage
#

git clone https://github.com/thewhiteh4t/FinalRecon.git
cd FinalRecon
pip3 install -r requirements.txt
chmod +x ./finalrecon.py

# Targeted run — headers + whois
./finalrecon.py --headers --whois --url http://$TARGET_DOMAIN

# Full recon — all modules
./finalrecon.py --full --url http://$TARGET_DOMAIN

FinalRecon Module Reference
#

ModuleWhat It Does
--headersHTTP header analysis and fingerprinting
--whoisDomain registration data
--sslinfoSSL certificate details — extracts SANs for subdomain discovery
--crawlWebsite crawling and link extraction
--dnsFull DNS enumeration
--subSubdomain discovery
--dirDirectory brute-forcing
--waybackHistorical URL retrieval from Wayback Machine
--fullAll modules — complete recon run

Recon Playbook — Phase-Ordered Execution
#

Phase 1: Passive Reconnaissance (Zero Target Traffic)
#

# 1. WHOIS — ownership, registrar, creation date
whois $TARGET_DOMAIN

# 2. DNS records — full picture
dig $TARGET_DOMAIN ANY
dig $TARGET_DOMAIN MX
dig $TARGET_DOMAIN NS
dig $TARGET_DOMAIN TXT

# 3. Certificate transparency — passive subdomain discovery
curl -s "https://crt.sh/?q=$TARGET_DOMAIN&output=json" \
  | jq -r '.[].name_value' | sort -u > recon/subdomains_ct.txt

# 4. Google dorking — exposed files, admin panels, credentials
# (run manually in browser — see dork combinations above)

# 5. Wayback Machine — historical endpoint discovery
# (manual review at web.archive.org)

Phase 2: Active Reconnaissance (Target Interaction Begins)
#

# 1. Zone transfer attempt — always try, rarely succeeds
dig axfr @$(dig NS $TARGET_DOMAIN +short | head -1) $TARGET_DOMAIN

# 2. Subdomain brute-force
dnsenum --enum $TARGET_DOMAIN \
  -f /usr/share/wordlists/seclists/Discovery/DNS/subdomains-top1million-20000.txt \
  -r > recon/subdomains_brute.txt

# 3. Fingerprinting
curl -I $TARGET_DOMAIN
wafw00f $TARGET_DOMAIN
nikto -h $TARGET_DOMAIN -Tuning b

# 4. Web crawling
/root/.local/bin/scrapy runspider /opt/ReconSpider.py \
  -a start_url=http://$TARGET_DOMAIN

Phase 3: Advanced Techniques
#

# 1. Virtual host discovery
gobuster vhost \
  -u http://$TARGET_IP \
  -w /usr/share/wordlists/seclists/Discovery/DNS/subdomains-top1million-5000.txt \
  --append-domain -t 50

# 2. Directory brute-forcing
ffuf -w /usr/share/wordlists/dirbuster/directory-list-2.3-medium.txt \
  -u http://$TARGET_DOMAIN/FUZZ \
  -mc 200,301,302,403 -t 50

# 3. Full automation run
./finalrecon.py --full --url http://$TARGET_DOMAIN

Phase 4: Correlation and Documentation
#

All recon data gets correlated into a unified target profile before any exploitation begins:

  • Merge subdomain lists from CT logs + brute-force + crawling
  • Map technology stack per subdomain
  • Cross-reference WHOIS data with nameserver findings
  • Flag highest-value targets: dev/staging environments, admin panels, outdated software versions
  • Document everything with timestamps

Quick Command Reference
#

DNS
#

dig $TARGET_DOMAIN ANY
dig @1.1.1.1 $TARGET_DOMAIN MX
dig +trace $TARGET_DOMAIN
dig +short $TARGET_DOMAIN
nslookup $TARGET_DOMAIN
host $TARGET_DOMAIN
dig axfr @$NAMESERVER $TARGET_DOMAIN

Subdomain Discovery
#

dnsenum --enum $TARGET_DOMAIN -f wordlist.txt -r
curl -s "https://crt.sh/?q=$TARGET_DOMAIN&output=json" | jq -r '.[].name_value' | sort -u
amass enum -passive -d $TARGET_DOMAIN

Fingerprinting
#

curl -I $TARGET_DOMAIN
wafw00f $TARGET_DOMAIN
nikto -h $TARGET_DOMAIN -Tuning b
whatweb $TARGET_DOMAIN

Virtual Hosts
#

gobuster vhost -u http://$TARGET_IP -w wordlist.txt --append-domain -t 50
ffuf -w wordlist.txt -u http://$TARGET_IP -H "Host: FUZZ.$TARGET_DOMAIN" -mc 200

Automation
#

finalrecon --full --url http://$TARGET_DOMAIN
/root/.local/bin/scrapy runspider /opt/ReconSpider.py -a start_url=http://$TARGET_DOMAIN

Key Principles
#

Start passive. Zero-traffic techniques first — CT logs, WHOIS, Google dorks, Wayback Machine. Build your target profile before sending a single packet.

Layer techniques. No single tool sees everything. CT logs miss vhosts. Brute-force misses expired subdomains. Crawling misses DNS-only assets. Stack them.

Validate findings. Cross-reference subdomains from CT logs against brute-force results. Confirm fingerprinting output against multiple tools. False positives in recon become wasted exploitation attempts downstream.

Document in real time. Save every output to a structured directory. You cannot reconstruct recon state from memory in a debrief or a court of law.

Respect the scope. Everything above applies strictly within authorized boundaries. Out-of-scope assets discovered during recon get documented and disclosed to the client — not touched.


Recon is where the engagement is won or lost. The operator who maps the full attack surface before touching an exploit lands shells. The operator who skips it fires blind.