11---
2-
2+ type : resource-note
3+ status : done
4+ created : 2026-02-28
5+ updated : 2026-03-12
6+ tags : [security-writeup, tryhackme, osint, google-dorking]
7+ source : TryHackMe - Google Hacking
38platform : tryhackme
49room : Google Hacking
510slug : google-hacking
6- path : notes/00-foundations /google-dorking.md
11+ path : TryHackMe/10-web /google-dorking.md
712topic : 10-web
8- domain : [osint, web-recon]
9- skills : [search-engines, crawling-indexing, seo-basics, robots-sitemaps, google-dorking]
10- artifacts : [concept-notes, pattern-cards, cookbook]
11- status : done
12- date : 2026-02-28
13+ domain : [osint, web]
14+ skills : [search-engines, crawling-indexing, web-enum, google-dorking]
15+ artifacts : [concept-notes, pattern-card, cookbook]
16+ sanitized : true
1317---
1418
15- 0 . Summary
19+ # Google Hacking
20+
21+ ## Summary
1622
1723* Search engines are * public, large-scale indexes* built by crawlers/spiders that fetch URLs, parse content, and store signals for retrieval.
1824* “Google dorking” is precision querying with operators (` site: ` , ` filetype: ` , ` intitle: ` …) to shrink search space and surface exposed content.
1925* ` robots.txt ` controls * crawling behavior* (advisory), not access control; blocked URLs can still appear as “URL-only” results. Don’t treat robots as a secrecy mechanism.
2026* ` sitemap.xml ` accelerates discovery by listing canonical URLs; it has hard size/URL limits and supports sitemap index files.
2127* Defensive takeaway: periodically “dork your own org” to find exposures before others do.
2228
23- 1 . Key Concepts (plain language)
29+ ## Key Concepts
2430
25- 1.1 Crawl → index → rank → serve (the pipeline)
31+ ### 1.1 Crawl → index → rank → serve (the pipeline)
2632
2733* Crawling: fetch pages and discover new URLs.
2834* Indexing: extract content + metadata and store it in an index.
@@ -47,7 +53,7 @@ Key vocabulary
4753* Index: database mapping terms/signals → documents.
4854* SERP: Search Engine Results Page.
4955
50- 1.2 Google operators: what works and what drifts
56+ ### 1.2 Google operators: what works and what drifts
5157
5258Reality check:
5359
@@ -77,7 +83,7 @@ Important nuance for `filetype:`
7783
7884* It filters by file type/extension and indexable formats. If Google doesn’t index a format, ` filetype: ` won’t help.
7985
80- 1.3 robots.txt (Robots Exclusion Protocol; advisory)
86+ ### 1.3 robots.txt (Robots Exclusion Protocol; advisory)
8187
8288What it is
8389
@@ -104,7 +110,7 @@ Practical OSINT heuristic
104110
105111* Treat ` Disallow: ` entries as * high-signal leads* (admin panels, backups, staging, old paths). Verify carefully and ethically.
106112
107- 1.4 Meta robots and X-Robots-Tag (index control)
113+ ### 1.4 Meta robots and X-Robots-Tag (index control)
108114
109115Crawling vs indexing
110116
@@ -116,7 +122,7 @@ Operational consequence
116122* If you block a page via robots.txt, Googlebot won’t crawl it and therefore won’t read ` noindex ` on the page.
117123* If you allow crawling but set ` noindex ` , Google can crawl and then drop it from results.
118124
119- 1.5 sitemap.xml (Sitemaps Protocol)
125+ ### 1.5 sitemap.xml (Sitemaps Protocol)
120126
121127What it is
122128
@@ -131,15 +137,15 @@ Why it matters
131137
132138* Sitemaps reduce discovery cost for crawlers and help with crawl efficiency.
133139
134- 1.6 Ethical boundary (OSINT vs intrusion)
140+ ### 1.6 Ethical boundary (OSINT vs intrusion)
135141
136142* OSINT (including dorking) uses publicly reachable information.
137143* Crossing the line typically happens when you attempt access to restricted resources, mass-download sensitive data, or exploit what you find.
138144* In public notes: do not publish real targets or sensitive URLs; use placeholders.
139145
140- 2 . Pattern Cards (generalizable)
146+ ## Pattern Cards
141147
142- 2.1 Query design card (minimize → sharpen)
148+ ### 2.1 Query design card (minimize → sharpen)
143149
144150* Step 1: reduce scope
145151
@@ -154,7 +160,7 @@ Why it matters
154160
155161 * ` "confidential" ` , ` "password" ` , ` "backup" ` , ` "api key" `
156162
157- 2.2 “robots + sitemap first” card
163+ ### 2.2 “robots + sitemap first” card
158164
159165* Check early:
160166
@@ -164,14 +170,14 @@ Why it matters
164170
165171 * ` site:TARGET_DOMAIN inurl:<path> `
166172
167- 2.3 Sensitive filetype shortlist (defensive awareness)
173+ ### 2.3 Sensitive filetype shortlist (defensive awareness)
168174
169175* Config/secrets: ` env ` , ` ini ` , ` conf ` , ` yml ` , ` yaml ` , ` properties `
170176* Data dumps: ` sql ` , ` bak ` , ` db ` , ` sqlite ` , ` csv ` , ` json `
171177* Keys/certs: ` pem ` , ` key ` , ` pfx ` , ` p12 ` , ` crt `
172178* “internal docs”: ` pdf ` , ` docx ` , ` xlsx ` , ` pptx `
173179
174- 2.4 Defensive remediation mapping
180+ ### 2.4 Defensive remediation mapping
175181
176182* If it’s publicly accessible, fix at source:
177183
@@ -180,9 +186,9 @@ Why it matters
180186 * add ` noindex ` /` X-Robots-Tag ` where appropriate
181187 * remove/rotate exposed credentials
182188
183- 3 . Command Cookbook (placeholders only)
189+ ## Command Cookbook
184190
185- 3.1 Operator templates
191+ ### 3.1 Operator templates
186192
187193``` text
188194# Domain scoping
@@ -210,23 +216,23 @@ site:TARGET_DOMAIN inurl:/admin/
210216site:TARGET_DOMAIN "incident report"
211217```
212218
213- 3.2 robots + sitemap retrieval
219+ ### 3.2 robots + sitemap retrieval
214220
215221``` bash
216222curl -s https://TARGET_DOMAIN/robots.txt | sed -n ' 1,200p'
217223curl -s https://TARGET_DOMAIN/sitemap.xml | sed -n ' 1,200p'
218224```
219225
220- 3.3 Defensive self-audit (run on your own assets)
226+ ### 3.3 Defensive self-audit (run on your own assets)
221227
222228``` text
223- site:YOUR_DOMAIN filetype:env
224- site:YOUR_DOMAIN (filetype:sql OR filetype:bak)
225- site:YOUR_DOMAIN intitle:"index of" "backup"
226- site:YOUR_DOMAIN "BEGIN PRIVATE KEY"
229+ site:TARGET_DOMAIN filetype:env
230+ site:TARGET_DOMAIN (filetype:sql OR filetype:bak)
231+ site:TARGET_DOMAIN intitle:"index of" "backup"
232+ site:TARGET_DOMAIN "BEGIN PRIVATE KEY"
227233```
228234
229- 4 . Evidence (sanitized; assets/)
235+ ## Evidence
230236
231237* This note was expanded from a walkthrough transcript provided by the user.
232238* If you later add screenshots, store under ` assets/ ` and redact:
@@ -235,14 +241,14 @@ site:YOUR_DOMAIN "BEGIN PRIVATE KEY"
235241 * user identifiers
236242 * unique query outputs that expose sensitive paths
237243
238- 5 . Takeaways
244+ ## Takeaways
239245
240246* Indexing turns “unknown paths” into “search queries.” Attackers can recon at scale with no scanning.
241247* The strongest dorks are not complicated; they are * well scoped* .
242248* robots.txt is not a lock; it is public metadata and often a recon hint.
243249* Defensive action item: schedule periodic “self-dorking” and treat findings like vuln reports.
244250
245- 6 . References (official/docs-first; list titles in public notes)
251+ ## References
246252
247253* Google Search: Advanced Search page
248254* Google Search Central: File types Google can index (mentions ` filetype: ` operator)
@@ -253,7 +259,7 @@ site:YOUR_DOMAIN "BEGIN PRIVATE KEY"
253259* Google Search Central: Build and submit a sitemap + sitemap index files
254260* Google Search Central: Control what you share on Search (noindex, robots meta, X-Robots-Tag)
255261
256- CN–EN Glossary (mini)
262+ ## CN–EN Glossary (mini)
257263
258264* Search engine: 搜索引擎
259265* Crawler / spider: 爬虫/蜘蛛
0 commit comments