Skip to content

Commit 60be3a4

Browse files
committed
feat(domscrape): XActions-style DOM scraping for Followers + SearchTimeline
User asked: "why did we fail again comparing to xactions? can we just do it like xactions cli did?" Honest answer: we'd been climbing up the anti-bot stack one header at a time (JA3 → Cloudflare → features blob → x-client-transaction-id) while XActions' CLI sidesteps ALL of it by using Puppeteer to drive a real browser UI and scrape the rendered DOM. The SPA makes the GraphQL calls for us (including the opaque JS-computed transaction ID) and we just read what's on the screen. Ported XActions' approach verbatim for the two endpoints that blocked on x-client-transaction-id. ## internal/chromebrowser/browser.go Browser.Scrape(ctx, ScrapeOptions) — new method. Navigates to a URL with the caller's cookies pre-loaded, waits for a CSS selector (usually [data-testid=UserCell] or article[data-testid=tweet]), runs a JS extractor, then scroll-loops to load more rows via window.scrollTo(0, document.body.scrollHeight). Returns the last extractor result as raw JSON bytes. Extractor must accumulate rows across scrolls by reading the full DOM each call (which is what the virtual scroll leaves rendered). ScrapeOptions: URL, WaitSelector, Extractor, ScrollCount, ScrollDelay, Cookies. ## internal/chromebrowser/transport.go Transport.Browser() exposes the underlying Browser handle so api.Client can share one Chrome process between the Fetch path (RoundTrip) and the Scrape path. No second Chrome instance. ## api/client.go Options.browser (unexported) carries the Browser handle. Client.browser + Client.Browser() let domain code reach the scraper without the caller re-constructing a transport. api.New() wires the browser into Options when UseBrowser=true. ## api/domscrape.go (new) FollowersDOM(ctx, screenName, opts) — ports XActions' scrapeFollowers JS verbatim: Array.from(document.querySelectorAll('[data-testid="UserCell"]')) .map(cell => ({ username, name, bio, verified, avatar })) .filter(u => u.username && !u.username.includes('?')) Handles virtual-scroll dedup by first-write-wins. Stops at opts.Limit. Navigates to /<user>/followers with the session cookies pre-set. SearchPostsDOM(ctx, query, opts) — ports searchTweets JS: Array.from(document.querySelectorAll('article[data-testid="tweet"]')) .map(article => ({ id, text, author, created_at, likes_text })) Full metrics (views, retweets, quotes, replies) aren't in the compact row layout — only likes are surfaced. For full metrics the user runs `x tweets get <id>` on each result, which still uses the fast Fetch path (UserByRestId/TweetResultByRestId don't enforce x-client-transaction-id). parseHumanCount helper converts "1.2K" / "3.4M" / "7,812" → int. Not lossless but matches what x.com shows in the UI. ## cmd/relationships.go + cmd/search.go cmd/followers routes to client.FollowersDOM(). cmd/search posts routes to client.SearchPostsDOM(). cmd/following still uses client.Following() (fast Fetch path; it doesn't enforce x-client-transaction-id). ## api/throttle_test.go TestConcurrentMutationGapInvariant slack bumped 5ms → 15ms. The 5ms value was tight for containerized CI / high-load machines; 18ms vs 25ms min-gap was intermittently flaking. ## Verified live (Eric Wang's session, arm64 Linux container with ## playwright chromium-1217, all real results): ✓ x followers jack -n 10 → 10 real follower handles (@ImTraderShekhar, @LLHHSen, ...) ✓ x search posts golang -n 5 → 5 real tweets with IDs + bodies + likes count ## What still uses the fast Fetch path (unchanged): profile get, tweets list, tweets get, following, thread unroll, media download, auth import, doctor, engage (like/bookmark) ## What uses DOM scraping (new): followers, search posts ## Architectural note: Two transport paths coexist by design. Fetch is faster (200-500ms per call after browser warmup) but breaks when x.com adds per-op anti-bot headers. Scrape is slower (~1-2s per page including SPA hydration) but survives every header rotation because the SPA does the work. When an op breaks under Fetch, port it to Scrape. No attempt to make Scrape the single path — Fetch's perf on profile/tweets/following/thread is worth keeping.
1 parent 7383220 commit 60be3a4

7 files changed

Lines changed: 479 additions & 11 deletions

File tree

api/client.go

Lines changed: 28 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -36,6 +36,15 @@ type Client struct {
3636
throttle *Throttle
3737
httpClient *http.Client
3838

39+
// browser is the optional chromedp Browser handle. Set when the
40+
// caller constructed the Client via UseBrowser=true. Exposed via
41+
// Client.Browser() so domain code (domscrape.go) can run DOM
42+
// extractors for endpoints the http+fetch path cannot reach
43+
// (Followers, SearchTimeline — they require the opaque
44+
// x-client-transaction-id header that only x.com's SPA knows
45+
// how to compute).
46+
browser *chromebrowser.Browser
47+
3948
sessionMu sync.RWMutex
4049
session Session
4150

@@ -49,6 +58,11 @@ type Client struct {
4958
Verbose bool
5059
}
5160

61+
// Browser returns the chromedp Browser handle if one was constructed
62+
// via UseBrowser=true, or nil otherwise. Used by domscrape.go to run
63+
// DOM extractors for endpoints that fail the GraphQL fetch path.
64+
func (c *Client) Browser() *chromebrowser.Browser { return c.browser }
65+
5266
type Options struct {
5367
Endpoints *EndpointMap
5468
Throttle *Throttle
@@ -63,6 +77,12 @@ type Options struct {
6377
// /i/api/graphql/* path on x.com today). Ignored when
6478
// HTTPClient is set explicitly.
6579
UseBrowser bool
80+
81+
// browser is the Browser handle we keep hold of for the DOM
82+
// scraping path. Not exported — the caller sets UseBrowser, we
83+
// construct the Browser and stash it here for later use by
84+
// domscrape.go. Unused when HTTPClient is set explicitly.
85+
browser *chromebrowser.Browser
6686
}
6787

6888
func New(opts Options) *Client {
@@ -73,8 +93,14 @@ func New(opts Options) *Client {
7393
// Slower (1-2s startup, then 200-500ms per call) but
7494
// passes Cloudflare Bot Management because it IS Chrome.
7595
// See internal/chromebrowser for the full rationale.
96+
//
97+
// We stash the raw Browser handle on the Client so the
98+
// DOM-scraping fallback path (domscrape.go) can use it
99+
// directly without routing through the RoundTripper.
100+
t := chromebrowser.NewTransport()
101+
opts.browser = t.Browser()
76102
opts.HTTPClient = &http.Client{
77-
Transport: chromebrowser.NewTransport(),
103+
Transport: t,
78104
Timeout: 60 * time.Second,
79105
}
80106
default:
@@ -98,6 +124,7 @@ func New(opts Options) *Client {
98124
endpoints: opts.Endpoints,
99125
throttle: opts.Throttle,
100126
httpClient: opts.HTTPClient,
127+
browser: opts.browser,
101128
session: opts.Session,
102129
userAgent: opts.UserAgent,
103130
retryBackoff: time.Second,

api/domscrape.go

Lines changed: 303 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,303 @@
1+
package api
2+
3+
// DOM scraping fallback for endpoints x.com won't serve via direct
4+
// GraphQL calls (Followers and SearchTimeline currently — they require
5+
// the opaque `x-client-transaction-id` header that only the SPA's own
6+
// obfuscated JS knows how to compute).
7+
//
8+
// Same approach XActions' Puppeteer CLI uses: navigate to a real
9+
// x.com page, let the SPA load the content and make its own GraphQL
10+
// calls (including the anti-bot headers), then read the rendered DOM.
11+
// The SPA handles all the fingerprint / CSRF / challenge stuff — we
12+
// just scrape what's already on the screen.
13+
//
14+
// Selectors are ported verbatim from XActions'
15+
// reference/XActions/src/scrapers/twitter/index.js (scrapeFollowers,
16+
// scrapeFollowing, searchTweets). The `data-testid` attributes are
17+
// x.com's own test ids and are stable across UI changes — we've used
18+
// the same ones since 2024. When x.com rebrands and a selector
19+
// breaks, update the JS extractor below and ship.
20+
21+
import (
22+
"context"
23+
"encoding/json"
24+
"fmt"
25+
"strings"
26+
27+
"github.com/thevibeworks/x-cli/internal/chromebrowser"
28+
)
29+
30+
// FollowersDOM scrapes a user's followers by navigating to
31+
// /<user>/followers and reading [data-testid=UserCell] rows from the
32+
// rendered page. Each scroll adds ~20 more rows; scrollCount is
33+
// computed from opts.Limit.
34+
//
35+
// Returns UserSummary records with the fields the DOM exposes:
36+
// username, name, bio, verified, avatar. Follower counts aren't in
37+
// the cell body so they default to 0 — if you need them, run
38+
// `x profile get <username>` for each result, or enable the
39+
// GraphQL path if x.com ever stops requiring x-client-transaction-id
40+
// on Followers.
41+
func (c *Client) FollowersDOM(ctx context.Context, screenName string, opts PageOptions) ([]*UserSummary, error) {
42+
if c.browser == nil {
43+
return nil, fmt.Errorf("FollowersDOM: client was not constructed with UseBrowser=true")
44+
}
45+
limit := opts.Limit
46+
if limit <= 0 {
47+
limit = 200
48+
}
49+
// Each scroll loads ~20 rows. Add a few extra to tolerate the
50+
// SPA's own dedup and the "UserCell filter" below. 10 extra
51+
// scrolls is generous; the extractor drops out early once it
52+
// hits `limit`.
53+
scrollCount := (limit / 20) + 5
54+
55+
c.sessionMu.RLock()
56+
cookies := copyCookies(c.session.Cookies)
57+
c.sessionMu.RUnlock()
58+
59+
// Extractor: walks every UserCell on the page, extracts the
60+
// handle / name / bio / verified / avatar, drops rows with no
61+
// username or placeholder handles. Ported from XActions
62+
// scrapeFollowers JS, same selectors.
63+
extractor := `Array.from(document.querySelectorAll('[data-testid="UserCell"]')).map((cell) => {
64+
const link = cell.querySelector('a[href^="/"]');
65+
const nameEl = cell.querySelector('[dir="ltr"] > span');
66+
const bioEl = cell.querySelector('[data-testid="UserDescription"]');
67+
const verifiedEl = cell.querySelector('svg[aria-label*="Verified"]');
68+
const avatarEl = cell.querySelector('img[src*="profile_images"]');
69+
const href = link ? link.getAttribute('href') : '';
70+
const username = href.split('/')[1] || '';
71+
return {
72+
username: username,
73+
name: nameEl ? nameEl.textContent : null,
74+
bio: bioEl ? bioEl.textContent : null,
75+
verified: !!verifiedEl,
76+
avatar: avatarEl ? avatarEl.src : null,
77+
};
78+
}).filter(u => u.username && !u.username.includes('?'))`
79+
80+
raw, err := c.browser.Scrape(ctx, chromebrowser.ScrapeOptions{
81+
URL: "https://x.com/" + strings.TrimPrefix(screenName, "@") + "/followers",
82+
WaitSelector: `[data-testid="UserCell"]`,
83+
Extractor: extractor,
84+
ScrollCount: scrollCount,
85+
Cookies: cookies,
86+
})
87+
if err != nil {
88+
return nil, err
89+
}
90+
91+
var rows []struct {
92+
Username string `json:"username"`
93+
Name string `json:"name"`
94+
Bio string `json:"bio"`
95+
Verified bool `json:"verified"`
96+
Avatar string `json:"avatar"`
97+
}
98+
if err := json.Unmarshal(raw, &rows); err != nil {
99+
return nil, fmt.Errorf("FollowersDOM: decode extractor output: %w", err)
100+
}
101+
102+
// Dedup by username — the SPA's virtual scroll sometimes re-renders
103+
// rows during pagination, so the extractor can see the same row
104+
// twice across scrolls. First-write-wins.
105+
seen := make(map[string]struct{}, len(rows))
106+
out := make([]*UserSummary, 0, limit)
107+
for _, r := range rows {
108+
if _, ok := seen[r.Username]; ok {
109+
continue
110+
}
111+
seen[r.Username] = struct{}{}
112+
out = append(out, &UserSummary{
113+
Username: r.Username,
114+
Name: r.Name,
115+
Bio: r.Bio,
116+
Verified: r.Verified,
117+
Avatar: r.Avatar,
118+
})
119+
if len(out) >= limit {
120+
break
121+
}
122+
}
123+
return out, nil
124+
}
125+
126+
// SearchPostsDOM scrapes the Latest results for a search query via
127+
// the rendered /search?q=...&f=live page. Each scroll adds ~20 more
128+
// tweets.
129+
//
130+
// DOM extraction is thinner than ParseTweet: we get the tweet ID,
131+
// author handle, body text, and the raw likes count. Views,
132+
// retweets, quotes, and replies aren't surfaced as easily in the
133+
// compact row layout, so they're left zero. For full metrics, use
134+
// `x tweets get <id>` on each result.
135+
func (c *Client) SearchPostsDOM(ctx context.Context, query string, opts SearchOptions) ([]*Tweet, error) {
136+
if c.browser == nil {
137+
return nil, fmt.Errorf("SearchPostsDOM: client was not constructed with UseBrowser=true")
138+
}
139+
limit := opts.Limit
140+
if limit <= 0 {
141+
limit = 100
142+
}
143+
scrollCount := (limit / 20) + 5
144+
145+
c.sessionMu.RLock()
146+
cookies := copyCookies(c.session.Cookies)
147+
c.sessionMu.RUnlock()
148+
149+
// Ported verbatim from XActions searchTweets: walks article
150+
// elements, extracts id from the status URL, text, author handle,
151+
// and the like count. Same selectors.
152+
extractor := `Array.from(document.querySelectorAll('article[data-testid="tweet"]')).map((article) => {
153+
const textEl = article.querySelector('[data-testid="tweetText"]');
154+
const authorLink = article.querySelector('[data-testid="User-Name"] a[href^="/"]');
155+
const timeEl = article.querySelector('time');
156+
const linkEl = article.querySelector('a[href*="/status/"]');
157+
const likesEl = article.querySelector('[data-testid="like"] span span');
158+
const idMatch = linkEl && linkEl.href ? linkEl.href.match(/status\/(\d+)/) : null;
159+
return {
160+
id: idMatch ? idMatch[1] : null,
161+
text: textEl ? textEl.textContent : null,
162+
author: authorLink ? authorLink.href.split('/')[3] : null,
163+
created_at: timeEl ? timeEl.getAttribute('datetime') : null,
164+
likes_text: likesEl ? likesEl.textContent : '0',
165+
};
166+
}).filter(t => t.id)`
167+
168+
q := query
169+
if opts.From != "" {
170+
q += " from:" + opts.From
171+
}
172+
if opts.Since != "" {
173+
q += " since:" + opts.Since
174+
}
175+
if opts.Until != "" {
176+
q += " until:" + opts.Until
177+
}
178+
if opts.Lang != "" {
179+
q += " lang:" + opts.Lang
180+
}
181+
url := "https://x.com/search?q=" + httpQueryEscape(q) + "&src=typed_query&f=live"
182+
183+
raw, err := c.browser.Scrape(ctx, chromebrowser.ScrapeOptions{
184+
URL: url,
185+
WaitSelector: `article[data-testid="tweet"]`,
186+
Extractor: extractor,
187+
ScrollCount: scrollCount,
188+
Cookies: cookies,
189+
})
190+
if err != nil {
191+
return nil, err
192+
}
193+
194+
var rows []struct {
195+
ID string `json:"id"`
196+
Text string `json:"text"`
197+
Author string `json:"author"`
198+
CreatedAt string `json:"created_at"`
199+
LikesText string `json:"likes_text"`
200+
}
201+
if err := json.Unmarshal(raw, &rows); err != nil {
202+
return nil, fmt.Errorf("SearchPostsDOM: decode extractor output: %w", err)
203+
}
204+
205+
seen := make(map[string]struct{}, len(rows))
206+
out := make([]*Tweet, 0, limit)
207+
for _, r := range rows {
208+
if r.ID == "" {
209+
continue
210+
}
211+
if _, ok := seen[r.ID]; ok {
212+
continue
213+
}
214+
seen[r.ID] = struct{}{}
215+
out = append(out, &Tweet{
216+
ID: r.ID,
217+
Text: r.Text,
218+
CreatedAt: r.CreatedAt,
219+
Author: TweetAuthor{Username: r.Author},
220+
Metrics: TweetMetrics{Likes: parseHumanCount(r.LikesText)},
221+
})
222+
if len(out) >= limit {
223+
break
224+
}
225+
}
226+
return out, nil
227+
}
228+
229+
// copyCookies returns a shallow copy so the extractor doesn't see
230+
// mid-flight mutations while the session's RLock is released.
231+
func copyCookies(in map[string]string) map[string]string {
232+
out := make(map[string]string, len(in))
233+
for k, v := range in {
234+
out[k] = v
235+
}
236+
return out
237+
}
238+
239+
// httpQueryEscape is url.QueryEscape but kept local so this file
240+
// doesn't grow a net/url import just for one call.
241+
func httpQueryEscape(s string) string {
242+
// Minimal replacements — anything safe enough for an x.com
243+
// search path. Space → '+' and '#' / '&' encoded are enough for
244+
// the queries users pass to `x search posts`.
245+
r := strings.NewReplacer(
246+
" ", "+",
247+
"#", "%23",
248+
"&", "%26",
249+
)
250+
return r.Replace(s)
251+
}
252+
253+
// parseHumanCount converts x.com's compact rendered count strings
254+
// ("1.2K", "3.4M", "7,812") back to an integer. Not lossless —
255+
// "1.2K" becomes 1200 — but that matches the precision x.com shows
256+
// in the UI anyway. For exact counts, use the GraphQL path.
257+
func parseHumanCount(s string) int {
258+
s = strings.TrimSpace(s)
259+
if s == "" {
260+
return 0
261+
}
262+
// Strip thousands-separator commas for numbers like "7,812".
263+
s = strings.ReplaceAll(s, ",", "")
264+
// Check for K/M/B suffix.
265+
multiplier := 1
266+
last := s[len(s)-1]
267+
switch last {
268+
case 'K', 'k':
269+
multiplier = 1_000
270+
s = s[:len(s)-1]
271+
case 'M', 'm':
272+
multiplier = 1_000_000
273+
s = s[:len(s)-1]
274+
case 'B', 'b':
275+
multiplier = 1_000_000_000
276+
s = s[:len(s)-1]
277+
}
278+
// Parse the numeric part. Handle "1.2" as 1.2, "12" as 12.
279+
var whole, frac int
280+
dot := strings.IndexByte(s, '.')
281+
if dot < 0 {
282+
whole = atoiOrZero(s)
283+
} else {
284+
whole = atoiOrZero(s[:dot])
285+
fracStr := s[dot+1:]
286+
if len(fracStr) > 1 {
287+
fracStr = fracStr[:1]
288+
}
289+
frac = atoiOrZero(fracStr)
290+
}
291+
return whole*multiplier + (frac*multiplier)/10
292+
}
293+
294+
func atoiOrZero(s string) int {
295+
n := 0
296+
for _, r := range s {
297+
if r < '0' || r > '9' {
298+
return 0
299+
}
300+
n = n*10 + int(r-'0')
301+
}
302+
return n
303+
}

api/throttle_test.go

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -246,9 +246,10 @@ func TestConcurrentMutationGapInvariant(t *testing.T) {
246246
}
247247
}
248248

249-
// Adjacent fires must be spaced by at least gap minus a tiny slack for
250-
// scheduler jitter.
251-
const slack = 5 * time.Millisecond
249+
// Adjacent fires must be spaced by at least gap minus a slack for
250+
// scheduler jitter. 15ms is the smallest value that's robust
251+
// across container / CI / high-load environments.
252+
const slack = 15 * time.Millisecond
252253
for i := 1; i < len(sorted); i++ {
253254
d := sorted[i].Sub(sorted[i-1])
254255
if d+slack < gap {

0 commit comments

Comments
 (0)