从"能查到"到"查得快、查得稳"(含可复用代码)
面向读者:刚接触 GitHub GraphQL 的初级开发者
目标:用更少的请求、更低的 cost、更稳定地跑长时间批量任务
1. 你真的需要"更多字段"吗?------GraphQL 优化的第一原则:只取你用到的
GraphQL 最大的优势是"按需取数",但很多人写 GraphQL 的时候,反而把它当成"万能字段收集器",把能想到的字段全塞进去。结果是:
- 查询 cost 变高,更容易触发 rate limit / secondary rate limit
- 响应体变大,网络更慢、解析更慢、落库更慢
- 批量任务更不稳定(中间失败重试成本更高)
这段代码把"仓库需要的字段集合"集中放在一个 fragment 里,保证"字段的口径一致、可控"。
graphql
fragment GitHubRepoNode on Repository {
databaseId
owner {
... on User { databaseId login __typename }
... on Organization { databaseId login __typename }
}
nameWithOwner
licenseInfo { key }
isInOrganization
isFork
isArchived
description
primaryLanguage { name }
diskUsage
stargazerCount
forkCount
latestRelease { createdAt }
pushedAt
createdAt
updatedAt
languages(first: 20, orderBy: {field: SIZE, direction: DESC}) {
edges { node { name } size }
}
repositoryTopics(first: 20) {
nodes { topic { name } }
}
parent { databaseId }
}
如何把 fragment 当成"优化开关"
对初学者来说,最实用的做法是:把 fragment 分成两档(甚至三档):
- 轻量档(列表/批处理扫描) :只要
databaseId、nameWithOwner、pushedAt、stargazerCount这类核心字段 - 完整档(单仓库详情页/入库补全):再加 topics、languages、release、parent 等
这样你在"批量扫描"时不会无意间把 cost 拉满。
2. 分页不是 while True:正确姿势是"Connection + Cursor"
GitHub GraphQL 的列表型数据基本都是 Connection 模型:
- 你用
first: N控制单页大小 - 通过
pageInfo { hasNextPage endCursor }拿到游标 - 下一页用
after: $cursor
标准写法(并且把 $cursor 做成变量,避免拼接字符串):
graphql
query SearchReposByTimeRange($q: String!, $cursor: String) {
search(type: REPOSITORY, query: $q, first: 100, after: $cursor) {
nodeTotal: repositoryCount
nodes { ...GitHubRepoNode }
pageInfo { hasNextPage endCursor }
}
rateLimit { limit cost remaining resetAt }
}
优化点 1:别忽略 rateLimit 字段
上面查询末尾的:
graphql
rateLimit { limit cost remaining resetAt }
是非常值得保留的"自检仪表盘"。它让你能在运行时看到:
- cost:这次查询花了多少配额(不是"请求次数",而是 GraphQL 的消耗点数)
- remaining/resetAt:还能跑多久、什么时候恢复
很多批处理任务跑不稳,就是因为没把这部分数据打出来做监控。
3. Search 的 1000 上限:别硬刚,用"时间窗口切片"绕过去
GitHub 的 Search(包括 GraphQL search)存在一个经典限制:对于匹配结果非常多的查询,你最多只能稳定地翻到前 1000 个结果 。
这里要专门判断:
python
total = int(search.get("nodeTotal") or 0)
if total > 1000:
more_than_1k = True
if not fetch_nodes_when_over_1k:
return True, []
这段逻辑非常关键:当一个时间段内的结果太多时,不去"硬翻页",而是返回 more_than_1k=True,让上层去缩小时间窗口。
3.1 可复用策略:自适应时间窗口(大了就二分,小了就回退)
实现一个非常实用的"自适应滑动窗口":
- 初始用较大窗口(例如 1 小时)
- 如果结果 > 1000:把窗口缩小一半(直到最小窗口)
- 如果窗口内结果不多:可以逐步把窗口放大回去,提高吞吐
核心逻辑是:
python
if more_than_1k and step_s > min_s:
step_s = max(min_s, step_s / 2.0)
continue
...
current = window_end
if not more_than_1k and step_s < base_s:
step_s = min(base_s, step_s * 2.0)
这套策略的好处:
- 不需要你提前估计"这个条件一天会产生多少仓库"
- 自动在"吞吐"和"上限"之间找平衡
- 对高峰期(比如某个热门语言突然爆发)更鲁棒
你只需要根据你的业务调整两个参数:
base_step:正常情况下想用多大的窗口(吞吐上限)min_step:最坏情况下允许缩到多细(保证能跑完)
4. 限流不是错误:把 Rate Limit 当成"调度信号"
很多人看到 403 就 panic,或者简单 sleep 固定秒数。更稳的思路是:
- 识别"真的被限流了" vs "403 权限不足/资源禁止"
- 从响应头读
X-RateLimit-Remaining/X-RateLimit-Reset - 多 token 时轮换,单 token 时等待 reset
做了完整的闭环:
- 每次请求后更新 token 状态:
_update_state_from_headers - 403 且像限流:
_mark_rate_limited,然后continue换 token 或等待 - GraphQL 返回
errors也可能表示 rate limit,同样识别后重试
4.1 为什么要轮换 token?
当你做批量任务(例如扫最近 6 小时所有仓库)时,单 token 很容易在某个时间点把配额打光。多 token 轮换有两点收益:
- 降低单 token 的连续压力(降低 secondary rate limit 风险)
- 更容易把吞吐做平滑(避免"一会儿很快一会儿完全停住")
5. 变量(Variables)不是语法糖:它能让你写出"可缓存、可复用、可维护"的查询
初学者常见写法是把参数拼到 query 字符串里,这会带来:
- 代码更难读、更难复用
- 更容易出现转义/注入问题
- 更难做统一的日志、重试、缓存(哪怕只是你自己做内存缓存)
你可以把 owner/name 做成变量:
graphql
query Repo($owner: String!, $name: String!) {
repository(owner: $owner, name: $name) {
databaseId
description
updatedAt
pushedAt
owner { login __typename }
}
}
这种写法的"优化价值"在于:你可以把 query 当成稳定模板,只替换 variables。
6. GraphQL 不是银弹:对比 REST "大礼包脚本",你会更清楚优化方向
一个非常典型的"我想一次拿全一个仓库相关信息"的实现:
- 仓库主体(GraphQL 一次)
- topics、branches、tags、languages、stargazers、watchers、forks、contributors、readme、owner 信息(大量 REST)
- 还做了请求统计(
RequestStats),并在开始/结束读取 rate_limit(REST 的GET /rate_limit)
这段脚本本身没问题,但它能帮你看清楚两个事实:
- REST 很容易产生多次请求:一个 repo 的"全量信息"天然就是多个 endpoint。
- GraphQL 能减少 round-trip,但也会增加单次查询的复杂度和 cost:你把更多字段塞进一次 GraphQL 查询里,可能会触发更高的 cost 和更大的响应体。
所以优化策略通常不是"全换成 GraphQL",而是:
- 批量扫描(大量仓库)用 GraphQL + 精简字段 + 可控分页
- 单仓库补全(少量仓库)用 GraphQL 详情档,或搭配少量 REST(比如某些 preview/特殊 accept 的接口)
7. 一套可直接复用的"批量扫描"模板(基于现有脚本)
如果你的目标是:"在一个时间范围内,筛选出满足条件的仓库,并且稳定跑完",可以直接复用的结构:
7.1 典型用法
python
from datetime import datetime, timedelta, timezone
from backend.tools.GraphQL_search import (
GitHubGraphQLClient,
parse_tokens_from_env,
scan_repos_incremental,
)
tokens = parse_tokens_from_env("GITHUB_TOKENS")
client = GitHubGraphQLClient(tokens=tokens)
end_dt = datetime.now(timezone.utc)
start_dt = end_dt - timedelta(hours=6)
for repo in scan_repos_incremental(
client=client,
time_range_field="pushed",
start=start_dt,
end=end_dt,
filter_query="stars:>500 language:TypeScript",
base_step=timedelta(hours=1),
min_step=timedelta(minutes=2),
):
print(repo.get("nameWithOwner"), repo.get("stargazerCount"))
7.2 你最该先改的三个参数
filter_query:把业务过滤条件尽量前置到 search query,减少无效节点数量base_step:吞吐上限(越大越快,但更容易触发 >1000)min_step:最坏情况下的颗粒度(越小越稳,但窗口变多、请求次数上升)
8. 常见踩坑清单(初学者版)
- 把"分页"写成 page=1,2,3:GraphQL 用 cursor,不是 page number
- 忽略 rateLimit.cost:请求次数少不代表消耗低
- 搜索条件过宽 + 想翻完所有结果:超过 1000 时要切分条件(时间窗口/额外过滤)
- 批处理时字段开太大:先做轻量档,再按需补全
- 把 403 当成一种错误:403 可能是限流、也可能是权限;要区分处理
- 没有多 token 调度策略:长任务很容易跑一半停住
9. 结语:优化顺序建议
如果你准备把一个"能跑"的查询变成"生产可用",建议按这个顺序做:
- 先减字段(fragment 轻量化)
- 再做分页(cursor + pageInfo)
- 再看 cost(rateLimit.cost/remaining)
- 最后做稳定性(多 token + reset 调度 + 自适应时间窗口)
上面四点,你在这两个脚本里都能找到对应实现或对比参照:
python
from __future__ import annotations
import argparse
import base64
import json
import os
import re
import sys
from dataclasses import dataclass, field
from typing import Any, Dict, Iterable, List, Optional, Tuple
import requests
from dotenv import load_dotenv
#author:wanchen
#2026/04/22 15:55
API_BASE = "https://api.github.com/"
GQL_ENDPOINT = "https://api.github.com/graphql"
TOKEN_SPLIT_RE = re.compile(r"[,\s;]+")
def _load_env() -> None:
root = os.path.abspath(os.path.join(os.path.dirname(__file__), "..", ".."))
env_path = os.path.join(root, ".env")
if os.path.exists(env_path):
load_dotenv(env_path)
else:
load_dotenv()
def _strip_quotes(v: Optional[str]) -> Optional[str]:
if v is None:
return None
s = v.strip()
if len(s) >= 2 and ((s[0] == '"' and s[-1] == '"') or (s[0] == "'" and s[-1] == "'")):
return s[1:-1]
return s
def _parse_tokens(raw: str) -> List[str]:
raw2 = _strip_quotes(raw or "") or ""
parts = TOKEN_SPLIT_RE.split(raw2) if raw2 else []
out = []
for p in parts:
p = (p or "").strip()
if p:
out.append(p)
return out
def _proxies_from_env() -> Dict[str, str]:
http_proxy = _strip_quotes(os.getenv("PROXY_HTTP") or "") or ""
https_proxy = _strip_quotes(os.getenv("PROXY_HTTPS") or "") or ""
proxies: Dict[str, str] = {}
if http_proxy:
proxies["http"] = http_proxy
if https_proxy:
proxies["https"] = https_proxy
return proxies
def _api_proxy_prefixes() -> List[str]:
raw = _strip_quotes(os.getenv("API_PROXY_PREFIXES") or "") or ""
parts = TOKEN_SPLIT_RE.split(raw) if raw else []
out = []
for p in parts:
p = (p or "").strip()
if p:
out.append(p)
single = (_strip_quotes(os.getenv("API_PROXY_PREFIX") or "") or "").strip()
if single and single not in out:
out.append(single)
return out
def _api_candidates(url: str) -> List[str]:
out: List[str] = []
if url.startswith(API_BASE):
out.append(url)
for pfx in _api_proxy_prefixes():
pfx = (pfx or "").strip()
if not pfx:
continue
if not pfx.endswith("/"):
pfx = pfx + "/"
if url.startswith(API_BASE):
u2 = pfx + url[len(API_BASE) :]
if u2 not in out:
out.append(u2)
if not out:
out.append(url)
return out
def _total_count_from_link(resp: requests.Response) -> Optional[int]:
link = resp.headers.get("Link") or resp.headers.get("link") or ""
if not link:
return None
parts = [p.strip() for p in link.split(",") if p.strip()]
for p in parts:
if 'rel="last"' not in p:
continue
m = re.search(r"[?&]page=(\d+)", p)
if not m:
continue
try:
return int(m.group(1))
except Exception:
return None
return None
@dataclass
class RequestStats:
total: int = 0
by_label: Dict[str, int] = field(default_factory=dict)
by_status: Dict[int, int] = field(default_factory=dict)
def inc(self, label: str, status_code: int) -> None:
self.total += 1
self.by_label[label] = self.by_label.get(label, 0) + 1
self.by_status[status_code] = self.by_status.get(status_code, 0) + 1
class GitHubClient:
def __init__(self, tokens: List[str], timeout_sec: int = 30):
self.tokens = tokens[:] if tokens else []
self.timeout_sec = int(timeout_sec or 30)
self.stats = RequestStats()
self.proxies = _proxies_from_env()
self._sessions: Dict[str, requests.Session] = {}
self._token_index = 0
def _current_token(self) -> Optional[str]:
if not self.tokens:
return None
self._token_index = self._token_index % len(self.tokens)
return self.tokens[self._token_index]
def _rotate_token(self) -> Optional[str]:
if not self.tokens:
return None
self._token_index = (self._token_index + 1) % len(self.tokens)
return self._current_token()
def _session(self, token: Optional[str]) -> requests.Session:
key = token or "anon"
s = self._sessions.get(key)
if s is not None:
return s
s = requests.Session()
s.trust_env = False
s.headers.update({"Accept": "application/vnd.github+json", "User-Agent": "github-analysis-demo"})
if token:
s.headers.update({"Authorization": f"Bearer {token}"})
self._sessions[key] = s
return s
def request_json(
self,
label: str,
method: str,
url: str,
*,
accept: Optional[str] = None,
headers: Optional[Dict[str, str]] = None,
json_body: Optional[Dict[str, Any]] = None,
allow_statuses: Iterable[int] = (200,),
try_candidates: bool = True,
) -> Tuple[Optional[Any], Optional[requests.Response]]:
allow = set(int(x) for x in allow_statuses)
token_attempts = max(len(self.tokens), 1)
last_resp: Optional[requests.Response] = None
last_exc: Optional[Exception] = None
for _ in range(token_attempts):
token = self._current_token()
s = self._session(token)
req_headers = dict(headers) if headers else {}
if accept:
req_headers["Accept"] = accept
candidates = _api_candidates(url) if try_candidates else [url]
for u in candidates:
try:
resp = s.request(
method.upper(),
u,
headers=req_headers,
json=json_body,
timeout=self.timeout_sec,
proxies=self.proxies,
)
self.stats.inc(label, int(resp.status_code or 0))
last_resp = resp
if resp.status_code in allow:
try:
return resp.json(), resp
except Exception:
return None, resp
if resp.status_code in (401, 403):
try:
payload = resp.json()
except Exception:
payload = {}
msg = str((payload or {}).get("message", "")).lower()
if "rate limit" in msg or "bad credentials" in msg or "token" in msg:
break
except requests.RequestException as exc:
last_exc = exc
continue
if self.tokens:
self._rotate_token()
if last_resp is not None:
try:
return last_resp.json(), last_resp
except Exception:
return None, last_resp
if last_exc is not None:
raise last_exc
return None, None
def rate_limit(self) -> Optional[Dict[str, Any]]:
data, _ = self.request_json("rate_limit", "GET", f"{API_BASE}rate_limit", allow_statuses=(200, 403))
return data if isinstance(data, dict) else None
def graphql(self, query: str, variables: Dict[str, Any]) -> Optional[Dict[str, Any]]:
body = {"query": query, "variables": variables}
data, _ = self.request_json(
"graphql",
"POST",
GQL_ENDPOINT,
json_body=body,
allow_statuses=(200,),
try_candidates=False,
)
return data if isinstance(data, dict) else None
def fetch_repo_bundle(client: GitHubClient, full_name: str) -> Dict[str, Any]:
full_name = (full_name or "").strip()
if not full_name or "/" not in full_name:
raise ValueError("full_name must be like 'owner/repo'")
owner, repo = full_name.split("/", 1)
gql = client.graphql(
query="""
query Repo($owner: String!, $name: String!) {
repository(owner: $owner, name: $name) {
databaseId
description
updatedAt
pushedAt
owner { login __typename }
}
}
""",
variables={"owner": owner, "name": repo},
)
repo_node = (((gql or {}).get("data") or {}).get("repository") or {}) if isinstance(gql, dict) else {}
repo_id = repo_node.get("databaseId")
description = repo_node.get("description")
updated_at = repo_node.get("updatedAt")
pushed_at = repo_node.get("pushedAt")
owner_node = repo_node.get("owner") if isinstance(repo_node.get("owner"), dict) else {}
owner_login = (owner_node or {}).get("login")
owner_type = (owner_node or {}).get("__typename")
topics, _ = client.request_json(
"topics",
"GET",
f"{API_BASE}repos/{full_name}/topics",
accept="application/vnd.github.mercy-preview+json",
allow_statuses=(200,),
)
topics_list = (topics or {}).get("names") if isinstance(topics, dict) else []
if not isinstance(topics_list, list):
topics_list = []
branches_page, _ = client.request_json(
"branches_page1",
"GET",
f"{API_BASE}repos/{full_name}/branches?per_page=100&page=1",
allow_statuses=(200, 409),
)
branches_rows: List[Dict[str, Any]] = []
if isinstance(branches_page, list):
for b in branches_page:
if not isinstance(b, dict):
continue
name = b.get("name")
sha = (b.get("commit") or {}).get("sha") if isinstance(b.get("commit"), dict) else None
if name:
branches_rows.append({"branch_name": name, "sha": sha})
branches_count_resp_json, branches_count_resp = client.request_json(
"branches_count",
"GET",
f"{API_BASE}repos/{full_name}/branches?per_page=1&page=1",
allow_statuses=(200, 409),
)
if branches_count_resp is None:
branches_count = None
elif branches_count_resp.status_code == 409:
branches_count = 0
else:
branches_count = _total_count_from_link(branches_count_resp)
if branches_count is None:
branches_count = len(branches_count_resp_json or []) if isinstance(branches_count_resp_json, list) else 0
tags_page, _ = client.request_json(
"tags_page1",
"GET",
f"{API_BASE}repos/{full_name}/tags?per_page=100&page=1",
allow_statuses=(200, 409),
)
tags_rows: List[Dict[str, Any]] = []
if isinstance(tags_page, list):
for t in tags_page:
if not isinstance(t, dict):
continue
name = t.get("name")
sha = (t.get("commit") or {}).get("sha") if isinstance(t.get("commit"), dict) else None
if name:
tags_rows.append({"tag_name": name, "sha": sha})
tags_count_resp_json, tags_count_resp = client.request_json(
"tags_count",
"GET",
f"{API_BASE}repos/{full_name}/tags?per_page=1&page=1",
allow_statuses=(200, 409),
)
if tags_count_resp is None:
tags_count = None
elif tags_count_resp.status_code == 409:
tags_count = 0
else:
tags_count = _total_count_from_link(tags_count_resp)
if tags_count is None:
tags_count = len(tags_count_resp_json or []) if isinstance(tags_count_resp_json, list) else 0
languages, _ = client.request_json(
"languages",
"GET",
f"{API_BASE}repos/{full_name}/languages",
allow_statuses=(200,),
)
languages_rows: List[Dict[str, Any]] = []
if isinstance(languages, dict):
for lang, size in languages.items():
if lang:
languages_rows.append({"language": lang, "bytes": size})
stars_page, _ = client.request_json(
"stars_page1",
"GET",
f"{API_BASE}repos/{full_name}/stargazers?per_page=100&page=1",
accept="application/vnd.github.v3.star+json",
allow_statuses=(200,),
)
stars_rows: List[Dict[str, Any]] = []
if isinstance(stars_page, list):
for s in stars_page:
if not isinstance(s, dict):
continue
user = s.get("user") if isinstance(s.get("user"), dict) else {}
uid = (user or {}).get("id")
login = (user or {}).get("login")
starred_at = s.get("starred_at")
if uid is not None:
stars_rows.append({"user_id": uid, "login": login, "starred_at": starred_at})
watchers_page, _ = client.request_json(
"watchers_page1",
"GET",
f"{API_BASE}repos/{full_name}/subscribers?per_page=100&page=1",
allow_statuses=(200,),
)
watchers_rows: List[Dict[str, Any]] = []
if isinstance(watchers_page, list):
for u in watchers_page:
if not isinstance(u, dict):
continue
uid = u.get("id")
login = u.get("login")
if uid is not None:
watchers_rows.append({"user_id": uid, "login": login})
forks_page, _ = client.request_json(
"forks_page1",
"GET",
f"{API_BASE}repos/{full_name}/forks?per_page=100&page=1",
allow_statuses=(200,),
)
forks_rows: List[Dict[str, Any]] = []
if isinstance(forks_page, list):
for f in forks_page:
if not isinstance(f, dict):
continue
fork_id = f.get("id")
fork_full_name = f.get("full_name")
created_at = f.get("created_at")
fork_owner = f.get("owner") if isinstance(f.get("owner"), dict) else {}
fork_owner_id = (fork_owner or {}).get("id")
fork_owner_login = (fork_owner or {}).get("login")
if fork_id is not None:
forks_rows.append(
{
"fork_repo_id": fork_id,
"fork_full_name": fork_full_name,
"created_at": created_at,
"owner_id": fork_owner_id,
"owner_login": fork_owner_login,
}
)
contributors_page, _ = client.request_json(
"contributors_page1",
"GET",
f"{API_BASE}repos/{full_name}/contributors?per_page=100&anon=1&page=1",
allow_statuses=(200, 204, 403),
)
contributors_rows: List[Dict[str, Any]] = []
contributors_too_large = False
if isinstance(contributors_page, dict):
msg = str(contributors_page.get("message", "")).lower()
if "too large to list contributors" in msg:
contributors_too_large = True
if isinstance(contributors_page, list):
for u in contributors_page:
if not isinstance(u, dict):
continue
uid = u.get("id")
login = u.get("login")
contributions = u.get("contributions")
if uid is not None:
contributors_rows.append({"user_id": uid, "login": login, "contributions": contributions})
contributors_count_resp_json, contributors_count_resp = client.request_json(
"contributors_count",
"GET",
f"{API_BASE}repos/{full_name}/contributors?per_page=1&anon=1&page=1",
allow_statuses=(200, 204, 403),
)
if contributors_count_resp is None:
contributors_count = None
elif contributors_count_resp.status_code == 204:
contributors_count = 0
elif contributors_count_resp.status_code == 403 and isinstance(contributors_count_resp_json, dict):
msg = str(contributors_count_resp_json.get("message", "")).lower()
if "too large to list contributors" in msg:
contributors_count = None
else:
contributors_count = None
else:
contributors_count = _total_count_from_link(contributors_count_resp)
if contributors_count is None:
contributors_count = len(contributors_count_resp_json or []) if isinstance(contributors_count_resp_json, list) else 0
readme_json, _ = client.request_json(
"readme",
"GET",
f"{API_BASE}repos/{full_name}/readme",
allow_statuses=(200, 404),
)
readme_text: Optional[str] = None
if isinstance(readme_json, dict):
content = readme_json.get("content")
if isinstance(content, str) and content:
try:
readme_text = base64.b64decode(content).decode("utf-8", errors="ignore")
except Exception:
readme_text = None
users_rows: List[Dict[str, Any]] = []
orgs_rows: List[Dict[str, Any]] = []
if isinstance(owner_login, str) and owner_login.strip():
if owner_type == "Organization":
org_json, _ = client.request_json(
"owner_org",
"GET",
f"{API_BASE}orgs/{owner_login}",
allow_statuses=(200, 404),
)
if isinstance(org_json, dict) and org_json:
orgs_rows.append(org_json)
else:
user_json, _ = client.request_json(
"owner_user",
"GET",
f"{API_BASE}users/{owner_login}",
allow_statuses=(200, 404),
)
if isinstance(user_json, dict) and user_json:
users_rows.append(user_json)
counts = {
"topics": len(topics_list),
"branches": branches_count,
"tags": tags_count,
"contributors": contributors_count,
"contributors_too_large": contributors_too_large,
}
return {
"base": {
"full_name": full_name,
"repo_id": repo_id,
"text_row": description,
"updates": {"updatedAt": updated_at, "pushedAt": pushed_at},
},
"counts": counts,
"topics_rows": [{"topic": t} for t in topics_list],
"branches_rows": branches_rows,
"tags_rows": tags_rows,
"languages_rows": languages_rows,
"stars_rows": stars_rows,
"watchers_rows": watchers_rows,
"forks_rows": forks_rows,
"contributors_rows": contributors_rows,
"users_rows": users_rows,
"orgs_rows": orgs_rows,
"readme": readme_text,
}
def parse_args() -> argparse.Namespace:
parser = argparse.ArgumentParser()
parser.add_argument("--repo", default="StarRocks/starrocks", help="owner/repo")
parser.add_argument("--timeout", type=int, default=30, help="HTTP timeout seconds")
parser.add_argument("--out", default="", help="write result json to path (optional)")
parser.add_argument("--pretty", action="store_true", help="pretty print json")
return parser.parse_args()
def main() -> None:
_load_env()
tokens = _parse_tokens(os.getenv("GITHUB_TOKENS", "") or "")
single = _strip_quotes(os.getenv("GITHUB_TOKEN") or "") or ""
if single and single not in tokens:
tokens.append(single)
if not tokens:
raise RuntimeError("GITHUB_TOKENS/GITHUB_TOKEN not found in .env (or environment)")
args = parse_args()
client = GitHubClient(tokens=tokens, timeout_sec=args.timeout)
before = client.rate_limit()
result = fetch_repo_bundle(client, args.repo)
after = client.rate_limit()
payload = {
"repo": result,
"request_stats": {
"total": client.stats.total,
"by_label": dict(sorted(client.stats.by_label.items(), key=lambda x: (-x[1], x[0]))),
"by_status": dict(sorted(client.stats.by_status.items(), key=lambda x: (-x[1], x[0]))),
},
"rate_limit_before": before,
"rate_limit_after": after,
}
if args.out:
with open(args.out, "w", encoding="utf-8") as f:
json.dump(payload, f, ensure_ascii=False, indent=2 if args.pretty else None)
print(args.out)
return
text = json.dumps(payload, ensure_ascii=False, indent=2 if args.pretty else None)
sys.stdout.write(text)
if not text.endswith("\n"):
sys.stdout.write("\n")
if __name__ == "__main__":
main()
python
import os
import time
import logging
from dataclasses import dataclass
from datetime import datetime, timedelta, timezone
from typing import Any, Dict, Iterable, List, Optional, Tuple
import requests
#author:wanchen
#2026/04/22 16:05
def to_github_iso(dt: datetime) -> str:
if dt.tzinfo is None:
dt = dt.replace(tzinfo=timezone.utc)
dt = dt.astimezone(timezone.utc)
dt = dt.replace(microsecond=0)
return dt.isoformat().replace("+00:00", "Z")
@dataclass
class TokenState:
token: str
remaining: Optional[int] = None
reset_at: Optional[datetime] = None
last_used_at: Optional[datetime] = None
class GitHubGraphQLClient:
def __init__(
self,
tokens: List[str],
endpoint: str = "https://api.github.com/graphql",
logger: Optional[logging.Logger] = None,
timeout_s: float = 30.0,
) -> None:
cleaned = [t.strip() for t in tokens if t and t.strip()]
if not cleaned:
raise ValueError("tokens 不能为空")
self._states = [TokenState(token=t) for t in cleaned]
self._endpoint = endpoint
self._timeout_s = timeout_s
self._log = logger or logging.getLogger("github-graphql")
self._rr_idx = 0
def graphql(self, query: str, variables: Dict[str, Any]) -> Dict[str, Any]:
while True:
state = self._pick_token_state()
headers = {
"Authorization": f"Bearer {state.token}",
"Accept": "application/vnd.github+json",
}
payload = {"query": query, "variables": variables}
resp = requests.post(
self._endpoint,
json=payload,
headers=headers,
timeout=self._timeout_s,
)
now = datetime.now(timezone.utc)
state.last_used_at = now
self._update_state_from_headers(state, resp.headers)
if resp.status_code == 401:
raise RuntimeError("GitHub token 无效或无权限(401)")
if resp.status_code == 403:
if self._looks_like_rate_limited(resp):
self._mark_rate_limited(state, resp.headers)
continue
raise RuntimeError(f"GitHub 403: {resp.text}")
resp.raise_for_status()
data = resp.json()
errors = data.get("errors") or []
if errors:
if self._errors_look_like_rate_limit(errors):
self._mark_rate_limited(state, resp.headers)
continue
raise RuntimeError(f"GraphQL errors: {errors}")
return data["data"]
def _pick_token_state(self) -> TokenState:
now = datetime.now(timezone.utc)
available: List[TokenState] = []
sleeping: List[TokenState] = []
for s in self._states:
if s.remaining is None or s.reset_at is None:
available.append(s)
continue
if s.remaining > 0:
available.append(s)
continue
if s.reset_at <= now:
s.remaining = None
s.reset_at = None
available.append(s)
continue
sleeping.append(s)
if available:
start = self._rr_idx % len(available)
chosen = available[start]
self._rr_idx += 1
return chosen
soonest = min(sleeping, key=lambda x: x.reset_at or now)
wait_s = max(1.0, (soonest.reset_at - now).total_seconds()) if soonest.reset_at else 5.0
self._log.warning("All tokens rate-limited, sleeping %.1fs", wait_s)
time.sleep(wait_s)
return soonest
def _update_state_from_headers(self, state: TokenState, headers: Dict[str, str]) -> None:
rem = headers.get("X-RateLimit-Remaining")
reset = headers.get("X-RateLimit-Reset")
if rem is not None:
try:
state.remaining = int(rem)
except ValueError:
pass
if reset is not None:
try:
reset_ts = int(reset)
state.reset_at = datetime.fromtimestamp(reset_ts, tz=timezone.utc)
except ValueError:
pass
def _looks_like_rate_limited(self, resp: requests.Response) -> bool:
msg = (resp.text or "").lower()
return "rate limit" in msg or "secondary rate limit" in msg
def _errors_look_like_rate_limit(self, errors: List[Dict[str, Any]]) -> bool:
for e in errors:
msg = str(e.get("message", "")).lower()
if "rate limit" in msg or "secondary rate limit" in msg:
return True
return False
def _mark_rate_limited(self, state: TokenState, headers: Dict[str, str]) -> None:
state.remaining = 0
reset = headers.get("X-RateLimit-Reset")
if reset is not None:
try:
reset_ts = int(reset)
state.reset_at = datetime.fromtimestamp(reset_ts, tz=timezone.utc)
return
except ValueError:
pass
state.reset_at = datetime.now(timezone.utc) + timedelta(seconds=60)
REPO_NODE_FRAGMENT = """
fragment GitHubRepoNode on Repository {
databaseId
owner {
... on User { databaseId login __typename }
... on Organization { databaseId login __typename }
}
nameWithOwner
licenseInfo { key }
isInOrganization
isFork
isArchived
description
primaryLanguage { name }
diskUsage
stargazerCount
forkCount
latestRelease { createdAt }
pushedAt
createdAt
updatedAt
languages(first: 20, orderBy: {field: SIZE, direction: DESC}) {
edges { node { name } size }
}
repositoryTopics(first: 20) {
nodes { topic { name } }
}
parent { databaseId }
}
"""
SEARCH_REPOS_GQL = (
REPO_NODE_FRAGMENT
+ """
query SearchReposByTimeRange($q: String!, $cursor: String) {
search(type: REPOSITORY, query: $q, first: 100, after: $cursor) {
nodeTotal: repositoryCount
nodes { ...GitHubRepoNode }
pageInfo { hasNextPage endCursor }
}
rateLimit { limit cost remaining resetAt }
}
"""
)
def search_repos_by_time_range(
client: GitHubGraphQLClient,
time_range_field: str,
start: datetime,
end: datetime,
filter_query: str = "",
fetch_nodes_when_over_1k: bool = False,
) -> Tuple[bool, List[Dict[str, Any]]]:
start_iso = to_github_iso(start)
end_iso = to_github_iso(end)
q = f"{filter_query} {time_range_field}:{start_iso}..{end_iso}".strip()
cursor: Optional[str] = None
nodes: List[Dict[str, Any]] = []
more_than_1k = False
total: Optional[int] = None
while True:
data = client.graphql(SEARCH_REPOS_GQL, {"q": q, "cursor": cursor})
search = data["search"]
if total is None:
total = int(search.get("nodeTotal") or 0)
if total > 1000:
more_than_1k = True
if not fetch_nodes_when_over_1k:
return True, []
page_nodes = search.get("nodes") or []
nodes.extend(page_nodes)
page_info = search["pageInfo"]
if not page_info.get("hasNextPage"):
break
if more_than_1k and len(nodes) >= 1000:
break
cursor = page_info.get("endCursor")
if not cursor:
break
return more_than_1k, nodes
def scan_repos_incremental(
client: GitHubGraphQLClient,
time_range_field: str,
start: datetime,
end: datetime,
filter_query: str = "",
base_step: timedelta = timedelta(hours=1),
min_step: timedelta = timedelta(minutes=1),
) -> Iterable[Dict[str, Any]]:
current = start
step_s = float(base_step.total_seconds())
base_s = float(base_step.total_seconds())
min_s = float(min_step.total_seconds())
while current < end:
window_end = min(end, current + timedelta(seconds=step_s))
more_than_1k, nodes = search_repos_by_time_range(
client=client,
time_range_field=time_range_field,
start=current,
end=window_end,
filter_query=filter_query,
fetch_nodes_when_over_1k=False,
)
if more_than_1k and step_s > min_s:
step_s = max(min_s, step_s / 2.0)
continue
for n in nodes:
yield n
current = window_end
if not more_than_1k and step_s < base_s:
step_s = min(base_s, step_s * 2.0)
def parse_tokens_from_env(env_key: str = "GITHUB_TOKENS") -> List[str]:
raw = os.getenv(env_key, "")
tokens = [t.strip() for t in raw.split(",") if t.strip()]
if not tokens:
raise RuntimeError(f"请设置环境变量 {env_key},用逗号分隔多个 token")
return tokens
if __name__ == "__main__":
logging.basicConfig(level=logging.INFO)
tokens = parse_tokens_from_env("GITHUB_TOKENS")
client = GitHubGraphQLClient(tokens=tokens)
end_dt = datetime.now(timezone.utc)
start_dt = end_dt - timedelta(hours=6)
count = 0
for repo in scan_repos_incremental(
client=client,
time_range_field="pushed",
start=start_dt,
end=end_dt,
filter_query="stars:>500 language:TypeScript",
base_step=timedelta(hours=1),
min_step=timedelta(minutes=2),
):
count += 1
if count <= 5:
print(repo.get("nameWithOwner"), repo.get("stargazerCount"))
print("total:", count)