feat: LLM 论文图书馆 — 初始提交

- FastAPI 后端: REST API + Bearer Token 鉴权 + PDF 代理
- 180 篇论文数据 (data/papers.json): 9 模块、32 子领域
- 前端: 数据驱动、卡片径向渐变光效、PDF 页面内阅读
- 底部状态栏: arXiv/HF 连通性检测
- PDF 加载: arXiv 优先(5s超时) → HK 本地兜底
- Docker 化部署 (Dockerfile + start.sh + nginx.conf)
- arXiv + HF 批量下载器 (api/downloader.py)
This commit is contained in:
2026-06-02 10:25:14 +00:00
commit f0ff62e082
21 changed files with 4042 additions and 0 deletions

17
.env.example Normal file
View File

@@ -0,0 +1,17 @@
# LLM 论文图书馆 — 环境配置
# 复制此文件为 .env.local 并填入实际值
# API 管理密钥 (用于 POST/PUT/DELETE 鉴权)
API_KEY=change-me-to-a-strong-random-string
# 服务端口
PORT=8000
# PDF 存储路径
PAPER_DIR=papers
# 日志级别
LOG_LEVEL=info
# CORS 允许的域名 (逗号分隔)
CORS_ORIGINS=*

9
.gitignore vendored Normal file
View File

@@ -0,0 +1,9 @@
__pycache__/
*.pyc
papers/arxiv/*.pdf
papers/hf/*.pdf
.env
.env.local
*.egg-info/
.venv/
papers/download.log

16
Dockerfile Normal file
View File

@@ -0,0 +1,16 @@
FROM python:3.11-slim
RUN apt-get update && apt-get install -y --no-install-recommends poppler-utils && rm -rf /var/lib/apt/lists/*
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
VOLUME ["/app/papers", "/app/data"]
EXPOSE 8000
ENV PORT=8000
ENV LOG_LEVEL=info
CMD ["sh", "-c", "python3 -m uvicorn api.server:app --host 0.0.0.0 --port ${PORT} --log-level ${LOG_LEVEL}"]

123
README.md Normal file
View File

@@ -0,0 +1,123 @@
# LLM 论文图书馆
大模型全链路技术论文知识库 — 从架构设计到 Agent 应用,覆盖 9 大模块、30+ 子领域、180+ 篇关键论文。
**在线访问:** https://your-domain.com
**API 文档:** https://your-domain.com/docs (FastAPI Swagger)
## 项目结构
```
llm-library/
├── api/
│ ├── server.py # FastAPI 服务 (REST API + PDF 代理)
│ ├── downloader.py # PDF 批量下载器
│ ├── parse_papers.py # 从 HTML 提取论文数据
│ └── extract_data.py # 备用提取脚本
├── data/
│ └── papers.json # 论文元数据 (单一数据源)
├── papers/
│ ├── arxiv/ # arXiv PDF 缓存
│ └── hf/ # HuggingFace PDF 缓存
├── static/ # 前端 (index.html + CSS + JS)
├── start.sh # 一键启动
├── requirements.txt
└── pyproject.toml
```
## 快速启动
```bash
# 1. 配置 API Key
echo "API_KEY=$(python3 -c 'import secrets; print(secrets.token_urlsafe(32))')" > .env
# 2. 安装依赖
pip install -r requirements.txt
# 3. 启动服务
bash start.sh
# 或
python3 -m uvicorn api.server:app --host 0.0.0.0 --port 8000
```
服务启动后访问 `http://localhost:8000` 即可使用。
## API 接口
| 方法 | 路径 | 说明 | 鉴权 |
|------|------|------|------|
| GET | `/api/stats` | 图书馆统计 | 无 |
| GET | `/api/modules` | 列出所有模块 | 无 |
| GET | `/api/modules/{id}` | 获取模块详情 (含论文) | 无 |
| GET | `/api/papers?q=xxx` | 搜索论文 | 无 |
| POST | `/api/papers` | 添加论文 | Bearer Token |
| PUT | `/api/papers` | 更新论文 | Bearer Token |
| DELETE | `/api/papers` | 删除论文 | Bearer Token |
| GET | `/papers/arxiv/{id}.pdf` | 本地 PDF 代理 | 无 |
### 管理接口示例
```bash
# 添加一篇论文
curl -X POST http://localhost:8000/api/papers \
-H "Authorization: Bearer $API_KEY" \
-H "Content-Type: application/json" \
-d '{
"module_id": "arch",
"area_id": "attention",
"section": "mainline",
"title": "Paper Title Here",
"authors": "Author et al.",
"year": 2026,
"venue": "arXiv",
"arxiv": "2601.01234",
"tags": ["前沿"]
}'
```
## PDF 下载
```bash
# 下载所有论文 PDF 到本地 (增量)
python3 api/downloader.py
# 只下载前 5 篇测试
python3 api/downloader.py --limit 5
# 强制重新下载
python3 api/downloader.py --no-incremental
```
## 部署 (Nginx 反向代理)
```nginx
server {
listen 80;
server_name your-domain.com;
location / {
proxy_pass http://127.0.0.1:8000;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
}
# 静态文件直接由 Nginx 服务 (可选, 提升性能)
location /style.css { alias /path/to/llm-library/static/style.css; }
location /app.js { alias /path/to/llm-library/static/app.js; }
}
```
## 数据维护
论文数据存储在 `data/papers.json`,也可通过 API 管理。
**标签系统:**
- 🏁 **起点** — 该子领域的奠基论文
- 🔴 **关键节点** — 改变技术方向的里程碑论文
- 🟢 **前沿** — 当前 SOTA已被主流模型采纳
- 🟣 **前瞻** — 有潜力的想法,尚未被主流采纳 (如 Engram, Titans)
- 🟠 **支线** — 有影响力的替代技术路线
## 许可证
MIT

1
api/__init__.py Normal file
View File

@@ -0,0 +1 @@
# LLM 论文图书馆 — API 模块

57
api/backfill.py Normal file
View File

@@ -0,0 +1,57 @@
#!/usr/bin/env python3
"""Backfill all uncached PDFs, skipping dead arXiv IDs"""
import json, subprocess, sys, time
from pathlib import Path
PAPERS_JSON = Path(__file__).resolve().parent.parent / "data" / "papers.json"
ARXIV_DIR = Path(__file__).resolve().parent.parent / "papers" / "arxiv"
LOG_FILE = Path("/app/papers/backfill.log")
def main():
with open(PAPERS_JSON) as f:
data = json.load(f)
# Collect all arxiv IDs
arxiv_ids = set()
for mod in data.values():
for area in mod.get("areas", []):
for section in ("mainline", "branches", "forward"):
for p in area.get(section, []):
aid = p.get("arxiv")
if aid:
arxiv_ids.add(aid)
cached = {p.stem for p in ARXIV_DIR.glob("*.pdf")}
missing = [aid for aid in arxiv_ids if aid not in cached]
print(f"Total: {len(arxiv_ids)}, Cached: {len(cached)}, Missing: {len(missing)}")
if not missing:
print("All caught up!")
return
ok, fail = 0, 0
for aid in missing:
url = f"https://arxiv.org/pdf/{aid}.pdf"
dest = ARXIV_DIR / f"{aid}.pdf"
try:
r = subprocess.run(
["wget", "-q", "-T", "15", "-O", str(dest), url],
timeout=20
)
if r.returncode == 0 and dest.exists() and dest.stat().st_size > 5000:
ok += 1
print(f" OK {aid} ({dest.stat().st_size//1024} KB)")
else:
dest.unlink(missing_ok=True)
fail += 1
print(f" FAIL {aid} (rc={r.returncode}, sz={dest.stat().st_size if dest.exists() else 0})")
except Exception as e:
dest.unlink(missing_ok=True)
fail += 1
print(f" ERR {aid} {e}")
time.sleep(0.8) # Be nice to arXiv
print(f"\nDone: {ok} ok, {fail} failed")
if __name__ == "__main__":
main()

13
api/check_trans.py Normal file
View File

@@ -0,0 +1,13 @@
import urllib.request, json
r = urllib.request.urlopen("http://127.0.0.1:8000/api/translate/2005.14165")
data = json.loads(r.read())
p = data["paragraphs"][0]
page1 = p["page"]
en1 = p["en"][:60]
zh1 = p["zh"][:80]
print(f"page={page1}, en={en1}")
print(f"zh={zh1}")
p5 = data["paragraphs"][5]
page5 = p5["page"]
zh5 = p5["zh"][:80]
print(f"para[5] page={page5}, zh={zh5}")

198
api/downloader.py Normal file
View File

@@ -0,0 +1,198 @@
"""
LLM 论文图书馆 — PDF 下载器
从 arXiv 和 HuggingFace 下载论文 PDF 到本地缓存
"""
import os
import json
import time
import logging
from pathlib import Path
import httpx
from tqdm import tqdm
ROOT = Path(__file__).resolve().parent.parent
DATA_FILE = ROOT / "data" / "papers.json"
ARXIV_DIR = ROOT / "papers" / "arxiv"
HF_DIR = ROOT / "papers" / "hf"
LOG_FILE = ROOT / "papers" / "download.log"
log = logging.getLogger("downloader")
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s %(levelname)s %(message)s",
handlers=[
logging.FileHandler(LOG_FILE),
logging.StreamHandler(),
],
)
def collect_urls() -> tuple[list[tuple[str, str]], list[tuple[str, str]]]:
"""从 papers.json 收集所有需要下载的 PDF URL
Returns:
arxiv_list: [(arxiv_id, title), ...]
hf_list: [(url, filename), ...]
"""
with open(DATA_FILE) as f:
data = json.load(f)
arxiv_seen = set()
hf_seen = set()
arxiv_list = []
hf_list = []
for mod in data.values():
for area in mod.get("areas", []):
for section in ("mainline", "branches", "forward"):
for p in area.get(section, []):
if p.get("arxiv") and p["arxiv"] not in arxiv_seen:
arxiv_seen.add(p["arxiv"])
arxiv_list.append((p["arxiv"], p.get("title", "")))
if p.get("pdf") and p["pdf"] not in hf_seen:
hf_seen.add(p["pdf"])
# Derive a safe filename from the URL
name = p["pdf"].split("/")[-1].replace(".pdf", "")
hf_list.append((p["pdf"], name))
return arxiv_list, hf_list
def download_arxiv(client: httpx.Client, arxiv_id: str, title: str) -> bool:
"""下载单个 arXiv PDF"""
pdf_path = ARXIV_DIR / f"{arxiv_id}.pdf"
if pdf_path.exists():
log.debug(f"Skip (exists): {arxiv_id}")
return True
url = f"https://arxiv.org/pdf/{arxiv_id}.pdf"
try:
resp = client.get(url, follow_redirects=True, timeout=30)
resp.raise_for_status()
# Verify it's actually a PDF (arxiv returns HTML for missing papers)
content_type = resp.headers.get("content-type", "")
if "pdf" not in content_type and not resp.content.startswith(b"%PDF"):
log.warning(f"Not a PDF: {arxiv_id}{title[:60]}")
return False
pdf_path.write_bytes(resp.content)
size_kb = len(resp.content) / 1024
log.info(f"OK: {arxiv_id} ({size_kb:.0f} KB) — {title[:60]}")
return True
except httpx.HTTPError as e:
log.error(f"HTTP error {arxiv_id}: {e}")
return False
except Exception as e:
log.error(f"Error {arxiv_id}: {e}")
return False
def download_hf(client: httpx.Client, url: str, filename: str) -> bool:
"""下载单个 HuggingFace PDF"""
safe_name = filename.replace("..", "").replace("/", "_")
pdf_path = HF_DIR / f"{safe_name}.pdf"
if pdf_path.exists():
log.debug(f"Skip (exists): {safe_name}")
return True
try:
resp = client.get(url, follow_redirects=True, timeout=60)
resp.raise_for_status()
if not resp.content.startswith(b"%PDF"):
log.warning(f"Not a PDF: {safe_name}")
return False
pdf_path.write_bytes(resp.content)
size_kb = len(resp.content) / 1024
log.info(f"OK (HF): {safe_name} ({size_kb:.0f} KB)")
return True
except httpx.HTTPError as e:
log.error(f"HTTP error {safe_name}: {e}")
return False
except Exception as e:
log.error(f"Error {safe_name}: {e}")
return False
def run(incremental: bool = True, limit: int = 0, delay: float = 1.0):
"""批量下载所有 PDF
Args:
incremental: True=跳过已有文件
limit: 0=全部, N=只下载前N篇
delay: 请求间延迟(秒)
"""
ARXIV_DIR.mkdir(parents=True, exist_ok=True)
HF_DIR.mkdir(parents=True, exist_ok=True)
arxiv_list, hf_list = collect_urls()
total = len(arxiv_list) + len(hf_list)
log.info(f"Found {len(arxiv_list)} arXiv + {len(hf_list)} HF = {total} PDFs to download")
log.info(f"Incremental: {incremental}, Delay: {delay}s")
if not incremental:
log.warning("Non-incremental mode: will re-download existing files")
# Count existing
arxiv_existing = sum(1 for aid, _ in arxiv_list if (ARXIV_DIR / f"{aid}.pdf").exists())
hf_existing = sum(1 for _, name in hf_list if (HF_DIR / f"{name}.pdf").exists())
log.info(f"Already cached: {arxiv_existing} arXiv + {hf_existing} HF")
ok, fail = 0, 0
total_size = 0.0
with httpx.Client(
headers={"User-Agent": "LLM-Library-Downloader/0.1"},
timeout=30,
follow_redirects=True,
) as client:
# Download arXiv
if limit > 0:
arxiv_list = arxiv_list[:limit]
for arxiv_id, title in tqdm(arxiv_list, desc="arXiv"):
if incremental and (ARXIV_DIR / f"{arxiv_id}.pdf").exists():
ok += 1
continue
success = download_arxiv(client, arxiv_id, title)
if success:
ok += 1
p = ARXIV_DIR / f"{arxiv_id}.pdf"
if p.exists():
total_size += p.stat().st_size
else:
fail += 1
time.sleep(delay)
# Download HF
if limit > 0:
hf_list = hf_list[:limit]
for url, name in tqdm(hf_list, desc="HF "):
if incremental and (HF_DIR / f"{name}.pdf").exists():
ok += 1
continue
success = download_hf(client, url, name)
if success:
ok += 1
p = HF_DIR / f"{name}.pdf"
if p.exists():
total_size += p.stat().st_size
else:
fail += 1
time.sleep(delay)
log.info(f"Done: {ok} OK, {fail} failed, {total_size/1024/1024:.1f} MB total")
if __name__ == "__main__":
import argparse
parser = argparse.ArgumentParser(description="下载论文 PDF 到本地缓存")
parser.add_argument("--no-incremental", action="store_true", help="重新下载所有 (默认跳过已有)")
parser.add_argument("--limit", type=int, default=0, help="限制下载数量 (0=全部)")
parser.add_argument("--delay", type=float, default=1.0, help="请求间延迟 (秒)")
args = parser.parse_args()
run(incremental=not args.no_incremental, limit=args.limit, delay=args.delay)

103
api/extract_data.py Normal file
View File

@@ -0,0 +1,103 @@
#!/usr/bin/env python3
"""Build papers.json by regex-extracting paper entries from llm_library.html"""
import re, json, os
ROOT = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
html_path = os.path.join(ROOT, 'llm_library.html')
with open(html_path, 'r') as f:
html = f.read()
# Step 1: Parse modules (each module is a top-level key in PAPER_DATA)
# Find each module block by matching " arch: {" style patterns
# Actually, let's parse line by line since this is a human-readable format
# Simpler approach: extract all paper entries with regex
# Pattern: { title:"...", authors:"...", year:..., venue:"...", arxiv:"...", tags:[...] }
paper_re = re.compile(
r'\{\s*title:\s*"([^"]*)",\s*authors:\s*"([^"]*)",\s*year:\s*(\d+),\s*venue:\s*"([^"]*)",\s*'
r'(?:arxiv:\s*"([^"]*)",\s*|pdf:\s*"([^"]*)",\s*|)'
r'tags:\s*\[(.*?)\]\s*\}',
re.DOTALL
)
papers = []
for m in paper_re.finditer(html):
title = m.group(1)
authors = m.group(2)
year = int(m.group(3))
venue = m.group(4)
arxiv = m.group(5) or None
pdf = m.group(6) or None
tags_str = m.group(7)
tags = re.findall(r'"([^"]*)"', tags_str)
# Find which module/area this paper belongs to
pos = m.start()
# Search backwards for module and area context
before = html[max(0,pos-3000):pos]
# Find module id
mod_match = re.search(r'\n\s*(\w+):\s*\{\s*\{?\s*name:\s*"([^"]*)"', before)
if not mod_match:
# Try broader pattern
mod_match = re.search(r'(\w+):\s*\{[^}]*name:\s*"([^"]*)"', before)
if mod_match:
mod_id = mod_match.group(1)
mod_name = mod_match.group(2)
else:
mod_id = 'unknown'
mod_name = 'Unknown'
# Find area id
area_match = re.search(r'id:\s*"(\w+)"[^}]*name:\s*"([^"]*)"', before)
if area_match:
area_id = area_match.group(1)
area_name = area_match.group(2)
else:
area_id = 'unknown'
area_name = 'Unknown'
papers.append({
'module': mod_id,
'module_name': mod_name,
'area': area_id,
'area_name': area_name,
'title': title,
'authors': authors,
'year': year,
'venue': venue,
'arxiv': arxiv,
'pdf': pdf,
'tags': tags,
})
print(f'Extracted {len(papers)} papers')
# Group by module → area → section (mainline/branches/forward)
# For now, just save as flat list for verification
# We'll reconstruct the proper nested structure after verifying
# Also extract module metadata
modules = {}
for m in re.finditer(r"(\w+):\s*\{\s*name:\s*\"([^\"]+)\"[^}]*icon:\s*\"([^\"]*)\"[^}]*desc:\s*\"([^\"]*)\"", html):
mod_id = m.group(1)
modules[mod_id] = {
'name': m.group(2),
'icon': m.group(3),
'desc': m.group(4),
'color': mod_id,
'areas': []
}
print(f'Found {len(modules)} modules')
for mod_id, mod in modules.items():
print(f' {mod_id}: {mod["name"]}')
# Save the flat list for now
output_path = os.path.join(os.path.dirname(__file__), '..', 'data', 'papers.json')
os.makedirs(os.path.dirname(output_path), exist_ok=True)
with open(output_path, 'w') as f:
json.dump(papers, f, ensure_ascii=False, indent=2)
print(f'Saved {len(papers)} papers (flat) to {output_path}')

107
api/parse_papers.py Normal file
View File

@@ -0,0 +1,107 @@
#!/usr/bin/env python3
"""Parse llm_library.html PAPER_DATA block → nested papers.json"""
import re, json, os
HTML = '/app/working/workspaces/default/llm_library.html'
JSON = '/app/working/workspaces/default/llm-library/data/papers.json'
with open(HTML) as f:
html = f.read()
s = html.index('const PAPER_DATA = {')
e = html.index('APP STATE')
block = html[s+22:e]
modules = {}
current_mod = None
current_area = None
current_section = 'mainline'
for line in block.split('\n'):
stripped = line.strip()
if not stripped:
continue
indent = len(line) - len(line.lstrip())
# Module start: " arch: {" at indent 2
if indent == 2 and re.match(r'^\w+:\s*\{', stripped):
mid = stripped.split(':')[0]
if current_mod and current_mod.get('name'):
modules[current_mod['id']] = current_mod
current_mod = {'id': mid, 'name': '', 'icon': '', 'desc': '', 'color': mid, 'areas': []}
current_area = None
current_section = 'mainline'
continue
if not current_mod:
continue
# Module metadata at indent 4
if indent == 4:
m = re.match(r'(\w+):\s*"([^"]*)"', stripped)
if m and m.group(1) in ('name', 'icon', 'desc', 'color'):
current_mod[m.group(1)] = m.group(2)
# Area header at indent 8
if indent == 8:
m_id = re.match(r'id:\s*"(\w+)"', stripped)
if m_id:
current_area = {'id': m_id.group(1), 'name': '', 'mainline': [], 'branches': [], 'forward': []}
current_mod['areas'].append(current_area)
current_section = 'mainline'
continue
m_name = re.match(r'name:\s*"([^"]+)"', stripped)
if m_name and current_area:
current_area['name'] = m_name.group(1)
continue
if re.match(r'mainline:\s*\[', stripped):
current_section = 'mainline'
elif re.match(r'branches:\s*\[', stripped):
current_section = 'branches'
elif re.match(r'forward:\s*\[', stripped):
current_section = 'forward'
# Paper entry
if stripped.startswith('{ title:') and 'tags:' in stripped and current_area:
title = re.search(r'title:\s*"([^"]+)"', stripped)
authors = re.search(r'authors:\s*"([^"]*?)"', stripped)
year = re.search(r'year:\s*(\d+)', stripped)
venue = re.search(r'venue:\s*"([^"]*?)"', stripped)
arxiv = re.search(r'arxiv:\s*"(\S+?)"', stripped)
pdf = re.search(r'pdf:\s*"(https:[^"]+)"', stripped)
tags_m = re.search(r'tags:\s*\[(.*?)\]', stripped, re.DOTALL)
if title and year and tags_m:
tags = re.findall(r'"([^"]*)"', tags_m.group(1))
entry = {
'title': title.group(1),
'authors': authors.group(1) if authors else '',
'year': int(year.group(1)),
'venue': venue.group(1) if venue else '',
'tags': tags
}
if arxiv and arxiv.group(1):
entry['arxiv'] = arxiv.group(1)
if pdf and pdf.group(1):
entry['pdf'] = pdf.group(1)
current_area[current_section].append(entry)
# Save last module
if current_mod and current_mod.get('name'):
modules[current_mod['id']] = current_mod
# Count
total = sum(
len(a.get('mainline',[])) + len(a.get('branches',[])) + len(a.get('forward',[]))
for m in modules.values() for a in m.get('areas',[])
)
print(f'Parsed: {len(modules)} modules, {total} papers')
for mid, m in sorted(modules.items()):
pc = sum(len(a.get('mainline',[]))+len(a.get('branches',[]))+len(a.get('forward',[])) for a in m['areas'])
print(f' {mid}: {m["name"]}{len(m["areas"])} areas, {pc} papers')
os.makedirs(os.path.dirname(JSON), exist_ok=True)
with open(JSON, 'w') as f:
json.dump(modules, f, ensure_ascii=False, indent=2)
print(f'\nSaved to {JSON}')

484
api/server.py Normal file
View File

@@ -0,0 +1,484 @@
"""
LLM 论文图书馆 — FastAPI 后端
提供 REST API 进行论文查询、管理、PDF 代理服务
"""
import json
import os
import hashlib
import secrets
import logging
from pathlib import Path
from typing import Optional
from fastapi import FastAPI, HTTPException, Query, Depends, Request
from fastapi.middleware.cors import CORSMiddleware
from fastapi.responses import FileResponse, JSONResponse
from fastapi.staticfiles import StaticFiles
from pydantic import BaseModel
# ─── Config ────────────────────────────────────────────
ROOT = Path(__file__).resolve().parent.parent
DATA_FILE = ROOT / "data" / "papers.json"
PAPERS_DIR = ROOT / "papers"
API_KEY = os.environ.get("LLM_LIB_API_KEY", "change-me")
log = logging.getLogger("llm-library")
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
# ─── App ───────────────────────────────────────────────
app = FastAPI(
title="LLM 论文图书馆",
description="大模型论文知识库 API — 查询、搜索、管理论文",
version="0.1.0",
)
app.add_middleware(
CORSMiddleware,
allow_origins=["*"],
allow_methods=["*"],
allow_headers=["*"],
)
# ─── Auth ──────────────────────────────────────────────
def verify_api_key(request: Request):
"""简单的 API Key 鉴权 — 用于写操作 (POST/PUT/DELETE)"""
auth = request.headers.get("Authorization", "")
if auth.startswith("Bearer "):
token = auth[7:]
else:
token = request.query_params.get("api_key", "")
if not token or token != API_KEY:
raise HTTPException(status_code=401, detail="Invalid or missing API key")
return True
# ─── Data loading ──────────────────────────────────────
def load_data():
if not DATA_FILE.exists():
return {}
with open(DATA_FILE, 'r') as f:
return json.load(f)
def save_data(data):
with open(DATA_FILE, 'w') as f:
json.dump(data, f, ensure_ascii=False, indent=2)
# ─── Paper CRUD helpers ────────────────────────────────
def find_paper(data, module_id, area_id, title):
"""Find a paper index by title within module/area"""
mod = data.get(module_id)
if not mod:
return None, None, None, None
for area in mod.get("areas", []):
if area["id"] == area_id:
for section in ("mainline", "branches", "forward"):
for i, p in enumerate(area.get(section, [])):
if p["title"] == title:
return mod, area, section, i
return None, None, None, None
# ─── Routes: Query ─────────────────────────────────────
@app.get("/api/stats")
def get_stats():
"""获取图书馆统计信息"""
data = load_data()
mods = len(data)
areas = 0
papers = 0
sections = {"mainline": 0, "branches": 0, "forward": 0}
for mod in data.values():
areas += len(mod.get("areas", []))
for area in mod.get("areas", []):
for s in ("mainline", "branches", "forward"):
n = len(area.get(s, []))
papers += n
sections[s] += n
return {
"modules": mods,
"areas": areas,
"papers": papers,
"sections": sections,
"data_file": str(DATA_FILE),
}
@app.get("/api/modules")
def list_modules():
"""列出所有模块 (不含论文详情)"""
data = load_data()
return [
{
"id": mid,
"name": m["name"],
"icon": m["icon"],
"desc": m["desc"],
"area_count": len(m.get("areas", [])),
"paper_count": sum(
len(a.get("mainline", [])) + len(a.get("branches", [])) + len(a.get("forward", []))
for a in m.get("areas", [])
),
}
for mid, m in data.items()
]
@app.get("/api/modules/{module_id}")
def get_module(module_id: str):
"""获取单个模块的完整论文数据"""
data = load_data()
mod = data.get(module_id)
if not mod:
raise HTTPException(status_code=404, detail=f"Module '{module_id}' not found")
return mod
@app.get("/api/papers")
def search_papers(
q: str = Query(default="", description="搜索关键词: 标题/作者"),
module: Optional[str] = Query(default=None),
tag: Optional[str] = Query(default=None, description="起点/关键节点/前沿/前瞻/支线"),
limit: int = Query(default=50, ge=1, le=200),
):
"""搜索论文 (全文/按模块/按标签)"""
data = load_data()
results = []
q = q.lower()
for mid, mod in data.items():
if module and mid != module:
continue
for area in mod.get("areas", []):
for section in ("mainline", "branches", "forward"):
for p in area.get(section, []):
# Filter by tag
if tag and tag not in p.get("tags", []):
continue
# Filter by query
if q:
if q not in (p.get("title", "") + p.get("authors", "")).lower():
continue
results.append({
"module_id": mid,
"module_name": mod["name"],
"area_id": area["id"],
"area_name": area["name"],
"section": section,
**p,
})
if len(results) >= limit:
break
if len(results) >= limit:
break
if len(results) >= limit:
break
if len(results) >= limit:
break
return results
# ─── Routes: Management (写操作, 需 API Key) ────────────
class PaperCreate(BaseModel):
module_id: str
area_id: str
section: str = "mainline" # mainline / branches / forward
title: str
authors: str = ""
year: int
venue: str = ""
arxiv: Optional[str] = None
pdf: Optional[str] = None
tags: list[str] = []
class PaperUpdate(BaseModel):
authors: Optional[str] = None
year: Optional[int] = None
venue: Optional[str] = None
arxiv: Optional[str] = None
pdf: Optional[str] = None
tags: Optional[list[str]] = None
section: Optional[str] = None # move to different section
@app.post("/api/papers", dependencies=[Depends(verify_api_key)])
def add_paper(paper: PaperCreate):
"""添加一篇新论文"""
data = load_data()
mod = data.get(paper.module_id)
if not mod:
raise HTTPException(status_code=404, detail="Module not found")
area = next((a for a in mod["areas"] if a["id"] == paper.area_id), None)
if not area:
raise HTTPException(status_code=404, detail="Area not found")
section = paper.section
if section not in ("mainline", "branches", "forward"):
raise HTTPException(status_code=400, detail="section must be mainline/branches/forward")
entry = {
"title": paper.title,
"authors": paper.authors,
"year": paper.year,
"venue": paper.venue,
"tags": paper.tags,
}
if paper.arxiv:
entry["arxiv"] = paper.arxiv
if paper.pdf:
entry["pdf"] = paper.pdf
area.setdefault(section, []).append(entry)
save_data(data)
log.info(f"Added paper: {paper.title}")
return {"ok": True, "title": paper.title}
@app.put("/api/papers")
def update_paper(
module_id: str,
area_id: str,
title: str,
update: PaperUpdate,
_=Depends(verify_api_key),
):
"""更新一篇论文"""
data = load_data()
mod, area, section, idx = find_paper(data, module_id, area_id, title)
if mod is None:
raise HTTPException(status_code=404, detail="Paper not found")
paper = area[section][idx]
for field in ("authors", "year", "venue", "arxiv", "pdf", "tags"):
val = getattr(update, field)
if val is not None:
paper[field] = val
# Move to different section?
if update.section and update.section != section:
if update.section not in ("mainline", "branches", "forward"):
raise HTTPException(status_code=400, detail="Invalid section")
area[section].pop(idx)
area.setdefault(update.section, []).append(paper)
save_data(data)
log.info(f"Updated paper: {title}")
return {"ok": True, "title": title}
@app.delete("/api/papers")
def delete_paper(
module_id: str,
area_id: str,
title: str,
_=Depends(verify_api_key),
):
"""删除一篇论文"""
data = load_data()
mod, area, section, idx = find_paper(data, module_id, area_id, title)
if mod is None:
raise HTTPException(status_code=404, detail="Paper not found")
area[section].pop(idx)
save_data(data)
log.info(f"Deleted paper: {title}")
return {"ok": True, "title": title}
# ─── Routes: PDF proxy ──────────────────────────────────
@app.get("/papers/arxiv/{arxiv_id}.pdf")
@app.get("/papers/arxiv/{arxiv_id}")
def serve_arxiv_pdf(arxiv_id: str):
"""从本地缓存提供 arXiv PDF无 .pdf 后缀路由防 IDM 拦截)"""
pdf_path = PAPERS_DIR / "arxiv" / f"{arxiv_id}.pdf"
if not pdf_path.exists():
raise HTTPException(status_code=404, detail=f"PDF not in local cache: {arxiv_id}")
return FileResponse(
pdf_path, media_type="application/pdf",
headers={"Cache-Control": "public, max-age=86400"},
)
@app.get("/papers/hf/{filename}.pdf")
@app.get("/papers/hf/{filename}")
def serve_hf_pdf(filename: str):
"""从本地缓存提供 HuggingFace PDF无 .pdf 后缀路由防 IDM 拦截)"""
safe_name = filename.replace("..", "").replace("/", "_").removesuffix(".pdf")
pdf_path = PAPERS_DIR / "hf" / f"{safe_name}.pdf"
if not pdf_path.exists():
raise HTTPException(status_code=404, detail=f"PDF not in local cache: {filename}")
return FileResponse(
pdf_path, media_type="application/pdf",
headers={"Cache-Control": "public, max-age=86400"},
)
# ─── Routes: Translation ───────────────────────────────
TRANSLATE_CACHE = ROOT / "data" / "translations"
TRANSLATE_CACHE.mkdir(parents=True, exist_ok=True)
def extract_pdf_text_with_pages(pdf_path: Path, max_chars: int = 12000) -> list[dict]:
"""从 PDF 提取文本和页码信息,使用 pdftotext (Poppler) 避免 PyMuPDF GPU 依赖"""
import subprocess, tempfile
# Strip arXiv stamp (first page header)
stamp = f"{pdf_path.stem}.pdf" # e.g. "1706.03762.pdf"
result = subprocess.run(
["pdftotext", "-layout", "-q", str(pdf_path), "-"],
capture_output=True, text=True, timeout=30
)
if result.returncode != 0:
log.error(f"pdftotext failed: {result.stderr}")
raise HTTPException(status_code=500, detail="PDF text extraction failed")
text = result.stdout
# Remove arXiv stamp line
import re
text = re.sub(r'arXiv:' + re.escape(stamp.split('.pdf')[0]) + r'.*?\n\n', '', text, flags=re.DOTALL)
text = re.sub(r'arXiv:' + re.escape(stamp) + r'.*?\n\n', '', text, flags=re.DOTALL)
# Split by form-feed (page break)
pages = text.split('\f')
result_pages = []
total = 0
for i, page_text in enumerate(pages):
pt = page_text.strip()
if not pt: continue
result_pages.append({"page": i + 1, "text": pt})
total += len(pt)
if total >= max_chars: break
if not result_pages:
raise HTTPException(status_code=500, detail="No text extracted from PDF")
return result_pages
def split_text_with_pages(page_texts: list[dict], max_len: int = 400) -> list[dict]:
"""将按页拆分的文本进一步拆为段落,保留页码"""
chunks = []
for pt in page_texts:
page = pt["page"]
text = pt["text"]
raw_paras = [p.strip() for p in text.split("\n\n") if p.strip()]
for para in raw_paras:
if len(para) <= max_len:
chunks.append({"page": page, "text": para})
else:
sentences = para.replace(". ", ".|").replace("? ", "?|").replace("! ", "!|").split("|")
current = ""
for s in sentences:
s = s.strip()
if not s: continue
if len(current) + len(s) + 1 <= max_len:
current = (current + " " + s).strip()
else:
if current: chunks.append({"page": page, "text": current})
current = s
if current: chunks.append({"page": page, "text": current})
return chunks
def translate_text(text: str, source: str = "en", target: str = "zh") -> str:
"""使用 MyMemory 免费 API 翻译文本"""
import urllib.request
import urllib.parse
url = "https://api.mymemory.translated.net/get"
params = urllib.parse.urlencode({
"q": text,
"langpair": f"{source}|{target}",
"mt": "1", # Force machine translation, not memory
"de": "me@llm-library.local",
})
full_url = f"{url}?{params}"
try:
with urllib.request.urlopen(full_url, timeout=15) as resp:
data = json.loads(resp.read())
except Exception as e:
log.warning(f"Translation API error: {e}")
return text
if data.get("responseStatus") == 200 and data.get("responseData"):
return data["responseData"]["translatedText"]
return text
@app.get("/api/translate/{arxiv_id}")
def translate_paper(arxiv_id: str):
"""翻译论文正文 (从本地 PDF 提取文本,每段带页码)"""
pdf_path = PAPERS_DIR / "arxiv" / f"{arxiv_id}.pdf"
if not pdf_path.exists():
raise HTTPException(status_code=404, detail=f"PDF not cached: {arxiv_id}")
cache_file = TRANSLATE_CACHE / f"{arxiv_id}.json"
if cache_file.exists():
with open(cache_file) as f:
return json.load(f)
# Extract text with page numbers
log.info(f"Extracting text from {arxiv_id}")
page_texts = extract_pdf_text_with_pages(pdf_path)
chunks = split_text_with_pages(page_texts)
log.info(f"Translating {len(chunks)} paragraphs for {arxiv_id}")
translated = []
for i, chunk in enumerate(chunks):
if i % 10 == 0:
log.info(f" [{arxiv_id}] translating paragraph {i+1}/{len(chunks)}")
zh = translate_text(chunk["text"])
translated.append({
"page": chunk["page"],
"en": chunk["text"],
"zh": zh,
})
result = {"arxiv_id": arxiv_id, "paragraphs": translated, "count": len(translated)}
with open(cache_file, "w") as f:
json.dump(result, f, ensure_ascii=False)
return result
@app.get("/api/translate/{arxiv_id}/status")
def translate_status(arxiv_id: str):
"""检查翻译缓存状态"""
cache_file = TRANSLATE_CACHE / f"{arxiv_id}.json"
return {
"arxiv_id": arxiv_id,
"cached": cache_file.exists(),
"pdf_exists": (PAPERS_DIR / "arxiv" / f"{arxiv_id}.pdf").exists(),
}
# ─── Routes: PDF download on-demand ────────────────────
@app.post("/api/download/{arxiv_id}")
def download_single_pdf(arxiv_id: str):
"""按需下载单篇 arXiv PDF"""
import subprocess, sys
pdf_path = PAPERS_DIR / "arxiv" / f"{arxiv_id}.pdf"
if pdf_path.exists():
return {"ok": True, "arxiv_id": arxiv_id, "status": "cached"}
cmd = [sys.executable, str(ROOT / "api" / "downloader.py"), "--limit", "1", "--delay", "0"]
# We need a way to download specific arxiv IDs — for now, just run the downloader
# It will try all uncached papers, but the specific one will be among them
try:
subprocess.run(cmd, cwd=str(ROOT), timeout=60, capture_output=True)
if pdf_path.exists():
return {"ok": True, "arxiv_id": arxiv_id, "status": "downloaded"}
return {"ok": False, "arxiv_id": arxiv_id, "status": "failed"}
except subprocess.TimeoutExpired:
return {"ok": False, "arxiv_id": arxiv_id, "status": "timeout"}
# ─── Health ─────────────────────────────────────────────
@app.get("/api/health")
def health():
return {"status": "ok", "version": "0.1.0"}
# ─── Mount static frontend (at /) ──────────────────────
# Static files mounted after API routes to avoid conflicts
static_dir = ROOT / "static"
if static_dir.exists() and any(static_dir.iterdir()):
app.mount("/", StaticFiles(directory=str(static_dir), html=True), name="static")
# ─── Main ───────────────────────────────────────────────
def main():
import uvicorn
uvicorn.run("api.server:app", host="0.0.0.0", port=8000, reload=True)
if __name__ == "__main__":
main()

2175
data/papers.json Normal file

File diff suppressed because it is too large Load Diff

55
nginx.conf Normal file
View File

@@ -0,0 +1,55 @@
# LLM 论文图书馆 — Nginx 反向代理配置
# 部署路径: /etc/nginx/sites-available/llm-library
server {
listen 80;
server_name your-domain.com;
# 安全头
add_header X-Frame-Options "SAMEORIGIN" always;
add_header X-Content-Type-Options "nosniff" always;
add_header Referrer-Policy "no-referrer" always;
# 限制请求速率 (防止滥用)
limit_req_zone $binary_remote_addr zone=api:10m rate=10r/s;
# API 代理
location /api/ {
proxy_pass http://127.0.0.1:8000;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
limit_req zone=api burst=20 nodelay;
}
# PDF 代理 — 强制 inline 阻止 IDM 弹下载
location /papers/ {
proxy_pass http://127.0.0.1:8741;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_read_timeout 60s;
# 追加 inline header 防止 IDM 拦截
add_header Content-Disposition "inline" always;
add_header X-Content-Type-Options "nosniff" always;
}
# 静态文件直接由 Nginx 服务 (性能更好)
location /style.css {
alias /opt/llm-library/static/style.css;
expires 1d;
}
location /app.js {
alias /opt/llm-library/static/app.js;
expires 1d;
}
location /favicon.ico {
return 204;
}
# 首页
location / {
proxy_pass http://127.0.0.1:8000;
proxy_set_header Host $host;
}
}

10
proxy_conf.txt Normal file
View File

@@ -0,0 +1,10 @@
location / {
proxy_pass http://127.0.0.1:8741;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
# 强制 inline 阻止 IDM 弹下载
add_header Content-Disposition "inline" always;
add_header X-Content-Type-Options "nosniff" always;
}

17
pyproject.toml Normal file
View File

@@ -0,0 +1,17 @@
[project]
name = "llm-library"
version = "0.1.0"
description = "LLM 论文图书馆 — 可维护的大模型论文知识库"
requires-python = ">=3.10"
dependencies = [
"fastapi>=0.115",
"uvicorn[standard]>=0.34",
"httpx>=0.28",
"pydantic>=2.10",
"python-multipart>=0.0.19",
"aiofiles>=24.0",
"tqdm>=4.66",
]
[project.scripts]
llm-lib = "api.server:main"

8
requirements.txt Normal file
View File

@@ -0,0 +1,8 @@
fastapi>=0.115
uvicorn[standard]>=0.34
httpx>=0.28
pydantic>=2.10
python-multipart>=0.0.19
aiofiles>=24.0
tqdm>=4.66
PyMuPDF>=1.24

43
start.sh Executable file
View File

@@ -0,0 +1,43 @@
#!/bin/bash
# LLM 论文图书馆 — 启动脚本
set -e
SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
cd "$SCRIPT_DIR"
# 加载环境变量
if [ -f .env ]; then
export $(grep -v '^#' .env | xargs)
fi
# 生成 API Key (如果未设置)
if [ -z "$LLM_LIB_API_KEY" ]; then
export LLM_LIB_API_KEY=$(python3 -c "import secrets; print(secrets.token_urlsafe(32))")
echo "API_KEY=$LLM_LIB_API_KEY" > .env
echo "⚠️ 自动生成 API Key: $LLM_LIB_API_KEY"
fi
echo "═══ LLM 论文图书馆 ═══"
echo " API Key: ${LLM_LIB_API_KEY:0:8}..."
echo " Port: ${PORT:-8000}"
echo " PDF Dir: papers/"
echo
# 首次运行: 下载依赖
if ! python3 -c "import fastapi" 2>/dev/null; then
echo "📦 安装依赖..."
pip install -r requirements.txt -q
fi
# 如果 papers.json 不存在,从 HTML 重新提取
if [ ! -f data/papers.json ]; then
echo "📊 提取论文数据..."
python3 api/extract_data.py || echo "⚠️ extract_data.py 失败,请手动运行"
fi
# 启动服务
echo "🚀 启动服务..."
exec python3 -m uvicorn api.server:app \
--host 0.0.0.0 \
--port ${PORT:-8000} \
--log-level ${LOG_LEVEL:-info}

252
static/app.js Normal file
View File

@@ -0,0 +1,252 @@
/**
* LLM 论文图书馆 — 前端 JS
* 页面加载检测 arXiv/HF 连通性 → 底部状态条
* 点击论文arXiv 连通? → iframe 直连 arXiv → 5s 超时 → HK 兜底
* IDM 拦就拦,不额外对抗
*/
const API = '/api';
let modules = {};
let moduleData = {};
let pdfTimeout = null;
let networkStatus = {};
const $ = (sel) => document.querySelector(sel);
const $$ = (sel) => document.querySelectorAll(sel);
const TAG_CLASS = { '起点':'tag-start','关键节点':'tag-milestone','前沿':'tag-frontier','前瞻':'tag-forward','支线':'tag-branch' };
// ══════════════════ INIT ═══════════════════════════════
async function init() {
buildStatusBar();
try {
const resp = await fetch(`${API}/modules`);
const mods = await resp.json();
for (const m of mods) modules[m.id] = m;
renderCards(mods);
attachGlowTracking();
} catch (e) { console.error(e); }
checkSources();
document.addEventListener('keydown', e => {
if (e.key === 'Escape') {
if ($('#pdfOverlay')?.classList.contains('open')) closePdf();
else if ($('#overlay').classList.contains('open')) closeModal();
}
});
}
// ══════════════════ STATUS BAR ════════════════════════
function buildStatusBar() {
const bar = document.createElement('div');
bar.id = 'statusBar'; bar.className = 'status-bar';
bar.innerHTML = `<span class="status-label">连通性检测</span>
<span class="status-item" id="status-arxiv"><span class="status-dot"></span> arXiv <span class="status-ms" id="ms-arxiv">—</span></span>
<span class="status-item" id="status-hf"><span class="status-dot"></span> HuggingFace <span class="status-ms" id="ms-hf">—</span></span>`;
document.body.appendChild(bar);
}
function setStatus(id, ok, ms, aborted) {
networkStatus[id.replace('status-','')] = ok;
const el = document.getElementById(id); if (!el) return;
el.querySelector('.status-dot').className = 'status-dot ' + (ok ? 'status-ok' : 'status-fail');
const msEl = document.getElementById('ms-'+id.replace('status-',''));
if (msEl) {
if (aborted) msEl.textContent = '超时';
else if (ok) msEl.textContent = ms+'ms';
else msEl.textContent = '—';
}
}
async function checkSource(name, url, statusId) {
const start = performance.now();
let aborted = false;
const probeUrl = url + '?_=' + Date.now();
try {
const ctrl = new AbortController();
setTimeout(() => { aborted = true; ctrl.abort(); }, 4000);
await fetch(probeUrl, { mode: 'no-cors', signal: ctrl.signal, cache: 'no-store' });
const ms = Math.round(performance.now() - start);
setStatus(statusId, true, ms, false);
} catch {
const ms = Math.round(performance.now() - start);
setStatus(statusId, false, ms, aborted);
}
}
function checkSources() {
checkSource('arxiv', 'https://arxiv.org/favicon.ico', 'status-arxiv');
checkSource('hf', 'https://huggingface.co/favicon.ico', 'status-hf');
}
// ══════════════════ GLOW ══════════════════════════════
function attachGlowTracking() {
$$('.card').forEach(card => {
card.addEventListener('mousemove', e => {
const r = card.getBoundingClientRect();
card.style.setProperty('--mx', (e.clientX-r.left)/r.width*100+'%');
card.style.setProperty('--my', (e.clientY-r.top)/r.height*100+'%');
});
});
}
// ══════════════════ CARDS → MODAL → PAPERS ═══════════
function renderCards(mods) {
$('#moduleGrid').innerHTML = mods.map(m =>
`<div class="card card-${m.id}" onclick="openModule('${m.id}')">
<div class="card-header"><span class="card-icon">${m.icon}</span> ${m.name}</div>
<div class="card-desc">${m.desc}</div>
<div class="card-badge">${m.area_count} 子领域 · ${m.paper_count} 篇</div>
</div>`).join('');
}
async function openModule(modId) {
let data = moduleData[modId];
if (!data) {
try { data = await (await fetch(`${API}/modules/${modId}`)).json(); moduleData[modId]=data; }
catch { return; }
}
const mod = modules[modId];
$('#modalTitle').innerHTML = `<span class="card-icon">${mod?.icon||''}</span> ${data.name}`;
$('#modalSubtitle').textContent = data.desc||'';
const areas = data.areas||[];
$('#modalTabs').innerHTML = areas.map((a,i)=>`<button class="tab ${i?'':'active'}" onclick="switchArea(${i})">${a.name}</button>`).join('');
if (areas.length) { $('#modalTabs').dataset.areaIdx='0'; renderPapers(areas[0]); }
$('#overlay').classList.add('open');
}
function switchArea(idx) {
const data = Object.values(moduleData).find(d => d.areas && d.areas[idx]);
if (!data) return;
$$('#modalTabs .tab').forEach((t,i)=>t.classList.toggle('active',i===idx));
renderPapers(data.areas[idx]);
}
function closeModal() { $('#overlay').classList.remove('open'); }
function renderPapers(area) {
const s = (l, ps, c) => ps.length ? `<div class="section-label ${c}">${l}</div>`+ps.map(renderPaper).join('') : '';
$('#modalContent').innerHTML =
s('📌 主线论文', area.mainline||[],'mainline') +
s('🌿 支线论文', area.branches||[],'branch') +
s('🔮 前瞻探索', area.forward||[],'forward') ||
'<p style="color:var(--text-dim);padding:20px;">暂无论文数据</p>';
}
function renderPaper(p) {
const pdfUrl = getPdfLink(p);
const tags = (p.tags||[]).map(t=>`<span class="paper-tag ${TAG_CLASS[t]||'tag-branch'}">${t}</span>`).join(' ');
const links = [];
if (pdfUrl) links.push(`<button class="paper-link" data-pdf="${encodeURIComponent(pdfUrl)}" data-title="${encodeURIComponent(p.title)}" onclick="openPdfBtn(this)">📄 阅读</button>`);
else if (p.arxiv) links.push(`<a class="paper-link" href="https://arxiv.org/abs/${p.arxiv}" target="_blank">📋 arXiv</a>`);
return `<div class="paper-item"><div class="paper-year">${p.year||'—'}</div><div class="paper-body">
<div class="paper-title">${p.title}</div>
<div class="paper-meta"><span>${p.authors||''}</span>${p.venue?`<span class="paper-venue">${p.venue}</span>`:''}${tags}</div>
<div class="paper-links">${links.join('')}</div>
</div></div>`;
}
function getPdfLink(p) {
// 有 pdf 字段 → 返回外部源 URLarxiv 直连 / HF 直连)
if (p.pdf) return p.pdf;
// 只有 arxiv id → arXiv 直连
if (p.arxiv) return `https://arxiv.org/pdf/${p.arxiv}.pdf`;
return null;
}
function openPdfBtn(btn) { openPdf(btn.dataset.pdf, btn.dataset.title); }
// ══════════════════ PDF VIEWER ════════════════════════
function getLocalUrl(extUrl) {
// arXiv
const am = extUrl.match(/arxiv\.org\/pdf\/(\d+\.\d+)/);
if (am) return `/papers/arxiv/${am[1]}.pdf`;
// HuggingFace
if (extUrl.includes('huggingface.co')) {
const name = decodeURIComponent(extUrl).split('/').pop().replace('.pdf','');
return `/papers/hf/${name}.pdf`;
}
return null;
}
function openPdf(url, title) {
const decodedUrl = decodeURIComponent(url);
const decodedTitle = decodeURIComponent(title);
// 构建 overlay
if (!$('#pdfOverlay')) {
const div = document.createElement('div'); div.id='pdfOverlay'; div.className='pdf-overlay';
div.innerHTML = `<div class="pdf-container" id="pdfContainer">
<div class="pdf-toolbar"><span id="pdfTitle">PDF</span><span id="pdfStatus" style="color:var(--orange);font-size:0.8em;margin-left:8px;"></span>
<button onclick="window.open($('#pdfFrame').src || currentPdf,'_blank')">🔗 新窗口</button>
<button class="pdf-close" onclick="closePdf()">&times;</button></div>
<iframe class="pdf-frame" id="pdfFrame" src=""></iframe></div>`;
document.body.appendChild(div);
div.addEventListener('click', e => { if (e.target===div) closePdf(); });
}
if (pdfTimeout) { clearTimeout(pdfTimeout); pdfTimeout=null; }
$('#pdfTitle').textContent = decodedTitle;
$('#pdfStatus').textContent = '';
const frame = $('#pdfFrame');
let loaded = false;
frame.src = 'about:blank';
frame.onload = ()=>{ loaded=true; if(pdfTimeout){clearTimeout(pdfTimeout);pdfTimeout=null;} $('#pdfStatus').textContent=''; };
const hkLocalUrl = getLocalUrl(decodedUrl);
const isArxiv = decodedUrl.includes('arxiv.org');
const isHF = decodedUrl.includes('huggingface.co');
const isRemote = isArxiv || isHF;
const sourceOk = isArxiv ? networkStatus['arxiv'] !== false
: isHF ? networkStatus['hf'] !== false
: false;
// 弹框先出
$('#pdfOverlay').classList.add('open');
if (isRemote && sourceOk) {
// 远程源 iframe 直连
loaded = false;
$('#pdfStatus').textContent = isArxiv ? '🌐 arXiv 加载中...' : '🤗 HuggingFace 加载中...';
frame.src = decodedUrl;
pdfTimeout = setTimeout(() => {
if (loaded) return;
if (!hkLocalUrl) { $('#pdfStatus').textContent='⚠️ 超时'; return; }
$('#pdfStatus').textContent = '⏳ 超时,走 HK 服务器...';
frame.src = hkLocalUrl;
}, 5000);
} else {
// 直接走 HK
$('#pdfStatus').textContent = hkLocalUrl ? '📂 HK 服务器加载中...' : '⏳ 加载中...';
frame.src = hkLocalUrl || decodedUrl;
}
}
function closePdf() {
if (pdfTimeout) { clearTimeout(pdfTimeout); pdfTimeout=null; }
$('#pdfOverlay').classList.remove('open');
$('#pdfFrame').src = '';
}
// ══════════════════ SEARCH ════════════════════════════
function searchPapers(q) {
q=(q||'').toLowerCase().trim();
if (q.length<2) { $$('.card').forEach(c=>c.style.outline=''); return; }
fetch(`${API}/papers?q=${encodeURIComponent(q)}&limit=10`).then(r=>r.json()).then(ps=>{
const matched = new Set(ps.map(p=>p.module_id));
$$('.card').forEach(c=>{
const m=[...c.classList].find(x=>x.startsWith('card-'))?.replace('card-','');
c.style.outline=matched.has(m)?'2px solid var(--green)':'';
});
});
}
// ══════════════════ EVENTS ════════════════════════════
if (typeof document !== 'undefined') {
document.addEventListener('DOMContentLoaded', () => {
init();
$('.search-input').addEventListener('input', e => searchPapers(e.target.value));
$('#modalClose').addEventListener('click', closeModal);
$('#overlay').addEventListener('click', e => { if (e.target===$('#overlay')) closeModal(); });
});
}

34
static/index.html Normal file
View File

@@ -0,0 +1,34 @@
<!DOCTYPE html>
<html lang="zh-CN">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>LLM 论文图书馆</title>
<link rel="stylesheet" href="/style.css">
</head>
<body>
<div class="header">
<h1>🧠 LLM 论文图书馆</h1>
</div>
<div class="search-wrap">
<span class="search-icon">🔍</span>
<input class="search-input" placeholder="搜索论文标题、作者、关键词..." autocomplete="off">
</div>
<div class="grid" id="moduleGrid"></div>
<div class="overlay" id="overlay">
<div class="modal" id="modal">
<button class="modal-close" id="modalClose">&times;</button>
<div class="modal-title" id="modalTitle"></div>
<div class="modal-subtitle" id="modalSubtitle"></div>
<div class="tabs" id="modalTabs"></div>
<div id="modalContent"></div>
</div>
</div>
<script src="/app.js"></script>
</body>
</html>

7
static/pdf.min.js vendored Normal file
View File

@@ -0,0 +1,7 @@
<html>
<head><title>404 Not Found</title></head>
<body>
<center><h1>404 Not Found</h1></center>
<hr><center>nginx</center>
</body>
</html>

313
static/style.css Normal file
View File

@@ -0,0 +1,313 @@
/* LLM 论文图书馆 — 前端样式 (提取自 llm_library.html) */
:root {
--bg: #0a0e14;
--bg-card: #12171f;
--bg-modal: #0d1117;
--border: #1e293b;
--text: #c9d1d9;
--text-dim: #8b949e;
--text-bright: #e6edf3;
--blue: #58a6ff; --blue-bg: #0d1b2a; --blue-border: #1a3a5c;
--green: #3fb950; --green-bg: #0d1f14; --green-border: #1a3d1a;
--red: #f85149; --red-bg: #1f0d0d; --red-border: #3d1a1a;
--purple: #bc8cff; --purple-bg: #190d2a; --purple-border: #2d1a3d;
--orange: #d2991d; --orange-bg: #1f160d; --orange-border: #3d2d1a;
--cyan: #39d2c0; --cyan-bg: #0d1f1c; --cyan-border: #1a3d3a;
--pink: #f778ba; --pink-bg: #1f0d1a; --pink-border: #3d1a2d;
--yellow: #e3b341; --yellow-bg: #1f1a0d; --yellow-border: #3d3d1a;
--teal: #79c0ff; --teal-bg: #0d1d2d; --teal-border: #1a2d3d;
--radius: 10px; --radius-sm: 6px; --transition: 0.2s ease;
}
*, *::before, *::after { box-sizing: border-box; margin: 0; padding: 0; }
body {
background: var(--bg); color: var(--text);
font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', 'Noto Sans SC', sans-serif;
padding: 32px 24px 60px; min-height: 100vh;
}
.header { text-align: center; margin-bottom: 32px; }
.header h1 {
font-size: 1.5em;
background: linear-gradient(135deg, var(--blue), var(--purple), var(--pink));
-webkit-background-clip: text; -webkit-text-fill-color: transparent;
}
.header p { color: var(--text-dim); font-size: 0.85em; margin-top: 4px; }
.grid {
display: grid;
grid-template-columns: repeat(auto-fill, minmax(260px, 1fr));
gap: 16px; max-width: 1400px; margin: 0 auto;
}
.card {
background: var(--bg-card); border: 1px solid var(--border);
border-radius: var(--radius); padding: 18px 16px;
cursor: pointer; position: relative; overflow: hidden;
transition: border-color 0.3s ease, box-shadow 0.3s ease;
}
.card::before {
content: ''; position: absolute; inset: 0;
background: radial-gradient(circle at var(--mx, 50%) var(--my, 50%),
rgba(88,166,255,0.12) 0%, transparent 60%);
opacity: 0; transition: opacity 0.3s ease;
pointer-events: none; z-index: 0;
}
.card:hover::before { opacity: 1; }
.card:hover {
border-color: rgba(88,166,255,0.4);
box-shadow: 0 0 20px rgba(88,166,255,0.08), inset 0 0 20px rgba(88,166,255,0.03);
}
.card-header, .card-desc, .card-badge { position: relative; z-index: 1; }
.card-icon { font-size: 1.3em; }
.card-desc { font-size: 0.8em; color: var(--text-dim); line-height: 1.5; }
.card-badge {
position: absolute; top: 12px; right: 12px;
font-size: 0.7em; background: rgba(255,255,255,.06);
padding: 2px 8px; border-radius: 10px; color: var(--text-dim);
}
.card-arch { border-left: 3px solid var(--blue); }
.card-multi { border-left: 3px solid var(--cyan); }
.card-data { border-left: 3px solid var(--green); }
.card-pretrain { border-left: 3px solid var(--red); }
.card-post { border-left: 3px solid var(--purple);}
.card-compress { border-left: 3px solid var(--orange);}
.card-deploy { border-left: 3px solid var(--teal); }
.card-agent { border-left: 3px solid var(--pink); }
.card-eval { border-left: 3px solid var(--yellow);}
.card-arch .card-header { color: var(--blue); }
.card-multi .card-header { color: var(--cyan); }
.card-data .card-header { color: var(--green); }
.card-pretrain .card-header { color: var(--red); }
.card-post .card-header { color: var(--purple);}
.card-compress .card-header { color: var(--orange);}
.card-deploy .card-header { color: var(--teal); }
.card-agent .card-header { color: var(--pink); }
.card-eval .card-header { color: var(--yellow);}
/* Modal */
.overlay {
display: none; position: fixed; inset: 0;
background: rgba(0,0,0,.7); z-index: 100;
justify-content: center; align-items: flex-start;
padding: 40px 16px; overflow-y: auto;
}
.overlay.open { display: flex; }
.modal {
background: var(--bg-modal); border: 1px solid var(--border);
border-radius: var(--radius); width: 100%; max-width: 960px;
padding: 28px 24px; position: relative;
animation: fadeIn 0.2s ease;
}
@keyframes fadeIn { from { opacity: 0; transform: translateY(8px); } to { opacity: 1; transform: translateY(0); } }
.modal-close {
position: absolute; top: 16px; right: 16px;
background: none; border: none; color: var(--text-dim);
font-size: 1.4em; cursor: pointer;
width: 36px; height: 36px; border-radius: 50%;
display: flex; align-items: center; justify-content: center;
}
.modal-close:hover { background: rgba(255,255,255,.06); color: var(--text); }
.modal-title { font-size: 1.2em; font-weight: 700; margin-bottom: 6px; display: flex; align-items: center; gap: 8px; }
.modal-subtitle { color: var(--text-dim); font-size: 0.85em; margin-bottom: 24px; }
/* Tabs */
.tabs { display: flex; gap: 4px; flex-wrap: wrap; border-bottom: 1px solid var(--border); margin-bottom: 20px; }
.tab {
padding: 8px 16px; font-size: 0.85em; color: var(--text-dim);
cursor: pointer; border: none; background: none;
border-bottom: 2px solid transparent;
transition: color var(--transition), border-color var(--transition);
}
.tab:hover { color: var(--text); }
.tab.active { color: var(--text-bright); border-bottom-color: var(--blue); font-weight: 600; }
/* Section labels */
.section-label {
font-size: 0.78em; font-weight: 700; text-transform: uppercase;
letter-spacing: 1px; margin: 16px 0 8px; padding: 4px 0;
}
.section-label.mainline { color: var(--blue); border-bottom: 1px solid var(--blue-border); }
.section-label.branch { color: var(--orange); border-bottom: 1px solid var(--orange-border); }
.section-label.forward { color: var(--purple); border-bottom: 1px solid var(--purple-border); }
/* Paper item */
.paper-item {
display: flex; align-items: flex-start; gap: 12px;
padding: 10px 12px; border-radius: var(--radius-sm);
margin-bottom: 4px; transition: background var(--transition);
}
.paper-item:hover { background: rgba(255,255,255,.03); }
.paper-year { flex-shrink: 0; width: 42px; font-size: 0.75em; color: var(--text-dim); font-variant-numeric: tabular-nums; text-align: right; padding-top: 1px; }
.paper-body { flex: 1; min-width: 0; }
.paper-title { font-size: 0.9em; font-weight: 600; color: var(--text-bright); line-height: 1.4; margin-bottom: 2px; }
.paper-meta { font-size: 0.75em; color: var(--text-dim); display: flex; gap: 8px; flex-wrap: wrap; align-items: center; }
.paper-venue { color: var(--green); font-weight: 500; }
/* Tags */
.paper-tag { display: inline-block; font-size: 0.7em; padding: 1px 6px; border-radius: 4px; font-weight: 600; }
.tag-start { background: #1a3a5c; color: var(--blue); }
.tag-milestone{ background: #3d1a1a; color: var(--red); }
.tag-frontier { background: #1a3d1a; color: var(--green);}
.tag-forward { background: #2d1a3d; color: var(--purple);}
.tag-branch { background: #3d2d1a; color: var(--orange);}
/* Links */
.paper-links { display: flex; gap: 6px; margin-top: 4px; flex-wrap: wrap; }
.paper-link {
font-size: 0.72em; color: var(--blue); text-decoration: none;
padding: 2px 8px; border: 1px solid var(--blue-border);
border-radius: 4px; transition: background var(--transition); cursor: pointer;
}
.paper-link:hover { background: var(--blue-bg); }
/* Search */
.search-wrap { max-width: 600px; margin: 0 auto 28px; position: relative; }
.search-input {
width: 100%; padding: 10px 16px 10px 38px;
background: var(--bg-card); border: 1px solid var(--border);
border-radius: var(--radius); color: var(--text); font-size: 0.9em; outline: none;
}
.search-input:focus { border-color: var(--blue); }
.search-icon { position: absolute; left: 12px; top: 50%; transform: translateY(-50%); color: var(--text-dim); }
@media (max-width: 640px) { .grid { grid-template-columns: 1fr; } .modal { padding: 20px 16px; } }
/* PDF viewer */
.pdf-overlay {
display: none; position: fixed; inset: 0; background: rgba(0,0,0,.85);
z-index: 200; justify-content: center; align-items: center;
transition: justify-content 0.3s ease;
}
.pdf-overlay.open { display: flex; }
.pdf-container {
width: 90vw; height: 90vh; background: #fff; border-radius: var(--radius);
display: flex; flex-direction: column; overflow: hidden;
}
.pdf-toolbar {
display: flex; align-items: center; gap: 12px; padding: 10px 16px;
background: #1a1a2e; color: #eee; font-size: 0.85em;
}
.pdf-toolbar button {
background: rgba(255,255,255,.1); border: none; color: #eee;
padding: 4px 12px; border-radius: 4px; cursor: pointer; font-size: 0.85em;
}
.pdf-toolbar button:hover { background: rgba(255,255,255,.2); }
.pdf-toolbar .pdf-close { margin-left: auto; font-size: 1.2em; }
.pdf-frame { flex: 1; border: none; width: 100%; }
/* Translation */
.trans-btn {
font-size: 0.7em; color: var(--cyan); cursor: pointer;
padding: 2px 6px; border: 1px solid var(--cyan-border);
border-radius: 4px; background: transparent;
margin-left: 4px; transition: background var(--transition);
}
.trans-btn:hover { background: var(--cyan-bg); }
.trans-btn.loading { opacity: 0.5; pointer-events: none; }
.paper-title-zh {
font-size: 0.8em; color: var(--cyan); margin-top: 3px; line-height: 1.4;
}
/* Status bar */
.status-bar {
position: fixed; bottom: 0; left: 0; right: 0;
background: var(--bg-card); border-top: 1px solid var(--border);
display: flex; gap: 20px; align-items: center; padding: 8px 20px;
z-index: 50; font-size: 0.78em; color: var(--text-dim);
}
.status-label {
color: var(--text-dim); font-weight: 600; margin-right: 4px;
}
.status-item {
display: flex; align-items: center; gap: 6px;
}
.status-ms {
color: var(--text-dim); font-variant-numeric: tabular-nums;
margin-left: 2px;
}
.status-dot {
width: 8px; height: 8px; border-radius: 50%;
background: var(--text-dim); transition: background 0.3s;
}
.status-dot.status-ok { background: var(--green); }
.status-dot.status-fail { background: var(--red); }
.pdf-container {
width: 90vw; height: 90vh; background: #fff; border-radius: var(--radius);
display: flex; flex-direction: column; overflow: hidden;
transition: width 0.3s ease;
}
.pdf-container.with-trans { width: 55vw; }
/* Translation panel (right sidebar) */
.trans-panel {
position: fixed; right: -50vw; top: 5vh; bottom: 5vh;
width: 45vw; min-width: 380px; max-width: 650px;
background: var(--bg-modal); border: 1px solid var(--border);
border-radius: var(--radius);
z-index: 201; display: flex; flex-direction: column;
transition: right 0.35s cubic-bezier(0.4,0,0.2,1);
box-shadow: -4px 0 24px rgba(0,0,0,.5);
overflow: hidden;
}
.trans-panel.open { right: 2vw; }
.trans-panel-header {
display: flex; align-items: center; gap: 12px;
padding: 12px 16px; border-bottom: 1px solid var(--border);
font-weight: 700; font-size: 1em; flex-shrink: 0;
}
.trans-panel-close {
margin-left: auto; background: none; border: none;
color: var(--text-dim); font-size: 1.3em; cursor: pointer;
}
.trans-panel-body {
flex: 1; overflow-y: auto; padding: 16px;
scroll-behavior: smooth;
}
.trans-panel-footer {
padding: 10px 16px; border-top: 1px solid var(--border); flex-shrink: 0;
}
.trans-panel-footer button {
background: var(--blue-bg); color: var(--blue);
border: 1px solid var(--blue-border); padding: 6px 16px;
border-radius: 4px; cursor: pointer; font-size: 0.85em;
}
.trans-panel-footer button:hover { background: #1a3a5c; }
.trans-panel-footer button:disabled { opacity: 0.5; cursor: not-allowed; }
.trans-para-group {
margin-bottom: 18px; padding-bottom: 14px;
border-bottom: 1px solid rgba(255,255,255,.04);
transition: background 0.3s ease;
}
.trans-para-group.active {
background: rgba(88,166,255,.06);
border-left: 2px solid var(--blue);
padding-left: 10px;
margin-left: -12px;
padding-right: 4px;
border-radius: 0 4px 4px 0;
}
.trans-en, .trans-zh {
font-size: 0.82em; line-height: 1.65;
}
.trans-page-badge {
font-size: 0.65em; color: var(--text-dim); margin-bottom: 2px;
font-variant-numeric: tabular-nums;
}
.trans-en { color: var(--text); margin-bottom: 4px; }
.trans-zh { color: var(--cyan); padding-left: 8px; border-left: 2px solid var(--cyan-border); }
/* Scroll sync highlight */
.trans-para-group.highlight {
background: rgba(88,166,255,.12);
border-left: 3px solid var(--blue);
padding-left: 9px;
margin-left: -12px;
padding-right: 4px;
border-radius: 0 4px 4px 0;
}