--- title: 可自动化、可扩展、AI增强的电商数据采集与处理系统 tags: - "#ecommerce" - "#scrapy" - ai - "#playwright" created: 2025-11-11 published: author: - Wei Shen description: 可自动化、可扩展、AI增强的电商数据采集与处理系统 link: source: --- 非常好的思路——你想要的是一个**可自动化、可扩展、AI增强的数据采集与处理系统**,基于 Docker + Ubuntu + n8n 搭建。下面我给你一个详细的设计与工具选择建议,从爬取到分析的整体架构。 --- ## 🧩 一、系统整体架构建议 你的目标系统可以分为三个层次: |层次|组件|说明| |---|---|---| |**数据采集层(爬虫)**|Scrapy / Playwright / Selenium / Apify|从各大电商网站采集结构化信息(标题、描述、图片、视频等)| |**数据处理层(自动化管道)**|n8n + LLM API (e.g., OpenAI, Ollama, LM Studio)|对采集数据进行清洗、分类、摘要、翻译、属性提取等AI处理| |**存储与展示层**|PostgreSQL / SQLite + MinIO / NAS + Grafana / Metabase|存储文本、图片和视频元数据,并可视化结果| --- ## 🕷️ 二、爬虫工具推荐与对比 |工具|适用场景|优点|缺点| |---|---|---|---| |**Scrapy**|静态页面、电商产品信息|轻量高效、插件生态丰富、可Docker化部署|对JS渲染页面支持弱,需要配合Splash或Playwright| |**Playwright (Python/Node.js)**|动态渲染页面、滚动加载、视频图片加载|可模拟浏览器、支持无头模式、可靠性高|相对重,适合单站点深度采集| |**Apify (Open Source SDK)**|通用网页爬取+API接口+调度|已内置防封禁策略、支持Docker|学习曲线略陡、对纯本地化部署需定制| |**Colly (Go语言)**|高性能爬虫服务、轻量API爬取|性能强、可编译为二进制Docker镜像|JS支持弱,不适合电商复杂页面| |**Crawlee (Node.js)**|Apify的开源核心框架,支持Playwright/Selenium|与n8n、LangChain容易集成|需要JS/TS开发基础| **推荐组合:** > ✅ **Scrapy + Playwright(或Crawlee + Playwright)** - Scrapy 负责结构化抓取、分页调度、下载媒体; - Playwright 负责加载动态页面; - 两者可通过 Docker Compose 容器化; - 输出 JSON 或 CSV 文件供 n8n 消费。 --- ## ⚙️ 三、Docker 架构示例 Scrapy 爬完的数据(JSON/CSV)存放在 `/data`,由 n8n 定时读取和处理。 --- ## 🤖 四、n8n 中的AI集成与自动化 在 n8n 中可以通过 workflow 实现整个管线自动化: **工作流示例:** 1. **Cron Trigger** → 定时启动 2. **Execute Command Node** → 执行 `docker exec scraper scrapy crawl amazon` 3. **Read File Node** → 读取爬取的 JSON 文件 4. **OpenAI / Ollama Node** → 调用 LLM 对产品描述提取属性(品牌、型号、价格区间、关键词等) 5. **Database Node** → 写入 PostgreSQL / SQLite 6. **Webhook / Email Node** → 生成报表或通知 你也可以用 n8n 的 “HTTP Request Node” 抓取 API 页面(适合公开电商API如阿里、Shopee)。 --- ## 🧠 五、AI 处理建议 结合 LLM 做如下任务: - **内容摘要与分类**(提炼产品特点) - **多语言翻译** - **特征提取**(品牌、价格、类别) - **异常检测**(自动识别异常价格或缺图产品) - **结构化JSON输出**,便于数据库存储 可以本地使用 **Ollama** (Mistral, Llama3) 模型,通过 n8n 的 HTTP Request 调用本地 `http://localhost:11434/api/generate`。 --- ## 🗄️ 六、数据存储与管理建议 |数据类型|存储建议|工具| |---|---|---| |文本/结构化数据|PostgreSQL / SQLite|适合AI分析、可视化| |图片/视频|MinIO / 本地NAS|模拟S3存储方式| |Dashboard|Grafana / Metabase|生成电商趋势与分析报表| --- ## 🛡️ 七、防封与安全建议 - 使用 **User-Agent轮换**(Scrapy 内置 `scrapy-user-agents`) - **代理池**(BrightData / ScraperAPI / 免费代理) - 对同一网站设置 **下载延迟 + 随机化访问** - **分布式调度** 可用 `Scrapyd` 或 `Archetype`(Scrapy集群) - 不采集个人隐私信息或侵权内容(仅公开页面信息) --- ## 🚀 八、后续可扩展方向 - 增加 **FastAPI 服务层**:暴露 REST API 给前端或 BI 工具; - 使用 **LangChain** 结合 Vector DB(e.g. Qdrant / Milvus)存储产品语义信息; - n8n 可触发自动生成市场趋势报告(AI Summarizer); - 可视化电商类目趋势(Grafana + Postgres)。 --- ## ✅ 推荐起步路径 1. **在 Ubuntu 上安装 Docker + Docker Compose** 2. 启动基础环境:`scrapy + playwright + n8n` 3. 选择 1–2 个电商站点(如 Amazon / JD / Taobao) 4. 构建 Scrapy 爬虫模板 5. 用 n8n 处理数据并测试 AI 工作流 6. 逐步扩展至全自动管线 --- 太好了 ✅ 我将为你生成一个**可直接部署的爬虫自动化系统骨架**,基于以下目标设计: > **目标:** 在 Ubuntu 上使用 Docker Compose 部署一个自动化电商爬虫系统,采集电商网站产品信息 → 存入本地文件 → 由 n8n 自动读取并利用 AI 处理。 下面是完整结构与代码示例。 --- ## 🧩 一、项目目录结构 ```bash ecommerce-crawler/ ├── docker-compose.yml ├── scrapy/ │ ├── Dockerfile │ ├── requirements.txt │ ├── scrapy.cfg │ └── ecommerce/ │ ├── __init__.py │ ├── items.py │ ├── middlewares.py │ ├── pipelines.py │ ├── settings.py │ └── spiders/ │ └── amazon_spider.py ├── n8n_data/ # n8n 数据持久化 ├── data/ # 爬取结果输出目录 └── README.md ``` --- ## 🧱 二、`docker-compose.yml` ```yaml version: '3.8' services: # Scrapy 爬虫容器 scraper: build: ./scrapy container_name: ecommerce-scraper working_dir: /app/scrapy volumes: - ./data:/app/data depends_on: - playwright environment: - PLAYWRIGHT_BROWSERS_PATH=/ms-playwright networks: - crawler-net # Playwright 浏览器支持容器 playwright: image: mcr.microsoft.com/playwright/python:v1.48.0-jammy shm_size: 2gb networks: - crawler-net # n8n 自动化平台 #n8n: # image: n8nio/n8n:latest # container_name: n8n # ports: # - 5678:5678 # environment: # - N8N_BASIC_AUTH_ACTIVE=true # - N8N_BASIC_AUTH_USER=admin # - N8N_BASIC_AUTH_PASSWORD=changeme # - N8N_PATH=/workflows # volumes: # - ./n8n_data:/home/node/.n8n # - ./data:/data # networks: # - crawler-net networks: crawler-net: ``` --- ## 🐍 三、Scrapy 部分 ### `scrapy/Dockerfile` ```dockerfile FROM mcr.microsoft.com/playwright/python:v1.48.0-jammy WORKDIR /app COPY requirements.txt . RUN pip install --no-cache-dir -r requirements.txt COPY . . # 安装 Playwright 浏览器 RUN playwright install WORKDIR /app CMD ["scrapy", "crawl", "amazon"] ``` --- ### `scrapy/requirements.txt` ```txt scrapy==2.13.3 playwright==1.48.0 scrapy-playwright==0.0.44 ``` > 说明:`scrapy-playwright` 插件可直接让 Scrapy 调用 Playwright 渲染动态页面,非常适合电商网站。 --- ### `scrapy/scrapy.cfg` ```ini [settings] default = settings [deploy] # 如果你将来要用 scrapyd 部署,可以在这里定义目标(可忽略) # url = http://localhost:6800/ # project = crawler ``` --- ### `scrapy/ecommerce/settings.py` ```python BOT_NAME = "scrapy" SPIDER_MODULES = ["spiders"] # 指向当前目录下的 spiders NEWSPIDER_MODULE = "spiders" # 新建 spider 时默认放在这里 ROBOTSTXT_OBEY = False DOWNLOAD_DELAY = 2 DOWNLOAD_HANDLERS = { "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler", "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler", } TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor" PLAYWRIGHT_LAUNCH_OPTIONS = { "headless": True, "args": ["--no-sandbox", "--disable-setuid-sandbox"], } PLAYWRIGHT_BROWSER_TYPE = "chromium" FEEDS = { "/app/data/amazon.json": {"format": "json", "overwrite": True}, } ``` --- ### `scrapy/ecommerce/items.py` ```python import scrapy class ProductItem(scrapy.Item): title = scrapy.Field() price = scrapy.Field() rating = scrapy.Field() image_urls = scrapy.Field() images = scrapy.Field() product_url = scrapy.Field() ``` --- ### `scrapy/ecommerce/pipelines.py` ```python import json class JsonWriterPipeline: def open_spider(self, spider): self.file = open("data/products.json", "w", encoding="utf-8") self.file.write("[\n") def close_spider(self, spider): self.file.write("]\n") self.file.close() def process_item(self, item, spider): line = json.dumps(dict(item), ensure_ascii=False) + ",\n" self.file.write(line) return item ``` --- ### `scrapy/ecommerce/spiders/amazon_spider.py` > ⚠️ 仅作演示用途,使用通用搜索页采集,不涉及登录或侵权内容。 ```python import scrapy from scrapy_playwright.page import PageMethod class QuotesSpider(scrapy.Spider): name = "amazon" start_urls = ["https://quotes.toscrape.com/js/"] custom_settings = { "PLAYWRIGHT_LAUNCH_OPTIONS": {"headless": True}, "PLAYWRIGHT_CONTEXT_ARGS": {"viewport": {"width": 1280, "height": 720}}, } def start_requests(self): for url in self.start_urls: yield scrapy.Request( url, meta={ "playwright": True, "playwright_page_methods": [ PageMethod("wait_for_selector", "div.quote") ], }, ) async def parse(self, response): quotes = response.css("div.quote") for quote in quotes: text = quote.css("span.text::text").get() author = quote.css("small.author::text").get() yield {"text": text, "author": author} # 翻页 next_page = response.css("li.next a::attr(href)").get() if next_page: next_url = response.urljoin(next_page) yield scrapy.Request( next_url, meta={ "playwright": True, "playwright_page_methods": [ PageMethod("wait_for_selector", "div.quote") ], }, callback=self.parse ) ``` ### `scrapy/ecommerce/spiders/__init__.py` ```python #empty ``` --- ## ⚙️ 四、运行方式 ```bash # 1. 构建并启动所有容器 docker compose up --build -d # 2. 进入 Scrapy 容器执行爬虫 docker exec -it ecommerce-scraper scrapy crawl amazon # 3. 查看数据 cat data/products.json ``` ### Docker 命令 ``` bash docker compose down -v docker compose up --build -d docker exec -it ecommerce-scraper bash #进入交互模式 docker run -it --entrypoint bash ecommerce-scraper ``` --- ## 🤖 五、n8n 自动化流程(示例) 你可以在 n8n Web UI([http://localhost:5678)导入如下逻辑:](http://localhost:5678%EF%BC%89%E5%AF%BC%E5%85%A5%E5%A6%82%E4%B8%8B%E9%80%BB%E8%BE%91%EF%BC%9A/) **Workflow 示例逻辑:** 1. **Cron Node**:每天凌晨触发 2. **Execute Command Node**:执行命令 `docker exec ecommerce-scraper scrapy crawl amazon` 3. **Read File Node**:读取 `/data/products.json` 4. **OpenAI Node**(或 HTTP Request 调本地 Ollama): - Prompt 示例: ```text 从以下JSON中提取每个产品的简短摘要(不超过30字)并分类。 ``` 5. **Write to Database Node**(或 Google Sheet / CSV) 保存提炼结果 6. **Telegram / Email Node** 发送日报通知 --- ## 💡 六、后续增强建议 |功能|工具|说明| |---|---|---| |图像下载/视频下载|Scrapy 自带 `ImagesPipeline` 或 `requests`|下载并存储本地或 MinIO| |多站点并行爬取|Scrapy 多 Spider + n8n 并行调度|自动轮询不同站点| |数据可视化|Metabase / Grafana|接 PostgreSQL 实时展示产品趋势| |向量搜索|Qdrant / Weaviate|用于AI语义检索| |本地AI处理|Ollama + Llama3|替代外部API| --- 是否希望我帮你再生成一个 **完整可导入的 n8n workflow JSON 文件模板**(包含 Cron、执行 Scrapy、读取 JSON、调用 AI、存入数据库)? 这样你只需导入到 n8n UI 就能立即自动化整个流程。 非常好 👍 下面是一个**完整可导入的 n8n Workflow JSON 模板**,它能自动: 1️⃣ 定时运行 Scrapy 爬虫(通过 Docker 命令) 2️⃣ 读取爬取的数据(`products.json`) 3️⃣ 调用 AI(可选择 OpenAI 或本地 Ollama)进行摘要与分类 4️⃣ 存入数据库或文件 5️⃣ 发送报告通知 --- ## 🧩 一、Workflow 功能概览 |步骤|节点类型|功能| |---|---|---| |1|**Cron Trigger**|每天凌晨 2:00 自动触发| |2|**Execute Command**|执行 `docker exec ecommerce-scraper scrapy crawl amazon`| |3|**Read Binary File**|读取 `/data/products.json`| |4|**OpenAI (或 HTTP Request)**|提炼摘要与分类(可切换 Ollama)| |5|**Write Binary File**|输出 `data/products_summary.json`| |6|**Email (或 Telegram)**|发送日报通知| --- ## 📦 二、Workflow JSON 模板(可直接导入) 将以下 JSON 内容保存为 👉 `workflow_ecommerce_automation.json` 然后在 n8n Web UI → **Import from file** 导入。 ```json { "name": "Ecommerce Crawler + AI Summary", "nodes": [ { "parameters": { "triggerTimes": { "item": [ { "mode": "everyDay", "hour": 2 } ] } }, "id": "1", "name": "Cron Trigger", "type": "n8n-nodes-base.cron", "typeVersion": 1, "position": [250, 250] }, { "parameters": { "command": "docker exec ecommerce-scraper scrapy crawl amazon" }, "id": "2", "name": "Run Scrapy Crawler", "type": "n8n-nodes-base.executeCommand", "typeVersion": 1, "position": [500, 250] }, { "parameters": { "path": "/data/products.json", "options": {} }, "id": "3", "name": "Read Products JSON", "type": "n8n-nodes-base.readBinaryFile", "typeVersion": 1, "position": [750, 250] }, { "parameters": { "functionCode": "const data = JSON.parse(Buffer.from(items[0].binary.data.data, 'base64').toString());\nreturn data.map(p => ({ json: p }));" }, "id": "4", "name": "Parse JSON", "type": "n8n-nodes-base.function", "typeVersion": 1, "position": [1000, 250] }, { "parameters": { "model": "gpt-4-turbo", "prompt": "你是一个电商产品分析助手。请从以下产品信息中提取每个产品的简短摘要(不超过30字)并归类到相应产品类别。\n\n输入数据:{{$json[\"title\"]}},价格:{{$json[\"price\"]}},评分:{{$json[\"rating\"]}}。\n\n输出格式:{\"title\":\"...\",\"summary\":\"...\",\"category\":\"...\"}" }, "id": "5", "name": "AI Summarize & Categorize", "type": "n8n-nodes-base.openAi", "typeVersion": 2, "position": [1250, 250], "credentials": { "openAIApi": { "id": "YOUR-OPENAI-CREDENTIAL-ID", "name": "OpenAI API" } } }, { "parameters": { "fileName": "/data/products_summary.json", "options": {}, "dataPropertyName": "json" }, "id": "6", "name": "Write Summary JSON", "type": "n8n-nodes-base.writeBinaryFile", "typeVersion": 1, "position": [1500, 250] }, { "parameters": { "fromEmail": "noreply@yourdomain.com", "toEmail": "your@email.com", "subject": "Daily Product Summary Report", "text": "今日电商产品摘要已生成,请查看 /data/products_summary.json 文件。" }, "id": "7", "name": "Send Email Notification", "type": "n8n-nodes-base.emailSend", "typeVersion": 1, "position": [1750, 250] } ], "connections": { "Cron Trigger": { "main": [[{ "node": "Run Scrapy Crawler", "type": "main", "index": 0 }]] }, "Run Scrapy Crawler": { "main": [[{ "node": "Read Products JSON", "type": "main", "index": 0 }]] }, "Read Products JSON": { "main": [[{ "node": "Parse JSON", "type": "main", "index": 0 }]] }, "Parse JSON": { "main": [[{ "node": "AI Summarize & Categorize", "type": "main", "index": 0 }]] }, "AI Summarize & Categorize": { "main": [[{ "node": "Write Summary JSON", "type": "main", "index": 0 }]] }, "Write Summary JSON": { "main": [[{ "node": "Send Email Notification", "type": "main", "index": 0 }]] } }, "active": true } ``` --- ## 🧠 三、使用说明 1. **导入流程** - 访问 `http://<你的服务器IP>:5678` - 点击右上角「Import from File」→ 选择上面的 JSON 文件 2. **配置 OpenAI 凭证** - n8n → Credentials → 新建 OpenAI API Key - 将 `YOUR-OPENAI-CREDENTIAL-ID` 替换为对应 ID - 若使用本地 Ollama,则: - 替换节点 `AI Summarize & Categorize` 为 “HTTP Request” - URL 填入 `http://ollama:11434/api/generate` - POST body 示例: ```json { "model": "llama3", "prompt": "分析以下产品信息:{{$json[\"title\"]}} ..." } ``` 3. **设置邮箱凭证** - n8n → Credentials → Email Send (SMTP) - 支持 Gmail, Outlook, AWS SES 等 4. **测试运行** - 手动点击 ▶️ 运行 - 查看 `/data/products_summary.json` - 邮箱将收到日报通知 --- ## 🚀 四、可选扩展 |目标|实现方式| |---|---| |改为多站点爬取|添加更多 spider(如 `jd_spider.py`, `ebay_spider.py`)并在 n8n 串联运行| |增加数据库写入|在 n8n 中加入 “PostgreSQL Node” 写入产品与AI摘要| |生成图表报表|把数据库接入 Grafana / Metabase| |向量检索增强|在 n8n 中调用 Qdrant 插入语义向量| --- 是否希望我在下一步帮你生成一个适配 **Ollama 本地模型(离线AI处理)** 的 workflow 版本? 它将不依赖外部API,完全在本地自动运行分析。