9 个自动化互联网的 Python 库，最后一个太离谱！

页面卡在登录后第二屏，接口明明 200，HTML 里却只有一坨空壳。这种自动化任务，我第一眼不会先写爬虫。先看它到底是接口返回，还是浏览器渲染出来的。

Python 做互联网自动化，库很多，但真在线上脚本里能常用的，也就这几类。

1. requests：先别上浏览器，能请求就请求

requests 适合最朴素的 HTTP 自动化。比如定时拉一个内部系统报表、调接口、下载文件。它不是花活，胜在稳。官方定位也是简单 HTTP 库。

import requests
url = "https://example.com/api/report"
headers = {"User-Agent": "ops-checker/1.0"}
resp = requests.get(url, headers=headers, timeout=8)
resp.raise_for_status()
with open("daily_report.json", "w", encoding="utf-8") as f:
f.write(resp.text)

我一般会把 timeout 写死，不写这个，脚本挂在半夜，第二天看进程还活着，人是懵的。

2. httpx：接口多了，就别一条条等

httpx 比 requests 更适合现在这类批量接口任务，它同时支持同步和异步 API，也支持 HTTP/2。

import httpx
import asyncio
async def check_one(client, order_id):
r = await client.get(
  f"https://example.com/api/orders/{order_id}",
timeout=6
)
  return order_id, r.status_code
async def main():
ids = ["A1001", "A1002", "A1003"]
  async with httpx.AsyncClient() as client:
rows = await asyncio.gather(*(check_one(client, i) for i in ids))
print(rows)
asyncio.run(main())

这种脚本我常拿来扫一批订单状态，比 for 循环里 requests 一把梭舒服很多。

3. BeautifulSoup：HTML 不是正则能救的

页面已经返回完整 HTML，那就别用正则硬抠。BeautifulSoup 本来就是拿来从 HTML、XML 里提数据的。

from bs4 import BeautifulSoup
html = open("page.html", encoding="utf-8").read()
soup = BeautifulSoup(html, "html.parser")
items = []
for tr in soup.select("table.order-list tr[data-id]"):
items.append({
"id": tr["data-id"],
"amount": tr.select_one(".amount").get_text(strip=True)
})
print(items)

这里我不太喜欢链式写太长。页面结构一改，报错位置都不好找。

4. aiohttp：网络脚本要跑久一点，就用它

aiohttp 是 asyncio 这一套里的 HTTP 客户端和服务端库，客户端还支持 WebSocket。

import aiohttp
import asyncio
async def fetch_text(session, url):
  async with session.get(url, timeout=10) as resp:
  return await resp.text()
async def main():
urls = [
  "https://example.com/a",
  "https://example.com/b",
]
  async with aiohttp.ClientSession() as session:
pages = await asyncio.gather(*(fetch_text(session, u) for u in urls))
print([len(p) for p in pages])
asyncio.run(main())

它适合那种“跑一晚上，拉几千个公开页面”的活。前提是遵守对方站点规则，别把人家服务打穿。

5. Scrapy：不是脚本，是一个爬取工程

Scrapy 适合规模稍微上来一点的抓取任务。它自带异步引擎、重试、限速、缓存这些东西。

import scrapy
class NoticeSpider(scrapy.Spider):
name = "notice_spider"
start_urls = ["https://example.com/notices"]
custom_settings = {
  "DOWNLOAD_DELAY": 1,
  "CONCURRENT_REQUESTS": 4,
}
  def parse(self, response):
  for item in response.css(".notice"):
  yield {
  "title": item.css("a::text").get("").strip(),
  "link": response.urljoin(item.css("a::attr(href)").get(""))
}

Scrapy 不适合写一锤子买卖的小脚本。项目一旦有去重、分页、失败重跑，才轮到它上场。

6. Selenium：老系统后台，还得靠它

有些后台系统，不点按钮不出数据，不登录不进页面，接口还藏得很深。Selenium 就是干浏览器自动化的，官方也说它支持自动化浏览器交互。

from selenium import webdriver
from selenium.webdriver.common.by import By
driver = webdriver.Chrome()
driver.get("https://example.com/admin")
driver.find_element(By.NAME, "keyword").send_keys("refund")
driver.find_element(By.CSS_SELECTOR, ".search-btn").click()
rows = driver.find_elements(By.CSS_SELECTOR, "table tbody tr")
print("rows:", len(rows))
driver.quit()

我对 Selenium 的态度是：能不用就不用。不是它不行，是浏览器自动化天生脆，按钮名字一改，脚本就开始装死。

7. Playwright：现在写浏览器自动化，我更愿意先看它

Playwright 支持 Chromium、WebKit、Firefox，也有同步和异步 Python API。

from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
page.goto("https://example.com/orders")
page.get_by_placeholder("订单号").fill("A1001")
page.get_by_text("查询").click()
page.wait_for_selector(".result-row")
print(page.locator(".result-row").count())
browser.close()

Playwright 的选择器和等待机制更顺手一点。以前 Selenium 里一堆 sleep，看着就烦。

8. mitmproxy：抓接口、改响应，排障很好用

mitmproxy 不只是代理工具，它还能用 Python 写 addon，挂到请求和响应事件里。官方文档里也提到 addon 可以通过事件改变 mitmproxy 行为。([mitmproxy文档][8])

class TraceHeader:
def response(self, flow):
if "example.com/api" in flow.request.pretty_url:
flow.response.headers["x-debug-from"] = "local-mitm"
addons = [TraceHeader()]

这东西适合排查“前端说没问题、后端说没收到”的场景。抓一下请求，比嘴上吵半天强。

9. browser-use：这个就有点离谱了

前面这些库，都是你告诉代码点哪里、取哪里。browser-use 这类东西不太一样，它是让 AI agent 去使用浏览器。官方项目介绍就是让网站可被 AI agent 使用。

import asyncio
from browser_use import Agent
from browser_use.llm import ChatOpenAI
async def main():
agent = Agent(
task="打开示例站点，找到最新公告标题，并整理成三行摘要",
llm=ChatOpenAI(model="gpt-4.1-mini"),
)
result = await agent.run()
print(result)
asyncio.run(main())

离谱的地方在于，你不是写：

click("#login")
fill("#keyword", "xxx")

而是直接给一句任务。它自己看页面、点按钮、滚动、判断下一步。

但这东西我不会直接扔生产。原因很简单：不确定性太高。内部低风险页面、运营后台辅助、一次性资料整理，可以试。涉及支付、删数据、提交表单，必须加人工确认。

自动化互联网这件事，别一上来就浏览器，也别一上来就 AI。先 requests，看不行再解析 HTML，再不行上 Playwright。最后那个 AI 控浏览器，确实猛，但猛的东西，一般也更需要栓绳。

以上就是“9 个自动化互联网的 Python 库，最后一个太离谱！”的详细内容，想要了解更多Python 教程欢迎持续关注编程学习网。

扫码二维码 获取免费视频学习资料

Python编程学习

本文固定链接: http://www.phpxs.com/post/14247/
转载请注明：转载必须在正文中标注并保留原文链接
扫码：扫上方二维码获取免费视频资料

查看2022高级编程视频教程免费获取