爬虫的时候，分析列表页的规律后，直接 get 让输如验证码或者其他

<!DOCTYPE html>
<html lang="zh-cn">

<head>
  <meta charset="utf-8">
  <meta http-equiv="X-UA-Compatible" content="IE=edge">
  <meta name="viewport" content="width=device-width, initial-scale=1">  <meta name="format-detection" content="telephone=no">
  <title>验证码</title>

    <link rel="stylesheet" type="text/css" href="http://tb1.bdstatic.com/tb/_/cui/frscaptcha/node_modules/tb-captcha/node_modules/tb-icon/lib/font_af3f10d.css" />
    <link rel="stylesheet" type="text/css" href="http://tb1.bdstatic.com/tb/_/cui/frscaptcha/node_modules/tb-captcha/lib/captcha/core/index_6cc78b2.css" />
    <link rel="stylesheet" type="text/css" href="http://tb1.bdstatic.com/tb/_/cui/frscaptcha/routes/home/index_1357dd3.css" />
    <link rel="stylesheet" type="text/css" href="http://tb1.bdstatic.com/tb/_/cui/frscaptcha/index_75e7e66.css" />
</head>

<body>
    <div id="react-dom"></div>

<script type="text/javascript" src="http://tb1.bdstatic.com/tb/_/cui/frscaptcha/mod_c630892.js"></script>
<script type="text/javascript">!function(){var e=500,t=function(){var t=document.documentElement.clientWidth/e;t=screen.width/e;var n=document.querySelector('meta[name="viewport"]');n.setAttribute("content","width="+e+",initial-scale="+t+",maximum-scale="+t+", minimum-scale="+t+",user-scalable=no,target-densitydpi=device-dpi")};t(),window.onload=function(){document.documentElement.clientWidth>750&&(document.getElementById("react-dom").style.margin="0 auto",document.getElementById("react-dom").style.width=e+"px")}}();</script>
<script type="text/javascript" src="http://tb1.bdstatic.com/tb/_/cui/frscaptcha/pkg/aio_1436556.js"></script>
</body></html>

**之前用网上的所谓高匿代理，然后用 python 代码检测是否为高匿名，结果发现 100 个中有 1~2 个高匿。。。之前爬 tieba 的时候，以为自己用了高匿名，就没有加 sleep 并启用了多进程，我主机的 ip 应该就封了。。。

**但是现在用检测好的高匿名代理，去爬网站，也是要输入验证码呀，这是怎回事呢？

Text

type

meta

验证码

7 条回复 • 2017-03-12 12:57:22 +08:00

dsg001

2017-03-12 09:27:43 +08:00

chrome 的 headers 完整复制到 requests

changwei

2017-03-12 09:37:33 +08:00 via Android

直接爬客户端接口是没有验证码的，任何情况下写爬虫都应该是优先考虑客户端接口。

我的 github.com/cw1997 上有相关项目（仓库名忘了，你找找）你可以参考参考。

wisefree

2017-03-12 09:39:59 +08:00

@dsg001 cookie 也复制？
heades = {
"Cookie": "...."
....
}
requests.get(url, headers = headers, proxies= random.choice(proxy_dicts))
复制了 cookie ，使用单一 cookie ，但是每次使用的 proxy 并不一样，这个会有影响么？

zhihaofans

2017-03-12 10:19:41 +08:00 via iPhone

@wisefree 有，会导致异地登陆

lecher

2017-03-12 10:32:50 +08:00 via Android

不要抓 PC 的分页，验证太多，抓 wap 版的分页，随意并发轻松又愉快。
如果想要 PC 版贴子内容的数据，抓完分页之后，自己取 kz 贴子 ID 转换成 PC 版 tieba.baidu.com/p/kzid 之类的即可。

如果有心情分析 app 接口，拦截 app 接口构造请求更轻松。

wisefree

2017-03-12 12:57:06 +08:00

@zhihaofans 好吧，有好的解决办法么？

wisefree

2017-03-12 12:57:22 +08:00

@lecher 好的，我去试试，谢啦