Python 的正则是不是有匹配的字符串的最大长度的限制啊

#!/usr/bin/env python3

import re

html = """
<html>
        <something>
        </something>
        <!--
                aaa
                bbb
                ccc
        -->
        111<!--
                ccc
                bbb
                aaa
        -->11
</html>
"""

item = re.findall(r"(?<=<!--).+?(?=-->)",html,re.S)
for i in item:
        print(i)

上面这个可以匹配成功。

这个就匹配不出来：

#!/usr/bin/env python3

import requests
import re
import json
import sys

s = requests.session()

params = {
        "ie" : "utf-8",
        "kw" : "linux"
}
page = s.get("http://tieba.baidu.com/f",params = params)
text = page.text

tiezi_data = re.findall(r"(?<=<!--).+?(?=-->)",text,re.S)
print(tiezi_data)
print(len(tiezi_data))

贴吧的页面里有大量注释，注释里有大量的信息，可以在浏览器里看到。但是我的正则只能匹配到第一个，我不知道为什么。

import

HTML

tiezi_data

5 条回复 • 2017-04-05 16:29:06 +08:00

zsz

2017-04-05 00:49:49 +08:00

在 text 字符串中搜索下 , 好像服务器直接返回给你的内容中并没有什么注释，
页面中之所以能看到，应该是前端异步获取的内容，你再仔细确认下吧

fyyz

2017-04-05 00:51:07 +08:00 via Android

@zsz 我刚刚才看到，现在已经关电脑了，明天再看看。

IanPeverell

2017-04-05 08:12:42 +08:00

贴吧那部分内容是 script 渲染上去的，直接 html 是没有内容的，而且即使有长度限制，也是直接报错的

ijustdo

2017-04-05 09:25:25 +08:00

```Python
html = """
<html>
<something>
</something>

11111
</html>
"""
import re
re.findall(r"(?is)(?:\<\!\-\-)(.+?)(?:\-\-\>)",html)
['\n aaa\n bbb\n ccc\n ', '\n ccc\n bbb\n aaa\n ']

```

kghch

2017-04-05 16:29:06 +08:00

除了异步加载的问题，另外匹配注释可以用 bs4 。 http://stackoverflow.com/questions/33138937/how-to-find-all-comments-with-beautiful-soup