按教程用正则表达式提取不到内容怎么办？ - V2EX

首页注册登录

V2EX = way to explore

V2EX 是一个关于分享和探索的地方

现在注册

已注册用户请登录

推荐学习书目

› Learn Python the Hard Way

Python Sites

› PyPI - Python Package Index

› http://diveintopython.org/toc/index.html

› Pocoo

值得关注的项目

› PyPy

› Celery

› Jinja2

› Read the Docs

› gevent

› pyenv

› Stackless Python

› Beautiful Soup

› 结巴中文分词

› Green Unicorn

› Sentry

› Shovel

› pytest

Python 编程

› pep8 Checker

Styles

› PEP 8

› Google Python Style Guide

› Code Style from The Hitchhiker's Guide

这是一个创建于 2501 天前的主题，其中的信息可能已经有所发展或是发生改变。

爬取猫眼 100 名电影，结果检测正则表达式提取内容的时候返回的结果为空。

def parse_one_page(html):
pattern = re.compile(
'<dd>.*?board-index.*?>(.*?)</i>.*?data-src="(.*?)".*?name.*?a.*?>(.*?)</a>.*?star.*?>(.*?)</p>.*?releasetime.*?>(.*?)</p>.*?integer.*?>(.*?)</i>.*?fraction.*?>(.*?)</i>.*?</dd>',
re.S)
items = re.findall(pattern, html)
print(items)
这是第一个；

import re
def parse_one_page(html):

pattern = re.compile('<dd>.*?board-index.*?>(\d+)</i>.*?data-src="(.*?)".*?name">'
+ '<a.*?>(.*?)</a>.*?"star">(.*?)</p>.*?releasetime">(.*?)</p>'
+ '.*?integer">(.*?)</i>.*?fraction">(.*?)</i>.*?</dd>', re.S)

items = re.findall(pattern, html)

for item in items:
yield {
'index': item[0],
'image': item[1],
'title': item[2],
'actor': item[3].strip()[3:],
'time': item[4].strip()[5:],
'score': item[5] + item[6]
}
def main():
url = 'http://maoyan.com/board/4'
html = get_one_page(url)
for item in parse_one_page(html):
print(item)
这是第二种方式。
发现都提取不出来内容，但是如果用完整的代码则在最后运行的时候会正确显示……

8 条回复 • 2019-02-25 16:37:12 +08:00

1

Kacxxia

2019-02-24 18:49:05 +08:00

1

https://regex101.com
推荐你用这个测试正则，右上会有语法解析

2

xiaozaiziwyt

OP

2019-02-24 18:57:02 +08:00

@Kacxxia 谢谢了。不过重新写代码后发现居然能运行了

3

fzinfz

2019-02-25 00:25:26 +08:00

1

写这么长正则解析 html 的教程是为了找人切磋不是教人的吧，劝楼主另寻教程。。。关键词：bs4

4

msg7086

2019-02-25 08:10:38 +08:00

比较好的办法是先用正则提取出主要数据，再用 XML / HTML 解析器解成结构化数据，再去读取遍历。

5

hakono

2019-02-25 10:42:05 +08:00 via Android

楼主你看的么破教程啊，教人抽取复杂的网页用正则。。。。
乖乖去用 beautiful soup 去，一个 css 选择器就抽出来，节省自己点时间和生命吧

6

xpresslink

2019-02-25 11:30:16 +08:00

建议楼主学习一下 xPath 语法，写代码效率比正则要高多了，在 scrapy 中直接就可以用。比较容易入手还是推荐 BS4

7

E1n

2019-02-25 13:50:19 +08:00 via Android

正则写的爽

8

hjq98765

2019-02-25 16:37:12 +08:00

bs4+1

关于 · 帮助文档 · 自助推广系统 · 博客 · API · FAQ · Solana · 1287 人在线 最高记录 6679 ·

Select Language

创意工作者们的社区

World is powered by solitude

VERSION: 3.9.8.5 · 35ms · UTC 17:10 · PVG 01:10 · LAX 09:10 · JFK 12:10
♥ Do have faith in what you're doing.