python 解析 xml 的 bug ： coercing to Unicode: need string or buffer, NoneType found

V2EX = way to explore

V2EX 是一个关于分享和探索的地方

现在注册

已注册用户请登录

这是一个创建于 3870 天前的主题，其中的信息可能已经有所发展或是发生改变。

刚开始学 python，解析 xml 的时候很奇怪的一个 bug

`TypeError: coercing to Unicode: need string or buffer, NoneType found`

----------------------
附上 gist

其中 dblp_part.xml 是要解析的源数据的一部分，在结尾处不知道为什么 author.text 会是 None

求大神帮看一下

dblp_part.xml 文件太大

gist.github.com/bigsquirrel/396cc6c88a337505bb38 只需要下这个就可以了包含需要的两个文件

第 1 条附言 · 2015-05-08 14:54:06 +08:00

而且这个 bug 很奇怪，完全是一个的结构，前面解析的正常，然后突然就抛 NoneType 了。关键字 ElementTree NoneType 也无果

第 2 条附言 · 2015-05-08 14:57:43 +08:00

我可以直接把 NoneType 的所有项给忽略，但我统计下了有 4w 多条这样的情况，于是不得已

第 3 条附言 · 2015-05-08 15:41:40 +08:00

错误信息在 line 24
author.text 为 NoneType

nonetype

coercing

buffer

10 条回复 • 2015-05-08 16:45:13 +08:00

imn1

2015-05-08 15:23:53 +08:00

看看出错时 article.findall('author') 是返回什么
另外不要一棵树上吊死，单纯研究还可以，做事的话不要纠结在一个问题
如果一个地方撞墙了，就绕一下弯，例如换 lxml 试试，或者换 xpath 试试

binux

2015-05-08 15:24:25 +08:00

你的用法有误
https://docs.python.org/2/library/xml.etree.elementtree.html#xml.etree.ElementTree.iterparse，看 Note：

> iterparse() only guarantees that it has seen the “>” character of a starting tag when it emits a “start” event, so the attributes are defined, but the contents of the text and tail attributes are undefined at that point.

而当 end 的时候，你已经把 article.clear() 了

ivanchou

2015-05-08 15:34:01 +08:00

@binux 但当我注释掉 article.clear() 还是同样的错。问题出在 line 24， author.text 为 NoneType

ivanchou

2015-05-08 15:40:09 +08:00

@imn1 返回的是 None

ivanchou

2015-05-08 15:45:48 +08:00

@binux 原来是写 pyspider 的大神！
那正确的用法应该是怎样的，我之前看过这段 Note 但不是太明白，求指教

binux

2015-05-08 15:59:25 +08:00

@ivanchou
https://gist.github.com/binux/7bcdcac8c5959c4c50b8

ivanchou

2015-05-08 16:15:51 +08:00

@binux
好奇为什么在只注释掉 article.clear() 或者去掉 events=("start", "end") 的情况下都有问题，而两者都去掉就 OK 了。大神能不能再简单解释下。
还有 article 不需要及时 clear 吗？如果不及时 clear 那么跟直接 parse 有什么区别？我注意到占用内存以及开始飙升了。

binux

2015-05-08 16:24:50 +08:00

@ivanchou 去掉 events=("start", "end") 之后，只有 end event 会被 pop 出来。根据 Note，start 的时候，子元素是不全的。
去掉 clear，因为你的 article 并不是只有 article 标签，如果在 article end 之前，把 author clear 了，那就取不到东西了。如果你要释放，在 26 行 clear 即可。

使用 iterparse 的正确的方法应该是把它看做一个状态机，想好在什么标签状态干什么。在 article.tag == 'author' 的时候输出 author, 在 article.tag == 'article' 的时候 clear 和换行，而不是用 findall。

ivanchou

2015-05-08 16:41:33 +08:00

@binux 好像有点明白了，所以在 start 的时候就被 clear 掉了，但诡异的是，它在解析了若干个 element 后才报错。我再研究研究……
刚开始学习，看到有 findall 就直接用了，感觉效率上确实不够高，感谢指点。

binux

2015-05-08 16:45:13 +08:00

@ivanchou Note 上面写的是「guarantees」，意思是「保证」，在 start 的时候只「保证」有 attributes，「不保证」有子元素，又没有说「一定没有」