这样一个 html 文件,想导出到这样的 json 格式
{"AC1101":{"date":"21 April 2017","day":"Friday","time":"9.00 am","code":"AC1101","name":"ACCOUNTING I","duration":"2.5"},"AD1101":{"date":"21 April 2017","day":"Friday","time":"9.00 am","code":"AD1101","name":"FINANCIAL ACCOUNTING","duration":"2.5"},"BA2201":{"date":"21 April 2017","day":"Friday","time":"9.00 am","code":"BA2201","name":"ACTUARIAL ECONOMICS","duration":"2.5"}}
https://gist.github.com/wudaown/c4f46daa4bd6edc42b8d870fd77c7322
求助 bs4 如何导!不想用正则
谢谢
1
15015613 2017-05-16 23:55:52 +08:00
In [1]: from lxml import etree
In [2]: with open('tmp.html','r') as f: ...: tree=etree.HTML(f.read()) In [10]: tmp=tree.xpath('//tr') In [29]: import json In [37]: out=list() ...: for tmp1 in tmp[1:]: ...: i=0 ...: dict_d={1:'Date',2:'Day',3:'Time',4:'Course',5:' Course Title',6:'Duration'} ...: t1=dict() ...: for t in tmp1: ...: i=i+1 ...: t2=t.xpath('text()')[0] ...: t1[dict_d[i]]=t2 ...: out.append(t1) In [45]: out2=dict() ...: for o in out: ...: try: ...: out2[o['Course']]={'Course Title':o[' Course Title'],'Date':o['Date'],'Day':o['Day'],'Duration':o['Duration'],'Time':o['Time']} ...: except: ...: pass In [46]: out2 Out[46]: {' AC1101 ': {'Course Title': ' ACCOUNTING I ', 'Date': ' 24 November 2017 ', 'Day': ' Friday ', 'Duration': ' 2.5 ', 'Time': ' 9.00 am '}, ' AD1101 ': {'Course Title': ' FINANCIAL ACCOUNTING ', 'Date': ' 24 November 2017 ', 'Day': ' Friday ', 'Duration': ' 2.5 ', 'Time': ' 9.00 am '}, ' BA3201 ': {'Course Title': ' LIFE CONTINGENCIES AND DEMOGRAPHY ', 'Date': ' 24 November 2017 ', 'Day': ' Friday ', 'Duration': ' 3 ', 'Time': ' 9.00 am '}} |
2
15015613 2017-05-16 23:59:35 +08:00
from lxml import etree
with open('tmp.html','r') as f: ____tree=etree.HTML(f.read()) tmp=tree.xpath('//tr') import json out=list() for tmp1 in tmp[1:]: ____i=0 ____dict_d={1:'Date',2:'Day',3:'Time',4:'Course',5:' Course Title',6:'Duration'} ____t1=dict() ____for t in tmp1: ________i=i+1 ________t2=t.xpath('text()')[0] ________t1[dict_d[i]]=t2 ____out.append(t1) out2=dict() for o in out: ____try: ________out2[o['Course']]={'Course Title':o[' Course Title'],'Date':o['Date'],'Day':o['Day'],'Duration':o['Duration'],'Time':o['Time']} ____except: ________pass print(out2) |
3
wudaown OP @15015613 非常感谢你的回答,都是我没有见过的东西,需要慢慢消化。在等待的时候我已经用 dict,list 和 bs4 实现了。就是代码看起来很初级的样子
|
4
justtery 2017-05-17 08:17:03 +08:00 via Android
为什么不用 pyquery 呢 滑稽
|