周末玩下机器学习吧， cherry 分类器新版本发布

V2EX = way to explore

V2EX 是一个关于分享和探索的地方

现在注册

已注册用户请登录

• 请不要在回答技术问题时复制粘贴 AI 生成的内容

这是一个创建于 2809 天前的主题，其中的信息可能已经有所发展或是发生改变。

cherry 分类器

增加英文支持
训练速度大幅提升
增加了混淆矩阵以及 ROC 曲线方便分析：

快速开始

>>> import cherry
>>> result = cherry.classify('她们对计算机很有热情，也希望学习到数据分析，网络爬虫，人工智能等方面的知识，从而运用在她们工作上')
Building prefix dict from the default dictionary ...
Loading model from cache 	/var/folders/md/0251yy51045d6nknpkbn6dc80000gn/T/jieba.cache
Loading model cost 0.899 seconds.
Prefix dict has been built succesfully.

得到 Result 对象

>>> result.percentage
[('normal.dat', 0.837), ('politics.dat', 0.108), ('gamble.dat', 0.053), ('sex.dat', 0.002)]

Result 的 percentage 属性显示了对应数据每个类别的概率，正常句子的概率为 83.7%，政治敏感的概率为 10.8%，赌博的概率为 5%，色情的概率为 0.2%。

>>> result.word_list
[('工作', 7.0784042046861373), ('学习', 4.2613376272953198), ('方面', 3.795076414904381), ('希望', 2.1552995125795613), ('人工智能', 1.1353997980863895), ('网络', 0.41148095885968772), ('从而', 0.27235358073104443), ('数据分析', 0.036787509418279463), ('热情', 0.036787509418278574), ('她们', -4.660672209426675)]

result 的 word_list 属性显示的是句子的有效部分（这里的有效部分根据分词函数划分，中文默认情况下，要求在结巴分词结果中词语长度大于 1，不在 stop_word 列表中，并且在其他训练数据中出现过这个词）对划分类别的影响程度。

运行测试，得到混淆矩阵和 ROC 曲线

>>> python runanalysis.py -t 10 -p

+Cherry---------------+------------+---------+------------+--------------+
| Confusion matrix    | gamble.dat | sex.dat | normal.dat | politics.dat |
+---------------------+------------+---------+------------+--------------+
| (Real)gamble.dat    |        141 |       0 |          0 |            0 |
| (Real)sex.dat       |          0 |     165 |          0 |            0 |
| (Real)normal.dat    |          3 |       8 |        118 |           11 |
| (Real)politics.dat  |          0 |       0 |          2 |          152 |
| Error rate is 4.00% |            |         |            |              |
+---------------------+------------+---------+------------+--------------+

roc

分类器原理以及分析可以参考贝叶斯分类器

目前尚无回复

dat result dat'real