Python 可否检查某文本为某编码方式的几率？

伪代码如下：
s = '**************'
print(detectRisk(s,'gbk') #=>80% s 是 gbk 编码的概率是 80%
print(detectRisk(s,'utf8') #=>30% s 是 u8 编码的概率是 30%
请问 python 有没有这样的方法呢？
有 cchardet.dect(s)或者 chardet.dect(s),但都不能指定检查哪一种编码的概率。

编码

Python

detectrisk

7 条回复 • 2017-09-10 23:08:30 +08:00

nullcc

2017 年 9 月 9 日

chardet https://pypi.python.org/pypi/chardet

mskip

2017 年 9 月 9 日

chardet
{'confidence': 0.7525, 'language': '', 'encoding': 'utf-8'}
{'confidence': 1.0, 'language': '', 'encoding': 'ascii'}
{'confidence': 0.99, 'language': '', 'encoding': 'utf-8'}
{'confidence': 0.99, 'language': '', 'encoding': 'utf-8'}

vtoexsir

2017 年 9 月 9 日

@nullcc
@mskip
chardet 不能指定检查哪种编码的概率。比如指定一段文本检查 gbk 编码的概率是多少。

ltux

2017 年 9 月 9 日

那就自己实现喽。

janxin

2017 年 9 月 9 日

In [1]: import chardet

In [2]: prober = chardet.utf8prober.UTF8Prober()

In [3]: prober.feed('你好，世界！'.encode('utf-8'))
Out[3]: 1

In [4]: prober.get_confidence()
Out[4]: 0.99

param

2017 年 9 月 9 日

chardet 有出错的，而且速度很慢

yucongo

2017 年 9 月 10 日

s = '**************', 貌似少个 b 吧，s = b'**************'。str 是不存在编码的问题的