寻求一个概率型计数器，效果类似 BloomFilter + Counter，能告诉我每个 item 大约出现了多少次

V2EX = way to explore

V2EX 是一个关于分享和探索的地方

现在注册

已注册用户请登录

这是一个创建于 1005 天前的主题，其中的信息可能已经有所发展或是发生改变。

背景

在一些流处理场景中，因为数据太多，不想把所有东西都持久化保存，所以需要一些采样的手段。已有的采样策略会按照例如“是否包含错误”、“是否 bla bla bla”来采样，这些都是可以的。

目前在寻求新的思路，尽可能覆盖到不同的数据，例如所有数据的可能组合有1 亿种，那每种出现的频率可能不一样，如果用一个 map 来计数的话很精确，但是会用很多内存，例如：

{"组合_uniq_id_1": 999, "组合_uniq_id_2": 12, ..., "组合_uniq_id_192818939291": 21}

需求

BloomFilter 可以把 1 亿种可能映射到一个 BitMap 上，空间很小，但是可以告诉我这种组合是否出现过。如果我不光想知道某种组合（可能）出现过，还想大致了解一下它出现过多少次，有没有合适的数据结构满足这个要求呢？

bla

采样

组合

例如

8 条回复 • 2023-05-06 11:47:39 +08:00

lxy42

2023 年 5 月 5 日 via Android

hyperloglog ，Redis 就支持

aphorism

2023 年 5 月 5 日

CM Sketch or Counting Bloom filter.

RedisMasterNode

2023 年 5 月 5 日

@lxy42 HyperLogLog 是不行的，HyperLogLog 能告诉我某个东西，例如 “今日见过的 IP”，里面有多少个不同的 item ，我要的不是不同的次数，每个不同的 item 出现的次数

RedisMasterNode

2023 年 5 月 5 日

@aphorism 谢谢，搜了一下很接近了，我详细了解下

matrix1010

2023 年 5 月 5 日

如果你需要比较精确而空间又有限的话 Count-min Sketch 不一定合适，通常来说使用 Count-min Sketch 来进行相对比较，比如 LFU cache 。可以看看这个: https://redis.com/blog/count-min-sketch-the-art-and-science-of-estimating-stuff/ 里面的 Count-min Sketch example 和 What is Count-min Sketch good for?部分

lxy42

2023 年 5 月 5 日 via Android

@RedisMasterNode 我记混了，看到 bloom filter 和你 ID 里的 Redis ，就想起了 hyperloglog

RedisMasterNode

2023 年 5 月 5 日

@matrix1010 非常有帮助，感谢

RedisMasterNode

2023 年 5 月 6 日

@matrix1010
@aphorism
谢谢两位，信息很有用，Redis 里面已经有拓展的模块可以试验一下。
同时还有几种概率型数据结构的“一句话用途描述”，快速理解： https://redis.io/docs/stack/bloom/

Use these data structures to answer a set of common questions concerning data streams:

- HyperLogLog: How many unique values appeared so far in the data stream?
- Bloom filter and Cuckoo filter: Did the value v already appear in the data stream?
- Count-min sketch: How many times did the value v appear in the data stream?
- Top-K: What are the k most frequent values in the data stream?
- t-digest can be used to answer these questions:
- - What fraction of the values in the data stream are smaller than a given value?
- - How many values in the data stream are smaller than a given value?
- - Which value is smaller than p percent of the values in the data stream? (what is the p-percentile value)?
- - What is the mean value between the p1-percentile value and the p2-percentile value?
- - What is the value of the n-th smallest / largest value in the data stream? (what is the value with [reverse] rank n)?