词袋模型：概念及python实现-摩杜云开发者社区

词袋模型

- 1. 基本概念
- 2. 代码实现

1. 基本概念

在对文本进行分类时，需要首先对文本进行向量会表示，常用到词袋模型。

词袋模型（Bow，Bag of Words）不考虑文本中词与词之间的上下文关系，仅仅只考虑所有词的权重（与词在文本中出现的频率有关），类似于将所有词语装进一个袋子里，每个词都是独立的，不含语义信息。

生成文本的词袋模型分为三步：

分词（tokenizing）
统计词频（counting）
特征标准化（normalizing）

词集模型（SoW，Set of Words）与词带模型类似，唯一的不同是仅考虑词是否在文本中出现，而不考虑词频。多数时候一般使用词袋模型。

比如语料库中有4个文本：

I come to China to travel
This is a car polupar in China
I love tea and Apple
The work is to write some papers in science

上述语料生成的词典共有21个单词：

‘a’,
‘and’,
‘apple’,
‘car’,
‘china’,
‘come’,
‘i’,
‘in’,
‘is’,
‘love’,
‘papers’,
‘polupar’,
‘science’,
‘some’,
‘tea’,
‘the’,
‘this’,
‘to’,
‘travel’,
‘work’,
‘write’

每个单词的One-Hot Representation如下：

‘a’： $\;\;\;\;[1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]$
‘and’： $\;\;[0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]$
…
‘write’： $[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]$

上述文本的词袋模型表示如下：

$[0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 1, 0, 0]$
$[1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0]$
$[0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0]$
$[0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1]$

词频归一化结果如下：

$[0, 0, 0, 0, 1 / 6, 1 / 6, 1 / 6, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1 / 3, 1 / 6, 0, 0]$
$[1 / 7, 0, 0, 1 / 7, 1 / 7, 0, 0, 1 / 7, 1 / 7, 0, 0, 1 / 7, 0, 0, 0, 0, 1 / 7, 0, 0, 0, 0]$
$[0, 1 / 5, 1 / 5, 0, 0, 0, 1 / 5, 0, 0, 1 / 5, 0, 0, 0, 0, 1 / 5, 0, 0, 0, 0, 0, 0]$
$[0, 0, 0, 0, 0, 0, 0, 1 / 9, 1, 0, 1 / 9, 0, 1 / 9, 1 / 9, 0, 1 / 9, 0, 1 / 9, 0, 1 / 9, 1 / 9]$

在大规模的文本处理中，由于特征的维度对应分词词汇表的大小，维度将会非常高，常使用Hash Trick的方法进行降维。

此外，词袋模型中的值也可以采用单词的TF-IDF值。

2. 代码实现

主要通过sklearn.feature_extraction.text中的CountVectorizer类实现。

CountVectorizer是常见的特征数值计算类（支持传入停止词），对于每个文本通过fit_transform方法计算每个单词在该文本中出现的频率，形成词频矩阵。
通过get_feature_names可查看所有文本关键字，通过toarray可查看到文本的词袋模型结果。

输入：列表，列表元素为字符串
输出：词频矩阵，矩阵元素 $a [i] [j]$ 表示 $j$ 词在第 $i$ 个文本下的词频

scikit-learn的HashingVectorizer类实现了基于signed hash trick的算法。

代码如下：

from sklearn.feature_extraction.text import CountVectorizer  
corpus=["I come to China to travel", 
    "This is a car polupar in China",          
    "I love tea and Apple ",   
    "The work is to write some papers in science"] 
vectorizer=CountVectorizer()
print("词频统计：")
#输出4个文本的词频统计：左边的括号中的两个数字分别为(文本序号，词序号)，右边数字为频次
print(vectorizer.fit_transform(corpus))
print("\n词袋模型：")
print(vectorizer.fit_transform(corpus).toarray())

输出如下：
词袋模型：概念及python实现

from sklearn.feature_extraction.text import HashingVectorizer 
vectorizerH=HashingVectorizer(n_features = 6,norm = None) #将19维词汇表哈希降维到6维
print("词频统计：")
print(vectorizerH.fit_transform(corpus))
print("\n词袋模型：")
print(vectorizerH.fit_transform(corpus).toarray())