Python制作自定义词云图-摩杜云开发者社区

一、读取文件并做分词处理

1 安装导入jieba中文分词库

pip install jieba

2 读取文本并分词

参考wordcloud官方文档，中文文本内容需要先做分隔处理（英文内容中词语之间已经有空格）

Python制作自定义词云图_词云图

使用jieba.lcut()精确分词，筛选出长度大于1的词语

source.txt为事先准备好的文本，未做任何处理

示例代码：

from pprint import pprint

import jieba


def word_cloud(source_file):
    # 读取文本
    text = open(source_file, "r", encoding='UTF-8').read()
    # 精确模式分词，没有重合的词语，筛选长度大于1的词语
    cut_text = [word for word in jieba.lcut(text) if len(word) > 1]
    # 分隔分词结果
    text_result = "/ ".join(cut_text)
    return text_result


if __name__ == "__main__":
    file = 'source.txt'
    pprint(word_cloud(file))

运行结果：

Python制作自定义词云图_Python_02

二、制作并保存词云图

1 安装导入wordcloud

pip install wordcloud

2 创建WordCloud对象，加载文本，保存词云图

示例代码：

import jieba
from wordcloud import WordCloud


def word_cloud(source_file, target_path):
    # 读取文本
    text = open(source_file, "r", encoding='UTF-8').read()
    # 精确模式分词，没有重合的词语，筛选长度大于1的词语
    cut_text = [word for word in jieba.lcut(text) if len(word) > 1]
    # 分隔分词结果
    text_result = "/ ".join(cut_text)

    # 生成词云图
    wc = WordCloud()
    wc.generate(text_result)
    
    # 保存词云图
    wc.to_file(target_path)


if __name__ == "__main__":
    file = 'source.txt'
    target = "wordcloud.png"
    word_cloud(file, target)

运行结果：

Python制作自定义词云图_词云图_03

问题1--未设置中文字体font_path时词云图显示方框：

可以看到，生成的词云图片中文显示异常，数字和英文显示正常

解决1：

在查阅官方文档后，发现问题在于未设置中文字体：

Python制作自定义词云图_词云图_04

找到本机的字体路径，添加到font_path参数中：

Python制作自定义词云图_Python_05

示例代码1：

import jieba
from wordcloud import WordCloud


def word_cloud(source_file, font_path, target_path):
    # 读取文本
    text = open(source_file, "r", encoding='UTF-8').read()
    # 精确模式分词，没有重合的词语，筛选长度大于1的词语
    cut_text = [word for word in jieba.lcut(text) if len(word) > 1]
    # 分隔分词结果
    text_result = "/ ".join(cut_text)

    # 生成词云图
    wc = WordCloud(font_path=font_path)
    wc.generate(text_result)

    # 保存词云图
    wc.to_file(target_path)


if __name__ == "__main__":
    file = 'source.txt'
    font = "C:\\Windows\\Fonts\\STXINGKA.TTF"
    target = "wordcloud.png"
    word_cloud(file, font, target)

运行结果1：

Python制作自定义词云图_词云图_06

问题2--未设置词语搭配频率collocations时词语出现两次：

词云图上有很多词语均出现两次：“51CTO”、“博客”、“计划”等

解决2：

wordcloud有一个参数collocations--词语搭配频率，当设置为False时就不会出现重复的词语

Python制作自定义词云图_词云图_07

示例代码2：

import jieba
from wordcloud import WordCloud


def word_cloud(source_file, font_path, target_path):
    # 读取文本
    text = open(source_file, "r", encoding='UTF-8').read()
    # 精确模式分词，没有重合的词语，筛选长度大于1的词语
    cut_text = [word for word in jieba.lcut(text) if len(word) > 1]
    # 分隔分词结果
    text_result = "/ ".join(cut_text)

    # 生成词云图
    wc = WordCloud(font_path=font_path, collocations=False)
    wc.generate(text_result)

    # 保存词云图
    wc.to_file(target_path)


if __name__ == "__main__":
    file = 'source.txt'
    font = "C:\\Windows\\Fonts\\STXINGKA.TTF"
    target = "wordcloud.png"
    word_cloud(file, font, target)

运行结果2：

Python制作自定义词云图_词云图_08

三、自定义词云图形状

1 安装导入PIL和numpy

在读取图片时，官网示例代码使用到的imageio.imread()方法快要停用了，故使用其提供的另一种方式：PIL和numpy

pip install pillow
pip install numpy

2 读取轮廓图并配置wordcloud参数

轮廓图mask.jpeg是这样的：

Python制作自定义词云图_词云图_09

示例代码：

import jieba
import numpy as np
from PIL import Image
from wordcloud import WordCloud


def word_cloud(source_file, mask_image, font_path, target_path):
    # 读取文本
    text = open(source_file, "r", encoding='UTF-8').read()
    # 精确模式分词，没有重合的词语，筛选长度大于1的词语
    cut_text = [word for word in jieba.lcut(text) if len(word) > 1]
    # 分隔分词结果
    text_result = "/ ".join(cut_text)

    # 读取轮廓图（生成的词云图形状）
    image = Image.open(mask_image)
    image_array = np.array(image)

    # 生成词云图
    wc = WordCloud(font_path=font_path, collocations=False, mask=image_array)
    wc.generate(text_result)

    # 保存词云图
    wc.to_file(target_path)


if __name__ == "__main__":
    file = 'source.txt'
    img = 'mask.jpeg'
    font = "C:\\Windows\\Fonts\\STXINGKA.TTF"
    target = "wordcloud.png"
    word_cloud(file, img, font, target)

运行结果：

Python制作自定义词云图_Python_10

问题1--轮廓图mask白色的区域不会生成词语：

上海贼王路飞图片上白色的地方都没有词语

解决1：

参考官方文档，wordcloud会在mask图片白色以外的区域生成词语

Python制作自定义词云图_Python_11

方案1：将mask.jpeg除背景外的白色区域涂色（这里不采用）

方案2：将mask.jpeg扣除背景，读取轮廓时转换图片的RGB值，另外给词云图添加轮廓线

示例代码1：

import jieba
import numpy as np
from PIL import Image
from wordcloud import WordCloud


def word_cloud(source_file, mask_image, font_path, target_path):
    # 读取文本
    text = open(source_file, "r", encoding='UTF-8').read()
    # 精确模式分词，没有重合的词语，筛选长度大于1的词语
    cut_text = [word for word in jieba.lcut(text) if len(word) > 1]
    # 分隔分词结果
    text_result = "/ ".join(cut_text)

    # 读取轮廓图（生成的词云图形状）
    image = Image.open(mask_image)
    # 转换图片RGB值
    threshold = 0  # 阈值
    new_image = image.point(lambda p: 0 if p > threshold else 255)
    image_array = np.array(new_image)

    # 生成词云图
    wc = WordCloud(
        font_path=font_path,  # 字体
        collocations=False,  # 词语搭配频率，False--不会出现重复的词语
        contour_width=1,  # 轮廓线粗细
        contour_color="white",  # 轮廓颜色
        mask=image_array)
    wc.generate(text_result)

    # 保存词云图
    wc.to_file(target_path)


if __name__ == "__main__":
    file = 'source.txt'
    img = 'mask.png'
    font = "C:\\Windows\\Fonts\\STXINGKA.TTF"
    target = "wordcloud.png"
    word_cloud(file, img, font, target)

运行结果1：

Python制作自定义词云图_Python_12

四、预览词云图

前面生成的词云图均是通过看图软件打开的，下面我们使用代码来展示词云图

1 安装导入matplotlib

pip install matplotlib

2 创建matplotlib图像并显示

完整代码：

import jieba
import numpy as np
from PIL import Image
from matplotlib import pyplot as plt
from wordcloud import WordCloud


def word_cloud(source_file, mask_image, font_path, target_path):
    # 读取文本
    text = open(source_file, "r", encoding='UTF-8').read()
    # 精确模式分词，没有重合的词语，筛选长度大于1的词语
    cut_text = [word for word in jieba.lcut(text) if len(word) > 1]
    # 分隔分词结果
    text_result = "/ ".join(cut_text)

    # 读取轮廓图（生成的词云图形状）
    image = Image.open(mask_image)
    # 转换图片RGB值
    threshold = 0  # 阈值
    new_image = image.point(lambda p: 0 if p > threshold else 255)
    image_array = np.array(new_image)

    # 生成词云图
    wc = WordCloud(
        font_path=font_path,  # 字体
        collocations=False,  # 词语搭配频率，False--不会出现重复的词语
        contour_width=1,  # 轮廓线粗细
        contour_color="white",  # 轮廓颜色
        mask=image_array)
    wc.generate(text_result)

    # 保存词云图
    wc.to_file(target_path)

    # 预览图片
    plt.figure("wordcloud")  # 创建图像
    plt.imshow(wc)  # 加载图片
    plt.axis("off")  # 隐藏坐标系
    plt.show()  # 显示图片


if __name__ == "__main__":
    file = 'source.txt'
    img = 'mask.png'
    font = "C:\\Windows\\Fonts\\STXINGKA.TTF"
    target = "wordcloud.png"
    word_cloud(file, img, font, target)

运行结果：

Python制作自定义词云图_词云图_13

注意：

matplotlib预览图像也可以保存图片，但是保存的图片包含了画布的白色区域，wordcloud的to_file()方法保存的原图片更加清晰