python统计诗歌数据词频-摩杜云开发者社区

一、问题

现有一组诗歌数据表格，需要统计列名为'诗歌名称', '内容'的词频，停用词为chinese_stopwords.txt以及所有位数的数字，并统计前1000个词的词频

二、解决方案

导入必要的库和模块：

jieba：中文分词库，用于将文本进行分词。
re：正则表达式库，用于去除文本中的标点符号和空白字符。

定义一个函数cut_words，用于将文本进行分词和停用词过滤：

函数参数 text 是输入的文本。
函数参数 stopwords 是停用词列表。
使用jieba.cut方法对文本进行分词，并将结果转换为列表。
通过列表推导式，在分词结果中选择不在停用词列表中的词。
返回经过停用词过滤后的分词结果。

定义一个函数calculate_word_frequency，用于统计词频：

函数参数 text_list 是包含文本的列表。
函数参数 stopwords 是停用词列表。
创建一个空字典word_frequency，用于存储词频结果。
遍历text_list中的每个文本：

使用正则表达式去除文本中的标点符号和空白字符。
调用cut_words函数对文本进行分词和停用词过滤。
遍历分词结果中的每个词：

使用字典的get方法获取词的频次，如果词不在字典中返回默认值0。
将词的频次加1并更新字典。

返回词频统计结果。

根据需要从数据框中选取需要进行分析的文本列（'诗歌名称'和'内容'列），将它们合并为一个文本列text_list。
调用calculate_word_frequency函数，并传入text_list和stopwords进行词频统计。

三、代码

import pandas as pd
import jieba
import re

# 加载停用词表
def load_stopwords(file_path):
    stopwords = []
    with open(file_path, 'r', encoding='utf-8') as file:
        for line in file:
            stopwords.append(line.strip())
    # 使用正则表达式提取文本中的所有数字
    pattern = re.compile(r'\d+')
    all_numbers = pattern.findall(' '.join(stopwords))

    # 添加所有数字到停用词列表中
    stopwords += all_numbers
    # 添加自定义停用词
    custom_stopwords = ['组诗', '一首']
    stopwords += custom_stopwords
    return stopwords

# 使用ik_smart分词器进行分词
def cut_words(text):
    return list(jieba.cut(text))

# 统计词频
def calculate_word_frequency(text_list, stopwords):
    word_frequency = {}
    for text in text_list:
        # 使用正则表达式去除标点符号和空白字符
        text = re.sub(r'[^\w\s]', '', text)
        words = cut_words(text)
        for word in words:
            if word not in stopwords:
                word_frequency[word] = word_frequency.get(word, 0) + 1
    return word_frequency

# 读取Excel文件
df = pd.read_excel('poem_1w.xlsx')

# 加载停用词表
stopwords = load_stopwords('chinese_stopwords.txt')

# 提取'诗歌名称'和'内容'列的文本
text_list = df['诗歌名称'].astype(str) + ' ' + df['内容'].astype(str)

# 统计词频
word_frequency = calculate_word_frequency(text_list, stopwords)

# 排序并获取前1000个词频
top_words = sorted(word_frequency.items(), key=lambda x: x[1], reverse=True)[:2000]

# 将结果写入Excel文件
output_df = pd.DataFrame(top_words, columns=['词语', '词频'])
output_df.to_excel('word_frequency.xlsx', index=False)