P.4文本统计工具

一、功能

读取指定文本文件,统计字符数(含/不含空格和标点符号)、单词数、行数、高频词(TOP10)。

二、训练重点

ifstream大文件读取、string的遍历与处理、map<string, int>统计词频、对map值排序、STL算法(count/replace),过滤停用词(is/a/an/the等等)、支持多文件统计、输出统计结果到新文件。

三、我的思路

将任务分为两部分:(1)读取文本;(2)将单词出现频率从高到低排序;(3)输出统计结果。最后的执行函数run也是按照这个顺序进行的。

1)读取文本

  1. 使用getline一行一行地读取,读取一行字符串text后,进行分析。
  2. 先将这一行字符串所有大写字母转换为小写形式,再遍历字符串:
  • if (text[idx] == (空格 || 标点符号)) idx++,跳过该字符;
  • else text[i] == 单词 ,使用read_word函数读取这个单词:
    • 如果读取到的单词在要过滤的名单中(过滤名单见代码),则不予理睬
    • 否则将该单词出现次数++

2)将单词出现频率从高到低排序

这一步我直接使用STL中的sort了。

3)输出统计结果

共5+n行(n == 高频词数量num >= 10 ? 10 : num),前4行分别是:字符数(含空格和标点符号)、字符数(不含空格和标点符号)、单词数、行数。第五行是High-frequency words(Top 10),接下来10行是高频词 Top 10(若不足10个则按实际个数输出)。

四、注意事项

  1. 代码需要在C++ 20及以上才能运行(因为使用了ranges头文件范围库及相关操作);
  2. 读取的文本仅支持全英文文本,不能包含任何中文(中文统计功能本来是想加入的,但是我能力不够);
  3. 读取的文本的每一行的末尾不能包含不完整的单词,即不能有使用连字符将本行放不下的单词放在下一行的情况,否则读取的单词不完整;
  4. 不能识别复合词
  5. 代码中可能有未知Bug和可以优化的地方,如果您发现了,希望您能告知作者,因为作者是在读大学生,希望提高自己的能力。

五、演示示例

文章(不知道什么原因,一个段落都显示到一行中了):

txt 复制代码
Food and Health: The Foundation of a Good Life
"You are what you eat" is an old saying that holds more truth today than ever before. Our diet is not just about satisfying hunger---it directly shapes our physical health, energy levels, mental clarity, and even our long-term lifespan. In a world dominated by fast food and ultra-processed snacks, understanding the connection between what we eat and how we feel has become essential.
A balanced diet is the cornerstone of good health. It does not mean strict restrictions or giving up all the foods we love. Instead, it means eating a variety of nutrient-dense foods in the right proportions. This includes plenty of colorful fruits and vegetables, which are packed with vitamins, minerals, and antioxidants that protect our bodies from diseases. Whole grains like brown rice and oats provide sustained energy, while lean proteins such as fish, chicken, and beans help build and repair our muscles. Healthy fats from nuts, avocados, and olive oil are crucial for brain function and heart health.
Unfortunately, modern lifestyles have led many people to rely heavily on processed foods. These convenient options are often loaded with added sugars, salt, and unhealthy trans fats, but lack essential nutrients. Regular consumption of these foods can lead to a range of health problems, including obesity, heart disease, type 2 diabetes, and high blood pressure. They can also cause energy crashes, leaving us feeling tired and sluggish throughout the day.
What many people do not realize is that diet also has a profound impact on our mental health. Research has shown that a diet rich in whole foods can reduce the risk of depression and anxiety. The gut is often called our "second brain," and the food we eat affects the production of neurotransmitters like serotonin, which regulates our mood. On the contrary, a diet high in sugar and processed foods can disrupt this balance and worsen mood swings.
Making small, sustainable changes to your eating habits can have a huge impact on your overall health. Start by adding one extra serving of vegetables to your meals each day. Swap sugary drinks for water or herbal tea. Try cooking at home more often, so you can control the ingredients in your food. Remember, healthy eating is a journey, not a destination. It is okay to enjoy your favorite treats occasionally, as long as you maintain balance most of the time.
In conclusion, the food we choose to eat is one of the most powerful tools we have for taking care of our health. By nourishing our bodies with wholesome foods, we can increase our energy, improve our mood, and reduce the risk of chronic diseases. A healthy diet is not just about living longer---it is about living better.

输出的统计结果:

txt 复制代码
character count including spaces: 3193
character count excluding spaces: 2212
Totally words: 151
Totally lines: 7
High-frequency words(Top 10)
our: 12
health: 8
foods: 7
diet: 6
food: 5
not: 5
your: 5
energy: 4
eat: 4
healthy: 3

六、代码

cpp 复制代码
#include <fstream>
#include <map>
#include <string>
#include <vector>
#include <algorithm>
#include <iostream>
#include <stdexcept>
#include <unordered_set>
#include <ranges>
using namespace std;

class Text_Statistics {
private: // 类成员变量
	int chars_with_spaces, chars_without_spaces;	// 字符数(含/不含空格与标点符号)
	map<string, int> words_frequency;				// 词频
	vector<pair<string, int>> high_frequency_words;	// 高频词
	int word_count, line_count;						// 单词数目、文本行数
	unordered_set<char> punctuation;				// 空格与标点符号

private: // 辅助函数
	/*----- 获得单词 -----*/
	void read_word(const string& text, int& i)
	{
		const int SIZE = text.size();
		string word;
		while (i < SIZE)
		{
			chars_with_spaces++;
			if (punctuation.count(text[i])) break; // 遇见空格或标点符号
			word += text[i];
			i++;
			chars_without_spaces++;
		}
		if (filter(word)) words_frequency[word]++; // 添加
		if (words_frequency.count(word)) word_count++;
	}

	/*----- 过滤器 -----*/
	bool filter(const string& word)
	{
		// 过滤空字符串
		if (word.empty()) return false;

		// 过滤常见虚词
		static const unordered_set<string> stop_words = {
			"a", "an", "the", "am", "is", "are", "was", "were",
			"be", "been", "being", "have", "has", "had", "do", "does", "did",
			"will", "would", "shall", "should", "may", "might", "can", "could",
			"of", "in", "on", "at", "to", "for", "with", "by", "about", "as",
			"but", "and", "or", "so", "if", "because", "when", "where", "which",
			"that", "this", "these", "those", "he", "she", "it", "we", "you", "they",
			"i", "me"
		};
		return !stop_words.count(word);
	}

	/*----- 词频排序 -----*/
	void sort_by_value()
	{
		high_frequency_words.assign(words_frequency.begin(), words_frequency.end());
		sort(high_frequency_words.begin(), high_frequency_words.end(), [](const pair<string, int>& a, const pair<string, int>& b)
		{
			return a.second > b.second;
		});
	}

private: // 关键函数
	/*----- 加载文本 -----*/
	void load_from_local()
	{
		ios::sync_with_stdio(false);
		cin.tie(nullptr);

		ifstream ifs;
		ifs.open("Text.txt", ios::in);
		if (!ifs.is_open()) throw runtime_error("Error: Failed to open file 'data.txt'!");

		string buf;
		while (getline(ifs, buf))
		{
			line_count++;
			int idx = 0;
			const int SIZE = buf.size();
			// 将所有大写字母转为小写
			auto lowerView = buf | views::transform([](unsigned char c)
			{
				return tolower(c);
			});
			string lowerStr(lowerView.begin(), lowerView.end());
			// 读取单词
			while (idx < SIZE)
			{
				const char character = buf[idx];
				// 碰到空格或标点符号
				if (punctuation.count(character))
				{
					idx++;
					chars_with_spaces++;
				}
				else read_word(lowerStr, idx);
			}
		}
		ifs.close();
	}

	/*----- 保存统计结果 -----*/
	void save_to_local()
	{
		ofstream ofs;
		ofs.open("Result.txt", ios::out);
		ofs << "character count including spaces: " << chars_with_spaces << endl;		// 字符数(含空格与标点符号)
		ofs << "character count excluding spaces: " << chars_without_spaces << endl;	// 字符数(不含空格与标点符号)
		ofs << "Totally words: " << word_count << endl;	// 单词数(包括被过滤了的单词)
		ofs << "Totally lines: " << line_count << endl;	// 文本行数
		ofs << "High-frequency words(Top 10)" << endl;	// 高频词(前10)
		for (const auto& p : high_frequency_words | views::take(10))
		{
			ofs << p.first << ": " << p.second << endl;
		}
		ofs.close();
	}

public:
	/*----- 构造函数 -----*/
	Text_Statistics()
	{
		chars_with_spaces = chars_without_spaces = 0;
		word_count = line_count = 0;

		// 添加标点符号
		for (int i = 32; i <= 47; i++) punctuation.insert(i);
		for (int i = 58; i <= 64; i++) punctuation.insert(i);
		for (int i = 91; i <= 96; i++) punctuation.insert(i);
		for (int i = 123; i <= 126; i++) punctuation.insert(i);
	}

	/*----- 接口 -----*/
	void run()
	{
		load_from_local();	// 先加载
		sort_by_value();	// 再排序
		save_to_local();	// 再输出
	}
};

int main()
{
	try
	{
		Text_Statistics text_statistics;
		text_statistics.run();
		cout << "Statistics completed successfully!" << endl << "Results saved to Result.txt!" << endl;
	}
	catch (const runtime_error& e)
	{
		cerr << e.what() << endl;
		return 1;
	}
	return 0;
}