Author: Yu Lou
Written for Python 3.
博文链接:用 Python 分析《红楼梦》
Preprocess text.
- hlm.txt: original text
- preprocessing.txt: preprocessed text.
Split text into chapters and preprocess them.
- hlm.txt: original text
- chapters(folder): preprocessed text. One file for each chapter, numbered from "1.text".
Creat dictionary.
- preprocessing.txt: preprocessed text.
- dict.csv: dictionary.
Split words apart.
- preprocessing.txt: preprocessed text.
- dict.csv: dictionary.
- word_split.text: split text.
Split words apart in all chapters.
- preprocessing.txt: preprocessed text.
- dict.csv: dictionary.
- chapter(folder): preprocessed text for all chapters.
- chapter_split(folder): split text. One file for each chapter, numbered from "1.text".
Count words.
- word_split.text: split text.
- word_count.csv: counting result, sorted by number of occurrence.
Count words in each chapter.
- chapter_split(folder): split text for each chapter.
- word_count_chapters.csv: counting result. One line per word and one chapter per column.
Do PCA analysis. Show result on screen.
"sklearn", "numpy" and "matplotlib" is needed to run this program.
- word_count_chapters.csv: word counting result for each chapter.
- components.csv: weights for each component.
Library for suffix tree.
Calculate the correctness of word splitting algorithm.
- *_answer.txt: answer.
- *_result.txt: result of the program.
("*" is file prefix)