Author: Yu Lou
Written for Python 3.
Preprocess text.
- hlm.txt: original text
- preprocessing.txt: preprocessed text.
Split text into chapters and preprocess them.
- hlm.txt: original text
- chapters(folder): preprocessed text. One file for each chapter, numbered from "1.text".
Creat dictionary.
- preprocessing.txt: preprocessed text.
- dict.csv: dictionary.
Split words apart.
- preprocessing.txt: preprocessed text.
- dict.csv: dictionary.
- word_split.text: splitted text.
Split words apart in all chapters.
- preprocessing.txt: preprocessed text.
- dict.csv: dictionary.
- chapter(folder): preprocessed text for all chapters.
- chapter_split(folder): splitted text. One file for each chapter, numbered from "1.text".
Count words.
- word_split.text: splitted text.
- word_count.csv: counting result, sorted by number of occurence.
Count words in each chapters.
- chapter_split(folder): splitted text for each chapter.
- word_count_chapters.csv: counting result. One line per word and one chapter per column.
Do PCA analysis. Show result on screen.
"sklearn", "numpy" and "matplotlib" is needed to run this program.
- word_count_chapters.csv: word counting result for each chapters.
- components.csv: weights for each components.
Libary for suffix tree.
Calculate the correctness of word splitting algorithm.
- *_answer.txt: answer.
- *_result.txt: result of the program.
("*" is file prefix)