In the initial phase of our project, we delved into the textual analysis and tagging of three distinct datasets, each representing a unique domain. This comprehensive approach aimed to harness natural language processing techniques for a deeper understanding of various written content.
The first dataset originated from the Archive of Our Own (AO3), a platform known for hosting a wide array of fanfiction. We specifically extracted the initial chapters of the ten worst-rated English language works last edited within the last 8 years. This dataset not only captures the diversity of writing styles within the fanfiction community but also provides an intriguing lens into works that received less favorable feedback. The Treebank tokenizer and stemming processes were applied to standardize the text, and subsequent sentence tagging aimed to unveil patterns or themes within the narrative.
The second dataset consisted of articles from The Onion, a renowned satirical news website celebrated for its witty and humorous content. Our analysis focused on understanding the unique linguistic characteristics of satirical writing. Tokenization and stemming procedures were employed to distill the essence of the language, and sentence tagging was performed to highlight the distinctive features of satire within the randomly selected sentences.
The third dataset brought a gaming dimension into our analysis, featuring the description of twenty randomly selected spells from the Dungeons and Dragons (DND) tabletop roleplaying game. By applying the Treebank tokenizer and stemming, we aimed to explore the language and structure of gaming-related content. Sentence tagging was utilized to uncover trends or patterns within the spell descriptions, providing insights into the unique narrative style prevalent in the gaming community.
The second phase of our project focused on the development of a search engine using the DBLP (Digital Bibliography & Library Project) dataset. This extensive dataset encompasses thousands of scientific research articles, covering a wide spectrum of domains. The search engine was designed to facilitate user queries based on criteria such as author names, titles, publication years, and journals. Advanced algorithms were implemented to efficiently parse through the scholarly literature, providing users with precise and relevant results tailored to their specific queries.