Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clustering: What are the most common mechanisms nature uses for a particular function? #130

Open
bruffridge opened this issue Oct 5, 2022 · 2 comments
Assignees

Comments

@bruffridge
Copy link
Member

While we wait to get mechanisms and function extracted using OpenAI, we can try out clustering algorithms using AskNature functions and summaries as an approximation of mechanism.

@bruffridge
Copy link
Member Author

Some interns a few years ago developed clusters of biology papers. Putting a link to the code here incase it is of use.

https://github.com/nasa-petal/PeTaL/blob/legacy/petal/cluster.py

@AI-Complete
Copy link

-- coding: utf-8 --

'''
Created on Wed Jul 11 07:52:04 2018
@author: bwhiteak and cbaumler
'''

import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
import spacy
from spacy.matcher import Matcher
import pyLDAvis
import pyLDAvis.sklearn
import string, json, sys, pickle

ignore = pickle.load(open("petal/data/cluster/ignore/ignore.p", "rb"))

def create_df0(text):
"""
Format and parse a long string of abstracts into a DataFrame.

Parameters:
- text (str): A long string containing multiple abstracts.

Returns:
- DataFrame: A pandas DataFrame containing parsed abstracts.
"""
try:
    if sys.platform == 'win32':
        line_sep = '\r\n'
    else:
        line_sep = '\n'

    text_list = [y.strip() for y in text.split(sep=(line_sep * 3))]
    title_list = [x.split(line_sep, 1) for x in text_list]
    if title_list[-1] == ['']:
        del title_list[-1]

    nlp = spacy.load('en_core_web_sm')
    matcher = Matcher(nlp.vocab)
    matcher.add("hyphen", None, [{}, {"TEXT": "-"}, {}])

    for i in range(len(title_list)):
        if len(title_list[i]) > 1:
            doc = nlp(title_list[i][1])
            doc = match_n_merge(matcher, doc)
            n, v, vd, al = get_split_tokens(doc)
            title_list[i] = [title_list[i][0], n, v, vd, al]

    return pd.DataFrame(title_list, columns=['Title', 'n', 'v', 'vd', 'all'])

except Exception as e:
    print(f"Error processing abstracts: {e}")
    return pd.DataFrame()

def make_doc_dict(doc_list, assoc_topics):
"""
Create a dictionary mapping document titles to their associated topics.

Parameters:
- doc_list (list): List of document titles.
- assoc_topics (list): List of topics associated with each document.

Returns:
- dict: A dictionary where each key is a document title and the value is the associated topics.
"""
try:
    return {f"doc_{i}": topic for i, topic in enumerate(assoc_topics)}
except Exception as e:
    print(f"Error creating document dictionary: {e}")
    return {}

def make_topic_dict(model_list, feature_names, n_top_words):
"""
Generates a dictionary of topics with their corresponding terms.

Parameters:
- model_list (list): List of model components.
- feature_names (list): List of feature names from the model.
- n_top_words (int): Number of top words to include for each topic.

Returns:
- dict: A dictionary where each key is a topic number and value is a string of top terms.
"""
try:
    all_mat = np.vstack([model.components_ for model in model_list])
    topic_dict = {
        i: " ".join(feature_names[idx] for idx in topic.argsort()[:-n_top_words - 1:-1])
        for i, topic in enumerate(all_mat)
    }
    return topic_dict
except Exception as e:
    print(f"Error in making topic dictionary: {e}")
    return {}

def make_subset_dict(df):
"""
Creates a dictionary mapping index values to document titles from a DataFrame.

Parameters:
- df (DataFrame): The DataFrame containing document indices and titles.

Returns:
- dict: Dictionary with index values as keys and document titles as values.
"""
try:
    return dict(zip(df.index.values, df['Title'].tolist()))
except Exception as e:
    print(f"Error creating subset dictionary: {e}")
    return {}

def get_lemma(token):
"""
Extracts the lemma of a given token, with special handling for certain cases.

Parameters:
- token (Token): A Spacy token object.

Returns:
- str: The lemma of the token.
"""
if token.text == "species":
    return "species"  # Handling exception for the word 'species'
return token.lemma_

[Continuation of other functions and script logic]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants