- Introduction
- Documentation
Self-Supervised Classifier is a set of functions which together comprise a model for classifying sentences as relevant or not. The approach was inspired by Banko et al.'s 2007 "Open Information Extraction from the Web" which used a self-supervised learner to perform open information extraction. We are taking much the same approach to relevancy classification by having the learner tag certain sentences as relevant or irrelevant based on keyword input and then Doc2Vec is trained on these tagged sentences to learn more complex features.
Sets all of the text to lower case, removes all non-alphanumeric characters besides apostrophe (') and hyphen (-) with a Regular Expression. All words are lemmatized and then stemmed.
Parameters
- doc (string or list of strings) : a sentence in either string for or list of words form
Returns
- list of strings : the sentence after the Natural Language Processing techniques have been applied
Produces TaggedDocuments and labels them appropriately from the documents in the Profile Manager instance and saves them in "../data/profilemanager/TaggedDocuments/Labeled/". Multithreading safe using instances/iam
Parameters
- pm (ProfileManager) : a ProfileManager with the documents you would like to Tag
- instances (int) : the number of instances you'd like to run in parallel
- iam (int) : the current instance's assignment [0-instances) (default = 0)
Returns
- None
Produces TaggedDocuments and labels them appropriately from the documents in the Web Resource Manager instance and saves them in "../data/TaggedDocuments/Labeled/". Multithreading safe using instances/iam
Parameters
- manager (WebResourceManager) : a ProfileManager with the documents you would like to Tag
Returns
- None
Returns the TF-IDF value of the document using the tfidf instance
Parameters
- tfidf (sklearn.feature_extraction.text.TfidfVectorizer) : a TtfidfVectorizer with 'english' stop words and fitted to the corpus
- document (string) : the document to score
Returns
- float : the TF-IDF score of the document
Scores the TaggedDocuments in "../data/profilemanager/TaggedDocuments/Labeled" using Profile Manager Doc2Vec model and saves them in "../data/profilemanager/TaggedDocuments/Classified/".
Returns
- None
Scores the TaggedDocuments in "../data/TaggedDocuments/Labeled" using Web Resource Manager Doc2Vec model and saves them in "../data/TaggedDocuments/Classified/".
Returns
- None
Scores the documents in the Profile Manager instance using TF-IDF and saves them in "../data/profilemanager/TaggedDocuments/Classified/". Multithreading safe using instances/iam.
Parameters
- tfidf (sklearn.feature_extraction.text.TfidfVectorizer) : a TtfidfVectorizer with 'english' stop words and fitted to the Profile Manager corpus
- instances (int) : the number of instances you'd like to run in parallel
- iam (int) : the current instance's assignment [0-instances) (default = 0)
Returns
- None
Scores the documents in the Profile Manager instance using TF-IDF and saves them in "../data/TaggedDocuments/Classified/"
Parameters
- tfidf (sklearn.feature_extraction.text.TfidfVectorizer) : a TtfidfVectorizer with 'english' stop words and fitted to the Web Resource Manager corpus
Returns
- None
Trains a Profile Manager Doc2Vec model using the TaggedDocuments in "../data/profilemanager/TaggedDocuments/Labeled/" and saves the model at "../data/profilemanager/doc2vec_model"
Returns
- None
Trains a Web Resource Manager Doc2Vec model using the TaggedDocuments in "../data/TaggedDocuments/Labeled" and saves the model at "../data/doc2vec_model"
Returns
- None
Instantiates and trains a TF-IDF Vectorizer using the English stop words
Parameters
- corpus (list of strings) : the corpus you would like to perform TF-IDF on
Returns
- sklearn.feature_extraction.text.TfidfVectorizer : a TtfidfVectorizer with 'english' stop words and fitted to the data.
Returns a TF-IDF instance trained on the Profile Manager instance
Returns
- sklearn.feature_extraction.text.TfidfVectorizer : a TtfidfVectorizer with 'english' stop words and fitted to the Profile Manager corpus
Returns a TF-IDF instance trained on the Profile Manager instance
Returns
- sklearn.feature_extraction.text.TfidfVectorizer : a TtfidfVectorizer with 'english' stop words and fitted to the Web Resource Manager corpus