categorization

Notes

This software requires AutoPhrase to extract domain keyphrases. Please refer to AutoPhrase documentation for installation and usage.

Requirements

We will take Ubuntu for example.

python 2.7

$ sudo apt-get install python

other python packages

$ sudo pip install -r requirements.txt

Build

Build AutoPhrase by Makefile in the terminal.

$ cd SegPhrase
$ make

Run

First, to fit our model to a corpus, run

$ bash ./train.sh

and specify the input/output by changing the following variables at the begining of the script

TEXT: the input text file
MODEL: the path to store the model

Then, query a specific set of documents, run

$ bash ./test.sh

and specify the input/output by changing the following variables at the begining of the script

TEXT: the input text file
CATEGORY_SEEDCONCEPTS: one or more set of concepts to query
MODEL: the path for the stored model you wish to use
SEGGED_TEXT_categorized: the final output

Input Format

The input files specified by TEXT for both train.sh and test.sh should be one document per line.
The input file CATEGORY_SEEDCONCEPTS should have each line following the format [category name]\t[concept1],[concept2],[concept3]..., and can contain one or more lines.

Output Format

The query relevance output, as specified by the SEGGED_TEXT_categorized in the begining of the test.sh, is of the format of [relevance to concept set1], [relevance to concept set2]...

Hyper-Parameters

The running parameters are located in conf.d folder, including autoPhrase.conf and pyConfig.conf.

autoPhrase.conf contains the parameters for concept extraction and segmentation.

pyConfig.conf contains the parameters for all other training steps

segphrase.conf

MIN_SUP=10

A hard threshold of raw frequency is specified for frequent phrase mining, which will generate a candidate set.

HIGHLIGHT_MULTI=0.5

The threshold for multi-word phrases to be recognized as quality phrases.

HIGHLIGHT_SINGLE=0.5

The threshold for multi-word phrases to be recognized as quality phrases.

learning_embedding.conf

USE_CONCEPT_GRAPH=1

whether to use the concept graph distance algorithm (when set to 1) or to use the basic query expansion algorithm (when set to 0) to compute relevance.

MIN_NEIGHBOR_SIMILARITY=.6

minimum threshold for concept pairs to be considered neighbors in the concept graph

MIN_CATEGORY_NEIGHBOR=3

minimum number of neighbors that each concept in the query concept set should have

MAX_NEIGHBORS=100

maximum number of neighbors a concept in the concept will have

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
conf.d		conf.d
preprocessing		preprocessing
README.md		README.md
__init__.py		__init__.py
buildIndex.py		buildIndex.py
conceptGraphPPR.py		conceptGraphPPR.py
requirements.txt		requirements.txt
retain_alphanumeric.py		retain_alphanumeric.py
segmented2phrase_as_word.py		segmented2phrase_as_word.py
taxonomy_signal_processing.txt		taxonomy_signal_processing.txt
taxonomy_signal_processing_application.txt		taxonomy_signal_processing_application.txt
taxonomy_signal_processing_method.txt		taxonomy_signal_processing_method.txt
test.sh		test.sh
train.sh		train.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

categorization

Notes

Requirements

Build

Run

Input Format

Output Format

Hyper-Parameters

segphrase.conf

learning_embedding.conf

About

Releases

Packages

Contributors 2

Languages

keqianli/conceptRelevance

Folders and files

Latest commit

History

Repository files navigation

categorization

Notes

Requirements

Build

Run

Input Format

Output Format

Hyper-Parameters

segphrase.conf

learning_embedding.conf

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages