Guidance on Aspect Extraction with a Taxonomy #13752
elspanishgeek
started this conversation in
Help: Best practices
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi spaCy community,
I hope this is the right place for my question. I’ve looked around for similar discussions but haven’t found a definitive answer. I’m trying to avoid some common pitfalls, so I’d appreciate any guidance on the best approach to perform aspect extraction given a taxonomy. For concreteness, I’ll use restaurant reviews as an example.
So far I'm seeing a few approaches:
1) NER + DependencyMatcher
Entity Recognition:
Make sure to have a rich set of entities for each taxonomy category. For example, for the SERVICE category, entities such as: chef, cook, waiter, waitress, waitstaff, server, bartender, barman, barwoman, etc.
Dependency Pattern Matching:
Create multiple DependencyMatcher patterns to capture phrases grounded on the entities like:
Example Pattern:
I can also expand the matched tokens to include the subtrees they belong to or add modifiers (like an "advmod" on the adjective) to capture intensifiers such as “very” or “kinda.”
(Optional) Classification:
Use a multiclass TextCat on the extracted patterns to apply a final category label or none and use that as a filtering for the extractions that are not useful.
Sentiment Scoring:
Running a separate sentiment classification model on the matched phrases.
2) TextCat/SpanCat
Data Preparation:
Keep documents short (from a sentence up to a few sentences) and train a multilabel TextCat or SpanCat model based on the available annotations.
For this approach, I do have some extra questions:
I’m currently using the
en_core_web_trf
pipeline, which includes a Tagger, Parser, Lemmatizer, and NER all sharing the same transformer.If I add a TextCat or SpanCat, should I:
If I opt for the transformer approach, should I:
replace_listener
so that the new component gets its own transformer (at the expense of a heftier pipeline)?If added as a listener, my understanding is that there are two main options:
en_core_web_trf
) during the SpanCat/TextCat Prodigy annotation process so that my dataset would have everything? And if so, is this desirable anyway?Sentiment Scoring:
3) NER/SpanCat + Relation
If I'm reading this correctly it looks like this could be a viable approach if this NER is treated as a SpanCat for the "aspect" and then the relation label allows for more nuance extraction. Continuing my above example and making it more complex:
Sentiment Scoring:
Following this model with a TextCat for the sentiment classification on the aspect-labeled entities/spans.
Thank you in advance to anyone who can share their thoughts/recommendations on these approaches!
Beta Was this translation helpful? Give feedback.
All reactions