- The National Assembly of the Republic of Korea is making various efforts to detect and respond to public opinion on major social issues through opinion polls and the media.
- However, there is a limit to the objective prediction, and considerable time and cost are incurred from preparing a solution to legislative connection. For example, in the 20th National Assembly, the approval rate of all bills was 13.2%, and the average processing time was 577.2 days.
- Therefore, in this project, we propose a natural language processing-based artificial intelligence model that can efficiently predict online public opinion and is expected to be used as a policy decision tool in various fields.
- The Korean National Assembly provided articles, Twitter, and online community data related to major legislation in Korea.
- Online comment and review data were additionally collected for fine-tuning the language model.
To summarize the entire process of the 『Online Public Opinion Prediction Model』 we designed, it consists of the following four steps.
-
STEP 1) Sentiment analysis corpus preparation: positive/negative labeled Twitter and comment text.
-
STEP 2) Fine-tuning a text embedding classification model: learn a BERT-based language model designed to create and classify text embeddings (or vectors) in three ways.
-
STEP 3) Time series data conversion: Convert the positive/negative predictive values of the language model into a time series table.
-
STEP 4) Applying a Transformers-based time series prediction model: Predict the future trend of positive/negative public opinion after learning with my time series data prediction model designed based on Transformers.
-
A total of 530k text data, including legislative news and Twitter, provided by the Korea National Assembly, and online comments collected for sentiment analysis, were synthesized and pre-processed.
-
This text data was labeled (negative: 0, positive: 1) according to positive and negative public opinion.
-
Using pre-trained BERT-based language models(PLMs), we can obtain a fixed-size contextual vector, which means Token Embedding.
-
Use the [CLS] token or apply a pooling technique to obtain sentence-level embedding instead of token-level embedding.
- [CLS] Token : A word-level vector containing the meaning of the entire token within a sentence.
- Mean Pooling : A sentence-level vector summarizing the semantic expression of all tokens.
- Max Pooling : A sentence-level vector summarizing the semantic expression of important tokens.
-
As a result of Text Embedding Classifier, all three methods show classification accuracy of 91~92% or more.
- Through the 'Text Embedding Classifier Model', articles and Twitter related to 'Lease 3 Law' are predicted as positive or negative and then converted into time series data through the 'Hash Table Function'.
- The transformer model solves the problems faced by the existing RNN-based models by applying the attention mechanism, and the calculation speed is greatly improved.
- In particular, attention is a core concept of the Transformer, which enables the neural network of the model to understand contextual text information, focusing on words similar to the current term, and training and inferencing.
- Inspired by this model, our time series prediction model is a Seq2Seq model in consists of an encoder as three transformer encoders are stacked and a decoder as one linear regression model.
-
Our model attends (highly weighted) to the major rebound or rebound inflection points in the input data and learns fine-grained time series patterns.
-
Blue-line : Actual value, Red-line : Model predicted value, Grey-line : Residual(True - Prediction)
- Seoul National University NLP Labs
- Navy Lee