This project will achieve NER system on Chinese entity and people's names.
-
colah's blog: http://colah.github.io/ (we can study some ML topics in this blog)
-
machine learning book: http://machinelearningbook.com (800+ pages cover ML topics in details)
-
English and Hindi NER: https://github.com/monikkinom/ner-lstm
-
Stanford assignment: https://github.com/Observerspy/CS224n/tree/master/assignment3
-
tushare news: http://tushare.org (retrieving data)
- the Beginng of entity and Inside of entity is called BIO notation -- sequence tagging tasks https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)
-
To build a GUI to construct the training and testing dataset (checked)
-
To download news data and cut them into sentences
-
To familiarize with Recurrent Neural Networks (hold a seminar)
-
To carefully study current NER code for English and Hindi (hold a seminar later)
-
To design a deep learning model for Chinese NER system
-
To test the model on our testing data as well as out-source data
-
To design a simple prototype for the NER system
-
Further studies: improve the NER system, Company Logo detection project (using CNN), web-crawling plugins, etc.
Part 1-- data gathering/pre-processing <-- which is very important too eg, how can i systematically collect data, maybe writing a web crawling program to crawl forum
Part 2 -- use existing model to get a sense of the "goodness" of the training data
Part 3 -- try to implement the model using tensorflow or if there are one, copy it
Part 4 -- twist the model so that it work well for our problem/data