-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add splitexp dataset doc #7
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,92 @@ | ||
# SplitExp | ||
|
||
A rule-based method for determining the end of sentence has been developed for Turkish news texts. By including direct quotations that have not been addressed in the problem before, the punctuation ambiguities at the end of the sentence are eliminated by means of a single regular expression. | ||
|
||
github repo of dataset: https://github.com/ideateknoloji/SplitExp | ||
|
||
## Dataset Details | ||
|
||
This dataset has been created using quotes that are frequently found in Turkish news texts. More than one case was evaluated and a matcher was created over the samples that fit each case. | ||
cases of end-of-sentence markers: | ||
|
||
| case | percentage | | ||
|--------|-----------| | ||
|quotation | %21.2 | | ||
| numbers | %8.7 | | ||
| abbreviations | %2.2 | | ||
| extensions | %0.4 | | ||
|
||
|
||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ilgili paperda sayilar, alintilar gibi cumle sonu iceren farkli durumlarin dataset icindeki dagilimlari verilmis, o dagilimlar da tablo halinde eklenebilir. |
||
|
||
### Samples | ||
|
||
Samples of data instances from all types of data present in the dataset. | ||
|
||
Example: | ||
|
||
``` | ||
{"_id":"5bcdd1ac31878cb578d6a13f","text":"Merkez Bankası, Ziraat, Halkbank, Vakıfbank ve Kalkınma Bankası Hazine ve Maliye Bakanı Berat Albayrak’a bağlandı.","indexes":["113"],"types":["0"]} | ||
``` | ||
|
||
### Fields | ||
|
||
Explain the fields of the instances. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. bu kisimda https://raw.githubusercontent.com/ideateknoloji/SplitExp/master/Dataset/sbd.json adresinde verilen example gosterilecek |
||
|
||
| field | dtype | | ||
|----------|------------| | ||
| id | id of the token | | ||
| text | token | | ||
| indexes | ? | | ||
| types | ? | | ||
|
||
### Splits | ||
|
||
Experiments were conducted on 9343 end-of-sentence markers obtained from 685 unambiguous documents by means of a marking tool developed for testing. | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Splits kismi ekleyip orada datasetin kac tane dokumandan ve sample dan olustugunu ekle |
||
|
||
## Dataset Creation | ||
|
||
### Curation Rationale | ||
|
||
The motivation of this dataset is that it aims to develop a method for determining the boundary of sentences for news texts by including direct quotations that have not been addressed before. | ||
|
||
### Data Source | ||
|
||
The source of this dataset is news articles from different newspapers in Turkey. | ||
|
||
### Annotations | ||
|
||
In the development of the sentence boundary method, multiple cases were taken into account by including direct quotation sentences in news articles from newspapers that have not been discussed before. This method has been developed by taking into account many conditions, from sentences ending with numbers to sentences with quotations within quotations. | ||
|
||
|
||
|
||
## Additional Information | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Versionseklinde bir baslik acip ilgili repositorynin hangi commitinden aindigini belirt |
||
### Dataset Curators | ||
|
||
“Published by Can Ozbey, Ozge Dincsoy.” | ||
|
||
### Version | ||
|
||
This dataset is taken from the cd6e457 commit of the repository | ||
|
||
### Citation Information | ||
|
||
Please cite the following paper if you found this dataset useful: | ||
|
||
Özbey, C., and Dinçsoy, Ö. (2019). Sentence Boundary Detection in Turkish News with Regular Expressions. In 2019 IEEE 27th Signal Processing and Communications Applications Conference (SIU). | ||
|
||
``` | ||
@inproceedings{inproceedings, | ||
title={TSentence Boundary Detection in Turkish News with | ||
Regular Expressions}, | ||
author={Can Ozbey, Ozge Dincsoy}, | ||
year={2020}, | ||
month={aug}, | ||
isbn={978-1-7281-1904-5}, | ||
doi={10.1109/SIU.2019.8806556} | ||
} | ||
`` | ||
|
||
{"_id":"5bcdd1ac31878cb578d6a13f","text":"Merkez Bankası, Ziraat, Halkbank, Vakıfbank ve Kalkınma Bankası Hazine ve Maliye Bakanı Berat Albayrak’a bağlandı.","indexes":["113"],"types":["0"]} | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Burasi example kisminda kalmis sanirim silinmesi gerek |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bu kısımda tam olarak ne kastettiğini anlamadım, tekrar ve daha basit şekilde yazar mısın ? Sanırım alttaki tabloyu anlatmak için paperdaki şu kısıma benzer bir şey yazmaya çalışıyoruz: