Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add splitexp dataset doc #7

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
92 changes: 92 additions & 0 deletions docs/TDD-C-202105-UNL-002.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,92 @@
# SplitExp

A rule-based method for determining the end of sentence has been developed for Turkish news texts. By including direct quotations that have not been addressed in the problem before, the punctuation ambiguities at the end of the sentence are eliminated by means of a single regular expression.

github repo of dataset: https://github.com/ideateknoloji/SplitExp

## Dataset Details

This dataset has been created using quotes that are frequently found in Turkish news texts. More than one case was evaluated and a matcher was created over the samples that fit each case.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bu kısımda tam olarak ne kastettiğini anlamadım, tekrar ve daha basit şekilde yazar mısın ? Sanırım alttaki tabloyu anlatmak için paperdaki şu kısıma benzer bir şey yazmaya çalışıyoruz:

i¸sareti ‘!’, soru i¸sareti ‘?’ ve üç nokta karakteri ‘...’ (U+2026).
Çalı¸smamızda önerilen yöntemde, bu karakterlerin cümle sonu
olmayan durumları, alıntılar, sayılar, kısaltmalar ve uzantılar
olmak üzere 4 ba¸slık altında incelenmi¸stir. Bu durumları e¸sleyen düzenli ifadeler, özyineleme ve ¸sartlı yapılar gibi özellikler
kullanılarak olu¸sturulup metni kesintisiz e¸sleyecek tek bir ana
ifadede birle¸stirilmi¸slerdir.```

cases of end-of-sentence markers:

| case | percentage |
|--------|-----------|
|quotation | %21.2 |
| numbers | %8.7 |
| abbreviations | %2.2 |
| extensions | %0.4 |



Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ilgili paperda sayilar, alintilar gibi cumle sonu iceren farkli durumlarin dataset icindeki dagilimlari verilmis, o dagilimlar da tablo halinde eklenebilir.


### Samples

Samples of data instances from all types of data present in the dataset.

Example:

```
{"_id":"5bcdd1ac31878cb578d6a13f","text":"Merkez Bankası, Ziraat, Halkbank, Vakıfbank ve Kalkınma Bankası Hazine ve Maliye Bakanı Berat Albayrak’a bağlandı.","indexes":["113"],"types":["0"]}
```

### Fields

Explain the fields of the instances.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


| field | dtype |
|----------|------------|
| id | id of the token |
| text | token |
| indexes | ? |
| types | ? |

### Splits

Experiments were conducted on 9343 end-of-sentence markers obtained from 685 unambiguous documents by means of a marking tool developed for testing.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Splits kismi ekleyip orada datasetin kac tane dokumandan ve sample dan olustugunu ekle


## Dataset Creation

### Curation Rationale

The motivation of this dataset is that it aims to develop a method for determining the boundary of sentences for news texts by including direct quotations that have not been addressed before.

### Data Source

The source of this dataset is news articles from different newspapers in Turkey.

### Annotations

In the development of the sentence boundary method, multiple cases were taken into account by including direct quotation sentences in news articles from newspapers that have not been discussed before. This method has been developed by taking into account many conditions, from sentences ending with numbers to sentences with quotations within quotations.



## Additional Information

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Version

seklinde bir baslik acip ilgili repositorynin hangi commitinden aindigini belirt

### Dataset Curators

“Published by Can Ozbey, Ozge Dincsoy.”

### Version

This dataset is taken from the cd6e457 commit of the repository

### Citation Information

Please cite the following paper if you found this dataset useful:

Özbey, C., and Dinçsoy, Ö. (2019). Sentence Boundary Detection in Turkish News with Regular Expressions. In 2019 IEEE 27th Signal Processing and Communications Applications Conference (SIU).

```
@inproceedings{inproceedings,
title={TSentence Boundary Detection in Turkish News with
Regular Expressions},
author={Can Ozbey, Ozge Dincsoy},
year={2020},
month={aug},
isbn={978-1-7281-1904-5},
doi={10.1109/SIU.2019.8806556}
}
``

{"_id":"5bcdd1ac31878cb578d6a13f","text":"Merkez Bankası, Ziraat, Halkbank, Vakıfbank ve Kalkınma Bankası Hazine ve Maliye Bakanı Berat Albayrak’a bağlandı.","indexes":["113"],"types":["0"]}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Burasi example kisminda kalmis sanirim silinmesi gerek