DocIE-Probing Repository Overview

This repository contains code and data for the paper "Probing Representations for Document-level Event Extraction" (EMNLP 2023 Findings).

The project focuses on analyzing different models' capabilities in document-level event extraction.

Code Safety Warning: Some models use outdated dependencies (like older transformers), which has critical vulnerabilities. But for compatibility purposes this repo will not update them. Make sure to use an isolated, separate environment.

Repository Structure

1. Corpora

Contains two main datasets, with both converted for different models:

MUC 3/4 (Message Understanding Conference):
- Historical dataset from the 1990s
- Contains various versions including:
  - Base MUC dataset
  - Two attempts at adding triggers, which are expanded in later works.
- Includes both full text and sentence-level (SentCat) variants
- Split into 200 test, 200 dev, 1300 training examples
WikiEvents:
- Similar to MUC, but have 20/20/206 test/dev/train split.

2. Models

The repository implements several event extraction models:

DyGIE++: A document-level information extraction system
- Includes configuration files and training scripts
- Contains output processing and error analysis tools
GTT:
- Template filling with generative transformers
- Includes evaluation scripts and utilities
TANL:
- Text-to-text approach for information extraction
- Contains pre/post processing scripts and evaluation tools
Naive BERT Baseline

Models are modified to take environmental variables to save the embedding they used. More elegant solutions could use transformer_lens or torch hook.

3. Probing

Contains code and data for analyzing model representations:

Code/:
- getX_embeddings.py: Extracts embeddings from different models
- getY_labels.py: Processes dataset labels
- probing.py: Main probing implementation
- confusion_matrix.py: Visualization and analysis tools
- Additional utilities and analysis notebooks
X_Embeddings/: Contains extracted embeddings from different models
- bfloat16 version included in the Github repo
Y_labels/: Contains processed labels for both MUC and WikiEvents
Y_labels_Tokenizer_Specific/: Model-specific label processing

Getting Started

First, familiarize yourself with the dataset formats in the Corpora directory
Check the model implementations in the Model directory
Use the probing tools in the Probing directory to analyze model representations

Key Features

Multiple model implementations for document-level event extraction
Comprehensive probing framework for analyzing model representations
Support for both full text and sentence-level processing
Tools for data preprocessing, model training, and analysis

Name		Name	Last commit message	Last commit date
Latest commit History 371 Commits
Corpora		Corpora
Model		Model
Probing		Probing
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DocIE-Probing Repository Overview

Repository Structure

1. Corpora

2. Models

3. Probing

Getting Started

Key Features

About

Releases

Packages

Languages

License

GithuBarry/DocIE-Probing

Folders and files

Latest commit

History

Repository files navigation

DocIE-Probing Repository Overview

Repository Structure

1. Corpora

2. Models

3. Probing

Getting Started

Key Features

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages