Skip to content

Commit

Permalink
refactored package for automated data handling
Browse files Browse the repository at this point in the history
  • Loading branch information
merillium committed Feb 20, 2024
1 parent df2a2c9 commit 5786dcd
Show file tree
Hide file tree
Showing 8 changed files with 262 additions and 125 deletions.
2 changes: 1 addition & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
**/__pycache__/
!/.gitignore
lichess_player_data/
lichess-games-database/
lichess_downloaded_games/
exploratory_plots/
model_plots/
saved_models/
17 changes: 13 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,15 @@
# filter_suspicious_players

This is a work-in-progress package that retrieves training data from the [lichess.org open database](https://database.lichess.org/), then trains a statistical model to detect suspicious players.
This is a work-in-progress package that retrieves training data from the [lichess.org open database](https://database.lichess.org/), then trains a statistical model to detect suspicious players. Currently the app is not functional, and has not yet been built.

Currently the app is not functional, and has not been deployed. If cloning this repo for personal use, the structure of the python scripts assumes that there is a folder called `lichess-games-database` to which .pgn and .pgn.zst files are downloaded and unzipped (this may be automated in the future using a bash script), and that there is a folder called `lichess_player_data` to which .csv files are saved (this folder is created by `parse_pgn.py` if it doesn't exist).
### Data Download and Preprocessing
To download and preprocess data from the lichess.org open database, you can run the following command:

```bash
python3 download_and_preprocess_data.py --year 2015 --month 1 --filetype lichess-open-database
```

The `download_and_preprocess_data.py` script downloads the `.pgn.zst` file corresponding to the month and year specified, decompresses the `.pgn` file, and creates the `lichess_downloaded_games` directory to which both files are saved. Then the script preprocesses the `.pgn` file and extracts relevant features, creates the `lichess_player_data` directory, to which a `.csv` file is saved. By default, all raw files in the `lichess_downloaded_games` directory are then deleted because they are typically large and not needed after preprocessing. (This process can be streamlined by directly reading from the decompressed `.pgn` file instead of first saving it)

### Model Description
This is a simple statistical model that flags players who have performed a certain threshold above their expected performance under the Glicko-2 rating system. The expected performance takes into account each player's complete game history and opponents in the span of the training data. The thresholds are initialized to default values, and then adjusted separately for each 100 point rating bin in the training data.
Expand Down Expand Up @@ -37,7 +44,9 @@ Currently working on unit tests, which can be run with the following command:
```make test```, or if you want to run test files individually ```PYTHONPATH=. pytest tests/test_model.py```

To-do:
- write a bash script to download and unzip data from the lichess.org open database
- restructure `make_player_features.py` to parse arguments from `download_and_preprocess_data.py`
- complete data labelling using lichess API calls, with a workaround or retry request if API rate limiting occurs
- write unit tests for scripts that perform feature extraction and data labelling
- write unit tests for `PlayerAnomalyDetectionModel` class and methods (in-progress)
- write unit tests for `PlayerAnomalyDetectionModel` class and methods (in-progress)
- possible benchmarks for length of time to execute data downloading, preprocessing, and model training depending
on the size of the raw data
112 changes: 112 additions & 0 deletions download_and_preprocess.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,112 @@
import argparse
import os
import re
from pathlib import Path
import subprocess
import zstandard as zstd
import pyzstd

from enums import Folders


def download_data(year, month, filetype):
month = str(month).zfill(2)
url = f"https://database.lichess.org/standard/lichess_db_standard_rated_{year}-{month}.pgn.zst"
filename = f"lichess_db_standard_rated_{year}-{month}.pgn.zst"
if not os.path.exists(Folders.LICHESS_DOWNLOADED_GAMES.value):
os.mkdir(Folders.LICHESS_DOWNLOADED_GAMES.value)

# Check file size before downloading
response = subprocess.run(
["wget", "--spider", "--server-response", url],
stderr=subprocess.PIPE,
stdout=subprocess.PIPE,
)
# print(f"response: {response}")
result = re.search("Content-Length: (.*)\n", response.stderr.decode())
content_length = result.group(1) if result else None
download_file_size = int(content_length) if content_length.isdigit() else None

# Warn user if file size exceeds 1 GB
if download_file_size and download_file_size > 10**9:
user_response = input(
"Warning: File size exceeds 1GB. Do you want to proceed with the download? (Y/N): "
)
if user_response.lower() != "y":
print("Download aborted.")
return None
else:
subprocess.run(["wget", url, "-P", Folders.LICHESS_DOWNLOADED_GAMES.value])
else:
subprocess.run(["wget", url, "-P", Folders.LICHESS_DOWNLOADED_GAMES.value])

return filename


def preprocess_data(filename, remove_raw_files):
"""This function calls parse_pgn.py and make_player_features.py with the filename argument"""

base_filename = Path(filename).stem.split(".")[
0
] ## removes .pgn.zst from extension
PGN_FILE_PATH = f"{Folders.LICHESS_DOWNLOADED_GAMES.value}/{base_filename}.pgn"
ZST_FILE_PATH = f"{Folders.LICHESS_DOWNLOADED_GAMES.value}/{base_filename}.pgn.zst"

# decompress .pgn.zst and save as .pgn
# note: for large files, this operation must be done in chunks
# with open(ZST_FILE_PATH, "rb") as compressed_file:
# dctx = zstd.ZstdDecompressor()
# with open(PGN_FILE_PATH, "wb") as output_file:
# for chunk in dctx.read_to_iter(compressed_file):
# output_file.write(chunk)

# with open(ZST_FILE_PATH, 'rb') as compressed_file:
# decompressor = zstd.ZstdDecompressor()
# with decompressor.stream_reader(compressed_file) as reader:
# with open(PGN_FILE_PATH, 'wb') as output_file:
# for chunk in reader:
# output_file.write(chunk)

with open(ZST_FILE_PATH, "rb") as f_in:
compressed_data = f_in.read()

decompressed_data = pyzstd.decompress(compressed_data)

with open(PGN_FILE_PATH, "wb") as f_out:
f_out.write(decompressed_data)

subprocess.run(["python3", "parse_pgn.py", PGN_FILE_PATH])

CSV_FILE_PATH = f"{Folders.LICHESS_PLAYER_DATA.value}/{base_filename}.csv"
# subprocess.run(["python3", "make_player_features.py", filename[:-4]])

# Remove the downloaded .pgn.zst and .pgn files
if remove_raw_files:
os.remove(PGN_FILE_PATH)
os.remove(ZST_FILE_PATH)


def main():
parser = argparse.ArgumentParser(description="Download and preprocess Lichess data")
parser.add_argument("--year", type=int, help="Year of the data", required=True)
parser.add_argument("--month", type=int, help="Month of the data", required=True)
parser.add_argument(
"--filetype",
type=str,
help="Type of file to download",
choices=["lichess-open-database"],
required=True,
)
parser.add_argument(
"--remove-raw-files",
action="store_true",
help="Remove raw files after preprocessing",
)
args = parser.parse_args()

filename = download_data(args.year, args.month, args.filetype)
preprocess_data(filename, args.remove_raw_files)


if __name__ == "__main__":
main()
2 changes: 2 additions & 0 deletions enums.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,8 @@ class TimeControl(Enum):
class Folders(Enum):
"""Enum to represent the default folder name(s) in the project."""

LICHESS_DOWNLOADED_GAMES = "lichess_downloaded_games"
LICHESS_PLAYER_DATA = "lichess_player_data"
MODEL_PLOTS = "model_plots"
SAVED_MODELS = "saved_models"
EXPLORATORY_PLOTS = "exploratory_plots"
1 change: 0 additions & 1 deletion exploratory_plots.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,6 @@
## constants could eventually go into enums
BASE_FILE_NAME = "lichess_db_standard_rated_2015-01"


if not os.path.exists(Folders.EXPLORATORY_PLOTS.value):
os.mkdir(Folders.EXPLORATORY_PLOTS.value)

Expand Down
4 changes: 1 addition & 3 deletions make_player_features.py
Original file line number Diff line number Diff line change
Expand Up @@ -43,9 +43,7 @@ def get_player_expected_score(
player_rating, opponent_rating, player_rd=80.0, opponent_rd=80.0
):
"""Returns expected score of player based on player rating, opponent rating, and RDs (if known)."""
A = g(np.sqrt(player_rd**2 + opponent_rd**2)) * (
player_rating - opponent_rating
)
A = g(np.sqrt(player_rd**2 + opponent_rd**2)) * (player_rating - opponent_rating)
return 1 / (1 + np.exp(-A))


Expand Down
Loading

0 comments on commit 5786dcd

Please sign in to comment.