Skip to content

Synthetic data generation for investigative graphs based on patterns of bad-actor tradecraft.

License

Notifications You must be signed in to change notification settings

DerwenAI/kleptosyn

Repository files navigation

KleptoSyn

Synthetic data generation for investigative graphs based on patterns of bad-actor tradecraft.

Default input data sources:

Ontologies used:

The simulation uses the following process:

  1. Construct a Network that represents bad-actor subgraphs

    • Use OpenSanctions (risk data) and Open Ownership (link data) for real-world UBO topologies
    • Run Senzing entity resolution to generate a "backbone" for organizing the graph
    • Partition into subgraphs and run centrality measures to identify UBO owners
  2. Configure a Simulation for generating patterns of bad-actor tradecraft

    • Analyze the transactions of the OCCRP "Azerbaijani Laundromat" leaked dataset (event data)
    • Sample probability distributions for shell topologies, transfer amounts, and transfer timing
    • Generate a large portion of "legit" transfers (49:1 ratio)
  3. Generate the SynData (synthetic data) by applying the simulation on the network

    • Track the generated bad-actor transactions
    • Serialize the transactions and people/companies involved

Note that much of the "heavy-lifting" here is entity resolution performed by Senzing and network analytics performed by NetworkX.

As simulations scale, both the data generation and the fraud pattern detection would benefit by using the cuGraph high performance back-end for NetworkX.

build a local environment

This project uses poetry for dependency management, virtual environment, builds, packaging, etc. To set up an environment locally:

git clone https://github.com/DerwenAI/kleptosyn.git
cd kleptosyn

poetry install --extras=demo

The source code is currently based on Python 3.11 or later.

load the default data

wget https://raw.githubusercontent.com/Kineviz/senzing_starter_kit/refs/heads/main/senzing_rootfs/data/open-sanctions.json
wget https://raw.githubusercontent.com/Kineviz/senzing_starter_kit/refs/heads/main/senzing_rootfs/data/open-ownership.json

wget https://storage.googleapis.com/erkg/starterkit/export.json

wget https://raw.githubusercontent.com/cj2001/senzing_occrp_mapping_demo/refs/heads/main/occrp_17k.csv

run the demo script and notebooks

poetry run python3 demo.py
poetry run jupyter-lab

use the results

By default, the output results will be serialized as:

  • graph.json: the network representation
  • transact.csv: transactions generated by the simulation
  • entities.csv: entities generated by the simulation

development

First, to set up the dev and test environment:

poetry install --extras=dev
poetry install --extras=test

This project uses pre-commit hooks for code linting, etc., whenever git is used to commit or push. To run pre-commit explicitly:

poetry run pre-commit