Synthetic data generation for investigative graphs based on patterns of bad-actor tradecraft.
Default input data sources:
- https://www.opensanctions.org/
- https://www.openownership.org/
- https://www.occrp.org/en/project/the-azerbaijani-laundromat/the-raw-data
Ontologies used:
The simulation uses the following process:
-
Construct a Network that represents bad-actor subgraphs
- Use
OpenSanctions
(risk data) andOpen Ownership
(link data) for real-world UBO topologies - Run
Senzing
entity resolution to generate a "backbone" for organizing the graph - Partition into subgraphs and run centrality measures to identify UBO owners
- Use
-
Configure a Simulation for generating patterns of bad-actor tradecraft
- Analyze the transactions of the OCCRP "Azerbaijani Laundromat" leaked dataset (event data)
- Sample probability distributions for shell topologies, transfer amounts, and transfer timing
- Generate a large portion of "legit" transfers (49:1 ratio)
-
Generate the SynData (synthetic data) by applying the simulation on the network
- Track the generated bad-actor transactions
- Serialize the transactions and people/companies involved
Note that much of the "heavy-lifting" here is entity resolution performed by
Senzing
and network analytics performed by NetworkX
.
As simulations scale, both the data generation and the fraud pattern
detection would benefit by using the
cuGraph
high performance
back-end for NetworkX
.
This project uses poetry
for dependency management, virtual environment, builds, packaging, etc.
To set up an environment locally:
git clone https://github.com/DerwenAI/kleptosyn.git
cd kleptosyn
poetry install --extras=demo
The source code is currently based on Python 3.11 or later.
wget https://raw.githubusercontent.com/Kineviz/senzing_starter_kit/refs/heads/main/senzing_rootfs/data/open-sanctions.json
wget https://raw.githubusercontent.com/Kineviz/senzing_starter_kit/refs/heads/main/senzing_rootfs/data/open-ownership.json
wget https://storage.googleapis.com/erkg/starterkit/export.json
wget https://raw.githubusercontent.com/cj2001/senzing_occrp_mapping_demo/refs/heads/main/occrp_17k.csv
poetry run python3 demo.py
poetry run jupyter-lab
By default, the output results will be serialized as:
graph.json
: the network representationtransact.csv
: transactions generated by the simulationentities.csv
: entities generated by the simulation
First, to set up the dev
and test
environment:
poetry install --extras=dev
poetry install --extras=test
This project uses pre-commit
hooks for
code linting, etc., whenever git
is used to commit or push.
To run pre-commit
explicitly:
poetry run pre-commit