Skip to content

Latest commit

 

History

History
50 lines (29 loc) · 1.25 KB

how-to-mitigate-crawlers-on-prod.md

File metadata and controls

50 lines (29 loc) · 1.25 KB

How to mitigate crawlers in case of too much traffic

It happens that some program is hammering Open Food Facts, causing serious latency and 504 problems to every users.

Most of the time it's someone who is not aware data must be downloaded at https://world.openfoodfacts.org/data instead of harvesting.

Find disturbing IPs

On prod server we have a script concatenate_by_ip.pl.

Go to access log for your site.

tail -n 100000 access_log |  ./concatenate_by_ip.pl|tail -n 10

Shows you a list of IP addresses with a number of requests

tail -n 100000 access_log | grep <ip_adress>

to see activity of a single ip address and try to understand what's going on.

Search and facets requests are the most consuming. (Product read requests are just a read of a file)

Warn

If the User-Agent has some indication on who is using the platform, try to contact them.

Blacklist

On off2 reverse proxy

fail2ban-client set nginx-botsearch banip <IP>

see How to use fail2ban to ban bots

On off1

In case it's needed, blacklist the IP address using an iptables rule:

iptables -A INPUT -s <ip_address> -j DROP

Make it better

We could instead implement traffic rate limitation using NGINX.