Aggressive crawler blocking
Overview
Over the past year crawlers have become much more aggressive and prevalent . Some without any direct means of identifying themselves. We think this is related to the mass ingestion of content for Machine Learning training.
The tools that are used for this activity have typically been run from cloud service providers. Many different IP addresses are used. This makes it more difficult to identify and block these offending crawlers. However we have noticed some patterns that help us in identifying some of the worst offenders.
The LUNA viewers make use of faceted data (Who What Where When) to encourage useful crawlers to traverse collections and index the content for web searching. This creates a double edged sword. One that is helpful in many cases and negative with aggressive crawlers.
Blocking IP addresses using Fail2ban
We have had some luck banning IP addresses using a tool called fail2ban.
The filter described below works by detecting more than a single facet search within one second from the same IP address. This is a clear indicator that a human is not using the interface.
Filters
Create file /etc/fail2ban/filter.d/luna-w4.conf
# Fail2Ban configuration file
#
# Author: David Wong
#
# $Revision$
#
[INCLUDES]
# Read common prefixes. If any customizations available -- read them from
# common.local
#before = common.conf
[Definition]
# Option: failregex
# Notes.: regex to mrsid requests in access.log.
# host must be matched by a group named "host". The tag "<HOST>" can
# be used for standard IP/hostname matching and is only an alias for
# (?:::f{4,6}:)?(?P<host>[\w\-.^_]+)
# Values: TEXT
#
failregex = ^<HOST> -.*(GET|HEAD) /luna/servlet/view/all/wh.+
^<HOST> -.*(GET|HEAD) /luna/servlet/user/presentations/create.+
^<HOST> -.*(GET|HEAD) /luna/servlet/user/groups/create.+
# Option: ignoreregex
# Notes.: regex to ignore. If this regex matches, the line is ignored.
# Values: TEXT
#
#ignoreregex = ^<HOST> -.*GET /mrsid/bin/image_jpeg.pl.+width=750.*
# ^<HOST> -.*GET /mrsid/bin/image_jpeg.pl.+height=750.*
ignoreregex = ^<HOST> -.*GET /luna/servlet/view/all/wh.+"(.*Googlebot.*)"
^<HOST> -.*GET /luna/servlet/view/all/wh.+"(.*GoogleOther.*)"
datepattern = ^[^\[]*\[({DATE})
{^LN-BEG}
Â
Create file /etc/fail2ban/filter.d/luna-agent.conf
# Fail2Ban configuration file
#
# Author:
#
# $Revision$
#
[INCLUDES]
# Read common prefixes. If any customizations available -- read them from
# common.local
#before = common.conf
[Definition]
# Option: failregex
# Notes.: regex to mrsid requests in access.log.
# host must be matched by a group named "host". The tag "<HOST>" can
# be used for standard IP/hostname matching and is only an alias for
# (?:::f{4,6}:)?(?P<host>[\w\-.^_]+)
# Values: TEXT
#
# failregex = ^<HOST> -.*(GET|HEAD) /luna/servlet/user/presentations/create.+
# block various bots based on agent name
failregex = ^<HOST> -.*GET /luna/servlet.+"(.*meta-externalagent.*)"
^<HOST> -.*GET /luna/servlet.+"(.*Amazonbot.*)"
^<HOST> -.*GET /luna/servlet.+"(.*facebookexternalhit.*)"
^<HOST> -.*GET /luna/servlet.+"(.*Bytespider.*)"
^<HOST> -.*GET /luna/servlet.+"(.*ClaudeBot.*)"
^<HOST> -.*GET /luna/servlet.+"(.*HawaiiBot.*)"
^<HOST> -.*GET /luna/servlet.+"(.*GPTBot.*)"
^<HOST> -.*GET /luna/servlet.+"(.*Applebot.*)"
# Option: ignoreregex
# Notes.: regex to ignore. If this regex matches, the line is ignored.
# Values: TEXT
#
#ignoreregex = ^<HOST> -.*GET /mrsid/bin/image_jpeg.pl.+width=750.*
# ^<HOST> -.*GET /mrsid/bin/image_jpeg.pl.+height=750.*
#ignore Google crawlers
ignoreregex = ^<HOST> -.*GET /luna/servlet/view/all/wh.+"(.*Googlebot.*)"
^<HOST> -.*GET /luna/servlet/view/all/wh.+"(.*GoogleOther.*)"
datepattern = ^[^\[]*\[({DATE})
{^LN-BEG}
Â
At the bottom of /etc/fail2ban/jail.conf, append the following:
[luna-w4]
enabled = true
port = http,https
filter = luna-w4
logpath = /var/log/apache*/acces*.log
findtime = 1
maxretry = 2
bantime = 3600
[luna-agent]
enabled = true
port = http,https
filter = luna-agent
logpath = /var/log/apache*/acces*.log
findtime = 1
maxretry = 1
bantime = 3600
Â
This will ban the IP for 3600 seconds ( 1 hour )
Useful commands
Restart:
Check currently banned IPs for this filter:
Â
Â
Testing the filters:
Â
Unban a specific IP: