Overview
Over the past year crawlers have become much more aggressive and prevalent . Some without any direct means of identifying themselves. We think this is related to the mass ingestion of content for Machine Learning training.
The tools that are used for this activity have typically been run from cloud service providers. Many different IP addresses are used. This makes it more difficult to identify and block these offending crawlers. However we have noticed some patterns that help us in identifying some of the worst offenders.
The LUNA viewers make use of faceted data (Who What Where When) to encourage useful crawlers to traverse collections and index the content for web searching. This creates a double edged sword. One that is helpful in many cases and negative with aggressive crawlers.
Blocking IP addresses using Fail2ban
We have had some luck banning IP addresses using a tool called fail2ban.
The filter described below works by detecting more than a single facet search within one second from the same IP address. This is a clear indicator that a human is not using the interface.
Filter
Create file /etc/fail2ban/filter.d/luna-w4.conf
# Fail2Ban configuration file # # Author: David Wong # # $Revision$ # [INCLUDES] # Read common prefixes. If any customizations available -- read them from # common.local #before = common.conf [Definition] # Option: failregex # Notes.: regex to mrsid requests in access.log. # host must be matched by a group named "host". The tag "<HOST>" can # be used for standard IP/hostname matching and is only an alias for # (?:::f{4,6}:)?(?P<host>[\w\-.^_]+) # Values: TEXT # failregex = ^<HOST> -.*(GET|HEAD) /luna/servlet/view/all/wh.+ # Option: ignoreregex # Notes.: regex to ignore. If this regex matches, the line is ignored. # Values: TEXT # #ignoreregex = ^<HOST> -.*GET /mrsid/bin/image_jpeg.pl.+width=750.* # ^<HOST> -.*GET /mrsid/bin/image_jpeg.pl.+height=750.* datepattern = ^[^\[]*\[({DATE}) {^LN-BEG}
At the bottom of /etc/fail2ban/jail.conf, append the following:
[luna-w4] enabled = true port = http,https filter = luna-w4 logpath = /var/log/apache*/access.log findtime = 1 maxretry = 2 bantime = 3600
This will ban the IP for 3600 seconds ( 1 hour )
Useful commands
Restart:
sudo service fail2ban restart
Check currently banned IPs for this filter:
sudo fail2ban-client status luna-w4
Unban a specific IP:
sudo fail2ban-client set luna-w4 unbanip 192.168.1.1
You can also whitelist any ip addresses that you never want to ban Here is a link to some instructions on this.