Aggressive crawler blocking

Overview

Over the past year crawlers have become much more aggressive and prevalent . Some without any direct means of identifying themselves. We think this is related to the mass ingestion of content for Machine Learning training.

The tools that are used for this activity have typically been run from cloud service providers. Many different IP addresses are used. This makes it more difficult to identify and block these offending crawlers. However we have noticed some patterns that help us in identifying some of the worst offenders.

The LUNA viewers make use of faceted data (Who What Where When) to encourage useful crawlers to traverse collections and index the content for web searching. This creates a double edged sword. One that is helpful in many cases and negative with aggressive crawlers.

Blocking IP addresses using Fail2ban

We have had some luck banning IP addresses using a tool called fail2ban.

The filter described below works by detecting more than a single facet search within one second from the same IP address. This is a clear indicator that a human is not using the interface.

Filters

Create file /etc/fail2ban/filter.d/luna-w4.conf

# Fail2Ban configuration file # # Author: David Wong # # $Revision$ # [INCLUDES] # Read common prefixes. If any customizations available -- read them from # common.local #before = common.conf [Definition] # Option: failregex # Notes.: regex to mrsid requests in access.log. # host must be matched by a group named "host". The tag "<HOST>" can # be used for standard IP/hostname matching and is only an alias for # (?:::f{4,6}:)?(?P<host>[\w\-.^_]+) # Values: TEXT # failregex = ^<HOST> -.*(GET|HEAD) /luna/servlet/view/all/wh.+ ^<HOST> -.*(GET|HEAD) /luna/servlet/user/presentations/create.+ ^<HOST> -.*(GET|HEAD) /luna/servlet/user/groups/create.+ # Option: ignoreregex # Notes.: regex to ignore. If this regex matches, the line is ignored. # Values: TEXT # #ignoreregex = ^<HOST> -.*GET /mrsid/bin/image_jpeg.pl.+width=750.* # ^<HOST> -.*GET /mrsid/bin/image_jpeg.pl.+height=750.* ignoreregex = ^<HOST> -.*GET /luna/servlet/view/all/wh.+"(.*Googlebot.*)" ^<HOST> -.*GET /luna/servlet/view/all/wh.+"(.*GoogleOther.*)" datepattern = ^[^\[]*\[({DATE}) {^LN-BEG}

 

Create file /etc/fail2ban/filter.d/luna-agent.conf

# Fail2Ban configuration file # # Author: # # $Revision$ # [INCLUDES] # Read common prefixes. If any customizations available -- read them from # common.local #before = common.conf [Definition] # Option: failregex # Notes.: regex to mrsid requests in access.log. # host must be matched by a group named "host". The tag "<HOST>" can # be used for standard IP/hostname matching and is only an alias for # (?:::f{4,6}:)?(?P<host>[\w\-.^_]+) # Values: TEXT # # failregex = ^<HOST> -.*(GET|HEAD) /luna/servlet/user/presentations/create.+ # block various bots based on agent name failregex = ^<HOST> -.*GET /luna/servlet.+"(.*meta-externalagent.*)" ^<HOST> -.*GET /luna/servlet.+"(.*Amazonbot.*)" ^<HOST> -.*GET /luna/servlet.+"(.*facebookexternalhit.*)" ^<HOST> -.*GET /luna/servlet.+"(.*Bytespider.*)" ^<HOST> -.*GET /luna/servlet.+"(.*ClaudeBot.*)" ^<HOST> -.*GET /luna/servlet.+"(.*HawaiiBot.*)" ^<HOST> -.*GET /luna/servlet.+"(.*GPTBot.*)" ^<HOST> -.*GET /luna/servlet.+"(.*Applebot.*)" # Option: ignoreregex # Notes.: regex to ignore. If this regex matches, the line is ignored. # Values: TEXT # #ignoreregex = ^<HOST> -.*GET /mrsid/bin/image_jpeg.pl.+width=750.* # ^<HOST> -.*GET /mrsid/bin/image_jpeg.pl.+height=750.* #ignore Google crawlers ignoreregex = ^<HOST> -.*GET /luna/servlet/view/all/wh.+"(.*Googlebot.*)" ^<HOST> -.*GET /luna/servlet/view/all/wh.+"(.*GoogleOther.*)" datepattern = ^[^\[]*\[({DATE}) {^LN-BEG}

 

At the bottom of /etc/fail2ban/jail.conf, append the following:

[luna-w4] enabled = true port = http,https filter = luna-w4 logpath = /var/log/apache*/acces*.log findtime = 1 maxretry = 2 bantime = 3600 [luna-agent] enabled = true port = http,https filter = luna-agent logpath = /var/log/apache*/acces*.log findtime = 1 maxretry = 1 bantime = 3600

 

This will ban the IP for 3600 seconds ( 1 hour )

Useful commands

Restart:

Check currently banned IPs for this filter:

 

 

Testing the filters:

 

Unban a specific IP:

You can also whitelist any ip addresses that you never want to ban.

Here is a link to some instructions on this.