Etsy Icon>

Code as Craft

Robots, Graphs, and Binary Search main image
robots-graphs-binary-search

Robots, Graphs, and Binary Search

  image

save-the-robots-new-york-flyer-1

We love our human customers. That said we get a lot of traffic from robots too. This is great, especially when they use the Etsy API. However, they sometimes misbehave. And they misbehave frequently in the late hours, not unlike a legendary East Village night club. This time Craig Ferguson isn't at the door to keep an eye on things. Instead we monitor the robots attending our night club with graphs:

datacenter-logins

To do this we first compiled a list of where the robots live. This includes datacenters, places that make search engines and crawlers, and companies that lease out CPUs by the slice. The full list has over 1800 records, but here's a sample:

ipcat_datacenters-csv-at-master-c2b7-client9_ipcat-c2b7-github

So given this list, and an incoming IP address, how can one quickly tell who it belongs too, if anyone? Linear scan? No. Binary Search? Yes, and we like binary search.

To do this, the raw file is converted into a static PHP file. It converts all the dotted IPv4 addresses into the equivalent plain integers and turns all the data into an PHP array. As part of the file, there is a find method that uses a binary search algorithm modified to work with disjoint intervals. Can you spot the difference?


<?php
/* Autogenerated.  Do not edit */
class IpDb {
    public static function find($ipstr) {
        $ip = ip2long($ipstr);
        $haystack = self::$db;
        $high = count($haystack) - 1;
        $low = 0;
        while ($high >= $low) {
            $probe = floor(($high + $low) / 2);
            $row = $haystack[$probe];
            if ($row['_ip0'] > $ip) {
                $high = $probe - 1;
            } else if ($row['_ip1'] < $ip) {
                $low = $probe + 1;
            } else {
                return $row;
            }
        }
        return null;
    }
    static public $db = array (
  0 =>
  array (
    '_ip0' => 34603008,
    '_ip1' => 35127295,
    'owner' => 'http://akamai.com/',
  ),
  1 =>
  array (
    '_ip0' => 134488576,
    '_ip1' => 134489087,
    'owner' => 'http://www.peakwebhosting.com/',
  ),
  2 => ... etc...

The resulting file is pushed out in the same way all our other code is. Using this mini WHOIS database is just a PHP function call without any database or network access. The same technique can easily be done in other languages or used for other applications involving date-time intervals.

fraud-engineering

This is just one of many ways we monitor the site for bad behavior (robots or otherwise). Seeing how much traffic comes from where "hands aren't on keyboard" has given us many insights into how the site is used. Some more details on how this data is used can be found in presentation on Fraud Engineering given at the Merchant Risk Council conference in March.

Finally, let's not forget that not all robots are bad.

Update: as soon as I posted this, my colleague Jerry Soung informed he implemented the same thing for a different IP database. Binary search FTW.