Etsy Icon>

Code as Craft

Mining Facebook for Gifts on Etsy main image

Mining Facebook for Gifts on Etsy

  image

Buying gifts is hard. We created the Facebook gift recommender on Etsy to help you overcome the feeling of gift-giving writer’s block. To do so, it surfaces your friends’ interests from Facebook’s social graph and compares them across millions of items from Etsy’s marketplace. The product works by (1) connecting with your Facebook account and pulling entities for each of your Facebook friends: interests, activities, favorite movies, music, and more, (2) matching these entities to relevant items from Etsy’s marketplace, and then (3) making recommendations on per-friend basis. Each individual recommendation consists of a context (‘Michael Jackson’), along with a sample set of 4 items from the marketplace. At the core of the gift ideas finder is the matching algorithm that we train from mining billions of Etsy searches, purchases, and listing favorites over months and months of data.

Critical to the matching algorithm is an understanding of Facebook entities and how they relate to items on Etsy. For each Facebook keyword, our algorithm first measures the quality of the keyword in Etsy’s marketplace, and then analyzes the semantic meaning of the keyword on Facebook compared to its meaning on Etsy. As an example, the musician ‘Pink’ has over 4 million followers on Facebook, and a quick search for ‘pink’ on Etsy reveals over 500,000 unique items. Although there are many relevant listings to the keyword ‘pink’, the entity has a very different meaning in the context of Etsy as compared to Facebook. On the other hand, ‘Michael Jackson’ has a large following on Facebook and also has lots of relevant items on Etsy: Michael Jackson dolls, Michael Jackson 1980’s thriller jackets, etc.

To understand the semantic context of keywords on Etsy, we start with an analysis of billions of searches on Etsy.com. We mine these searches on a per-visit level: the key assumption here is that when people search within a visit to the site, they’re generally searching for semantically similar items. For example, someone may search for the term ‘tutu’, then ‘pink tutu’, and then ‘pink skirt’. And in a separate visit, someone else may search for ‘michael jackson’, ‘thriller’, and ‘thriller jacket’. By mining hundreds of millions of visits in aggregate over many months of search data, we’re able to form semantically similar sets for many of the queries that people have searched for on Etsy. As an example, the semantic set for ‘pink’ includes ‘hot pink’, ‘pink jewelry’, ‘pale pink’, and ‘fuchsia’. Since none of these keywords are bands, we then infer that ‘pink’ in the context of Etsy has nothing to do with the musician.

The second component to the recommendation algorithm requires understanding quality: a keyword like ‘BMW’ may be semantically similar on Facebook and Etsy yet Etsy isn’t the best place to buy a BMW. To understand quality, we analyze item page view, purchase, and favoriting behavior that originates from searches on Etsy. High quality and popular items tend to have lots of searches, page views, purchases, and favorites, and mapping these events back to specific search terms is an important data quality measure. Our logging infrastructure enables us to precisely attribute such views, purchases, and favorites to originating searches. Our tracking infrastructure logs all listing ids shown for every search that appears on Etsy. From this, we’re able to join listing ids of purchase, favorite, and item view events back to their originating searches, and then precisely attribute the sale of an item to a specific search, i.e. either ‘pink tutu’ or ‘tutu’. We implement this funnel analysis using a custom event sequence analyzer that runs on our Cascading data flow framework (you can read more about how we use Hadoop and Cascading here).

The final step in the process involves combining these two components to decide which recommendations to show and in what order. The process is similar to search and information retrieval algorithms that must order based on analogous quality and relevance metrics. We also leverage other data sources that we won’t get into here (for example, popular Facebook likes per-category), and of course we apply lots of heuristics as well – bands in particular often have ambiguous names (Cream, Queen, Tool, Traffic).