Pareto’s Principle at Play in Polyvore

One of Fred Wilson’s great insights is the importance of mobile first design — mobile form factor constraints force design simplicity. This insight remains valid despite recent criticisms of a mobile first, web second strategy.

usage_dist

The above graph is a good illustration of this insight. The graph shows access distribution to Polyvore’s endpoints (I have removed the scales and endpoint names). We compiled this data using Splunk during a recent audit of our services to identify places where we could simplify.

The bulk of Polyvore’s activity is concentrated over a handful of endpoints. The 50th endpoint barely registers on the graph and I am embarrassed to confess that we have 630+ endpoints and only the first 100 are plotted :(

This graph tells me that our product could be far simpler and still deliver the bulk of its utility to most of our users. In fact, this is exactly what is happening in our recently launched iOS app.

The best part about simplifying products is that it allows concentration of effort on improving the parts that are used the most. Going deep is a better long term strategy than going wide.

Polyvore 2012 Infographic

2012 has been another amazing year of growth for Polyvore (20M monthly UV and 2.3x revenue growth) and the team created a great infographic to share it with the world.

On a personal note, when we started out, we never imagined points like this. It was always heads down and focusing on the next 3 months. Today however, after watching the company grow year over year, I am convinced that Polyvore will continue to grow far beyond. It is all thanks to our amazing team and equally amazing community.

Polyvore Infographic

Positive Feedback Loops in Local Reviews

While in Paris I made the mistake of using TripAdvisor for finding a good restaurant. The place I picked was ranked #5 in all of Paris but in reality it was just an average place — hardly better than what you could get at a random local cafe. The place was not a typical tourist trap, with neon lights on a busy street. It is a tiny inconspicuous place tucked away in a little side street.

I think what happened is that some American tourists randomly stumbled into this place. Due to an elixir of serendipitous discovery and hunger, and maybe a few glasses of wine, they had a great time and gave it an euphoric review on TripAdvisor. The high rating brought in more clueless tourists who did the same. Just like how a corral reef grows. I am pretty sure this is what happened, because everyone there was a tourist.

This is a symptom of not enough liquidity in TA’s paris restaurant reviews. In places where there is a lot of liquidity, subsequent reviews quickly correct these inaccuracies. Still, TripAdvisor could do a better job in computing rankings. For example, if all the reviews are from tourists vs locals, they should weigh less. Or if a few restaurants in a huge city like Paris are getting the bulk of reviews, it’s a good signal that something has gone loopy.

Unfortunately, Yelp is no better in this case (even though I constantly rely on them in the US). Apparently the best restaurant in Paris is a fallafel place.

It seems local ratings and reviews are still largely an unsolved problem except for a few narrow regions. Next time traveling in Europe, I will use the Michelin guides or ask a local.

Please leave me a comment with a good Paris restaurant recommendation, ideally $$ – $$$ and near Les Halles.

Smartphones + Free Online Education = Disruption

Ever since the industrial age, technology has been widening the gap between the rich and poor. You either fell on the side of the curve that benefits from technology, in which case you become more wealthy, or you fall below the threshold, in which case technology commoditizes you and you become poorer.

Actually, for a while, it was possible for people below the threshold to move up by learning new skills — farmers doing manual labor became more skilled factory workers. But the relentless improvement in technology pushed the bar higher and higher until people now have a hard time keeping up.  Where does the average person go once a computer can beat Kasporov in chess? Hence the erosion of the middle class.

There is however still room in high tech. If you are lucky and of above average intelligence, you can still be educated and secure one of the increasingly scarce jobs that are above the threshold of technological obsolescence. I say lucky because you have to be born in the developed world with access to higher education, etc.

With the advent of global Internet connectivity, affordable computing devices and free online education (information wants to be free), a few billion brains will soon have the potential to compete with the lucky few for the same increasingly scarce jobs.

I think something amazing is going to happen when all this comes to pass.  Either a weird dystopian future where everything is completely commoditized, or something amazingly wonderful, like all the brains working together to enable some breakthrough science that opens new frontiers for humanity.

My hope is for the latter.

Polyvore’s Awesome Crawler System

Note: This is cross-posted to Polyvore’s Engineering Blog.

Polyvore’s product index spans millions of items. The bulk of these arrive via our awesome user community who are constantly scouring the web for interesting products using our clipper bookmarklet.


Our clipper is quite smart — it auto-detects the correct price, landing page, etc… We also use a background task to scrape the Facebook open graph meta information for gleaning the correct description and title for each product.  However, this information is essentially a snapshot taken at the time of clipping.  We don’t get notified about price changes and the availability of the product.  Since Polyvore is a social commerce platform, we felt it was important to have up to date price and availability information about the products that are present in our index.



To augment our product index, we started by integrating data feeds directly from retailers that offered them.  But we soon found that these feeds were constantly breaking, out of date and missing useful meta data. So, we decided to write our own crawlers to regularly crawl retail sites and extract accurate, up to date product catalogue data.



We split the problem into two parts: a crawler framework and site specific definitions. The crawler framework’s job is to start at a URL, fetch it, optionally extract product data from it, find new URLs to crawl, and repeat the process until it has visited all parts of a given site.  

The site specific definitions tell the crawler where to start and how to extract information from a subset of the pages that have been crawled.  We outsourced writing these definitions to an external team.



Here is a sample crawler definition:



my $scraper = new Polyvore::Scraper({
    
  start => 'http://www.happysocks.com/us/',

    
  # invoke the scraper whenever the URL matches this pattern
    
  scrape_re => qr{\/us\/[a-z].*},

    
  # declare what needs to be extracted using CSS or XPATH expressions
    
  scraper => scraper {
        
    process 'div#content h1',            'title' => \&html2text;
        
    process 'div.product_container img', 'imgurl[]'   => '@src';
        
    process 'div#product_properties h2', 'price'      => 'TEXT';
        
    process 'div.size span.single',      'sizes[]'    => 'TEXT';
        
    process 'select#form_size option',   'sizes[]'    => 'TEXT';
        
    process 'span.sold_out_big',         'outofstock' => 'TEXT';
    
  } 
}); 


$scraper->crawl();



The crawler framework takes care of the rest.  It spins up EC2 instances as needed, deploys crawlers, performs the crawls, monitors the health of each crawler (using our stats collection system) and automatically opens trouble tickets for our outsourced team whenever it detects an issue.

Challenges

Extracting information from HTML pages

Anyone who has ever dealt with extracting structured information from HTML pages knows that it is a total pain in the ass.  It is tedious to specify what elements of the page content you want to extract.  Most definitions are very fragile and susceptible to slight site changes or the presence or absence of additional page content.  Even though we were outsourcing this part of the process, we still wanted to make it easier. Fortunately, we found Tatsuhiko Miyagawa (@miyagawa)‘s amazing Web::Scraper.  It allows you to declare what you want extracted using XPATH or CSS selectors (internally translated to XPATH expressions).  Declarative systems make life a lot easier for developers because you are letting the machine map your declarations to what needs to happen to satisfy it.  We have also found that most sites are fairly stable in their CSS structure and therefore our crawlers are less fragile.

Infinite Crawl Loops

Writing generic crawlers is hard.  Your crawler can get stuck in infinite loops because of dynamically generated sites that offer infinite combinations of pagination / sort / filter permutations.  As a safeguard against this, we adopted the following strategy: As we crawl and discover new URLs to crawl, we always prioritize processing URLs that need to be scraped (typically product detail pages).  We also keep track of how many pages we have crawled since the last time we encountered a detail page.  If we have been crawling for a while (say 5000 pages) and have not encountered a new product detail page, we assume we are stuck in a loop and abort.

Error Detection

We are now crawling hundreds of sites.  When you are crawling that many sites, there is always a few crawlers that are misbehaving.  It is similar to managing a data center with 1000′s of machines.  Probabilistically, something is going to be broken at any given moment.  We value our time way too much to spend it baby-sitting hundreds of crawlers and looking for breakage.  So, using our stats system we built a statistical model of what the results from a healthy crawler looks like.  Then, we compare stats about each completed crawl against our model.  If they are out of whack, we know there is a problem.  Our system automatically shoots off a trouble ticket to the crawler team for further investigation.

Guaranteed Updates aka Update SLA

Even though our crawlers continuously run and extract the data that we need, and our shopping data is fresh and up-to-date, we don’t have a mechanism to guarantee that updates that were posted to a retailer site will make it into Polyvore index within a given period of time. This guaranteed time to update, or update SLA is important especially during the holiday season, black friday and cyber monday – but is also a useful feature to have year-round. We are considering couple of options on how to design and implement this feature – but if you have ideas or suggestions we would love to hear from you!


Summary

Accurate, up to date data about products is important to providing an awesome user experience. We built a highly scalable, easy to maintain crawler system that enables us to keep our product index updated on an ongoing basis. We made our own lives easier by using declarative tools that are less fragile and also taking the time to develop robust monitoring systems that do not require our constant attention. Creating a crawler framework, combined with crawler rules per site allowed us to keep our design and implementation simple, yet scalable and manageable. Designing for simplicity and efficiency is a principle that guides us in many of the decisions we make in engineering at Polyvore.

Measure Everything

Note: This is cross-posted to Polyvore’s Engineering Blog.

Measuring and acting on stats is an essential part of building successful products. There are many direct and indirect benefits to pervasive measurement and tracking of stats:

  • Accurate, real-time data enables better and faster decisions.
  • It empowers a data-driven culture where ideas can come from anyone — ideas can be easily tested, and the best ones can be chosen based on their merit, instead of pure intuition, or because it’s the HiPPO.
  • Tracking stats and watching them improve in response to our iterations helps teams stay focused and is incredibly motivating.
  • Historical stats are a great way to keep an eye on how code changes affect the health of a product and processes.

Early on in the life of Polyvore, we decided to weave stats into everything we did. One of our engineering philosophies is to invest a fair amount of our resources into making ourselves more efficient. This approach introduces some latency into our projects but this is made up in the future by an increase in overall team bandwidth. Consistent with this philosophy, the first thing we did was to invest in building a system that made it easy to collect stats and made it immediately useful.

We started by identifying different types of stats that people were already collecting in ad hoc ways or wished that they could collect. Based on these requirements, we came up with the following, super simple API:

my $stats = Polyvore::Stats->new({ 
    name => 'db_layer' , 
    roleup => MINUTE 
});

# observe a value for a given key
$stats->observe(45, ‘file_size’); 

# observe the occurrence of events
$stats->inc(‘facebook_post’);
pre>stats->add(5, ‘facebook_disconnects’);

# utility hi-res timing functions that observe elapsed time for a given key
$stats->timer_begin(‘select_user_by_id’);
$stats->timer_end(‘select_user_by_id’);

To measure anything, a developer on our team would instantiate a stats object with a given name, granularity. Arbitrary stats can be recorded using this object. Once the measurement was set up and running, we could access the collected data via a generic web dashboard, pragmatically through an API or graph them on a dashboard.

Some Examples

Polyvore’s stats collection system is integrated into almost every subsystem (except itself!)

One of the earliest applications was in instrumenting our page generation times. Site performance is very important to us and we needed to profile where we were spending time in generating the response to each request. Tracking the # of calls and the time spent on each call allowed us to identify bottlenecks and to prioritize our time on optimizing the most important ones (via tuning SQL, caching, parallelization, pre-computation, etc). We make all our back-end calls through an abstraction layer and this allowed us to easily instrument our DB reads by collecting stats in just a few places:

sub _read_sql {
…

# retrieve a stats object from a factory
# (these are reused across persistent server processes)
$stats = $instance->stats({ name => 'backend_stats' })

# find the calling method that is requesting the read so that we can log it.
# eg: select_user_by_id
my $caller = Polyvore::Util::first_caller_inside_pkg();

# begin hi-res timer
$stats->timer_begin($caller, 'read');

# execute the query
my $result = $self->_read($sql, $data, $shard_name);

# end hi-res timer
$stats->timer_end($caller, 'read');

return $result;
}

We did the same thing for our other back-ends — Memcached, Solr and Cassandra.

Collecting this data for each request allows us do other cool things, like dumping it at the end of every page as an HTML comment, so that we could view source and see what had happened during a particular request:

shard ‘read’
shard main read: db17.polyvore.com cnt max min sum
are_contacts 1 0.001073 0.001073 0.001073
count_collection_comments 1 0.001174 0.001174 0.001174
count_collection_favorites 1 0.00403 0.00403 0.00403
… some stats omitted …
select_sponsored_brands_for_collections 1 0.001372 0.001372 0.001372
select_sponsored_hosts_for_collections 1 0.00511 0.00511 0.00511
select_template_collection_basedon 1 0.000991 0.000991 0.000991
select_user_preferences 1 0.000869 0.000869 0.000869
set.handle_counter 25
write_counter 0

This request took 180ms to generate (somewhat average for a typical page) and performed 25 reads, including some memcached hits.

We can also look at historical data in the web UI for each of the calls (past 24 hours):

path avg
select_collection_comments 0.00170376

select_collection_comments on average takes 1.7ms to execute.

Challenges

Scalability

We started out by storing the stats archives in mysql using a very simple auto-generated table schema. This approach got us off the ground quickly and worked reasonably well until the stats system became a victim of its own success. We started to collect stats in every aspect of our system and as the volume of data grew, we ran into scalability issues on the storage system. Given the simple schema (key, timestamp, value) we decided to use Cassandra as the storage system.

Today, we are logging and archiving 35M data points / day from over 100 different subsystems.

Write Performance

We have been using the stats system to collect data in our production environment and very early on it became clear that we could not simply write out all the stats without overwhelming our storage system.

Fortunately, we had already started working on a distributed job queue system. Our stats objects internally buffer the stats and occasionally flush to a RabbitMQ based job queue. The data is picked up by a bank of workers that write it to Cassandra. This approach has allowed us to not worry about the volume of data we are collecting and let the job queue smooth out and manage the write volume. The other benefit is that writing to a queue is essentially async and very fast. Our front end code never has to wait while stats are being flushed.

Getting Everyone Onboard

We made it easy, simple and useful to collect stats. But we’ve also made a point of celebrating stats. Having actual data that showed how a metric was improved was very gratifying, and we made a point of encouraging everyone to present their work in the context of the metrics they had improved.

A work in Progress

As with everything else we do, our stats system is a work in progress.

The first area for improvement is keep track of distributions in values. Because of the way we’re rolling up data, we’re only looking at average measurements over a given time period. Looking at averages alone can be misleading because it hides patterns in how the values are distributed. For example, if we’re measuring the performance of a single database select over time in order to spot slow queries (something that we do!), it would be useful to know that 99% run in under 4ms, but that 1% take 200ms, vs an overall average value of 6ms. Knowing that the queries is generally fast but dramatically slows down for 1% of cases would suggest a need for a fix where the average of 6ms might seem just fine.

Another interesting possibility to our status system is integration with Splunk. We’ve recently started using Splunk, an amazing tool for slicing and dicing log data in real-time. and it would be interesting to tee off of our stats to Splunk for real-time monitoring and ad hoc queries.

Summary

Measuring and collecting stats is an essential tool for building successful products. The key to success is to make it easy, immediately useful, ensure you’re making decisions based on the data and to celebrate the results.

Also See

Why there is no Flash on the iPhone

If the iPhone supported Flash, then most of the iPhone apps (or games at least) would have been written in Flash. This would have meant a lot of apps available for competitors to the iPhone (eg RIM, Android, etc…).

Polyvore Jobs

Polyvore Growth CurveAt Polyvore, we strive to build products that delight people.  

By design, we keep things simple on the surface and yet go to great lengths to make them run better under the hood.  We prefer simple solutions to technical problems but always try to include novelty where it matters — in our products.

If you care about the same things and enjoy working with like-minded people, drop me a line: pasha [at] polyvore [dom] com.

Polyvore Badges

We recently introduced Polyvore Badges.  We wanted something that reflected each person’s profile on Polyvore and was also simple and streamlined.  I am happy with what we came up with: a strip of colors reflective of ones recent activity on Polyvore, refreshed once a day.  Enjoy :)

Sprint

I have not been a Sprint customer in recent history, but I did use a Sprint phone for a few weeks. The following exchange with the Sprint Voicemail should illustrate why so many people are fleeing to other providers (each ‘-’ represents a small mechanical pause between each computer utterance).

  • Sprint Voicemail: You have 4 new messages and 3 saved messages. To listen to your new messages, press 1.
  • Me: [ Press 1 ]
  • SV: First message, [ pause ] from caller 4-0-8-5-5-5-1-2-3-5, left on [pause ] December-11th at 4-20-A-M.
  • Message: “Hi Pasha… call me back” [ a 2 second message ]
  • SV: To replay to this message, press 7, to save, press 8, to erase, press 9, for all other options, press 0. (Note: you can’t press a key to interrupt the computer voice and jump to the next message, you have to listen to the entire explanation, every time).
  • Me: [ Press 9 ]
  • SV: Message erased. [pause] Next message from caller 4-0-8-5-5-5-1-2-3-5, left on [pause ] December-11th at 5-30-P-M.
  • Message: “Hey man… just calling to say hi”
  • SV: To replay to this message, press 7, to save, press 8, to erase, press 9, for all other options, press 0. (Note: this 15 second explanation repeats for each message and you can’t interrupt it).

And on and on… the point is that while checking voice mail, you spent 3 minutes listening to the computer voice and about 5 seconds to your messages. Now, I am sure there is some way to configure the voice mail system to be more terse, but it would probably take you a few hours to get to it. With this level of disregard for people’s time, I am not surprised they are losing customers.

Follow

Get every new post delivered to your Inbox.

Join 146 other followers