Dec 21, 2011

Part 1: Classifying and Quantifying Historical Private Equity Investments

In a couple of my talks this year I showed a demonstration of how we can mine crowd-sourced data from the web to quantify whether we are in a tech bubble. I thought people might find it useful to see a step-by-step walk through of how this can be done. Keep in mind, if you find this too complex, its going to get a lot easier for non-programmers to ask the same questions of the web once the tooling gets more mature.
So lets be a bit more explicit. What are we trying to achieve here? I've always been curious as to where the hot spots are for Venture Capital (VC) investing so I could discover how Austin stacks up against Boston or Portland in terms of the amount of VC invested. Also, in late 2010 and 2011 there was a lot of talk about the US startup tech scene being in a bubble. Unfortunately, all the information that I tended to come across about the private equity market was simply subjective commentary from various investors based on the deal flow they and their personal networks were seeing. The most quantitative work I had come across was the article in Fortune Magazine by David Kaplan, "Don't call it the next tech bubble-yet" but the data largely focused on the public market rather than the private one. So I thought to myself, with the right data you should be able to ask some high level questions of the private market to see how much was invested this year vs. previous years and the distributions and frequency of those investments to determine if 2011 was showing signs of a bubble?

DISCLAIMER: I am not a Venture Capitalist or in Finance. I am however interested in understanding the technological ecosystem better, which includes its economy. This article is merely an example of how curious people can look to the web to explore data for whatever intrigues them.
Gathering the Data

The first step to exploring these questions involves finding the data. Data needs to be in a structured format to be queried, so before you embark on the process of working with data of arbitrary structure on the web, you should see if you can find the data in structured format first. There are some great data marketplaces (follow the icons/links below) which provide searchable repositories of already structured data. I actually think its pretty neat that we already have so many companies providing this service. There are also some institutional repositories, like, but I don't know how much longer that site will be around.
You can also attempt to gather your data via API. For example, you could use the Twitter API to gather tweets around a particular set of topics. This usually involves setting up a persistent connection where data is continually streamed in via the API and you dump it to disk or the Hadoop Distributed File System (HDFS) to be analyzed later.

Lastly, if you've exhausted those alternatives, you can web crawl (or spider). Web crawlers typically start by pulling down one or more web pages, these are known as the seeds of the crawl and crawl depth 1. The crawler then extracts the links from those pages and then builds a fetch list of pages to get and pulls down these pages. This is known as crawl depth 2. The process repeats for however many crawl depths you have specified for the crawl. As you can imagine, you are greatly increasing the amount of pages retrieved for each consecutive crawl depth. One can also specify crawl url filters where each link in the fetch list is matched and discarded if it doesn't meet the filter criteria. This allows one to constrain crawls to just one website, or just a part of a website.

Identifying the Website and Seeding the Crawl

After spending some time searching the web, I found that provided a website that was rich with crowd-sourced private equity data specifically related to tech. The site was set up so that each company had a page, and on that page it contained the sector (or category), the Location, the Name and details of each round of funding. This was perfect! However, if you're planning on crawling a site, you need to *carefully* read the website's terms of service and the license for its content. Otherwise, you might get sued.

Crunchbase has an awesome and flexible license for folks that are not gaining financially from the use of their content. One last thing to keep in mind. Websites have a "gentlemens agreement" with web crawlers in that they specify which parts of the site are allowed to be crawled and which parts are not, via a ROBOTS.TXT file. This is accessible at and is observed by most web crawlers although it is possible to modify the source code of Open Source crawlers and remove the compliance. You need to check the allowed crawling scheme in this file to make sure you can actually get to the content you want to retrieve.


Crunchbase Crawl Seed List
  An ideal crawl for content retrieval should be at a crawl depth of 2. This means the crawl is specific and highly targeted and requires some thought before hand as to exactly what URLs the crawl will be seeded with. After spending some time with the site, I was able to determine that it could provide alphabetical indexes of just private companies. I was able to change which index was rendered by simply changing a parameter (such as from "c=A" to "c=B") in the querystring. This let me create a seed list such that the next crawl depth would only pull down the individual company pages that had the data I sought. Now, there are a couple of options for a choice of a web crawler. If you don't want to do the hardware/software set up of a crawler, you can use a service like 80legs (also the creators of DataFiniti), otherwise I recommend you use the Apache Hadoop based web crawler, Apache Nutch. Nutch is awesome and we'll be using it as our crawler from the perspective of this post.

The setup information for Apache Nutch is available on the project Wiki. Setting up Nutch is outside the scope of this post, however there are a few things that are non-obvious. If you are going to use Nutch for content retrieval rather than to build a search engine you want to make sure you modify the conf/nutch-site.xml to make sure the file.content.limit, http.content.limit, and db.max.inlinks properties are all set to -1. This makes sure neither the page content or the fetch lists are being truncated. In addition make sure your crawl conf/*-urlfilter.txt files comment out the regular expression that skips URLs that contains queries. i.e. It should look like this:
# skip URLs containing certain characters as probable queries, etc.
This isn't actually a terribly big crawl and you can actually run this in single process local mode if you wanted. Once Nutch is set up, you create a seeds directory and copy in a seeds.txt file comprised of the seed URLs I provided to the left.

Then you launch your crawl with the following command:     bin/nutch crawl seeds -dir content -depth 2

Nutch will then run the crawl from the seeds specified in the "seeds" directory and store the results of the crawl in a "content" directory. For each crawl depth, the content directory will contain a unique segments directory corresponding to that depth. We can ignore the segments directory for the first crawl depth and just focus on the other one which contains all the data. Nutch by default runs quite a few jobs that build indexes for each segment and then merge them. All of this is unnecessary if you're using Nutch for content retrieval. I've embedded a talk I did on Nutch at the beginning of the year where I step through the jobs that are run for a crawl, how to run Nutch in distributed mode & how to orchestrate Nutch to simply run the jobs you need for content retrieval. I also spend a bit of time talking about processing Nutch Content Objects in Nutch Sequence files.

Extracting the data out of the sequence files in the segments directory and doing the analytics will be covered in the next blog post in this series.

No comments: