Identifying and Extracting the entities within the semi-structured data
At this point, you now have all the data you need, but it is in semi-structured(HTML) format and still not accessible to query. We need to extract the appropriate entities that we require for our queries and store them in structured form so that we can start analyzing the data. So, what is the quickest way to do this?
|The first step is to find consistent markers in the HTML you can use to pattern match to identify the entity. We need to know where each entity begins and where they end so we can extract the information. Most websites these days use a Content Management System (CMS) to deliver their content in HTML, which means they use templates and therefore have consistent markers in the HTML that don't change from page to page. If you are a Mozilla FireFox fan you can make use of the FireBug inspect option to click on the text (such as the company name) in the browser and it will take you directly to the corresponding markup in the HTML source. Google's Chrome browser has a similar function accessible by right-clicking on the page and selecting "Inspect element". This is WAY faster than searching through the source. Keep in mind though, the HTML pulled down in the crawl might actually be different than what is displayed through an inspect function. This might happen because the developer has chosen to dynamically manipulate the browsers Document Object Model (DOM), often on page onload. i.e. Sometimes the entity markers are in the Script.|
You might have guessed it by now, this is effectively screen scraping. Now before we start doing this at scale using Map/Reduce, you should first write a little Java POJO that can handle the extraction for each Crunchbase Company Page. I like to do this by dropping the entire page source into a constructor that privately calls a set of methods to extract the data and then makes the normalized structured entities (Company, Street Address, ZipCode, City, State, etc.) available via getter methods. You then write a static main method where you can pass in URLs of different Company pages to test how well your extraction techniques are working. Once you've run this through a decent set of sample pages and you are comfortable your extractors are working consistently we can now move onto doing this with Map/Reduce.
|If you're new to Hadoop & Map/Reduce, I suggest you take a Cloudera or Hortonworks Training class or read Tom White's outstanding "Hadoop: The Definitive Guide". The short version is that Map/Reduce is a 2 phase (with the Reduce phase being optional) framework that processes records out of a block of data, one record at a time. Each record is passed as a key and a value to the Mapper to be processed. It is designed to handle data of arbitrary format, so for each Hadoop Job you need to specify a specific Reader that knows how to parse out the records contained within the block of data.
In the example to the left (click for bigger picture), You can see a very simple Map/Reduce Job I've created. The key configuration property of this job is the InputFormatter (Record Reader):
job.setInputFormatClass(SequenceFileInputFormat.class);This tells the job to use a Class that knows how to read Sequence Files. Nutch stores all the web pages in a given crawl depth/segment as a sequence of Content objects (one Content Object per Web Page) inside a Sequence File. The record reader passes a Content Object to the Mapper as the Value for each record it passes in (we ignore the Key). Inside the Map, we are then free to do whatever we want in processing the web page. In this example I drop it into my Crunchbase Company POJO's constructor and then write out the name of the company and the sector the company belongs to.
In the full example I don't just write out those two properties for each Company, but rather a tab delimited record that looks like the following:
Company Address City State ZipCode Sector Investor FundingRound Amount Month Day YearAs you can imagine most companies have multiple rounds of funding and therefore they would have a unique record for each round of funding. Having the data broken out like this allows one to Group By a variety of factors and SUM(Amount). This is all we need to quantify the disbursement of funds for a given factor and analyze over a given time dimension. Once the Hadoop Job is complete and all the data is extracted and normalized for each Company, we are now ready to start answering some of the questions that we have around the Tech Bubble. I'll cover this in my next post.
|I chose to go into detail to show how one can extract data out of HTML for analysis since so few locations on the web have an alternate structured data equivalent of what is represented in HTML. Crunchbase, however, actually does have this, in that each Company HTML page contains a link to a JSON representation of the Company data. The fastest way to get at the JSON data is write a Map Job that reads each Company Page and then writes out just the link for the URL of the page containing the JSON data. Once this is complete, you now have a new seed list that you can crawl to a depth of 1. This will create a new Nutch segment that you can run a new Map job over, now having a much easier time extracting the pertinent data (using a library like org.json) and writing out the same schema in the same fashion described earlier.|