Jan 28, 2013

Leveraging CapMetro Rail for SXSW Hotels and Commutes

It's inevitable that each year, all the hotels sell out within a certain radius of the SXSW location (the Austin Convention center). If you run into this problem, here's a little hack that might save the day.

Austin recently added a nice 2 car light rail option known as the Capital Metro Rail. This just so happens to stop right outside the Austin Convention Center and also has a greatly extended timetable during SXSW. There are nice (Westin) and decent (La Quinta) hotels within walking distance at several of the stops on the line.

This is a list of the stops on the line. Punch the address for a station into Google Maps and search nearby for hotels. The example I've provided below, using the Kramer station address, stops right next to The Domain complex which has several hotels within walking distance.


View Larger Map

Aug 14, 2012

Building an Iron Man Suit - Part 2

Given that I've had a ton of hits on the rather unsubstantive original post, I thought I'd post an update on what I've learned so far to save other folks the time I've spent figuring out how to approach this. Note: This post provides all the resources you need to build a full replica fiberglass Mark III Iron Man suit

As I mentioned in the previous post, there is an existing Iron Man Suit Builders community called Stark Industries Weapons, Data and Armor Technology (SIWDAT).  The site is largely driven by a gentleman named TMP (Timeless Movie Prop) who is a professional sculptor and prop builder. Its a great site and has a good forum where one can see some of the existing suits folks have built, but unfortunately has lost a lot of its content due to some forum crash in 2011. Users are slowly re-adding the content. I personally didn't find it super useful in how to build suits other than discovering what people use for full size blueprints for the various Iron Man Suits (War Machine, Mark I, Mark III, etc.), which is called Pepakura.

Pepakura is essentially a 3D Model comprised of a number of parts, whereby you can print out the parts and it tells you how to attach the tabs of the various parts together to recreate the 3D model. The cliff note version is that you print the templates on cardstock paper, Xacto Knife the templates out, glue the tabs on the templates together to build the model, resin the model, glue on fiberglass cloth, resin the cloth, apply a finishing product, sand it and paint it and you end up with amazing replica Iron Man Suits like the one to the right.

Pepkakura is actually pretty simple. It uses free software and you just need to find the templates, which are PDO files, and figure out what to do with them. If you watch all seven of the pepakura tutorials below, you'll learn everything you need to see how Pepakura is used to build an Iron Man Suit from start to finish. The link to the next tutorial appears at the end of each video. Plus, Stealth, their creator is pretty funny.
 
Another gentleman that goes by the handle Dancin_Fools, has one of the most popular and detailed Mark III pepakura suit designs built using 3D Studio Max.This is the link to the thread where you can see some of the models and the outcome.  You can download the templates (PDO) files directly here.

I'm actually attempting to build an aluminum suit since I'm having some fun exploring how far my son and I can get trying to build a real suit and not a costume. Aluminum is a much harder medium to work with than paper, so at this point, I've decided to first go/think through the process with paper to make sure I have the scale correctly identified. I'm currently working on the chest piece from the PDO files provided above.

Do not understimate the huge amount of time it takes to cut the pepakura templates out using the xacto knife, it is taking me days (in my spare time) to cut out just the chest peices. Someone should start a business selling pre-cut templates at a scale specified by the customer.

I was pretty encouraged to find that other people have actually built Aluminum suits. I have decided that Welding is going to really complicate things, so my current plan has changed to using steel rivets to join the pieces. Check out this really cool fully aluminum Mark VI suit below. The builders thread for this is available here. Granted, there's not a whole lot out there that I've found (yet) on how to build aluminum suits. I'll post more as I find it.

UPDATE: Samurai169 over at RPF has an awesome welded 20 gauge steel suit he is building.

  

The next step is to actually build your Arc Reactor replica. I've at least got this bit accomplished already. To begin with, here's a refresher of what an actual Mark I Arc Reactor looks like:

Now, unless you've got a 3 inch hole in your solar plexus, or you've already made the suit and it sits an inch or two off of your chest, you're not going to be able to wear a true replica. Instructables has a great section on how to build your own Iron Man Arc Reactor. I got all that I needed from Radioshack, Hobby Lobby and Lowes. The images below are some examples of Arc Reactors that can be built.

     

Lastly, if you want to start wiring the suit for functionality such as automating the face plate opening and closing, then the XRobots web site has a lot of tutorials and likely what you're looking for.

Also, dont forget about the ability to use the Arduino platform, as it provides a platform to start building out the suits motorized functions.

Another option is to use a 3D Printer as it possible to print some of the components in their entirety. For example, MakerBot has a 3D Printer that can print components the size of a loaf of bread. Some folks are using it to print entire helmets.

Happy Making !
  

Jul 12, 2012

Building an Iron Man Suit

This is the first post. My 7 year old son and I are going to try and see how far we get. I'm hoping the project will be fun, it will be some good bonding time and we'll learn a little bit about mechanics and science along the way. I plan to blog our progress as it happens in case anyone else is interested in attempting the same thing.

Our approach (As of today) 

I can program so I'm not too worried about the sensors and display. I am going to take some old iPhones and hack them to do add some functionality to the suit. The hard part is going to be building the actual suit. I want to do it out of sheet metal. I'm not sure if this will be too heavy. This involves also learning how to weld. I've been reading from this blog post on Instructables how to MIG Weld. For the record, conducting current to produce extreme heat through the use of a gas based on a tutorial I read off the internet scares the hell out of me.

So the base design will be welding sheet metal cut-outs to produce the core suit, havent figured out how we'll pull off the helmet, and use carefully placed halogen LEDs to produce the ARC reactor. The plan is to then add to it once (and if) we pull off the core suit. I'm gonna build the suit for my son and not me. Which is tricky because he's growing like a weed. However, I'm sure this will look better than a suit with me in it, aka  an iron man suit with a rather pronounced mid section.

Update: There is an Iron Man Suit builders community !

We're using foamboard and the model below to build our prototype. The foamboard is key to help us get the scale right before we move to sheet metal.


May 3, 2012

Art, Engineering and the Digital Afterlife

I love storytellers. Their ability to envision the future is amazing. While being artists, in a lot of ways they are visionary scientists and engineers as well. How often do we see some concept imagined in a comic, book or a movie by an artist only to see it eventually made real shortly thereafter by a scientist or engineer. These artists inspire the makers. For instance, take the concepts envisioned within the Iron Man franchise which inspired one guy to make his own Iron Man Suit (low-tech but freaking awesome) and and how the US Military is building the real thing (which could definitely use a coat of hot-rod red).

      

With that said, I've been noodling a little on the thoughts behind the Battlestar Galactica prequel "Caprica". The premise is that one could create a digital afterlife where a soul could be reanimated provided enough data about the original human is preserved. In the show, the premise is powerful enough to disrupt contemporary religion.

The basic engineering concepts to make this a reality, are divided into two areas:
  1. A device that humans wear on their head which allows them to enter incredibly realistic digital three-dimensional worlds which they navigate as avatars. These worlds are limited by the fact that each avatar is directed by a real human in a real world outside the digital one. Think of it as an uber-realistic Second Life where you control your avatar with your mind.

  2. Technology that extends the previous concept to allow one to create autonomous avatars and inject them into these worlds. The avatars behaviour originates from rules divined from data about the original human that they reflect. In other words, if its your data, this avatar is you... except that its an autonomous copy.
So the 2nd invention is the bit that captivated my imagination. Presuming, at the point of someone's death (with something like a Zoe chip from The Final Cut), one could access data such as their entire purchase histories, every word they ever spoke or wrote, a 3 dimensional rendering of them and every action they ever undertook, could we create an avatar that would behave the same way they did and had the same memories? I think we could.

      

Dec 21, 2011

Part 2: Classifying and Quantifying Historical Private Equity Investments

This is part 2 of a series of posts. My previous post describes how to obtain the data you need for what is described in this section

Identifying and Extracting the entities within the semi-structured data

At this point, you now have all the data you need, but it is in semi-structured(HTML) format and still not accessible to query. We need to extract the appropriate entities that we require for our queries and store them in structured form so that we can start analyzing the data. So, what is the quickest way to do this?

The first step is to find consistent markers in the HTML you can use to pattern match to identify the entity. We need to know where each entity begins and where they end so we can extract the information. Most websites these days use a Content Management System (CMS) to deliver their content in HTML, which means they use templates and therefore have consistent markers in the HTML that don't change from page to page. If you are a Mozilla FireFox fan you can make use of the FireBug inspect option to click on the text (such as the company name) in the browser and it will take you directly to the corresponding markup in the HTML source. Google's Chrome browser has a similar function accessible by right-clicking on the page and selecting "Inspect element". This is WAY faster than searching through the source. Keep in mind though, the HTML pulled down in the crawl might actually be different than what is displayed through an inspect function. This might happen because the developer has chosen to dynamically manipulate the browsers Document Object Model (DOM), often on page onload. i.e. Sometimes the entity markers are in the Script.

Identifying the Markers for the Company Name

You might have guessed it by now, this is effectively screen scraping. Now before we start doing this at scale using Map/Reduce, you should first write a little Java POJO that can handle the extraction for each Crunchbase Company Page. I like to do this by dropping the entire page source into a constructor that privately calls a set of methods to extract the data and then makes the normalized structured entities (Company, Street Address, ZipCode, City, State, etc.) available via getter methods. You then write a static main method where you can pass in URLs of different Company pages to test how well your extraction techniques are working. Once you've run this through a decent set of sample pages and you are comfortable your extractors are working consistently we can now move onto doing this with Map/Reduce.


M/R Job Reading from Nutch Segments
If you're new to Hadoop & Map/Reduce, I suggest you take a Cloudera or Hortonworks Training class or read Tom White's outstanding "Hadoop: The Definitive Guide". The short version is that Map/Reduce is a 2 phase (with the Reduce phase being optional) framework that processes records out of a block of data, one record at a time. Each record is passed as a key and a value to the Mapper to be processed. It is designed to handle data of arbitrary format, so for each Hadoop Job you need to specify a specific Reader that knows how to parse out the records contained within the block of data.

In the example to the left (click for bigger picture), You can see a very simple Map/Reduce Job I've created. The key configuration property of this job is the InputFormatter (Record Reader):
 job.setInputFormatClass(SequenceFileInputFormat.class);
This tells the job to use a Class that knows how to read Sequence Files. Nutch stores all the web pages in a given crawl depth/segment as a sequence of Content objects (one Content Object per Web Page) inside a Sequence File. The record reader passes a Content Object to the Mapper as the Value for each record it passes in (we ignore the Key). Inside the Map, we are then free to do whatever we want in processing the web page. In this example I drop it into my Crunchbase Company POJO's constructor and then write out the name of the company and the sector the company belongs to.

In the full example I don't just write out those two properties for each Company, but rather a tab delimited record that looks like the following:
Company Address City State ZipCode Sector Investor FundingRound Amount Month Day Year
As you can imagine most companies have multiple rounds of funding and therefore they would have a unique record for each round of funding. Having the data broken out like this allows one to Group By a variety of factors and SUM(Amount). This is all we need to quantify the disbursement of funds for a given factor and analyze over a given time dimension. Once the Hadoop Job is complete and all the data is extracted and normalized for each Company, we are now ready to start answering some of the questions that we have around the Tech Bubble. I'll cover this in my next post.

I chose to go into detail to show how one can extract data out of HTML for analysis since so few locations on the web have an alternate structured data equivalent of what is represented in HTML. Crunchbase, however, actually does have this, in that each Company HTML page contains a link to a JSON representation of the Company data. The fastest way to get at the JSON data is write a Map Job that reads each Company Page and then writes out just the link for the URL of the page containing the JSON data. Once this is complete, you now have a new seed list that you can crawl to a depth of 1. This will create a new Nutch segment that you can run a new Map job over, now having a much easier time extracting the pertinent data (using a library like org.json) and writing out the same schema in the same fashion described earlier.

Part 1: Classifying and Quantifying Historical Private Equity Investments

In a couple of my talks this year I showed a demonstration of how we can mine crowd-sourced data from the web to quantify whether we are in a tech bubble. I thought people might find it useful to see a step-by-step walk through of how this can be done. Keep in mind, if you find this too complex, its going to get a lot easier for non-programmers to ask the same questions of the web once the tooling gets more mature.
So lets be a bit more explicit. What are we trying to achieve here? I've always been curious as to where the hot spots are for Venture Capital (VC) investing so I could discover how Austin stacks up against Boston or Portland in terms of the amount of VC invested. Also, in late 2010 and 2011 there was a lot of talk about the US startup tech scene being in a bubble. Unfortunately, all the information that I tended to come across about the private equity market was simply subjective commentary from various investors based on the deal flow they and their personal networks were seeing. The most quantitative work I had come across was the article in Fortune Magazine by David Kaplan, "Don't call it the next tech bubble-yet" but the data largely focused on the public market rather than the private one. So I thought to myself, with the right data you should be able to ask some high level questions of the private market to see how much was invested this year vs. previous years and the distributions and frequency of those investments to determine if 2011 was showing signs of a bubble?

DISCLAIMER: I am not a Venture Capitalist or in Finance. I am however interested in understanding the technological ecosystem better, which includes its economy. This article is merely an example of how curious people can look to the web to explore data for whatever intrigues them.
Gathering the Data

The first step to exploring these questions involves finding the data. Data needs to be in a structured format to be queried, so before you embark on the process of working with data of arbitrary structure on the web, you should see if you can find the data in structured format first. There are some great data marketplaces (follow the icons/links below) which provide searchable repositories of already structured data. I actually think its pretty neat that we already have so many companies providing this service. There are also some institutional repositories, like data.gov, but I don't know how much longer that site will be around.
You can also attempt to gather your data via API. For example, you could use the Twitter API to gather tweets around a particular set of topics. This usually involves setting up a persistent connection where data is continually streamed in via the API and you dump it to disk or the Hadoop Distributed File System (HDFS) to be analyzed later.

Lastly, if you've exhausted those alternatives, you can web crawl (or spider). Web crawlers typically start by pulling down one or more web pages, these are known as the seeds of the crawl and crawl depth 1. The crawler then extracts the links from those pages and then builds a fetch list of pages to get and pulls down these pages. This is known as crawl depth 2. The process repeats for however many crawl depths you have specified for the crawl. As you can imagine, you are greatly increasing the amount of pages retrieved for each consecutive crawl depth. One can also specify crawl url filters where each link in the fetch list is matched and discarded if it doesn't meet the filter criteria. This allows one to constrain crawls to just one website, or just a part of a website.

Identifying the Website and Seeding the Crawl

After spending some time searching the web, I found that Crunchbase.com provided a website that was rich with crowd-sourced private equity data specifically related to tech. The site was set up so that each company had a page, and on that page it contained the sector (or category), the Location, the Name and details of each round of funding. This was perfect! However, if you're planning on crawling a site, you need to *carefully* read the website's terms of service and the license for its content. Otherwise, you might get sued.

Crunchbase has an awesome and flexible license for folks that are not gaining financially from the use of their content. One last thing to keep in mind. Websites have a "gentlemens agreement" with web crawlers in that they specify which parts of the site are allowed to be crawled and which parts are not, via a ROBOTS.TXT file. This is accessible at websitedomainname.com/robots.txt and is observed by most web crawlers although it is possible to modify the source code of Open Source crawlers and remove the compliance. You need to check the allowed crawling scheme in this file to make sure you can actually get to the content you want to retrieve.

TOS, License & ROBOTS.TXT

Crunchbase Crawl Seed List
http://www.crunchbase.com/companies?c=a&q=privately_held
http://www.crunchbase.com/companies?c=b&q=privately_held
http://www.crunchbase.com/companies?c=c&q=privately_held
http://www.crunchbase.com/companies?c=d&q=privately_held
http://www.crunchbase.com/companies?c=e&q=privately_held
http://www.crunchbase.com/companies?c=f&q=privately_held
http://www.crunchbase.com/companies?c=g&q=privately_held
http://www.crunchbase.com/companies?c=h&q=privately_held
http://www.crunchbase.com/companies?c=i&q=privately_held
http://www.crunchbase.com/companies?c=j&q=privately_held
http://www.crunchbase.com/companies?c=k&q=privately_held
http://www.crunchbase.com/companies?c=l&q=privately_held
http://www.crunchbase.com/companies?c=m&q=privately_held
http://www.crunchbase.com/companies?c=n&q=privately_held
http://www.crunchbase.com/companies?c=o&q=privately_held
http://www.crunchbase.com/companies?c=p&q=privately_held
http://www.crunchbase.com/companies?c=q&q=privately_held
http://www.crunchbase.com/companies?c=r&q=privately_held
http://www.crunchbase.com/companies?c=s&q=privately_held
http://www.crunchbase.com/companies?c=t&q=privately_held
http://www.crunchbase.com/companies?c=u&q=privately_held
http://www.crunchbase.com/companies?c=v&q=privately_held
http://www.crunchbase.com/companies?c=w&q=privately_held
http://www.crunchbase.com/companies?c=x&q=privately_held
http://www.crunchbase.com/companies?c=y&q=privately_held
http://www.crunchbase.com/companies?c=z&q=privately_held
http://www.crunchbase.com/companies?c=other&q=privately_held
  An ideal crawl for content retrieval should be at a crawl depth of 2. This means the crawl is specific and highly targeted and requires some thought before hand as to exactly what URLs the crawl will be seeded with. After spending some time with the site, I was able to determine that it could provide alphabetical indexes of just private companies. I was able to change which index was rendered by simply changing a parameter (such as from "c=A" to "c=B") in the querystring. This let me create a seed list such that the next crawl depth would only pull down the individual company pages that had the data I sought. Now, there are a couple of options for a choice of a web crawler. If you don't want to do the hardware/software set up of a crawler, you can use a service like 80legs (also the creators of DataFiniti), otherwise I recommend you use the Apache Hadoop based web crawler, Apache Nutch. Nutch is awesome and we'll be using it as our crawler from the perspective of this post.

The setup information for Apache Nutch is available on the project Wiki. Setting up Nutch is outside the scope of this post, however there are a few things that are non-obvious. If you are going to use Nutch for content retrieval rather than to build a search engine you want to make sure you modify the conf/nutch-site.xml to make sure the file.content.limit, http.content.limit, db.max.outlinks.per.page and db.max.inlinks properties are all set to -1. This makes sure neither the page content or the fetch lists are being truncated. In addition make sure your crawl conf/*-urlfilter.txt files comment out the regular expression that skips URLs that contains queries. i.e. It should look like this:
# skip URLs containing certain characters as probable queries, etc.
#-.*[?*!@=].*
This isn't actually a terribly big crawl and you can actually run this in single process local mode if you wanted. Once Nutch is set up, you create a seeds directory and copy in a seeds.txt file comprised of the seed URLs I provided to the left.

Then you launch your crawl with the following command:     bin/nutch crawl seeds -dir content -depth 2

Nutch will then run the crawl from the seeds specified in the "seeds" directory and store the results of the crawl in a "content" directory. For each crawl depth, the content directory will contain a unique segments directory corresponding to that depth. We can ignore the segments directory for the first crawl depth and just focus on the other one which contains all the data. Nutch by default runs quite a few jobs that build indexes for each segment and then merge them. All of this is unnecessary if you're using Nutch for content retrieval. I've embedded a talk I did on Nutch at the beginning of the year where I step through the jobs that are run for a crawl, how to run Nutch in distributed mode & how to orchestrate Nutch to simply run the jobs you need for content retrieval. I also spend a bit of time talking about processing Nutch Content Objects in Nutch Sequence files.

Extracting the data out of the sequence files in the segments directory and doing the analytics will be covered in the next blog post in this series.

Dec 6, 2011

Down the rabbit hole

"This is your LAST CHANCE. After this, there is no turning back. You take the blue pill, the story ends. You wake up and believe whatever you want to believe. You take the red pill and you stay in wonderland, I show you just how deep the rabbit hole goes" - Morpheus, The Matrix


Image Credit: CC - timsnell (Flickr)
I personally define Big Data as the challenges, solutions and opportunities around the storage, processing and delivery of data at scale. Clearly its a broad and hot topic and vendors from startups to the enterprise are actively providing solutions throughout the space. As a community and industry we're also doing a good job of moving the ball further up the field making this easier for everyone. However, I see one aspect that does not appear to be progressing at the same rate. Namely, the general understanding required to effectively apply Apache Hadoop.

I get this. I started working on Apache Hadoop a few years ago, somewhere around the 0.16.4 release. It took me awhile to wrap my head around exactly how to use it and what use cases it supported. Eventually there was an "Aha!" moment when it sunk in, and then I started to get excited. Very excited. It was as if all data was the matrix, Hadoop was the red pill that made it accessible and now I could fully explore this new world of data and all the opportunity that came with it. Hadoop is an amazingly flexible architecture for doing analysis of arbitrarily structured data. The fact that the Apache Nutch Web Crawler runs on Hadoop means the potential for analyzing the enormous amount of Data on the web is also there. One can run all kinds of interesting analyses on private and public web repositories (like Wikipedia) alike.

So, back to the issue of struggling to apply Hadoop. Most conversations I have with people both working with and in the enterprise appear to be dealing with this problem. The story I both see and hear is that they can wrap their heads around getting the platform up and running and might even be able to use it for an obvious use case such as Extract-Transform-Load (ETL) but as the cluster gets turned into a service, the business and the technical side of the house struggle to match Hadoop with the opportunities they have with private and public data. The people responsible for information discovery (whether they be engineers or business intelligence analysts) are standing looking at the Matrix unable to properly digest the red pill.
 
So we have a problem here. For those that find this exciting and long to see people crawling WAY down the rabbit hole and coming up with all kinds of awesome discoveries, how can we deliver a red pill for those that don't see the matrix? I think this goes beyond companies making the platform easier to run and companies providing tooling that make the queries easier to express. What do you do when your audience struggles to envision the query (or job) itself?

Formally training Data Scientists might be one approach. I think in its most canonical description a data scientist is someone that can (among other things) bridge that gap in applying Hadoop. Most folks I know that I would classify as Data Scientists sort of stumbled into the field and have non-traditional backgrounds like Linguistics, Physics and Psychology and their skill-set is largely self taught. Due to the fact that Hadoop is so new the definition of the term "Data Scientist" is a little nebulous. James Kobielus has written an article series on what it means to be a Data Scientist that explores various aspects of the role. Another approach might be to create and socialize a wide array of use cases for Hadoop and some easy on-ramps to exploring them. Either way, this is something that as a community, we need to fix.

What are your ideas?

Image Credit: CC - Someday Soon (Flickr)