UPDATE: I forgot to mention that PiCloud has been bought by Dropbox (Congrats!) at the end of 2013. Fortunately MultyVac will rise from its ashes. How well MultyVac supports web mining has yet to be seen.
Recently, I've done a number of projects utilizing web mining techniques. Also known as web scraping, web mining is a set of methods to extract useful information from websites, be it content, connections, or others. With the stagnation in semantic web and the often-lacking API endpoints, web mining has become a popular way to retrieve and consume publicly available data in a scalable fashion.
A web scraper typically has 3 components: HTTP Client, Content Parsing, and Data Consumption. As I have been using mostly Python for my recent work, thus many of my recommended libraries will be in Python.
The first step to mine web for information is search for an API. It’s surprising how trigger-happy developers may get with web scraping. If an API already exists for the needed data, don’t scrape it! Web mining often requires much work to start and maintain, as any HTML changes may throw off the script. The best way to scrap the web is not do it.
The HTTP client is responsible for chatting communicating with the internet. It needs to be able to make basic HTTP requests (e.g. GET and POST) and output the response. A good example is the GNU wget, retrieving the raw response at the URI of interests. Extra points go to libraries and tools that can understand and format the response. My favorite is requests, which saves developers time on common routines, including status code parsing and header extraction.
Often, this client may also disguised its requests as if from a popular browser, in order to better mirror the traditional user experience. Alternatively, a pool of IP addresses may be used. through cloud computing services such as AWS and PiCloud, in order to distribute the load. If the data you need is populated through an AJAX call, use tools such as Fiddler to examine the particular HTTP request that populated the data of interest. If the same response can be replicated through manipulating the GET arguments (i.e. the stuff following ? in the URI), you can make less requests to get the same information. Depending on the website, AJAX responses are often JSON objects, making it easier to parse and traverse.
We can traverse the XML tree (of which HTML is a type) with tools that supports XPATH expressions, or if you are using Python, use BeautifulSoup. By identifying the specific order or unique trait, XML parsers can quickly traverse to the node containing the target data. This method is sensitive to UI updates, as the node orders and attributes may change. If the website doesn't update often, this may be a good and easy-to-follow technique.
While these are two very different techniques, they are not mutually exclusive. I have often traversed to a set of relevant HTML nodes using BeautifulSoup, then use regex to extract the actual data of interest.
An important step many developers forget, or not emphasize enough, is actually consuming the data. While it is the last step of importing data from the web, it should be the one of the first to be considered when designing the solution. How is the data ultimately used? What is the nature of data and how to make it easier to manipulate later? We need to think about these questions before we set out to build a web mining solution. If you are dealing to text data, you may want to apply text analytics techniques afterwards, requiring you to clean up the text and strip any HTML tags or encoded characters. If you are importing data tables, you may need to properly cast each row into useful data types (e.g. Int32, Boolean, String, etc). Ask any database engineers, and they can give you much thorough list of the dangers of improperly processed data.
Finally, it is important have an attitude of gratitude when mining information from the web. Often, the information we seek are not the primary purpose of the target websites, thus not guaranteed the quality, support, or even access. If we can avoid overstepping our boundary with the traffic load and usage, web community as a whole may be more inclined to share their data.