I have recently released Mida as a Gem for parsing/extracting Microdata from web pages. Not many sites at the moment are using Microdata, in fact, apart from this site, I only know of one other: Trust a Friend, which is another site that I work on. However, as HTML5 is more widely adopted I am sure that this will change and Mida will become more useful.
Microdata
Microdata is part of the upcoming HTML5 standard and is a way to label the content on web pages so that it is machine readable. I prefer this to its main rival, Microformats, because it is simpler to implement, though I do recognize that this can lead to its being more imprecise.
As an example, the following html is marked-up with Microdata:
<div itemscope itemtype="http://data-vocabulary.org/Product"
itemid="urn:isbn:1-86207-839-4">
<h2 itemprop="name">Electronic Brains: Stories from the Dawn of the Computer Age</h2>
By <span itemprop="brand">Mike Hally</span>
<meta itemprop="category" content="Media > Books > Non-Fiction > Computer Books" />
<meta itemprop="identifier" content="isbn:1-86207-839-4" />
<div itemprop="review" itemscope itemtype="http://data-vocabulary.org/Review">
<h3 itemprop="summary">Great short history of early computers</h3>
Rated <span itemprop="rating">5.0</span>/5.0 by
<span itemprop="reviewer">Lawrence Woodman</span> on
<time itemprop="dtreviewed" datetime="2009-06-03">3rd March 2009</time>
<div itemprop="description">
While reading this I came across quite a few surprises, such as the early
successes in Australia. There is also a chapter on Remington Rand's Rand
409, another early computer which I don't think has been covered much elsewhere.
Finally it tries to explain how IBM became the market leader despite its late entry
into the field.
</div>
</div>
</div>
If the above were parsed it would yield the following information:
{
:type => "http://data-vocabulary.org/Product",
:id => "urn:isbn:1-86207-839-4",
:properties => {
"name" => ["Electronic Brains: Stories from the Dawn of the Computer Age"],
"brand" => ["Mike Hally"],
"category" => ["Media > Books > Non-Fiction > Computer Books"],
"identifier" => ["isbn:1-86207-839-4"],
"review" => [{
:type => "http://data-vocabulary.org/Review",
:id => nil,
:properties => {
"summary" => ["Great short history of early computers"],
"rating" => ["5.0"],
"reviewer" => ["Lawrence Woodman"],
"dtreviewed" => ["2009-06-03"],
"description" => [
"While reading this I came across quite a few surprises, such as the early
successes in Australia. There is also a chapter on Remington Rand's Rand
409, another early computer which I don't think has been covered much elsewhere.
Finally it tries to explain how IBM became the market leader despite its late
entry into the field."
]
}
}]
}
}
Installation
I have made a Gem for Mida and hosted it on RubyGems, so installation is as easy as:
gem install mida
Usage
Mida is very easy to use as the following examples illustrate. They all
assume that you have required mida
and open-uri
.
This is the current usage and is likely to change, so please see the RDocs for
the current version.
Extracting Microdata from a page
All the Microdata is extracted from a page when a new Mida::Document
instance is created.
To extract all the Microdata from a webpage:
url = 'http://example.com'
open(url) {|f| doc = Mida::Document.new(f, url)}
The top-level Items will be held in an Array
accessible via doc.items
.
To simply list all the top-level Items that have been found:
puts doc.items
Searching
If you want to search for an Item that has a specific itemtype/vocabulary this
can be done with the search
method.
To return all the Items that use one of Google’s Review vocabularies:
doc.search(%r{http://data-vocabulary\.org.*?review.*?}i)
Inspecting an Item
Each Item is a Mida::Item
instance and has three main methods of interest, type
, properties
and id
.
To find out the itemtype of the Item:
puts doc.items.first.type
To find out the itemid of the Item:
puts doc.items.first.id
Properties are returned as a Hash
containing name/values pairs. The values will be an Array
of either String
or Mida::Item
instances.
To see the properties of the Item:
puts doc.items.first.properties
Contributing
Mida is still in the early stages and much is likely to change. If you would like to contribute to the project, the source is hosted on Github. The best place to raise ideas/bugs/feature requests is on Mida's Issues page.