I have recently released Mida as a Gem for parsing/extracting Microdata from web pages. Not many sites at the moment are using Microdata, in fact, apart from this site, I only know of one other: Trust a Friend, which is another site that I work on. However, as HTML5 is more widely adopted I am sure that this will change and Mida will become more useful.
Microdata is part of the upcoming HTML5 standard and is a way to label the content on web pages so that it is machine readable. I prefer this to its main rival, Microformats, because it is simpler to implement, though I do recognize that this can lead to its being more imprecise.
As an example, the following html is marked-up with Microdata:
If the above were parsed it would yield the following information:
I have made a Gem for Mida and hosted it on RubyGems, so installation is as easy as:
Mida is very easy to use as the following examples illustrate. They all
assume that you have required
This is the current usage and is likely to change, so please see the RDocs for
the current version.
Extracting Microdata from a page
All the Microdata is extracted from a page when a new
Mida::Document instance is created.
To extract all the Microdata from a webpage:
The top-level Items will be held in an
Array accessible via
To simply list all the top-level Items that have been found:
If you want to search for an Item that has a specific itemtype/vocabulary this
can be done with the
To return all the Items that use one of Google’s Review vocabularies:
Inspecting an Item
Each Item is a
Mida::Item instance and has three main methods of interest,
To find out the itemtype of the Item:
To find out the itemid of the Item:
Properties are returned as a
Hash containing name/values pairs. The values will be an
Array of either
To see the properties of the Item:
Mida is still in the early stages and much is likely to change. If you would like to contribute to the project, the source is hosted on Github. The best place to raise ideas/bugs/feature requests is on Mida’s Issues page.