Well after much teasing its about time I posted something concrete about the work I've been doing using OpenCalais. As metioned previously I was working on my own .NET plugin / Helper class but, as it turns out, a ex-collegue of mine (who is now hiding away in sunny Australia) Chris Fulstow pipped me to the post and released his .NET Open Calais project on CodePlex. Rather than have two seperate projects doing exactly the same thing I began to merge my work with his and the result can be found here: CalaisDotNet
New Version
While busy merging code, OpenCalais released version 2 of their web service, adding more entities and relationships and now adding 'Simple' and 'Microformats' output types. I was attending a museum data mashup day as part of the the UK Musuems on the Web Conference 2008 last week and wanted to run some museum data into OpenCalais so branched the code and began re-writing it to support the new output types. After a busy weekend finishing off the code and adding relationship (Events/Facts) support I am pleased to say this new version is almost ready for release. I wanted to show some example of how easy it is to use it to process any type of data you submit to it and to use the power of LINQ to then manipulate and query the results set you get back.
Requirements
CalaisDotNet is written using the wonderful-ness of C# 3.0 and relies on having the .NET Framework v3.5 being installed. Remember .NET v3.5 is essentially just a new set of libraries - the CLR is still the same 2.0 version .. which means if you're running .NET 2.0 atm then it isnt not too much of a big deal to install v3.5 as all your old stuff will still work exactly the same way.
You will also need an OpenCalais API Key which can be freely requested after you register at the OpenCalais website.
Here we go ..
CalaisDotNet can be broken down into two parts. The first one is the call to the web service, handled by the CalaisDotNet object, and second is the processed response data contained within one of three Calais*Document types that represent the three different types of output from the OpenCalais web service. Calling each is trivial.
var calais = new CalaisDotNet(_apiKey, _content);
var document = calais.Call<CalaisRdfDocument>();
.. where:
_apiKey = (string) Your 24 digit API key
_content =(string) Your content to be processed. OpenCalais can accept input in 3 different formats - Plain Text, HTML and XML. CalaisDotNet has support for all three. CalaisDotNet will take a guess at the format of you text or you can specify the input type with an extra parameter in the constructor.
var calais = new CalaisDotNet(_apiKey, _content, CalaisInputFormat.Text);
var document = calais.Call<CalaisRdfDocument>();
The document returned represents the processed output from the web service and gives you the ability to access various collections of data such as Entities or Relationships.
Simple Format
Documentation: HERE
This is a new format introduced in the latest version which has a reduced set of properties and entities, this format is ideal for doing things such as tag clouds as it only exposes simple list of basic entities, their frequency in the document and the value of each. Also its document description information is much reduced only having five properties.
You can still build up a LINQ query to filter or order the data in anyway that you choose.
Use the CalaisSimpleEntityType enum to filter by entity type.
an example query would look something like this, filtering for results where the entity type is 'Country'
var calais = new CalaisDotNet(_apiKey, _content, CalaisInputFormat.Text);
var document = calais.Call<CalaisSimpleDocument>();
var results = from item in document.Entities
where item.Type == CalaisSimpleEntityType.Country
select item;
foreach (var result in results)
{
Console.WriteLine(result.Value);
}
Microformats
Documentation: HERE
This is a very basic implementation as, franky, I dont know how we can add any value to it as a lot of the work is done by the web service to format the data into HCalendars and hCards. Most of my time was spent making the RDF stuff work so suggestions are welcome on how best to process this into something useful :)
To grab the unprocessed output use the RawOutput property available on all the Calais*Documents to view see the original response.
var calais = new CalaisDotNet(_apiKey, _content, CalaisInputFormat.Text);
var document = calais.Call<CalaisMicroFormatsDocument>();
Console.WriteLine(document.RawOutput);
RDF Magic
The meat of the semantic data is (of course) contained within the CalaisRdfDocument class. It has a much richer set of document description metadata than the 'Simple' format.
The document also contains an IEnumerable list of Entities and an IEnumerable list of Relationships (Events/Facts). These can, for example, then be filtered by entity type (CalaisRdfEntityType) or relationship type (CalaisRdfRelationshipType).
Each entity/relationship also contains a list of all instances of that entity/relationship in the submitted document. Some examples:
Filtering where CalaisRdfEntityType is 'Company' and printing their location offsets.
var calais = new CalaisDotNet(_apiKey, _content);
var document = calais.Call<CalaisRdfDocument>(); var results = from item in document.Entities
where item.EntityType == CalaisRdfEntityType.Company
select item;
foreach (var result in results)
{
Console.WriteLine(result);
foreach (var instance in result.Instances)
{
Console.WriteLine(
" - Found at offset: " +
instance.Offset + "(" +
instance.Length + " chars)"
);
}
}
Returns only 'PersonPolitical' relationships
var results = from item in document.Relationships
where item.RelationshipType == CalaisRdfRelationshipType.PersonPolitical
select item;
foreach (var result in results)
{
Console.WriteLine(result);
foreach (var instance in result.Instances)
{
Console.WriteLine(
" - Found at offset: " +
instance.Offset + "(" +
instance.Length + " chars)"
);
}
}
Slightly more complicated ..
Filters results by country and then looks up any relationships that are related to that country.
var calais = new CalaisDotNet(_apiKey, _content);
var document = calais.Call<CalaisRdfDocument>();
var results = from item in document.Entities
where item.EntityType == CalaisRdfEntityType.Country
select item;
foreach (var result in results)
{
Console.WriteLine(result);
foreach (var instance in result.Instances)
{
Console.WriteLine(
" - Found at offset: " +
instance.Offset + "(" +
instance.Length + " chars)"
);
}
var rels = from item in document.Relationships
where item.RelationshipDetails.Values.Contains(result.Value)
select item;
foreach (var rel in rels)
{
Console.WriteLine(" - Relationship: " + rel);
}
}
Download
Currently this version is still in a branch so you will have to compile using the solution in the 'CalaisDotNet-NewFeatures_200805' folder .. you can download the release from the source tab of the Codeplex project site (HERE). Im hoping to make this a release soon once its been QA'd and also when I work out how to do it hehe :P
TO DO
-
MicroFormats - As mentioned earlier we need to look at the Microformats output and work out how to present it usefully.
-
RDFa - It would be reallty nice to be able to output the sumbitted document as RDFa .. we have the entities and we know where they are in the text so this shouldnt be too hard a jump .. personally I just need to understand RDFa better first.