Semanlink, presentation paper

François-Paul Servant <fps@semanlink.net>

Abstract

Semanlink is a tool, based on RDF and built with Jena, that can be used to manage files, bookmarks and short text notes. Basically, it is a tagging utility, but it provides a simple way to organize tags in a graph, thus allowing a user to incrementally define the vocabulary he/she uses when annotating documents with metadata. To current tagging systems, Semanlink adds an organization of tags' space that they generally lack, and that is the basis for "concept navigation" among tags and documents. To hierarchical file systems, it adds the possibility to store each document in several "directories" at the same time, and it cleanly separates system-level addressing (file path or URI) and classification data (metadata information).

Introduction

Semanlink has been developed as a personal information management system. It is a tagging utility, based on RDF, that can be used to add tags (and other RDF metadata) to files, bookmarks, and to short notes that it allows to write, organize and display.

A graph of tags

Tags are organized by the user in a graph of tags : each tag can have several "parents" and "children", as well as other RDF properties. In this sense, Semanlink is a tag tagging tool. The user builds his/her own vocabulary that models his/her representation of concepts, and their relations, in a simple yet useful way. Semanlink provides a GUI to easily navigate through the graph of tags (making extensive use of trees). And of course, it uses that taxonomy when searching: a document tagged with "Jena" will be found when searching for "RDF and Java".

A file based RDF store

Metadata about local files is stored in small RDF files, that are saved in the same directory as the files they describe (or in a parent directory). This is an important feature, not only because no database is needed to run Semanlink: for instance, when you save a copy of one directory to a CD or when you move it to another location, metadata about its files is also copied or moved, and ready to be used without any modification, as relative URLs are used to identify the files.

Metadata about bookmarks, as well as the short notes created within the application, are stored in a directory organized with a "year/month" structure.

The definition of tags is stored separately from the metadata about files: a vocabulary can therefore be reused in another context. Several graphs of tags, each with its own URI, can be used to mark a set of documents.

All these RDF files are loaded into memory at startup.

Semanlink is based on Jena

Semanlink runs as a servlet and has been developed with Jena. It uses plain vanilla RDF memory models, which give very fast response time for any operation, at least on my models (more than 20,000 statements about 3000 documents and involving 2000 tags, as of this writing).

Semanlink on the web

A limited demo of Semanlink has just been put online at the address http://www.semanlink.net. As it is, it is just a rough copy of a subset of my personal semanlink store, without the possibility to edit content, and with access limited to bookmarks (links to documents lead to a restricted area protected by a password, as they are often copyrighted material that I saved for my own purpose). This clearly needs some polishing, but I wanted to give a chance to get a first hand on Semanlink at the time of this submission. It is my intention to improve this situation in the following weeks, and to release, in the near future, Semanlink as open source.

The following gives a look at the functionalities of Semanlink through some screen shots.

Screen shots

Main page: newest entries

Looks very much like the main page of a blog. Each entry has a title, a creation date, tags, and possibly a text describing it (generally an abstract or a quote).

Tag

One of the views of a tag's page, namely the "Jena" tag. This tag has several "parents" ("HP", "Java dev", "RDF"...), and also several "children" ("Jena : introduction", "Jena rules", "semblog"). The tree with root "Jena" is shown here partially opened. (Complete tree of descendants is not loaded when accessing the tag's page: subtrees are loaded on request using Ajax) On the right side, "linked tags": other tags used to mark the documents listed in the page.

 

Other views of a tag are available (simple list of the related documents instead of tree view, so you can sort them by a given property, expanded view of the tree, list of images displayed as snapshots,.... ) Here is a view displaying all the documents marked with the tag "Archéologie du Niger" (Archeology in Niger) or any of its descendants. The screen shot also illustrates that you can have a quick look at images marked by the tag (loaded on request via Ajax).

Adding a document

A "bookmarklet" allows to easily bookmark a web page. Once you confirmed creation of the bookmark (or of a local copy of the page) you can annotate it with metadata. Some metadata has automatically been added, a creation date and a title for instance, but not only. If you had some text selected in the page when you clicked the bookmarklet, it will be added to a "comment" property. A basic tag extraction mechanism, that looks for tags in the title and the comment, proves itself useful. In the example below, this procedure has been sufficient to add three tags to the document, among which two actually are correct. Note that the text "P2P" has automatically been replaced by "Peer to peer" (because "P2P", in the taxonomy, is defined as an "alias" of the tag "Peer to peer"). "Philosophie", which has "Philosophy" as alias, should not have been selected: just check the box on its left side and click "Remove".

At this step, you'll probably want to add more metadata to the document, other tags in particular. The "live search" among the tags is very useful for that purpose. In our example, you may want to add some tags related to "Semantic Web" - but you don't know exactly which ones. Type "sem web" in the search field to get all the tags containing words beginning with "sem" and "web". You can have a quick look at the descendants of any of the tags in the returned list, and choose among them which tag to add to the document.

Creating and editing a tag

Suppose, in the previous example, that you want to add a new tag with label "Semantic Web P2P" to the document. Just type that text in the form area and click add. The new tag is added to the list of tags of the document. You can click it to get to the new tag's page which, in editing mode, displays forms to add (or remove) parents, children, aliases, and to define some other properties. Again, the live search proves itself useful in selecting parents (or children) to be added to the new tag.

Searching for documents: the graph structure of the set of tags is used.

In this example (where display mode is set to "display images"), searching for tag "Archéologie" (archeology) and "Nigeria" returns documents with none of these tags: only descendants of them.

Technical points

Semanlink is a very simple application. In the following, I'll briefly describe some key points of its design, and explain how I came up with them.

Semanlink runs as a servlet, using JSP for the GUI. A set of java packages modelizes Semanlink main concepts (graph of tags, set of documents). Another layer provides the actual implementation that is based on Jena, and uses memory models that are loaded from files at startup.

Why RDF?

As previously seen, the main goal of Semanlink is to provide improved access to files and bookmarks by combining tagging and incremental definition of a graph of tags. From there on, the choice of RDF seems natural, as it is precisely about handling metadata. SKOS was not available when this work began, but it was obvious that RDF could be used to define thesauri. Furthermore:

This qualifies RDF, at the very least, to be the data model. But there is more. A relational database such as MySQL could be used to store the information. A file based RDF store has been preferred, for several reasons:

In conclusion to this topic, RDF stored in files provides a truly portable database that is easy to set up, that do not need to be centralized, and that provide very fast response time when loaded into memory. This are trully key advantages over relational databases.

The graph of tags

The structure of the graph of tags is very simple. It consists of a single "hasParent" property, whose domain and range are tags. The inverse "hasChild" property is not used: when adding a child to a tag, you just add a parent to the child tag. This way, when recursively traversing the tree whose root is a given tag, you just have to list the values of one property on each node, instead of two. This improves response time.

This pattern - converting statements to forms that require less work at search time - has been used as often as possible. Aliases, for instance, are handled that way: when stating that an existing tag is to be considered as an alias of another one (cf. owl:sameAs), all statements involving the alias are converted and saved to disk. You may note that this is not completly sufficient, as we are not using a centralized database and some RDF files may not be available for modification (for instance because they are on a unmounted disk, or on a CD). This is not handled by the application at this time, but it should not be difficult to correct that: statements involving aliases just need to be converted at load time. Without any doubt, this can be handled efficiently by Jena advanced features - the important point being that this (light) kind of inference must be done once for all, at load time (and not at search time).

Integrating Semanlink into an application

If an application is built with java, the easier way is maybe to use the Semanlink java API and/or the JSP. Access to Semanlink as a service still needs to be improved. An interesting option, if Semanlink and the application are running on the same host, is to use Ajax scripts that allows to dynamically insert Semanlink data (with GUI) into a page. This can be used, for instance, to insert into a static HTML page, the metadata that describes it.

Perspectives of improvements

The only limit to improvements and future developments is the time I have. In the short term, here is what I am planning to do:

Conclusion

I have been using Semanlink for years now, improving it over time, and it has become an essential tool for me, one that I use every day. Actually, it's a way to master the information I am concerned with. Finding interesting information takes time and work - you use search engines, you read RSS feeds and newspaper articles... When finding something important to you, you don't want to lose it. That's what bookmarks are for, but better tools are necessary to handle them efficiently. I am very pleased to be able to get back, from my hard disk, a newspaper article that I read years ago, in a matter of seconds.

A word more about using RDF and storing metadata in text files. I had made an implementation of Semanlink fifteen years ago, in a pre-web context (on a Macintosh running HyperCard). But the information I entered at this time has been lost: I have no more way to read it. Using a standard and open format such as RDF (furthermore not using extra programs such as a database), I can be confident that I won't lose again the metadata that I produce. What a relief!