Open citation data is coming. It’s a matter of when, not if
‘In the beginning was the link…’ – Most of us know what a citation is, a relationship between two publications. But what is open citation data? Unsurprisingly, its citation data that’s open, free to use, re-usable…and useful in ways you probably haven’t thought about yet.
Over the years citations have become the key currency of academic reputation, helping to measure the degree of influence any one scholar’s works have had on the academic community. At the most basic level, there are two important aspects of citations associated with any one paper; who is cited in it and who it’s cited by. The first is easy to establish, the information should be there in the document. However a crystal ball is needed to know who is doing the citing. Those links are yet to take place and some form of citation data storage coupled with regular analysis to ferret them out will be required - something like the lifecycle below.
Citation data lifecycle [From the soon to be published Jisc report Access to Citation Data: Cost-benefit and Risk Review and Forward Look
So, who’s doing this indexing and analysing and then supplying the information? In the not too distant past there were only two sources, SciVerse Scopus and Web of Knowledge/Science. Citation information and associated value added services could be provided to you if you were lucky enough to be associated with an institution that had an appropriate subscription. Then not so long ago Google started providing citation information for free (through Google Scholar) followed more recently by Microsoft’s Academic Search. Along with CiteSeerx and the Jisc Open Citations Corpus, these six players now make up the core providers of citation information.So how has open citation data appeared on the scene and why are more publishers now making their citation information available? Firstly, although publishers clearly see value in their citation data, it has now been recognised by many that the improvement in discoverability of publications outweighs the loss of subscriptions revenue (probably). Secondly, the increase in open publication means that much of this information becomes open by default.
Is it really open and
is it useful?
Herein lies the rub. Where the data is available, it’s often only provided for tightly controlled use cases, or through a web interface that returns results rather than access to the underlying data—fine if your use case is supported, but not so good if you’re trying to achieve something a little different. What’s more, if you should get hold of raw data from one or more sources, the chances are that it will be both out of date—access may have been provided to a downloadable snapshot from a database—or in a proprietary format that makes it difficult to use with information from other sources, (and many use cases for citation data exploitation require extensive—if not complete—coverage which implies multiple sources).
So could linked open data provide a way forward? Potentially yes. The Jisc Open Citation Corpus is testing the waters for this type of data exposure, providing access to approximately 40 million citations. However, when you consider the relatively small range of sources that went to make up that 40 million record dataset and the fact that the data gathering is still a ‘pull’ process as opposed to an automated ‘push’ process then the real scale of the challenge becomes apparent. Once again, we’re faced with a situation where the data is incomplete and not completely up to date. (David Shotton’s blog about the project covers these challenges and makes for entertaining reading).
What we really need is some way of automatically interconnecting the citation data from numerous sources as regularly as possible and then exposing it. Sounds familiar? Indexing and aggregating services such as CrossRef are perfectly placed to provide such access.
So what’s holding us back? A recent Jisc workshop I attended considering this very question came to the conclusion that there is very little holding us back. It’s more a question of will. The technology is there. The data is mostly there (baring a few standardisation problems and errors). It just needs everyone to sign-up to the concept and tick the box.
So does all this add up to a rosy future for open citation data? Personally, I see a future where open access through linked open data to a complete corpus of standardised citation data is considered the norm. Where virtual aggregation of such data on the fly is possible and practical. Where new citation data use-cases are developed and research can take place with citation data as the subject of the research. A pipe dream? I don’t think so (apart perhaps from the virtual aggregation). It’s all possible now. But does the market (in the UK at least) have the will to make it happen? Now that’s a different question altogether, one that is explored in the soon to be released Jisc Digital Infrastructure Directions report ‘Access to Citation Data: Cost-benefit and Risk Review and Forward Look’ (published on 10 September). So am I an idealistic dreamer or a practical visionary? You tell me.
What can you do with Citation data (now and in the future)?
The analysis and exploitation of citation data has come a long way from the simple ‘standing on the shoulders of giants’ concept it started out as. We’ve progressed (if that is the right word) onto using it for performance management (for everyone from the individual researcher through to departments and institutions), and for business intelligence. To work effectively these use cases need current, comprehensive and trustworthy source data. It is also essential that those using them understand the limitations of the tools.
As more citations become openly available the citations themselves are becoming the subject of research with investigators examining the inter-relationships between disciplines, generating new knowledge. Making the data open and usable means that hitherto unimagined avenues of exploration can and will appear.