Do
blog posts correlate with a higher number of future citations? In many
cases, yes, at least for Researchblogging.org (RB). Judit Bar-Ilan, Mike
Thelwall and I already used RB, a science blogging aggregator for posts
citing peer-reviewed research, in our previous article.

snippet of a blog post
Snippet of a blog post aggregated in RB
RB has many advantages (if you read the previous article’s post,
you can probably skip this part), the most important being structured
citation(s) at the end of each post. It has human editors, so we didn’t
have to check for spam or pseudo-science blogs. In short, RB gives us
those bloggers who care about and are familiar enough with research to
refer to it in a formal way. Of course, it also has its disadvantages;
it’s self-selecting, so we can gather only data from bloggers who
bothered to register with it; also, RB is life-science oriented, so the
results aren’t necessarily true for other disciplines.

Last
research we found that RB bloggers are highly educated (32% earned a
PhD) and that most (59%) are part of the academic system in one way or
another. So, we knew that many RB bloggers either belong or used to
belong to the academic system and wanted to see if, as a group, they
cover articles which will be better cited in future peer-reviewed
literature than articles from the same journal and year they didn’t
cover.

As a rule, we differentiate between blog mentions and blog citations. Blog mentionsare
any sort of reference to scholarly material in blogs, while blog
citations are mentions of scholarly materials written in structured
styles (e.g., APA, MLA) and appear in blog posts.

Methodology

As
I wrote earlier, the idea was to take blog posts which covered articles
from the same year and see if the articles, as a group, will receive
more citations later on compared with articles from the same year and
journal which weren’t covered. The problem with that was that RB was
launched around 2008. Since we studied the citations at the beginning of
2013, it meant citations from peer-reviewed journals did not have too
much time to accumulate. We knew from previous research (Glnzel &
Schoepflin, 1995) that in the life sciences, to which most of the
journals and articles in the sample belonged, articles reach citation
peak in about three years from the time of publication, including the
publication year (biomedical fields tend to be fast-moving). That gave
us 2009 and 2010 to work with. We downloaded all RB data from 2009-2010
and looked at all the posts from a certain year that reported about
articles from the same year (e.g. 2009 post covering a 2009 article).
There were 4013 posts of that kind in 2009 and 6116 in 2010. Next we
limited the sample only to journals with 20 or more articles published
in the journal and covered during 2009 and 2010. The cut-off of 20
articles and above was a compromise - we wanted to have as many journals
as possible in the sample, but also wanted the results to be
statistically reliable. The 20 cut-off left 12 journals from 2009 and 19
from 2010. For both years, the most popular journals were PLoS One,
PNAS, Science and Nature (not necessarily in this order).

Tables 1
and 2 shows the journals for 2009 and 2010. Three journals (Current
Biology, Journal of the American Chemical Society and Nature
Neuroscience) didn’t make the threshold for 2010, and 10 new journals
were added to the old ones.

 Journals with more than 20 articles published in 2009 and reviewed in 2009 in blog posts aggregated by RB
Table 1: Journals with more than 20 articles published in 2009 and reviewed in 2009 in blog posts aggregated by RB
Journals with more than 20 articles published in 2010 and reviewed in 2010 in blog posts aggregated by RB
Table 2: Journals with more than 20 articles published in 2010 and reviewed in 2010 in blog posts aggregated by RB
Medians -
for each journal we calculated the median of the article group that
were covered by bloggers and the median of the article group that
weren’t covered by bloggers. We used medians rather than averages
because citation numbers for articles in the same journal tend to be
highly skewed, and averages would have been affected by the extreme
values. For 10 of the 12 journals in 2009 the medians for the covered
groups were higher than the non-covered ones. The same was true for 17
out of 19 journals in 2010.

We used the medians to perform statistical tests (Mann-Whitney). In 2009, 7 out of the 12 journals (58%) had significant differences between the medians at p<.05
(the citation window was 2009-2011; there’s a mistake in table 6’s
column headline that says 2010-2012 – please ignore, it won’t be like
that in the final version). In 2010, 12 out of the 19 journals (68%) had significant differences at p<.05
for the citation window 2010-2012. We also calculated the 2010-2011
citation window for 2009 and the 2011-2012 citation window for 2010 to
see if there’s any difference, but the results were very similar (the
data for these citation windows isn’t shown in the article)



Martin: “But why? Why? I mean, why? Why?



Douglas: “Four excellent questions.”

Cabin Pressure, “Douz”

We
believe it’s mainly the “wisdom of crowds” in action here. It makes
sense that a large group of people with scholarly background in a field
can guess more accurately which articles are likely to have more impact
in that field than an editor and 2-3 peer-reviewers. Notice the improved
accuracy of bloggers between 2009 (887 items overall in the journals
studied) and 2010 (1394 items). It’s true that the bloggers didn’t have a
citation advantage in all journals, but that could have had something
with the 20-article threshold. Had we chosen, say, a 50-article
threshold we would have had 10 journals in 2009 and 2010 combined, out
of which only 2 would have had non-significant results.

We also
looked into other “Whys”; we know that reviews are over-represented
among highly-cited articles, so we checked to see if there was an
over-representation of reviews among the articles covered by blogs as
well, in comparison to their representation in every journal’s general
population in the same year. However, it doesn’t seem reviews are
over-represented in the covered articles population (though we can’t
have statistical significance because of the small numbers of reviews),
so this speculation fell through.

Another “Why” we looked at was a
possible media-blogs connection. The median differences for the New
England Journal of Medicine (NEJM) between the covered by blogs and not
covered by blogs groups were especially high (172 vs. 56 in 2009; 138
vs. 51 in 2010). Since the NEJM is an elite journal which has many of
its articles covered in the media, we wanted to see if bloggers tend to
choose NEJM articles which were also reported by the New York Times and
Reuters. The results weren’t surprising: twenty-one out of 26 articles
in 2009 (81%) and 20 out of 38 articles in 2010 (53%) were covered by
Reuters and/or the New York Times. The numbers of NEJM articles are
different than in earlier tables because some articles were covered by
more than one post, some posts covered more than one journal article and
some news articles covered more than one journal article. The bloggers
were usually not far behind the mainstream media – up to a month
difference between the news article and the blog post for most articles.
So at least for NEJM there could be a media-blog connection, though we
can’t tell what kind of connection. However, most journals aren’t as
thoroughly covered by the media the way NEJM is, so we can’t say
bloggers take their cues from the media.

The main limitations of
the study were the time frame – we could only take posts from 2009 and
2010 – and the relatively small number of articles. Despite these
limitations, I think the results are rather promising and would love to
repeat the study in the future to see if they hold.

The article doesn’t have an official publication date yet, but it’ll be published in the Journal of the American Society for Information Science and Technology (JASIST) and can for now be found in Professor Thelwall’s site (PDF).

References

Glanzel, W., & Schoepflin, U. (1995). A bibliometric study on ageing and reception processes of scientific literature Journal of Information Science, 21 (1), 37-53 DOI: 10.1177/016555159502100104

Shema H, Bar-Ilan J, & Thelwall M (2012). Research blogs and the discussion of scholarly information. PloS one, 7 (5) PMID: 22606239

Shema, H., Bar-Ilan, J., & Thelwall, M. (in press).
Do blog citations correlate with a higher number of future citations?
Research blogs as a potential source of alternative metrics. Journal of the American Society for Information Science and Technology.