FT to open archives to researchers via British Library

‏@MESandbu: FT digital archives open to academic research | FT Data http://on.ft.com/1KzYrh0

The Financial Times is working with the British Library to open up access to FT Digital Archive for academic research.

Extracts from the archive materials were used to produce the feature about Britain’s 1975 referendum on European Community membership that was published on FT.com today.

The archive consists of scanned images of each of the 903,029 pages comprising all 37,464 print editions of the Financial Times published between 1888 and 2010.

For each page, the archive consists of a high-resolution image file of the scanned page and a large XML file that includes the full text of the page (generated by optical character recognition software) and detailed metadata about the position of each scanned word. The full 123-year dataset is 2.5 terabytes in size.

In addition to researchers interested in 20th-century economic history, this vast dataset is likely to be of interest to linguists interested in studying a large corpus of specialist news, or computer scientists interested in techniques for digitising large volumes of printed documents.

FT journalists and developers will be participating in a British Library hackathon on November 16 to explore how these datasets can be used.


From: «McKernan, Luke» <Luke.McKernan @ bl.uk>

Subject: RE: FT Archives

Date: 23 September 2015 17:14:59 BST

To: «‘b.batiz-lazo@bangor.ac.uk'» <b.batiz-lazo @ bangor.ac.uk>

Dear Bernardo,

Thank you for your email about the FT’s newspaper archives. Their announcement is a little misleading, because it implies that the whole of the archive is available now for academic research, which isn’t quite the case. The FT’s newspaper archive has been available under subscription via CengageGale and continues to be so (http://gale.cengage.co.uk/financial-times-historical-archive.aspx0, but no longer exclusively so. We have been in discussion with the FT about how to open up the archive to different kinds of research, particularly data-driven research, and to make this freely available. To kick things off, they have made available four years of content, as JPEG images and XML – 1888, 1939, 1966 and 1991 – with a licence agreement that research teams can sign in to, and access to be provided via a BL FTP server.

This arrangement will last until the end of this calendar year, after which we hope to negotiate an extension with the FT, opening up more of the archive. So we are currently in a test phase, and we are keen for researchers to start working with the test data, so that we can feed back the results to the FT and plan ahead appropriately. If you are interested is making use of the test data, let me know and I’ll get you added to the licence agreement. Or you may be interested in the news hackathon that we will be hosting on November 16th, which will feature data from the FT archives and other news collections that we hold here.