BBC Datasets - Exploratory Workshop

Time: 10.30 - 12.30

Venue: Atlas Room , Kilburn Building, Oxford Road, University of Manchester, M13 9PL

Sorry, this event has now ended.

Digital Futures and Cathie Marsh Institute (CMI) will host a follow-up workshop on June 26th 10.30-12.30 in Atlas room, Kilburn Building. The workshop will be led by George Wright (Head of Internet Research and Future Services, BBC Research & Development & Visiting Simon Fellow at CMI) and will focus on accessing and working with BBC datasets and discussing possible collaborations with the BBC’s R&D team

 The datasets are listed below with summary information, more details can be found here. If you are interested in using one of them, please email George directly by the end of this week, 21st June He will seek to arrange your access to those data for the workshop. Some are more flexible in use than others. In descending order of simplicity - Drama, MGB, Pips, Genome 

Please feel free to ask let George know if you have any questions. Also if you want to attend the workshop and are not sure about the datasets that might be relevant then please register your interest with Digital Futures by emailing

BBC Datasets





Programme Metadata

The main dataset of programme information starts in July 2007 and represents a continuous broadcast history from that point. This data includes: programme description, transmission details, some cast and crew, genre and format. In addition there is sporadic programme information prior to 2007 which is added when programmes from before this point are repeated.


Programme Metadata

Scanned copies of 4,500 issues of the Radio times from 1929 to 2009 (PIPs data is used for record of transmission post 2009). Scanned data has been OCR’d and is available via a web interface here:  Mo McRoberts is planning to make this data available via an API later in the year but there are some issues around redacting data



Elvis is the BBC's publicity stills and photo library – 1.1m photos, of which 330,000 are green lit (BBC Copyright). Metadata for photos is inconsistent but when it's good it is fairly rich. For example, well documented photos of famous people will usually list all the other notable people in the photograph, their position or job (e.g. MP for Bexley Heath) and the location alongside rights information.

SUBTITLES Programme Metadata

There are 2 sets of subtitle data; the first one which is the historic subtitle dataset supplied by Red Bee. This contains a variety of subtitle files from the early 1980s onwards. In addition there is the Redux/Snippets subtitle dataset which contains subtitles to all BBC broadcast from July 2007 onwards. The API to it can be found here

PROTEUS Programme Metadata

Proteus is a programme metadata repository that was originally designed as a commissioning and reporting tool for Radio 4. There are entries for 1.3 million transmissions that use the PIPS episode model for identification. It has very good metadata for Radio 2, 3, 4, 6 Music, and thinner data for Radio 1, 1Extra, 5 Live, 5 Live Sports Extra, 4 Extra and Asian Network.

JUPITER    Video

Jupiter is the name of the video server and content editing system for BBC News which contains tens of thousands of daily-changing videos from news feeds and correspondents around the world.  The hardware is provided by Quantel (self-supported by news), and the software (Colledia) and interface development, support and maintenance were transferred by Atos to the BBC (Atos still owns the core software)

INFAX Programme Metadata

Infax is I&A's longest running programme information store. It contains details to programmes running back to 1922, but the data from the early days is very patchy and often given incorrect dates.

PasCs Production metadata

Steve Daly has a database of thousands of scripts and programmes as completed forms from various programmes from 1980 to 2000. They've been scanned in as TIFFs but no OCR or any other form of extraction has been performed on them yet

P4A Production metadata Acquired footage within programmes, music use within programmes, other production data.
DRAMA SCRIPTS Programme metadata

A large number of Post Production Scripts in word or pdf format representing a large chunk of the BBC drama output from 2007-2014. These contain full dialogue, character names, scene description and some timing and music data


A large number of .flv viewing copy files of BBC Newsreel programmes from 1948-1959, given to us by the Rewind Project.

Radio Permanent Archive Collection Audio

Thousands of radio programmes permanently archived on the open web to anyone via the Radio 4 permanent archive collection. The two biggest programme collections are Desert Island Discs and In Our Time, but there are hundreds of factual and news strands featured.

World Service Archive Data Metadata

Machine-generated and user-generated tags for the c.50,000 programmes processed via the World Service Archive project. Programme descriptions (original and user-edited), genres, tx dates.

Home Front Assets Audio plus metadata All episodes, scripts, scene description, storyline description, character description and associated story structure metadata
Twitter Data Dump Twitter metadata

A full archive of tweets from the Twitter firehose covering a time period of 3 months during 2010. The firehose data includes *all* tweets posted on Twitter, not just the filtered subset you normally get through their APIs.

To register for this event, please email