2020 Annual Digital Lecture: Staff research poster exhibition

Welcome to The National Archives’ Annual Digital Lecture poster exhibition.

Exploration of digital ideas is an important part of our archival work. The Annual Digital Lecture offers the opportunity to hear from a leading speaker on a topic related to digital research. This poster exhibition showcases the innovative digital work taking place at The National Archives.

In previous years, the event and poster display have taken place in person, but this year, we’ve moved things online. As you scroll down, you’ll see each project title, followed by a short video describing the project, and an online poster display.

Enjoy the exhibition, and please get in touch with us with any questions or comments by emailing research@nationalarchives.gov.uk.

Please use this menu to navigate directly to a poster:
Explore and Discover: Applying AI vision search technologies to our collections – Lora Angelova and Lucia Pereira-Pardo (The National Archives)

Maps, merchants and machines: AI for revealing maritime trade and mapmaking practice – Lucia Pereira-Pardo (The National Archives)

Explainable AI: Explaining AI in an archival context – Mark Bell, Jo Pugh, Jenny Bunn (The National Archives) and Leontien Talboom (University College London and The National Archives)

Computing Cholera? Distant Reading’ General Board of Health catalogue data – Chris Day (The National Archives)

Computational Archival Science (CAS): An international research collaboration network – Eirini Goudarouli (The National Archives), Mark Hedges (King’s College London), Richard Marciano (University of Maryland), David Beavan (The Alan Turing Institute)

Network analysis of visual collections: Entry form records of the Copyright Office, 1837-1912 – Dr Katherine Howells (The National Archives)

DiAGRAM: Digital Archiving Graphical Risk Assessment Model – Alex Green, Hannah Merwood, David Underdown, Alec Mulinder and Sonia Ranade (The National Archives)

Social Media Archive: Expanding the collection and improving the service – Tom Storrar and Claire Newing (The National Archives)

Project ALPHA: The future of The National Archives on the web – Jenifer Klepfer and Simon Wilkes (The National Archives)

Preserving Google Docs: Enabling transfer and preservation of cloud native formats – Paul Young (The National Archives)

Lora Angelova and Lucia Pereira Pardo (The National Archives)

Deep Discoveries: Computer vision searching across our national digitised image collections

Can AI recognise a herbarium rose specimen from the Royal Botanic Garden collections and find the same flower on a stylised pattern print in The National Archives’ design registers? Can a trained algorithm match an artist’s sketch of a flower with a similar image on a 3D ceramic vase?

Deep Discoveries is one of eight foundation projects in the Arts and Humanities Research Council scheme ‘Towards a National Collection: Opening UK Heritage to the World’ programme, which aims to establish and develop the technologies and infrastructures necessary to virtually link our nation’s heritage collections.

In collaboration with the Royal Botanic Garden Edinburgh, V&A, and University of Surrey, we will explore the potential of computer vision search to make our image collections discoverable in new and unexpected ways.

A rose is a rose is a rose? Can a computer vision algorithm recognise the same flower across a range of styles and materials?

Searching for watermarks

Watermarks help us trace the origin of early paper production and are important in provenance studies. However, recording them poses challenges, as they are faint features and often obscured by the ink on the paper.

By combining reflected and transmitted light images, we manage to diminish text interference with watermark detection in historic papers from our collection.

The images can then be quickly processed and made searchable and discoverable using open access computer vision search tools, such as the ones developed by the VGG-University of Oxford (VISE) or the École des chartes in Paris (Filigranes pour tous).

 

 

 

Image of post-processing is used to enhance watermarks and make them more legible in historic paper documents

Image post-processing is used to enhance watermarks and make them more legible in historic paper documents.

 

Image of search query

Query

 

Image of query with search results

Computer vision can then be used to search the digital database of watermarks and find the best matches across the collection.

Maps, merchants and machines: AI for revealing maritime trade and mapmaking practice

Lucia Pereira-Pardo (The National Archives)

Automated recognition of merchants’ marks in bills of lading

The Prize Papers collection is comprised of the documents seized by the British Royal Navy from captured enemy ships during the 16-19th centuries wars at sea. It is one of the most important archives for maritime history and includes sacks of letters, the captain’s notebooks, the ship’s cargo manifests and the bills of lading of the goods transported in commercial trips around the globe.

The manifests and bills show the marks used by merchants to identify their consignments. Being able to track these marks across the thousands of bills of lading in the collection would provide an unprecedented insight into global trade in the modern era. To perform such a huge task we are developing computing tools to automatically locate, annotate, compare and match the merchants’ marks in the Prize Papers.

This project is undertaken in collaboration with the University of Oldenburg and the Bioengineering department at Imperial College London.

An image of Example of annotations of the merchants’ marks in the English translation of the bills of lading for the Royal Court of Admiralty.

Example of annotations of the merchants’ marks in the English translation of the bills of lading for the Royal Court of Admiralty.

An image from the Prize Papers website of a sack of unopened letters from the Prize Papers collection

Sack of unopened letters from the Prize Papers collection

AI for pigment identification in historical maps

The National Archives houses a large collection of historic maps, some of them hand-drawn and incredibly colourful. Imaging techniques, such as XRF scanning or MSI, are useful to investigate the materials used by the mapmakers. Algorithms will be used to analyse the large datasets produced by these techniques and determine the inks, pigments and dyes present in the maps, shedding light on their production context, the trade of the materials, possible influences between mapmakers and across media. The AI for Digilab project is funded by the Arts and Humanities Research Council and the maps case study is undertaken in collaboration with ISAAC at Nottingham Trent University and the Osher Map Library at University of Southern Maine.

Image of multispectral images of a map of Ulster by Richard Bartlett (1603)

Multispectral images of a map of Ulster by Richard Bartlett (1603)

 

Image of pigment reference

Pigment reference

Image of pigment reference

Pigment reference

 

Image of pigment reference

Pigment reference

 

Image of pigment reference

Pigment reference

Image of pigment reference

Pigment reference

 

Image of UVL test samples

Multispectral images of a map of Ulster by Richard Bartlett (1603) compared with pigment references.

Explainable AI: Explaining AI in an archival context

Mark Bell, Jo Pugh, Jenny Bunn (The National Archives) and Leontien Talboom (University College London and The National Archives)

 

HeXAI Workshop

In July 2019, University College London and The National Archives ran a workshop entitled Human-centred eXplainable AI (HeXAI). The motivation was that explainable AI research is being driven by technologists.  The focus of the workshop was very much on the process of engagement, both with the topic and with other participants , who ranged from expert to novice. We began by explaining AI itself, not just as technology but as a wide ranging academic field with a rich history.

Two group discussions followed, focused firstly on defining the problem, and then setting out a research agenda.

The main outputs from the discussions were:

  • We need to change the metaphor: From Black Box to Tip of the Iceberg
  • We should shift from XAI to (Why) ‘Y’-AI
  • We are all designers of explanations

Learn more about this project.

Image of coffee, cards and decision boundaries

Coffee, cards and decision boundaries

 

Image of a black box and an iceberg with the text From Black Box to Tip of the Iceberg: Creative Engagement with the Emergence of XAI (Explainable Artificial Intelligence)

Changing the metaphor

What next?

The National Archives started a series of four internal workshops on Machine Learning (ML). The sessions are primarily non-technical with a focus on building an intuition for ML, working from the perspective of the data creator or expert who will one day be ‘training’ AI systems.

Having built this intuition over three workshops, the final session will ask the participants (archivists, policy makers, Freedom of Information experts, user interface designers, and technologists) how they would archive, contextualise, and explain the ML algorithms they have worked with.

Hear more about the ML workshops on an episode of the Information and Records Management Society podcast.

Computing Cholera? ‘Distant Reading’ General Board of Health catalogue data

Chris Day (The National Archives)

 

Catalogue as data

Catalogue metadata is often seen as simply a tool to enhance the ‘human findability’ of archival material, but detailed item descriptions offer us vast corpuses of machine readable data to analyse.

MH 13 contains the records of the mid-19th century public health body The General Board of Health (1848-1871) – comprising of c.89, 000 items of correspondence, individually described.

Using machines we can ‘read’ this vast corpus of text at a distance, using machine learning to group texts by topics, to get a picture of trends and subjects across this collection.

Image of Anti-General Board of Health poster, 1851. Catalogue reference: MH 13/81/161

Anti-General Board of Health poster, 1851. Catalogue reference: MH 13/81/161

Topic modelling

A test corpus of the 1,967 descriptions dated 1848 was topic modelled. An algorithm called Latent Dirichlet Allocation was used, in which the machine uses Bayesian probability to discover the latent or underlying topics of texts across a corpus and sorts them into a user-defined number of groups (in this case five). The Python library pyLDAvis was then used to visualise each of these topics. The results show clear groupings of the Board’s correspondence in its first year and further analysis will be carried out across the entire collection.

Computational Archival Science (CAS): An international research collaboration network

Eirini Goudarouli (The National Archives), Mark Hedges (King’s College London), Richard Marciano (University of Maryland), David Beavan (The Alan Turing Institute)

 

The National Archives has partnered with King’s College London, the Digital Curation Innovation Center at the University of Maryland iSchool, and the Maryland State Archives in the US to explore the application of computational methods and tools to large-scale digital heritage collections.

The partnership received the International Research Networking grant for UK-US Collaboration in Digital Scholarship in Cultural Institutions from the Arts and Humanities Research Council (02/2019 – 02/2020).

The CAS research network is mainly seeking:

  • To define more clearly the concept of context and contextualisation within archives
  • To investigate how the various forms of contextualisation relate to computational methods, and to explore the relevance of these methods for archival practice.
  • To develop practical methodologies for working with computational approaches in practical contexts within interdisciplinary teams.
Image of group of people sitting at Alan Turing Institute workshop

Alan Turing Institute workshop

Image of CAS network group

CAS team

The CAS network’s next goals are to:

  • Explore the opportunities and challenges of ‘disruptive technologies’.
  • Pursue multidisciplinary collaborations to share relevant knowledge across domains.
  • Leverage the latest technologies to unlock the hidden information in massive stores of records.
  • Train information professionals to think computationally and rapidly adapt new technologies to their everyday work.
  • Promote ethical information access and use.

Find out more at the CAS website and at AI-collaboratory.net

Network analysis of visual collections: Entry form records of the Copyright Office, 1837-1912

Dr Katherine Howells (The National Archives)

 

How can network analysis be used to investigate 19th-century creative industries?

This project employs network analysis methods to visualise relationships between creators of visual materials and owners of copyright hidden within the catalogue data of the COPY 1 record series.

This is achieved through mining catalogue data and creating network graphs using Gephi network analysis software.

The project seeks to shed light on the structure of creative industries in the 19th and early 20th centuries, identify influential actors and stimulate further research. It also aims to demonstrate how network analysis can be used to enhance record data and engage academic and other audiences.

Image of Geomapping COPY 1 data using Esri ArcGIS

Geomapping COPY 1 data using Esri ArcGIS

Image of Network of creators and copyright owners

Network of creators and copyright owners

What next?

This project is in its initial phase of testing to determine the most effective method.

Next steps

  • Explore geo mapping to plot creative relationships geographically.
  • Create an interactive online resource to help users find COPY 1 records.
  • Produce a journal article and conference paper.

DiAGRAM: Digital Archiving Graphical Risk Assessment Model

Alex Green, Hannah Merwood, David Underdown, Alec Mulinder and Sonia Ranade (The National Archives)

 

A statistical tool for digital preservation risk management

The National Archives is developing the Digital Archiving Graphical Risk Assessment Model (DiAGRAM) with statisticians from the University of Warwick and partners from other UK archives, to map and quantify the risks involved in digital preservation. This project is supported by the National Lottery Heritage Fund and the Engineering and Physical Sciences Research Council.

Image of DiAGRAM logo

DiAGRAM logo

DiAGRAM will be a decision support tool for risk management founded on data and evidence. The underlying methodology will be based on a Bayesian Network – a statistical model that estimates the probability of outcomes by considering conditional events (e.g. storage life depends on media type). Some examples of the risks DiAGRAM will cover include poor metadata, software obsolescence and storage conditions.

Using statistical techniques, we will create a new tool for managing digital preservation risk which will:

  • Improve users’ understanding of the complex digital archiving risk landscape and of the interplay between digital archiving risk factors.
  • Enables archivists to compare and prioritise very different types of threats to the digital archive.
  • Aid in quantifying the impact of risk events and risk management strategies on archival outcomes for use in decision making, communication with stakeholders and developing business cases for targeted action.

The model will be flexible enough to be relevant for all types of digital archiving organisations and users will be able to adapt the model to reflect their own institution’s policies and resources.

Image of The risk model network for the preservation outcomes of renderability and intellectual control

The risk model network for the preservation outcomes of renderability and intellectual control

Image of an example of the risk score comparison chart used in the tool’s user interface. The higher the score, the lower the risk.

An example of the risk score comparison chart used in the tool’s user interface. The higher the score, the lower the risk.

What next?

The prototype tool is now available. Feedback is welcome.

We are continuing to talk about the model in a variety of virtual settings, including through webinars, remote conferences and blogs. We will also be publishing a paper on this project for a special edition of the Archives and Records journal entitled “Interdisciplinarity and Archives”.

Our long-term ambition is that DiAGRAM will become a product managed by The National Archives alongside existing digital preservation tools such as DROID and PRONOM, which we will regularly review and update with new evidence as it becomes available.

Social Media Archive: Expanding the collection and improving the service

Tom Storrar and Claire Newing (The National Archives)

 

A novel service supporting research

Sitting alongside the websites archived in the UK Government Web Archive (UKGWA) our social media archive now includes over 1.2 million posts, from more than a decade of government Twitter, YouTube and Flickr use.

The data, captured through each platform’s Application Programming Interface (API), is presented through new interfaces, and linkages with the UKGWA are made where possible.

The content is now fully searchable. This is a world first, opening it up to researchers like never before.

Users can search by keyword, date, and restrict to platforms and department, making this unique resource a research source.

Try it for yourself.

 

Image of the social media archive

Social media archive search – a world first!

Images of social media snaps from various government websites

A unique view on recent history.

What next?

We will continue to research how to capture and present the full range of platforms government uses: for example, Facebook, LinkedIn, Instagram and GitHub all present fascinating challenges from capture, through access, to presentation.

We plan to integrate search with the main web archive, unifying the use of our web-born collection.

Further ahead still, we believe that this service will provide even more context to future users trying to understand the nature of government in these times.

Project ALPHA: The future of The National Archives on the web

Jenifer Klepfer and Simon Wilkes (The National Archives)

 

Building an archive for everyone

Between January and April 2020 we built and tested, in public, experimental prototypes of a new website for The National Archives.

Consistent with Archives for Everyone, we took a ‘blank sheet of paper’ approach to Project Alpha, challenging ourselves to look beyond our current technological and cultural limitations, to define what an inclusive, entrepreneurial and disruptive archive might look like.

Collaborating with content and service design experts Digirati – who have a formidable track record of delivering large scale, user-focused development projects – the aim was to unlock the potential of our content and data, to create an entirely new web experience for our millions of users.

View our prototypes-in-progress at the project website.

Image of Ideation workshop with colleagues from across The National Archives

Ideation workshop with colleagues from across The National Archives

Image of a data visualisation of The National Archives

A visual way to explore The National Archives’ collection

What next?

Moving into the Beta phase we will be:

  • Building a new cloud-first platform from scratch with the latest technologies
  • Delivering powerful new ways for all audiences to explore our collection
  • Bringing The National Archives’ collection to more users than ever before
  • Testing and refining our ideas openly with the public

Preserving Google Docs: Enabling transfer and preservation of cloud native formats

Paul Young (The National Archives)

 

This research considers how The National Archives can ensure the effective transfer, preservation and presentation of Google Docs and G-Suite files created by UK Government departments. Reliable preservation of cloud native formats is an essential part of The National Archives’ mission to preserve the government record.

There are three main challenges involved in this process:

  • Export. You cannot export a Google Doc. Google holds a complete revision log of the file and what you see in your browser is a representation of the final version of this log. To export from Google it has to be converted into one of several export formats.
  • Metadata. Google Drive collects a range of metadata, which can be useful for contextual and preservation purposes.
  • Integrity. The National Archives’ current transfer practice relies on generating a checksum of the file to prove integrity and this is not possible with Google Docs.

Our investigation into export formats built on work undertaken by Jenny Mitcham in 2017 and uncovered similar results:

  • Formatting of document characteristics (e.g. font, comments, number of pages) was close to original Google Doc but not exact.
  • Metadata for dates became corrupted on export (extracting date metadata via the API prior to export can mitigate this).
  • Undertaking tests to test consistency and reliability of exports showed that while export checksums changed, content was consistent.
Image of Google doc

Image of Google doc

 

Image of OpenDocument Text

Top image shows an original Google Doc along with export versions in formats Microsoft Word Document, PDF and OpenDocument Text.

Image of PDF

Image of PDF

Image of OpenDocument Text

Top image shows an original Google Doc along with export versions in formats Microsoft Word Document, PDF and OpenDocument Text.

Image of a Google Doc viewed via browser interface

A Google Doc viewed via browser interface

Image of an extract of the Google ‘Kix’ format (Revision log) which is used to render the Google Doc. This can be exported, but it is hard to present!

An extract of the Google ‘Kix’ format (Revision log) which is used to render the Google Doc. This can be exported, but it is hard to present!

What next?

  • Formalise a transfer process for Google Doc formats from UK Government departments. Ensure process is adaptable to change.
  • Identify departments to undertake test transfers to The National Archives. This can assist with planned developments for the ‘Transfer Digital Records’ system to include transfer from G-Suite and Office 365.
  • Continue investigation into the integrity of Google Docs over time.
  • Investigate additional capture methods (e.g. web archiving).
  • Explore potential Google Docs can offer, as revision log can show any edit made to a document through its lifetime.