Skip Navigation

Derivative data for Web Archives for Longitudinal Knowledge (WALK)

Description: These are derivative files generated by the Web Archives for Longitudinal Knowledge (WALK) project, which ran between 2016 and 2018. WALK was an interdisciplinary project spearheaded by scholars at York University, the University of Waterloo, and the University of Alberta. The project's goal was to bring together major Canadian web archive holdings and provide researcher access to search indexes and derivative files, including plain text, network diagrams, and domain frequency information. These will be useful to digital humanists who want to work with text at scale or the hyperlink networks of large parts of the archived Web.

Six universities participated: the University of Toronto, University of Alberta, University of Victoria, University of Winnipeg, Dalhousie University, and Simon Fraser University. These files reflect the state of their public web archives in late-2017 to mid-2018.

Each xz file contains: derivative files for a given collection, a GraphML file which you can load with Gephi (it will not have any basic layouts or transformations done to it, requiring you to do so manually), a csv file that explains the distribution of domains within the web archive, and a txt file that contains the plain text extracted from HTML documents within the web archive. You can find the crawl date, full URL, and the plain text of each page within the txt file. It may also contain a GEXF file which you can load with Gephi. It will have a basic layout courtesy of our GraphPass program, allowing you to see major nodes and communities in the network.

This project has evolved into the Archives Unleashed Project. Information on Archives Unleashed and the WALK project can be found at and on our blog at
Authors: Ruest, Nick; York University; ORCID iD
Milligan, Ian; University of Waterloo; ORCID iD
Lin, Jimmy; University of Waterloo
Deschamps, Ryan; University of Waterloo
Fritz, Samantha; University of Waterloo
Keywords: web archives
Date: 24-Aug-2018
Publisher: Federated Research Data Repository / dépôt fédéré de données de recherche

Files in Dataset 
No files uploaded

Files for this dataset are currently being backed up so it cannot be approved at this time. Please try later.
Access to this dataset is subject to the following terms:
Creative Commons Attribution 4.0 International (CC BY 4.0)
Ruest, N. , Milligan, I. , Lin, J. , Deschamps, R. , Fritz, S. (2018) Derivative data for Web Archives for Longitudinal Knowledge (WALK). Federated Research Data Repository.