Skip Navigation

How to Download

FRDR offers multiple ways to download datasets. Learn more in our documentation.

Derivative data for Web Archives for Longitudinal Knowledge (WALK)

Description: These are derivative files generated by the Web Archives for Longitudinal Knowledge (WALK) project, which ran between 2016 and 2018. WALK was an interdisciplinary project spearheaded by scholars at York University, the University of Waterloo, and the University of Alberta. The project's goal was to bring together major Canadian web archive holdings and provide researcher access to search indexes and derivative files, including plain text, network diagrams, and domain frequency information. These will be useful to digital humanists who want to work with text at scale or the hyperlink networks of large parts of the archived Web.

Six universities participated: the University of Toronto, University of Alberta, University of Victoria, University of Winnipeg, Dalhousie University, and Simon Fraser University. These files reflect the state of their public web archives in late-2017 to mid-2018.

Each xz file contains: derivative files for a given collection, a GraphML file which you can load with Gephi (it will not have any basic layouts or transformations done to it, requiring you to do so manually), a csv file that explains the distribution of domains within the web archive, and a txt file that contains the plain text extracted from HTML documents within the web archive. You can find the crawl date, full URL, and the plain text of each page within the txt file. It may also contain a GEXF file which you can load with Gephi. It will have a basic layout courtesy of our GraphPass program, allowing you to see major nodes and communities in the network.

This project has evolved into the Archives Unleashed Project. Information on Archives Unleashed and the WALK project can be found at https://archivesunleashed.org and on our blog at https://news.archivesunleashed.org.
Authors: Ruest, Nick; York University; ORCID iD 0000-0003-1891-1112
Milligan, Ian; University of Waterloo; ORCID iD 0000-0002-1470-7723
Lin, Jimmy; University of Waterloo
Deschamps, Ryan; University of Waterloo
Fritz, Samantha; University of Waterloo
Keywords: web archives
Field of Research: 
Computer and information sciences
>
Library science and information studies
>
Archival, repository and related studies
Publication Date: 2018-08-24
Publisher: Federated Research Data Repository / dépôt fédéré de données de recherche
Funder: Social Sciences and Humanities Research Council of Canada (SSHRC)
Andrew W. Mellon Foundation (AWMF)
Compute Canada
URI: https://doi.org/10.20383/101.036

Files in Dataset 
No files uploaded
Download entire dataset using Globus Transfer. This method requires a Globus account and installing software. Watch Video: Get Started with FRDR: Download a Dataset
Download with Globus
Files for this dataset are currently being backed up so it cannot be approved at this time. Please try later.

Access to this dataset is subject to the following terms:
Creative Commons Attribution 4.0 International (CC BY 4.0) https://creativecommons.org/licenses/by/4.0/
Citation
Ruest, N., Milligan, I., Lin, J., Deschamps, R., Fritz, S. (2018). Derivative data for Web Archives for Longitudinal Knowledge (WALK). Federated Research Data Repository. https://doi.org/10.20383/101.036