Papers
Topics
Authors
Recent
Search
2000 character limit reached

ArcLink: Optimization Techniques to Build and Retrieve the Temporal Web Graph

Published 25 May 2013 in cs.IR | (1305.5959v2)

Abstract: Archiving the web is socially and culturally critical, but presents problems of scale. The Internet Archive's Wayback Machine can replay captured web pages as they existed at a certain point in time, but it has limited ability to provide extensive content and structural metadata about the web graph. While the live web has developed a rich ecosystem of APIs to facilitate web applications (e.g., APIs from Google and Twitter), the web archiving community has not yet broadly implemented this level of access. We present ArcLink, a proof-of-concept system that complements open source Wayback Machine installations by optimizing the construction, storage, and access to the temporal web graph. We divide the web graph construction into four stages (filtering, extraction, storage, and access) and explore optimization for each stage. ArcLink extends the current Web archive interfaces to return content and structural metadata for each URI. We show how this API can be applied to such applications as retrieving inlinks, outlinks, anchortext, and PageRank.

Citations (6)

Summary

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Authors (2)

Collections

Sign up for free to add this paper to one or more collections.