The internet is a vast treasure trove of knowledge. But it is fleeting and there are no guarantees that the content you like will be there in the future. If you can't afford to lose that content, you can use a web archiving tool to store a copy of the web page.
Many people use read-later services for saving web articles. These apps work best with text-based content and do not handle complicated webpage designs or media properly. Want some more control?
Let's see how you can create a clone of Instapaper or Pocket in your computer without losing any web page asset.
ArchiveBox is an Open Source solution that can help you host your own alternative to an archiving service like the Wayback Machine. You don't give up your privacy or stay locked in a service you cannot control.
It takes the list of URLs you want to archive and creates a local, browsable HTML clone of the content in multiple formats. It includes local copies in HTML, a screenshot of the page, a PDF file, and WARC (Web ARChive).
These copies stay with you even if the original webpage disappears in the future.
ArchiveBox is written in Python 3. It also uses dependencies like Wget, Headless Chrome, Youtube-dl, and other Unix tools to save the webpage. You don't need a constantly running backend server. Just run it each time you want to import new links and update the static output.
Once the archiving completes, you can open the generated output/index.html in your browser to view the archive.
Advantages of ArchiveBox
It archives the links in several file formats that work as backups.
It tries to retain the original webpage using sophisticated capturing methods.
Has the ability to automatically extract the content and save them to a single folder.
It also provides a simple, command-line interface to deal with multiple links, feeds, and bookmarks. You have to set it once and run it on a schedule to archive newer links.
Disadvantages of ArchiveBox
ArchiveBox extracts all the assets from the webpage. It consumes significant disk space and is CPU intensive.
The app requires three or more dependencies beyond Python 3.5. It takes trial-and-error to make these components work together.
The app does not completely support Windows OS. You have to install Docker or enable Windows Subsystem for Linux (WSL). Even then some features may or may not work.
Supported Operating Systems
ArchiveBox officially supports the following operating systems:
macOS: 10.12 Sierra with Homebrew.
Linux: Ubuntu, Debian (with APT). The app may (or may not) work in distros like Fedora, CentOS, SUSE, Arch, and more.
BSD: FreeBSD, OpenBSD, NetBSD (with pkg).