The Australian Web Archive ( AWA ) is an publicly available online database of archived Australian websites, hosted by the National Library of Australia (NLA) on its Trove platform, an online library database aggregator. It comprises the NLA's own PANDORA archive , the Australian Government Web Archive (AGWA) and the National Library of Australia 's ".au" domain collections. Access is through a single interface in Trove, which is publicly available. The Australian Web Archive was created in March 2019, and is one of the biggest web archives in the world. Its purpose is to provide a resource for historians and researchers, now and into the future.
24-636: The PANDORA service started archiving websites in October 1996. In 2005, the NLA started archiving annual snapshots of the entire Australian web domain ( URLs with the suffix . ".au"), collected via large crawl harvests . Later, the earliest websites from the .au web domain, dating back to 1996, were obtained from the Internet Archive . In 2019 this content was first made publicly accessible through Trove. The PANDORA infrastructure, which works well for
48-473: A secure connection to the website . Internet users are distributed throughout the world using a wide variety of languages and alphabets, and expect to be able to create URLs in their own local alphabets. An Internationalized Resource Identifier (IRI) is a form of URL that includes Unicode characters. All modern browsers support IRIs. The parts of the URL requiring special treatment for different alphabets are
72-489: A selective small scale archiving, does not adapt to large scale "bulk harvesting" of web content, so a new technical system had to be developed whereby a web archiving service which would integrate the delivery of archived websites within a live website interface delivering the archived websites seamlessly to the user, which is difficult to achieve technically. Australian Government websites are Commonwealth records, and are therefore publications to be managed in accordance with
96-437: Is empty if it has no characters; the scheme component is always non-empty. The authority component consists of subcomponents : This is represented in a syntax diagram as: [REDACTED] The URI comprises: A web browser will usually dereference a URL by performing an HTTP request to the specified host, by default on port number 80. URLs using the https scheme require that requests and responses be made over
120-555: Is a "Limit to the gov.au web domain" option before searching, and government websites archived via AGWA can still be searched separately using the "Advanced Search" option. Other options in Advanced Search are to limit by timespan of the snapshots, domain and file type. With many of the earlier websites from the 1990s now lost, mainly because of the frequent change of web platforms, the Australian Web Archive
144-672: Is a significant initiative that will help to save current and future web pages, especially Australian content. Material will continue to be added to the Archive, and other online material collected in accordance with the National Library Act 1960 , the legal deposit provisions of the Copyright Act 1968 and the NLA's digital collections selection policy . Websites in the Asia Pacific region are not included in
168-416: Is a specific type of Uniform Resource Identifier (URI), although many people use the two terms interchangeably. URLs occur most commonly to reference web pages ( HTTP / HTTPS ) but are also used for file transfer ( FTP ), email ( mailto ), database access ( JDBC ), and many other applications. Most web browsers display the URL of a web page above the page in an address bar . A typical URL could have
192-447: Is fully searchable, based on a combination of techniques used by the developers. Each team created a unique and complex search algorithm , by adapting a version of Google ’s page ranking algorithm (based frequency of clicks on a page), modified to lead to better, high-quality resources. Other technologies include a Bayesian filter (effectively a spam filter ), a Not Safe For Work classifier from Yahoo , and machine learning . There
216-606: The Archives Act 1983 . The Australian Government Web Archive (AGWA) consists of bulk archiving of Commonwealth Government websites. The NLA began regular harvests of the websites in June 2011, after a significant obstacle had been overcome with an administrative agreement made in May 2010 allowing the NLA to collect, preserve and make accessible government websites without having to seek prior permission for each website or document, as
240-459: The Archives Act ; however videos and document files ( such as PDFs or Word documents ) are not always captured, so must be managed separately. As of early 2015, the AGWA included content dating from 2005, which amounted to about 144 million files occupying 15 terabytes . It only included Commonwealth Government websites collected through bulk harvests of nearly 1000 seed URLs. The scheduling of
264-667: The Internet at workplaces or schools that have policies prohibiting access to sexual and graphic subject matter. Conversely, safe for work ( SFW ) is used for links that do not contain such material, especially where the title might otherwise lead people to think that the content is NSFW. The similar expression not safe for life ( NSFL ) is also used, referring to content which is so nauseating or disturbing that it might be emotionally scarring to view. Links marked NSFL may contain fetish pornography , gore , or murder . Some websites, such as Reddit and OnlyFans , give users
SECTION 10
#1732772883643288-568: The Wayback Machine , hosted by the Internet Archive , allowing full-text searching using a search engine built in-house. The developers also devised techniques to filter out unwanted "noise". The data remains on the Library servers, although a move to the cloud is envisaged in the future, as content grows. Usability by a wide range of users, and in particular the search functionality, were major focuses during development. The archive
312-713: The "Advanced Search" option. A web archive is described by the NLA as a "collection of snapshots of websites captured while they are accessible on the web, and then preserved in a static copy". The collection archived in the AWA is "relevant to the cultural, social, political, research and commercial life and activities of Australia and Australians". It collects web material via both scheduled archiving of selected websites and publications as well as some ad hoc harvesting relating to significant events. As of March 2019, when it began, AWA already contained around 600 terabytes of data, with 9 billion records. It contains more functionality than
336-447: The AWA, but NLA partners with the Internet Archive to collect and preserve "selected Asia/Pacific websites related to specific events or socio-political groups". URL A uniform resource locator ( URL ), colloquially known as an address on the Web , is a reference to a resource that specifies its location on a computer network and a mechanism for retrieving it. A URL
360-523: The HTML Specification referred to "Universal" Resource Locators. This was dropped some time between June 1994 ( RFC 1630 ) and October 1994 (draft-ietf-uri-url-08.txt). In his book Weaving the Web , Berners-Lee emphasizes his preference for the original inclusion of "universal" in the expansion rather than the word "uniform", to which it was later changed, and he gives a brief account of
384-478: The IETF Living Documents birds of a feather session in 1992. The format combines the pre-existing system of domain names (created in 1985) with file path syntax, where slashes are used to separate directory and filenames . Conventions already existed where server names could be prefixed to complete file paths, preceded by a double slash ( // ). Berners-Lee later expressed regret at
408-455: The contention that led to the change. Every HTTP URL conforms to the syntax of a generic URI. The URI generic syntax consists of five components organized hierarchically in order of decreasing significance from left to right: A component is undefined if it has an associated delimiter and the delimiter does not appear in the URI; the scheme and path components are always defined. A component
432-665: The domain name and path. The domain name in the IRI is known as an Internationalized Domain Name (IDN). Web and Internet software automatically convert the domain name into punycode usable by the Domain Name System ; for example, the Chinese URL http://例子.卷筒纸 becomes http://xn--fsqu00a.xn--3lr804guic/ . The xn-- indicates that the character was not originally ASCII . The URL path name can also be specified by
456-601: The form http://www.example.com/index.html , which indicates a protocol ( http ), a hostname ( www.example.com ), and a file name ( index.html ). Uniform Resource Locators were defined in RFC 1738 in 1994 by Tim Berners-Lee , the inventor of the World Wide Web , and the URI working group of the Internet Engineering Task Force (IETF), as an outcome of collaboration started at
480-608: The harvests was not yet routinely established, but harvests were being conducted roughly three times per year. In 2017, the AGWA and the PANDORA archive were amalgamated with the other web archive collections, to form the Trove web archive collection. After further development and the creation of the Australia Web Archive, government websites archived via AGWA and now included in AWA can still be searched separately using
504-615: The protocol of the current page, typically HTTP or HTTPS. Not Safe For Work Not safe for work ( NSFW ) is Internet slang or shorthand used to mark links to content, videos, or website pages the viewer may not wish to be seen viewing in a public, formal, or controlled environment. The marked content may contain graphic violence , pornography , profanity , nudity , slurs , or other potentially disturbing subject matter. Environments that may be problematic include workplaces , schools , and family settings . NSFW has particular relevance for people trying to make personal use of
SECTION 20
#1732772883643528-451: The use of dots to separate the parts of the domain name within URIs , wishing he had used slashes throughout, and also said that, given the colon following the first component of a URI, the two slashes before the domain name were unnecessary. Early WorldWideWeb collaborators including Berners-Lee originally proposed the use of UDIs: Universal Document Identifiers. An early (1993) draft of
552-626: The user in the local writing system. If not already encoded, it is converted to UTF-8 , and any characters not part of the basic URL character set are escaped as hexadecimal using percent-encoding ; for example, the Japanese URL http://example.com/引き割り.html becomes http://example.com/%E5%BC%95%E3%81%8D%E5%89%B2%E3%82%8A.html . The target computer decodes the address and displays the page. Protocol-relative links (PRL), also known as protocol-relative URLs (PRURL), are URLs that have no protocol specified. For example, //example.com will use
576-543: Was the case before that. The service uses the Heritrix web crawler for harvesting, WARC files for storage and Open Wayback for delivery of the service. There is a huge amount of publishing by the government, but many challenges to overcome trying to preserve content, such as its sudden disappearance. In March 2014, the AGWA was made publicly accessible. The AGWA meets the preservation and retention requirements for websites as "retain as national archives" (RNA) material under
#642357