Misplaced Pages

robots.txt

Article snapshot taken from Wikipedia with creative commons attribution-sharealike license. Give it a read and then ask your questions in the chat. We can research this topic together.

This is an accepted version of this page

#901098

55-590: Robots.txt is the filename used for implementing the Robots Exclusion Protocol , a standard used by websites to indicate to visiting web crawlers and other web robots which portions of the website they are allowed to visit. The standard, developed in 1994, relies on voluntary compliance . Malicious bots can use the file as a directory of which pages to visit, though standards bodies discourage countering this with security through obscurity . Some archival sites ignore robots.txt. The standard

110-484: A pathname to be the character string that must be entered into a file system by a user in order to identify a file. On early personal computers using the CP/M operating system, filenames were always 11 characters. This was referred to as the 8.3 filename with a maximum of an 8 byte name and a maximum of a 3 byte extension. Utilities and applications allowed users to specify filenames without trailing spaces and include

165-431: A 500 kibibyte file size restriction for robots.txt files. Filename A filename or file name is a name used to uniquely identify a computer file in a file system . Different file systems impose different restrictions on filename lengths. A filename may (depending on the file system) include: The components required to identify a file by utilities and applications varies across operating systems, as does

220-414: A Fortran compiler might use the extension FOR for source input file, OBJ for the object output and LST for the listing. Although there are some common extensions, they are arbitrary and a different application might use REL and RPT . Extensions have been restricted, at least historically on some systems, to a length of 3 characters, but in general can have any length, e.g., html . There

275-401: A dot before the extension. The dot was not actually stored in the directory. Using only 7 bit characters allowed several file attributes to be included in the actual filename by using the high-order-bit; these attributes included Readonly, Archive, and System. Eventually this was too restrictive and the number of characters allowed increased. The attribute bits were moved to a special block of

330-589: A file: additionally, the exact byte representation of the filename on the storage device is needed. This can be solved at the application level, with some tricky normalization calls. The issue of Unicode equivalence is known as "normalized-name collision". A solution is the Non-normalizing Unicode Composition Awareness used in the Subversion and Apache technical communities. This solution does not normalize paths in

385-459: A filename, although most utilities do not handle them well. Filenames may include things like a revision or generation number of the file, a numerical sequence number (widely used by digital cameras through the DCF standard ), a date and time (widely used by smartphone camera software and for screenshots ), or a comment such as the name of a subject or a location or any other text to help identify

440-406: A filesystem to storing components of names, so increasing limits often requires an incompatible change, as well as reserving more space. A particular issue with filesystems that store information in nested directories is that it may be possible to create a file with a complete pathname that exceeds implementation limits, since length checking may apply only to individual parts of the name rather than

495-566: A joke file hosted at /killer-robots.txt instructing the Terminator not to kill the company founders Larry Page and Sergey Brin . This example tells all robots that they can visit all files because the wildcard * stands for all robots and the Disallow directive has no value, meaning no pages are disallowed. The same result can be accomplished with an empty or missing robots.txt file. This example tells all robots to stay out of

550-421: A maximum of eight plus three characters was a filename alias of " long file name.??? " as a way to conform to 8.3 limitations for older programs. This property was used by the move command algorithm that first creates a second filename and then only removes the first filename. Other filesystems, by design, provide only one filename per file, which guarantees that alteration of one filename's file does not alter

605-525: A page that is crawled. A robots.txt file covers one origin . For websites with multiple subdomains, each subdomain must have its own robots.txt file. If example.com had a robots.txt file but a.example.com did not, the rules that would apply for example.com would not apply to a.example.com . In addition, each protocol and port needs its own robots.txt file; http://example.com/robots.txt does not apply to pages under http://example.com:8080/ or https://example.com/ . The robots.txt protocol

SECTION 10

#1732790248902

660-593: A period must occur at least once each 8 characters, two consecutive periods could not appear in the name, and must end with a letter or digit. By convention, the letters and numbers before the first period was the account number of the owner or the project it belonged to, but there was no requirement to use this convention. On the McGill University MUSIC/SP system, file names consisted of The Univac VS/9 operating system had file names consisting of In 1985, RFC   959 officially defined

715-486: A request that specified robots ignore specified files or directories when crawling a site. This might be, for example, out of a preference for privacy from search engine results, or the belief that the content of the selected directories might be misleading or irrelevant to the categorization of the site as a whole, or out of a desire that an application only operates on certain data. Links to pages listed in robots.txt can still appear in search results if they are linked to from

770-464: A security risk, this sort of security through obscurity is discouraged by standards bodies. The National Institute of Standards and Technology (NIST) in the United States specifically recommends against this practice: "System security should not depend on the secrecy of the implementation or its components." In the context of robots.txt files, security through obscurity is not recommended as

825-483: A security technique. Many robots also pass a special user-agent to the web server when fetching content. A web administrator could also configure the server to automatically return failure (or pass alternative content ) when it detects a connection using one of the robots. Some sites, such as Google , host a humans.txt file that displays information meant for humans to read. Some sites such as GitHub redirect humans.txt to an About page. Previously, Google had

880-427: A website: This example tells all robots not to enter three directories: This example tells all robots to stay away from one specific file: All other files in the specified directory will be processed. This example tells two specific robots not to enter one specific directory: Example demonstrating how comments can be used: It is also possible to list multiple robots with their own rules. The actual robot string

935-412: Is defined by the crawler. A few robot operators, such as Google , support several user-agent strings that allow the operator to deny access to a subset of their services by using specific user-agent strings. Example demonstrating multiple user-agents: The crawl-delay value is supported by some crawlers to throttle their visits to the host. Since this value is not part of the standard, its interpretation

990-448: Is dependent on the crawler reading it. It is used when the multiple burst of visits from bots is slowing down the host. Yandex interprets the value as the number of seconds to wait between subsequent visits. Bing defines crawl-delay as the size of a time window (from 1 to 30 seconds) during which BingBot will access a web site only once. Google ignores this directive, but provides an interface in its search console for webmasters, to control

1045-606: Is no general encoding standard for filenames. File names have to be exchanged between software environments for network file transfer, file system storage, backup and file synchronization software, configuration management, data compression and archiving, etc. It is thus very important not to lose file name information between applications. This led to wide adoption of Unicode as a standard for encoding file names, although legacy software might not be Unicode-aware. Traditionally, filenames allowed any character in their filenames as long as they were file system safe. Although this permitted

1100-747: Is that different instances of the script or program can use different files. This makes an absolute or relative path composed of a sequence of filenames. Unix-like file systems allow a file to have more than one name; in traditional Unix-style file systems, the names are hard links to the file's inode or equivalent. Windows supports hard links on NTFS file systems, and provides the command fsutil in Windows XP, and mklink in later versions, for creating them. Hard links are different from Windows shortcuts , classic Mac OS/macOS aliases , or symbolic links . The introduction of LFNs with VFAT allowed filename aliases. For example, longfi~1.??? with

1155-468: Is widely complied with by bot operators. Some major search engines following this standard include Ask, AOL, Baidu, Bing, DuckDuckGo, Kagi, Google, Yahoo!, and Yandex. Some web archiving projects ignore robots.txt. Archive Team uses the file to discover more links, such as sitemaps . Co-founder Jason Scott said that "unchecked, and left alone, the robots.txt file ensures no mirroring or reference for items that may have general use and meaning beyond

SECTION 20

#1732790248902

1210-533: The null character and the path separator / are prohibited. File system utilities and naming conventions on various systems prohibit particular characters from appearing in filenames or make them problematic: Except as otherwise stated, the symbols in the Character column, " and < for example, cannot be used in Windows filenames. Jason Scott Too Many Requests If you report this error to

1265-509: The Googlebot 's subsequent visits. Some crawlers support a Sitemap directive, allowing multiple Sitemaps in the same robots.txt in the form Sitemap: full-url : The Robot Exclusion Standard does not mention the "*" character in the Disallow: statement. In addition to root-level robots.txt files, robots exclusion directives can be applied at a more granular level through

1320-431: The file type . Some other file systems, such as Unix file systems, VFAT , and NTFS , treat a filename as a single string; a convention often used on those file systems is to treat the characters following the last period in the filename, in a filename containing periods, as the extension part of the filename. Multiple output files created by an application may use the same basename and various extensions. For example,

1375-405: The website . If this file does not exist, web robots assume that the website owner does not wish to place any limitations on crawling the entire site. A robots.txt file contains instructions for bots indicating which web pages they can and cannot access. Robots.txt files are particularly important for web crawlers from search engines such as Google. A robots.txt file on a website will function as

1430-460: The Unicode version in use. For instance, UDF is limited to Unicode 2.0; macOS's HFS+ file system applies NFD Unicode normalization and is optionally case-sensitive (case-insensitive by default.) Filename maximum length is not standard and might depend on the code unit size. Although it is a serious issue, in most cases this is a limited one. On Linux, this means the filename is not enough to open

1485-527: The attributes separately from the file name. Around 1995, VFAT , an extension to the MS-DOS FAT filesystem, was introduced in Windows 95 and Windows NT . It allowed mixed-case long filenames (LFNs), using Unicode characters, in addition to classic "8.3" names. Programs and devices may automatically assign names to files such as a numerical counter (for example IMG_0001.JPG ) or a time stamp with

1540-424: The clock of their camera. Internet-connected devices such as smartphones may synchronize their clock from a NTP server. An absolute reference includes all directory levels. In some systems, a filename reference that does not include the complete directory path defaults to the current working directory . This is a relative reference. One advantage of using a relative reference in program configuration files or scripts

1595-459: The current date and time. The benefit of a time stamped file name is that it facilitates searching files by date, given that file managers usually feature file searching by name. In addition, files from different devices can be merged in one folder without file naming conflicts. Numbered file names, on the other hand, do not require that the device has a correctly set internal clock. For example, some digital camera users might not bother setting

1650-474: The encoding used for a filename as part of the extended file information. This forced costly filename encoding guessing with each file access. A solution was to adopt Unicode as the encoding for filenames. In the classic Mac OS, however, encoding of the filename was stored with the filename attributes. The Unicode standard solves the encoding determination issue. Nonetheless, some limited interoperability issues remain, such as normalization (equivalence), or

1705-480: The entire name. Many Windows applications are limited to a MAX_PATH value of 260, but Windows file names can easily exceed this limit. From Windows 10, version 1607 , MAX_PATH limitations have been removed. Filenames in some file systems, such as FAT and the ODS-1 and ODS-2 levels of Files-11 , are composed of two parts: a base name or stem and an extension or suffix used by some applications to indicate

robots.txt - Misplaced Pages Continue

1760-670: The exact capitalization by which it is named. On a case-insensitive, case-preserving file system, on the other hand, only one of "MyName.Txt", "myname.txt" and "Myname.TXT" can be the name of a file in a given directory at a given time, and a file with one of these names can be referenced by any capitalization of the name. From its original inception, the file systems on Unix and its derivative systems were case-sensitive and case-preserving. However, not all file systems on those systems are case-sensitive; by default, HFS+ and APFS in macOS are case-insensitive but case-preserving, and SMB servers usually provide case-insensitive behavior (even when

1815-633: The file including additional information. The original File Allocation Table (FAT) file system, used by Standalone Disk BASIC-80 , had a 6.3 file name, with a maximum of 6 bytes in the name and a maximum of 3 bytes in the extension. The FAT12 and FAT16 file systems in IBM PC DOS / MS-DOS and Microsoft Windows prior to Windows 95 used the same 8.3 convention as the CP/M file system. The FAT file systems supported 8-bit characters, allowing them to support non-ASCII characters in file names, and stored

1870-535: The file. Some people use the term filename when referring to a complete specification of device, subdirectories and filename such as the Windows C:\Program Files\Microsoft Games\Chess\Chess.exe . The filename in this case is Chess.exe . Some utilities have settings to suppress the extension as with MS Windows Explorer. During the 1970s, some mainframe and minicomputers had operating systems where files on

1925-454: The introduction of VFAT , store filenames as upper-case regardless of the letter case used to create them. For example, a file created with the name "MyName.Txt" or "myname.txt" would be stored with the filename "MYNAME.TXT" (VFAT preserves the letter case). Any variation of upper and lower case can be used to refer to the same file. These kinds of file systems are called case-insensitive and are not case-preserving . Some filesystems prohibit

1980-437: The main communication channel for WWW-related activities at the time. Charles Stross claims to have provoked Koster to suggest robots.txt, after he wrote a badly behaved web crawler that inadvertently caused a denial-of-service attack on Koster's server. The standard, initially RobotsNotWanted.txt, allowed web developers to specify which bots should not access their website or which pages bots should not access. The internet

2035-415: The new Unicode encoding. Mac OS X 10.3 marked Apple's adoption of Unicode 3.2 character decomposition, superseding the Unicode 2.1 decomposition used previously. This change caused problems for developers writing software for Mac OS X. Within a single directory, filenames must be unique. Since the filename syntax also applies for directories, it is not possible to create a file and directory entries with

2090-424: The ones that appeared on popular blocklists . Despite the use of the terms allow and disallow , the protocol is purely advisory and relies on the compliance of the web robot ; it cannot enforce any of what is stated in the file. Malicious web robots are unlikely to honor robots.txt; some may even use the robots.txt as a guide to find disallowed links and go straight to them. While this is sometimes claimed to be

2145-838: The other filename's file. Some filesystems restrict the length of filenames. In some cases, these lengths apply to the entire file name, as in 44 characters in IBM z/OS . In other cases, the length limits may apply to particular portions of the filename, such as the name of a file in a directory, or a directory name. For example, 9 (e.g., 8-bit FAT in Standalone Disk BASIC ), 11 (e.g. FAT12 , FAT16 , FAT32 in DOS), 14 (e.g. early Unix), 21 ( Human68K ), 31, 30 (e.g. Apple DOS 3.2 and 3.3), 15 (e.g. Apple ProDOS ), 44 (e.g. IBM S/370), or 255 (e.g. early Berkeley Unix) characters or bytes. Length limits often result from assigning fixed space in

2200-400: The page has loaded, whereas robots.txt is effective before the page is requested. Thus if a page is excluded by a robots.txt file, any robots meta tags or X-Robots-Tag headers are effectively ignored because the robot will not see them in the first place. The Robots Exclusion Protocol requires crawlers to parse at least 500 kibibytes (512000 bytes) of robots.txt files, which Google maintains as

2255-482: The repository. Paths are only normalized for the purpose of comparisons. Nonetheless, some communities have patented this strategy, forbidding its use by other communities. To limit interoperability issues, some ideas described by Sun are to: Those considerations create a limitation not allowing a switch to a future encoding different from UTF-8. One issue was migration to Unicode. For this purpose, several software companies provided software for migrating filenames to

robots.txt - Misplaced Pages Continue

2310-497: The robots.txt standard and gives advice to web operators about how to disallow it, but The Verge ' s David Pierce said this only began after "training the underlying models that made it so powerful". Also, some bots are used both for search engines and artificial intelligence, and it may be impossible to block only one of these options. 404 Media reported that companies like Anthropic and Perplexity.ai circumvented robots.txt by renaming or spinning up new scrapers to replace

2365-558: The same character set for composing a filename. Before Unicode became a de facto standard, file systems mostly used a locale-dependent character set. By contrast, some new systems permit a filename to be composed of almost any character of the Unicode repertoire, and even some non-Unicode byte sequences. Limitations may be imposed by the file system, operating system, application, or requirements for interoperability with other systems. Many file system utilities prohibit control characters from appearing in filenames. In Unix-like file systems,

2420-585: The same name in a single directory. Multiple files in different directories may have the same name. Uniqueness approach may differ both on the case sensitivity and on the Unicode normalization form such as NFC, NFD. This means two separate files might be created with the same text filename and a different byte implementation of the filename, such as L"\x00C0.txt" (UTF-16, NFC) (Latin capital A with grave) and L"\x0041\x0300.txt" (UTF-16, NFD) (Latin capital A, grave combining). Some filesystems, such as FAT prior to

2475-514: The syntax and format for a valid filename. The characters allowed in filenames depend on the file system. The letters A–Z and digits 0–9 are allowed by most file systems; many file systems support additional characters, such as the letters a–z, special characters, and other printable characters such as accented letters, symbols in non-Roman alphabets, and symbols in non-alphabetic scripts. Some file systems allow even unprintable characters, including Bell , Null , Return and Linefeed , to be part of

2530-505: The system were identified by a user name, or account number. For example, on the TOPS-10 and RSTS/E operating systems from Digital Equipment Corporation , files were identified by On the OS/VS1 , MVS , and OS/390 operating systems from IBM , a file name was up to 44 characters, consisting of upper case letters, digits, and the period. A file name must start with a letter or number,

2585-576: The thousand most-visited websites blocked OpenAI 's GPTBot in their robots.txt file and 85 blocked Google 's Google-Extended. Many robots.txt files named GPTBot as the only bot explicitly disallowed on all pages. Denying access to GPTBot was common among news websites such as the BBC and The New York Times . In 2023, blog host Medium announced it would deny access to all artificial intelligence web crawlers as "AI companies have leached value from writers in order to spam Internet readers". GPTBot complies with

2640-439: The underlying file system is case-sensitive, e.g. Samba on most Unix-like systems), and SMB client file systems provide case-insensitive behavior. File system case sensitivity is a considerable challenge for software such as Samba and Wine , which must interoperate efficiently with both systems that treat uppercase and lowercase files as different and with systems that treat them the same. File systems have not always provided

2695-477: The use of Robots meta tags and X-Robots-Tag HTTP headers. The robots meta tag cannot be used for non-HTML files such as images, text files, or PDF documents. On the other hand, the X-Robots-Tag can be added to non-HTML files by using .htaccess and httpd.conf files. The X-Robots-Tag is only effective after the page has been requested and the server responds, and the robots meta tag is only effective after

2750-405: The use of any encoding, and thus allowed the representation of any local text on any local system, it caused many interoperability issues. A filename could be stored using different byte strings in distinct systems within a single country, such as if one used Japanese Shift JIS encoding and another Japanese EUC encoding. Conversion was not possible as most systems did not expose a description of

2805-413: The use of lower case letters in filenames altogether. Some file systems store filenames in the form that they were originally created; these are referred to as case-retentive or case-preserving . Such a file system can be case-sensitive or case-insensitive . If case-sensitive, then "MyName.Txt" and "myname.txt" may refer to two different files in the same directory, and each file must be referenced by

SECTION 50

#1732790248902

2860-597: The website's context." In 2017, the Internet Archive announced that it would stop complying with robots.txt directives. According to Digital Trends , this followed widespread use of robots.txt to remove historical sites from search engine results, and contrasted with the nonprofit's aim to archive "snapshots" of the internet as it previously existed. Starting in the 2020s, web operators began using robots.txt to deny access to bots collecting training data for generative AI . In 2023, Originality.AI found that 306 of

2915-507: Was published in September 2022 as RFC 9309. When a site owner wishes to give instructions to web robots they place a text file called robots.txt in the root of the web site hierarchy (e.g. https://www.example.com/robots.txt ). This text file contains the instructions in a specific format (see examples below). Robots that choose to follow the instructions try to fetch this file and read the instructions before fetching any other file from

2970-510: Was small enough in 1994 to maintain a complete list of all bots; server overload was a primary concern. By June 1994 it had become a de facto standard ; most complied, including those operated by search engines such as WebCrawler , Lycos , and AltaVista . On July 1, 2019, Google announced the proposal of the Robots Exclusion Protocol as an official standard under Internet Engineering Task Force . A proposed standard

3025-459: Was used in the 1990s to mitigate server overload. In the 2020s many websites began denying bots that collect information for generative artificial intelligence . The "robots.txt" file can be used in conjunction with sitemaps , another robot inclusion standard for websites. The standard was proposed by Martijn Koster , when working for Nexor in February 1994 on the www-talk mailing list,

#901098