This is an accepted version of this page
141-647: Portable Document Format ( PDF ), standardized as ISO 32000 , is a file format developed by Adobe in 1992 to present documents , including text formatting and images, in a manner independent of application software , hardware , and operating systems . Based on the PostScript language, each PDF file encapsulates a complete description of a fixed-layout flat document, including the text, fonts , vector graphics , raster images and other information needed to display it. PDF has its roots in "The Camelot Project" initiated by Adobe co-founder John Warnock in 1991. PDF
282-517: A RIP for Raster Image Processor) for the PostScript language was a common component of laser printers during the 1980s and 1990s. However, the cost of implementation was high; computers output raw PS code that would be interpreted by the printer into a raster image at the printer's natural resolution. This required high performance microprocessors and ample memory . The LaserWriter used a 12 MHz Motorola 68000 , making it faster than any of
423-572: A page description . PDF has (as of version 2.0) 25 graphics state properties, of which some of the most important are: As in PostScript, vector graphics in PDF are constructed with paths . Paths are usually composed of lines and cubic Bézier curves , but can also be constructed from the outlines of text. Unlike PostScript, PDF does not allow a single path to mix text outlines with lines and curves. Paths can be stroked, filled, fill then stroked, or used for clipping . Strokes and fills can use any color set in
564-555: A trade secret . Paxton worked on several other related improvements, such as font hinting . Adobe was also responsible for porting PostScript to the Canon's Motorola 68000 chip. Apple and Adobe announced the LaserWriter at Apple's annual stockholder meeting on January 23, 1985. It was the first printer to ship with PostScript, sparking the desktop publishing (DTP) revolution in the mid-1980s. The original PostScript royalty
705-494: A PDF are: In later PDF revisions, a PDF document can also support links (inside document or web page), forms, JavaScript (initially available as a plugin for Acrobat 3.0), or any other types of embedded contents that can be handled using plug-ins. PDF combines three technologies: PostScript is a page description language run in an interpreter to generate an image. It can handle graphics and has standard features of programming languages such as branching and looping . PDF
846-455: A PDF. Within text strings, characters are shown using character codes (integers) that map to glyphs in the current font using an encoding . There are several predefined encodings, including WinAnsi , MacRoman , and many encodings for East Asian languages and a font can have its own built-in encoding. (Although the WinAnsi and MacRoman encodings are derived from the historical properties of
987-419: A PostScript file could be accurately rendered only as the cumulative result of executing all preceding commands to draw all previous pages—any of which could affect subsequent pages—plus the commands to draw that particular page, and there was no easy way to bypass that process to skip around to different pages. Traditionally, to go from PostScript to PDF, a source PostScript file (that is, an executable program)
1128-463: A PostScript program: the execution of which results in the original document. This program can be sent to an interpreter in a printer, which results in a printed document, or to one inside another application, which will display the document on-screen. Since the document-program is the same regardless of its destination, it is called device-independent . PostScript is noteworthy for implementing on-the-fly rasterization in which everything, even text,
1269-472: A Web browser plugin without waiting for the entire file to download, since all objects required for the first page to display are optimally organized at the start of the file. PDF files may be optimized using Adobe Acrobat software or QPDF . Page dimensions are not limited by the format itself. However, Adobe Acrobat imposes a limit of 15 million by 15 million inches, or 225 trillion in (145,161 km). The basic design of how graphics are represented in PDF
1410-434: A byte frequency distribution to build the representative models for file type and use any statistical and data mining techniques to identify file types. There are several types of ways to structure data in a file. The most usual ones are described below. Earlier file formats used raw data formats that consisted of directly dumping the memory images of one or more structures into the file. This has several drawbacks. Unless
1551-408: A company logo may be needed both in .eps format (for publishing) and .png format (for web sites). With the extensions visible, these would appear as the unique filenames: " CompanyLogo.eps " and " CompanyLogo.png ". On the other hand, hiding the extensions would make both appear as " CompanyLogo ", which can lead to confusion. Hiding extensions can also pose a security risk. For example,
SECTION 10
#17327651901151692-520: A few bytes long. The metadata contained in a file header are usually stored at the start of the file, but might be present in other areas too, often including the end, depending on the file format or the type of data contained. Character-based (text) files usually have character-based headers, whereas binary formats usually have binary headers, although this is not a rule. Text-based file headers usually take up more space, but being human-readable, they can easily be examined by using simple software such as
1833-448: A file based on the end of its name, more specifically the letters following the final period. This portion of the filename is known as the filename extension . For example, HTML documents are identified by names that end with .html (or .htm ), and GIF images by .gif . In the original FAT file system , file names were limited to an eight-character identifier and a three-character extension, known as an 8.3 filename . There are
1974-401: A file format is to use information regarding the format stored inside the file itself, either information meant for this purpose or binary strings that happen to always be in specific locations in files of some formats. Since the easiest place to locate them is at the beginning, such area is usually called a file header when it is greater than a few bytes , or a magic number if it is just
2115-416: A file unusable (or "lose" it) by renaming it incorrectly. This led most versions of Windows and Mac OS to hide the extension when listing files. This prevents the user from accidentally changing the file type, and allows expert users to turn this feature off and display the extensions. Hiding the extension, however, can create the appearance of two or more identical filenames in the same folder. For example,
2256-401: A formal specification document, letting precedent set by other already existing programs that use the format define the format via how these existing programs use it. If the developer of a format does not publish free specifications, another developer looking to utilize that kind of file must either reverse engineer the file to find out how to read it or acquire the specification document from
2397-399: A hierarchical structure, known as a conformance hierarchy. Thus, public.png conforms to a supertype of public.image , which itself conforms to a supertype of public.data . A UTI can exist in multiple hierarchies, which provides great flexibility. In addition to file formats, UTIs can also be used for other entities which can exist in macOS, including: In IBM OS/VS through z/OS ,
2538-479: A language suitable for running the entire GUI of a computer. Sun added a number of new commands for timers, mouse control, interrupts and other systems needed for interactivity, and added data structures and language elements to allow it to be completely object oriented internally. A complete GUI, three in fact, were written in NeWS and provided for a time on their workstations. However, the ongoing efforts to standardize
2679-406: A laser printer makes it possible to position high-quality graphics and text on the same page. PostScript made it possible to fully exploit these characteristics by offering a single control language that could be used on any brand of printer. PostScript went beyond the typical printer control language and was a complete programming language of its own. Many applications can transform a document into
2820-442: A limited number of three-letter extensions, which can cause a given extension to be used by more than one program. Many formats still use three-character extensions even though modern operating systems and application programs no longer have this limitation. Since there is no standard list of extensions, more than one format can use the same extension, which can confuse both the operating system and users. One artifact of this approach
2961-414: A malicious user could create an executable program with an innocent name such as " Holiday photo.jpg.exe ". The " .exe " would be hidden and an unsuspecting user would see " Holiday photo.jpg ", which would appear to be a JPEG image, usually unable to harm the machine. However, the operating system would still see the " .exe " extension and run the program, which would then be able to cause harm to
SECTION 20
#17327651901153102-507: A number of technologies for this task, but most shared the property that the glyphs were physically difficult to change, as they were stamped onto typewriter keys, bands of metal, or optical plates. This changed to some degree with the increasing popularity of dot matrix printers . The characters on these systems were drawn as a series of dots, as defined by a font table inside the printer. As they grew in sophistication, dot matrix printers started including several built-in fonts from which
3243-452: A page description as an inline image .) Images are typically filtered for compression purposes. Image filters supported in PDF include the following general-purpose filters: Normally all image content in a PDF is embedded in the file. But PDF allows image data to be stored in external files by the use of external streams or Alternate Images . Standardized subsets of PDF, including PDF/A and PDF/X , prohibit these features. Text in PDF
3384-415: A paper for a project then code-named Camelot, in which he proposed the creation of a simplified version of PostScript called Interchange PostScript (IPS). Unlike traditional PostScript, which was tightly focused on rendering print jobs to output devices, IPS would be optimized for displaying pages to any screen and any platform. Adobe Systems made the PDF specification available free of charge in 1993. In
3525-408: A particular file's format, with each approach having its own advantages and disadvantages. Most modern operating systems and individual applications need to use all of the following approaches to read "foreign" file formats, if not work with them completely. One popular method used by many operating systems, including Windows , macOS , CP/M , DOS , VMS , and VM/CMS , is to determine the format of
3666-462: A printer. When Steve Jobs left Apple and started NeXT , he pitched Adobe on the idea of using PS as the display system for his new workstation computers. The result was Display PostScript , or DPS. DPS added basic functionality to improve performance by changing many string lookups into 32 bit integers, adding support for direct output with every command, and adding functions to allow the GUI to inspect
3807-437: A printing device. PostScript was not intended for long-term storage and real-time interactive rendering of electronic documents to computer monitors , so there was no need to support anything other than consecutive rendering of pages. If there was an error in the final printed output, the user would correct it at the application level and send a new print job in the form of an entirely new PostScript file. Thus, any given page in
3948-469: A result, files that use a small amount of transparency might be viewed acceptably by older viewers, but files making extensive use of transparency could be viewed incorrectly by an older viewer. The transparency extensions are based on the key concepts of transparency groups , blending modes , shape , and alpha . The model is closely aligned with the features of Adobe Illustrator version 9. The blend modes were based on those used by Adobe Photoshop at
4089-441: A sorted index). Also, data must be read from the file itself, increasing latency as opposed to metadata stored in the directory. Where file types do not lend themselves to recognition in this way, the system must fall back to metadata. It is, however, the best way for a program to check if the file it has been told to process is of the correct format: while the file's name or metadata may be altered independently of its content, failing
4230-431: A specialized editor or IDE . However, this feature was often the source of user confusion, as which program would launch when the files were double-clicked was often unpredictable. RISC OS uses a similar system, consisting of a 12-bit number which can be looked up in a table of descriptions—e.g. the hexadecimal number FF5 is "aliased" to PoScript , representing a PostScript file. A Uniform Type Identifier (UTI)
4371-473: A stream may be used instead of the ASCII cross-reference table and contains the offsets and other information in binary format. The format is flexible in that it allows for integer width specification (using the /W array), so that for example, a document not exceeding 64 KiB in size may dedicate only 2 bytes for object offsets. At the end of a PDF file is a footer containing If a cross-reference stream
PDF - Misplaced Pages Continue
4512-485: A text editor or a hexadecimal editor. As well as identifying the file format, file headers may contain metadata about the file and its contents. For example, most image files store information about image format, size, resolution and color space , and optionally authoring information such as who made the image, when and where it was made, what camera model and photographic settings were used ( Exif ), and so on. Such metadata may be used by software reading or interpreting
4653-502: A value in a company/standards organization database), and the 2 following digits categorize the type of file in hexadecimal . The final part is composed of the usual filename extension of the file or the international standard number of the file, padded left with zeros. For example, the PNG file specification has the FFID of 000000001-31-0015948 where 31 indicates an image file, 0015948
4794-638: A way of identifying what type of file was attached to an e-mail , independent of the source and target operating systems. MIME types identify files on BeOS , AmigaOS 4.0 and MorphOS , as well as store unique application signatures for application launching. In AmigaOS and MorphOS, the Mime type system works in parallel with Amiga specific Datatype system. There are problems with the MIME types though; several organizations and people have created their own MIME types without registering them properly with IANA, which makes
4935-428: A well-designed magic number test is a pretty sure sign that the file is either corrupt or of the wrong type. On the other hand, a valid magic number does not guarantee that the file is not corrupt or is of a correct type. So-called shebang lines in script files are a special case of magic numbers. Here, the magic number is human-readable text that identifies a specific command interpreter and options to be passed to
5076-459: Is a page description language and dynamically typed , stack-based programming language . It is most commonly used in the electronic publishing and desktop publishing realm, but as a Turing complete programming language, it can be used for many other purposes as well. PostScript was created at Adobe Systems by John Warnock , Charles Geschke , Doug Brotz, Ed Taft and Bill Paxton from 1982 to 1984. The most recent version, PostScript 3,
5217-528: Is a method used in macOS for uniquely identifying "typed" classes of entities, such as file formats. It was developed by Apple as a replacement for OSType (type & creator codes). The UTI is a Core Foundation string , which uses a reverse-DNS string. Some common and standard types use a domain called public (e.g. public.png for a Portable Network Graphics image), while other domains can be used for third-party types (e.g. com.adobe.pdf for Portable Document Format ). UTIs can be defined within
5358-407: Is a risk that the file format can be misinterpreted. It may even have been badly written at the source. This can result in corrupt metadata which, in extremely bad cases, might even render the file unreadable. A more complex example of file headers are those used for wrapper (or container) file formats. One way to incorporate file type metadata, often associated with Unix and its derivatives,
5499-579: Is a static data structure made for efficient access and embeds navigational information suitable for interactive viewing. PostScript is a Turing-complete programming language, belonging to the concatenative group of programming languages. It is an interpreted , stack-based language similar to Forth but with strong dynamic typing , data structures inspired by those found in Lisp , scoped memory and, since language level 2, garbage collection . The language syntax uses reverse Polish notation , which makes
5640-414: Is a subset of PostScript, simplified to remove such control flow features, while graphics commands remain. PostScript was originally designed for a drastically different use case : transmission of one-way linear print jobs in which the PostScript interpreter would collect a series of commands until it encountered the showpage command, then execute all the commands to render a page as a raster image to
5781-464: Is also a shading pattern , which draws continuously varying colors. There are seven types of shading patterns of which the simplest are the axial shading (Type 2) and radial shading (Type 3). Raster images in PDF (called Image XObjects ) are represented by dictionaries with an associated stream. The dictionary describes the properties of the image, and the stream contains the image data. (Less commonly, small raster images may be embedded directly in
PDF - Misplaced Pages Continue
5922-449: Is an extensible scheme of persistent, unique, and unambiguous identifiers for file formats, which has been developed by The National Archives of the UK as part of its PRONOM technical registry service. PUIDs can be expressed as Uniform Resource Identifiers using the info:pronom/ namespace. Although not yet widely used outside of the UK government and some digital preservation programs,
6063-469: Is called an embedded font while the former is called an unembedded font . The font files that may be embedded are based on widely used standard digital font formats: Type 1 (and its compressed variant CFF), TrueType , and (beginning with PDF 1.6) OpenType . Additionally PDF supports the Type 3 variant in which the components of the font are described by PDF graphic operators. Fourteen typefaces, known as
6204-475: Is its handling of fonts . The font system uses the PS graphics primitives to draw glyphs as curves, which can then be rendered at any resolution . A number of typographic issues had to be considered with this approach. One issue is that fonts do not scale linearly at small sizes and features of the glyphs will become proportionally too large or small and start to look displeasing. PostScript avoided this problem with
6345-598: Is needed for such a printer, Ghostscript can be used. There are also a number of commercial PostScript interpreters, such as TeleType Co. 's T-Script or Brother 's BR-Script3 . PostScript became commercially successful due to the introduction of the graphical user interface (GUI), allowing designers to directly lay out pages for eventual output on laser printers. However, the GUIs' own graphics systems were generally much less sophisticated than PostScript; Apple's QuickDraw , for instance, supported only basic lines and arcs, not
6486-417: Is not being used, the footer is preceded by the trailer keyword followed by a dictionary containing information that would otherwise be contained in the cross-reference stream object's dictionary: Within each page, there are one or multiple content streams that describe the text, vector and images being drawn on the page. The content stream is stack-based , similar to PostScript. There are two layouts to
6627-543: Is not required in situations where a PDF file is intended only for print. Since the feature is optional, and since the rules for tagged PDF were relatively vague in ISO 32000-1, support for tagged PDF among consuming devices, including assistive technology (AT), is uneven as of 2021. ISO 32000-2, however, includes an improved discussion of tagged PDF which is anticipated to facilitate further adoption. An ISO-standardized subset of PDF specifically targeted at accessibility, PDF/UA ,
6768-432: Is organized using ASCII characters, except for certain elements that may have binary content. The file starts with a header containing a magic number (as a readable string) and the version of the format, for example %PDF-1.7 . The format is a subset of a COS ("Carousel" Object Structure) format. A COS tree file consists primarily of objects , of which there are nine types: Comments using 8-bit characters prefixed with
6909-440: Is possible to write computer programs in PostScript just like any other programming language. A Hello World program , the customary way to show a small example of a complete program in a given language, might look like this in PostScript (level 2): or if the output device has a console PostScript uses the point as its unit of length. However, unlike some of the other versions of the point, PostScript uses exactly 72 points to
7050-402: Is represented by text elements in page content streams. A text element specifies that characters should be drawn at certain positions. The characters are specified using the encoding of a selected font resource . A font object in PDF is a description of a digital typeface . It may either describe the characteristics of a typeface, or it may include an embedded font file . The latter case
7191-424: Is small, and/or that chunks do not contain other chunks; many formats do not impose those requirements. The information that identifies a particular "chunk" may be called many different things, often terms including "field name", "identifier", "label", or "tag". The identifiers are often human-readable, and classify parts of the data: for example, as a "surname", "address", "rectangle", "font name", etc. These are not
SECTION 50
#17327651901157332-510: Is specified in terms of straight lines and cubic Bézier curves (previously found only in CAD applications), which allows arbitrary scaling, rotating and other transformations. When the PostScript program is interpreted, the interpreter converts these instructions into the dots needed to form the output. For this reason, PostScript interpreters are occasionally called PostScript raster image processors , or RIPs. Almost as complex as PostScript itself
7473-427: Is that the system can easily be tricked into treating a file as a different format simply by renaming it — an HTML file can, for instance, be easily treated as plain text by renaming it from filename.html to filename.txt . Although this strategy was useful to expert users who could easily understand and manipulate this information, it was often confusing to less technical users, who could accidentally make
7614-648: Is the standard number and 000000001 indicates the International Organization for Standardization (ISO). Another less popular way to identify the file format is to examine the file contents for distinguishable patterns among file types. The contents of a file are a sequence of bytes and a byte has 256 unique permutations (0–255). Thus, counting the occurrence of byte patterns that is often referred to as byte frequency distribution gives distinguishable patterns to identify file types. There are many content-based file type identification schemes that use
7755-479: Is to store a "magic number" inside the file itself. Originally, this term was used for a specific set of 2-byte identifiers at the beginnings of files, but since any binary sequence can be regarded as a number, any feature of a file format which uniquely distinguishes it can be used for identification. GIF images, for instance, always begin with the ASCII representation of either GIF87a or GIF89a , depending upon
7896-438: Is used as the basis for generating PostScript-like PDF code (see, e.g., Adobe Distiller ). This is done by applying standard compiler techniques like loop unrolling , inlining and removing unused branches, resulting in code that is purely declarative and static. The end result is then packaged into a container format , together with all necessary dependencies for correct rendering (external files, graphics, or fonts to which
8037-447: Is very similar to that of PostScript, except for the use of transparency, which was added in PDF 1.4. PDF graphics use a device-independent Cartesian coordinate system to describe the surface of a page. A PDF page description can use a matrix to scale , rotate , or skew graphical elements. A key concept in PDF is that of the graphics state , which is a collection of graphical parameters that may be changed, saved, and restored by
8178-449: The de facto standard for electronic document distribution. On high-end printers, PostScript processors remain common, and their use can dramatically reduce the CPU work involved in printing documents, transferring the work of rendering PostScript images from the computer to the printer. The first version of the PostScript language was released to the market in 1984. The qualifier Level 1
8319-638: The GIF file format required the use of a patented algorithm, and though the patent owner did not initially enforce their patent, they later began collecting royalty fees . This has resulted in a significant decrease in the use of GIFs, and is partly responsible for the development of the alternative PNG format. However, the GIF patent expired in the US in mid-2003, and worldwide in mid-2004. Different operating systems have traditionally taken different approaches to determining
8460-579: The Harlequin RIP , both by Global Graphics . A free software version, with several other applications, is Ghostscript . Several compatible interpreters are listed on the Undocumented Printing Wiki. Some basic, inexpensive laser printers do not support PostScript, instead coming with drivers that simply rasterize the platform's native graphics formats rather than converting them to PostScript first. When PostScript support
8601-458: The Ogg format can act as a container for different types of multimedia including any combination of audio and video , with or without text (such as subtitles ), and metadata . A text file can contain any stream of characters, including possible control characters , and is encoded in one of various character encoding schemes . Some file formats, such as HTML , scalable vector graphics , and
SECTION 60
#17327651901158742-537: The Windows and Macintosh operating systems, fonts using these encodings work equally well on any platform.) PDF can specify a predefined encoding to use, the font's built-in encoding or provide a lookup table of differences to a predefined or built-in encoding (not recommended with TrueType fonts). The encoding mechanisms in PDF were designed for Type 1 fonts, and the rules for applying them to TrueType fonts are complex. For large fonts or fonts with non-standard glyphs,
8883-486: The X11 system led to its introduction and widespread use on Sun systems, and NeWS never became widely used. The PDF and PostScript share the same imaging model and both documents are mutually convertible to each other. Both documents produce the same result when printed. The difference between the PDF and PostScript is that the PDF lacks the general-purpose programming language framework of the PostScript language. A PDF document
9024-601: The array and dictionary types, but cannot be declared to the type system, which sees them all only as arrays and dictionaries, so any further typing discipline to be applied to such user-defined "types" is left to the code that implements them. The character "%" is used to introduce comments in PostScript programs. As a general convention, every PostScript program should start with the characters "%!PS" as an interpreter directive so that all devices will properly interpret it as PostScript. Typically, PostScript programs are not produced by humans, but by other programs. However, it
9165-468: The source code of computer software are text files with defined syntaxes that allow them to be used for specific purposes. File formats often have a published specification describing the encoding method and enabling testing of program intended functionality. Not all formats have freely available specification documents, partly because some developers view their specification documents as trade secrets , and partly because other developers never author
9306-417: The standard 14 fonts , have a special significance in PDF documents: These fonts are sometimes called the base fourteen fonts . These fonts, or suitable substitute fonts with the same metrics, should be available in most PDF readers, but they are not guaranteed to be available in the reader, and may only display correctly if the system has them installed. Fonts may be substituted if they are not embedded in
9447-460: The "level" terminology in favor of simple versioning) came at the end of 1997, and along with many new dictionary-based versions of older operators, introduced better color handling and new filters (which allow in-program compression/decompression, program chunking, and advanced error-handling). PostScript 3 was significant in terms of replacing the existing proprietary color electronic prepress systems, then widely used for magazine production, through
9588-419: The ASCII representation formed a sequence of meaningful characters, such as an abbreviation of the application's name or the developer's initials. For instance a HyperCard "stack" file has a creator of WILD (from Hypercard's previous name, "WildCard") and a type of STAK . The BBEdit text editor has a creator code of R*ch referring to its original programmer, Rich Siegel . The type code specifies
9729-478: The Macintosh computers to which it was attached. When the laser printer engines themselves cost over a thousand dollars the added cost of PS was marginal. But, as printer mechanisms fell in price, the cost of implementing PS became too great a fraction of overall printer cost. In addition, with desktop computers becoming more powerful during the 1990s than their attached printers, it no longer made sense to offload
9870-429: The PDF files: non-linearized (not "optimized") and linearized ("optimized"). Non-linearized PDF files can be smaller than their linear counterparts, though they are slower to access because portions of the data required to assemble pages of the document are scattered throughout the PDF file. Linearized PDF files (also called "optimized" or "web optimized" PDF files) are constructed in a manner that enables them to be read in
10011-402: The PS system in the computer rather than the printer. This led to the natural evolution of PS from a printing system to one that could also be used as the host's own graphics language. There were numerous advantages to this approach; not only did it help eliminate the possibility of different output on screen and printer, but it also provided a powerful graphics system for the computer, and allowed
10152-513: The PUID scheme does provide greater granularity than most alternative schemes. MIME types are widely used in many Internet -related applications, and increasingly elsewhere, although their usage for on-disc type information is rare. These consist of a standardised system of identifiers (managed by IANA ) consisting of a type and a sub-type , separated by a slash —for instance, text/html or image/gif . These were originally intended as
10293-751: The VSAM catalog (prior to ICF catalogs ) and the VSAM Volume Record in the VSAM Volume Data Set (VVDS) (with ICF catalogs) identifies the type of VSAM dataset. In IBM OS/360 through z/OS , a format 1 or 7 Data Set Control Block (DSCB) in the Volume Table of Contents (VTOC) identifies the Dataset Organization ( DSORG ) of the dataset described by it. The HPFS , FAT12, and FAT16 (but not FAT32) filesystems allow
10434-510: The announcement of TrueType, Adobe published the specification for the Type 1 font format. Retail tools such as Altsys Fontographer (acquired by Macromedia in January 1995, owned by FontLab since May 2005) added the ability to create Type 1 fonts. Since then, many free Type 1 fonts have been released; for instance, the fonts used with the TeX typesetting system are available in this format. In
10575-442: The appropriate icons, but these will be located in different places on the storage medium thus taking longer to access. A folder containing many files with complex metadata such as thumbnail information may require considerable time before it can be displayed. If a header is binary hard-coded such that the header itself needs complex interpretation in order to be recognized, especially for metadata content protection's sake, there
10716-617: The basis for handling PostScript outlines in OpenType fonts. The CID-keyed font format was also designed, to solve the problems in the OCF/Type 0 fonts , for addressing the complex Asian-language ( CJK ) encoding and very large character set issues. The CID-keyed font format can be used with the Type 1 font format for standard CID-keyed fonts, or Type 2 for CID-keyed OpenType fonts. To compete with Adobe's system, Apple designed their own system, TrueType , around 1991. Immediately following
10857-452: The challenge of making PostScript render high-quality output to such a low-resolution device (which for most consumers would be their only printing device). In response, Warnock and Brotz solved the so-called "appearance problem" of making the stem width of letters scale properly so that they look good at all resolutions. Their breakthrough was so important that Adobe has never patented the technology, in order to keep its details concealed as
10998-454: The command interpreter. Another operating system using magic numbers is AmigaOS , where magic numbers were called "Magic Cookies" and were adopted as a standard system to recognize executables in Hunk executable file format and also to let single programs, tools and utilities deal automatically with their saved data files, or any other kind of file types when saving and loading data. This system
11139-454: The complex B-splines and advanced region filling options of PostScript. In order to take full advantage of PostScript printing, applications on the computers had to re-implement those features using the host platform's own graphics system. This led to numerous issues where the on-screen layout would not exactly match the printed output, due to differences in the implementation of these features. As computer power grew, it became possible to host
11280-431: The computer. The same is true with files with only one extension: as it is not shown to the user, no information about the file can be deduced without explicitly investigating the file. To further trick users, it is possible to store an icon inside the program, in which case some operating systems' icon assignment for the executable file ( .exe ) would be overridden with an icon commonly used to represent JPEG images, making
11421-508: The contents. PDF 2.0 defines 256-bit AES encryption as the standard for PDF 2.0 files. The PDF Reference also defines ways that third parties can define their own encryption systems for PDF. PDF files may be digitally signed, to provide secure authentication; complete details on implementing digital signatures in PDF are provided in ISO 32000-2. PDF files may also contain embedded DRM restrictions that provide further controls that limit copying, editing, or printing. These restrictions depend on
11562-419: The data must be entirely parsed by applications. On Unix and Unix-like systems, the ext2 , ext3 , ext4 , ReiserFS version 3, XFS , JFS , FFS , and HFS+ filesystems allow the storage of extended attributes with files. These include an arbitrary list of "name=value" strings, where the names are unique and a value can be accessed through its related name. The PRONOM Persistent Unique Identifier (PUID)
11703-430: The destination, the single file received has to be unzipped by a compatible utility to be useful. The problems of handling metadata are solved this way using zip files or archive files. The Mac OS ' Hierarchical File System stores codes for creator and type as part of the directory entry for each file. These codes are referred to as OSTypes. These codes could be any 4-byte sequence but were often selected so that
11844-542: The diagram. Additionally, a set of "bindings" was provided to allow PS code to be called directly from the C programming language . NeXT used these bindings in their NeXTStep system to provide an object oriented graphics system. Although DPS was written in conjunction with NeXT, Adobe sold it commercially and it was a common feature of most Unix workstations in the 1990s. Sun Microsystems took another approach, creating NeWS . Instead of DPS's concept of allowing PS to interact with C programs, NeWS instead extended PS into
11985-408: The document refers), and compressed . Modern applications write to printer drivers that directly generate PDF rather than going through PostScript first. As a document format, PDF has several advantages over PostScript: Its disadvantages are: PDF since v1.6 supports embedding of interactive 3D documents: 3D drawings can be embedded using U3D or PRC and various other data formats. A PDF file
12126-411: The document root. This dictionary contains an array of Optional Content Groups (OCGs), each describing a set of information and each of which may be individually displayed or suppressed, plus a set of Optional Content Configuration Dictionaries, which give the status (Displayed or Suppressed) of the given OCGs. A PDF file may be encrypted , for security, in which case a password is needed to view or edit
12267-415: The early 1990s, there were several other systems for storing outline-based fonts, developed by Bitstream and Metafont for instance, but none included a general-purpose printing solution and they were therefore not widely used. In the late 1990s, Adobe joined Microsoft in developing OpenType , essentially a functional superset of the Type 1 and TrueType formats. When printed to a PostScript output device,
12408-457: The early years PDF was popular mainly in desktop publishing workflows, and competed with several other formats, including DjVu , Envoy , Common Ground Digital Paper, Farallon Replica and even Adobe's own PostScript format. PDF was a proprietary format controlled by Adobe until it was released as an open standard on July 1, 2008, and published by the International Organization for Standardization as ISO 32000-1:2008, at which time control of
12549-435: The file during the loading process and afterwards. File headers may be used by an operating system to quickly gather information about a file without loading it all into memory, but doing so uses more of a computer's resources than reading directly from the directory information. For instance, when a graphic file manager has to display the contents of a folder, it must read the headers of many files before it can display
12690-448: The file format's definition. Throughout the 1970s, many programs used formats of this general kind. For example, word-processors such as troff , Script , and Scribe , and database export files such as CSV . Electronic Arts and Commodore - Amiga also used this type of file format in 1985, with their IFF (Interchange File Format) file format. A container is sometimes called a "chunk" , although "chunk" may also imply that each piece
12831-831: The file, each of which is a string, such as "Plain Text" or "HTML document". Thus a file may have several types. The NTFS filesystem also allows storage of OS/2 extended attributes, as one of the file forks , but this feature is merely present to support the OS/2 subsystem (not present in XP), so the Win32 subsystem treats this information as an opaque block of data and does not use it. Instead, it relies on other file forks to store meta-information in Win32-specific formats. OS/2 extended attributes can still be read and written by Win32 programs, but
12972-471: The format of the file, while the creator code specifies the default program to open it with when double-clicked by the user. For example, the user could have several text files all with the type code of TEXT , but each open in a different program, due to having differing creator codes. This feature was intended so that, for example, human-readable plain-text files could be opened in a general-purpose text editor, while programming or HTML code files would open in
13113-463: The format will be identified correctly, and can often determine more precise information about the file. Since reasonably reliable "magic number" tests can be fairly complex, and each file must effectively be tested against every possibility in the magic database, this approach is relatively inefficient, especially for displaying large lists of files (in contrast, file name and metadata-based methods need to check only one piece of data, and match it against
13254-586: The format's developers for a fee and by signing a non-disclosure agreement . The latter approach is possible only when a formal specification document exists. Both strategies require significant time, money, or both; therefore, file formats with publicly available specifications tend to be supported by more programs. Patent law, rather than copyright , is more often used to protect a file format. Although patents for file formats are not directly permitted under US law, some formats encode data using patented algorithms . For example, prior to 2004, using compression with
13395-619: The founders kept turning him down. In December 1983, the two companies finally signed off on the PostScript licensing deal, and Adobe had to shift focus immediately from high-end, high-resolution printing devices to the consumer-oriented Apple LaserWriter laser printer. At that time, the 300-dpi Canon laser printing engine to be used in LaserWriters was seen as good enough only for proof printing (i.e., for crude rough drafts of material whose final drafts would be sent to professional high-resolution devices), but Jobs presented Adobe with
13536-401: The full implementation of the ISO 32000-1 specification. These proprietary technologies are not standardized, and their specification is published only on Adobe's website. Many of them are not supported by popular third-party implementations of PDF. ISO published version 2.0 of PDF, ISO 32000-2 in 2017, available for purchase, replacing the free specification provided by Adobe. In December 2020,
13677-402: The graphics state, including patterns . PDF supports several types of patterns. The simplest is the tiling pattern in which a piece of artwork is specified to be drawn repeatedly. This may be a colored tiling pattern , with the colors specified in the pattern object, or an uncolored tiling pattern , which defers color specification to the time the pattern is drawn. Beginning with PDF 1.3 there
13818-492: The imaging model. A tagged PDF (see clause 14.8 in ISO 32000) includes document structure and semantics information to enable reliable text extraction and accessibility . Technically speaking, tagged PDF is a stylized use of the format that builds on the logical structure framework introduced in PDF 1.3. Tagged PDF defines a set of standard structure types and attributes that allow page content (text, graphics, and images) to be extracted and reused for other purposes. Tagged PDF
13959-405: The inch. Thus: For example, in order to draw a vertical line of 4 cm length, it is sufficient to type: More readably and idiomatically, one might use the following equivalent, which demonstrates a simple procedure definition and the use of the mathematical operators mul and div : (Technically, most printers have a construction-implied unprintable margin around the physical borders of
14100-431: The inclusion of font hinting , in which additional information is provided in horizontal or vertical bands to help identify the features in each letter that are important for the rasterizer to maintain. The result was significantly better-looking fonts even at low resolution. It had formerly been believed that hand-tuned bitmap fonts were required for this task. At the time, the technology for including these hints in fonts
14241-515: The introduction of smooth shading operations with up to 4096 shades of grey (rather than the 256 available in PostScript Level 2), as well as DeviceN, a color space that allowed the addition of additional ink colors (called spot colors ) into composite color pages. Prior to the introduction of Interpress and PostScript, printers were designed to print character output given the text—typically in ASCII —as input. There were
14382-408: The licensing fees for their implementation of PostScript for printers, known as a raster image processor or RIP . As a number of new RISC -based platforms became available in the mid-1980s, some found Adobe's support of the new machines to be lacking. This and issues of cost led to third-party implementations of PostScript becoming common, particularly in low-cost printers (where the licensing fee
14523-515: The main data and the name, but is also less portable than either filename extensions or "magic numbers", since the format has to be converted from filesystem to filesystem. While this is also true to an extent with filename extensions— for instance, for compatibility with MS-DOS 's three character limit— most forms of storage have a roughly equivalent definition of a file's data and name, but may have varying or no representation of further metadata. Note that zip files or archive files solve
14664-430: The memory images also have reserved spaces for future extensions, extending and improving this type of structured file is very difficult. It also creates files that might be specific to one platform or programming language (for example a structure containing a Pascal string is not recognized as such in C ). On the other hand, developing tools for reading and writing these types of files is very simple. The limitations of
14805-688: The need for a standard means of defining page images. In 1975–76 Bob Sproull and William Newman developed the Press format, which was eventually used in the Xerox Star system to drive laser printers. But Press, a data format rather than a language, lacked flexibility, and PARC mounted the Interpress effort to create a successor. In 1978, John Gaffney and Martin Newell then at Xerox PARC wrote J & M or JaM (for "John and Martin") which
14946-440: The objects in the file, and also allows for small changes to be made without rewriting the entire file ( incremental update ). Before PDF version 1.5, the table would always be in a special ASCII format, be marked with the xref keyword, and follow the main body composed of indirect objects. Version 1.5 introduced optional cross-reference streams , which have the form of a standard stream object, possibly with filters applied. Such
15087-419: The order of operations unambiguous, but reading a program requires some practice, because one has to keep the layout of the stack in mind. Most operators (what other languages term functions ) take their arguments from the stack, and place their results onto the stack. Literals (for example, numbers) have the effect of placing a copy of themselves on the stack. Sophisticated data structures can be built on
15228-552: The percent sign ( % ) may be inserted. Objects may be either direct (embedded in another object) or indirect . Indirect objects are numbered with an object number and a generation number and defined between the obj and endobj keywords if residing in the document root. Beginning with PDF version 1.5, indirect objects (except other streams) may also be located in special streams known as object streams (marked /Type /ObjStm ). This technique enables non-stream objects to have standard stream filters applied to them, reduces
15369-494: The printers to be "dumb" at a time when the cost of the laser engines was falling. In a production setting, using PostScript as a display system meant that the host computer could render low-resolution to the screen, higher resolution to the printer, or simply send the PS code to a smart printer for offboard printing. However, PostScript was written with printing in mind, and had numerous features that made it unsuitable for direct use in an interactive display system. In particular, PS
15510-416: The problem of handling metadata. A utility program collects multiple files together along with metadata about each file and the folders/directories they came from all within one new file (e.g. a zip file with extension .zip ). The new file is also compressed and possibly encrypted, but now is transmissible as a single file across operating systems by FTP transmissions or sent by email as an attachment. At
15651-447: The program look like an image. Extensions can also be spoofed: some Microsoft Word macro viruses create a Word file in template format and save it with a .doc extension. Since Word generally ignores extensions and looks at the format of the file, these would open as templates, execute, and spread the virus. This represents a practical problem for Windows systems where extension-hiding is turned on by default. A second way to identify
15792-435: The rasterization work onto the resource-constrained printer. By 2001, few low-end printer models came with onboard support for PostScript, largely due to growing competition from much cheaper non-PostScript inkjet printers, and new software-based methods to render PostScript images on computers, making them suitable for any printer. PDF , a descendant of PostScript, provides one such method, and has largely replaced PostScript as
15933-580: The reader software to obey them, so the security they provide is limited. File format A file format is a standard way that information is encoded for storage in a computer file . It specifies how bits are used to encode information in a digital storage medium. File formats may be either proprietary or free . Some file formats are designed for very particular types of data: PNG files, for example, store bitmapped images using lossless data compression . Other file formats, however, are designed for storage of several different types of data:
16074-819: The same thing as identifiers in the sense of a database key or serial number (although an identifier may well identify its associated data as such a key). With this type of file structure, tools that do not know certain chunk identifiers simply skip those that they do not understand. Depending on the actual meaning of the skipped data, this may or may not be useful ( CSS explicitly defines such behavior). This concept has been used again and again by RIFF (Microsoft-IBM equivalent of IFF), PNG, JPEG storage, DER ( Distinguished Encoding Rules ) encoded streams and files (which were originally described in CCITT X.409:1984 and therefore predate IFF), and Structured Data Exchange Format (SDXF) . Indeed, any data format must somehow identify
16215-499: The second edition of PDF 2.0, ISO 32000-2:2020, was published, with clarifications, corrections, and critical updates to normative references (ISO 32000-2 does not include any proprietary technologies as normative references). In April 2023 the PDF Association made ISO 32000-2 available for download free of charge. A PDF file is often a combination of vector graphics , text, and bitmap graphics . The basic types of content in
16356-544: The significance of its component parts, and embedded boundary-markers are an obvious way to do so: This is another extensible format, that closely resembles a file system ( OLE Documents are actual filesystems), where the file is composed of 'directory entries' that contain the location of the data within the file itself as well as its signatures (and in certain cases its type). Good examples of these types of file structures are disk images , executables , OLE documents TIFF , libraries . PostScript PostScript ( PS )
16497-422: The size of files that have large numbers of small indirect objects and is especially useful for Tagged PDF . Object streams do not support specifying an object's generation number (other than 0). An index table, also called the cross-reference table, is located near the end of the file and gives the byte offset of each indirect object from the start of the file. This design allows for efficient random access to
16638-432: The special encodings Identity-H (for horizontal writing) and Identity-V (for vertical) are used. With such fonts, it is necessary to provide a ToUnicode table if semantic information about the characters is to be preserved. A text document which is scanned to PDF without the text being recognised by optical character recognition (OCR) is an image, with no fonts or text properties. The original imaging model of PDF
16779-576: The specification passed to an ISO Committee of volunteer industry experts. In 2008, Adobe published a Public Patent License to ISO 32000-1 granting royalty-free rights for all patents owned by Adobe necessary to make, use, sell, and distribute PDF-compliant implementations. PDF 1.7, the sixth edition of the PDF specification that became ISO 32000-1, includes some proprietary technologies defined only by Adobe, such as Adobe XML Forms Architecture (XFA) and JavaScript extension for Acrobat, which are referenced by ISO 32000-1 as normative and indispensable for
16920-485: The spring of 1983, Steve Jobs came to visit Adobe and was dazzled by PostScript's potential, especially for the new Macintosh computer he was developing at Apple . To John Sculley 's frustration, Jobs licensed the PostScript technology from Adobe by offering a $ 1.5 million advance against PostScript royalties, plus $ 2.5 million in exchange for 20 percent of Adobe shares. During a series of meetings in 1983, Jobs also repeatedly offered for Apple to buy Adobe outright, but
17061-566: The standard outline font technology for both Windows and the Macintosh. Today, third-party PostScript-compatible interpreters are widely used in printers and multifunction peripherals (MFPs). For example, CSR plc 's IPS PS3 interpreter, formerly known as PhoenixPage, is standard in many printers and MFPs, including those developed by Hewlett-Packard and sold under the LaserJet and Color LaserJet lines. Other third-party PostScript solutions used by print and MFP manufacturers include Jaws and
17202-586: The standard to which they adhere. Many file types, especially plain-text files, are harder to spot by this method. HTML files, for example, might begin with the string <html> (which is not case sensitive), or an appropriate document type definition that starts with <!DOCTYPE html , or, for XHTML , the XML identifier, which begins with <?xml . The files can also begin with HTML comments, random text, or several empty lines, but still be usable HTML. The magic number approach offers better guarantees that
17343-438: The storage of "extended attributes" with files. These comprise an arbitrary set of triplets with a name, a coded type for the value, and a value, where the names are unique and values can be up to 64 KB long. There are standardized meanings for certain types and names (under OS/2 ). One such is that the ".TYPE" extended attribute is used to determine the file type. Its value comprises a list of one or more file types associated with
17484-472: The technology were left with the Type 3 Font (also known as PostScript Type 3 Font , PS3 or T3 ). Type 3 fonts allowed for all the sophistication of the PostScript language, but without the standardized approach to hinting. The Type 2 font format was designed to be used with Compact Font Format (CFF) charstrings, and was implemented to reduce the overall font file size. The CFF/Type2 format later became
17625-447: The time. When the PDF 1.4 specification was published, the formulas for calculating blend modes were kept secret by Adobe. They have since been published. The concept of a transparency group in PDF specification is independent of existing notions of "group" or "layer" in applications such as Adobe Illustrator. Those groupings reflect logical relationships among objects that are meaningful when editing those objects, but they are not part of
17766-460: The unneeded parts of the OpenType font are omitted, and what is sent to the device by the driver is the same as it would be for a TrueType or Type 1 font, depending on which kind of outlines were present in the OpenType font. Adobe supported Type 1 fonts in its products until January 2023, when it fully removed support in favor of OpenType fonts. In the 1980s, Adobe drew most of its revenue from
17907-414: The unstructured formats led to the development of other types of file formats that could be easily extended and be backward compatible at the same time. In this kind of file structure, each piece of data is embedded in a container that somehow identifies the data. The container's scope can be identified by start- and end-markers of some kind, by an explicit length field somewhere, or by fixed requirements of
18048-472: The use of this standard awkward in some cases. File format identifiers are another, not widely used way to identify file formats according to their origin and their file category. It was created for the Description Explorer suite of software. It is composed of several digits of the form NNNNNNNNN-XX-YYYYYYY . The first part indicates the organization origin/maintainer (this number represents
18189-461: The user could select, and some models allowed users to upload their own custom glyphs into the printer. Dot matrix printers also introduced the ability to print raster graphics . The graphics were interpreted by the computer and sent as a series of dots to the printer using a series of escape sequences . These printer control languages varied from printer to printer, requiring program authors to create numerous drivers . Vector graphics printing
18330-481: Was opaque, similar to PostScript, where each object drawn on the page completely replaced anything previously marked in the same location. In PDF 1.4 the imaging model was extended to allow transparency. When transparency is used, new objects interact with previously marked objects to produce blending effects. The addition of transparency to PDF was done by means of new extensions that were designed to be ignored in products written to PDF 1.3 and earlier specifications. As
18471-418: Was added when Level 2 was introduced. PostScript Level 2 was introduced in 1991, and included several improvements: improved speed and reliability, support for in-Raster Image Processing (RIP) separations, image decompression (for example, JPEG images could be rendered by a PostScript program), support for composite fonts , and the form mechanism for caching reusable content. PostScript 3 (Adobe dropped
18612-401: Was based on the idea of collecting up PS commands until the showpage command was seen, at which point all of the commands read up to that point were interpreted and output. In an interactive system, this was clearly not appropriate, nor did PS have any sort of interactivity built in; for example, supporting hit detection for mouse interactivity obviously did not apply when PS was being used on
18753-484: Was carefully guarded, and the hinted fonts were compressed and encrypted into what Adobe called a Type 1 Font (also known as PostScript Type 1 Font , PS1 , T1 or Adobe Type 1 ). Type 1 was effectively a simplification of the PS system to store outline information only, as opposed to being a complete language (PDF is similar in this regard). Adobe would then sell licenses to the Type 1 technology to those wanting to add hints to their own fonts. Those who did not license
18894-514: Was first published in 2012. With the introduction of PDF version 1.5 (2003) came the concept of Layers. Layers, more formally known as Optional Content Groups (OCGs), refer to sections of content in a PDF document that can be selectively viewed or hidden by document authors or viewers. This capability is useful in CAD drawings, layered artwork, maps, multi-language documents, etc. Basically, it consists of an Optional Content Properties Dictionary added to
19035-506: Was five percent of the list price for each laser printer sold, which was $ 350 of the original LaserWriter list price of $ 6,995, and such royalties provided nearly all of Adobe's income during its early years. (Apple later renegotiated the contract to pay a licensing fee based on volume of printers shipped.) The combination of technical merits and widespread availability made PostScript the language of choice for graphical output for printing applications. An interpreter (sometimes referred to as
19176-511: Was left to special-purpose devices, called plotters . Almost all plotters shared a common command language, HPGL , but were of limited use for anything other than printing graphics. In addition, they tended to be expensive and slow, and thus rare. Laser printers combine the best features of both printers and plotters. Like plotters, laser printers offer high quality line art, and like dot-matrix printers, they are able to generate pages of text and raster graphics. Unlike either printers or plotters,
19317-407: Was released in 1997. The concepts of the PostScript language were seeded in 1976 by John Gaffney at Evans & Sutherland , a computer graphics company. At that time, Gaffney and John Warnock were developing an interpreter for a large three-dimensional graphics database of New York Harbor . Concurrently, researchers at Xerox PARC had developed the first laser printer and had recognized
19458-690: Was standardized as ISO 32000 in 2008. The last edition as ISO 32000-2:2020 was published in December 2020. PDF files may contain a variety of content besides flat text and graphics including logical structuring elements, interactive elements such as annotations and form-fields, layers, rich media (including video content), three-dimensional objects using U3D or PRC , and various other data formats . The PDF specification also provides for encryption and digital signatures , file attachments, and metadata to enable workflows requiring these features. The development of PDF began in 1991 when John Warnock wrote
19599-450: Was the sticking point) or in high-end typesetting equipment (where the quest for speed demanded support for new platforms faster than Adobe could provide). At one point, Microsoft licensed to Apple a PostScript-compatible interpreter it had bought called TrueImage , and Apple licensed to Microsoft its new font format, TrueType . Apple ended up reaching an accord with Adobe and licensed genuine PostScript for its printers, but TrueType became
19740-579: Was then enhanced with the Amiga standard Datatype recognition system. Another method was the FourCC method, originating in OSType on Macintosh, later adapted by Interchange File Format (IFF) and derivatives. A final way of storing the format of a file is to explicitly store information about the format in the file system, rather than within the file itself. This approach keeps the metadata separate from both
19881-467: Was used for VLSI design and the investigation of type and graphics printing. This work later evolved and expanded into the Interpress language. Warnock left with Chuck Geschke and founded Adobe Systems in December 1982. They, together with Doug Brotz, Ed Taft and Bill Paxton created a simpler language, similar to Interpress, called PostScript, which went on the market in 1984. Meanwhile, in
#114885