Open Content Alliance - Misplaced Pages

The Open Content Alliance (OCA) was a consortium of organizations contributing to a permanent, publicly accessible archive of digitized texts. Its creation was announced in October 2005 by Yahoo! , the Internet Archive , the University of California , the University of Toronto and others. Scanning for the Open Content Alliance was administered by the Internet Archive, which also provided permanent storage and access through its website.

#272727

52-520: The OCA was, in part, a response to Google Book Search , which was announced in October 2004. OCA's approach to seeking permission from copyright holders differed significantly from that of Google Book Search. OCA digitized copyrighted works only after asking and receiving permission from the copyright holder ("opt-in"). By contrast, Google Book Search digitized copyrighted works unless explicitly told not to do so ("opt-out"), and contends that digitizing for

104-402: A Tumblr blog. Scholars have frequently reported rampant errors in the metadata information on Google Books – including misattributed authors and erroneous dates of publication. Geoffrey Nunberg , a linguist researching on the changes in word usage over time noticed that a search for books published before 1950 and containing the word "internet" turned up an unlikely 527 results. Woody Allen

156-560: A "download a pdf" button to all its out-of-copyright, public domain books. It also added a new browsing interface along with new "About this Book" pages. August 2006 : The University of California System announced that it would join the Books digitization project. This includes a portion of the 34 million volumes within the approximately 100 libraries managed by the System. September 2006 : The Complutense University of Madrid became

208-451: A 300-page book. But soon after the technology had been developed to the extent that scanning operators could scan up to 6000 pages an hour. Google established designated scanning centers to which books were transported by trucks. The stations could digitize at the rate of 1,000 pages per hour. The books were placed in a custom-built mechanical cradle that adjusted the book spine in place while an array of lights and optical instruments scanned

260-640: A PDF copy. Books can also be made available for sale on Google Play. Unlike the Library Project, this does not raise any copyright concerns as it is conducted pursuant to an agreement with the publisher. The publisher can choose to withdraw from the agreement at any time. For many books, Google Books displays the original page numbers. However, Tim Parks , writing in The New York Review of Books in 2014, noted that Google had stopped providing page numbers for many recent publications (likely

312-583: A book's author chooses to add an ISBN , LCCN or OCLC record number, the service will update the book's url to include it. Then, the author can set a specific page as the link's anchor. This option makes their book more easily discoverable. The Ngram Viewer is a service connected to Google Books that graphs the frequency of word usage across their book collection. The service is important for historians and linguists as it can provide an inside look into human culture through word use throughout time periods. This program has fallen under criticism because of errors in

364-477: A decade. The announcement soon triggered controversy, as publisher and author associations challenged Google's plans to digitize, not just books in the public domain, but also titles still under copyright. September–October 2005 : Two lawsuits against Google charge that the company has not respected copyrights and has failed to properly compensate authors and publishers. One is a class action suit on behalf of authors (Authors Guild v. Google, September 20, 2005) and

416-409: A declaration from Google at the end of scanned books says: The digitization at the most basic level is based on page images of the physical books. To make this book available as an ePub formatted file we have taken those page images and extracted the text using Optical Character Recognition (or OCR for short) technology. The extraction of text from page images is a difficult engineering task. Smudges on

468-591: A list of titles that they do not want scanned, and the request would be respected. The company also stated that it would not scan any in-copyright books between August and 1 November 2005, to provide the owners with the opportunity to decide which books to exclude from the Project. Thus, copyright owners have three choices with respect to any work: Most scanned works are no longer in print or commercially available. In addition to procuring books from libraries, Google also obtains books from its publisher partners, through

520-425: A long time and we didn't need PR firms.’” In 2020, AAP released press statements to support four of its members in the case of Hachette v. Internet Archive (IA). President Maria Pallante said of the case, "As the complaint outlines, by illegally copying and distributing online a stunning number of literary works each day, IA displays an abandon shared only by the world’s most egregious pirate sites." This action

572-562: A perfect book are daunting, but we continue to make enhancements to our OCR and book structure extraction technologies. In 2009, Google stated that they would start using reCAPTCHA to help fix the errors found in Google Book scans. This method would only improve scanned words that are hard to recognize because of the scanning process and cannot solve errors such as turned pages or blocked words. Scanning errors have inspired works of art such as published collections of anomalous pages and

SECTION 10

#1732798220273

624-437: A range of sources, including the users, third-party sites like Goodreads , and often the book's author and publisher. In fact, to encourage authors to upload their own books, Google has added several functionalities to the website. The authors can allow visitors to download their ebook for free, or they can set their own purchase price. They can change the price back and forth, offering discounts whenever it suits them. Also, if

676-509: A remarkable efficiency and speed but also helped protect the fragile collections from being over-handled. Afterwards, the crude images went through three levels of processing: first, de-warping algorithms used the LIDAR data fix the pages' curvature. Then, optical character recognition (OCR) software transformed the raw images into text, and, lastly, another round of algorithms extracted page numbers, footnotes, illustrations and diagrams. Many of

728-417: Is aimed at scanning and making searchable the collections of several major research libraries . Along with bibliographic information, snippets of text from a book are often viewable. If a book is out of copyright and in the public domain, the book is fully available to read or download . In-copyright books scanned through the Library Project are made available on Google Books for snippet view. Regarding

780-492: Is higher than one would expect to find in a typical library online catalog. The overall error rate of 36.75% found in this study suggests that Google Books' metadata has a high rate of error. While "major" and "minor" errors are a subjective distinction based on the somewhat indeterminate concept of "findability", the errors found in the four metadata elements examined in this study should all be considered major. Metadata errors based on incorrect scanned dates makes research using

832-630: Is mentioned in 325 books ostensibly published before he was born. Google responded to Nunberg by blaming the bulk of errors on outside contractors. Other metadata errors reported include publication dates before the author's birth (e.g. 182 works by Charles Dickens prior to his birth in 1812); incorrect subject classifications (an edition of Moby Dick found under "computers", a biography of Mae West classified under "religion"), conflicting classifications (10 editions of Whitman's Leaves of Grass all classified as both "fiction" and "nonfiction"), incorrectly spelled titles, authors, and publishers ( Moby Dick: or

884-416: Is verse or prose, and so forth). Getting this right allows us to render the book in a way that follows the format of the original book. Despite our best efforts you may see spelling mistakes, garbage characters, extraneous images, or missing pages in this book. Based on our estimates, these errors should not prevent you from enjoying the content of the book. The technical challenges of automatically constructing

936-521: The Cantonal and University Library of Lausanne . May 2007 : The Boekentoren Library of Ghent University announced that it would participate with Google in digitizing and making digitized versions of 19th century books in the French and Dutch languages available online. American Association of Publishers The Association of American Publishers ( AAP ) is the national trade association of

988-498: The "Partner Program" – designed to help publishers and authors promote their books. Publishers and authors submit either a digital copy of their book in EPUB or PDF format, or a print copy to Google, which is made available on Google Books for preview. The publisher can control the percentage of the book available for preview, with the minimum being 20%. They can also choose to make the book fully viewable, and even allow users to download

1040-480: The "secret 'books' project." Google founders Sergey Brin and Larry Page came up with the idea that later became Google Books while still graduate students at Stanford in 1996. The history page on the Google Books website describes their initial vision for this project: "in a future world in which vast collections of books are digitized, people would use a ' web crawler ' to index the books' content and analyze

1092-684: The 2000s. Google Book's scanning efforts have been subject to litigation, including Authors Guild v. Google , a class-action lawsuit in the United States, decided in Google's favor (see below). This was a major case that came close to changing copyright practices for orphan works in the United States. A 2023 study by scholars from the University of California, Berkeley and Northeastern University 's business schools found that Google Books's digitization of books has led to increased sales for

SECTION 20

#1732798220273

1144-543: The American book publishing industry. AAP lobbies for book, journal and education publishers in the United States . AAP members include most of the major commercial publishers in the United States, as well as smaller and nonprofit publishers, university presses, and scholarly societies. Patricia Schroeder , a former United States representative , served as the association's CEO from 1997 until 2009, taking over

1196-544: The Book Search digitization project. At least one million volumes would be digitized from the university's 13 library locations. March 2007 : The Bavarian State Library announced a partnership with Google to scan more than a million public domain and out-of-print works in German as well as English, French, Italian, Latin, and Spanish. May 2007 : A book digitizing project partnership was announced jointly by Google and

1248-583: The Google Books Partner Program, or by Google's library partners through the Library Project. Additionally, Google has partnered with a number of magazine publishers to digitize their archives. The Publisher Program was first known as Google Print when it was introduced at the Frankfurt Book Fair in October 2004. The Google Books Library Project, which scans works in the collections of library partners and adds them to

1300-749: The Google Books Project database difficult. Google has shown only limited interest in cleaning up these errors. Some European politicians and intellectuals have criticized Google's effort on linguistic imperialism grounds. They argue that because the vast majority of books proposed to be scanned are in English, it will result in disproportionate representation of natural languages in the digital world. German, Russian, French, and Spanish, for instance, are popular languages in scholarship. The disproportionate online emphasis on English, however, could shape access to historical scholarship, and, ultimately,

1352-632: The Google Print Library Project. Google announced partnerships with several high-profile university and public libraries, including the University of Michigan , Harvard ( Harvard University Library ), Stanford ( Green Library ), Oxford ( Bodleian Library ), and the New York Public Library . According to press releases and university librarians, Google planned to digitize and make available through its Google Books service approximately 15 million volumes within

1404-567: The OCA: Biodiversity Heritage Library , a cooperative project of: Google Book Search Google Books (previously known as Google Book Search , Google Print , and by its code-name Project Ocean ) is a service from Google that searches the full text of books and magazines that Google has scanned, converted to text using optical character recognition (OCR), and stored in its digital database. Books are provided either by publishers and authors through

1456-482: The White "Wall" ), and metadata for one book incorrectly appended to a completely different book (the metadata for an 1818 mathematical work leads to a 1963 romance novel). A review of the author, title, publisher, and publication year metadata elements for 400 randomly selected Google Books records was undertaken. The results show 36% of sampled books in the digitization project contained metadata errors. This error rate

1508-625: The base for such digitization projects as JSTOR and Making of America. In a conversation with the at that time University President Mary Sue Coleman , when Page found out that the university's current estimate for scanning all the library's volumes was 1,000 years, Page reportedly told Coleman that he "believes Google can help make it happen in six." 2003 : The team works to develop a high-speed scanning process as well as software for resolving issues in odd type sizes, unusual fonts, and "other unexpected peculiarities." December 2004 : Google signaled an extension to its Google Print initiative known as

1560-630: The book is still under copyright, a user sees "snippets" of text around the queried search terms. All instances of the search terms in the book text appear with a yellow highlight. The four access levels used on Google Books are: In response to criticism from groups such as the American Association of Publishers and the Authors Guild , Google announced an opt-out policy in August 2005, through which copyright owners could provide

1612-458: The books are scanned using a customized Elphel 323 camera at a rate of 1,000 pages per hour. A patent awarded to Google in 2009 revealed that Google had come up with an innovative system for scanning books that uses two cameras and infrared light to automatically correct for the curvature of pages in a book. By constructing a 3D model of each page and then "de-warping" it, Google is able to present flat-looking pages without having to really make

Open Content Alliance - Misplaced Pages Continue

1664-616: The connections between them, determining any given book's relevance and usefulness by tracking the number and quality of citations from other books." This team visited the sites of some of the larger digitization efforts at that time including the Library of Congress's American Memory Project , Project Gutenberg , and the Universal Library to find out how they work, as well as the University of Michigan, Page's alma mater, and

1716-689: The content they had scanned and they relinquished the scanning equipment to their digitization partners and libraries to continue digitization programs. Between about 2006 and 2008 Microsoft sponsored the scanning of over 750,000 books, 300,000 of which are now part of the Internet Archive's on-line collections. Brewster Kahle , a founder of the Open Content Alliance, actively opposed the proposed Google Book Settlement until its defeat in March 2011. The following are contributors to

1768-457: The digital inventory, was announced in December 2004. The Google Books initiative has been hailed for its potential to offer unprecedented access to what may become the largest online body of human knowledge and promoting the democratization of knowledge . However, it has also been criticized for potential copyright violations, and lack of editing to correct the many errors introduced into

1820-444: The file sizes minimal to enable access by internet users with low bandwidth. For each work, Google Books automatically generates an overview page. This page displays information extracted from the book—its publishing details, a high frequency word map, the table of contents—as well as secondary material, such as summaries, reader reviews (not readable in the mobile version of the website), and links to other relevant texts. A visitor to

1872-759: The first Spanish-language library to join the Google Books Library Project. October 2006 : The University of Wisconsin–Madison announced that it would join the Book Search digitization project along with the Wisconsin Historical Society Library. Combined, the libraries have 7.2 million holdings. November 2006 : The University of Virginia joined the project. Its libraries contain more than five million volumes and more than 17 million manuscripts, rare books and archives. January 2007 : The University of Texas at Austin announced that it would join

1924-577: The freedom to read, censorship and libel ; the freedom to publish; funding for education and libraries ; postal rates and regulations; tax and trade policy; and international copyright enforcement. AAP tracks publisher revenue on a monthly and annual basis with its StatShot programs. The association has also awarded books, journals, and electronic content through its annual PROSE Awards since 1976. In August 2019, AAP sued Audible for its Captions feature, through which machine-generated text could be displayed alongside audio narration. The lawsuit

1976-584: The growth and direction of future scholarship. Among these critics is Jean-Noël Jeanneney , the former president of the Bibliothèque nationale de France . While Google Books has digitized large numbers of journal back issues, its scans do not include the metadata required for identifying specific articles in specific issues. This has led the makers of Google Scholar to start their own program to digitize and host older journal articles (in agreement with their publishers). The Google Books Library Project

2028-464: The metadata used in the program. The project has received criticism that its stated aim of preserving orphaned and out-of-print works is at risk due to scanned data having errors and such problems not being solved. The scanning process is subject to errors. For example, some pages may be unreadable, upside down, or in the wrong order. Scholars have even reported crumpled pages, obscuring thumbs and fingers, and smeared or blurry images. On this issue,

2080-500: The ones acquired through the Partner Program) "presumably in alliance with the publishers, in order to force those of us who need to prepare footnotes to buy paper editions." The project began in 2002 under the codename Project Ocean. Google co-founder Larry Page had always had an interest in digitizing books. When he and Marissa Mayer began experimenting with book scanning in 2002, it took 40 minutes for them to digitize

2132-539: The other is a civil lawsuit brought by five large publishers and the Association of American Publishers . ( McGraw Hill v. Google , October 19, 2005) November 2005 : Google changed the name of this service from Google Print to Google Book Search. Its program enabling publishers and authors to include their books in the service was renamed Google Books Partner Program, and the partnership with libraries became Google Books Library Project . 2006 : Google added

Open Content Alliance - Misplaced Pages Continue

2184-502: The page, for instance, might see a list of books that share a similar genre and theme, or they might see a list of current scholarship on the book. This content, moreover, offers interactive possibilities for users signed into their Google account . They can export the bibliographic data and citations in standard formats , write their own reviews, add it to their library to be tagged, organized, and shared with other people. Thus, Google Books collects these more interpretive elements from

2236-636: The pages flat, which requires the use of destructive methods such as unbinding or glass plates to individually flatten each page, which is inefficient for large scale scanning. Google decided to omit color information in favour of better spatial resolution, as most out-of-copyright books at the time did not contain colors. Each page image was passed through algorithms that distinguished the text and illustration regions. Text regions were then processed via OCR to enable full-text searching. Google expended considerable resources in coming up with optimal compression techniques, aiming for high image quality while keeping

2288-409: The physical books' pages, fancy fonts, old fonts, torn pages, etc. can all lead to errors in the extracted text. Imperfect OCR is only the first challenge in the ultimate goal of moving from collections of page images to extracted-text based books. Our computer algorithms also have to automatically determine the structure of the book (what are the headers and footers, where images are placed, whether text

2340-402: The physical versions of the books. Results from Google Books show up in both the universal Google Search and in the dedicated Google Books search website ( books.google.com ). In response to search queries, Google Books allows users to view full pages from books in which the search terms appear if the book is out of copyright or if the copyright owner has given permission. If Google believes

2392-558: The purposes of indexing is fair use . Microsoft had a special relationship with the Open Content Alliance until May 2008. Microsoft joined the Open Content Alliance in October 2005 as part of its Live Book Search project . However, in May 2008 Microsoft announced it would be ending the Live Book Search project and no longer funding the scanning of books through the Internet Archive. Microsoft removed any contractual restrictions on

2444-531: The quality of scans, Google acknowledges that they are "not always of sufficiently high quality" to be offered for sale on Google Play. Also, because of supposed technical constraints, Google does not replace scans with higher quality versions that may be provided by the publishers. The project is the subject of the Authors Guild v. Google lawsuit, filed in 2005 and ruled in favor of Google in 2013, and again, on appeal, in 2015. Copyright owners can claim

2496-457: The rights for a scanned book and make it available for preview or full view (by "transferring" it to their Partner Program account), or request Google to prevent the book text from being searched. The number of institutions participating in the Library Project has grown since its inception. Other institutional partners have joined the project since the partnership was first announced: 2002 : A group of team members at Google officially launch

2548-439: The role from Nicholas A. Veliotes . On May 1, 2009, another former United States representative, Tom Allen , took over as president and CEO. In January 2017, Maria Pallante , a former United States Register of Copyrights , became the president and CEO of the organization. The association's core programs deal primarily with advocacy related to: intellectual property ; new technology and digital issues of concern to publishers;

2600-451: The scanned texts by the OCR process. As of October 2019 , Google celebrated 15 years of Google Books and provided the number of scanned books as more than 40 million titles. Google estimated in 2010 that there were about 130 million distinct titles in the world, and stated that it intended to scan all of them. However, the scanning process in American academic libraries has slowed since

2652-401: The two open pages. Each page would have two cameras directed at it capturing the image, while a range finder LIDAR overlaid a three-dimensional laser grid on the book's surface to capture the curvature of the paper. A human operator would turn the pages by hand, using a foot pedal to take the photographs. With no need to flatten the pages or align them perfectly, Google's system not only reached

SECTION 50

#1732798220273

2704-596: Was settled in February 2020, with Audible agreeing not to implement the Captions feature without obtaining express permission. The AAP initially supported the arrest of Dmitry Sklyarov . AAP was criticized after it contracted Eric Dezenhall 's crisis management firm to promote its position regarding the open access movement. Schroeder told The Washington Post “the association hired Dezenhall when members realized they needed help. ‘We thought we were angels for

#272727