Misplaced Pages

Apache Nutch

Article snapshot taken from Wikipedia with creative commons attribution-sharealike license. Give it a read and then ask your questions in the chat. We can research this topic together.

Apache Nutch is a highly extensible and scalable open source web crawler software project.

#356643

47-526: Nutch is coded entirely in the Java programming language , but data is written in language-independent formats. It has a highly modular architecture, allowing developers to create plug-ins for media-type parsing, data retrieval, querying and clustering. The fetcher ("robot" or " web crawler ") has been written from scratch specifically for this project. Nutch originated with Doug Cutting , creator of both Lucene and Hadoop , and Mike Cafarella . In June, 2003,

94-490: A crucial role in supporting the Java community and its releases. It is essential that members possess expertise and in-depth technical knowledge, combined with strong professional experience, to significantly contribute to the growth and usage of the Java language . Membership for organizations and commercial entities requires annual fees, but it is free for individuals. JSRs undergo formal public reviews before becoming final, and

141-717: A lawsuit against Google shortly after that for using Java inside the Android SDK (see the Android section). On April 2, 2010, James Gosling resigned from Oracle . In January 2016, Oracle announced that Java run-time environments based on JDK 9 will discontinue the browser plugin. Java software runs on everything from laptops to data centers , game consoles to scientific supercomputers . Oracle (and others) highly recommend uninstalling outdated and unsupported versions of Java, due to unresolved security issues in older versions. There were five primary goals in creating

188-482: A number of other standard servlet classes available, for example for WebSocket communication. The Java servlet API has to some extent been superseded (but still used under the hood) by two standard Java technologies for web services: Typical implementations of these APIs on Application Servers or Servlet Containers use a standard servlet for handling all interactions with the HTTP requests and responses that delegate to

235-440: A small portion of code to which Sun did not hold the copyright. Sun's vice-president Rich Green said that Sun's ideal role with regard to Java was as an evangelist . Following Oracle Corporation 's acquisition of Sun Microsystems in 2009–10, Oracle has described itself as the steward of Java technology with a relentless commitment to fostering a community of participation and transparency. This did not prevent Oracle from filing

282-567: A subject of controversy during the 2010s. The class library contains features such as: Javadoc is a comprehensive documentation system, created by Sun Microsystems . It provides developers with an organized system for documenting their code. Javadoc comments have an extra asterisk at the beginning, i.e. the delimiters are /** and */ , whereas the normal multi-line comments in Java are delimited by /* and */ , and single-line comments start with // . Java Community Process The Java Community Process (JCP) , established in 1998,

329-673: A subproject of Lucene in June of that same year. Since April, 2010, Nutch has been considered an independent, top level project of the Apache Software Foundation . In February 2014 the Common Crawl project adopted Nutch for its open, large-scale web crawl. While it was once a goal for the Nutch project to release a global large-scale web search engine, that is no longer the case. Branch Branch IBM Research studied

376-471: A successful 100-million-page demonstration system was developed. To meet the multi-machine processing needs of the crawl and index tasks, the Nutch project has also implemented a MapReduce facility and a distributed file system . The two facilities have been spun out into their own subproject, called Hadoop . In January, 2005, Nutch joined the Apache Incubator, from which it graduated to become

423-491: Is a high-level , class-based , object-oriented programming language that is designed to have as few implementation dependencies as possible. It is a general-purpose programming language intended to let programmers write once, run anywhere ( WORA ), meaning that compiled Java code can run on all platforms that support Java without the need to recompile. Java applications are typically compiled to bytecode that can run on any Java virtual machine (JVM) regardless of

470-486: Is a formal mechanism that enables interested parties to develop standard technical specifications for Java technology. Becoming a member of the JCP requires solid knowledge of the Java programming language, its specifications, and best practices in software development. Membership in the JCP involves a detailed review of the candidate's profile, including an assessment by current members. Typically, professionals are invited to join

517-576: Is actually two compilers in one; and with GraalVM (included in e.g. Java 11, but removed as of Java 16) allowing tiered compilation . Java itself is platform-independent and is adapted to the particular platform it is to run on by a Java virtual machine (JVM), which translates the Java bytecode into the platform's machine language. Programs written in Java have a reputation for being slower and requiring more memory than those written in C++ . However, Java programs' execution speed improved significantly with

SECTION 10

#1732798608357

564-409: Is implicitly allocated on the stack or explicitly allocated and deallocated from the heap . In the latter case, the responsibility of managing memory resides with the programmer. If the program does not deallocate an object, a memory leak occurs. If the program attempts to access or deallocate memory that has already been deallocated, the result is undefined and difficult to predict, and the program

611-626: Is insufficient free memory on the heap to allocate a new object; this can cause a program to stall momentarily. Explicit memory management is not possible in Java. Java does not support C/C++ style pointer arithmetic , where object addresses can be arithmetically manipulated (e.g. by adding or subtracting an offset). This allows the garbage collector to relocate referenced objects and ensures type safety and security. As in C++ and some other object-oriented languages, variables of Java's primitive data types are either stored directly in fields (for objects) or on

658-399: Is likely to become unstable or crash. This can be partially remedied by the use of smart pointers , but these add overhead and complexity. Garbage collection does not prevent logical memory leaks, i.e. those where the memory is still referenced but never used. Garbage collection may happen at any time. Ideally, it will occur when a program is idle. It is guaranteed to be triggered if there

705-420: Is no longer needed, typically when objects that are no longer needed are stored in containers that are still in use. If methods for a non-existent object are called, a null pointer exception is thrown. One of the ideas behind Java's automatic memory management model is that programmers can be spared the burden of having to perform manual memory management. In some languages, memory for the creation of objects

752-538: Is supported for interfaces . Java uses comments similar to those of C++. There are three different styles of comments: a single line style marked with two slashes ( // ), a multiple line style opened with /* and closed with */ , and the Javadoc commenting style opened with /** and closed with */ . The Javadoc style of commenting allows the user to run the Javadoc executable to create documentation for

799-400: Is the latest version (Java 22, and 20 are no longer maintained). Java 8, 11, 17, and 21 are previous LTS versions still officially supported. James Gosling , Mike Sheridan, and Patrick Naughton initiated the Java language project in June 1991. Java was originally designed for interactive television, but it was too advanced for the digital cable television industry at the time. The language

846-431: Is written inside classes, and every data item is an object, with the exception of the primitive data types, (i.e. integers, floating-point numbers, boolean values , and characters), which are not objects for performance reasons. Java reuses some popular aspects of C++ (such as the printf method). Unlike C++, Java does not support operator overloading or multiple inheritance for classes, though multiple inheritance

893-593: The ConcurrentMaps and other multi-core collections, and it was improved further with Java 1.6. Some platforms offer direct hardware support for Java; there are micro controllers that can run Java bytecode in hardware instead of a software Java virtual machine, and some ARM -based processors could have hardware support for executing Java bytecode through their Jazelle option, though support has mostly been dropped in current implementations of ARM. Java uses an automatic garbage collector to manage memory in

940-418: The object lifecycle . The programmer determines when objects are created, and the Java runtime is responsible for recovering the memory once objects are no longer in use. Once no references to an object remain, the unreachable memory becomes eligible to be freed automatically by the garbage collector. Something similar to a memory leak may still occur if a programmer's code holds a reference to an object that

987-516: The stack (for methods) rather than on the heap, as is commonly true for non-primitive data types (but see escape analysis ). This was a conscious decision by Java's designers for performance reasons. Java contains multiple types of garbage collectors. Since Java 9, HotSpot uses the Garbage First Garbage Collector (G1GC) as the default. However, there are also several other garbage collectors that can be used to manage

SECTION 20

#1732798608357

1034-605: The JCP Executive Committee votes on their approval. A finalized JSR provides a reference implementation , which is a free implementation of the technology in source code form, and a Technology Compatibility Kit to verify the API specification. The JCP itself is described by a JSR. As of 2020 , JSR 387 describes the current version (2.11) of the JCP. There are hundreds of JSRs. Some of the more visible JSRs include: The JCP's executive board has been characterized as

1081-575: The JCP based on their contributions and reputation within the Java community. Once invited, the new member undergoes an evaluation by the JCP Executive Committee, ensuring that they can effectively contribute to the Java Specification Requests (JSRs). These formal documents describe proposed specifications and technologies to be added to the Java platform . New members are encouraged to engage actively and play

1128-834: The Java language, as part of J2SE 5.0. Prior to the introduction of generics, each variable declaration had to be of a specific type. For container classes, for example, this is a problem because there is no easy way to create a container that accepts only specific types of objects. Either the container operates on all subtypes of a class or interface, usually Object , or a different container class has to be created for each contained class. Generics allow compile-time type checking without having to create many container classes, each containing almost identical code. In addition to enabling more efficient code, certain runtime exceptions are prevented from occurring, by issuing compile-time errors. If Java prevented all runtime type errors ( ClassCastException s) from occurring, it would be type safe . In 2016,

1175-950: The Java language: As of November 2024 , Java 8, 11, 17, and 21 are supported as long-term support (LTS) versions, with Java 25, releasing in September 2025, as the next scheduled LTS version. Oracle released the last zero-cost public update for the legacy version Java 8 LTS in January 2019 for commercial use, although it will otherwise still support Java 8 with public updates for personal use indefinitely. Other vendors such as Adoptium continue to offer free builds of OpenJDK's long-term support (LTS) versions. These builds may include additional security patches and bug fixes. Major release versions of Java, along with their release dates: Sun has defined and supports four editions of Java targeting different application environments and segmented many of its APIs so that they belong to one of

1222-440: The Java platform must run similarly on any combination of hardware and operating system with adequate run time support. This is achieved by compiling the Java language code to an intermediate representation called Java bytecode , instead of directly to architecture-specific machine code . Java bytecode instructions are analogous to machine code, but they are intended to be executed by a virtual machine (VM) written specifically for

1269-635: The ability to run Java applets within web pages, and Java quickly became popular. The Java 1.0 compiler was re-written in Java by Arthur van Hoff to comply strictly with the Java 1.0 language specification. With the advent of Java 2 (released initially as J2SE 1.2 in December 1998 – 1999), new versions had multiple configurations built for different types of platforms. J2EE included technologies and APIs for enterprise applications typically run in server environments, while J2ME featured APIs optimized for mobile applications. The desktop version

1316-586: The generated servlet creates the response. Swing is a graphical user interface library for the Java SE platform. It is possible to specify a different look and feel through the pluggable look and feel system of Swing. Clones of Windows , GTK+ , and Motif are supplied by Sun. Apple also provides an Aqua look and feel for macOS . Where prior implementations of these looks and feels may have been considered lacking, Swing in Java SE 6 addresses this problem by using more native GUI widget drawing routines of

1363-596: The heap, such as the Z Garbage Collector (ZGC) introduced in Java 11, and Shenandoah GC, introduced in Java 12 but unavailable in Oracle-produced OpenJDK builds. Shenandoah is instead available in third-party builds of OpenJDK, such as Eclipse Temurin . For most applications in Java, G1GC is sufficient. In prior versions of Java, such as Java 8, the Parallel Garbage Collector was used as the default garbage collector. Having solved

1410-658: The host hardware. End-users commonly use a Java Runtime Environment (JRE) installed on their device for standalone Java applications or a web browser for Java applets . Standard libraries provide a generic way to access host-specific features such as graphics, threading , and networking . The use of universal bytecode makes porting simple. However, the overhead of interpreting bytecode into machine instructions made interpreted programs almost always run more slowly than native executables . Just-in-time (JIT) compilers that compile byte-codes to machine code during runtime were introduced from an early stage. Java's Hotspot compiler

1457-556: The implementation of floating-point arithmetic, and a history of security vulnerabilities in the primary Java VM implementation HotSpot . Developers have criticized the complexity and verbosity of the Java Persistence API (JPA), a standard part of Java EE. This has led to increased adoption of higher-level abstractions like Spring Data JPA, which aims to simplify database operations and reduce boilerplate code. The growing popularity of such frameworks suggests limitations in

Apache Nutch - Misplaced Pages Continue

1504-537: The introduction of just-in-time compilation in 1997/1998 for Java 1.1 , the addition of language features supporting better code analysis (such as inner classes, the StringBuilder class, optional assertions, etc.), and optimizations in the Java virtual machine, such as HotSpot becoming Sun's default JVM in 2000. With Java 1.5, the performance was improved with the addition of the java.util.concurrent package, including lock-free implementations of

1551-455: The memory management problem does not relieve the programmer of the burden of handling properly other kinds of resources, like network or database connections, file handles, etc., especially in the presence of exceptions. The syntax of Java is largely influenced by C++ and C . Unlike C++, which combines the syntax for structured, generic, and object-oriented programming, Java was built almost exclusively as an object-oriented language. All code

1598-512: The performance of Nutch/Lucene as part of its Commercial Scale Out (CSO) project. Their findings were that a scale-out system, such as Nutch/Lucene, could achieve a performance level on a cluster of blades that was not achievable on any scale-up computer such as the POWER5 . The ClueWeb09 dataset (used in e.g. TREC ) was gathered using Nutch, with an average speed of 755.31 documents per second. Java (programming language) Java

1645-496: The platforms. The platforms are: The classes in the Java APIs are organized into separate groups called packages . Each package contains a set of related interfaces , classes, subpackages and exceptions . Sun also provided an edition called Personal Java that has been superseded by later, standards-based Java ME configuration-profile pairings. One design goal of Java is portability , which means that programs written for

1692-451: The program and can be read by some integrated development environments (IDEs) such as Eclipse to allow developers to access documentation within the IDE. The following is a simple example of a "Hello, World!" program that writes a message to the standard output : Java applets are programs embedded in other applications, mainly in web pages displayed in web browsers. The Java applet API

1739-527: The selling of licenses for specialized products such as the Java Enterprise System. On November 13, 2006, Sun released much of its Java virtual machine (JVM) as free and open-source software (FOSS), under the terms of the GPL-2.0-only license. On May 8, 2007, Sun finished the process, making all of its JVM's core code available under free software /open-source distribution terms, aside from

1786-555: The specifications of the Java Community Process , Sun had relicensed most of its Java technologies under the GPL-2.0-only license. Oracle offers its own HotSpot Java Virtual Machine, however the official reference implementation is the OpenJDK JVM which is free open-source software and used by most developers and is the default JVM for almost all Linux distributions. As of September 2024 , Java 23

1833-468: The standard JPA implementation's ease-of-use for modern Java development. The Java Class Library is the standard library , developed to support application development in Java. It is controlled by Oracle in cooperation with others through the Java Community Process program. Companies or individuals participating in this process can influence the design and development of the APIs. This process has been

1880-409: The type system of Java was proven unsound in that it is possible to use generics to construct classes and methods that allow assignment of an instance one class to a variable of another unrelated class. Such code is accepted by the compiler, but fails at run time with a class cast exception. Criticisms directed at Java include the implementation of generics, speed, the handling of unsigned numbers,

1927-434: The underlying computer architecture . The syntax of Java is similar to C and C++ , but has fewer low-level facilities than either of them. The Java runtime provides dynamic capabilities (such as reflection and runtime code modification) that are typically not available in traditional compiled languages. Java gained popularity shortly after its release, and has been a very popular programming language since then. Java

Apache Nutch - Misplaced Pages Continue

1974-571: The underlying platforms. JavaFX is a software platform for creating and delivering desktop applications , as well as rich web applications that can run across a wide variety of devices. JavaFX is intended to replace Swing as the standard GUI library for Java SE , but since JDK 11 JavaFX has not been in the core JDK and instead in a separate module. JavaFX has support for desktop computers and web browsers on Microsoft Windows , Linux , and macOS . JavaFX does not have support for native OS look and feels. In 2004, generics were added to

2021-417: The web service methods for the actual business logic. JavaServer Pages ( JSP ) are server-side Java EE components that generate responses, typically HTML pages, to HTTP requests from clients . JSPs embed Java code in an HTML page by using the special delimiters <% and %> . A JSP is compiled to a Java servlet , a Java application in its own right, the first time it is accessed. After that,

2068-440: Was deprecated with the release of Java 9 in 2017. Java servlet technology provides Web developers with a simple, consistent mechanism for extending the functionality of a Web server and for accessing existing business systems. Servlets are server-side Java EE components that generate responses to requests from clients . Most of the time, this means generating HTML pages in response to HTTP requests, although there are

2115-665: Was initially called Oak after an oak tree that stood outside Gosling's office. Later the project went by the name Green and was finally renamed Java , from Java coffee , a type of coffee from Indonesia . Gosling designed Java with a C / C++ -style syntax that system and application programmers would find familiar. Sun Microsystems released the first public implementation as Java 1.0 in 1996. It promised write once, run anywhere (WORA) functionality, providing no-cost run-times on popular platforms . Fairly secure and featuring configurable security, it allowed network- and file-access restrictions. Major web browsers soon incorporated

2162-689: Was renamed J2SE. In 2006, for marketing purposes, Sun renamed new J2 versions as Java EE , Java ME , and Java SE , respectively. In 1997, Sun Microsystems approached the ISO/IEC JTC 1 standards body and later the Ecma International to formalize Java, but it soon withdrew from the process. Java remains a de facto standard , controlled through the Java Community Process . At one time, Sun made most of its Java implementations available without charge, despite their proprietary software status. Sun generated revenue from Java through

2209-625: Was the third most popular programming language in 2022 according to GitHub . Although still widely popular, there has been a gradual decline in use of Java in recent years with other languages using JVM gaining popularity. Java was originally developed by James Gosling at Sun Microsystems . It was released in May 1995 as a core component of Sun's Java platform . The original and reference implementation Java compilers , virtual machines, and class libraries were originally released by Sun under proprietary licenses . As of May 2007, in compliance with

#356643