计算机科学论坛--Delve inside the Lucene indexing mechanism

Index your documents with Lucene, an IR library written in Java
[URL=http://www.ibm.com/developerworks/]

[/URL]

Level: Intermediate

[URL=http://www-128.ibm.com/developerworks/library/wa-lucene/#author]Deng Peng Zhou[/URL] ([URL=mailto:zhoudengpeng@yahoo.com.cn?subject=Delve inside the Lucene indexing mechanism&cc=htc@us.ibm.com]zhoudengpeng@yahoo.com.cn[/URL]), Software Engineer, Shanghai Jiaotong University

27 Jun 2006

Discover Lucene, a full-text information retrieval (IR) library written in the Java™ language. You can embed Lucene easily into your applications and implement indexing and searching functionality. Now it's an open source project in the popular Apache Jakarta Project family. Learn about Lucene's indexing mechanism, as well as its index file structure.
This article introduces you to the indexing mechanism of Lucene, a popular full-text IR library written in the Java language. First, I'll demonstrate how to index your documents with Lucene, then I'll discuss how to improve the indexing performance. Finally, I'll analyze Lucene's index file structure. Keep in mind that Lucene is not a ready-to-use application, but rather an IR Library that lets you add searching and indexing functionality to your application.

Architecture overview

[URL=http://www-128.ibm.com/developerworks/library/wa-lucene/#figure1]Figure 1[/URL] shows the indexing architecture of Lucene. Lucene uses different parsers for different types of documents. Take HTML documents, for example -- an HTML parser does some preprocessing, such as filtering the HTML tags and so on. The HTML parser outputs the text content, and then the Lucene Analyzer extracts tokens and related information, such as token frequency, from the text content. The Lucene Analyzer then writes the tokens and related information into the index files of Lucene.

Figure 1. Indexing the Lucene architecture

Indexing your documents with Lucene

I'll show you step by step how to create an index for your documents with Lucene. Lucene can index any data that you can convert into textual format. For example, if you want to index HTML or PDF documents, first you should extract the textual information from the documents and then send the information to Lucene for indexing. The example in this article uses Lucene to index text files with a .txt extension.

1. Prepare the text files

Put some text files with a .txt extension into a directory -- for example, C:\\files_to_index on the Microsoft® Windows® platform.

2. Create the index

[URL=http://www-128.ibm.com/developerworks/library/wa-lucene/#Listing1]Listing 1[/URL] shows you how to index the text files you prepared in the first step.

Listing 1. Indexing your documents with Lucene
package lucene.index;

import java.io.File;
import java.io.FileReader;
import java.io.Reader;
import java.util.Date;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexWriter;

/**
* This class demonstrates the process of creating an index with Lucene
* for text files in a directory.
*/
public class TextFileIndexer {
public static void main(String[] args) throws Exception{
//fileDir is the directory that contains the text files to be indexed
File fileDir = new File("C:\\files_to_index ");

   //indexDir is the directory that hosts Lucene's index files
   File   indexDir = new File("C:\\luceneIndex");
   Analyzer luceneAnalyzer = new StandardAnalyzer();
   IndexWriter indexWriter = new IndexWriter(indexDir,luceneAnalyzer,true);
   File[] textFiles  = fileDir.listFiles();
   long startTime = new Date().getTime();

   //Add documents to the index
   for(int i = 0; i < textFiles.length; i++){
     if(textFiles[i].isFile() >> textFiles[i].getName().endsWith(".txt")){
       System.out.println("File " + textFiles[i].getCanonicalPath()
              + " is being indexed");
       Reader textReader = new FileReader(textFiles[i]);
       Document document = new Document();
       document.add(Field.Text("content",textReader));
       document.add(Field.Text("path",textFiles[i].getPath()));
       indexWriter.addDocument(document);
     }
   }

   indexWriter.optimize();
   indexWriter.close();
   long endTime = new Date().getTime();

   System.out.println("It took " + (endTime - startTime)
              + " milliseconds to create an index for the files in the directory "
              + fileDir.getPath());
  }
}

As [URL=http://www-128.ibm.com/developerworks/library/wa-lucene/#Listing1]Listing 1[/URL] demonstrates, you can index your text files easily with Lucene. Let's interpret the key statements in [URL=http://www-128.ibm.com/developerworks/library/wa-lucene/#Listing1]Listing 1[/URL], beginning with this one:

Analyzer luceneAnalyzer = new StandardAnalyzer();

This statement creates an instance of the StandardAnalyzer class, which is in charge of extracting tokens out of text to be indexed. StandardAnalyzer is just one implementation of the abstract class Analyzer; other implementations, such as SimpleAnalyzer, exist.

Now, take a look at this statement:

IndexWriter indexWriter = new IndexWriter(indexDir,luceneAnalyzer,true);

This statement creates an instance of the IndexWriter class, which is a key component in the indexing process. This class can create a new index or open an existing index and add documents to it. You might notice that its constructor accepts three parameters. The first parameter specifies the directory that stores the index files; the second parameter specifies the analyzer that will be used in the indexing process; the last parameter is a Boolean variable. If true, the class creates a new index; if false, it opens an existing index.

The following code snippet shows the process of adding one document to the index:

Document document = new Document();
document.add(Field.Text("content",textReader));
document.add(Field.Text("path",textFiles[i].getPath()));
indexWriter.addDocument(document);

The first line creates an instance of the Document class, which consists of a collection of fields. You can think of this class as a virtual document, such as an HTML page, a PDF file, or a text file. The fields in a document are often the attributes of a virtual document. Take an HTML page, for example: Its fields can include title, contents, URL, and so on. Different types of Field control which field you should index and which you should store with the index. For more information about Field, you can refer to Lucene's Javadoc. The second and third lines add two fields to the document. Each field contains a field name and the content. This example adds two fields named "content" and "path", which store the content and the path of the text file, respectively. The last line adds the prepared documents to the index.

After you add the documents to the index, don't forget to close the index by calling this method, which guarantees that the index changes are written to the disk:

indexWriter.close();

Using the code in [URL=http://www-128.ibm.com/developerworks/library/wa-lucene/#Listing1]Listing 1[/URL], you can add the text documents to the index successfully. Now, let's look at another operation on the index.

3. Remove documents from the index

The IndexReader class in Lucene is responsible for removing documents from the existing index, as demonstrated in [URL=http://www-128.ibm.com/developerworks/library/wa-lucene/#Listing2]Listing 2[/URL].

Listing 2. Removing documents from the index
File indexDir = new File("C:\\luceneIndex");
IndexReader ir = IndexReader.open(indexDir);
ir.delete(1);
ir.delete(new Term("path","C:\\file_to_index\lucene.txt"));
ir.close();

In [URL=http://www-128.ibm.com/developerworks/library/wa-lucene/#Listing2]Listing 2[/URL], the second line initializes an instance of the IndexReader class using the static method IndexReader.open(indexDir). The parameter of the method specifies the directory that stores the Lucene index files. IndexReader provides two methods to remove documents, as shown in the third and fourth lines. The third line deletes a document by document ID. Every document has a unique ID in the Lucene index, but the system generates the ID, so it's not convenient to use it to delete the document. The fourth line deletes the documents that contain the string "C:\\file_to_index\lucene.txt" in their field "path." You can easily specify a document to be deleted by its file path. Keep in mind that although the documents aren't searchable, the operations don't physically remove the documents from the index; they just mark the documents that have been deleted by creating a file with a .del extension.

You can easily recover the documents that have been marked as deleted, as shown in [URL=http://www-128.ibm.com/developerworks/library/wa-lucene/#Listing3]Listing 3[/URL]. First, open the index, then call the ir.undeleteAll() method to complete the recovery process.

Listing 3. Recovering deleted documents
File indexDir = new File("C:\\luceneIndex");
IndexReader ir = IndexReader.open(indexDir);
ir.undeleteAll();
ir.close();

You might want to know how to remove the documents from the index physically. [URL=http://www-128.ibm.com/developerworks/library/wa-lucene/#Listing4]Listing 4[/URL] shows the process.

Listing 4. Removing documents from the index physically
File indexDir = new File("C:\\luceneIndex");
Analyzer luceneAnalyzer = new StandardAnalyzer();
IndexWriter indexWriter = new IndexWriter(indexDir,luceneAnalyzer,false);
indexWriter.optimize();
indexWriter.close();

The third line in [URL=http://www-128.ibm.com/developerworks/library/wa-lucene/#Listing4]Listing 4[/URL] initializes an instance of the IndexWriter class and opens the existing index specified by the first parameter. The fourth line cleans up the index. IndexWriter physically deletes from the disk the documents that have been marked as deleted.

Lucene doesn't provide a method to update the document in the index directly, but if you want to do so, first remove the documents from the index and then add the updated version of this document to the index.

[URL=http://www-128.ibm.com/developerworks/library/wa-lucene/#main]Back to top[/URL]

Improving the indexing performance

You can make full use of your hardware resources to improve the indexing performance with Lucene. When you need to index a large number of documents, you'll notice that the bottleneck of the indexing is the process of writing the documents into the index files on the disk. To solve this problem, Lucene holds a buffer in the RAM. But how can you control the buffer that Lucene uses? Fortunately, Lucene's IndexWriter class exposes three parameters to let you adjust the size of the buffer and the frequency of the disk writes.

mergeFactor

This parameter determines how many documents you can store in the original segment index and how often you can merge together the segment indexes in the disk. For example, if the value of mergeFactor is 10, all the documents will write to a new segment index on the disk if the number of documents reaches 10 in the memory. Also, if the number of segment indexes on the disk reaches 10, they will merge together. The default value of this parameter is 10, which isn't suitable if you have a large number of documents. The large value of this parameter is better for batch index creation.

minMergeDocs

This parameter also affects the indexing performance. It determines the minimum number of documents that have to be buffered in the RAM before IndexWriter writes them to disk. The default value of this parameter is 10. If you have enough RAM, set the value of this parameter as large as possible to decrease the indexing time dramatically.

maxMergeDocs

This parameter determines the maximum number of documents per segment index. The default value is Integer.MAX_VALUE. Large values are better for batched indexing and speedier searches.

[URL=http://www-128.ibm.com/developerworks/library/wa-lucene/#Listing5]Listing 5[/URL] shows the usage of these parameters. [URL=http://www-128.ibm.com/developerworks/library/wa-lucene/#Listing5]Listing 5[/URL] is similar to [URL=http://www-128.ibm.com/developerworks/library/wa-lucene/#Listing1]Listing 1[/URL] but adds the statements to set the parameters described previously.

Listing 5. Improving indexing performance

/**
* This class demonstrates how to improve the indexing performance
* by adjusting the parameters provided by IndexWriter.
*/
public class AdvancedTextFileIndexer  {
  public static void main(String[] args) throws Exception{
    //fileDir is the directory that contains the text files to be indexed
    File   fileDir  = new File("C:\\files_to_index");

    //indexDir is the directory that hosts Lucene's index files
    File   indexDir = new File("C:\\luceneIndex");
    Analyzer luceneAnalyzer = new StandardAnalyzer();
    File[] textFiles  = fileDir.listFiles();
    long startTime = new Date().getTime();

    int mergeFactor = 10;
    int minMergeDocs = 10;
    int maxMergeDocs = Integer.MAX_VALUE;
    IndexWriter indexWriter = new IndexWriter(indexDir,luceneAnalyzer,true);
    indexWriter.mergeFactor = mergeFactor;
    indexWriter.minMergeDocs = minMergeDocs;
    indexWriter.maxMergeDocs = maxMergeDocs;

    //Add documents to the index
    for(int i = 0; i < textFiles.length; i++){
      if(textFiles[i].isFile() >> textFiles[i].getName().endsWith(".txt")){
        Reader textReader = new FileReader(textFiles[i]);
        Document document = new Document();
        document.add(Field.Text("content",textReader));
        document.add(Field.Keyword("path",textFiles[i].getPath()));
        indexWriter.addDocument(document);
      }
    }

    indexWriter.optimize();
    indexWriter.close();
    long endTime = new Date().getTime();

    System.out.println("MergeFactor: " + indexWriter.mergeFactor);
    System.out.println("MinMergeDocs: " + indexWriter.minMergeDocs);
    System.out.println("MaxMergeDocs: " + indexWriter.maxMergeDocs);
    System.out.println("Document number: " + textFiles.length);
    System.out.println("Time consumed: " + (endTime - startTime) + " milliseconds");
  }
}

Notice that Lucene gives you enough flexibility to control the size of the buffer pool and the frequency of disk writes. Now, take a look at the key statements in this example. The following statements first create an instance of IndexWriter and then assign the defined values to the parameters of IndexWriter.

int mergeFactor = 10;
int minMergeDocs = 10;
int maxMergeDocs = Integer.MAX_VALUE;
IndexWriter indexWriter = new IndexWriter(indexDir,luceneAnalyzer,true);
indexWriter.mergeFactor = mergeFactor;
indexWriter.minMergeDocs = minMergeDocs;
indexWriter.maxMergeDocs = maxMergeDocs;

Let's examine these parameters' influence on the indexing time. Notice the values of these parameters and the changes on the indexing time. I prepared 10,000 documents for this test; [URL=http://www-128.ibm.com/developerworks/library/wa-lucene/#table1]Table 1[/URL] shows the test results.

Table 1. Testing results
MergeFactor MinMergeDocs MaxMergeDocs Document number Time consumed (seconds)
10 10 Integer.MAX_VALUE 10,000 423
100 10 Integer.MAX_VALUE 10,000 270
100 100 Integer.MAX_VALUE 10,000 213
100 100 100 10,000 220
1000 1000 Integer.MAX_VALUE 10,000 194

From [URL=http://www-128.ibm.com/developerworks/library/wa-lucene/#table1]Table 1[/URL], you can easily see the influence that three parameters have on the indexing time. In practice, you'll often change the value of mergeFactor and minMergeDocs to improve the indexing performance. As long as you have enough RAM, you can assign a big integer value to the mergeFactor and minMergeDocs parameters to decrease the indexing time dramatically.

[URL=http://www-128.ibm.com/developerworks/library/wa-lucene/#main]Back to top[/URL]

Lucene's index file structure analysis

Before analyzing Lucene's index file structure, you should understand the inverted index concept. An inverted index is an inside-out arrangement of documents in which terms take center stage. Each term points to a list of documents that contain it. On the contrary, in a forwarding index, documents take the center stage, and each document refers to a list of terms it contains. You can use an inverted index to easily find which documents contain certain terms. Lucene uses an inverted index as its index structure.

Logical view of index files

Lucene features segments, which contain some indexed documents. You can search segments independently. Now look at Lucene's logical view of index files in [URL=http://www-128.ibm.com/developerworks/library/wa-lucene/#figure2]Figure 2[/URL]. The number of segments is determined by the number of documents to be indexed and the maximum number of documents that one segment can contain.

Figure 2. Logical view of index files

Key index files in Lucene

The following describes the main index files in Lucene. Some might not include all of the columns, but it won't affect your understanding of the index file.

Segments file

A single file contains the active segments information for each index. This file lists the segments by name, and it contains the size of each segment. [URL=http://www-128.ibm.com/developerworks/library/wa-lucene/#table2]Table 2[/URL] describes the structure of this file.

Table 2. Structure of segments file
Column name Data type Description
Version UInt64 Contains the version information of the index files.
SegCount UInt32 The number of segments in the index.
NameCounter UInt32 Generates names for new segment files.
SegName String The name of one segment. If the index contains more than one segment, this column will appear more than once.
SegSize UInt32 The size of one segment. If the index contains more than one segment, this column will appear more than once.

Fields information file

As you know, documents in the index are composed of fields, and this file contains the fields information in the segment. [URL=http://www-128.ibm.com/developerworks/library/wa-lucene/#table3]Table 3[/URL] shows the structure of this file.

Table 3. Structure of fields information file
Column name Data type Description
FieldsCount VInt The number of fields.
FieldName String The name of one field.
FieldBits Byte Contains various flags. For example, if the lowest bit is 1, it means this is an indexed field; if 0, it's a nonindexed field.

Text information file

This core index file stores all of the terms and related information in the index, sorted by term. [URL=http://www-128.ibm.com/developerworks/library/wa-lucene/#table4]Table 4[/URL] shows the structure of this file.

Table 4. Structure of term information file
Column name Data type Description
TIVersion UInt32 Names the version of this file's format.
TermCount UInt64 The number of terms in this segment.
Term Structure This column is composed of three subcolumns: PrefixLength, Suffix, and FieldNum. It represents the contents in this term.
DocFreq VInt The number of documents that contain the term.
FreqDelta VInt Points to the frequency file.
ProxDelta VInt Points to the position file.

Frequency file

This file contains the list of documents that contain the terms, along with the term frequency in each document. If Lucene finds a term that matches the search word in the term information file, it will visit the list in the frequency file to find which documents contain the term. [URL=http://www-128.ibm.com/developerworks/library/wa-lucene/#table5]Table 5[/URL] shows a brief structure of this file. It does not contain all of the fields of this file, but it can help you understand its usage.

Table 5. Structure of the frequency file
Column name Data type Description
DocDelta VInt It determines both the document number and term frequency. If the value is odd, the term frequency is 1; otherwise, the Freq column determines the term frequency.
Freq VInt If the value of DocDelta is even, this column determines the term frequency.

Position file

This file contains the list of positions at which the term occurs within each document. You can use this information to rank the search results. [URL=http://www-128.ibm.com/developerworks/library/wa-lucene/#table6]Table 6[/URL] shows the structure of this file.

Table 6. Structure of the position file
Column name Data type Description
PositionDelta VInt The position at which each term occurs within the documents

I've introduced you to the main index files in Lucene, hopefully allowing you to understand the physical storage structure of Lucene.

[URL=http://www-128.ibm.com/developerworks/library/wa-lucene/#main]Back to top[/URL]

In conclusion

A number of large, well-known organizations are using Lucene. For example, Lucene provides searching capabilities for the Eclipse help system, MIT's OpenCourseWare, and so on. Upon reading this article, I hope you've gained an understanding of Lucene's indexing system and will find it easy to create an index using Lucene's API.

[URL=http://www-128.ibm.com/developerworks/library/wa-lucene/#main]Back to top[/URL]

Resources

Learn

[URL=http://www.ibm.com/developerworks/web/library/j-lucene/]Parsing, indexing, and searching XML with Digester and Lucene[/URL] by Otis Gospodnetic (developerWorks, June 2003): Manipulate XML in Lucene and cut your developement time.

[URL=http://www.ibm.com/developerworks/db2/library/techarticle/dm-0601chitiveli/]IBM Search and Index APIs (SIAPI) for WebSphere Information Integrator OmniFind Edition[/URL] by Srinivas Varma Chitiveli (developerWorks, January 2005): Build your own search solutions based on OmniFind technology, IBM's information retrieval library.

[URL=http://lucene.apache.org/]Lucene's official Web site[/URL]: Explore numerous study materials for Lucene, including Javadoc and Lucene's latest release.

[URL=http://lucene.sourceforge.net/talks/pisa/]A lecture on Lucene[/URL], presented by Doug Cutting at the University of Pisa on November 24, 2004: Explore this brief introduction to Lucene.

[URL=http://www.amazon.com/gp/product/020139829X/104-7111632-8247925?v=glance&n=283155]Modern Information Retrieval[/URL] by Ricardo Baeza-Yates and Berthier Ribeiro-Neto: Read about changes in modern information retrieval and how to provide relevant information in this book about IR technology.

developerWorks [URL=http://www.ibm.com/developerworks/web]Web Architecture zone[/URL]: Expand your site development skills with articles and tutorials that specialize in Web technologies.

[URL=http://www.ibm.com/developerworks/offers/techbriefings/?S_TACT=105AGX08&S_CMP=art]developerWorks technical events and webcasts[/URL]: Stay current with jam-packed technical sessions that shorten your learning curve, and improve the quality and results of your most difficult software projects.

Get products and technologies

[URL=http://www.apache.org/dyn/closer.cgi/lucene/java/]Lucene[/URL]: Download the latest version.

[URL=http://www.ibm.com/developerworks/downloads/?S_TACT=105AGX08&S_CMP=art]Free downloads and learning resources[/URL]: Improve your work with software downloads from developerWorks.

Discuss

[URL=http://lucene.apache.org/java/docs/mailinglists.html]Lucene mailing list[/URL] standards: Ask questions, share knowledge, and discuss issues.

[URL=http://www.ibm.com/developerworks/community/]developerWorks discussion forums[/URL]: Join and participate in the developerWorks community.

[URL=http://www.ibm.com/developerworks/blogs/]developerWorks blogs[/URL]:Get involved in the developerWorks community.

[URL=http://www-128.ibm.com/developerworks/library/wa-lucene/#main]Back to top[/URL]

About the author

Deng Peng Zhou is a graduate student from Shanghai Jiaotong University. He is interested in Java technology and modern information retrieval. You can contact him at [URL=mailto:zhoudengpeng@yahoo.com.cn?cc=htc@us.ibm.com]zhoudengpeng@yahoo.com.cn[/URL].


	W 3 C h i n a ( since 2003 ) 旗下站点苏ICP备05006046号《全国人大常委会关于维护互联网安全的决定》《计算机信息网络国际联网安全保护管理办法》	5,031.250ms