Building a Java application with Apache Nutch and Solr

(Author: Emre Çelikten)

Apache Nutch is a scalable web crawler that supports Hadoop. Apache Solr is a complete search engine that is built on top of Apache Lucene.

Let's make a simple Java application that crawls "World" section of CNN.com with Apache Nutch and uses Solr to index them. We are going to use both of them as libraries, which means Solr must be working without a servlet and HTTP connections.

We will be using Eclipse as the IDE. This is going to be a long tutorial, so get yourself a cup of your favorite drink.

First of all, you need:
Apache Nutch 1.5. You must download the source code for Nutch.
Apache Solr 3.4.0.

For development environment, you need JDK, Eclipse IDE and Ant. You can use this tutorial by Wencan for Ubuntu and similar distributions. If you wish to set them up yourself, here are the links for your convenience:

JDK.
Eclipse IDE.
Ant to make Nutch binaries.

Part 1: Extracting Nutch and Solr

Extract them to an appropriate place. Do not build anything yet. In this tutorial, /path/to/nutch and /path/to/solr will be used to refer to these folders.

Part 2: Adding EmbeddedSolrServer support to Nutch

As of writing, Nutch only supports Solr if it runs as a servlet. We wish to create a Solr server inside our application, so we need to add some code to Nutch sources to do so. We will use EmbeddedSolrServer from Solrj library. The good part is, they use the same interfaces, so this could also be used for any type of SolrServer that we want to pass programmatically. We do not need to care about if it is a servlet or an embedded one afterwards. (Note that the approach we are taking here is a hack that gets the job done.)

If you want to skip this part, you can use this patch that I have created. To apply it place it in the folder designated below and run patch < embeddedsolrserver.patch.

First, you need to navigate to /path/to/nutch/src/java/org/apache/nutch/indexer/solr.

We need to edit SolrIndexer.java and add a constructor to this class that will allow us to pass our SolrServer as a parameter.

[code lang="java"]public SolrIndexer(SolrServer solrServer) {
  super(null);
  SolrUtils.setSolrServer(solrServer);
}[/code]

Let's edit SolrUtils.java. This file contains an important method, getCommonsHttpSolrServer, which returns a SolrServer for Nutch. We need to add a method that uses the server that we have passed in the constructor. First, let's start by adding a necessary import:

[code lang="java"]
import org.apache.solr.client.solrj.SolrServer;
[/code]

and then add an attribute to SolrUtils class:

[code lang="java"]
private static SolrServer solrServer = null;
[/code]

and then paste this ugly snippet of code below:

[code lang="java"]
public static void setSolrServer(SolrServer server) {
  solrServer = server;
}

public static SolrServer getSolrServer(JobConf job) throws MalformedURLException {
  if (solrServer == null)
    return getCommonsHttpSolrServer(job);
  else
    return solrServer;
}

public static boolean isSolrServerSet() {
  return solrServer != null;
}
[/code]

which will allow us to get our own SolrServer if we have one. Our next step is to replace every instance of getCommonsHttpSolrServer with getSolrServer in the source code at /path/to/nutch/src/java/org/apache/nutch/indexer/solr.

We can compile Nutch now. Run ant in /path/to/nutch and hope that all goes well.

Part 3: Preparing libraries, Nutch and Solr

There are some last things we need to do before making our Java application.

Go to /path/to/solr/dist and open apache-solr-3.4.0.war with your favorite archive manager. Go to /WEB-INF/lib/ and extract everything there to /path/to/solr/dist. This will allow us to include all the libraries we need in our Java application.

We now need to configure Nutch for crawling. Navigate to /path/to/nutch/local/conf. You will see a lot of configuration files in there. There are two files that we are particularly interested in. First of them is nutch-default.xml, which has configuration settings for Nutch. The second is regex-urlfilter.txt. This file includes regular expressions for URLs that we are going to crawl. This is extremely good as we can limit our searches to domain, subdomains or even categories if the website supports it!

Make a new folder called nutchConf in a place you want and copy everything from /path/to/nutch/local/conf to there. Let's edit nutch-default.xml first. Open it with a text editor and search for http.agent.name. This field contains the name of our crawler and needs to be set. Set below to anything you like, for example NutchCrawlingExperiment. Similarly, you can set a description for your agent below, if you wish to do so. We also need to decrease our fetching delay a bit. Search for fetcher.server.delay and set 5.0 below to 2.0. You can set it lower if you want to, but please refrain from going too low as it might put a load on the website depending on your connection. We are done here, so save your file and open regex-urlfilter.txt.

Since our goal is to crawl World section of CNN.com, we need to set our regular expressions in that way. We need to observe URLs of CNN.com to do so. Here are two examples:

[code]
http://edition.cnn.com/2012/06/07/world/asia/japan-british-explorer/index.html?hpt=wo_c2
http://edition.cnn.com/2012/06/08/world/asia/singapore-supertrees-gardens-bay/index.html
[/code]

We are interested in URLs like this, which goes on like

[code]
http://edition.cnn.com/number/number/number/world/text/text-text-.../index.html[?...]
[/code]

We can express this as the regular expression

[code]
^http://edition.cnn.com/[0-9]+/[0-9]+/[0-9]+/world/[a-z]+/[a-z]+(-[a-z]+)*/index.html.*$
[/code]

(You can learn about regular expressions here and check the correctness of your regular expressions here.)

So let's put this regular expression into our regex-urlfilter.txt file. Navigate down to the part that says # accept anything else and change +. to ^http://edition.cnn.com/[0-9]+/[0-9]+/[0-9]+/world/[a-z]+/[a-z]+(-[a-z]+)*/index.html.*$. We also need to comment out the part that says # skip URLs containing certain characters as probable queries, etc. as our links can contain question marks. Change -[?*!@=] to # -[?*!@=] by adding a # to the beginning of the line. Also append the line

[code]
+^http://edition.cnn.com/WORLD/$
[/code]

to the end as this will be our seed URL.

We need to create a seed URLs file to start crawling too, but we will do it later.

Now let's configure Solr. Just like we did for Nutch, make a new folder called solrConf and copy everything from /path/to/solr/example/solr/conf to there. We need to edit a schema file that defines document structure for crawled pages. Nutch has its own schema and it is provided in the distribution, so we don't need to do much. Open schema.xml in your folder. Search for

[code]

[/code]

You can set stored value to true if you wish to fetch contents of the file in search results. We need to add another field in this section. Duplicate the line above and change it to

[code]

[/code]

which will prevent an error we can encounter later. Save your file.

Part 4: Creating an Eclipse project

We can now create an Eclipse project. From File menu, choose New->Java project and follow the menus. After creating the project, add a package by File->New->Package. I named the package edu.cmu.sphinx.crawler. We need to add a Java class to the package we have created. I called it Crawler.

Let's copy configuration and libraries into our project. Move your nutchConf and solrConf into your project folder. Make a new folder called lib. Make two subfolders in lib with the names nutch and solr. Copy everything from /path/to/nutch/runtime/local/lib to lib/nutch. Also make a new folder called plugins copy everything from /path/to/nutch/runtime/local/plugins to there. For Solr, copy all JAR files from /path/to/solr/dist to lib/solr.

Now we need to include those libraries in our project to be able to use them. From Project menu, click on Properties. Go to Java Build Path. Select Libraries and then Add JARs.... Go to lib/nutch folder and add everything except plugins. Do the same for lib/solr. Remove duplicates in the list. You also need to remove either slf4j-jdk14-1.6.1.jar or slf4j-log4j12-1.6.1.jar too.

Almost done. Click on Add Class Folder... and add nutchConf and solrConf folders from your project. Go to Order and Export, find the entries for nutchConf and solrConf and move them to the top.

Lastly, let's create a file for the list of initial URLs to crawl. Make a new folder in your project named urls and create a text file called seed.txt in it. Put

[code]
http://edition.cnn.com/WORLD/
[/code]

inside of the text file.

Part 5: Coding time!

Finally we're there! Just paste this snippet into your Crawler.java and run.

[code lang="java"]
package edu.cmu.sphinx.crawler;

import java.io.IOException;
import java.util.StringTokenizer;

import javax.xml.parsers.ParserConfigurationException;

import org.apache.hadoop.util.ToolRunner;
import org.apache.nutch.crawl.Crawl;
import org.apache.nutch.indexer.solr.SolrIndexer;
import org.apache.nutch.segment.SegmentReader;
import org.apache.nutch.util.NutchConfiguration;
import org.apache.solr.client.solrj.SolrQuery;
import org.apache.solr.client.solrj.SolrServerException;
import org.apache.solr.client.solrj.embedded.EmbeddedSolrServer;
import org.apache.solr.client.solrj.response.QueryResponse;
import org.apache.solr.common.SolrDocumentList;
import org.apache.solr.core.CoreContainer;
import org.xml.sax.SAXException;

public class Crawler {

	public static void main(String[] args) {

		/*
		 * Arguments for crawling.
		 *
		 * -dir dir names the directory to put the crawl in.
		 *
		 * -threads threads determines the number of threads that will fetch in
		 * parallel.
		 *
		 * -depth depth indicates the link depth from the root page that should
		 * be crawled.
		 *
		 * -topN N determines the maximum number of pages that will be retrieved
		 * at each level up to the depth.
		 */

		String crawlArg = "urls -dir crawl -threads 5 -depth 3 -topN 20";

		// Run Crawl tool

		try {
			ToolRunner.run(NutchConfiguration.create(), new Crawl(),
					tokenize(crawlArg));
		} catch (Exception e) {
			e.printStackTrace();
			return;
		}

		// Let's dump the segments we have to see what we have obtained. You
		// need to refresh your workspace to see the new folders. You can see
		// plaintext by going into dump folder and examining "dump".
		String dumpArg = "-dump crawl/segments/* dump -nocontent -nofetch -nogenerate -noparse -noparsedata";

		// Run dump
		try {
			SegmentReader.main(tokenize(dumpArg));
		} catch (Exception e) {
			// TODO Auto-generated catch block
			e.printStackTrace();
			return;
		}

		System.setProperty("solr.solr.home", "/home/emre/solr");
		CoreContainer.Initializer initializer = new CoreContainer.Initializer();
		CoreContainer coreContainer;

		try {
			coreContainer = initializer.initialize();

		} catch (IOException e) {
			// TODO Auto-generated catch block
			e.printStackTrace();
			return;
		} catch (ParserConfigurationException e) {
			// TODO Auto-generated catch block
			e.printStackTrace();
			return;
		} catch (SAXException e) {
			// TODO Auto-generated catch block
			e.printStackTrace();
			return;
		}
		EmbeddedSolrServer server = new EmbeddedSolrServer(coreContainer, "");

		// Arguments for indexing
		String indexArg = "local crawl/crawldb -linkdb crawl/linkdb crawl/segments/*";

		// Run indexing tool
		try {
			ToolRunner.run(NutchConfiguration.create(),
					new SolrIndexer(server), tokenize(indexArg));
		} catch (Exception e) {
			// TODO Auto-generated catch block
			e.printStackTrace();
			return;
		}

		// Let's query for something!
		SolrQuery query = new SolrQuery();
		query.setQuery("title:queen"); // Searching queen in query
		query.addSortField("title", SolrQuery.ORDER.asc);
		QueryResponse rsp;
		try {
			rsp = server.query(query);
		} catch (SolrServerException e) {
			// TODO Auto-generated catch block
			e.printStackTrace();
			return;
		}

		// Display the results in the console
		SolrDocumentList docs = rsp.getResults();
		for (int i = 0; i < docs.size(); i++) {
			System.out.println(docs.get(i).get("title").toString() + " Link: "
					+ docs.get(i).get("url").toString());
		}

		// Shut down the container so JVM ends.
		coreContainer.shutdown();

	}

	/**
	 * Helper function to convert a string into an array of strings by
	 * separating them using whitespace.
	 *
	 * @param str
	 *            string to be tokenized
	 * @return an array of strings that contain a each word each
	 */
	public static String[] tokenize(String str) {
		StringTokenizer tok = new StringTokenizer(str);
		String tokens[] = new String[tok.countTokens()];
		int i = 0;
		while (tok.hasMoreTokens()) {
			tokens[i] = tok.nextToken();
			i++;
		}

		return tokens;

	}

}
[/code]

Congratulations! Now you have an application that crawls a subset of articles from CNN.com, indexes it and searches for something inside the index. You can try adding new categories by changing regular expressions. Or maybe you can try adding support for another news site!

For more information, you can explore Nutch wiki and Solr wiki.