Blogroll

How to keep your content repository and Solr in synch using Camel

With recent contributions to Camel, now camel-jcr component has a consumer which allows monitoring a Java Content Repository for changes. If your jcr supports OPTION_OBSERVATION_SUPPORTED then the consumer will register an EventListener and get notified for all kind of events. The chances are that you are not interested in all the events from the whole repository and in this case it is possible to narrow down the notifications to receive by further specifying the path of interest, event types, node uuids, nodeTypes and etc.

How can this consumer be useful? (hhmmm, you tell me) Lets say we have a CMS and we want to keep our external Solr index in synch with the content updates.So whenever a new node is added to the content repository all of its properties get indexed in Solr, and if the node is deleted from the content repository then corresponding document is removed from Solr.

Here is a Camel route that will listen for changes under /path/site folder and all its children. But this route will get notified only for two kind of events: NODE_ADDED and NODE_REMOVED, because the value of eventTypes option is a bit mask of the event types of interest (in this case 3 for masking 1 and 2 respectively).
from("jcr://username:password@repository/path/site?deep=true&eventTypes=3")
   .split(body())
   .choice()

   .when(script("beanshell", "request.getBody().getType() == 1"))
   .to("direct:index")

   .when(script("beanshell", "request.getBody().getType() == 2"))
   .to("direct:delete")

   .otherwise()
   .log("Event type not recognized" + body().toString());
Then the route will split each event into a separate message and depending on the event type will send the node creation events to direct:index route and node deletion events to direct:delete route.

Delete route is a simple one: It sets the solr operation to delete_by_id in the message header
and the node identifier into message body which in our case represents also the uniqueKey in the solr schema. Followed by a solr commit.
from("direct:delete")
   .setHeader(SolrConstants.OPERATION, constant(SolrConstants.OPERATION_DELETE_BY_ID))
   .setBody(script("beanshell", "request.getBody().getIdentifier()"))

   .log("Deleting node with id: ${body}")
   .to(SOLR_URL)

   .setHeader("SolrOperation", constant("COMMIT"))
   .to(SOLR_URL);
Indexing part consist of two routes, where the nodeRetriever route is actually getting the node from content repository using its identifier from the update event:
from("direct:nodeRetriever")
   .setHeader(JcrConstants.JCR_OPERATION, constant(JcrConstants.JCR_GET_BY_ID))
   .setBody(script("beanshell", "request.getBody().getIdentifier()"))

   .log("Reading node with id: ${body}")
   .to("jcr://admin:admin@repository");
After the node is retrieved from the repository using content enricher EIP, there is also a processor
to extract node properties and set them into Camel message properties so that they get indexed as solr document fields.
from("direct:index")
   .enrich("direct:nodeRetriever", nodeEnricher)
   .process(jcrSolrPropertyMapper)

   .log("Indexing node with id: ${body}")
   .setHeader("SolrOperation", constant("INSERT"))
   .to(SOLR_URL);
You can find the complete working example on github. In case your CMS is not a JCR, but CMIS compliant, have a look at this cmis component on my github account.

About Me