Category Archives: big-data

Web app using spark Processing hdfs data from a spring web app

In the xdata project I had to convert a stand alone Spring web app into a “big-data” web app running on a hadoop cluster. To do that, I chose to use apache spark and spark-hive because it provided the most practical interface. I however could not find any documentation or tutorial on such use of spark in java spring web application.

To test how to setup such application, I made two getting-started prototypes:

  1. A spring+spark web app: it implements a very simple web service that reads and converts files either on local file system or on hdfs, using spark.
  2. A spring+spark-hive web app: its implements simple web services that generates a hive table and requests content from it.

The main difficulty is about run-time dependencies: dependencies used for compilation (such as provided through maven) are not working together at run-time (at the time of writing this post).

To run stand alone app, one should add the `$SPARK_HOME/lib/spark-assembly-X.X.X-hadoopY.Y.Y.jar` (provided by the spark installation) to the classpath. For the spark-hive case, the datanucleus dependencies found in spark lib should also be added. Because, web app are run by a servlet container, such as tomcat or jetty, this jar should be added:

  1. Either to the war file, such as recommended for web app. It is however a 140Mb dep. This is what is used in the spring+spark web app (1).
  2. Or to the class path of the servlet container. This is what is used in the spring+spark-hive web app (2) which is added to the maven jetty plugin.

For my (professional) work, I choosed the second solution: add the spark jar to maven jetty plugin which is used during development, and I included the jetty-runner to the project which I run with spark jar added to to classpath (using the `–jar` option).

XData project Data integration on hadoop cluster

The XData project is a french collaborative project between industrials, startups as well as big companies, and academics. Its main objective is to develop innovative  commercial product constructed from the integration of private data with open data.

I mostly work on the xdata “movement analytics” application. More specifically on:

  1. The data integration of the movement data type: any type of data that represent people movement such as housing or companies moving, as well as tourist displacement. The integration is done in two main parts: first a generalized data structure cas defined with a generic data descriptor to allow importing any data set containing movement data ; second an automated data query algorithm has been defined to select suitable movement entry with respect to geographical and temporal area and granularity.
  2. The transfer of the stand alone prototype of the web application, which use mysql and spring technologies, on the hadoop cluster of the xdata project, in particular using spark and hive.