datesgaq.blogg.se - Big data analysis with apache spark

An RDD is Spark's core data abstraction and represents a distributed collection of elements. This method reads the file from either the local file system or from a Hadoop Distributed File System (HDFS) and returns a resilient distributed dataset (RDD) of Strings. The first step is to leverage the JavaSparkContext's textFile() to load our input from the specified file. Reduce our words into a tuple pair that contains the word and the count of occurrences.In order to start interacting with Spark, we need a SparkContext instance, so we create a new JavaSparkContext that is configured to use our SparkConf. We first create a SparkConf object that points to our Spark instance, which in this case is "local." This means that we're going to be running Spark locally in our Java process space. The wordCountJava7() method is more explicit, so we'll start there. It defines two helper methods - wordCountJava7() and wordCountJava8() - that perform the same function (counting words), first in Java 7's notation and then in Java 8's. The WordCount application's main method accepts the source text file name from the command line and then invokes the workCountJava8() method. JavaPairRDD counts = words.mapToPair( t -> new Tuple2( t, 1 ) ).reduceByKey( (x, y) -> (int)x + (int)y ) Java 8 with lambdas: transform the collection of words into pairs (word and 1) and then count them JavaRDD words = input.flatMap( s -> Arrays.asList( s.split( " " ) ) ) Java 8 with lambdas: split the input string into words Public static void wordCountJava8( String filename ) ReducedCounts.saveAsTextFile( "output" ) Save the word count back out to a text file, causing evaluation. Public Integer call(Integer x, Integer y) JavaPairRDD reducedCounts = counts.reduceByKey( Java 7 and earlier: transform the collection of words into pairs (word and 1) Load the input data, which is a text file read from the command line JavaSparkContext sc = new JavaSparkContext(conf) Create a Java version of the Spark Context from the configuration SparkConf conf = new SparkConf().setMaster("local").setAppName("Work Count App") Define a configuration to use to interact with Spark Public static void wordCountJava7( String filename ) * Sample Spark application that counts the words in a text file Note that it shows how to write the Spark code in both Java 7 and Java 8. With that out of the way, Listing 2 shows the source code for the WordCount application. Having Maven copy all dependencies to the target/lib directory enables the executable JAR file to run.Defining a main class that references the WordCount class instructs Maven to create an executable JAR for WordCount.Setting the compilation level to Java 8 enables lambda support.Note the three plugins I added to the build directive: