A few months ago I decided to jump in and start learning Spark. After hearing words like Hadoop, Kafka, Yarn, Hive etc. I decided to find out what all the buzz was about.
As usual with jumping into a new technology I ran into quite a few hiccups. The best way to learn anything is to just go for it and get your hands dirty, but people can often get frustrated and walk away if they can’t get “Hello, world!” going quickly.
It’s my hope that this guide will help get some people over the humps they run into quickly and coding. Happy learning.
Installing IntelliJ IDEA
To start we need an IDE that supports Spark. I went with IntelliJ. Just head over to the IntelliJ Website and choose the community edition.
This will download the latest IntelliJ tar.gz file.
To install open a terminal and cd to the Downloads folder. Once there run:
tar -xvf <tar.gz file>
At the time of writing this it would be:
tar -xvf ideaIC-2021.1.tar.gz
Note: Here we’re using
f- to specify the file
If you don’t want to see the output you can just use -xf
Once it is done extracting you will see a new directory in the Downloads Folder. Then we need to move this to the opt directory.
We can now launch IntelliJ IDE with:
You should see a screen like the one below.
Decide if you want to opt in to data sharing.
Setting up the environment
Now that that’s out of the way we need to add the Scala extension.
Let that install and then we will need to restart the IDE.
Once it has restarted choose New Project
Since this will be a Spark Project we need to choose Scala and sbt
Two important things to note are that we need Java 8 or 11 (at least I think that’s what it means) and that we need to use a Scala Version that is 2.12.X (that is 2.12.any other number)
So for the JDK we will select Download JDK and switch the version to 11
I think any of the options for 11 will work. I chose AdoptOpenJDK.
Give your project a name and optionally a package prefix.
Also, change the Scala Version to something that is 2.12.X
Ahhh!! It’s like a pop up add bonanza. Yeah, just x out of all the pop up stuff.
Okay, we’re in the home stretch. One more thing before we start coding. In the build.sbt file add this block:
Seq("org.apache.spark" %% "spark-core" % "3.1.1",
"org.apache.spark" %% "spark-sql" % "3.1.1")
This tells IntelliJ to go out and get Spark in order to execute the code we write. Our build.sbt file should look like this.
To create our first file we need to expand src>>main>>scala right click on the scala directory and choose New > Scala Class and choose Object.
You should now have a file that looks like this:
There are a few things we need to add to get started.
Add these lines between the package and object lines at the top of the file.
log4j lets you set the logger level
spark.sql gives you access to Spark SQL
Define a main class in your script and set the logger level to error with the line below.
The file should now look something like this.
That’s it. You can now start coding and running your Spark Programs in either Scala or pyspark. The coding part goes beyond the scope of this article, but hopefully if you are just getting started this will ease the pain of setting up your environment.
You can create a directory under the main project and call it data in order to access data files with a simple “data/YourDataFile”
If you want to be able to access IntelliJ IDE from the menu button instead of running /opt/idea/bin/idea.sh every time go to Tools>>Create Desktop Entry
Now when you start typing IntelliJ from the start button you can simply select IntelliJ to launch IntelliJ IDEA