Spark MLLib and Installation of R in Jupyter Notebook
Category: Apache spark and scala, General Posted:Dec 15, 2016 By: Robert
Are you planning to learn Spark and searching for useful information regarding Spark MLLib, R and Jupyter? Well, this article is presented to you to get useful information regarding these. Let us begin with the MLLib.
Introduction to MLLib
Spark Machine Language Library ( MLLib), focuses mainly on learning algorithms as well as utilities such as clustering, classification, collaborative filtering, regression and dimensionality reduction. It can fit easily into the APIs of the Spark and can interoperate with R libraries and NumPy in Python. It is also possible to use any data source of Hadoop like HBase, HDFS or local files since it makes simple to plug into workflows of Hadoop. When it comes to performance, MLLib can support high-quality algorithms and it works hundred times faster than MapReduce. Spark shines iterative computation which enables MLLib to work fast. The high-quality algorithm in the MLLib benefits the iteration and provides better results when compared to one-pass approximation, which is used on Hadoop MapReduce.
Why to use MLLib?
MLLib is built on Spark which is a rapid general engine designed for high-scale processing. It support to write application code in a various languages like Scale, Java, and Python.
MLLib Installation
When it comes to the installation of MLLib, the only thing that you need to do is, installing Spark, since MLLib is already encompassed in Spark.
Let us look on how to install Spark 1.1.0. First, download the Apache Spark from the download link on the official website.
Download page generally includes Apache Spark Package for several famous HDFS versions. If you want to build Apache Spark from the scratch, then it is suggested to go through building Apache Spark with Maven. In the download page, just choose the Spark release, package type and download type.
Apache Spark can run on both Windows and Unix-based systems such as Mac OS and Linux. It is effortless to run Spark locally on the machine. All you want to include in your system is, Java on the system PATH or JAVA_HOME platform variable directing to Java installation. Apache Spark needs Python 2.6+ and Java 6+. Spark 1.1.0 utilizes Scala 2.10 for the Scala Application Programming Interface.
There may be a situation arise at the time of creating a machine-learning model, that is, the input dataset does not match the computer’s memory. Generally, developers use distributed computing tools such as Apache Spark and Hadoop for the computation in a bunch with several machines. On the other hand, Spark has the ability to process the input data locally on the machine at the stand alone mode. It can even able to build models once the amount of dataset exceeds the memory capacity of the computer.
Introduction to Jupyter Notebook
It is a web application, which permits the users to build as well as share documents, which includes equations, live codes, explanatory texts, and visualization. Its benefits include machine learning, statistical modeling, numerical simulation, data cleaning & transformation and much more. When functioning on a data science issue, users might need to fix an interactive platform to create and share the code with others. This issue can be resolved easily by using a notebook. A notebook can support the reproducible and transparent report. Notebooks are ideal for conditions where the user needs to integrate plain text with rich-text elements like calculations, graphics etc.
R Notebook
Nowadays, Jupyter appears as the standard key for R users. It offers the best solution when compared to other notebooks like Beaker and Apache Zeppelin. Other alternatives like R Markdown, Sweave or knitr have been more famous among the R community.
Installation of R in Jupyter Notebook with the R Kernel
- One of the best ways to run R in Jupyter notebook is by utilizing R Kernel. If you want to run R, you will have to load IRKernel (Kernel for R, which is available at Github) in the notebook platform. You need to activate it in order to start working with R.
- At the beginning, it is essential to install certain packages. Ensure that you do this in regular R terminal. Instead, if you do it in the RStudio console, you will get an error.
- Next, enter a number in the command prompt to choose a CRAN mirror in order to install essential packages and the installation process will continue.
- Then you are required to make the Kernel noticeable for Jupyter.
- Finally, you can open the application with the Jupyter notebook. You will notice R displays in the Kernel lists whenever you build a new notebook.
Advantages of using Jupyter
The main focus is to facilitate sharing notebooks with other users. It is possible to write some code, mix that code with some text, and publish the compilation as the notebook. The idea here is to enable the user to view the code and the result of the executing code.
Using Jupyter is an ideal way to share few experimental snippets and publish detailed reports along with an entire code set and explanations. The main advantage which makes Jupyter superior from other services is that it will extract the code output in addition to allowing code snippets posting.