Apache Spark VM by Anarion Technologies
Apache Spark is a powerful, open-source, distributed computing framework designed for big data processing and analytics. Originally developed at UC Berkeley, Spark has become one of the most widely used platforms for handling large-scale data processing tasks. It provides an in-memory computing architecture that significantly accelerates data processing by reducing the need for disk I/O, making it much faster than traditional batch processing systems like Hadoop MapReduce. Spark is capable of processing both batch and real-time data, supporting diverse workloads such as data querying, machine learning, graph processing, and stream processing.
Apache Spark offers a unified analytics engine that supports multiple programming languages, including Java, Scala, Python, and R, enabling a broad range of users, from developers to data scientists, to interact with the system using their preferred language. The platform includes several key libraries, such as MLlib for machine learning, Spark SQL for querying structured data, GraphX for graph processing, and Structured Streaming for real-time stream processing.
One of Spark’s major advantages is its ability to process data in-memory, which significantly speeds up iterative algorithms and complex analytics tasks. Spark also provides distributed data storage through integration with Hadoop’s HDFS (Hadoop Distributed File System) and can work with a variety of data sources, including NoSQL databases, cloud storage, and relational databases. Its scalability allows it to handle datasets ranging from gigabytes to petabytes, making it a go-to solution for industries dealing with vast amounts of data.
To subscribe to this product from Azure Marketplace and initiate an instance using the Azure compute service, follow these steps:
1. Navigate to Azure Marketplace and subscribe to the desired product.
2. Search for “virtual machines” and select “Virtual machines” under Services.
3. Click on “Add” in the Virtual machines page, which will lead you to the Create a virtual machine page.
4. In the Basics tab:
- Ensure the correct subscription is chosen under Project details.
- Opt for creating a new resource group by selecting “Create new resource group” and name it as “myResourceGroup.”
5. Under Instance details:
- Enter “myVM” as the Virtual machine name.
- Choose “East US” as the Region.
- Select “Ubuntu 18.04 LTS” as the Image.
- Leave other settings as default.
6. For Administrator account:
- Pick “SSH public key.”
- Provide your user name and paste your public key, ensuring no leading or trailing white spaces.
7. Under Inbound port rules > Public inbound ports:
- Choose “Allow selected ports.”
- Select “SSH (22)” and “HTTP (80)” from the drop-down.
8. Keep the remaining settings at their defaults and click on “Review + create” at the bottom of the page.
9. The “Create a virtual machine” page will display the details of the VM you’re about to create. Once ready, click on “Create.”
10. The deployment process will take a few minutes. Once it’s finished, proceed to the next section.
To connect to the virtual machine:
1. Access the overview page of your VM and click on “Connect.”
2. On the “Connect to virtual machine” page:
- Keep the default options for connecting via IP address over port 22.
- A connection command for logging in will be displayed. Click the button to copy the command. Here’s an example of what the SSH connection command looks like:
“`
ssh [email protected]
“`
3. Using the same bash shell that you used to generate your SSH key pair, you can either reopen the Cloud Shell by selecting >_ again
or going to https://shell.azure.com/bash.
4. Paste the SSH connection command into the shell to initiate an SSH session.
Usage/Deployment Instructions
Anarion Technologies – Apache Spark
Note: Search product on Azure marketplace and click on “Get it now”
Click on Continue
Click on Create
Creating a Virtual Machine, enter or select appropriate values for zone, machine type, resource group and so on as per your choice.
After Process of Create Virtual Machine. You have got an Option Go to Resource Group
Click Go to Resource Group
Copy the Public IP Address
Click on the Network Security Group: spark-nsg
Click on Inbound Security Rule
Click on Add
Add Port
Add Port
Destination Port Ranges Section* (where default value is 8080)
8080
Select Protocol as TCP
Option Action is to be Allow
Click on Add
Click on Refresh
Copy the Public IP Address
SSH into Terminal and Run these commands:
$ sudo su
$ apt update
$ cd ../../
$ cd opt/spark/
Start Spark: To start Spark in
standalone mode, run:
$ start-master.sh
In your browser, you can now access by navigating to the IP address of your server:
http://”Instance IP Address:8080
Apache Spark is used for fast,
distributed data processing and analytics on large-scale datasets across
clusters, enabling high-performance computations and real-time stream
processing.
ThankYou!!!