This article will implement a simple dataset analysis using Python and Pyspark code in google colab notebook. Regarding handling enormous datasets and executing complicated models, Google Colab is a lifesaver for data scientists. At the same time, PySpark/Python is simply a hurdle for data engineers learning and practicing. So what happens when we combine these two? Who are the best players in their respective categories?
We nearly always find the ideal answer to all of your data science and machine learning issues!
In this post, we’ll look at how to use a Google Colaboratory notebook to run Python and Pyspark code. We’ll also carry out a few fundamental data exploration activities that are typical of most data science issues. So let’s get to work!
A class project at UC Berkeley in 2010 produced the research that would become Apache Spark, a potent tool for real-time analysis and creating machine learning models . Spark is a distributed data processing platform that is helpful for large data processing because of its scalability and computing capacity. It is not a programming language like Python or Java.
PySpark is an interface for Apache Spark that enables users to examine large amounts of data in distributed systems interactively and to create Spark applications using Python APIs. Most Spark capabilities, including Spark SQL, DataFrame, Streaming, MLlib (Machine Learning Library), and Spark Core, are accessible in PySpark.
Installing Python ad PySpark Libraries in Google Colab Notebook
First of all, we will install pyspark in our google colab. For this purpose, you have to write the following command in colab
!pip install pyspark
All the necessary libraries are listed here, and we will import the other ones we will use later.
#necessory libraries import pandas as pd from pyspark.sql.functions import row_number,lit ,desc, monotonically_increasing_id from pyspark.sql.functions import desc, row_number, monotonically_increasing_id from pyspark.sql.window import Window from pyspark.sql import SparkSession from pyspark.sql.types import StructType, StructField,IntegerType, StringType , DateType,FloatType
Connecting to Google Drive
When using Colab, mounting your Google Drive should be your first priority. You can do this inside the Colab notebook to access any directory on your Drive.
Because it enables immediate access to files and data saved on your Google Drive from your Colab notebooks, Google Drive mounting is significant in Google Colab. An open-source computing environment for Jupyter notebooks called Google Colab offers a free data analysis and machine learning platform.
When you mount your Google Drive in Colab, a virtual filesystem is created that you may use Python and pyspark code to access. As a result, you may use your Google Colab notebook to access files, read and write data, and carry out other file-related tasks exactly like you would on a local computer.
Mounting your Google Drive in Colab is also useful because it allows you to store and share large datasets with your team without worrying about storage limitations on your local machine. Additionally, it makes it easier to work on projects across different machines, as you can access your data from anywhere, as long as you have an internet connection.
Overall, mounting your Google Drive in Colab is an essential step to take if you plan to work with data and files in your Colab notebooks. It provides a convenient way to manage and share your data and can greatly enhance productivity when working on data analysis or machine learning projects.
#mount google drive first from google.colab import drive drive.mount('/content/drive')
Now I assume you already uploaded your dataset excel file through the upload button that you can see on your google colab notebook in the top left corner.
Creating Spark Session and Performing Analysis
Here we assume that you already know the basic of spark and dataset, so I will be sharing code here for detailed understanding. You can see our other blogs related to implementation.
For an understanding of each line of code, you can read our blogs mentioned here :
# May take a little while on a local computer spark = SparkSession.builder.appName("Basics").getOrCreate() #upload file here first df = pd.read_excel('/content/IceCream.xlsx') #you can direct copy the path of uploaded by clicking on right mouse buttong and copy path and paste here. df.head() df.describe() df.info() df.columns #create schema for your dataframe schema = StructType( [StructField("SalesDate", DateType(), True)\ ,StructField("SalesQty",IntegerType(), True)\ ,StructField("SalesAmount", FloatType(), True)\ ,StructField("ProductCategory", StringType(), True)\ ,StructField("ProductSubCategory", StringType(), True)\ ,StructField("ProductName", StringType(), True)\ ,StructField("StoreName", StringType(), True)\ ,StructField("StoreRegion", StringType(), True)\ ,StructField("StoreProvince", StringType(), True)\ ,StructField("StoreZone", StringType(), True)\ ,StructField("StoreArea", StringType(), True)\ ,StructField("PaymentTerms", StringType(), True)\ ,StructField("SalesMan", StringType(), True)\ ,StructField("Route", StringType(), True)\ ,StructField("Category", StringType(), True) ] ) df2 = spark.createDataFrame(df,schema=schema) df2.show()