Google Colab Notebook | 1st Choice of Data Scientist to Work Quickly with Python and Pyspark

This article will implement a simple dataset analysis using Python and Pyspark code in google colab notebook. Regarding handling enormous datasets and executing complicated models, Google Colab is a lifesaver for data scientists. At the same time, PySpark/Python is simply a hurdle for data engineers learning and practicing. So what happens when we combine these two? Who are the best players in their respective categories?

Google Colab Notebook
Implement Python and Pyspark code in Google Colab Notebook

We nearly always find the ideal answer to all of your data science and machine learning issues!

In this post, we’ll look at how to use a Google Colaboratory notebook to run Python and Pyspark code. We’ll also carry out a few fundamental data exploration activities that are typical of most data science issues. So let’s get to work!

A class project at UC Berkeley in 2010 produced the research that would become Apache Spark, a potent tool for real-time analysis and creating machine learning models [1]. Spark is a distributed data processing platform that is helpful for large data processing because of its scalability and computing capacity. It is not a programming language like Python or Java.

PySpark is an interface for Apache Spark that enables users to examine large amounts of data in distributed systems interactively and to create Spark applications using Python APIs. Most Spark capabilities, including Spark SQL, DataFrame, Streaming, MLlib (Machine Learning Library), and Spark Core, are accessible in PySpark.

Installing Python ad PySpark Libraries in Google Colab Notebook

First of all, we will install pyspark in our google colab. For this purpose, you have to write the following command in colab

!pip install pyspark

All the necessary libraries are listed here, and we will import the other ones we will use later.

#necessory libraries
import pandas as pd
from pyspark.sql.functions import row_number,lit ,desc, monotonically_increasing_id
from pyspark.sql.functions import desc, row_number, monotonically_increasing_id
from pyspark.sql.window import Window
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField,IntegerType, StringType , DateType,FloatType

Connecting to Google Drive

When using Colab, mounting your Google Drive should be your first priority. You can do this inside the Colab notebook to access any directory on your Drive.

Because it enables immediate access to files and data saved on your Google Drive from your Colab notebooks, Google Drive mounting is significant in Google Colab. An open-source computing environment for Jupyter notebooks called Google Colab offers a free data analysis and machine learning platform.

When you mount your Google Drive in Colab, a virtual filesystem is created that you may use Python and pyspark code to access. As a result, you may use your Google Colab notebook to access files, read and write data, and carry out other file-related tasks exactly like you would on a local computer.

Mounting your Google Drive in Colab is also useful because it allows you to store and share large datasets with your team without worrying about storage limitations on your local machine. Additionally, it makes it easier to work on projects across different machines, as you can access your data from anywhere, as long as you have an internet connection.

Overall, mounting your Google Drive in Colab is an essential step to take if you plan to work with data and files in your Colab notebooks. It provides a convenient way to manage and share your data and can greatly enhance productivity when working on data analysis or machine learning projects.

#mount google drive first
from google.colab import drive
drive.mount('/content/drive')

Now I assume you already uploaded your dataset excel file through the upload button that you can see on your google colab notebook in the top left corner.

Creating Spark Session and Performing Analysis

Here we assume that you already know the basic of spark and dataset, so I will be sharing code here for detailed understanding. You can see our other blogs related to implementation.

For an understanding of each line of code, you can read our blogs mentioned here :

# May take a little while on a local computer
spark = SparkSession.builder.appName("Basics").getOrCreate()
#upload file here first
df = pd.read_excel('/content/IceCream.xlsx')
#you can direct copy the path of uploaded by clicking on right mouse buttong and copy path and paste here.
df.head()
df.describe()
df.info()
df.columns
#create schema for your dataframe
schema = StructType(
                   [StructField("SalesDate", DateType(), True)\
                   ,StructField("SalesQty",IntegerType(), True)\
                   ,StructField("SalesAmount", FloatType(), True)\
                   ,StructField("ProductCategory", StringType(), True)\
                   ,StructField("ProductSubCategory", StringType(), True)\
                   ,StructField("ProductName", StringType(), True)\
                   ,StructField("StoreName", StringType(), True)\
                   ,StructField("StoreRegion", StringType(), True)\
                   ,StructField("StoreProvince", StringType(), True)\
                   ,StructField("StoreZone", StringType(), True)\
                   ,StructField("StoreArea", StringType(), True)\
                   ,StructField("PaymentTerms", StringType(), True)\
                   ,StructField("SalesMan", StringType(), True)\
                   ,StructField("Route", StringType(), True)\
                   ,StructField("Category", StringType(), True)                    
                   ]
                   )


df2 = spark.createDataFrame(df,schema=schema)
df2.show()

star schema implementation