Pyspark (First Approach)

📝 Description

This code demonstrates how to integrate PySpark with datasets and perform simple data transformations. It loads a sample dataset using PySpark's built-in functionalities or reads data from external sources and converts it into a PySpark DataFrame for distributed processing and manipulation.

🔥 What's pyspark?

It's the Python API for Apache Spark, enabling the use of Spark with Python.

🔑 Key Features:

Distributed Computing: Processes large datasets across a cluster of computers for scalability.
In-Memory Processing: Speeds up computation by reducing disk I/O.
Lazy Evaluation: Operations are only executed when an action is triggered, optimizing performance.
Rich Libraries:
- Spark SQL: Structured data processing (like SQL operations).
- MLlib: Machine learning library for scalable algorithms.
- GraphX: Graph processing (via RDD API).
- Spark Streaming: Real-time stream processing.
Compatibility: Works with Hadoop, HDFS, Hive, Cassandra, etc.
Resilient Distributed Datasets (RDDs): Low-level API for distributed data handling.
DataFrames & Datasets: High-level APIs for structured data with SQL-like operations.

✅ Pros

Handles massive datasets efficiently.
Compatible with many tools (Hadoop, Cassandra, etc.).
Built-in libraries for SQL, Machine Learning, Streaming, Graph Processing.

❌ Cons

Can be memory-intensive.
Complex configuration for cluster environments.

🔧 Install pyspark

Install via pip

pip install pyspark

Verify installation

python3 -c "import pyspark; print(pyspark.__version__)"

🛠️ Code Explanation

👩‍💻 1. data_utils.py

🔧 Install libraries that we are going to need:

pip install kaggle pandas numpy

pip install windows-curses

📖 Explanation of the Code:

curses.wrapper(kaggle_connect):

Lets the user search for datasets and choose one to download.
Saves the dataset to a specified folder and loads the first CSV file into a DataFrame.

col_name(folder_path):

Lists all files in the ./data folder.
Lets the user pick a CSV file.
Lets rename the columns converting them to lowercase and replacing spaces (" ") with underscores ("_") and parentheses ("()") and periods (.) with nothing.
Saves the modified file as modified_data.csv in the same folder.

clean_data(folder_path):

Reads a CSV, replaces "N.A." with NaN, converts NaN to 0, and saves the result as clean_data.csv.

✅ Example Output:

👩‍💻 2. pyspark_first_approach.py

🔧 Install libraries that we are going to need:

pip install pyspark matplotlib fpdf

📖 Explanation of the Code:

new_col(spark):

The code loads a CSV, calculates the population percentage of each country, adds that information to the file, and saves it as cleaned_data.csv.

plot_data(spark):

This code reads a CSV with PySpark, converts the data to Pandas, and creates a bar chart showing the top 10 countries by population percentage.

plot_pie_chart(spark):

Reads the cleaned CSV, converts it to Pandas, selects the top 10 countries by population percentage, and generates a pie chart with percentages.

create_table(spark):

This code reads a CSV file, selects the top 10 most populated countries, and generates a PDF with a table showing the country, population, and percentage. It then saves the PDF in a folder.

create_report(spark):

This code reads a CSV file, selects data for all countries, and generates a PDF containing a table with country names, population, urban population, and percentages. It then saves the PDF to a specified folder.

✅ Example Output:

new_col(spark):

plot_data(spark):

plot_pie_chart(spark):

create_table(spark):

create_report(spark):

👩‍💻 3. googlesheets.py

🔧 Install libraries that we are going to need:

pip install gspread google-auth google-auth-oauthlib google-auth-httplib2 pyspark

📖 Explanation of the Code:

pyspark(client):

This code reads a CSV file, clears a Google Sheets document, and saves the CSV data into that sheet.

analysis(client):

This code reads a CSV file, clears a Google Sheets document, and saves the CSV data into that sheet.

✅ Example Output:

pyspark Google Sheets Results

analysis Google Sheets Results

👩‍💻 4. analysis.py

🔧 Install libraries that we are going to need:

pip install pyspark

📖 Explanation of the Code:

previous_years(spark):

The code loads a CSV into a PySpark DataFrame, calculates population estimates for multiple years, saves the result in a new CSV, renames the final file, removes unnecessary files, and handles logs and errors.

✅ Example Output:

previous_years(spark):

👩‍💻 5. eda.py

🔧 Install libraries that we are going to need:

pip install pyspark matplotlib

📖 Explanation of the Code:

correlation_analysis(df):

This function will analyze the relationship between population change and net migration.

growth_country(df):

This function visualizes population growth for China between 2015 and 2023.
You can change "CHINA" to any country name you want to analyze, as long as the country is present in the country_or_dependency column of the DataFrame.

✅ Example Output:

correlation_analysis(df):

growth_country(df):

👩‍💻 6. prediction.py

🔧 Install libraries that we are going to need:

pip install pyspark

📖 Explanation of the Code:

This code trains a linear regression model to predict the 2025 population based on features such as yearly change, fertility rate, and net migration.

✅ Example Output:

prediction:

Name		Name	Last commit message	Last commit date
Latest commit History 202 Commits
data		data
images		images
.gitignore		.gitignore
1. data_utils.py		1. data_utils.py
2. pyspark_first_approach.py		2. pyspark_first_approach.py
3. googlesheets.py		3. googlesheets.py
4. analysis.py		4. analysis.py
5. eda.py		5. eda.py
6. prediction.py		6. prediction.py
README.md		README.md

CamilaJaviera91/pyspark-first-approach

Folders and files

Latest commit

History

Repository files navigation

Pyspark (First Approach)

📝 Description

🔥 What's pyspark?

🔑 Key Features:

✅ Pros

❌ Cons

🔧 Install pyspark

🛠️ Code Explanation

👩‍💻 1. data_utils.py

🔧 Install libraries that we are going to need:

📖 Explanation of the Code:

curses.wrapper(kaggle_connect):

col_name(folder_path):

clean_data(folder_path):

✅ Example Output:

👩‍💻 2. pyspark_first_approach.py

🔧 Install libraries that we are going to need:

📖 Explanation of the Code:

new_col(spark):

plot_data(spark):

plot_pie_chart(spark):

create_table(spark):

create_report(spark):

✅ Example Output:

👩‍💻 3. googlesheets.py

🔧 Install libraries that we are going to need:

📖 Explanation of the Code:

pyspark(client):

analysis(client):

✅ Example Output:

👩‍💻 4. analysis.py

🔧 Install libraries that we are going to need:

📖 Explanation of the Code:

previous_years(spark):

✅ Example Output:

👩‍💻 5. eda.py

🔧 Install libraries that we are going to need:

📖 Explanation of the Code:

correlation_analysis(df):

growth_country(df):

✅ Example Output:

👩‍💻 6. prediction.py

🔧 Install libraries that we are going to need:

📖 Explanation of the Code:

✅ Example Output:

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages