Skip to content

This code demonstrates how to integrate PySpark with datasets and perform simple data transformations. It loads a sample dataset using PySpark's built-in functionalities or reads data from external sources and converts it into a PySpark DataFrame for distributed processing and manipulation.

Notifications You must be signed in to change notification settings

CamilaJaviera91/pyspark-first-approach

Repository files navigation

Pyspark (First Approach)

πŸ“ Description

This code demonstrates how to integrate PySpark with datasets and perform simple data transformations. It loads a sample dataset using PySpark's built-in functionalities or reads data from external sources and converts it into a PySpark DataFrame for distributed processing and manipulation.

πŸ”₯ What's pyspark?

  • It's the Python API for Apache Spark, enabling the use of Spark with Python.

πŸ”‘ Key Features:

  1. Distributed Computing: Processes large datasets across a cluster of computers for scalability.
  2. In-Memory Processing: Speeds up computation by reducing disk I/O.
  3. Lazy Evaluation: Operations are only executed when an action is triggered, optimizing performance.
  4. Rich Libraries:
    • Spark SQL: Structured data processing (like SQL operations).
    • MLlib: Machine learning library for scalable algorithms.
    • GraphX: Graph processing (via RDD API).
    • Spark Streaming: Real-time stream processing.
  5. Compatibility: Works with Hadoop, HDFS, Hive, Cassandra, etc.
  6. Resilient Distributed Datasets (RDDs): Low-level API for distributed data handling.
  7. DataFrames & Datasets: High-level APIs for structured data with SQL-like operations.

βœ… Pros

  • Handles massive datasets efficiently.
  • Compatible with many tools (Hadoop, Cassandra, etc.).
  • Built-in libraries for SQL, Machine Learning, Streaming, Graph Processing.

❌ Cons

  • Can be memory-intensive.
  • Complex configuration for cluster environments.

πŸ”§ Install pyspark

  1. Install via pip
pip install pyspark
  1. Verify installation
python3 -c "import pyspark; print(pyspark.__version__)"

πŸ› οΈ Code Explanation

πŸ‘©β€πŸ’» 1. data_utils.py

πŸ”§ Install libraries that we are going to need:

pip install kaggle pandas numpy
pip install windows-curses

πŸ“– Explanation of the Code:

curses.wrapper(kaggle_connect):

  • Lets the user search for datasets and choose one to download.
  • Saves the dataset to a specified folder and loads the first CSV file into a DataFrame.

col_name(folder_path):

  • Lists all files in the ./data folder.
  • Lets the user pick a CSV file.
  • Lets rename the columns converting them to lowercase and replacing spaces (" ") with underscores ("_") and parentheses ("()") and periods (.) with nothing.
  • Saves the modified file as modified_data.csv in the same folder.

clean_data(folder_path):

  • Reads a CSV, replaces "N.A." with NaN, converts NaN to 0, and saves the result as clean_data.csv.

βœ… Example Output:

kaggle_connect


col_name


clean_data


πŸ‘©β€πŸ’» 2. pyspark_first_approach.py

πŸ”§ Install libraries that we are going to need:

pip install pyspark matplotlib fpdf

πŸ“– Explanation of the Code:

new_col(spark):

  • The code loads a CSV, calculates the population percentage of each country, adds that information to the file, and saves it as cleaned_data.csv.

plot_data(spark):

  • This code reads a CSV with PySpark, converts the data to Pandas, and creates a bar chart showing the top 10 countries by population percentage.

plot_pie_chart(spark):

  • Reads the cleaned CSV, converts it to Pandas, selects the top 10 countries by population percentage, and generates a pie chart with percentages.

create_table(spark):

  • This code reads a CSV file, selects the top 10 most populated countries, and generates a PDF with a table showing the country, population, and percentage. It then saves the PDF in a folder.

create_report(spark):

  • This code reads a CSV file, selects data for all countries, and generates a PDF containing a table with country names, population, urban population, and percentages. It then saves the PDF to a specified folder.

βœ… Example Output:

  • new_col(spark):

pyspark

  • plot_data(spark):

plot

  • plot_pie_chart(spark):

pie_chart

  • create_table(spark):

table

  • create_report(spark):

pdf


πŸ‘©β€πŸ’» 3. googlesheets.py

πŸ”§ Install libraries that we are going to need:

pip install gspread google-auth google-auth-oauthlib google-auth-httplib2 pyspark

πŸ“– Explanation of the Code:

pyspark(client):

  • This code reads a CSV file, clears a Google Sheets document, and saves the CSV data into that sheet.

analysis(client):

  • This code reads a CSV file, clears a Google Sheets document, and saves the CSV data into that sheet.

βœ… Example Output:

pyspark_sheets

analysis_sheets


πŸ‘©β€πŸ’» 4. analysis.py

πŸ”§ Install libraries that we are going to need:

pip install pyspark

πŸ“– Explanation of the Code:

previous_years(spark):

  • The code loads a CSV into a PySpark DataFrame, calculates population estimates for multiple years, saves the result in a new CSV, renames the final file, removes unnecessary files, and handles logs and errors.

βœ… Example Output:

  • previous_years(spark):

analysis_sheets


πŸ‘©β€πŸ’» 5. eda.py

πŸ”§ Install libraries that we are going to need:

pip install pyspark matplotlib

πŸ“– Explanation of the Code:

correlation_analysis(df):

  • This function will analyze the relationship between population change and net migration.

growth_country(df):

  • This function visualizes population growth for China between 2015 and 2023.
  • You can change "CHINA" to any country name you want to analyze, as long as the country is present in the country_or_dependency column of the DataFrame.

βœ… Example Output:

  • correlation_analysis(df):

eda1

  • growth_country(df):

eda2


πŸ‘©β€πŸ’» 6. prediction.py

πŸ”§ Install libraries that we are going to need:

pip install pyspark

πŸ“– Explanation of the Code:

  • This code trains a linear regression model to predict the 2025 population based on features such as yearly change, fertility rate, and net migration.

βœ… Example Output:

  • prediction:

prediction

About

This code demonstrates how to integrate PySpark with datasets and perform simple data transformations. It loads a sample dataset using PySpark's built-in functionalities or reads data from external sources and converts it into a PySpark DataFrame for distributed processing and manipulation.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages