This code demonstrates how to integrate PySpark with datasets and perform simple data transformations. It loads a sample dataset using PySpark's built-in functionalities or reads data from external sources and converts it into a PySpark DataFrame for distributed processing and manipulation.
- It's the Python API for Apache Spark, enabling the use of Spark with Python.
- Distributed Computing: Processes large datasets across a cluster of computers for scalability.
- In-Memory Processing: Speeds up computation by reducing disk I/O.
- Lazy Evaluation: Operations are only executed when an action is triggered, optimizing performance.
- Rich Libraries:
- Spark SQL: Structured data processing (like SQL operations).
- MLlib: Machine learning library for scalable algorithms.
- GraphX: Graph processing (via RDD API).
- Spark Streaming: Real-time stream processing.
- Compatibility: Works with Hadoop, HDFS, Hive, Cassandra, etc.
- Resilient Distributed Datasets (RDDs): Low-level API for distributed data handling.
- DataFrames & Datasets: High-level APIs for structured data with SQL-like operations.
- Handles massive datasets efficiently.
- Compatible with many tools (Hadoop, Cassandra, etc.).
- Built-in libraries for SQL, Machine Learning, Streaming, Graph Processing.
- Can be memory-intensive.
- Complex configuration for cluster environments.
- Install via pip
pip install pyspark
- Verify installation
python3 -c "import pyspark; print(pyspark.__version__)"
pip install kaggle pandas numpy
pip install windows-curses
- Lets the user search for datasets and choose one to download.
- Saves the dataset to a specified folder and loads the first CSV file into a DataFrame.
- Lists all files in the ./data folder.
- Lets the user pick a CSV file.
- Lets rename the columns converting them to lowercase and replacing spaces (" ") with underscores ("_") and parentheses ("()") and periods (.) with nothing.
- Saves the modified file as modified_data.csv in the same folder.
- Reads a CSV, replaces "N.A." with NaN, converts NaN to 0, and saves the result as clean_data.csv.
pip install pyspark matplotlib fpdf
- The code loads a CSV, calculates the population percentage of each country, adds that information to the file, and saves it as cleaned_data.csv.
- This code reads a CSV with PySpark, converts the data to Pandas, and creates a bar chart showing the top 10 countries by population percentage.
- Reads the cleaned CSV, converts it to Pandas, selects the top 10 countries by population percentage, and generates a pie chart with percentages.
- This code reads a CSV file, selects the top 10 most populated countries, and generates a PDF with a table showing the country, population, and percentage. It then saves the PDF in a folder.
- This code reads a CSV file, selects data for all countries, and generates a PDF containing a table with country names, population, urban population, and percentages. It then saves the PDF to a specified folder.
- new_col(spark):
- plot_data(spark):
- plot_pie_chart(spark):
- create_table(spark):
- create_report(spark):
pip install gspread google-auth google-auth-oauthlib google-auth-httplib2 pyspark
- This code reads a CSV file, clears a Google Sheets document, and saves the CSV data into that sheet.
- This code reads a CSV file, clears a Google Sheets document, and saves the CSV data into that sheet.
pip install pyspark
- The code loads a CSV into a PySpark DataFrame, calculates population estimates for multiple years, saves the result in a new CSV, renames the final file, removes unnecessary files, and handles logs and errors.
- previous_years(spark):
pip install pyspark matplotlib
- This function will analyze the relationship between population change and net migration.
- This function visualizes population growth for China between 2015 and 2023.
- You can change "CHINA" to any country name you want to analyze, as long as the country is present in the country_or_dependency column of the DataFrame.
- correlation_analysis(df):
- growth_country(df):
pip install pyspark
- This code trains a linear regression model to predict the 2025 population based on features such as yearly change, fertility rate, and net migration.
- prediction: