GitHub - jcharistech/cheatsheets-datascience: A Collection of Cheatsheets,Tips and Reference Guide for Data Analysis Tools

CheatSheets-DataScience

A collection of cheatsheets,tips,Reference Guide for several data analysis libraries and frameworks

Pandas
Pyspark
Pypolars
Dask
DataTables
etc

PySpark

PySpark is an interface for Apache Spark in Python. It not only allows you to write Spark applications using Python APIs, but also provides the PySpark shell for interactively analyzing your data in a distributed environment. PySpark supports most of Spark’s features such as Spark SQL, DataFrame, Streaming, MLlib (Machine Learning) and Spark Core.
https://spark.apache.org/docs/latest/api/python/reference/index.html

Installation

pip install pyspark

Load PySpark & Launch Spark UI

from pyspark import SparkContext
# Create a Spark Context
sc = SparkContext(master="local[2]")
sc

Create Spark Session

from pyspark.sql import SparkSession
# Spark Session used for DF
spark = SparkSession.builder.appName("SampleSparkApp").getOrCreate()

Basics For Data Analysis

Read CSV without header/schema

df = spark.read.csv("data/dataset.csv")

Preview/Show the first 5 row

df.show(5)

Read CSV with header

df = spark.read.csv("data/dataset.csv",header=True)
df.show(5)

Read CSV with header and Infer Schema

df = spark.read.csv("data/dataset.csv",header=True,infer_schema=True)
df.show(5)

Get First Row

df.first()

Get First 5 Row using .head

df.head(5)

Check For the column names

df.columns

Check For the Datatypes

df.dtypes

Get the Schema

df.printSchema()

Check the number of rows

df.count()

Check for the number of columns

len(df.columns)

Get the shape (rows,cols)

print(df.count(),len(df.columns))

Descriptive Summary

df.describe().show()

Get Descriptive Summary of A Column

df.describe('age').show()

Selection of Columns

df.select('Col1').show()

Tip Column Selection is irrespective of Case of Column

df.select('COL1').show()
df.select('col1').show()

Select Multiple columnns

df.select('Age','Category').show()

Note on Selection

Using Bracket notation only prints the name of the column no the data within the column

Dot Notation *

df.Age

Conditions & Filter

Using filter method

# Method 1: Using Filter
df.filter(df['Age'] == 25).show()
# or
df.filter(df.Age == 25).show()
# Filter on ==,>, <, >=, <= condition
df = df.filter(df.Age > 25)

Filter using Multiple conditions

It requires parentheses around each condition

df = df.filter((df.Age > 25) & (df.Gender == 'M'))

Filter against a list of allowed values

df = df.filter(col('first_name').isin([3, 4, 7]))

Sort By Ascending or Descending Order

df.orderBy(df.age.asc())
df.orderBy(df.age.desc())

Using where method

# Where: Method 1
df.where(df['sex'] == 'f').show()
# Filter and show only a selected column
df.where(df['sex'] == 'f').select('Age','Sex','Category').show(5)

Working with Columns

# Add a Column
df.withColumn('Alb_by_10',df['ALB'] * 10).show()
# Persist the addition
df2 = df.withColumn('Alb_by_10',df['ALB'] * 10)
# Rename a column
df = df.withColumnRenamed('dob', 'date_of_birth')
# Select the columns to keep, optionally rename some
df = df.select(
    'name',
    'age',
    F.col('dob').alias('date_of_birth'),)

# Remove columns
df = df.drop('monthly_salary', 'salary_for_year')

# Keep all the columns which also occur in another dataset
df = df.select(*(F.col(c) for c in df2.columns))

# Batch Rename/Clean Columns
for col in df.columns:
    df = df.withColumnRenamed(col, col.lower().replace(' ', '_').replace('-', '_'))

Pypolars

PyPolars is an open-source Python data frame library similar to Pandas. PyPolars utilizes all the available cores of the CPU and hence performs the computations faster
Ideally, PyPolars is used when the data is too big for Pandas and too small for Spark
https://github.com/pola-rs/polars

Pypolars API

Eager API: it is very similar to that of Pandas, and the results are produced just after the execution is completed.
Lazy API: it is very similar to Spark, where a map or plan is formed upon execution of a query. Then the execution is executed parallelly across all the cores of the CPU.

Installation

pip install py-polars

Import Package

import pypolars as pl

Reading CSV

df = pl.read_csv("data/diamonds.csv")

Check Type

type(df)
pypolars.frame.DataFrame

Get First Rows / Head

df.head()

Get Last 10 Rows

df.tail(10)

Check For Shape

df.shape

Check For Datatypes

df.dtypes

[pypolars.datatypes.Float64,
 pypolars.datatypes.Utf8,
 pypolars.datatypes.Utf8,
 pypolars.datatypes.Utf8,
 pypolars.datatypes.Float64,
 pypolars.datatypes.Float64,
 pypolars.datatypes.Int64,
 pypolars.datatypes.Float64,
 pypolars.datatypes.Float64,
 pypolars.datatypes.Float64]

Check For Column Names

df.columns

Get Descriptive Summary

# there is no .describe() hence use pandas
df.to_pandas().describe()

Selection of Columns

# Select Columns By Col Name
# same in pandas:: df['carat']
df['col1']

# Select Columns By Index
df[0]

# Select By Index Position for columns
# same in pandas:: df[0] 
df.select_at_idx(0)

# Select Multiple Columns
df[['col1','col2']]

Selection of Rows

# Select rows 0 to 3 and all columns
# same in pandas:: df.iloc[0:3]
df[0:3]

# Slicing 0 to length of rows
# same in pandas:: df.iloc[0:3]
df.slice(0,3)

# Select from 3 rows of a column
df[0:3,"col1"]

# Select from different rows of a column
df[[0,4,10],"col1"]

# Select from different rows of a multiple columns
df[[0,4,10],["col1","col2"]]

Value Counts & Unique Values

df['col1'].value_counts()

# Get Unique Values
df['cut'].unique()

Filter & Conditions

# Boolean Indexing & Condition & Find 
# Find all rows/datapoint with cut as ideal
df[df['cut'] == "Ideal"]

# Check For Price More than A Value (300)
df[df['price'] > 300]

Applying Functions

# Method 1
df['col1'].apply(round)

# Method 2 Using Lambda
df['col1'].apply(lambda x : round(x))

Using Lazy Execution

import pypolars.lazy as plazy

# Define your fxn
def cust_mapcut_values(s: plazy.Series) -> plazy.Series:
    return s.apply(lambda x: mydict[x])
# Apply Function
output = df.lazy().with_column(plazy.col('cut').map(cust_mapcut_values))

# Execute and Collect Results
output.collect()

Group By

# same in pandas:: df.groupby('cut')['price'].sum()
df.groupby('cut').select('price').sum()

# Selecting Multiple COlumns From Groupby
df.groupby('cut').select(['price','carat']).sum()

# Select Every Column + Fxn
df.groupby(['cut','color']).select_all().first()

Create A Series

# Create A Series
x = pl.Series('X',['A','B','C'])
y = pl.Series('Y',[3,5,4])

Plot with Matplotlib

# Load Data Vis Pkgs
import matplotlib.pyplot as plt
# Bar Chart Using Matplotlib
plt.bar(x,y)

Dask

Dask can parallelise operations equally easily on a computer as on a server. It automatically figures out the cores in a machine and intelligently distributes workload.

Installation

pip install dask 
pip install “dask[complete]”

Data-Forge [Data Analysis with Javascript]

Data Analysis with Javascript - DataForge.js

// import package
const dataForge = require('data-forge');require('data-forge-fs')

Create A DataFrame

const df = new dataForge.DataFrame({
    columnNames:["id","name","sex","age"],
    rows:[
        [1,"Jesse","male",25],
        [2,"Jane","female",25],
        [3,"Mark","male",20],
        [4,"Peter","male",55],
        [5,"Paula","female",35],

    ]
})

df.toArray()

df.toString()

Using a list of objects

let data = [{ A: 1, B: 10 }, { A: 2, B: 20 }, { A: 3, B: 30 }];
let df2 = new dataForge.DataFrame(data);

df2.toString()

Reading CSV

const df = dataForge.readFileSync('data/sodata.csv').parseCSV();

Display DF

df.toArray()

Display as Rows

df.toRows()

Display as Index and Value Pairs

df.toPairs()

Display as JSON

df.toJSON()

Get Head as a nice format

df.head(5).toString()

Get tail

df.tail(5).toString()

Get Datatypes

df.detectTypes().toString()

Get Values

df.detectValues().toString()

Get Column Names

df.getColumnNames()

Rename a column /make sure to set const to let

let df2 = df.renameSeries({"title":"text"})
df2.getColumnNames()

SELECTION OF ROWS

pandas:: df.iloc[10]

 let row10 = df.at(10)
row10

pandas:: df.loc[10:20]

df.between(10,20).toString()

Using skip and take

df.skip(10).take(20).toString()

SELECTION OF COLUMNS/Same as series

pandas df['col1'] or df.col1

method 1: using getSeries

df.getSeries("tags").toString()

method 2: using deflate

df.deflate(row=>row.tags).toString()

method 3: using subset

df.subset(['tags','title']).toString()

Deleting Columns

df.dropSeries('tags').head(5).toString()

Delete Multiple Columns

df.dropSeries(['title','tags']).head(5).toArray()

Filtering

df.where(row=>row['python'] == 0.0).toString()

Modification/Transform

df.transformSeries({title: value => value.toUpperCase()}).toString()

Method 2

let pyHeight = df.getSeries("python")
pyHeight.select(value=> value + 2.5).toString()

df.withSeries("Height",pyheight2)

Groupby

     group=>{
         return{
             Sex:group.first().sex,
             Count:group.count()
         }
     }).inflate();

Thanks For Your Time

Jesus Saves @JCharisTech
Jesse E.Agbe(JCharis)

.

Jesus Saves @JCharisTech
By Jesse E.Agbe(JCharis)

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
data_analysis_with_javascripts		data_analysis_with_javascripts
GitLab_CI_CD_vs_Azure_Devops_Agent_JCharisTech.drawio.pdf		GitLab_CI_CD_vs_Azure_Devops_Agent_JCharisTech.drawio.pdf
GitLab_CI_CD_vs_Azure_Devops_Agent_JCharisTech.drawio.png		GitLab_CI_CD_vs_Azure_Devops_Agent_JCharisTech.drawio.png
GitLab_CI_CD_vs_Azure_Devops_Agent_JCharisTech.pdf		GitLab_CI_CD_vs_Azure_Devops_Agent_JCharisTech.pdf
README.md		README.md
awesome_django.md		awesome_django.md
django_snippets.md		django_snippets.md
docker_dev_snippets.md		docker_dev_snippets.md

jcharistech/cheatsheets-datascience

Folders and files

Latest commit

History

Repository files navigation

CheatSheets-DataScience

PySpark

Installation

Load PySpark & Launch Spark UI

Create Spark Session

Basics For Data Analysis

Read CSV without header/schema

Preview/Show the first 5 row

Read CSV with header

Read CSV with header and Infer Schema

Get First Row

Get First 5 Row using .head

Check For the column names

Check For the Datatypes

Get the Schema

Check the number of rows

Check for the number of columns

Get the shape (rows,cols)

Descriptive Summary

Get Descriptive Summary of A Column

Selection of Columns

Tip Column Selection is irrespective of Case of Column

Select Multiple columnns

Note on Selection

Dot Notation *

Conditions & Filter

Using filter method

Filter using Multiple conditions

Filter against a list of allowed values

Sort By Ascending or Descending Order

Using where method

Working with Columns

Pypolars

Pypolars API

Installation

Import Package

Reading CSV

Check Type

Get First Rows / Head

Get Last 10 Rows

Check For Shape

Check For Datatypes

Check For Column Names

Get Descriptive Summary

Selection of Columns

Selection of Rows

Value Counts & Unique Values

Filter & Conditions

Applying Functions

Using Lazy Execution

Group By

Create A Series

Plot with Matplotlib

Dask

Installation

Data-Forge [Data Analysis with Javascript]

Data Analysis with Javascript - DataForge.js

Create A DataFrame

Using a list of objects

Reading CSV

Display DF

Display as Rows

Display as Index and Value Pairs

Display as JSON

Get Head as a nice format

Get tail

Get Datatypes

Get Values

Get Column Names

Rename a column /make sure to set const to let

SELECTION OF ROWS

Using skip and take

SELECTION OF COLUMNS/Same as series

method 1: using getSeries

method 2: using deflate

Packages