Skip to content

Working with Data

Alvaro Vazquez-Mayagoitia edited this page Feb 23, 2024 · 24 revisions

Pandas cookbook

Applying a functions data

We can apply functions to a single row in a Dataframe with the method .apply(), see the following example:

# importing pandas and numpy libraries
import pandas as pd
import numpy as np
 
# creating and initializing a nested list
values_list = [[15, 2.5, 100], [20, 4.5, 50], [25, 5.2, 80],
               [45, 5.8, 48], [40, 6.3, 70], [41, 6.4, 90],
               [51, 2.3, 111]]
 
# creating a pandas dataframe
df = pd.DataFrame(values_list, columns=['Field_1', 'Field_2', 'Field_3'],
                  index=['a', 'b', 'c', 'd', 'e', 'f', 'g'])
 
 
# Apply function numpy.square() to square
# the values of one row only i.e. row
# with index name 'd'
df['Field_4']=df['Field_2'].apply( np.square)
 
 
# printing dataframe
df

Filtering values

Using the previous example you can filter values with conditional expressions. For example, filtering all the rows with Field_1 equals to 20.0

df[df['Field_1' ] == 20.0]

Or the rows with Field_2 smaller than 3.0

df[df['Field_2' ] < 3.0]

It is possible to take just a sample of the whole data set. In this example we take 1000 random rows

df =df.sample(n=1000)

Operations among columns

It is possible to do arithmetic operations with columns, for example adding Field_1 and Field_2 to create a new Field_4 looks like:

df['Field_5' ] = df['Field_1' ] + df['Field_2' ] 

Plotting with Pandas is relatively easy. The Dataframe object has some basic integrations with mathplotlib. For example a scatter plot can be plotted like:

df.plot.scatter(x='Field_2', y='Field_5',s=2)

Other interesting package for visualization is Seaborn. Values from Dataframes could be extracted as list and passed directly to Seaborn like:

import seaborn as sns
sns.distplot(df['Field_1'].values[:])

Loading Dataframes

Pandas can import data a variety of formats, including spreadsheets such as .XML and . CSV. For example, we could load a file into Dataframe as:

df = pd.read_csv('../data/mydata.csv').fillna(value = 0)

Notice that in this example we are assigning a zero value to those cells that are empty.

Storing Dataframes

Dataframes could be saved in multiple formats, such as .XML, .CSV, JSON, Pickle, etc. For example, we could save in Pickle (binary) format as:

df.to_pickle('../data/mydata.pkl')

Useful methods data description

In cases when the amount of data is large Pandas has methods that could give you a glimpse of the content. The method count() prints out the number of elements.

df.count()

The method describe() gets a number of useful summaries for a dataset, including minimum, maximum, mean, deviation, etc.

df.describe()

The method hist() could give you a visual reference of the data, and how this is distributed.

import matplotlib.pyplot as plt
df.hist(bins=50, figsize=(15,15))
plt.show()

Machine Learning

Unsupervised learning - Principal Component Analysis

Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components.

In other words, PCA is a way of reducing the dimensionality of a dataset while preserving as much of the information as possible. This can be useful for visualization, as it can make it easier to see patterns in high-dimensional data.

PCA works by finding the directions in the data that have the most variance. These directions are called principal components. The first principal component is the direction with the most variance, the second principal component is the direction with the second most variance, and so on.

The principal components are then ranked in order of decreasing variance. The first principal component will contain the most information about the data, the second principal component will contain the second most information, and so on.

PCA can be used to reduce the dimensionality of a dataset by selecting a subset of the principal components. For example, if a dataset has 100 features, we could use PCA to reduce the dimensionality to 10 or 20 features.

The number of principal components to select depends on the specific application. In general, we want to select enough principal components to capture as much of the information in the data as possible, while still keeping the number of dimensions manageable.

# import packages
# datasets has the Iris dataset
from sklearn import datasets

# pandas and numPy for DataFrames and arrays
import pandas as pd
import numpy as np

# pyplot and seaborn for plots
import matplotlib.pyplot as plt
import seaborn as sns

# PCA Model
from sklearn.decomposition import PCA

# Get dataset
iris = datasets.load_iris()

# load the data and target values
X, y = iris.data, iris.target

# create DataFrame with iris data
df = pd.DataFrame(X, columns = iris.feature_names)
df2 = df.copy()
df3 = df.copy()
df.head()

# initialize the model
X_reduced= PCA(n_components=2).fit_transform(X)
df = pd.DataFrame(X_reduced)
df.columns = ['x', 'y']
df.head()

sns.scatterplot(x='x', y='y', data=df,hue=iris.target,palette="Set1")

Cluster Analysis T-SNE

t-SNE (t-distributed stochastic neighbor embedding) is a non-linear dimensionality reduction algorithm that is commonly used to visualize high-dimensional data in a two or three-dimensional space. t-SNE works by finding a low-dimensional representation of the data that preserves the local structure of the data in the high-dimensional space. This means that similar data points in the high-dimensional space will be close together in the low-dimensional space, and dissimilar data points will be far apart.

t-SNE is a powerful tool for visualizing high-dimensional data, but it can be computationally expensive. It is also important to note that t-SNE is a non-deterministic algorithm, which means that the results of the algorithm may vary slightly each time it is run.

The data from Pandas Dataframe could

# import packages
# datasets has the Iris dataset
from sklearn import datasets

# pandas and numPy for DataFrames and arrays
import pandas as pd
import numpy as np

# pyplot and seaborn for plots
import matplotlib.pyplot as plt
import seaborn as sns

# TSNE Model
from sklearn.manifold import TSNE

# Get dataset
iris = datasets.load_iris()


# load the data and target values
X, y = iris.data, iris.target

# create DataFrame with iris data
df = pd.DataFrame(X, columns = iris.feature_names)
df2 = df.copy()
df3 = df.copy()
df.head()

# initialize the model
model = TSNE(learning_rate=100, random_state=2)

# fit the model to the Iris Data
transformed = model.fit_transform(X)
df = pd.DataFrame(transformed)
df.columns = ['x', 'y']
df.head()

sns.scatterplot(x='x', y='y', data=df,hue=iris.target,palette="Set1")

Clustering with HDBSCAN

HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise) is a clustering algorithm that is based on density. It works by first identifying clusters of high density, and then merging these clusters together until all of the data points have been assigned to a cluster.

HDBSCAN is a density-based algorithm, which means that it only considers the density of data points when clustering. This makes it different from other clustering algorithms, such as k-means, which consider the distance between data points when clustering.

HDBSCAN is a relatively new algorithm, but it has been shown to be effective in a variety of clustering tasks. It is particularly well-suited for clustering data that is not well-separated, such as data that has noise or outliers.

Using the Dataframe from the t-SNE analysis in the previous section, we could obtain the clusters from HDSCAN with the following script:

import hdbscan
cluster_tsne = hdbscan.HDBSCAN(min_cluster_size=2
                               , gen_min_span_tree=True)
cluster_tsne.fit(transformed)
plt.figure(figsize=(6,8))
plt.scatter(transformed.T[0], transformed.T[1], marker='o',c=cluster_tsne.labels_,s=5, cmap='brg')
#plt.scatter(tsne_X.T[0], df[ 'lambda_sTDA (nm)'][x_index].values[:]  , marker='o',c=cluster_tsne.labels_,s=50, cmap='hsv')
        

#plt.scatter(X_skernpca[y==1, 0], X_skernpca[y==1, 1], 
#             color='blue', marker='o', alpha=0.5)

plt.xlabel('X')
plt.ylabel('Y')
plt.title('Organic Dyes: HDBSCAN')
cbar = plt.colorbar(orientation='horizontal')
cbar.set_label('Cluster Label')
plt.show()

Alternatively, we could append the HDSCAN results to our Dataframe. For example:

df['cluster']=cluster_tsne.labels_
df['clusterprob']=cluster_tsne.probabilities_

Supervised learning - Gaussian Regression Process

Gaussian process regression (GPR) is a machine learning technique that can be used to predict a continuous value (such as temperature or height) from a set of input data. GPR is a nonparametric method, which means that it does not make any assumptions about the underlying distribution of the data. This makes GPR very flexible and can be used to model a wide variety of data.

GPR works by assuming that the function that maps from the input data to the output value is a Gaussian process. A Gaussian process is a distribution over functions that means that any finite number of outputs from the function will be jointly Gaussian distributed. This means that we can use the properties of Gaussian distributions to make predictions about the output value for a new input.

One of the advantages of GPR is that it can provide uncertainty estimates for its predictions. This is because the Gaussian process distribution also defines a distribution over the uncertainty of the predictions. This can be useful for applications where it is important to know how confident we are in our predictions.

GPR is a powerful machine learning technique that can be used for a variety of tasks. It is particularly well-suited for tasks where the underlying distribution of the data is unknown or where it is important to be able to provide uncertainty estimates for the predictions.

import numpy as np
import pandas as pd
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.datasets import load_iris

# Load the iris dataset
iris = load_iris()

# Create the features and target variables
X = iris.data
y = iris.target

# Create a Gaussian process regressor
gpr = GaussianProcessRegressor()

# Fit the regressor to the data
gpr.fit(X, y)

# Predict the target values for the test data
predictions = gpr.predict(X)

# Calculate the mean squared error
mse = np.mean((predictions - y)**2)

print("Mean squared error:", mse)

Mean squared error: 0.006245333333333333

Supervised learning - Random Forest

Random forest is a machine learning algorithm that can be used for both classification and regression tasks. It is a type of ensemble learning algorithm, which means that it combines the predictions of multiple decision trees to make a final prediction.

To understand how random forest works, let's first take a look at decision trees. A decision tree is a simple machine learning algorithm that can be used to make predictions by splitting the data into smaller and smaller groups until the predictions are clear. For example, a decision tree could be used to predict whether a patient has cancer by asking a series of questions about the patient's symptoms.

Random forest works by creating a set of decision trees, each of which is trained on a different subset of the data. The trees are created by randomly selecting features from the data and then splitting the data based on those features. This process is repeated until the desired number of trees is created.

The final prediction is made by combining the predictions of the individual trees. This is done by taking the majority vote of the trees for classification tasks or the average of the trees for regression tasks.

import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.datasets import load_iris

# Load the iris dataset
iris = load_iris()

# Create the features and target variables
X = iris.data
y = iris.target

# Create a random forest regressor with 100 trees
rf = RandomForestRegressor(n_estimators=100)

# Fit the regressor to the data
rf.fit(X, y)

# Predict the target values for the test data
predictions = rf.predict(X)

# Calculate the mean squared error
mse = np.mean((predictions - y)**2)

print("Mean squared error:", mse)

Mean squared error: 0.006245333333333333

Feature selection with Random Forest method

Feature selection is the process of selecting the most important features from a dataset for a machine learning model. This can be done for a variety of reasons, such as to improve the accuracy of the model, to reduce the complexity of the model, or to make the model more interpretable.

Random forest is a machine learning algorithm that can be used for both classification and regression tasks. It is a type of ensemble learning algorithm, which means that it combines the predictions of multiple decision trees to make a final prediction.

Random forest can also be used for feature selection. The way it works is that each decision tree in the random forest is trained on a different subset of the features. This means that each tree will learn to focus on different features. The importance of a feature is then determined by how often it is used to split the data in the decision trees.

The features with the highest importance are the ones that are most important for making predictions. These features can then be selected for the machine learning model.

import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris

# Load the iris dataset
iris = load_iris()

# Create the features and target variables
X = iris.data
y = iris.target

# Create a random forest classifier
rf = RandomForestClassifier(n_estimators=100)

# Fit the classifier to the data
rf.fit(X, y)

# Calculate the feature importances
importances = rf.feature_importances_

# Print the feature importances
print(importances)

# Select the top 2 features
top_2_features = np.argsort(importances)[-2:]

# Print the top 2 features
print(top_2_features)
[0.09179974 0.02709257 0.45509135 0.42601634]
[3 2]

RDKIT

Converting SMILES to a RDKIT molecule object

from rdkit.Chem import MolFromSmiles as smi2mol
smi = 'Clc1ccc(c(c1)Cl)OCCn1cncc1'
mol = smi2mol(smi)
Draw.MolToImage(mol,size=(800, 800))

Visualizing data in a grid

from rdkit.Chem import Draw
smiles = ['N#CC(=Cc1ccc(N(c2ccccc2)c2ccccc2)cc1)c1ccc(Cl)cc1','N#CC(=Cc1ccc(N(c2ccccc2)c2ccccc2)cc1)c1ccc(Br)cc1']
lengend=['Cl','Br']
from rdkit.Chem import MolFromSmiles as smi2mol
mol = [smi2mol(e) for e in smiles]
Draw.MolsToGridImage(mol, molsPerRow=2, subImgSize=(250,250), legends=['X = {}'.format(x) for x in lengend] )

Computing the fingerprint of a molecule

One way to translate molecules in forms that computers could represent as mathematical objects, such as vector or matrices, are the fingerprints. In this representation we could compare molecules and define similarity. Rdkit package include a good number of fingerprints, see this tutorial.

For example, the Morgan fingerprint constructs vectors fragmenting molecules around an atom within a radius. The algorithm counts how many fragments are present and produces a vector with positive real numbers. Alternatively, this could produce a bit vectors that just indicate whether the fragment is present or not. For small and medium size molecules radius 6 and vector length of 1024 is enough. See the follow example for a bit vector.

from rdkit.Chem.rdMolDescriptors import  GetMorganFingerprintAsBitVect
np.array(GetMorganFingerprintAsBitVect(mol, radius=6, nBits=2048))

array([0, 0, 0, ..., 0, 0, 1])

Computing molecular descriptors

A molecular descriptor in cheminformatics is a numerical value that summarizes a particular property of a molecule. Many times descriptors are not unique, therefore, for machine learning is preferable to use a set of descriptors. For an overview of the descriptor implemented in RDkit visit this blog.

import rdkit.Chem.Descriptors as Descriptors
#Compute molecular weight
Descriptors.MolWt(mol)

257.11999999999995