PandasVault — Advanced Pandas Functions and Code Snippets
The only Pandas utility package you would ever need. It has no exotic external dependencies. All functions have been compared and tested with alternatives, only the fastest equivalent functions have been developed and included in this package. The package has more than 20 wrapped functions and 100 snippets.
You have the option to view this Readme or run a Colab Notebook.
pipinstallpandasvault
If you can identify performance improvements, or improvements in code length and styling, please open a pull request. This package is new, all help and criticisms are appreciated. I would love to hear about any additional function ideas. If you have a function to contribute please open an issues tab or email me at d.snow(at)nyu.edu.
If you are running the code for the first time load this test dataframe:
!pipinstallpandasvault
importpandasaspdimportnumpyasnpimportpandasvaultaspvnp.random.seed(1)
"""quick way to create a data frame for testing"""df_test=pd.DataFrame(np.random.randn(3, 4), columns=['a', 'b', 'c', 'd']) \
.assign(target=lambdax: (x['b']+x['a']/x['d'])*x['c'])
1) & (df['b'] <1)] """filter by conditions and the condition on row labels(index)""" df[(df.a > 0) & (df.index.isin([0, 1]))] """regexp filters on strings (vectorized), use .* instead of *""" df[df.category.str.contains(r'.*[0-9].*')] """logical NOT is like this""" df[~df.category.str.contains(r'.*[0-9].*')] """creating complex filters using functions on rows""" df[df.apply(lambda x: x['b'] > x['c'], axis=1)] """Pandas replace operation""" df["a"].round(2).replace(0.87, 17, inplace=True) df["a"][df["a"] < 4] = 19 """Conditionals and selectors""" df.loc[df["a"] > 1, ["a","b","target"]] """Selecting multiple column slices""" df.iloc[:, np.r_[0:2, 4:5]] """apply and map examples""" df[["a","b","c"]].applymap(lambda x: x+1) """add 2 to row 3 and return the series""" df[["a","b","c"]].apply(lambda x: x[0]+2,axis=0) """add 3 to col A and return the series""" df.apply(lambda x: x['a']+1,axis=1) """ Split delimited values in a DataFrame column into two new columns """ df['new1'], df['new2'] = zip(*df['k'].apply(lambda x: x.split(': ', 1))) """ Doing calculations with DataFrame columns that have missing values In example below, swap in 0 for df['col1'] cells that contain null """ df['new3'] = np.where(pd.isnull(df['b']),0,df['a']) + df['c'] """ Exclude certain data type or include certain data types """ df.select_dtypes(exclude=['O','float']) df.select_dtypes(include=['int']) """one liner to normalize a data frame""" (df[["a","b"]] - df[["a","b"]].mean()) / (df[["a","b"]].max() - df[["a","b"]].min()) """groupby used like a histogram to obtain counts on sub-ranges of a variable, pretty handy""" df.groupby(pd.cut(df.a, range(0, 1, 2))).size() """use a local variable use inside a query of pandas using @""" mean = df["a"].mean() df.query("a > @mean") """Calculate the % of missing values in each column""" df.isna().mean() """Calculate the % of missing values in each row""" rows = df.isna().mean(axis=1) ; df.head() ">
"""set display width, col_width etc for interactive pandas session"""pd.set_option('display.width', 200)
pd.set_option('display.max_colwidth', 20)
pd.set_option('display.max_rows', 100)
"""when you have an excel sheet with spaces in column names"""df.columns= [c.lower().replace(' ', '_') forcindf.columns]
"""Add prefix to all columns"""df.add_prefix("1_")
"""Add suffix to all columns"""df.add_suffix("_Z")
"""Droping column where missing values are above a threshold"""df.dropna(thresh=len(df)*0.95, axis="columns")
"""Given a dataframe df to filter by a series ["a","b"]:"""df[df['category'].isin(["1","0"])]
"""filter by multiple conditions in a dataframe df"""df[(df['a'] >1) & (df['b'] <1)]
"""filter by conditions and the condition on row labels(index)"""df[(df.a>0) & (df.index.isin([0, 1]))]
"""regexp filters on strings (vectorized), use .* instead of *"""df[df.category.str.contains(r'.*[0-9].*')]
"""logical NOT is like this"""df[~df.category.str.contains(r'.*[0-9].*')]
"""creating complex filters using functions on rows"""df[df.apply(lambdax: x['b'] >x['c'], axis=1)]
"""Pandas replace operation"""df["a"].round(2).replace(0.87, 17, inplace=True)
df["a"][df["a"] <4] =19"""Conditionals and selectors"""df.loc[df["a"] >1, ["a","b","target"]]
"""Selecting multiple column slices"""df.iloc[:, np.r_[0:2, 4:5]]
"""apply and map examples"""df[["a","b","c"]].applymap(lambdax: x+1)
"""add 2 to row 3 and return the series"""df[["a","b","c"]].apply(lambdax: x[0]+2,axis=0)
"""add 3 to col A and return the series"""df.apply(lambdax: x['a']+1,axis=1)
""" Split delimited values in a DataFrame column into two new columns """df['new1'], df['new2'] =zip(*df['k'].apply(lambdax: x.split(': ', 1)))
""" Doing calculations with DataFrame columns that have missing values In example below, swap in 0 for df['col1'] cells that contain null """df['new3'] =np.where(pd.isnull(df['b']),0,df['a']) +df['c']
""" Exclude certain data type or include certain data types """df.select_dtypes(exclude=['O','float'])
df.select_dtypes(include=['int'])
"""one liner to normalize a data frame"""
(df[["a","b"]] -df[["a","b"]].mean()) / (df[["a","b"]].max() -df[["a","b"]].min())
"""groupby used like a histogram to obtain counts on sub-ranges of a variable, pretty handy"""df.groupby(pd.cut(df.a, range(0, 1, 2))).size()
"""use a local variable use inside a query of pandas using @"""mean=df["a"].mean()
df.query("a > @mean")
"""Calculate the % of missing values in each column"""df.isna().mean()
"""Calculate the % of missing values in each row"""rows=df.isna().mean(axis=1) ; df.head()
"""To avoid Unnamed: 0 when loading a previously saved csv with index""""""To parse dates""""""To set data types"""df_out=pd.read_csv("data.csv", index_col=0,
parse_dates=['D'],
dtype={"c":"category", "B":"int64"}).set_index("D")
"""Copy data to clipboard; like an excel copy and pastedf = pd.read_clipboard()""""""Read table from websitedf = pd.read_html(url, match="table_name")"""""" Read pdf into dataframe ()!pip install tabulafrom tabula import read_pdfdf = read_pdf('test.pdf', pages='all')"""df_out.head()
importpandasaspdfrompandas.api.typesimportCategoricalDtypeprint("Let's create our own categorical order.")
cat_type=CategoricalDtype(["bad", "good", "excellent"], ordered=True)
df["cats"] =df["cats"].astype(cat_type)
print("Now we can use logical sorting.")
df=df.sort_values("cats", ascending=True)
print("We can also filter this as if they are numbers.")
df[df["cats"] >"bad"]
Let's create our own categorical order.
Now we can use logical sorting.
We can also filter this as if they are numbers.
np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max: df[col] = df[col].astype(np.int8) elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max: df[col] = df[col].astype(np.int16) elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max: df[col] = df[col].astype(np.int32) elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max: df[col] = df[col].astype(np.int64) else: if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max: df[col] = df[col].astype(np.float16) elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max: df[col] = df[col].astype(np.float32) else: df[col] = df[col].astype(np.float64) else: df[col] = df[col].astype('category') end_mem = df.memory_usage().sum() / 1024**2 print('Memory usage after optimization is: {:.2f} MB'.format(end_mem)) print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem)) return df df_out = pv.reduce_mem_usage(df); df_out ">
importgcdefreduce_mem_usage(df):
""" iterate through all the columns of a dataframe and modify the data type to reduce memory usage. """start_mem=df.memory_usage().sum() /1024**2print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))
forcolindf.columns:
col_type=df[col].dtypegc.collect()
ifcol_type!=object:
c_min=df[col].min()
c_max=df[col].max()
ifstr(col_type)[:3] =='int':
ifc_min>np.iinfo(np.int8).minandc_max<np.iinfo(np.int8).max:
df[col] =df[col].astype(np.int8)
elifc_min>np.iinfo(np.int16).minandc_max<np.iinfo(np.int16).max:
df[col] =df[col].astype(np.int16)
elifc_min>np.iinfo(np.int32).minandc_max<np.iinfo(np.int32).max:
df[col] =df[col].astype(np.int32)
elifc_min>np.iinfo(np.int64).minandc_max<np.iinfo(np.int64).max:
df[col] =df[col].astype(np.int64)
else:
ifc_min>np.finfo(np.float16).minandc_max<np.finfo(np.float16).max:
df[col] =df[col].astype(np.float16)
elifc_min>np.finfo(np.float32).minandc_max<np.finfo(np.float32).max:
df[col] =df[col].astype(np.float32)
else:
df[col] =df[col].astype(np.float64)
else:
df[col] =df[col].astype('category')
end_mem=df.memory_usage().sum() /1024**2print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
print('Decreased by {:.1f}%'.format(100* (start_mem-end_mem) /start_mem))
returndfdf_out=pv.reduce_mem_usage(df); df_out
Memory usage of dataframe is 0.00 MB
Memory usage after optimization is: 0.00 MB
Decreased by 36.3%
defverify_primary_key(df, column_list):
'''Verify if columns in column list can be treat as primary key'''returndf.shape[0] ==df.groupby(column_list).size().reset_index().shape[0]
verify_primary_key(df, ["first_d","second_d"])
True
>>> Shift Columns to Front (func)
df=df_test.copy(); df
a
b
c
d
target
0
1.6243
-0.6118
-0.5282
-1.0730
1.1227
1
0.8654
-2.3015
1.7448
-0.7612
-5.9994
2
0.3190
-0.2494
1.4621
-2.0601
-0.5910
deflist_shuff(items, df):
"Bring a list of columns to the front"cols=list(df)
foriinrange(len(items)):
cols.insert(i, cols.pop(cols.index(items[i])))
df=df.loc[:, cols]
df.reset_index(drop=True, inplace=True)
returndfdf_out=pv.list_shuff(["target","c","d"],df); df_out
df_out= (df.assign(stringed=df["a"].astype(str),
ounces=df["b"]*12,# this will allow yo set a titlegalons=lambdadf: df["a"]/128)
.query("b > -1")
.style.set_caption("Average consumption")) ; df_out
b target 0.9215
a d 0.6605
target 0.3206
b a -0.0724
c d -0.1764
b -0.4545
target d -0.4994
c target -0.7647
b d -0.7967
a c -0.8555
dtype: float64
>>> Missing Data Report (func)
df=df_test.copy()
df[df>df.mean()] =None ; df
a
b
c
d
target
0
NaN
NaN
-0.5282
NaN
NaN
1
0.8654
-2.3015
NaN
NaN
-5.9994
2
0.3190
NaN
NaN
-2.0601
NaN
defmissing_data(data):
"Create a dataframe with a percentage and count of missing values"total=data.isnull().sum().sort_values(ascending=False)
percent= (data.isnull().sum()/data.isnull().count()*100).sort_values(ascending=False)
returnpd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
pv.df_out=missing_data(df); df_out
defreplace_small_cat(df, columns, thresh=0.2, term="other"):
forcolincolumns:
# Step 1: count the frequenciesfrequencies=df[col].value_counts(normalize=True)
# Step 2: establish your threshold and filter the smaller categoriessmall_categories=frequencies[frequencies<thresh].indexdf[col] =df[col].replace(small_categories, "Other")
returndfdf_out=pv.replace_small_cat(df,["cat"]); df_out.head()
a
b
c
d
target
cat
0
1.6243
-0.6118
-0.5282
-1.0730
1.1227
bat
1
0.8654
-2.3015
1.7448
-0.7612
-5.9994
bat
2
0.3190
-0.2494
1.4621
-2.0601
-0.5910
Other
0
1.6243
-0.6118
-0.5282
-1.0730
1.1227
mat
1
0.8654
-2.3015
1.7448
-0.7612
-5.9994
mat
>>> Quasi-Constant Features Detection (func)
df=df_test.copy()
df["a"] =3
= threshold: quasi_constant_feature.append(feature) print(len(quasi_constant_feature),' variables are found to be almost constant') return quasi_constant_feature # the original dataset has no constant variable qconstant_col = pv.constant_feature_detect(data=df,threshold=0.9) df_out = df.drop(qconstant_col, axis=1) ; df_out ">
defconstant_feature_detect(data,threshold=0.98):
""" detect features that show the same value for the majority/all of the observations (constant/quasi-constant features) Parameters ---------- data : pd.Dataframe threshold : threshold to identify the variable as constant Returns ------- list of variables names """data_copy=data.copy(deep=True)
quasi_constant_feature= []
forfeatureindata_copy.columns:
predominant= (data_copy[feature].value_counts() /np.float(
len(data_copy))).sort_values(ascending=False).values[0]
ifpredominant>=threshold:
quasi_constant_feature.append(feature)
print(len(quasi_constant_feature),' variables are found to be almost constant')
returnquasi_constant_feature# the original dataset has no constant variableqconstant_col=pv.constant_feature_detect(data=df,threshold=0.9)
df_out=df.drop(qconstant_col, axis=1) ; df_out
0: df[i] = df[i].fillna(df[i].mean()+3*df[i].std()) else: warn("Column %s has no missing" % i) return df df_out = pv.impute_null_with_tail(df,cols=df.columns); df_out ">
defimpute_null_with_tail(df,cols=[]):
""" replacing the NA by values that are at the far end of the distribution of that variable calculated by mean + 3*std """df=df.copy(deep=True)
foriincols:
ifdf[i].isnull().sum()>0:
df[i] =df[i].fillna(df[i].mean()+3*df[i].std())
else:
warn("Column %s has no missing"%i)
returndfdf_out=pv.impute_null_with_tail(df,cols=df.columns); df_out
a
b
c
d
target
0
1.7512
NaN
-0.5282
NaN
NaN
1
0.8654
-2.3015
NaN
NaN
-5.9994
2
0.3190
NaN
NaN
-2.0601
NaN
>>> Detect Outliers (func)
df=df_test.copy(); df
a
b
c
d
target
0
1.6243
-0.6118
-0.5282
-1.0730
1.1227
1
0.8654
-2.3015
1.7448
-0.7612
-5.9994
2
0.3190
-0.2494
1.4621
-2.0601
-0.5910
threshold print('Num of outlier detected:',outlier_index.value_counts()[1]) print('Proportion of outlier detected',outlier_index.value_counts()[1]/len(outlier_index)) return outlier_index, (median_absolute_deviation, median_absolute_deviation) para = (Upper_fence, Lower_fence) tmp = pd.concat([data[col]>Upper_fence,data[col]
defwindsorization(data,col,para,strategy='both'):
""" top-coding & bottom coding (capping the maximum of a distribution at an arbitrarily set value,vice versa) """data_copy=data.copy(deep=True)
ifstrategy=='both':
data_copy.loc[data_copy[col]>para[0],col] =para[0]
data_copy.loc[data_copy[col]<para[1],col] =para[1]
elifstrategy=='top':
data_copy.loc[data_copy[col]>para[0],col] =para[0]
elifstrategy=='bottom':
data_copy.loc[data_copy[col]<para[1],col] =para[1]
returndata_copydf_out=pv.windsorization(data=df,col='a',para=para,strategy='both'); df_out
a
b
c
d
target
0
1.5712
-0.6118
-0.5282
-1.0730
1.1227
1
0.8654
-2.3015
1.7448
-0.7612
-5.9994
2
0.3190
-0.2494
1.4621
-2.0601
-0.5910
>>> Drop Outliers
## run the top two examplesdf=df_test.copy(); df
a
b
c
d
target
0
1.6243
-0.6118
-0.5282
-1.0730
1.1227
1
0.8654
-2.3015
1.7448
-0.7612
-5.9994
2
0.3190
-0.2494
1.4621
-2.0601
-0.5910
df_out=df[~index] ; df_out
a
b
c
d
target
1
0.8654
-2.3015
1.7448
-0.7612
-5.9994
2
0.3190
-0.2494
1.4621
-2.0601
-0.5910
>>> Impute Outliers
defimpute_outlier(data,col,outlier_index,strategy='mean'):
""" impute outlier with mean/median/most frequent values of that variable. """data_copy=data.copy(deep=True)
ifstrategy=='mean':
data_copy.loc[outlier_index,col] =data_copy[col].mean()
elifstrategy=='median':
data_copy.loc[outlier_index,col] =data_copy[col].median()
elifstrategy=='mode':
data_copy.loc[outlier_index,col] =data_copy[col].mode()[0]
returndata_copydf_out=pv.impute_outlier(data=df,col='a', outlier_index=index,strategy='mean'); df_out
defauto_dummy(df, unique=15):
# Creating dummies for small object uniquesiflen(df)<unique:
raiseValueError('unique is set higher than data lenght')
list_dummies=[]
forcolindf.columns:
if (len(df[col].unique()) <unique):
list_dummies.append(col)
print(col)
df_edit=pd.get_dummies(df, columns=list_dummies) # Saves original dataframe#df_edit = pd.concat([df[["year","qtr"]],df_edit],axis=1)returndf_editdf_out=pv.auto_dummy(df, unique=3); df_out
a
b
c
d
target
e_1
e_2
0
1.6243
-0.6118
-0.5282
-1.0730
1.1227
0
1
1
0.8654
-2.3015
1.7448
-0.7612
-5.9994
1
0
2
0.3190
-0.2494
1.4621
-2.0601
-0.5910
1
0
>>> Binarise Empty Columns (func)
df=df_test.copy()
df[df>df.mean()] =None ; df
a
b
c
d
target
0
NaN
NaN
-0.5282
NaN
NaN
1
0.8654
-2.3015
NaN
NaN
-5.9994
2
0.3190
NaN
NaN
-2.0601
NaN
frac: # if more than 70% is null binarise print(col) this.append(col) df[col] = df[col].astype(float) df[col] = df[col].apply(lambda x: 0 if (np.isnan(x)) else 1) df = pd.get_dummies(df, columns = this) return df df_out = pv.binarise_empty(df, frac=0.6); df_out ">
defbinarise_empty(df, frac=80):
# Binarise slightly empty columnsthis=[]
forcolindf.columns:
ifdf[col].dtype!="object":
is_null=df[col].isnull().astype(int).sum()
if (is_null/df.shape[0]) >frac: # if more than 70% is null binariseprint(col)
this.append(col)
df[col] =df[col].astype(float)
df[col] =df[col].apply(lambdax: 0if (np.isnan(x)) else1)
df=pd.get_dummies(df, columns=this)
returndfdf_out=pv.binarise_empty(df, frac=0.6); df_out
| Population Average | Best Individual |
---- ------------------------- ------------------------------------------ ----------
Gen Length Fitness Length Fitness OOB Fitness Time Left
0 10.14 0.91 22 1 0 43.36m
a
b
c
d
target
gen_0
gen_1
gen_2
0
1.6243
-0.6118
-0.5282
-1.0730
1.1227
-1.8292
-2.6469
0.5059
1
0.8654
-2.3015
1.7448
-0.7612
-5.9994
-3.5190
99.1619
3.6243
2
0.3190
-0.2494
1.4621
-2.0601
-0.5910
-1.4668
1.3677
3.1826
>>> Prinicipal Component Features (func)
df=df_test.copy(); df
a
b
c
d
target
0
1.6243
-0.6118
-0.5282
-1.0730
1.1227
1
0.8654
-2.3015
1.7448
-0.7612
-5.9994
2
0.3190
-0.2494
1.4621
-2.0601
-0.5910
1: pca = PCA(n_components=variance_or_components) else: # automted selection based on variance pca = PCA(n_components=variance_or_components,svd_solver="full") X_pca = pca.fit_transform(df.drop(drop_cols,axis=1)) df = pd.concat((df[drop_cols],pd.DataFrame(X_pca, columns=["PCA_"+str(i+1) for i in range(X_pca.shape[1])])),axis=1) return df df_out = pv.pca_feature(df,variance_or_components=0.80,drop_cols=["target","a"]); df_out ">
fromsklearn.decompositionimportPCA, IncrementalPCAdefpca_feature(df, memory_issues=False,mem_iss_component=False,variance_or_components=0.80,drop_cols=None):
ifmemory_issues:
ifnotmem_iss_component:
raiseValueError("If you have memory issues, you have to preselect mem_iss_component")
pca=IncrementalPCA(mem_iss_component)
else:
ifvariance_or_components>1:
pca=PCA(n_components=variance_or_components)
else: # automted selection based on variancepca=PCA(n_components=variance_or_components,svd_solver="full")
X_pca=pca.fit_transform(df.drop(drop_cols,axis=1))
df=pd.concat((df[drop_cols],pd.DataFrame(X_pca, columns=["PCA_"+str(i+1) foriinrange(X_pca.shape[1])])),axis=1)
returndfdf_out=pv.pca_feature(df,variance_or_components=0.80,drop_cols=["target","a"]); df_out
target
a
PCA_1
PCA_2
0
1.1227
1.6243
-1.2944
-0.7684
1
-5.9994
0.8654
1.5375
-0.4537
2
-0.5910
0.3190
-0.2431
1.2220
>>> Multiple Lags (func)
df=df_test.copy(); df
a
b
c
d
target
0
1.6243
-0.6118
-0.5282
-1.0730
1.1227
1
0.8654
-2.3015
1.7448
-0.7612
-5.9994
2
0.3190
-0.2494
1.4621
-2.0601
-0.5910
defmultiple_lags(df, start=1, end=3,columns=None):
ifnotcolumns:
columns=df.columns.to_list()
lags=range(start, end+1) # Just two lags for demonstration.df=df.assign(**{
'{}_t_{}'.format(col, t): df[col].shift(t)
fortinlagsforcolincolumns
})
returndfdf_out=pv.multiple_lags(df, start=1, end=2,columns=["a","target"]); df_out
"""convert column to UPPERCASE"""col_name="C"df[col_name].str.upper()
"""count string occurence in each row"""df[col_name].str.count(r'\d') # counts number of digits"""count # o chars in each row"""df[col_name].str.count('o') # counts number of digits"""split rows"""s=pd.Series(["this is a regular sentence", "https://docs.p.org", np.nan])
s.str.split()
"""this creates new columns with the different split values (instead of lists)"""s.str.split(expand=True)
"""limit the number of splits to 1, and start spliting from the rights side"""s.str.rsplit("/", n=1, expand=True)
Mon)""" df[col_name].str.replace(r'(\w+day\b)', lambda x: x.groups[0][:3]) # () in r'' creates a group with one element, which we acces with x.groups[0] """create dataframe from regex groups (str.extract() uses first match of the pattern only)""" df[col_name].str.extract(r'(\d?\d):(\d\d)') df[col_name].str.extract(r'(?P
\d?\d):(?P
\d\d)') df[col_name].str.extract(r'(?P
This tutorial's purpose is to introduce Pythonistas to methods for scaling their data science and machine learning work to larger datasets and larger models, using the tools and APIs they know and love from the PyData stack (such as numpy, pandas, and scikit-learn).