Removing Non-Alphanumeric Characters From A Column
Removing non-alphanumeric characters and special symbols from a column in Pandas datafarme
- Remove symbols and return alphanumerics
- Remove symbols & numbers and return alphabets only
- Bonus: Remove symbols & characters and return numbers only
This is a short blogpost. I wanted to document this recipe for my own benefit, and hopefully it will help others. I was working with a very messy dataset with some columns containing non-alphanumeric characters such as #,!,$^*) and even emojis.
numpy has two methods isalnum
and isalpha
.
isalnum
returns True if all characters are alphanumeric, i.e. letters and numbers. documentation
isalpha
returns True if all characters are alphabets (only alphabets, no numbers).documentation
import numpy as np
import pandas as pd
df = pd.DataFrame({'col':['abc', 'a b c', 'a_b_c', '#$#$abc', 'abc111', 'abc111#@$@', ' abc !!! 123', 'ABC']})
df
def alphanum(element):
return "".join(filter(str.isalnum, element))
df.loc[:,'alphanum'] = [alphanum(x) for x in df.col]
df
def alphabets(element):
return "".join(filter(str.isalpha, element))
df.loc[:,'alphabets'] = [alphabets(x) for x in df.col]
df
def numbers(element):
return "".join(filter(str.isnumeric, element))
df.loc[:,'num'] = [numbers(x) for x in df.col]
df
df.dtypes
Note that the num
column is returned as an object
(i.e. string) and not a number so be sure to convert it to int