Python - Find unique words in Text File
Python - Find unique words in Text File
Finding unique words in a text file requires text cleaning, finding the words, and then finding the unique.
In this tutorial, we will learn how to find unique words in a text file.
Steps to find unique words
To find unique words in a text file, follow these steps.
- Read text file in read mode.
- Convert text to lower case or upper case. We do not want 'apple' to be different from 'Apple'.
- Split file contents into list of words.
- Clean the words that are infested with punctuation marks. Something like stripping the words from full-stops, commas, etc.
- Also, remove apostrophe-s 's.
- You may also add more text cleaning steps here.
- Now find the unique words in the list using a Python For Loop and Python Membership Operator.
- After finding unique words, sort them for presentation.
In the text cleaning, you can also remove helping verbs, etc.
Example 1: Find unique words in text file
Now, we will put all the above mentioned steps into working using a Python program.
Consider that we are taking the following text file.
Apple is a very big company. An apple a day keeps doctor away. A big fat cat came across the road beside doctor's office.
The doctor owns apple device.
Python Program
text_file = open('data.txt', 'r')
text = text_file.read()
#cleaning
text = text.lower()
words = text.split()
words = [word.strip('.,!;()[]') for word in words]
words = [word.replace("'s", '') for word in words]
#finding unique
unique = []
for word in words:
if word not in unique:
unique.append(word)
#sort
unique.sort()
#print
print(unique)
Output
['a', 'across', 'an', 'apple', 'away', 'beside', 'big', 'came', 'cat', 'company', 'day', 'device', 'doctor', 'fat', 'is', 'keeps', 'office', 'owns', 'road', 'the', 'very']
Translation of Steps into Python Code
Following is the list of Python concepts we used in the above program to find the unique words.
- open() function to get a reference to the file object.
- file.read() method to read contents of the file.
- str.lower() method to convert text to lower case.
- str.split() method to split the text into words separated by white space characters like single space, new line, tab, etc.
- str.strip() method to strip the punctuation marks from the edges of words.
- str.replace() method to replace
's
with nothing, at the end of words. - for loop to iterate for each word in the words list.
- in - membership operator to check if the word is present in unique.
- list.append() method to append the word to unique list.
- list.sort() method to sort unique words in lexicographic ascending order.
- print() function to print the unique words list.
Summary
In this tutorial of Python Examples, we learned how to find unique words in a text file, with the help of example program.