Python - Find unique words in Text File


Python - Find unique words in Text File

Python - Find unique words in Text File

Finding unique words in a text file requires text cleaning, finding the words, and then finding the unique.

In this tutorial, we will learn how to find unique words in a text file.

Steps to find unique words

To find unique words in a text file, follow these steps.

  1. Read text file in read mode.
  2. Convert text to lower case or upper case. We do not want 'apple' to be different from 'Apple'.
  3. Split file contents into list of words.
  4. Clean the words that are infested with punctuation marks. Something like stripping the words from full-stops, commas, etc.
  5. Also, remove apostrophe-s 's.
  6. You may also add more text cleaning steps here.
  7. Now find the unique words in the list using a Python For Loop and Python Membership Operator.
  8. After finding unique words, sort them for presentation.

In the text cleaning, you can also remove helping verbs, etc.

Example 1: Find unique words in text file

Now, we will put all the above mentioned steps into working using a Python program.

Consider that we are taking the following text file.

Apple is a very big company. An apple a day keeps doctor away. A big fat cat came across the road beside doctor's office.
The doctor owns apple device.

Python Program

text_file = open('data.txt', 'r')
text = text_file.read()

#cleaning
text = text.lower()
words = text.split()
words = [word.strip('.,!;()[]') for word in words]
words = [word.replace("'s", '') for word in words]

#finding unique
unique = []
for word in words:
    if word not in unique:
        unique.append(word)

#sort
unique.sort()

#print
print(unique)

Output

['a', 'across', 'an', 'apple', 'away', 'beside', 'big', 'came', 'cat', 'company', 'day', 'device', 'doctor', 'fat', 'is', 'keeps', 'office', 'owns', 'road', 'the', 'very']

Translation of Steps into Python Code

Following is the list of Python concepts we used in the above program to find the unique words.

  • open() function to get a reference to the file object.
  • file.read() method to read contents of the file.
  • str.lower() method to convert text to lower case.
  • str.split() method to split the text into words separated by white space characters like single space, new line, tab, etc.
  • str.strip() method to strip the punctuation marks from the edges of words.
  • str.replace() method to replace 's with nothing, at the end of words.
  • for loop to iterate for each word in the words list.
  • in - membership operator to check if the word is present in unique.
  • list.append() method to append the word to unique list.
  • list.sort() method to sort unique words in lexicographic ascending order.
  • print() function to print the unique words list.

Summary

In this tutorial of Python Examples, we learned how to find unique words in a text file, with the help of example program.