How to Create a Search Engine With Python and Laravel

Indexing is a data structure technique that allows you to quickly retrieve records from a database file. It’s based on the same attributes on which the indices are done.

What Is an Index-Based Search Engine?

An index-based search engine on Python is one that relies on data stored in an inverted index to find answers. When a user enters a search query, it will scan the appropriate index to surface the closest matching result rather than searching one-by-one through an entire set of document lists.

An index is a small table with only two columns. The first column comprises a copy of the primary or candidate key of a table. Its second column contains a set of pointers for holding the address of the disk block where that specific key value is stored.

In search engine applications, this is called an inverted index.

What Is an Inverted Index?

An inverted index is an index data structure or database index built to facilitate search queries. It works by storing data mapping from the content of a document list, such as words or numbers, to its locations in the list of documents. With the words or numbers that have been mapped, the search process will be faster and more efficient than searching it one by one through all of the document lists.

What Is a Query?

A query, or a search query as it’s called in a search engine, is a query based on a specific search term that a user enters into a search engine to satisfy his or her information needs.

It’s a request for information that is made using a search engine. Every time a user puts a string of characters in a search engine and presses “Enter,” a search engine query is made. And the web application will display the search query results

Requirements to Build an Index-Based Search Engine in Python

You should have :

python 3.x.x
The package manager PIP
Code Editor (in this tutorial I will use Visual Studio Code, Optional)

Requirement Package

Pickle Library: Pickle is a default module installed on both Python 2 and Python 3. If your pickle module is not found, run this command in your terminal to install this library:

pip install pickle-mixin

We’ll use the pickle module to save our index file and load the index file when the query is called. Pickle module is useful for storing Python objects into a file. With this function, we will use pickle to save our dictionary Python to file, thus allowing us to read files faster.

More on Python: PCA Using Python: A Tutorial

How to Create An Inverted Index in Python

In order to make an inverted index, we’ll use Python’s dictionary. The dictionary will save the term as a key and the document’s score as a value. This way we can save the data document and score document for each word

To do that, we will use the term frequency — inverse dense frequency (TF-IDF) technique to calculate the score from the document.

TF-IDF is a technique that’s used to find the meaning of sentences and cancels out the shortcomings of the bag of words technique. The bag of words technique is good for text classification or for helping a machine read words in numbers.

Script Indexing with TF-IDF

import re
import sys
import json
import pickle
import math

#Argumen check
if len(sys.argv) !=3 :
	print ("\n\\Use python \n\t tf-idf.py [data.json] [output]\n")
	sys.exit(1)


#data argumen
input_data = sys.argv[1]
output_data = sys.argv[2]


data = open(input_data).read()
list_data = data.split("\n")

sw = open("stopword.txt").read().split("\n")

content=[]
for x in list_data :
	try:
		content.append(json.loads(x))
	except:
		continue


# Clean string function
def clean_str(text) :
	text = (text.encode('ascii', 'ignore')).decode("utf-8")
	text = re.sub("&.*?;", "", text)
	text = re.sub(">", "", text)    
	text = re.sub("[\]\|\[\@\,\$\%\*\&\\\(\)\":]", "", text)
	text = re.sub("-", " ", text)
	text = re.sub("\.+", "", text)
	text = re.sub("^\s+","" ,text)
	text = text.lower()
	return text



df_data={}
tf_data={}
idf_data={}

i=0;
for data in content :
	tf={} 
	#clean and list word
	clean_title = clean_str(data['book_title'])
	list_word = clean_title.split(" ")
	
	for word in list_word :
		if word in sw:
			continue
		
		#tf term frequency
		if word in tf :
			tf[word] += 1
		else :
			tf[word] = 1

		#df document frequency
		if word in df_data :
			df_data[word] += 1
		else :
			df_data[word] = 1

	tf_data[data['url']] = tf


# Calculate Idf
for x in df_data :
   idf_data[x] = 1 + math.log10(len(tf_data)/df_data[x])

tf_idf = {}

for word in df_data:
	list_doc = []
	for data in content:
		tf_value = 0

		if word in tf_data[data['url']] :
			tf_value = tf_data[data['url']][word]

		weight = tf_value * idf_data[word]

		doc = {
			'url' : data['url'],
			'title' : data['book_title'],
			'image' : data['image-url'],
			'price' : data['price'],
			'score' : weight
		}

		if doc['score'] != 0 :
			if doc not in list_doc:
				list_doc.append(doc)
		
		
	tf_idf[word] = list_doc

# Write dictionary to file
with open(output_data, 'wb') as file:
     pickle.dump(tf_idf, file)

Argument Line Script

if len(sys.argv) !=3 :
   print ("\n\nPenggunaan\n\ttf-idf.py [data.json] [output]\n")
   sys.exit(1)
input_data = sys.argv[1]
output_data = sys.argv[2]
data = open(input_data).read()
list_data = data.split("\n")
#load Stopword
sw = open("stopword.txt").read().split("\n")
content=[]
for x in list_data :
    try:
        content.append(json.loads(x))
    except: continue

In the code above, I loaded the data that we returned in the scraping process and entered a “stopword” term that we don’t use to index. To run the script, type the following command:

python3 tf-idf.py <data_json> <index_name>

is the data set that we get when doing the scraping process, and is the output from this script that makes it an index file.

Cleaning the String Data Sets

# Clean string function                       
def clean_str(text) :                        
    text = (text.encode('ascii', 'ignore')).decode("utf-8")
    text = text.lower()                   
    text = re.sub("&.*?;", "", text)                        
    text = re.sub(">", "", text)                            
    text = re.sub("[\]\|\[\@\,\$\%\*\&\\\(\)\":]", "", text)               
    text = re.sub("-", " ", text)                        
    text = re.sub("\.+", "", text)                        
    text = re.sub("^\s+","" ,text)                        
                          
    return text

As the name implies, this function works to clean the book_title from data sets that come from useless punctuation.

How to Calculate the TF-IDF

There are three terms you should know:

Term frequency (TF): This is the frequency of term for all of the document.
Document frequency (DF): This is the frequency of term of a specific document.
Inverse document frequency (IDF): This is the value you get from calculating the DF with the formula: IDF =Log[(# Number of documents) / (Number of documents containing the word (DF))]

With this TF-IDF technique, we can get the weight for each word against the document. The weight of each word is what we’ll use as our score value. The formula to get this weight is: WEIGHT = TF * IDF

Save the Index to File

with open(output_data, 'wb') as file:
   pickle.dump(tf_idf, file)

This code serves to save the dictionary into a file using the pickle module.

Making a Query Script

The query script that will be created must be able to read the dictionary stored in the earlier file. Of course, if you saved it using the pickle module, then we’ll use the pickle module to read it.

import re
import sys
import json
import pickle

#Argumen check
if len(sys.argv) != 4 :
	print ("\n\nPenggunaan\n\tquery.py [index] [n] [query]..\n")
	sys.exit(1)

query = sys.argv[3].split(" ")
n = int(sys.argv[2])

with open(sys.argv[1], 'rb') as indexdb:
	indexFile = pickle.load(indexdb)

#query
list_doc = {}
for q in query:
	try :
		for doc in indexFile[q]:
			if doc['url'] in list_doc :
				list_doc[doc['url']]['score'] += doc['score']
			else :
				list_doc[doc['url']] = doc
	except :
		continue


#convert to list
list_data=[]
for data in list_doc :
	list_data.append(list_doc[data])


#sorting list descending
count=1;
for data in sorted(list_data, key=lambda k: k['score'], reverse=True):
	y = json.dumps(data)
	print(y)
	if (count == n) :
		break
	count+=1

How to Run the Query Script

To run this script, you just must type this command in your terminal:

python3 query.py <index_file> <n-query> <string-query>

<index_file> is the index_file we created earlier in pickle module.

<n-query> is the number of results we want to display.

<string-query> is a keyword for the information the user wants.

How to Load the Index File

with open(sys.argv[1], 'rb') as indexdb:
   indexFile = pickle.load(indexdb)

It doesn’t take long to read the file dictionary with the pickle module. We can even write the index with a pickle module.

How to Calculate the Score of a Document Toward a Query

The score we get on the index can’t be a ranking reference because the index we make consists of only one gram of word. As a result, phrases that are made up of more than one word can’t be fixed on the score of the index.

To solve this problem, we’ll divide the words in the query and calculate the score for each query word.

for q in query:
   try :
      for doc in indexFile[q]:
          if doc['url'] in list_doc :
              list_doc[doc['url']]['score'] += doc['score']
          else :
              list_doc[doc['url']] = doc
    except :
          continue

How to Rank a Query Result

To make a rank, we only need to sort the search results based on the highest score.

count=1;
for data in sorted(list_data, key=lambda k: k['score'], reverse=True):
   y = json.dumps(data)
   print(y)
   if (count == n) :
      break
   count+=1

A tutorial on how to build a search engine in Python. | Video: PyGotham 2017

More on Python: Python Databases 101: How to Choose a Database Library

Query Script Example

Using Query Script. | Image: Andika Pratama

You can see in the output of a query script that the data set is related to querying keywords in the JSON format. JSON format will make reading the results easier on the search engine interface we will create.