Building your own Semantic Search Implementation

Featured

Incorporating semantic search on your website can significantly enhance the discoverability of content. Semantic search is a feature available in Optimizely’s ‘Content Graph’, as well as dedicated search providers such as (Algolia, Coveo, and Hawksearch.

I wanted to investigate what other options are available, primarily to gain knowledge but also to see if it was possible to develop a solution and have control over the ML models being used.

So far, I have barely scratched the surface, but I have learnt a lot, and I wanted to share this knowledge via this blog post. I have also created a demo application, which is available on GitHub (https://github.com/andrewmarkham/Machine-Learning).

Technology Used

Large Language Models (LLMs) and Vector Databases are the foundational elements for building Semantic / Neural search solutions.

Python is the dominant language for developing AI / ML applications, but Java and Javascript also have good support. I used Javascript for the demo application.

Large Language Models

Large Language Models are used to process natural language and are trained to perform different tasks, such as Sentence Similarity. The model transforms the input text into a vector embedding, which is stored in a vector database for querying.

A vector embedding is a representation of source data (text, image, etc.) as a multidimensional numerical array. The number of dimensions and representation of the data are specific to the model, so you cannot mix and match models when indexing and querying.

I am using a model from Huggingface; this website is a fantastic resource for accessing different models and datasets.

Vector Database

A vector database stores the data (text is transformed into embeddings and persisted). To perform a query (semantic search), you transform the search query into an embedding and then perform a query with the embedding against the database.

There are many options when sourcing a vector database. Mongo DB and ElasticSearch now include this functionality, or if you are looking at dedicated services, Pinecone and Milvus are two options. In my demo application, I use Postgres with the ‘pgvector’ extension enabled; this adds vector support and can run within a Docker container.

Demo Application

I have created an application that demonstrates two functions:

  1. How to create embeddings from text input and store them in a database
  2. How to search for content semantically.

Create the database

docker-compose.yml

services:
db:
hostname: db
image: ankane/pgvector1
ports:
- 5432:5432
restart: always
environment:
- POSTGRES_DB=vectordb
- POSTGRES_USER=testuser
- POSTGRES_PASSWORD=testpwd
- POSTGRES_HOST_AUTH_METHOD=trust
volumes:
- ./init.sql:/docker-entrypoint-initdb.d/init.sql
  1. This image already has the pgvector extension installed. ↩︎

Create the docker-compose.yml file shown above, then start the instance using the command docker-compose up -d. This will start an new container and initialise the database.

/* This file is used to initialise the vector extension in the database */
CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE IF NOT EXISTS Articles (
id SERIAL PRIMARY KEY,
embedding vector(384), /* 384 is the dimension of the Embedding model */
text text,
created_at timestamptz DEFAULT now()
);

Indexing and searching

The demo application includes a node.js service (search-server) that 1) indexes text and 2) searches for content. Both these functions use the huggingface inference library (see: https://github.com/huggingface/huggingface.js).

Indexing

app.post('/', async (req: Request, res: Response) => {
const { text, id } = req.body;

const hfInference = new HfInference(HUGGINGFACE);

const embeddings = await hfInference.featureExtraction({
model: "intfloat/e5-small-v2",
inputs: `passage: ${text}`
}) as [];


await addRecordToTable(text, embeddings);

// Send a response
res.status(200).json({ message: 'Data received successfully' });
});

The highlighted code above uses the huggingface API to produce the embedding. The ‘model‘ parameter informs Huggingface which ML model to use, and the ‘inputs‘ parameter is the text used to create the embeddings. 

Note: the ‘passage‘ prefix is a directive the model requires.

async function addRecordToTable(text, embeddings:[]) {

// Get a PostgreSQL connection pool
var pool = getPool();

const client = await pool.connect();

var s = JSON.stringify(embeddings);

try {
await client.query('BEGIN');
await client.query('INSERT INTO Articles (text, embedding) VALUES ($1, $2)', [text, s]);
await client.query('COMMIT');
console.log('Record added successfully!');
} catch (error) {
await client.query('ROLLBACK');
console.error('Error adding record:', error);
} finally {
client.release();
}
};

The code above adds the text and associated embeddings to the database.

Searching

app.get('/search', async (req: Request, res: Response) => {
const { text } = req.query;

if (typeof text === "string") {
const hfInference = new HfInference(HUGGINGFACE)
const embeddings = await hfInference.featureExtraction({
model: "intfloat/e5-small-v2",
inputs: `query: ${text}`
}) as [];


var results = await query(embeddings);

res.status(200).json(results.rows);
}
else {
console.log("text is not a string");
res.status(500).json({ message: 'text is not a string' });
}
});

Searching is very straightforward. The submitted search phrase is used to create an embedding, which is then used to query the database.

Note: the prefix ‘query‘ is another directive the ML model requires.

async function query(embeddings:[]) {
// Get a PostgreSQL connection pool
var pool = getPool();

const client = await pool.connect();
var s = JSON.stringify(embeddings);
try {
await client.query('BEGIN');

const res = await client.query('SELECT text, 1 - (embedding <=> $1) AS cosine_similarity FROM Articles ORDER BY cosine_similarity desc LIMIT 5', [s]);

await client.query('COMMIT');

return res;
} catch (error) {
await client.query('ROLLBACK');
console.error('Error executing query:', error);
} finally {
client.release();
}
}

The method above contains the SQL query that is run against the database and returns records similar to the search phrase.

You may have noticed the ‘<=>‘ operator; this means ‘cosine distance’ and is a calculation used to determine similarity. In this instance, the closer the result is to 1.0, the better the match.

Other options are:

OperatorDescription
+element-wise addition
element-wise subtraction
*element-wise multiplication
<->Euclidean distance
<#>negative inner product
<=>cosine distance
pgvector operators, read more

Note: Some ML models will only work with specific operators

It is beyond the scope of this blog to discuss the concepts covered in greater detail. I have attached links to other websites at the end of the blog if you want more detailed information.

Demo

The video below demonstrates the demo application’s various semantic search capabilities.

Semantic Search Demo

The demo site, test data, and a Postman collection for indexing the test data are all available in the GitHub repository.: https://github.com/andrewmarkham/Machine-Learning

Conclusion

I have always considered semantic search or anything AI-related to be limited to using a third-party service, but as I hope this blog post demonstrates, this is not the case.

This doesn’t mean I advocate building your own semantic search solution over using a recognised search provider; rather, you may want to augment the existing search or deliver other use cases such as classification or image/voice search.

This blog post delivers the concepts at a very high level. I am starting to learn more about this subject area and how I can use it to build my own solutions. A wealth of information is available on the internet to gain more knowledge and learn how to leverage these tools yourself.

What, no OpenAI?

I can’t write a blog post about AI and not mention OpenAI.

You can use OpenAI to create embeddings, but I chose not to, as I wanted to demonstrate alternatives. The same is true for vector databases. Many other options are available, and Pinecone seems to be one of the leaders.

Useful links