Vector Db First Step page

In this article, we create a similarity search with the help of pgVector. In Part 1, we have created a scraper for building the dataset of real estate. You can find Part 1 and Part 2 here.

Delving into a little theory is important before putting your hands in the oven.

Vector Theory

The vector database is based on geometric theory. In the following section, I’ll explain how it works simply.

Mathematical Vector Theory

Lexicon:

Vector: A mathematical object that has both magnitude and direction. It is often represented as an ordered list of numbers or coordinates. Vectors are commonly used in various branches of mathematics, physics, computer science, and engineering to represent quantities such as displacement, velocity, acceleration, force, and more.
Euclidean space: Named after the ancient Greek mathematician Euclid, it refers to a geometric space in which the fundamental principles of Euclidean geometry hold true. Euclidean space is a specific type of metric space, meaning it is equipped with a distance function that satisfies the properties of a metric, such as non-negativity, symmetry, and triangle inequality.
Vector Space: A set of vectors along with vector addition and scalar multiplication operations. These operations must satisfy certain properties like closure, associativity, commutativity, the existence of an additive identity, and the existence of additive inverses. Euclidean space, which includes the familiar 2D and 3D spaces, is an example of a vector space.

Now that we have defined some mathematical notions, let’s apply the theory.

To facilitate my explanation, we’ll use a 2-dimensional space.

In this diagram, we have drawn 2 vectors:

u is the vector from A to B and equals (6, 8).
v is the vector from A to C and equals(4, 12).

Euclidean geometry provides a function for calculating the distance between two vectors. The distance between vectors u and v is determined by calculating the square root of the sum of squared differences between their components, as given by the Euclidean norm formula.

The distance between u and w is 4,7.

Database theory and use case

The graph database can represent objects as vectors in multiple dimensions. With these vectors, we can query and find the closest objects. For example, you can vectorize a picture by extracting its features, creating a vector based on these features, and then comparing vectors to determine whether the pictures are similar.

Use case

In our specific scenario, we will create a vector with 9 dimensions by extracting the following data from the CSV file:

Latitude (geo_latitude)
Longitude (geo_longitude)
Ground surface area (floor_size_value)
Total price (price)
Price per square meter (price_m2)
Total number of rooms (rooms)
Number of bedrooms (bed_rooms)
Number of bathrooms (bathroom)
Number of shower rooms (shower_room)

Subsequently, we aim to query the database to identify the most closely related real estate announcements.

Practice

For this article, I will use the Postgres extension pgvector, but many projects of vectorial databases have been developed. If you are curious, you can check qdrant or melisearch. (I don’t know if Meli has deployed their vectorial database).

SQL Stuff

First, we want to read the documentation of pgvector. I will use the article’s Docker image and my SQL seed. The SQL code is available here.

For my SQL seed, I use dbmate, and I have wrapped everything inside a Makefile.

DB_CONTAINER_NAME := my-postgres
DB_VOLUME_NAME := my-postgres-volume
DB_PORT := 5432
DB_NAME := mydb
DB_USER := user
DB_PWD := password

.PHONY: start-db-container
start-db-container:
    docker run --name $(DB_CONTAINER_NAME) \\\\
        -v $(CURDIR)/data:/var/lib/postgresql/data \\\\
        -e POSTGRES_USER=$(DB_USER) \\\\
        -e POSTGRES_PASSWORD=$(DB_PWD) \\\\
        -e POSTGRES_DB=$(DB_NAME) \\\\
        -p $(DB_PORT):$(DB_PORT) \\\\
        -d ankane/pgvector
.PHONY: stop-db-container
stop-db-container:
    docker stop $(DB_CONTAINER_NAME)
    docker rm $(DB_CONTAINER_NAME)
.PHONY: new
new:
    dbmate -u "postgres://$(DB_USER):$(DB_PWD)@localhost:$(DB_PORT)/$(DB_NAME)" new $(migration_name)
.PHONY: migrate
migrate:
    dbmate -u "postgres://$(DB_USER):$(DB_PWD)@localhost:$(DB_PORT)/$(DB_NAME)?sslmode=disable" up
.PHONY: rollback
rollback:
    dbmate -u "postgres://$(DB_USER):$(DB_PWD)@127.0.0.1:$(DB_PORT)/$(DB_NAME)?sslmode=disable" down
.PHONY: print_url
print:
    echo "postgres://$(DB_USER):$(DB_PWD)@localhost:$(DB_PORT)/$(DB_NAME)"
.PHONY: help
help:
    @echo "Available targets:"
    @echo "  start-db-container    : Start PostgreSQL container."
    @echo "  stop-db-container     : Stop and remove PostgreSQL container."
    @echo "  migrate               : Run database migrations."
    @echo "  rollback              : Rollback the last database migration."

Architecture

The documentation of pgvector provides an example for creating tables like this:

CREATE TABLE items (id bigserial PRIMARY KEY, embedding vector(3));

In our use case, we will create two tables:

The first is a representation of the real estate.
The second is very close to the example definition.

Migration

We create four migrations with make new migration_name=NAME_FILE.

Migration 1: We add the dependency for uuid.

Migration 2: We add the dependency for vector.

Migration 3: We add the dependency for real_estate_properties.

Migration 4: We add the dependency for real_estate_vectors.

CREATE EXTENSION IF NOT EXISTS "uuid-ossp";

CREATE EXTENSION vector;

CREATE TABLE real_estate_properties (
    id uuid PRIMARY KEY DEFAULT uuid_generate_v4(),
    created_at timestamptz DEFAULT NOW(),
    updated_at timestamptz DEFAULT NOW(),
    name VARCHAR(255),
    url VARCHAR(255),
    floor_size_value NUMERIC(10, 2),
    address_locality VARCHAR(255),
    address_region VARCHAR(255),
    address_postal_code VARCHAR(20),
    geo_latitude NUMERIC(10, 6),
    geo_longitude NUMERIC(10, 6),
    rooms INTEGER,
    bed_rooms INTEGER,
    bathroom INTEGER,
    shower_room INTEGER,
    price NUMERIC(10, 2),
    price_m2 NUMERIC(10, 2),
    dpe VARCHAR(1),
    ges VARCHAR(1)
);

CREATE TABLE real_estate_vectors (
    id uuid PRIMARY KEY DEFAULT uuid_generate_v4(),
    property_id uuid,
    embedding vector(9),
    FOREIGN KEY (property_id) REFERENCES real_estate_properties (id)
);

Populate Data

In the first step, we want to create the real_estate_properties CSV from Python. The code is straightforward: we open the CSV row by row and extract the data to insert.

import csv
import psycopg2

conn = psycopg2.connect(
    dbname="mydb",
    user="user",
    password="password",
    host="localhost",
    port=5432
)
cur = conn.cursor()
insert_query = """
    INSERT INTO real_estate_properties (
         name, url, floor_size_value,
        address_locality, address_region, address_postal_code,
        geo_latitude, geo_longitude, rooms, bed_rooms, bathroom,
        shower_room, price, price_m2, dpe, ges
    ) VALUES (
        %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s
    )
"""
with open('donnees_immobilieres.csv', 'r', encoding='utf-8') as csv_file:
    csv_reader = csv.DictReader(csv_file)
    for row in csv_reader:
        name = row['name']
        url = row['url']
        floor_size_value = float(row['floorSize.value'])
        address_locality = row['address.addressLocality']
        address_region = row['address.addressRegion']
        address_postal_code = row['address.postalCode']
        geo_latitude = float(row['geo.latitude'])
        geo_longitude = float(row['geo.longitude'])
        rooms = int(row['announce_detail.rooms'])
        bed_rooms = int(row['announce_detail.bed_rooms'])
        bathroom = int(row['announce_detail.bathroom'])
        shower_room = int(row['announce_detail.shower_room'])
        price = float(row['announce_detail.price'])
        price_m2 = float(row['announce_detail.price_m2'])
        dpe = row['announce_detail.dpe']
        ges = row['announce_detail.ges']
        cur.execute(
            insert_query,
            (name, url, floor_size_value,
             address_locality, address_region, address_postal_code,
             geo_latitude, geo_longitude, rooms, bed_rooms, bathroom,
             shower_room, price, price_m2, dpe, ges)
        )
conn.commit()
cur.close()
conn.close()

And now, a script to create and feed the real_estate_vector.

import psycopg2

conn = psycopg2.connect(
    dbname="mydb",
    user="user",
    password="password",
    host="localhost",
    port=5432
)
cur = conn.cursor()
cur.execute("""SELECT * from real_estate_properties;""")
data
 = cur.fetchall()
insert_query = """INSERT INTO real_estate_vectors (real_estate_id, embedding) VALUES (%s, %s)"""
for row in data:
    id = row[0]
    m2 = row[5]
    long = row[9]
    lat = row[10]
    rooms = row[11]
    bed_rooms = row[12]
    bath_room = row[13]
    shower_room = row[14]
    price = row[15]
    price_m2 = row[16]
    vec = [m2, long, lat, price, price_m2, rooms, bed_rooms, bath_room, shower_room]
    cur.execute(insert_query, (id, vec))
conn.commit()
cur.close()
conn.close()

Okay, normally, we have fed the database. Now let’s go test.

Testing

For that, I will use Datagrip and play some SQL functions inside. Inside the documentation of pgvector, we can find this query.

SELECT * FROM items WHERE id != 1 ORDER BY embedding <-> (SELECT embedding FROM items WHERE id = 1) LIMIT 5;

The goal of this query is for one id to return all the items closest. For the test, I will take inspiration from.

First, I will select a random uuid inside my db.

SELECT r.url, r.price, r.floor_size_value, r.price_m2
from real_estate_properties r
where id = '7beefa7f-5c5e-4dd6-8836-2c6d87296851';
-- result
-- <https://immobilier.lefigaro.fr/annonces/annonce-64799502.html,295000.00,90.94,3244.00>

Second, we query real_estate_vectors

SELECT r.real_estate_id
FROM real_estate_vectors r
WHERE r.real_estate_id != '7beefa7f-5c5e-4dd6-8836-2c6d87296851'
ORDER BY embedding <->
         (SELECT embedding FROM real_estate_vectors re WHERE re.real_estate_id = '7beefa7f-5c5e-4dd6-8836-2c6d87296851')
LIMIT 5;

-- result
--a09df0bc-c043-4bdb-991d-9b506e42812e
--dc59121f-2915-430f-85ef-a66dd6dc24f1
--911b9324-1e73-4bf4-81fe-760dc3036ce1
--7a3a2b07-03a7-46ab-a7b7-a0f1f7e09e54
--83e0b0b8-1349-48df-b958-052f76b9e0c4

Third, we need to list real_estate_properties for verification.

select r.url, r.price, r.floor_size_value, r.price_m2
from real_estate_properties r
where id in (
             'a09df0bc-c043-4bdb-991d-9b506e42812e',
             'dc59121f-2915-430f-85ef-a66dd6dc24f1',
             '911b9324-1e73-4bf4-81fe-760dc3036ce1',
             '7a3a2b07-03a7-46ab-a7b7-a0f1f7e09e54',
             '83e0b0b8-1349-48df-b958-052f76b9e0c4'
    );

-- result
--<https://immobilier.lefigaro.fr/annonces/annonce-64489914.html,295000.00,70.00,4214.00>
--<https://immobilier.lefigaro.fr/annonces/annonce-64489914.html,295000.00,70.00,4214.00>
--<https://immobilier.lefigaro.fr/annonces/annonce-64190252.html,295000.00,81.00,3642.00>
--<https://immobilier.lefigaro.fr/annonces/annonce-58419200.html,295000.00,80.12,3682.00>
--<https://immobilier.lefigaro.fr/annonces/annonce-63028684.html,295000.00,72.00,4097.00>

Conclusion

Congratulations! We have created our first similarity search system. I’m very happy to finish this series of articles. I like the system and am very happy to have tested it. To play, I have my dataset, and the SQL query I found is very powerful. If you want, find my first part here and here. I hope this is helpful and see you.