AI Development7 min read

Embedding Server

Modern AI applications rely heavily on embeddings.They power many common capabilities such as semantic search, document retrieval, recommendation systems, and Retrieval Augmented Generation pipelines...

Embedding Server
A

Author

Web Developer

Share:

Modern AI applications rely heavily on embeddings.

They power many common capabilities such as semantic search, document retrieval, recommendation systems, and Retrieval Augmented Generation pipelines.

Today developers usually generate embeddings in one of three ways.

Some use managed APIs. Some run Python inference servers. Others rely on vector databases that include embedding modules.

All of these approaches work well depending on the situation. But while building a small AI project recently, none of them felt ideal for my setup.

So I built a small self hosted embedding server that any application can call through a simple API.

This article explains why I built it and how it works.


The Story Behind This Project

This project started while I was building a portfolio project exploring agentic AI workflows.

The stack looked like this:

  • Next.js for the application
  • LangGraph for orchestration
  • Weaviate for vector storage
  • LLMs for reasoning

To enable semantic search I needed embeddings.

Initially I used the text2vec module in Weaviate. It works well and is convenient because the database can generate embeddings automatically.

During development everything worked smoothly.

The issue appeared after deployment.

The server I deployed to had 16 GB RAM and was already hosting several small hobby projects. When the embedding model was loaded through Weaviate using Transformers, memory usage increased significantly.

At that point the database was responsible for two things:

  • storing vectors
  • running the embedding model

That was more infrastructure overhead than I actually needed.

At the same time, most standalone embedding services are built in Python. My project was mostly Node and TypeScript, so introducing another runtime just for embeddings felt unnecessary.

That led to a simple idea.

Embedding generation could be a small independent service that any application can call.

That idea eventually became this project.


Understanding Embeddings

If you are new to embeddings, the concept can sound complicated, but the idea is simple.

Embeddings convert text into numerical vectors that represent meaning.

For example:

TextMeaning
How to cook pastacooking
Best pasta recipescooking
Install Node.jsprogramming

Even though the words are different, embeddings allow machines to understand that the first two sentences are related.

These vectors allow AI systems to perform semantic similarity search.

Embeddings are used in many modern systems including:

  • semantic search
  • RAG pipelines
  • recommendation engines
  • document retrieval systems

image


The Architecture Idea

Many AI systems generate embeddings directly inside the vector database.

Typical setup:

Application
     |
     v
Vector Database
(with embedding model)

In this setup the database both stores vectors and runs the machine learning model.

That works well in many cases, but it also means the database must load and manage ML models.

This project separates those responsibilities.

New architecture:

Application
     |
     v
Embedding Server
     |
     v
Vector Database

In this approach:

The embedding server focuses only on turning text into vectors.

The vector database focuses only on storage and similarity search.

Separating these responsibilities keeps the system simpler and more flexible.

image


Existing Options

Before building this project I explored several existing solutions.

Each of them has advantages depending on the use case.

Managed APIs

Examples include:

  • OpenAI embeddings
  • Cohere embeddings

Advantages

  • extremely easy to start
  • no infrastructure to manage

Limitations

  • recurring API cost
  • data must be sent to external services

Vector Databases With Built In Models

Examples include:

  • Weaviate text2vec
  • Qdrant inference
  • Milvus embedding integrations

Advantages

  • convenient integration
  • fewer moving parts

Limitations

  • database also runs model inference
  • memory usage increases when models are loaded

Python Based Embedding Services

Examples include:

  • HuggingFace Transformers
  • sentence-transformers
  • FastAPI inference servers

Advantages

  • flexible ecosystem
  • large collection of models

Limitations

  • introduces another runtime if the main stack is Node

This project provides another option.

A small standalone embedding server that integrates easily with TypeScript applications.


Introducing the Embedding Server

The embedding server is a lightweight service that converts text into vectors using open source embedding models.

It exposes a simple HTTP API compatible with the OpenAI embeddings endpoint. Because of this, many existing AI frameworks can interact with it without modification.

The main goals of the project are:

  • simple local deployment
  • minimal infrastructure overhead
  • easy integration with Node and TypeScript stacks
  • support for multiple embedding models

Key Features

Self Hosted

The server runs entirely inside your infrastructure.

No external APIs are required.

This helps keep data private and avoids API costs.


Open Source

The project is fully open source and available on GitHub.

Developers can inspect the implementation, extend the functionality, or integrate it into their own workflows.


Container Based Deployment

The embedding server is available as a container image and can be started with a single command.

docker run -p 8000:8000 ghcr.io/abdullah85398/embedding-server:latest

Container image:

https://github.com/abdullah85398/embedding-server/pkgs/container/embedding-server

Because the service is containerized it can easily run in environments such as:

  • Docker Compose
  • Kubernetes
  • local development environments
  • self hosted infrastructure

Dynamic Model Loading

Models can be loaded when required and unloaded when idle.

Example configuration:

models:
  mini:
    name: all-MiniLM-L6-v2
    preload: true

  code:
    name: jinaai/jina-embeddings-v2-base-code
    preload: false
    idle_timeout: 300

This helps control memory usage while still supporting multiple models.


Optional Caching

Embedding generation is deterministic.

The same text will always produce the same vector.

Because of this, caching can significantly reduce repeated computation.

The server supports:

  • in memory LRU cache
  • optional Redis cache

This is particularly useful for RAG pipelines where documents may be embedded multiple times.


OpenAI Compatible API

The server exposes an API compatible with the OpenAI embeddings endpoint.

This allows easy integration with common AI frameworks such as:

  • LangChain
  • Vercel AI SDK
  • OpenAI compatible clients
  • custom TypeScript applications

How to Use the Embedding Server

Running the server locally takes only one command.

Start the container:

docker run -p 8000:8000 ghcr.io/abdullah85398/embedding-server:latest

Once running, the server exposes an embeddings API.

http://localhost:8000/v1/embeddings

Applications can now send text to generate vectors.


Example Request

Here is a simple TypeScript example.

const res = await fetch("http://localhost:8000/v1/embeddings", {
  method: "POST",
  headers: {
    "Content-Type": "application/json"
  },
  body: JSON.stringify({
    model: "mini",
    input: "example text"
  })
})

const { data } = await res.json()

The response contains the embedding vector that can be stored in any vector database such as Weaviate, Qdrant, or Milvus.


When This Architecture Makes Sense

This approach is useful when you want to:

  • run open source embedding models locally
  • keep embedding inference separate from vector storage
  • build AI systems using Node or TypeScript
  • experiment with multiple embedding models
  • keep infrastructure simple

It is particularly helpful for self hosted AI systems, prototypes, and personal projects.


Repository

GitHub repository https://github.com/abdullah85398/embedding-server

Container image https://github.com/abdullah85398/embedding-server/pkgs/container/embedding-server

This project started as a small weekend experiment but may continue evolving as new features and improvements are added.

If you find it useful, feel free to explore the repository, open issues, or star the project.

Comments (0)

No comments yet. Be the first to share your thoughts!