More

whakim · 2025-09-08T19:42:35 1757360555

At 1M embeddings I'd think pgvector would do just fine assuming a sufficiently powerful database.

whakim · 2025-09-08T19:41:34 1757360494

It depends on scale. If you're storing a small number of embeddings (hundreds of thousands, millions) and don't have complicated filters, then absolutely the convenience factor of pgvector will win out. Beyond that, you'll need something more powerful. I do think the dedicated vector stores serve a useful place in the market in that they're extremely "managed" - it is really really easy to just call an API and never worry about pre- or post- filtering or sharding your index across a large cluster. But they also have weaknesses in that they're usually optimized around small(er) scale where the bulk of their customers lie, and they don't really replace an actual search system like ElasticSearch.

whakim · 2025-09-08T19:34:23 1757360063

> It's not like there's some secret sauce here in most of these implementation details.

IME the implementation of ANN + metadata filtering is often the "secret sauce" behind many vector database implementations.

whakim · 2025-09-08T17:03:31 1757351011

Regardless of the merits of this argument, dedicated vector databases are all running on top of AWS/GCP/Azure infrastructure anyways.

whakim · 2025-09-08T16:52:05 1757350325

Elasticsearch and Vespa both fit the bill for this, if your scale grows beyond the purpose-built vector stores.

whakim · 2025-08-09T06:10:25 1754719825

I do not think data stores are a bottleneck for serving embedding search. I think the raft of new-fangled vector db services (or pgvector or whatever) can be a bottleneck because they are mostly optimized around the long tail of pretty small data. Real internet-scale search systems like ES or Vespa won’t struggle with serving embedding search assuming you have the necessary scale and time/money to invest in them.

cfors · 2025-08-10T13:19:16 1754831956

Sure they can handle the basic case of ANN. But ANN still doesn’t have good stories for lots of real-world problems.

* filterable ANN, decomposes into prefiltering or postfiltering.

* dynamic updates and versioning is still very difficult

* slow building of graph indexes

* adding other signals into the search, such as query time boosting for recent docs.

I don’t disagree these systems can work but innovation is still necessary. We are not in a “data stores are solved” world.

whakim · 2025-08-11T06:18:16 1754893096

* Filterable ANN certainly decomposes into pre- and post-filtering, and there is definitely a lot of interesting innovation occurring around filterable ANN. But large-scale search systems currently do a pretty good job with pre-filtering, falling back to brute force search in the case of restrictive filters.

* You'd have to be a bit more exact re: dynamic updates/versioning for me to understand the challenges you're facing.

* Building graph indices can be slow, but in my experience (billions of embeddings) it is possible to build HNSW indices in tens of minutes.

* How is this any different to combining traditional keyword search with, say, recency boosting?

cfors · 2025-08-11T12:24:26 1754915066

Might be missing my argument here - I stated that there are workable solutions to this like you have pointed out.

But ANN search is still a sledgehammer and building out hybrid solutions that help bridge the gap between this and traditional data stores still have room for innovation.

whakim · 2025-08-12T01:10:57 1754961057

Fair enough - agreed there's lots of interesting innovations here - but my point is that semantic search and its associated issues don't really differ that much from other types of search problems at scale, and I therefore don't think that the current crop of vector database products add a lot of value from a technical perspective (perhaps they do from an ease-of-use perspective; or they work great at small scale, etc. etc.)

mdaniel · 2025-08-09T16:58:41 1754758721

> Real internet-scale search systems like ES

Oh, then you must have the secret sauce that allows scaling ES vector search beyond 10,000 results without requiring infinite RAM. I know their forums would welcome it, because that question comes up a lot

Or I guess that's why you included the qualifier about money to invest

whakim · 2025-08-09T23:05:49 1754780749

Would you mind putting aside the snark? I have a couple questions. How large is the corpus? I am also curious about the use-case for top-k ANN, k > 10000?

farsa · 2025-08-10T00:45:56 1754786756

Not the person you have asked but at work (we are a CRM platform) we allow our clients to arbitrarily query their userbase to find matching users for marketing campaigns (email, sms, whatsapp). These campaigns can some times target a few hundred thousand people. We are on a really ancient version of ES, but it sucks at this job in terms of throughput. Some experimenting with bigquery indicates it is so much better at mass exporting.

whakim · 2025-08-10T01:23:43 1754789023

Fair; my question was mostly in the context of ANN, since that was the discussion point - I have to assume ES (as a search engine) would not necessarily be the right tool for data warehousing types of workloads.

whakim · 2025-05-02T23:47:09 1746229629

The Romans were actually quite smart after Cannae; they had lost a bunch of pitched battles, so they decided to shadow Hannibal's army to make his foraging logistics much more complicated (and forcing him to stay close to Southern Italy where he could easily resupply). The logistics of attacking Rome were therefore challenging at best, and the Romans used this as a delaying tactic to score wins on other fronts (since they enjoyed an overall manpower advantage).

whakim · 2025-04-29T02:52:26 1745895146

As the article points out, the difference in cost between these two routes is pretty small in the grand scheme of things; more than two-thirds of the costs associated with the project are at either end getting in and out of Los Angeles/the SF Bay Area. On the other hand, as the author points out, building the route through the Central Valley population centers has a number of advantages (political support, having a useful rail line before the entire project is complete, etc. etc.)

Gibbon1 · 2025-04-29T05:09:46 1745903386

If you use google maps to find the distance between Los Banos and Tehachapi. Via I5 is 210 miles. Via HWY99 is 223 miles. So a 13 mile difference. That's 5-7 minutes at high speed rail speeds.

And you are right the extra cost is minimal. It's probably $10-15 billion.

My thought about the grade separation costs and my beef is. One is those grade separation projects need to be done anyways. The beef is why is the high speed rail project paying for road infrastructure. That should come out of gas taxes or something.

whakim · 2025-03-10T01:51:06 1741571466

Looks like there's a mistake here - it was Juvenal, not Horace, who coined the phrase "bread and circuses." While we're at it, Juvenal's point was not that the provision of bread and circuses was detrimental to the Roman people, but rather that Romans had been reduced (a process in which they were partially culpable) from being able to exercise their votes to only hoping for "bread and circuses."

whakim · 2025-03-09T06:28:42 1741501722

I was the first employee at a company which uses RAG (Halcyon), and I’ve been working through issues with various vector store providers for almost two years now. We’ve gone from tens of thousands to billions of embeddings in that timeframe - so I feel qualified to at least offer my opinion on the problem.

I agree that starting with pgvector is wise. It’s the thing you already have (postgres), and it works pretty well out of the box. But there are definitely gotchas that don’t usually get mentioned. Although the pgvector filtering story is better than it was a year ago, high-cardinality filters still feel like a bit of an afterthought (low-cardinality filters can be solved with partial indices even at scale). You should also be aware that the workload for ANN is pretty different from normal web-app stuff, so you probably want your embeddings in a separate, differently-optimized database. And if you do lots of updates or deletes, you’ll need to make sure autovacuum is properly tuned or else index performance will suffer. Finally, building HNSW indices in Postgres is still extremely slow (even with parallel index builds), so it is difficult to experiment with index hyperparameters at scale.

Dedicated vector stores often solve some of these problems but create others. Index builds are often much faster, and you’re working at a higher level (for better or worse) so there’s less time spent on tuning indices or database configurations. But (as mentioned in other comments) keeping your data in sync is a huge issue. Even if updates and deletes aren’t a big part of your workload, figuring out what metadata to index alongside your vectors can be challenging. Adding new pieces of metadata may involve rebuilding the entire index, so you need a robust way to move terabytes of data reasonably quickly. The other challenge I’ve found is that filtering is often the “special sauce” that vector store providers bring to the table, so it’s pretty difficult to reason about the performance and recall of various types of filters.

ichiwells · 2025-03-09T13:53:29 1741528409

> Finally, building HNSW indices in Postgres is still extremely slow (even with parallel index builds), so it is difficult to experiment with index hyperparameters at scale

For anyone coming across this without much experience here, for building these indexes in pgvector it makes a massive difference to increase your maintenance memory above the default. Either as a separate db like whakim mentioned, or for specific maintenance periods depending on your use case.

``` SHOW maintenance_work_mem; SET maintenance_work_mem = X; ```

In one of our semantic search use cases, we control the ingestion of the searchable content (laws, basically) so we can control when and how we choose to index it. And then I've set up classic relational db indexing (in addition to vector indexing) for our quite predictable query patterns.

For us that means our actual semantic db query takes about 10ms.

Starting from 10s of millions of entries, filtered to ~50k (jurisdictionally, in our case) relevant ones and then performing vector similarity search with topK/limit.

Built into our ORM and zero round-trip latency to Pinecone or syncing issues.

EDIT: I imagine whakim has more experience than me and YMMV, just sharing lesson learned. Even with higher maintenance mem the index building is super slow for HNSW

whakim · 2025-03-09T23:30:02 1741563002

Thanks for sharing! Yes, getting good performance out of pgvector with even a trivial amount of data requires a bit of understanding of how Postgres works.

zxt_tzx · 2025-03-09T10:08:08 1741514888

Thank you for the comment, compared to you I have only touched the bare surface of this quite complex domain, would love to get more of your input!

> building HNSW indices in Postgres is still extremely slow (even with parallel index builds), so it is difficult to experiment with index hyperparameters at scale.

Yes, I experienced this too. I from 1536 to 256 and did not try more values than I'd have liked because spinning up a new database and recreating the embeddings simply took too long. I’m glad it worked well enough for me, but without a quick way to experiment with these hyperparameters, who knows whether I’ve struck the tradeoff at the right place.

Someone on Twitter reached out and pointed out one could quantizing the embeddings to bit vectors and search with hamming distance — supposedly the performance hit is actually very negligible, especially if you add a quick rescore step: https://huggingface.co/blog/embedding-quantization

> But (as mentioned in other comments) keeping your data in sync is a huge issue.

Curious if you have any good solutions in this respect.

> The other challenge I’ve found is that filtering is often the “special sauce” that vector store providers bring to the table, so it’s pretty difficult to reason about the performance and recall of various types of filters.

I realize they market heavily on this, but for open source databases, wouldn't the fact that you can see the source code make it easier to reason about this? or is your point that their implementation here are all custom and require much more specialized knowledge to evaluate effectively?

whakim · 2025-03-09T23:21:51 1741562511

> Yes, I experienced this too. I from 1536 to 256 and did not try more values than I'd have liked because spinning up a new database and recreating the embeddings simply took too long. I’m glad it worked well enough for me, but without a quick way to experiment with these hyperparameters, who knows whether I’ve struck the tradeoff at the right place.

Yeah, this is totally a concern. You can mitigate this to some extent by testing on a representative sample of your dataset.

> Curious if you have any good solutions in this respect.

Most vector store providers have some facility to import data from object storage (e.g. s3) in bulk, so you can periodically export all your data from your primary data store, then have a process grab the exported data, transform it into the format your vector store wants, put it in object storage, and then kick off a bulk import.

> I realize they market heavily on this, but for open source databases, wouldn't the fact that you can see the source code make it easier to reason about this? or is your point that their implementation here are all custom and require much more specialized knowledge to evaluate effectively?

This is definitely a selling point for any open-source solution, but there are lots of dedicated vector store which are not open source and so it is hard to really know how their filtering algorithms perform at scale.

gregw134 · 2025-03-09T06:36:44 1741502204

What would you recommend for billions of embeddings?

whakim · 2025-03-09T23:24:03 1741562643

We've used Pinecone for a while (and it definitely "just worked" up to a certain point), but we're now actively exploring Google's Vertex Search as a potential alternative if some of the pain points can't be mitigated.