It depends on scale. If you're storing a small number of embeddings (hundreds of thousands, millions) and don't have complicated filters, then absolutely the convenience factor of pgvector will win out. Beyond that, you'll need something more powerful. I do think the dedicated vector stores serve a useful place in the market in that they're extremely "managed" - it is really really easy to just call an API and never worry about pre- or post- filtering or sharding your index across a large cluster. But they also have weaknesses in that they're usually optimized around small(er) scale where the bulk of their customers lie, and they don't really replace an actual search system like ElasticSearch.
I do not think data stores are a bottleneck for serving embedding search. I think the raft of new-fangled vector db services (or pgvector or whatever) can be a bottleneck because they are mostly optimized around the long tail of pretty small data. Real internet-scale search systems like ES or Vespa won’t struggle with serving embedding search assuming you have the necessary scale and time/money to invest in them.
* Filterable ANN certainly decomposes into pre- and post-filtering, and there is definitely a lot of interesting innovation occurring around filterable ANN. But large-scale search systems currently do a pretty good job with pre-filtering, falling back to brute force search in the case of restrictive filters.
* You'd have to be a bit more exact re: dynamic updates/versioning for me to understand the challenges you're facing.
* Building graph indices can be slow, but in my experience (billions of embeddings) it is possible to build HNSW indices in tens of minutes.
* How is this any different to combining traditional keyword search with, say, recency boosting?
Might be missing my argument here - I stated that there are workable solutions to this like you have pointed out.
But ANN search is still a sledgehammer and building out hybrid solutions that help bridge the gap between this and traditional data stores still have room for innovation.
Fair enough - agreed there's lots of interesting innovations here - but my point is that semantic search and its associated issues don't really differ that much from other types of search problems at scale, and I therefore don't think that the current crop of vector database products add a lot of value from a technical perspective (perhaps they do from an ease-of-use perspective; or they work great at small scale, etc. etc.)
Oh, then you must have the secret sauce that allows scaling ES vector search beyond 10,000 results without requiring infinite RAM. I know their forums would welcome it, because that question comes up a lot
Or I guess that's why you included the qualifier about money to invest
Would you mind putting aside the snark? I have a couple questions. How large is the corpus? I am also curious about the use-case for top-k ANN, k > 10000?
Not the person you have asked but at work (we are a CRM platform) we allow our clients to arbitrarily query their userbase to find matching users for marketing campaigns (email, sms, whatsapp). These campaigns can some times target a few hundred thousand people. We are on a really ancient version of ES, but it sucks at this job in terms of throughput. Some experimenting with bigquery indicates it is so much better at mass exporting.
Fair; my question was mostly in the context of ANN, since that was the discussion point - I have to assume ES (as a search engine) would not necessarily be the right tool for data warehousing types of workloads.
The Romans were actually quite smart after Cannae; they had lost a bunch of pitched battles, so they decided to shadow Hannibal's army to make his foraging logistics much more complicated (and forcing him to stay close to Southern Italy where he could easily resupply). The logistics of attacking Rome were therefore challenging at best, and the Romans used this as a delaying tactic to score wins on other fronts (since they enjoyed an overall manpower advantage).
As the article points out, the difference in cost between these two routes is pretty small in the grand scheme of things; more than two-thirds of the costs associated with the project are at either end getting in and out of Los Angeles/the SF Bay Area. On the other hand, as the author points out, building the route through the Central Valley population centers has a number of advantages (political support, having a useful rail line before the entire project is complete, etc. etc.)
If you use google maps to find the distance between Los Banos and Tehachapi. Via I5 is 210 miles. Via HWY99 is 223 miles. So a 13 mile difference. That's 5-7 minutes at high speed rail speeds.
And you are right the extra cost is minimal. It's probably $10-15 billion.
My thought about the grade separation costs and my beef is. One is those grade separation projects need to be done anyways. The beef is why is the high speed rail project paying for road infrastructure. That should come out of gas taxes or something.
Looks like there's a mistake here - it was Juvenal, not Horace, who coined the phrase "bread and circuses." While we're at it, Juvenal's point was not that the provision of bread and circuses was detrimental to the Roman people, but rather that Romans had been reduced (a process in which they were partially culpable) from being able to exercise their votes to only hoping for "bread and circuses."
I was the first employee at a company which uses RAG (Halcyon), and I’ve been working through issues with various vector store providers for almost two years now. We’ve gone from tens of thousands to billions of embeddings in that timeframe - so I feel qualified to at least offer my opinion on the problem.
I agree that starting with pgvector is wise. It’s the thing you already have (postgres), and it works pretty well out of the box. But there are definitely gotchas that don’t usually get mentioned. Although the pgvector filtering story is better than it was a year ago, high-cardinality filters still feel like a bit of an afterthought (low-cardinality filters can be solved with partial indices even at scale). You should also be aware that the workload for ANN is pretty different from normal web-app stuff, so you probably want your embeddings in a separate, differently-optimized database. And if you do lots of updates or deletes, you’ll need to make sure autovacuum is properly tuned or else index performance will suffer. Finally, building HNSW indices in Postgres is still extremely slow (even with parallel index builds), so it is difficult to experiment with index hyperparameters at scale.
Dedicated vector stores often solve some of these problems but create others. Index builds are often much faster, and you’re working at a higher level (for better or worse) so there’s less time spent on tuning indices or database configurations. But (as mentioned in other comments) keeping your data in sync is a huge issue. Even if updates and deletes aren’t a big part of your workload, figuring out what metadata to index alongside your vectors can be challenging. Adding new pieces of metadata may involve rebuilding the entire index, so you need a robust way to move terabytes of data reasonably quickly. The other challenge I’ve found is that filtering is often the “special sauce” that vector store providers bring to the table, so it’s pretty difficult to reason about the performance and recall of various types of filters.
> Finally, building HNSW indices in Postgres is still extremely slow (even with parallel index builds), so it is difficult to experiment with index hyperparameters at scale
For anyone coming across this without much experience here, for building these indexes in pgvector it makes a massive difference to increase your maintenance memory above the default. Either as a separate db like whakim mentioned, or for specific maintenance periods depending on your use case.
```
SHOW maintenance_work_mem;
SET maintenance_work_mem = X;
```
In one of our semantic search use cases, we control the ingestion of the searchable content (laws, basically) so we can control when and how we choose to index it. And then I've set up classic relational db indexing (in addition to vector indexing) for our quite predictable query patterns.
For us that means our actual semantic db query takes about 10ms.
Starting from 10s of millions of entries, filtered to ~50k (jurisdictionally, in our case) relevant ones and then performing vector similarity search with topK/limit.
Built into our ORM and zero round-trip latency to Pinecone or syncing issues.
EDIT: I imagine whakim has more experience than me and YMMV, just sharing lesson learned. Even with higher maintenance mem the index building is super slow for HNSW
Thanks for sharing! Yes, getting good performance out of pgvector with even a trivial amount of data requires a bit of understanding of how Postgres works.
Thank you for the comment, compared to you I have only touched the bare surface of this quite complex domain, would love to get more of your input!
> building HNSW indices in Postgres is still extremely slow (even with parallel index builds), so it is difficult to experiment with index hyperparameters at scale.
Yes, I experienced this too. I from 1536 to 256 and did not try more values than I'd have liked because spinning up a new database and recreating the embeddings simply took too long. I’m glad it worked well enough for me, but without a quick way to experiment with these hyperparameters, who knows whether I’ve struck the tradeoff at the right place.
Someone on Twitter reached out and pointed out one could quantizing the embeddings to bit vectors and search with hamming distance — supposedly the performance hit is actually very negligible, especially if you add a quick rescore step: https://huggingface.co/blog/embedding-quantization
> But (as mentioned in other comments) keeping your data in sync is a huge issue.
Curious if you have any good solutions in this respect.
> The other challenge I’ve found is that filtering is often the “special sauce” that vector store providers bring to the table, so it’s pretty difficult to reason about the performance and recall of various types of filters.
I realize they market heavily on this, but for open source databases, wouldn't the fact that you can see the source code make it easier to reason about this? or is your point that their implementation here are all custom and require much more specialized knowledge to evaluate effectively?
> Yes, I experienced this too. I from 1536 to 256 and did not try more values than I'd have liked because spinning up a new database and recreating the embeddings simply took too long. I’m glad it worked well enough for me, but without a quick way to experiment with these hyperparameters, who knows whether I’ve struck the tradeoff at the right place.
Yeah, this is totally a concern. You can mitigate this to some extent by testing on a representative sample of your dataset.
> Curious if you have any good solutions in this respect.
Most vector store providers have some facility to import data from object storage (e.g. s3) in bulk, so you can periodically export all your data from your primary data store, then have a process grab the exported data, transform it into the format your vector store wants, put it in object storage, and then kick off a bulk import.
> I realize they market heavily on this, but for open source databases, wouldn't the fact that you can see the source code make it easier to reason about this? or is your point that their implementation here are all custom and require much more specialized knowledge to evaluate effectively?
This is definitely a selling point for any open-source solution, but there are lots of dedicated vector store which are not open source and so it is hard to really know how their filtering algorithms perform at scale.
We've used Pinecone for a while (and it definitely "just worked" up to a certain point), but we're now actively exploring Google's Vertex Search as a potential alternative if some of the pain points can't be mitigated.