> Performance has been very for how many records?

CharlieDigital · on Aug 16, 2024

During our initial testing, ~1m nodes on a local Docker container with 1G RAM and 1vCPU.

But here I mean "performance" in both retrieval time and the overall quality of the fragments retrieved for RAG compared to a `pgvector` only implementation. It is possible to "simulate" these types of graph traversals in pg as well, you'll have to work much harder to get the performance (we tried it first).

zxexz · on Aug 17, 2024

Huh. I've had the opposite experience. Neo4j has a pretty nice interface and package overall, but I was not impressed with the performance, and the developer experience was about on-par with Elasticsearch (not comparing the two databases, just the developer resources and communities). For general purpose use I've still not found anything better than Postgres (and yes, knowledge graphs I would consider general purpose). For my day-to-day work I'm constantly querying a regularly-updated knowledge graph consisting of >10M active, highly-connected nodes - I keep previous versions in the same database so I can traverse backwards through time. This is all on my laptop. No problems with latency or performance.

I'm always curious what people's use cases are with graph databases; do people find Cypher and SPARQL helpful? I've tried several times, but SQL is just so expressive. Postgres is still my favorite graph database (and CRUD RDBMS, and filesystem, and "data conversion tool").

CharlieDigital · on Aug 17, 2024

If your performance is poor, try running your query with `PROFILE {your_query}`. It's very easy to write a query that ends up loading way more nodes than expected. Years ago we had one query that progressively performed worse -- turned out one leg was loading the full node space!

What I have found is that "land and expand" using an index to find the landing spots is key for performance. Reason being once you "land" effectively, "expand" is cheap and fast.

Some of it will also come down to your graph design. If you have a lot of super dense nodes (analogous to a large JOIN), it will create a lot of memory pressure which it does not handle well.

But in a RAG use case, I don't see these as being issues.

ekianjo · on Aug 17, 2024

number of nodes mean nothing. what matters to measure performance is how much interconnected your network is and how complex are the relationships you want to extract.

grounder · on Aug 17, 2024

Right. I created a Neo4j db once with millions of nodes and relationships. Individual queries were very performant for all of my access patterns. Where it failed was with queries/sec. Throw more users at it, and it slowed to a crawl. Yes, read replicas are an option, but I was really discouraged with Neo4j performance with more than a few users.