Basically, using a StringField for everything would seem very much like a bad code smell to me, and could cause performance issues at index time, but I would definitely expect the much larger problems would appear when you start trying to search.Īs far as "how will the search performance be with 5 billion values", that's far too vague a question to even attempt to answer. By the way, StringField indexes the entire value as a single term, and skips analysis, so this is also the wrong field for full text, for which you should use a TextField. If you might want to search the field as a range of numeric values, it should be an IntField, not a StringField. When I profiled a similar performance issue I thought was caused by lucene, turned out the problem was mostly string concatenations.Īs to whether you should use StringField or IntField (or TextField, or whatever), you should determine that based on what is in the field on how you are going to search it. Have you profiled to see what is actually causing your performance issues? You could find something unexpected is eating up all that time. If all went well, commit the db transaction and. make the changes to the db in the transaction, and update that entity in Lucene by using leteDocuments (.) then indexWriter.addDocument (.). dotnet add package Lucene. This page is now quite dated, but if you exhaust all your other options, may provide hints of which levers you can pull that might tweak performance: How to make indexing faster So each time we change an entity we do something like: open db transaction, open Lucene IndexWriter. These answers are all anecdotal, but perhaps they'll be of some use to you. It seems unlikely that the type of field is causing your slowdown on its own. We use a mixture of StringField, LongField, and TextField. Consider using Solr - it may ease your efforts here. Whether it helps or not depends on where your bottlenecks are, of course. Yes, you can write to your index in parallel, and you may see improved throughput. Lucene has never been the slowest/weakest component in our system. If you are experiencing significant slow-down, I support femtoRgon's suggestion that you profile to see what's eating the time. An Embebed version of Lucene IR library running inside Oracle. For us, perhaps 20% slower by the end, though it could also be specific to our data. Is the core component of Apache Solr and Nutch projects. Our indexing time does not grow exponentially with the number of rows already indexed, though it does slow down very gradually. I only offer that number to give you a very vague sense of what your indexing experience may be. Our initial indexing is not distributed or parallelized in any way, currently, and takes around 9 hours. Our current use case seems somewhat similar to yours: 1.6 billion rows, most fields are exact matches, periodic addition of files/rows, regular searching.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |