Fastest way to search through text records

Hi all,

Hoping to pick your collective minds.

I've got a MySQL table (pretty much just a couple of fields - an ID, and a string) which I need to search the string for and get the ID(s) of any matching rows.

Using LIKE "%search%" works perfect - but it requires a full table scan and there's 61 million rows.

I've tried using a MATCH ... AGAINST() search but this seems to work against the FULL word, but I need to include results where they match against part of the word too, if that makes sense.

I've considered putting the dataset into something like redis, but does anyone have any suggestions on the fastest way to do this. It doesn't have to use a database, I can put the data into any format I want - the only "requirement" is I've got to be able to search it using PHP.

Thanks in advance!

«1

Comments

  • InceptionHostingInceptionHosting Hosting ProviderOG

    I don't think there will be any magic bullet for 61 million rows short of a complete database redesign to separate them out, I would say your best bet is better caching so the table runs in ram.

    Perhaps someone like @joepie91 or @Francisco can give a better answer.

    Thanked by (1)WSS

    https://inceptionhosting.com
    Please do not use the PM system here for Inception Hosting support issues.

  • How about Full-text search, (FTS) indexing with ADD FULLTEXT

    example :

    ALTER TABLE news ADD FULLTEXT (title, content, author)
    
  • @AnthonySmith said: I don't think there will be any magic bullet for 61 million rows short of a complete database redesign to separate them out, I would say your best bet is better caching so the table runs in ram.

    Yeah I think that might be the way to go.

    @isunbejo said: How about Full-text search, (FTS) indexing with ADD FULLTEXT

    I've added a FULLTEXT index to the table already, hasn't really sped up the LIKE searches. Using MATCH/AGAINST is quicker but doesn't return the correct set of results.

  • edited March 2020

    Basically, what you're asking for is a search engine, and a Solr or Elasticsearch cluster might be better suited for this case. They're designed for text search.

    If you want to keep using MySQL, see if there is a distributed DB module which will automatically shard the DB and parallelize the query. PostgreSQL has the Citus extension and Postgres-XL, but I'm not as familiar with the MySQL ecosystem as I am the postgres ecosystem.

    Of course, you'll need to throw lots of medium boxes at the problem to get the best throughput.

  • @FlamingSpaceJunk said: Basically, what you're asking for is a search engine, and a Solr or Elasticsearch cluster might be better suited for this case. They're designed for text search.

    Thanks! That's probably the best way of looking at it.

    I might setup a solr or elasticsearch install and import the data to see how it fairs speed wise.

    There's no reason for it to be kept in a database, if grepping a csv file is quickest I'll write a solution that does that lol.

  • @Mr_Tom said:
    I might setup a solr or elasticsearch install and import the data to see how it fairs speed wise.

    You'll probably need a 3 node cluster to start seeing performance gains.

    There's no reason for it to be kept in a database, if grepping a csv file is quickest I'll write a solution that does that lol.

    You can use a CSV as a backend of some database systems. :smiley:

  • edited March 2020

    if grepping a csv file is quickest I'll write a solution that does that lol

    No matter how silly this sounds like, these type of solutions most of the times get the job done.
    Fuzzy search is already PITA. You never gonna like what you get with it (at least for me).
    text search oriented solutions like elasticsearch/solr will perform better but comes with own set of extra work.

    Using LIKE "%search%" works perfect - but it requires a full table scan and there's 61 million rows.

    have you tried to load whole dataset to buffer (RAM). which DB engine did you use. If your data is mostly read-only, may I suggest Aria engine (or Memory)? Off-course I dont know much GBs actually your 61 mil data is, but If all of your data is in Buffer, I don't see any problem with FULL TABLE SCAN.

  • joepie91joepie91 OGServices Provider

    I haven't played with it much yet, but AFAIK, PostgreSQL's full-text search capabilities are a lot better (and probably more performant) than those of MySQL.

    Given that PostgreSQL is generally nicer to work with (and faster) than MySQL, but its SQL dialect is similar, it may be worth considering a switch - after trying out your usecase in PostgreSQL and verifying that it does work well there, of course. You can also use EXPLAIN ANALYZE (along with this tool) to understand exactly how a query was scheduled and executed, so that you can track down what exactly made it slow.

    ElasticSearch is a possibility, but note that it is open-core (not open-source) and certainly not low-maintenance. I wouldn't bother with it unless you've verified that PostgreSQL really can't do what you need. As already mentioned elsewhere, it's probably not going to be faster in and of itself either, you'd need to actually add more servers for that.

    But honestly, 61 million records should be peanuts for an RDBMS on a single server. It's not the sort of dataset size where you should be running into performance issues, so long as you get your indexes right and don't run especially unperformant queries. This should just work.

    Thanked by (1)vimalware
  • williewillie OG
    edited March 2020

    @Mr_Tom said: I've got a MySQL table (pretty much just a couple of fields - an ID, and a string) which I need to search the string for and get the ID(s) of any matching rows.

    That is a crazy db setup. You shouldn't have to scan strings at query time. Process each string when you insert it into the database, extracting any id's from the string. Make a new table that combines the extracted id's with the id of the row you just added. Add suitable indexes and do appropriate JOIN operations to find the rows you want. Full text searching doesn't seem like the right thing at all.

    What is the application? If it's not an online op taking a high query load, maybe you can use a plain text file. 61M records is probably less than a GB of text and you can grep that much text in a second or so. Maybe that is ok for what you are doing.

  • havochavoc OGContent Writer

    Pretty sure the standard unix search stuff is optimised to death so I'd probably try to find a way to turn this into text files

  • @joepie91 said:
    ElasticSearch is a possibility, but note that it is open-core (not open-source) and certainly not low-maintenance. I wouldn't bother with it unless you've verified that PostgreSQL really can't do what you need. As already mentioned elsewhere, it's probably not going to be faster in and of itself either, you'd need to actually add more servers for that.

    The base functionality is still in Elasticsearch; the proprietary addons don't add much when it just needs to be a search engine, provided this isn't filled with sensitive information and on the Internet.

    For everything, Amazon's Open Distro for Elasticsearch should be looked at. (https://opendistro.github.io/for-elasticsearch/) It's what Amazon uses for their hosted Elasticsearch instances.

    Yes, it's like any other Java application. :expressionless:

    @Mr_Tom said:
    There's no reason for it to be kept in a database, if grepping a csv file is quickest I'll write a solution that does that lol.

    I found Whoosh, which is a search engine library for Python (https://whoosh.readthedocs.io/en/latest/intro.html), and Xapian, search engine lib for lots of languages (https://xapian.org/). [I haven't used either of these. I just found them after getting curious about search engine libraries.)

    Now that I think about it, there are also tools like the silver searcher, ack, and parallel grep implementations which might be helpful.

  • @FlamingSpaceJunk said:
    Now that I think about it, there are also tools like the silver searcher, ack, and parallel grep implementations which might be helpful.

    The answer is always awk based somehow.

    My pronouns are like/subscribe.

  • @WSS said:

    @FlamingSpaceJunk said:
    Now that I think about it, there are also tools like the silver searcher, ack, and parallel grep implementations which might be helpful.

    The answer is always awk based somehow.

    I thought of awk. I haven't used it enough to get a feel for it's speed, or how well text searching works.

  • @FlamingSpaceJunk said:

    @WSS said:

    @FlamingSpaceJunk said:
    Now that I think about it, there are also tools like the silver searcher, ack, and parallel grep implementations which might be helpful.

    The answer is always awk based somehow.

    I thought of awk. I haven't used it enough to get a feel for it's speed, or how well text searching works.

    awk '/term/ {print $1}' file_with_penis_ascii.txt

    Thanked by (1)FlamingSpaceJunk

    My pronouns are like/subscribe.

  • @WSS said:
    awk '/term/ {print $1}' file_with_penis_ascii.txt

    61 millions lines of ASCII peens. :astonished:

    Would that be a linear search? I'm thinking part of the solution is getting as much parallelism as possible. Something, something O(log(n)).

  • @FlamingSpaceJunk said:
    Would that be a linear search? I'm thinking part of the solution is getting as much parallelism as possible. Something, something O(log(n)).

    All I know is that my log comes first.

    My pronouns are like/subscribe.

  • ..oh god did I just make a sexual math joke?

    My pronouns are like/subscribe.

  • Thanks for all the info/tips guys.

    I've tweaked some of the indexes and it's helped - doing a WHERE text="xx" search is now almsot instant, but still a little slow for LIKE queries. I've started having a look at the "search engine" type options too, as these might be a better bet for searching/returning "close" matches as well as exact matches, which I'm currently handling in an acceptable way in PHP by generating "close" terms based on the search term and then querying the database for these depending on the number of results.

  • I would offload search to sphixsearch using sphinxse with mariadb.
    https://mariadb.com/kb/en/about-sphinxse/

  • LIKE 'XX%' should also be fast, because of the index, and there is sometimes a way to make an index on the reverse spelling of the string so that LIKE '%XX' is also fast. The slow one is LIKE '%X%'. But really if the stuff you are parsing out of those strings is id numbers or something else you can identify when you import the strings in the first place, then as mentioned above it's best to extract that at import time and put it in its own indexed column.

    The way you described the problem initially, it sounds a bit weird if you really have to search free text. If that is the case though, Postgres's text search stuff is supposed to be quite good (almost as good as Elastic/Solr) while MySQL's is terrible.

  • exception0x876exception0x876 Hosting Provider
    edited March 2020

    You can still achieve it using MySQL FULLTEXT (not really, see update below)

    It supports wildcard search for words suffixes so you could search like this MATCH(column) AGAINST('term*' IN BOOLEAN MODE)

    If you want to search for wildcard prefix as well, you can create another column that will contain reversed words of the original column. The final query would look like this

    MATCH(original_column) AGAINST('term*' IN BOOLEAN MODE) OR MATCH(reversed_column) AGAINST('mret*' IN BOOLEAN MODE)

    You might need to update MySQL stopwords list with their reversed counterparts in this case.

    UPDATE: however, that is still not gonna find the partial match inside the word.

  • bikegremlinbikegremlin ModeratorOGContent Writer
    edited March 2020

    Not 100% sure if this is a good option for your particular database, but to me it came highly recommended for editing WordPress ones:

    https://interconnectit.com/products/search-and-replace-for-wordpress-databases/

    DISCLAIMERS:
    Make backups before trying anything, don't blindly follow advice of idiots like me from the Internet, eat healthy food and drink enough water...

    Relja of House Novović, the First of His Name, King of the Plains, the Breaker of Chains, WirMach Wolves pack member
    BikeGremlin's web-hosting reviews

  • @willie said: The way you described the problem initially, it sounds a bit weird if you really have to search free text. If that is the case though, Postgres's text search stuff is supposed to be quite good (almost as good as Elastic/Solr) while MySQL's is terrible.

    Yeah, although the ID is the final required but if information, it's user inputted text which is being searched in a way that needs to find "Mr_Tom" from the term "To". The text is only 9 characters max. If postgresql is faster than mysql that will probably solve it - I've replicated it on a server of similar spec to the live one and with the index tweaks it's a lot faster - but not quite there for a the expected times for a website.

    @exception0x876 said: UPDATE: however, that is still not gonna find the partial match inside the word.

    That's what I need ideally. LIKE %TERM% works perfect, just not the fastest.

    @bikegremlin said: Not 100% sure if this is a good option for your particular database

    That looks a good option for what it does, but doesn't quite fit here.

    Cheers

    Thanked by (1)bikegremlin
  • mysql is not that good for that, I guess elasticsearch would make more sense

    there is however a tool called sphinx that can index your tables and then the search will be fast, but i dont know if it will fit your needs

  • williewillie OG
    edited March 2020

    What is the actual application if you don't mind my asking, and what is the total amount of text to be searched? What size vm are you running in? I'm still perplexed that you have to find "Mr_Tom" from "to".

    Maybe you can index trigraphs or substrings from the tokens in this text? If column length is 9 chars max, this is around 0.5GB of text, and if you index the 9-char strings, and their substrings starting from the 2nd char, 3rd char etc., you get around 500M total strings, around 2.5GB, doesn't seem too bad. It will chew up some disk space but you can index them all. Doesn't seem necessary to have them in ram since the index lookups on ssd will be fast enough

  • Postgresl full text native capabilities are probably enough.

  • $ head -5 records
    B7C9607F0
    BEA9DE6AF
    EC1D58C44
    576099FE5
    88607C803
    
    $ wc -l records
    69905067 records
    
    $ time grep -F AAAAAA records
    9AAAAAAAB
    AAAAAA9F8
    1AAAAAA1F
    AAAAAADA7
    AAAAAAD4A
    AAAAAA519
    AAAAAA5DD
    DAAAAAA58
    4AAAAAA92
    6AAAAAA4C
    F81AAAAAA
    AAAAAA34D
    746AAAAAA
    BAAAAAAEF
    
    real    0m0.545s
    user    0m0.465s
    sys     0m0.079s
    
    $ cat generate.sh
    #!/bin/sh
    
    dd if=/dev/urandom bs=1M count=300 | basenc --base16 -w 9 > records
    
    $ ls -lh
    total 667M
    -rwxr--r-- 1 1000 1000   79 Mar 13 15:15 generate.sh
    -rw-r--r-- 1 1000 1000 667M Mar 13 15:16 records
    -rw-r--r-- 1 1000 1000   42 Mar 13 15:21 rep.txt
    drwxr-xr-x 2 1000 1000 4.0K Mar 13 15:05 v1
    
    $
    

    databases are lame B)

  • williewillie OG
    edited March 2020

    If this is for a web service, 0.5 sec per query is really too slow. We still haven't heard anything about the application though. Fulltext engines will be of no help with this since they search for tokens given prefixes. They won't look for substrings in the middle of tokens. All you can really do is index all the "tails" of each token, or (depending on the input distribution) do some hack like indexing trigraphs. Mr_Tom if you're still around and want serious advice about this, we need to know the query load, the type of server, the software stack, the origin of the data (so we can make reasonable guesses about how the substrings are distributed), what kinds of lengths you're willing to go to in the software to make it fast, etc. A full-out in-process data structure with everything in ram can probably handle millions of queries per second, but the implementation hassle would be considerable.

  • @willie said: 0.5 sec per query is really too slow

    you are not wrong, but:

    • it will satisfy a lot of usecases
    • it is 0.5s without database and all involved hassle
    • it is 0.5s without optimisations like query caching

    As you said yourself it is not possible to index by substrings in the middle of tokens, so not sure what kind of algo in a dynamic language is going to be faster.
    Also, if it's a web service, it is going to serve queries in parallel, so 0.5s does not mean two clients per second.

Sign In or Register to comment.