Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dictionary handling _very_ slow. #3

Open
btimby opened this issue Oct 1, 2012 · 5 comments
Open

Dictionary handling _very_ slow. #3

btimby opened this issue Oct 1, 2012 · 5 comments

Comments

@btimby
Copy link

btimby commented Oct 1, 2012

My first time using the dictionary feature of this library was to use the /usr/share/dict/words dictionary as outlined in the documentation. On my system, this file contains just shy of 0.5M words.

However, if one does this, the validation step takes several minutes as the file is loaded, parsed then searched for the word. This happens for every form submittal. This renders this feature unusable. I don't have time to fix this right now, but one way to handle this is to use a preprocessing step that will convert the dictionary to a searchable form. This can easily be integrated into one's deploy procedure, so that the dictionary is sourced from a plain text file whenever the code is deployed. Optionally, a management command could be added that would perform this pre-processing.

An example of this type of operation can be taken from postfix (the MTA). It uses the postmap command to convert text lists into searchable databases so that the MTA can do a huge number of lookups very quickly.

http://www.postfix.org/postmap.1.html

A further optimization would be to trim words from the dictionary that are shorter than the configured minimal length. Potential passwords shorter than this length are rejected outright so the existence of these words in the dictionary bloats it unnecessarily.

@btimby
Copy link
Author

btimby commented Oct 1, 2012

A bit more information about postmap:

Postfix uses bsddb hash to store this data. Here is some sample code to interact with this data.

http://pastebin.com/X2rygT3K

It should be trivial to create and search this data using python-bsddb3. However, it might also be possible to simly pickle a Set. Loading and searching the pickled set should be fast. Especially if the first validation loads/caches the Set and subsequent validations only search it.

@maccesch
Copy link
Contributor

These are good thoughts. Would you mind creating a pull request for this?

@darakian
Copy link
Contributor

darakian commented Mar 6, 2020

What about simply loading the dict into memory as a list if the file is under some arbitrary size ex. 1MB? The way I see this function is that is must return a complete set of dictionary words

    def get_dictionary_words(self, dictionary):
        with open(dictionary) as dictionary:
            return [smart_text(x.strip()) for x in dictionary.readlines()]

So for a small enough dictionary keeping it resident in memory seems appropriate while for a larger file, perhaps an iterator over the file is the best that can be done.

@darakian
Copy link
Contributor

darakian commented Mar 6, 2020

@maccesch I've made a PR #56 which implements my simple cache proposal.

@bobwhitelock
Copy link

Related follow-up PR: #65

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants