How can an engineer relax at work using Python?

Hello everyone! My name is Vladimir Ganzyuk

I work as an engineer NSI and I am studying C# for myself, but not having encountered Python, I somehow accidentally came across a very interesting library Pymorhp.

Pymorph is a morphological analyzer for the Russian language, uses dictionaries from OpenCorpora. The source code can be obtained from github. The documentation for the library is written quite well.

Background:

While sipping tea with sweat on my brow, my manager wrote to me that for the next two sprints I need to check and unify one of the attributes for a billion positions.

Okay, I was actually drinking coffee, and there really were a huge number of positions. Unfortunately, all this can only be done manually (that's what everyone thought).

The gist:

The unload from our system contains a working environment, naturally, there can be several such environments. The list of environments was initially separated by “;”, after some time someone thought that it would be good to use “,” after all. This can be corrected in Excel, but the name of the working environment must begin with a noun + adjective. Here are some examples:

  • Diesel fuel;

  • Hydrocarbon gas, propane-propylene fraction, natural gas;

  • Butane-butylene fraction.

The working environments in the system were entered incorrectly, for example: “Butane butylene fraction”, “Gasoline; water”, “Hydrocarbon gas, dry gas”.

As we remember the rule: the noun must come first, and a comma must separate the elements of the list. Therefore, it was necessary to make unification.

In one evening I wrote a hack (function) sorting_dictionary:

def sorting_dictionary(dictionary):
    sorted_dic, dic, result_words = [], [], []

    for index in dictionary.keys():
        sentence_in_cell = dictionary[index].split(", ")
        for words in sentence_in_cell:
            words = words.split()
            for word in words:
                p = morph.parse(word)[0]
                if p.tag.POS == "NOUN":
                    sorted_dic.append(word)
                    for word in words:
                        p = morph.parse(word)[0]
                        if p.tag.POS == "PRED":
                            sorted_dic.append(word)
                elif p.tag.POS == "PREP":
                    sorted_dic.append(word)
            for word in words:
                p = morph.parse(word)[0]
                if p.tag.POS == "ADJF":
                    sorted_dic.append(word.lower())
            for word in words:
                p = morph.parse(word)[0]
                item_list = ['CONJ', 'PRTF']
                for item in item_list:
                    if item == p.tag.POS:
                        sorted_dic.append(word)
            for word in words:
                p = morph.parse(word)[0]
                item_list = ["VERB", "INFN"]
                for item in item_list:
                    if item == p.tag.POS:
                        sorted_dic.append(word)
            for word in words:
                p = morph.parse(word)[0]
                if p.tag.POS == None:
                    sorted_dic.append(word)
            words2 = " ".join(sorted_dic)
            result_words.append(words2)
            words2 = " "
            sorted_dic.clear()
        res_join = ", ".join(result_words)

        dic.append(upcase_first_letter(res_join))
        result_words.clear()
    return dic

In short, the function takes as input a dictionary in which all the values ​​from the Excel file have already been entered using the openpyxl library and the “;” has been replaced with “,”.

The part of speech of a word is obtained through the POS attribute: p.tag.POS. If the requested characteristic for a given tag is not defined, then None is returned. Designations for grammemes can be obtained here:

This function returns a sorted version of the dictionary.

Results:

The unloading of the workshop included 2193 items, each of which had to be checked manually.

The function changed 565 positions, which means that 1628 positions have already been eliminated as correct. These are mainly light ones like: “Nitrogen”, “Unstable gasoline”, which Pymorph identifies without problems.

Of the 565 changed positions, 121 were incorrect, for example: “Fresh alkali solution”, although the correct option is “Fresh alkali solution”. There is also a problem with brackets, for example: “Gasoline product mixture (gasoline, VSG)”, the function returns as “Mixture (gasoline gas product, VSG)”.

The algorithm's speed was 29 seconds. Not bad, right? But this library sometimes has difficulties with determining the part of speech.

For example, if you move away from the environments and analyze the sentence by the words “Mom washed the frame”, then it will define the word “washed” as a singular noun with a higher probability than a verb. Yes, this library still has score – an estimate of the probability that the given parsing is correct. Quote from the library documentation “how a word should be parsed depends on its neighboring words; pymorphy2 only works at the level of individual words”

There are also a lot of positions “Vacuum gas oil; Diesel fuel”, where “gas oil” and “fuel” should be in the first place, the function returned as “Vacuum gas oil, diesel fuel”. Even surprisingly, I found such an environment from the download: “27% aqueous amine solution, H2S – up to 10% by weight, nitrogen”, the function returned the absolutely correct option “Aqueous amine solution 27%, H2S – up to 10% by weight, nitrogen”

This library is also able to change the case of a word. For example, in the Natasha library, which I also tried, you can determine the case, but unfortunately, you couldn’t change it. And unlike Pymorph, Natasha is very slow, because Yargy implements the Early parser algorithm, and its complexity O(n^3)the code is written more for readability, not optimization. Natasha processed 1000 positions in about 1 minute, while Pymorph handled a volume twice as large in half the time. This is just a small digression, if anyone encounters a similar situation with choosing a library.

Changing the case of a word, for example, may be necessary to transfer the transported environment to the format “Transportation” + environment for another attribute.

Conclusion:

It would be interesting to hear the opinion of other experts. Perhaps someone has used artificial intelligence in a similar situation. I hope this method will help someone.

Ganzyuk Vladimir, engineer of normative and reference information (NSI)

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *