Business process mining and data visualization using Neo4j, Plotly and GPT

This material will be useful for COOs, business analysts and top managers of companies. Although there are some technical details in the text, they will not be too onerous. Purpose of the material: to show the general logic that we used to extract and analyze data.

Context:

The Ficus company is one of the leaders in the Russian market for landscaping corporate and public spaces. For a year now, we have been implementing AI solutions throughout the business: from support services for potential customers to the development of personal assistants for employees.

In the fall of 2023, the company underwent significant organizational changes in one of its key divisions. The head of the department left, and the team took advantage of this moment to eliminate bottlenecks in work processes. Up to this point, there was a certain ambiguity and inconsistency in the functioning of the department that required correction. The CEO suggested extracting and analyzing data from corporate information systems to identify the actual management and interaction patterns within the team, which differed from the officially approved business processes.

Solution:

The company actively uses the Kaiten management system, where a large amount of information has accumulated over several years, including discussions, comments and discussion of projects. These data formed the basis of our work.

Extracting and linking data

To be honest, we are in an advantageous position for several reasons:

a) The company consistently applies end-to-end project management methodologies and Kanban. End-to-end management involves participants from almost all departments at different stages and allows for systematic accumulation of information. Until 2022, Ficus worked in Trello, and then switched to Kaiten.

b) Kaiten provides an API and it is easy to retrieve any stored information. If the management system is built well (and in the company it is built well), we can have a lot of useful data in one place, which significantly speeds up the mining process.

The first step was to extract two JSON files:

– JSON with all information on projects for 2023;
– JSON with all comments for the same period.

def get_kaiten_cards(token, **params):
    url="URL"
    headers = {
        'Accept': 'application/json',
        'Content-Type': 'application/json',
        'Authorization': f'Bearer {token}'
    }

    response = requests.get(url, headers=headers, params=params)

    if response.status_code == 200:
        return response.json()
    else:
        response.raise_for_status()


def get_all_kaiten_cards_for_period(token, start_date, end_date):
    offset = 0
    limit = 100
    all_cards = []

    while True:
        response_cards = get_kaiten_cards(token, limit=limit, offset=offset, created_after=start_date,
                                          created_before=end_date)
        if not response_cards:
            break

        all_cards.extend(response_cards)
        if len(response_cards) < limit:
            break

        offset += limit

    return all_cards


def save_to_json(data, filename):
    with open(filename, 'w', encoding='utf-8') as f:
        json.dump(data, f, ensure_ascii=False, indent=4)


def load_from_json(filename):
    with open(filename, 'r', encoding='utf-8') as f:
        return json.load(f)


def get_card_comments(token, card_id):
    url = f'.../api/latest/cards/{card_id}/comments'
    headers = {
        'Accept': 'application/json',
        'Content-Type': 'application/json',
        'Authorization': f'Bearer {token}'
    }

    response = requests.get(url, headers=headers)

    try:
        response.raise_for_status()
        return response.json()
    except requests.exceptions.HTTPError as e:
        print(f"Ошибка при получении комментариев для карточки {card_id}: {e}")
        return []


if __name__ == "__main__":
    token = 'TOKEN'
    start_date = "2023-01-01T00:00:00Z"
    end_date = datetime.now().isoformat() + "Z"
    cards_filename = "kaiten_cards.json"
    comments_filename = "kaiten_comments.json"

    all_comments = []

    if os.path.exists(cards_filename):
        cards = load_from_json(cards_filename)
        for card in cards:
            card_id = card.get("id")
            if card_id:
                comments = get_card_comments(token, card_id)
                all_comments.extend(comments)

        save_to_json(all_comments, comments_filename)
    else:
        print(f"Файл {cards_filename} не найден.")

Next, we linked all the comments that were left in the IS with the projects and employees who left them.

Linking information: comments + projects.  Using the same logic, we later created a CSV file with comments from each employee.

Linking information: comments + projects. Using the same logic, we later created a CSV file with comments from each employee.

Data visualization

Already at this stage, when we had not yet started analyzing messages, it became interesting to visualize related data using Neo4J. The result is a graph like this:

Using the Neo4j desktop database limited our visualization capabilities: we cannot resize vertices in graphs depending on the number of comments submitted, which could clearly show the communication load of participants. As an alternative, we also used Bloom, an online tool from Neo4j for data visualization, however, it did not provide sufficient clarity in the presentation of data, although by the size of the clusters here we could already see which employee left the most comments.

However, we were not satisfied with this visualization format, so I chose Plotly, a flexible Python library that is perfect for solving my problems.

In this case, we used data about post authors and employee mentions in comments to create a new graph. The more mentions, the larger the size of the vertex in the graph. To balance the graph, we used the degree centrality formula. The “centrality” parameter helps evaluate the importance of each graph vertex (employee) based on its position in the structure. As a result, we received the following visualization:

This can already be used for analysis and allows us to identify key participants in business processes. The intensity of communications with some employees (determined by the diameter and position of the graph vertices) indicates that without their participation, tasks cannot be moved forward or, on the contrary, move only thanks to their participation. It is necessary to carefully analyze the messages and formulate further tasks based on them, which will be done in the next stage.

In addition, there is the concept of “information load”. If key employees (from the point of view of locking in core business processes) are overloaded with messages, they have less time to do their main work. Our task is to minimize this load. How is a subject of internal discussions; we will not touch upon it in this article.

Mining business processes using LLM

Now let's move on to the next step. We need to determine which business processes actually operate in the company. It is also important to find out what expectations colleagues have when interacting, what problems are solved in the process, and what components each process includes. To do this, we used a generative AI model from the GPT family (gpt-3.5-turbo), which analyzed messages in the information system. The result is a table like this:

After processing all messages using LLM, we cleaned the data by removing empty rows and duplicate values. Then we identified the authors of the messages and everyone who was mentioned in the messages and again created a graph in which the processes mentioned in the messages were linked to the vertices (employees).

As a result, as required, we received the “As Is” social graph of business processes. Further, instead of mentioning processes, we can expect from the work performed, problems to be solved, process components and other data mined using LLM. For example like this:

Thus, based on the information and visualization we have collected, we can make certain management decisions that will be based on the real state of affairs in the company.

Conclusion

This is not the limit of data usage. We could also track the emergence of business processes over time, look at delays (how long a particular process takes based on the difference between the received comment and the response) and much more. But this is a subject for another article.

Thank you for reading to the end.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *