Homework 02: Decision Trees, Hidden Markov Models, and Bayesian Reasoning

  1. Collect suspect statements from the 6 suspects
  2. Load in the data the police had about each suspect and combine it with the suspect statements
  3. Use a Decision Tree to try and match the suspects to an alias based on their alleged activities and time spent in each room
  4. Use an HMM to see how similar the suspect's alleged paths are to the witness statements and GPS data
  5. Combine the results of the HMM and Decision Tree using Bayesian reasoning to get a final alias matching

Task 0: Collecting Suspect Statements

We need to gather more evidence if we're going to be able to figure out who's alibi doesn't hold up. Your next task is to visit each suspect and collect their statements. Only one team member needs to visit each suspect, so feel free to divy up the work. Please go to their offices during the office hours given below and ask them for their statement:

Professor Bui: 2:00 PM - 3:00 PM, Every Day - 326D Cushing

Professor Dingler: 2:00 PM - 3:30 PM, Monday, Tuesday, Wednesday, Thursday - 350 Fitz

Professor Rehberg: 10:30 AM - 12:00 PM, Tuesday and Thursday - 324 Cushing

Professor Levis: 10:30 AM - 11:15 AM, Tuesday and Thursday - 264 Geddes

Professor Chambers: Monday 1:00 PM - 3:00 PM and Thursday 2:00 PM - 3:00 PM - 180 Fitz

Professor Kumar: 3:25 PM - 4:15 PM, Monday and Wednesday - 378 Fitz

Once you've collected them, add them to the suspect data json files. You may consider trying to find an LLM like ChatGPT-4o to allow you to turn a picture of the statement into the JSON you need.

Task 1: Load and create the necessary data

The starting data is available at Homework 02 Data

After getting in contact with the Theisen-Floyd Estate the police have collected data for past costume parties they've thrown. Turns out there's been a lot of them. They've gathered data on the movement and activity patterns of the aliases. Load those into a dictionary called "alias_data" with the format:

{
    "alias_name": [
        {"suspect_statement": [...]}
    ]
}

Then you should add the suspect statements to the suspect data json files.

Thinking back to class you realize that if you have all of this alias data, you could probably structure this as a supervised learning problem, and a classification task! We could compare our suspect data against the alias data, treating it as the ground-truth, and see if any of the suspects line up against the aliases!

Load the suspect data into a dictionary called "suspect_data" with the format:

{
    "suspect_name": [
        {"suspect_statement": [...], "gps_data": [...], "witness_statements": {...}}
    ]
}
# Import necessary libraries
import os
import numpy as np
import json

from typing import List, Dict, Any
from collections import defaultdict

# TODO: Load alias training files
alias_data = {}

# TODO: Load suspect data
suspect_data = {}

Looking through the data you have, you realize there's a lot of repeated, overlapping activities and this reminds you of categorical data. You wonder if you could somehow set up a decision tree with the classes being the aliases and then try and match the suspects statements against them!

Task 2: Decision Tree

Try training a decision tree classifier to match suspects to the aliases. You'll have to do some feature engineering to get the categorial data we have into a format that can be used by the decision tree.

Feature Extraction Function

We need to somehow turn all of the data we were given into a format that can be used by the decision tree. We can probably calculate how much of their time each suspect spends in each room and maybe we can use activities that suggest a weapon somehow as well!

# TODO: Some of the activities seem related to the weapons found at the party, I wonder if we can use that to improve our decision tree.
# Note: My solution has three activities per weapon for a total of 18 pairings

activity_weapon_mapping = {
    'Ripping wippets': 'Gas', # Freebie as an example
}

# TODO: Write a function to convert the categorical data we have into numerical data for the decision tree
def extract_features(data, is_alias=True):
    rooms = ['Study', 'Dining Room', 'Kitchen', 'Pantry', 'Living Room', 'Bathroom']
    weapons = ['Poison', 'Knife', 'Gas', 'Rope', 'Bag', 'Firearm']

    features = {}

    room_time = {room: 0 for room in rooms}
    activities = []

    total_time = 0

    # TODO: Count the amount of times the suspect or alias visits each room and a list of activities they do

    # TODO: Calculate the percentage of time the suspect or alias spends in each room 

    # TODO: Create features indicating if an activity they've done suggests a weapon

    return features
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import export_text

# TODO: Prepare alias training data using the feature extraction function
alias_features_dict = {}

# TODO: Prepare suspect data using the feature extraction function
suspect_features_dict = {}

# TODO: Train Decision Tree Classifier
# I used the arguments criterion='entropy', random_state=42, max_depth=4, min_samples_split=4
# Feel free to play around with different values for the arguments

# TODO: Classify Suspects Probabilistically as each Alias

# TODO: Print your results and store them for use in the final task
decision_tree_results = {}

Reference Output:

Probability of Each Suspect Matching Each Alias:

            ['Colonel Mustard' 'Miss Scarlet' 'Mr. Green' 'Mrs. Peacock' 'Mrs. White'
 'Professor Plum']
Professor Bui: ['0.0000', '0.0000', '0.8636', '0.0000', '0.0455', '0.0909']
Professor Dingler: ['0.0000', '0.0000', '0.0000', '0.0000', '0.0000', '1.0000']
Professor Rehberg: ['0.5000', '0.0000', '0.0000', '0.5000', '0.0000', '0.0000']
Professor Levis: ['0.0000', '0.2000', '0.0000', '0.0000', '0.8000', '0.0000']
Professor Chambers: ['0.0000', '0.9412', '0.0000', '0.0000', '0.0000', '0.0588']
Professor Kumar: ['0.0000', '0.0000', '0.0000', '1.0000', '0.0000', '0.0000']

Nice! We've got a decision tree that can classify the suspects into the aliases! However lets try and get some more evidence before we make a final decision. The decision tree makes use of the time spent in rooms, but it doesn't consider the paths the suspects take between the rooms. Suddenly inspiration strikes! You recall learning about Hidden Markov Models in class and think that maybe you could use them to get some more evidence!

Task 3: Hidden Markov Model

The police have acquired a blueprint of the estate and based on visitor data the likelihood that a person moves between rooms.

Estate

Note: I am not an artist don't laugh!

The transition probabilities are given below:

Using these, create a "blueprint" JSON structure containing:

"locations": {
    "ROOM NAME": {"id": "ROOM ID", "adjacent_to": ["NEIGHBOR 1", "NEIGHBOR 2"]}
    },
"transition_probabilities": {
    "ROOM ID": {
        "NEIGHBOR 1 ID": 0.4,
        "NEIGHBOR 2 ID": 0.4,
        "ROOM ID": 0.2
    }
}

Using a hidden markov model, find the most likely path for a suspect given the GPS data and witness statements. Then compare the suspect's alleged path from their statement you collected to the most likely path found by the HMM. How similar to the most likely path according to the data is the suspect's path in their statement?

import numpy as np
from typing import List, Dict, Tuple
from collections import defaultdict

# TODO: Create the blueprint dictionary structure

# TODO: Initial state distribution (uniform)

# TODO: Create the emission probability matrix using objective evidence (GPS and witnesses)
def weight_room_matrix(time: str, gps_data: List[Dict], witness_statements: Dict[str, Dict]) -> np.ndarray:
    """
    Create emission probability matrix using objective evidence (GPS and witnesses).
    """

    # Start with uniform distribution
    emission = np.ones(n_states) / n_states
    
    # TODO: 1. Weight based on GPS evidence
    
    # TODO: 2. Weight based on witness evidence
    
    # TODO: Normalize Probabilities

    return emission

# TODO: Use Viterbi algorithm to find most likely sequence of true locations based on GPS and witness data
def viterbi(suspect_data: Dict, timestamps: List[str]) -> Tuple[List[str], float]:
    """
    Use Viterbi algorithm to find most likely sequence of true locations.
    Also returns the probability of the sequence.
    """
    total_timesteps = len(timestamps)
    num_rooms = len(blueprint["rooms"])
    
    # TODO: Initialize matrices
    
    
    # TODO: Create emission matrices for all timestamps
    
    # Initialize first timestep
    V[0] = np.log(initial) + np.log(emissions[0])
    
    # TODO: Finish the Viterbi forward pass
    for time_step in range(1, total_timesteps):
        for _, room_info in blueprint["rooms"].items():

            # Calculate probabilities of coming from each previous state
            probs = []
            for neighbor_room in room_info["adjacent_to"]:
                # TODO: Calculate probability of coming from this neighbor
                # TODO: Add to probabilities

            # TODO: Find most likely previous state

    # TODO: Backtrack
    
    # TODO: Reverse path and convert to room names
    
    # TODO: Calculate sequence probability
    
    return room_path, sequence_prob
    

# TODO: Analyze each suspect's path to determine how well it matches the most likely path from the HMM

movement_results = {}

print("\nHMM Path Analysis:")
print("=" * 50)

for suspect_name, suspect in suspect_data.items():
    # Get all timestamps in chronological order
    timestamps = []
    for entry in suspect['suspect_statement']:
        timestamps.append(entry['time'])
    timestamps.sort()
    
    # Run Viterbi to get most likely true path using GPS data and witness statements
    true_path, path_prob = viterbi(
        {
            'gps_data': suspect['gps_data'],
            'witness_statements': suspect['witness_statements']
        }, 
        timestamps
    )
    
    # TODO: Get suspect's stated path
    
    # TODO: Calculate matches between true and stated paths
    
    print(f"\n{suspect_name}:")
    print(f'\tAgreement of Statement Path with HMM Path: {agreement}, Number of Contradictions: {len(true_path) - matches}')

    path_agreement = {
        'path_probability': path_prob,
        'statement_agreement': agreement,
        'num_contradictions': len(true_path) - matches
    }
    
    movement_results[suspect_name] = path_agreement

Reference Results:

HMM Path Analysis:
==================================================

Professor Bui:
	Agreement of Statement Path with HMM Path: 0.0, Number of Contradictions: 20

Professor Dingler:
	Agreement of Statement Path with HMM Path: 0.6, Number of Contradictions: 8

Professor Rehberg:
	Agreement of Statement Path with HMM Path: 0.05, Number of Contradictions: 19

Professor Levis:
	Agreement of Statement Path with HMM Path: 0.1, Number of Contradictions: 18

Professor Chambers:
	Agreement of Statement Path with HMM Path: 0.4, Number of Contradictions: 12

Professor Kumar:
	Agreement of Statement Path with HMM Path: 0.1, Number of Contradictions: 18

Nice, now we have two pieces of evidence that we can use to match the suspects to the aliases! If only there were some way to combine them. As you finish your drink you think back to class and remember that you can use Bayesian reasoning to combine the evidence!

Task 5: Final Bayesian Analysis

Perform a final Bayesian analysis to combine the evidence from the decision tree and the Hidden Markov Model to get the likelihood of each suspect matching each alias. Hopefully this will help us figure out who was who!

def combine_evidence_bayesian(aliases, decision_tree_results, hmm_results) -> Dict[str, Dict]:
    # TODO: Combine all evidence, assuming that whomever is Mr. Green is lying!
    # TODO: P(Alias|Evidence) ∝ P(Alias|Activities) * P(Alias|Truthfulness)

    final_results = {}
    
    # We can assume that whomever is Mr. Green is lying
    # So low statement match with highest likelihood path → higher probability of being Mr. Green

    for suspect_name in decision_tree_results.keys():
        # TODO: Get decision tree probabilities
        dt_probs = decision_tree_results[suspect_name]
        
        # TODO: Get path agreement from the HMM results 
        path_agreement = hmm_results[suspect_name]['statement_agreement']
        
        # TODO: Calculate posterior probabilities
        posterior = {}
        
        # TODO: Normalize
        
        final_results[suspect_name] = { alias: prob/total for alias, prob in posterior.items() }
        
    return final_results

# Combine evidence using Bayesian reasoning
final_results = combine_evidence_bayesian(aliases, decision_tree_results, movement_results)

for suspect, results in final_results.items():
    print(f"{suspect}:")
    for alias, prob in results.items():
        print(f"  {alias}: {prob:.3f}")

Reference Results:

Professor Bui:
  Colonel Mustard: 0.000
  Miss Scarlet: 0.000
  Mr. Green: 1.000
  Mrs. Peacock: 0.000
  Mrs. White: 0.000
  Professor Plum: 0.000
Professor Dingler:
  Colonel Mustard: 0.000
  Miss Scarlet: 0.000
  Mr. Green: 0.000
  Mrs. Peacock: 0.000
  Mrs. White: 0.000
  Professor Plum: 1.000
Professor Rehberg:
  Colonel Mustard: 0.500
  Miss Scarlet: 0.000
  Mr. Green: 0.000
  Mrs. Peacock: 0.500
  Mrs. White: 0.000
  Professor Plum: 0.000
Professor Levis:
  Colonel Mustard: 0.000
  Miss Scarlet: 0.200
  Mr. Green: 0.000
  Mrs. Peacock: 0.000
  Mrs. White: 0.800
  Professor Plum: 0.000
Professor Chambers:
  Colonel Mustard: 0.000
  Miss Scarlet: 0.941
  Mr. Green: 0.000
  Mrs. Peacock: 0.000
  Mrs. White: 0.000
  Professor Plum: 0.059
Professor Kumar:
  Colonel Mustard: 0.000
  Miss Scarlet: 0.000
  Mr. Green: 0.000
  Mrs. Peacock: 1.000
  Mrs. White: 0.000
  Professor Plum: 0.000

Success!!

We did it, it seems extremely likely that Professor Bui is Mr. Green! You quickly call up Detective Caulfield and explain your findings. He laughs at you. It seems that a purely statistical analysis isn't going to be enough for the authorities to make a warrant. You pour yourself another drink and start brainstorming other ways to find more conclusive evidence.