We need to gather more evidence if we're going to be able to figure out who's alibi doesn't hold up. Your next task is to visit each suspect and collect their statements. Only one team member needs to visit each suspect, so feel free to divy up the work. Please go to their offices during the office hours given below and ask them for their statement:
Professor Bui: 2:00 PM - 3:00 PM, Every Day - 326D Cushing
Professor Dingler: 2:00 PM - 3:30 PM, Monday, Tuesday, Wednesday, Thursday - 350 Fitz
Professor Rehberg: 10:30 AM - 12:00 PM, Tuesday and Thursday - 324 Cushing
Professor Levis: 10:30 AM - 11:15 AM, Tuesday and Thursday - 264 Geddes
Professor Chambers: Monday 1:00 PM - 3:00 PM and Thursday 2:00 PM - 3:00 PM - 180 Fitz
Professor Kumar: 3:25 PM - 4:15 PM, Monday and Wednesday - 378 Fitz
Once you've collected them, add them to the suspect data json files. You may consider trying to find an LLM like ChatGPT-4o to allow you to turn a picture of the statement into the JSON you need.
The starting data is available at Homework 02 Data
After getting in contact with the Theisen-Floyd Estate the police have collected data for past costume parties they've thrown. Turns out there's been a lot of them. They've gathered data on the movement and activity patterns of the aliases. Load those into a dictionary called "alias_data" with the format:
{
"alias_name": [
{"suspect_statement": [...]}
]
}
Then you should add the suspect statements to the suspect data json files.
Thinking back to class you realize that if you have all of this alias data, you could probably structure this as a supervised learning problem, and a classification task! We could compare our suspect data against the alias data, treating it as the ground-truth, and see if any of the suspects line up against the aliases!
Load the suspect data into a dictionary called "suspect_data" with the format:
{
"suspect_name": [
{"suspect_statement": [...], "gps_data": [...], "witness_statements": {...}}
]
}
# Import necessary libraries
import os
import numpy as np
import json
from typing import List, Dict, Any
from collections import defaultdict
# TODO: Load alias training files
= {}
alias_data
# TODO: Load suspect data
= {} suspect_data
Looking through the data you have, you realize there's a lot of repeated, overlapping activities and this reminds you of categorical data. You wonder if you could somehow set up a decision tree with the classes being the aliases and then try and match the suspects statements against them!
Try training a decision tree classifier to match suspects to the aliases. You'll have to do some feature engineering to get the categorial data we have into a format that can be used by the decision tree.
We need to somehow turn all of the data we were given into a format that can be used by the decision tree. We can probably calculate how much of their time each suspect spends in each room and maybe we can use activities that suggest a weapon somehow as well!
# TODO: Some of the activities seem related to the weapons found at the party, I wonder if we can use that to improve our decision tree.
# Note: My solution has three activities per weapon for a total of 18 pairings
= {
activity_weapon_mapping 'Ripping wippets': 'Gas', # Freebie as an example
}
# TODO: Write a function to convert the categorical data we have into numerical data for the decision tree
def extract_features(data, is_alias=True):
= ['Study', 'Dining Room', 'Kitchen', 'Pantry', 'Living Room', 'Bathroom']
rooms = ['Poison', 'Knife', 'Gas', 'Rope', 'Bag', 'Firearm']
weapons
= {}
features
= {room: 0 for room in rooms}
room_time = []
activities
= 0
total_time
# TODO: Count the amount of times the suspect or alias visits each room and a list of activities they do
# TODO: Calculate the percentage of time the suspect or alias spends in each room
# TODO: Create features indicating if an activity they've done suggests a weapon
return features
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import export_text
# TODO: Prepare alias training data using the feature extraction function
= {}
alias_features_dict
# TODO: Prepare suspect data using the feature extraction function
= {}
suspect_features_dict
# TODO: Train Decision Tree Classifier
# I used the arguments criterion='entropy', random_state=42, max_depth=4, min_samples_split=4
# Feel free to play around with different values for the arguments
# TODO: Classify Suspects Probabilistically as each Alias
# TODO: Print your results and store them for use in the final task
= {} decision_tree_results
Reference Output:
Probability of Each Suspect Matching Each Alias:
['Colonel Mustard' 'Miss Scarlet' 'Mr. Green' 'Mrs. Peacock' 'Mrs. White'
'Professor Plum']
Professor Bui: ['0.0000', '0.0000', '0.8636', '0.0000', '0.0455', '0.0909']
Professor Dingler: ['0.0000', '0.0000', '0.0000', '0.0000', '0.0000', '1.0000']
Professor Rehberg: ['0.5000', '0.0000', '0.0000', '0.5000', '0.0000', '0.0000']
Professor Levis: ['0.0000', '0.2000', '0.0000', '0.0000', '0.8000', '0.0000']
Professor Chambers: ['0.0000', '0.9412', '0.0000', '0.0000', '0.0000', '0.0588']
Professor Kumar: ['0.0000', '0.0000', '0.0000', '1.0000', '0.0000', '0.0000']
Nice! We've got a decision tree that can classify the suspects into the aliases! However lets try and get some more evidence before we make a final decision. The decision tree makes use of the time spent in rooms, but it doesn't consider the paths the suspects take between the rooms. Suddenly inspiration strikes! You recall learning about Hidden Markov Models in class and think that maybe you could use them to get some more evidence!
import numpy as np
from typing import List, Dict, Tuple
from collections import defaultdict
# TODO: Create the blueprint dictionary structure
# TODO: Initial state distribution (uniform)
# TODO: Create the emission probability matrix using objective evidence (GPS and witnesses)
def weight_room_matrix(time: str, gps_data: List[Dict], witness_statements: Dict[str, Dict]) -> np.ndarray:
"""
Create emission probability matrix using objective evidence (GPS and witnesses).
"""
# Start with uniform distribution
= np.ones(n_states) / n_states
emission
# TODO: 1. Weight based on GPS evidence
# TODO: 2. Weight based on witness evidence
# TODO: Normalize Probabilities
return emission
# TODO: Use Viterbi algorithm to find most likely sequence of true locations based on GPS and witness data
def viterbi(suspect_data: Dict, timestamps: List[str]) -> Tuple[List[str], float]:
"""
Use Viterbi algorithm to find most likely sequence of true locations.
Also returns the probability of the sequence.
"""
= len(timestamps)
total_timesteps = len(blueprint["rooms"])
num_rooms
# TODO: Initialize matrices
# TODO: Create emission matrices for all timestamps
# Initialize first timestep
0] = np.log(initial) + np.log(emissions[0])
V[
# TODO: Finish the Viterbi forward pass
for time_step in range(1, total_timesteps):
for _, room_info in blueprint["rooms"].items():
# Calculate probabilities of coming from each previous state
= []
probs for neighbor_room in room_info["adjacent_to"]:
# TODO: Calculate probability of coming from this neighbor
# TODO: Add to probabilities
# TODO: Find most likely previous state
# TODO: Backtrack
# TODO: Reverse path and convert to room names
# TODO: Calculate sequence probability
return room_path, sequence_prob
# TODO: Analyze each suspect's path to determine how well it matches the most likely path from the HMM
= {}
movement_results
print("\nHMM Path Analysis:")
print("=" * 50)
for suspect_name, suspect in suspect_data.items():
# Get all timestamps in chronological order
= []
timestamps for entry in suspect['suspect_statement']:
'time'])
timestamps.append(entry[
timestamps.sort()
# Run Viterbi to get most likely true path using GPS data and witness statements
= viterbi(
true_path, path_prob
{'gps_data': suspect['gps_data'],
'witness_statements': suspect['witness_statements']
},
timestamps
)
# TODO: Get suspect's stated path
# TODO: Calculate matches between true and stated paths
print(f"\n{suspect_name}:")
print(f'\tAgreement of Statement Path with HMM Path: {agreement}, Number of Contradictions: {len(true_path) - matches}')
= {
path_agreement 'path_probability': path_prob,
'statement_agreement': agreement,
'num_contradictions': len(true_path) - matches
}
= path_agreement movement_results[suspect_name]
Reference Results:
HMM Path Analysis:
==================================================
Professor Bui:
Agreement of Statement Path with HMM Path: 0.0, Number of Contradictions: 20
Professor Dingler:
Agreement of Statement Path with HMM Path: 0.6, Number of Contradictions: 8
Professor Rehberg:
Agreement of Statement Path with HMM Path: 0.05, Number of Contradictions: 19
Professor Levis:
Agreement of Statement Path with HMM Path: 0.1, Number of Contradictions: 18
Professor Chambers:
Agreement of Statement Path with HMM Path: 0.4, Number of Contradictions: 12
Professor Kumar:
Agreement of Statement Path with HMM Path: 0.1, Number of Contradictions: 18
Nice, now we have two pieces of evidence that we can use to match the suspects to the aliases! If only there were some way to combine them. As you finish your drink you think back to class and remember that you can use Bayesian reasoning to combine the evidence!
Perform a final Bayesian analysis to combine the evidence from the decision tree and the Hidden Markov Model to get the likelihood of each suspect matching each alias. Hopefully this will help us figure out who was who!
def combine_evidence_bayesian(aliases, decision_tree_results, hmm_results) -> Dict[str, Dict]:
# TODO: Combine all evidence, assuming that whomever is Mr. Green is lying!
# TODO: P(Alias|Evidence) ∝ P(Alias|Activities) * P(Alias|Truthfulness)
= {}
final_results
# We can assume that whomever is Mr. Green is lying
# So low statement match with highest likelihood path → higher probability of being Mr. Green
for suspect_name in decision_tree_results.keys():
# TODO: Get decision tree probabilities
= decision_tree_results[suspect_name]
dt_probs
# TODO: Get path agreement from the HMM results
= hmm_results[suspect_name]['statement_agreement']
path_agreement
# TODO: Calculate posterior probabilities
= {}
posterior
# TODO: Normalize
= { alias: prob/total for alias, prob in posterior.items() }
final_results[suspect_name]
return final_results
# Combine evidence using Bayesian reasoning
= combine_evidence_bayesian(aliases, decision_tree_results, movement_results)
final_results
for suspect, results in final_results.items():
print(f"{suspect}:")
for alias, prob in results.items():
print(f" {alias}: {prob:.3f}")
Reference Results:
Professor Bui:
Colonel Mustard: 0.000
Miss Scarlet: 0.000
Mr. Green: 1.000
Mrs. Peacock: 0.000
Mrs. White: 0.000
Professor Plum: 0.000
Professor Dingler:
Colonel Mustard: 0.000
Miss Scarlet: 0.000
Mr. Green: 0.000
Mrs. Peacock: 0.000
Mrs. White: 0.000
Professor Plum: 1.000
Professor Rehberg:
Colonel Mustard: 0.500
Miss Scarlet: 0.000
Mr. Green: 0.000
Mrs. Peacock: 0.500
Mrs. White: 0.000
Professor Plum: 0.000
Professor Levis:
Colonel Mustard: 0.000
Miss Scarlet: 0.200
Mr. Green: 0.000
Mrs. Peacock: 0.000
Mrs. White: 0.800
Professor Plum: 0.000
Professor Chambers:
Colonel Mustard: 0.000
Miss Scarlet: 0.941
Mr. Green: 0.000
Mrs. Peacock: 0.000
Mrs. White: 0.000
Professor Plum: 0.059
Professor Kumar:
Colonel Mustard: 0.000
Miss Scarlet: 0.000
Mr. Green: 0.000
Mrs. Peacock: 1.000
Mrs. White: 0.000
Professor Plum: 0.000
We did it, it seems extremely likely that Professor Bui is Mr. Green! You quickly call up Detective Caulfield and explain your findings. He laughs at you. It seems that a purely statistical analysis isn't going to be enough for the authorities to make a warrant. You pour yourself another drink and start brainstorming other ways to find more conclusive evidence.