Implementation of Graph Database using Neo4j on Historical Dataset

Mayanka
Nov 3, 2017
6 min read

Updated: Dec 12, 2017

Japan’s attack on Pearl Harbor in 1941 drew the United States into World War II and spawned a massive wave of shock and fear across the country. It prompted the U.S. government to round up and send more than 100,000 Japanese-Americans to internment camps. Between 110,000 and 120,000 Japanese-Americans, 70 percent of them born in the United States, were forced to leave their homes on the West Coast and incarcerated in makeshift camps in desolate areas until after the end of World War II.

Problem Statement

The National Archives and Records Administration (NARA) is the repository of our data. The dataset contains paper records of internal security cases and associated paper index cards for the 10 Relocation Centers. These records have not been released to the public due to access restrictions on some of the records. The National Archives wants to use the metadata extracted from the index cards so that they can identify index cards with information about internees under the age of 18 years old, which should not be released to the public.

For this project, we are restricting our study to one of the objectives:

“Explore and discover the untold stories hidden in the forest of index cards and analyze the social network among the people who were sent to these intermittent camps.”

Objectives

Study Japanese American WW2 Incarceration Camps
Studying index cards and capture in MS Excel.
Model a graph database using CSV files in Neo4j
Visualize the data to detect any patterns or correlation in it and understand the data’s significance

Dataset

The Index Cards contains following data:

Name of the Person
The case report ID
The relevant Page number in the report
The subject of the case report (offenses such as a Riot)
The Japanese - American or Japanese internee name
The residence ID in the camp (9999 - D)
Remarks section

Below is an example of one of these :

The Index cards have different styles. This is most likely due to the fact that they were indexing a record series consisting of files created and used in different offices and at different internment camps. Once the initial analysis of about 500 index cards is complete, we created an excel sheet which contain s the structured text output. This is saved as a CSV file for input into the graph dat abase model.

Data Visualization

We created two Dashboards for our dataset. The graphs and dashboards can be viewed here: https://us-east-1.online.tableau.com/#/site/mayanka/views

Why Graph Database

The index cards correspond to 120,000 Japanese-American who were sent to these camps during World War II. It turns out to be huge set of data. With conventional technologies such as Relational Database or Excel sheet, it would be impossible to visualize the social network among these people and store the metadata of each individual at the same time.

Hence, we will use Neo4j platform, a graph database tool, which stores data and its relationships together physically. The nodes in the graph can be people, organizations, or events (essentially the entities we extracted in GATE and stored in the database). The edges can represent family connections or co-appearance of people on an incident card. Both nodes and relationships can be further tagged with attribute-value pairs. After “stitching” nodes together with a number of computed relationships, a social network can be built. This will allow for a variety of analyses including: clustering, finding the shortest path between two nodes, calculating various measures of centrality and closeness, and recognizing hidden relationships in the network. The results of this type of network analysis may have strong social impacts, and when we are ready, we hope to engage with experts and survivors who can help guide the process in a meaningful and ethical way, taking into account the underlying sensitivities, and navigating through the inherent collection biases and propaganda.

Implementation using Neo4j

The two most important components of the project were to extract as much information as we can about these index cards and store it i n our Database for better and faster queries and analytics. Since, the index cards does not have one specific format, we had lots of missing data. If we try and create one complex graph including all attributes at one time, we will lose lots of important information in the process of omitting Null values. Hence, we operationalized our problem statements as below :

How many different events occurred and who participated in them? Also, to see if one person participated in more than one events
The residence address to which the Japanese - American belonged? Also, who were their family?
The case file corresponding to each person

Cypher Query

Cypher is a declarative, SQL - inspired language for describing patterns in graphs visually using an ascii - art syntax. It allows us to state what we want to select, insert, update or delete from our graph data without requiring us to describe exactly how to do it. Cypher is Neo4j’s query language and is strongly based on patterns which is helpful in encoding complex ideas in the form of nodes and relationships.

Cypher is based on the Property Graph Model, which in addition to the standard graph elements of nodes and edges adds labels and properties as concepts. Nodes may have zero or more labels, while each relationship has exactly one relationship type.

Cypher contains a variety of clauses. Among the most common are: MATCH and WHERE. These functions are slightly different than in SQL. MATCH is used for describing the structure of the pattern searched for, primarily based on relationships. WHERE is used to add additional constraints to patterns.

We ran the following query to load the dataset from our CSV file to create the different nodes and relationship:

LOAD CSV WITH HEADERS FROM "file:///Users/mayankajha/Documents/INFM_600/Final_Project/Case_Name.csv" AS row

MERGE (c:Case_Report {Key: row.Case_ID})

MERGE (p:People {Key: row.Names})

MERGE (p) - [r:CaseFile_No] - >(c)

RETURN c,p

We split our csv file into three different csv files. Each one containing crucial information and relation.

The name of the person and Events along with Dates – This graph shows us the names of persons associated with major events. We observe the maximum number of people associated with the riot on 11 - 4 - 43. We also see names of persons who are common across several events on varied dates.

The Name of the person and Residence – This graph models the relationships between person s and residences. We can use this to understand the relationship between people living in the same locality or camp and map them to the common events that they were involved in.

The Name of t he person and their case file number – This graph shows the people and their associated case files. We see the number of cases that a person was involved in. It can also help us understand the role of a person in a particular case, whether he/she was a participant, leader or target.

Limitations

During the course of the project, following limitations were encountered:
Qualitative dataset: Our dataset is primarily qualitative. There were a few columns with numeric data, thus limiting the variety of visualizations possible
Null values in the dataset: There were several missing values in the dataset, owing to absence of a format of the index cards. Neo4j does not allow the input of a CSV file with null values and deleting rows with null values was not the best solution as it led to loss of important information.
Access to only 250 index cards, remaining ones are confidential: In order to model a complete system and make it available on a portal for public use, we would need access to all case cards which was not possible as most of the data is still confidential.
No standardized format of the index card: The index cards were prepared by different agents at different locations, hence they do not follow a format f or capturing the data. For example, some of them capture residence address whereas some do not. Some of them capture ‘ First Name, Last Name ’ while some of them use ‘ Last Name, First Name ’ .
Redaction of data associated with minors: Some of the participants of the events were minors whose data needs to be protected and is confidential. This data was redacted from the index cards and not captured in our dataset.

Future Work

In conclusion, this dataset is a good model to understand the social relationships between individuals in these Japanese - American camps . However, to understand the bigger picture , we would need the entire dataset which was inaccessible at this point . Once we have more robust model, archivists can use it to publish the data publicly . We also need to understand the ways to include null values and prevent loss of information . Most importantly, we need to understand how can we interpolate the missing values by looking at social connections.

The collection of records can then used to build a prototype research tool that 1) integrates and simplifies access to the dispersed resources and 2) provides unprecedented access to the biographical - historical contexts of the people documented in the resources, including the social - professional - intellectual networks within which they lived

Mayanka Jha

Implementation of Graph Database using Neo4j on Historical Dataset

Recent Posts

Comments