Data Structure of the CHCD
Table of contents
Introduction
The CHCD is a graph database that focuses on geographic and relational connections. It utilizes the open-source and industry leading graph database platform Neo4j. This section provides information on graph database basics and the general design of the database itself.
Graph Basics
Graph databases mimic the natural relationships that exist in the real world; their structures often look like what you might draw on a white board when trying to describe how things are related. It is helpful to keep this basic framework in mind when devising spreadsheets for data collection.
The example image and four definitions below offer a basic understanding of the graph database approach:
Terminology
- Nodes: These are the primary entities of a database. (e.g. Matteo Ricci, Xu Guangqi)
- Relationships: Also called “edges.” These are the relationships between entities. Relationships in a graph databases tend to be directional (e.g. Clavius taught Ricci, Ricci baptized Xu).
- Labels: These are primary markers for a node and a relationship. They also can be used to help communicate a flexible structure to the database. Nodes and relationships can have multiple labels if needed. (e.g. Ricci was a priest like Clavius, Xu was a lay person).
- Properties: This is additional information that pertains to a specific node or relationship. This information is not relational and does not change (e.g. Clavius taught Ricci at the Roman College).
Graph Schema
The CHCD has six main kinds of nodes (i.e. six node labels) in the database: :Person
, :CorporateEntity
, :Institution
, :Event
, :Publication
, and :GeneralArea
. In addition, there are five kinds of geographic nodes which represent the five different levels of geography in the database: :Village
, :Township
, :County
, :Prefecture
, and :Province
These six main nodes and five geographic nodes are connected by eight kinds of relationship (i.e. eight edge labels) in the database: :PART_OF
, :RELATED_TO
, :CONNECTED_TO
, :PRESENT_AT
, :INVOLVED_WITH
, :LOCATED_IN
, :LINKED_TO
, and :INSIDE_OF
.
The below schema depicts the overall structure of the database by showing what relationships are possible between the various types of nodes.
Node Descriptions
:Person
: these nodes represent human beings. People are at the core of the database and thus they have the most kinds of relationships possible.:CorporateEntity
: these nodes represent organizations that do not have a direct geographic footprint. For example, the Society of Jesus is an organization, but it only exists in space through people and institutions.:Institution
: these nodes represent organizations that do have a direct geographic footprint. Common examples in the database include churches, hospitals, and schools.:Event
: these nodes represent important events that took place in Chinese Christianity. Events are, by definition, temporary happenings that have specific geographic locations. Examples in the database range from Christian conferences to imperial hunting parties.:Publication
: these nodes represent a publication associated with Chinese Christianity. Publications are categorized as either a book, a series, an issue, or ephemera. Examples include memoirs of missionaries, a journal published for a local Chinese audience, catechetical books translated from English into Chinese, or religious posters.:GeneralArea
: these nodes represent general locations in China. The database’s controlled geography system does not allow people to have a direct link with a geographic node. Therefore, the General Area nodes are used to indicate a person’s location in China when there is no corresponding institution they were affiliated with.
Relationship Descriptions
:PART_OF
: used to connect:Institution
,:Person
, and:Event
nodes to:CorporateEntity
nodes. Enables the ability to capture administrative hierarchies.:PRESENT_AT
: used to connect:Person
nodes to:Institution
,:GeneralArea
, and:Event
nodes. These relationships are the only way individuals receive geographic location in the database.:LOCATED_IN
: used to connect:Institution
,:GeneralArea
, and:Event
nodes to geography nodes.:RELATED_TO
: used to connect:Person
nodes to each other. These can capture any sort of interpersonal relationship.:LINKED_TO
: used to connect:Institution
nodes and:Event
nodes. This can capture any sort of relationship between institutions and/or events.:CONNECTED_TO
: used to connect:CorporateEntity
nodes to each other. This can capture any sort of relationship between corporate entities.:INVOLVED_WITH
: used to connect:Institution
,:Person
,:CorporateEntity
,:GeneralArea
and:Event
nodes to:Publication
nodes. Also used to connect:Publication
nodes to one another.:INSIDE_OF
: used to connect geography nodes to one another. This allows the database to reflect administrative hierarchy and fuzzy geographic data.
Implications of Design
The CHCD design schema is a careful balance of flexibility and rigidity. This combination allows for a wide amount of historical data collection without having to alter the core structure of the database. While limiting in some respects, these design choices provide the following benefits.
Bridges the Catholic-Protestant Divide
While Protestant and Catholic organizational structures are quite different, the flexible database structure of the CHCD allows them to be recorded and analyzed together, something rarely done in the study of Christianity in China.
Allows for Multiple Forms of Belonging
Christian people moved between institutions, institutions changed locations, and corporate entities often split. The CHCD design makes it easy to track these changes while limiting redundancy.
Controls Complex and Fuzzy Geographies
Geography is regulated using two principles. First, the only nodes that have geographic coordinates attached to them are geography nodes (i.e. Village, Township, County, Prefecture, Province). Second, the only nodes which can relate to geography nodes are Institution, GeneralArea, and Event nodes. These principles, in turn, accomplish three main goals: 1) historical locations with varying levels of geographic specificity can be recorded, 2) redundancy and errors are reduced, and 3) changes in an institution’s location over time can be easily tracked. For more information, see the documentation on Geography.
Easy and Understandable Query
When users download the database and use its native Neo4j environment, they can utilize the easy-to-learn Cypher query language. Complex data schemas can be difficult to write queries for, making it difficult to derive meaningful data. By putting a whole host of historical data into a single, simple schema, researchers can more easily analyze disparate sources.
Easy to Grow
The six primary node types are not the only kind of historical entities or information that could be placed in a database. While initial data collection will focus on these entities, the graph database structure can grow to include different kinds of information. This future growth will further enrich any data already recorded. The edition of two new node types (:GeneralArea
and :Publication
) in verion 2 of the CHCD shows the flexibility that this design affords.