This project explores SQLite capacity and features for managing hierarchical systems of categories. While the classical relational model does not mesh well with hierarchical data, there are several approaches to marrying these concepts.
The data model explored in this project combines Materialized Paths (MPs) and Parent References patterns. The MPs pattern stores redundant structural information and the structure modification operations usually affect entire subtrees, as opposed to localized changes in less redundant models. At the same time, it often facilitates data retrieval operations, which should be prioritized based on the anticipated usage pattern (consider adding this idea as a design rule).
The project proposes a set of design rules for modeling category systems. This rule set, in my opinion, reflects the intuitive behavior of such a system, and anticipated use patterns and functionality provide the rationale for specific choices behind individual rules.
The project also explores the possibility of developing an object-oriented (to some extent) SQL/SQLite source code library based on proposed rules. Such a library includes parameterized queries for standard MPs operations (such as new path creation and delete/move/copy operations similar to those provided by file systems). Structure modification operations rely on foreign key cascades where possible and employ multi-statement transactions where the business logic is more complicated. The SQL code uses a CTEs-based modular design and defines a unified JSON-based input format for both single- and multi-statement queries encapsulating specific implementation details. All redundant information (associated with the MPs pattern) is added via generated columns or constrained via appropriate foreign key relations to ensure data consistency (for further details, see the following section).
Category systems facilitate information management by associating each category with a specific characteristic/feature that distinguishes its members from nonmembers. The design of these category systems shall meet the following specification:
- Categories form a tree/forest structure.
- Each category has at most one parent.
- A category is identified by its path (vis-a-vis an absolute normalized file system path).
- The category path encodes a simple classification feature associated with that category.
- The category path is unique (no namesake siblings).
- Category names are case-preserving and case-insensitive.
- Any categorized item may be assigned multiple categories.
- No category-item name collision.
It is instructive to compare and contrast systems described by these rules with file systems (FS), probably the most familiar hierarchical structures for the organization of file system objects (FSO), particularly files. Like directories organize files, categories facilitate the organization of categorized items. A large portion of the above rules governs naming conventions, some of which are similar to those of FSs, while others may appear counterintuitive. Fundamental differences between the two systems justify the latter choices. For once, categories and categorized items belong to separate object spaces (database tables), whereas directories and files share the same namespace. Also, categories act more like selection filters than surrogate containers. Therefore, category assignment is akin to tagging rather than placing items in folder-like bins/containers (no item relocation, physical or virtual). With these differences in mind, let us discuss the rationale behind individual rules.
Rule #1
Classification schemes often employ tree-based FS-like structures. Typically, the hierarchy establishes general-to-specific vis-a-vis root-to-leaf mapping. The forest extension with multiple top-level/root nodes is beneficial for combining multiple classification modalities. While a “super” root node may convert a forest into a tree, the additional meaningless hierarchy level does not provide apparent benefits.
Rule #2
This rule is already implied by Rule #1 and is more restrictive than modern FSs, which support multiparent structures. While the multiparent feature has certain benefits, it enables loops and potentially unbound paths, complicating basic FS operations. In the case of category systems, I cannot think of a sound justification for having these complications.
Rules #3
Each category is assigned a name (vis-a-vis FSO name) and prefix (vis-a-vis absolute normalized FS path of the parent directory). Widespread classification systems illustrate this FS-like organization with several common variations in path encoding schemes. Natural classifications without any particular order, such as biological taxonomy, employ textual names for categories. Other systems use numerical names combined with a dictionary assigning semantic meaning to individual codes or a combination of a hierarchical numeric code and descriptive name. Conceptually, all these schemes are equivalent. Using descriptive names as human-facing identifiers is probably the most natural approach for a generic classification system. Additionally, category records may include numeric or alphanumeric identifiers for internal use.
Rules #4
This rule is essentially common sense. Many broadly used classification systems freely available on the Internet vividly illustrate it.
Rule #5
Each category is associated with a distinct classification feature (encoded in the category’s path, see Rule #4) used to identify a subset of objects sharing the same feature. Allowing sibling namesakes among categories is, therefore, meaningless and degrades, if not breaks, the utility of the classification system (cf. with FS paths, which identify FSOs). Because of this rule, the path column constitutes a natural primary key and is a suitable reference for foreign key relationships. Also, category subtree COPY/MOVE operations may cause name conflicts, and the processing script should partially follow FS logic (see general notes above and following sections for further discussion). Only a single node remains for each conflicting pair, retaining item assignments from both nodes.
Rule #6
There are FS with both case-sensitive and case-insensitive FSO names. For example, Windows FSO names are case-preserving and case-insensitive. IMHO, human-facing object identifiers should always be case-insensitive and, if possible, case-preserving, hence this rule.
Rules #7
The ability to assign multiple categories to items is a de-facto standard. This feature does not create loop-related complications for two reasons. On the one hand, categories are always parents to categorized items. More importantly, categories and items belong to separate object spaces (database tables). For these reasons, a category path and the item name combination is meaningless (no FS path analogy here).
Rules #8
Technically, category and item names do not share the same namespace; semantically, both are item attributes (this is very different from FS counterparts). Therefore, category and item names cannot collide.
The similar item-item interaction is beyond the scope of category system rules. Generally, categories act as tags and should not affect the item-item interaction. Typically, items should not define any common namespace. While, occasionally, having multiple identically named items within the same category may be suboptimal, they should not cause any interference in the functioning of the category system as an information management tool.
References
Trees and RDBMS
- Joe Celko’s Trees and Hierarchies in SQL for Smarties, 2nd edition
- SQL Design Patterns: The Expert Guide to SQL Programming
- Nested Sets and Materialized Path SQL Trees
- Storing trees in RDBMS
- Django-Treebeard tree library for Django Web Framework
- PostgreSQL tree module ltree
- Trees in MongoDB