Research Compass

How to Create a High-Quality Data Dictionary or Codebook

A data dictionary (also called a codebook) is one of the most important pieces of documentation you will ever create for your research dataset. It is essentially the “user manual” for your data — without it, even the cleanest spreadsheet or database becomes difficult, if not impossible, for others (and often your future self) to understand and reuse.

In this blog post, we'll explain what a data dictionary, or codebook, is and why it's so important in research data management. We then explore the essential steps and best practices for creating a robust data dictionary or codebook and provide a practical guide for researchers.

15 Apr 2026

[3 min read]

What is a Data Dictionary or Codebook?

A data dictionary, also known as a codebook, is a detailed description of the data in a dataset. It includes information about each variable, such as its name, type, format, possible values, and meaning. Essentially, it acts as a "user manual" or roadmap for anyone who needs to understand, work with, or reuse the data.

What is the difference between a Data Dictionary and a Codebook?

Data Dictionary: Focuses on variables — what each column means, its type, units, valid values, missing-value codes, etc. Most common for tabular/quantitative data.
Codebook: Broader term that also includes study-level information (questionnaire text, survey flow, sampling method, data collection process). Very common in social sciences and survey research.

In practice, most researchers use the two terms interchangeably.

Why does a high-quality Data Dictionary matter?

Many funders and journals require submission of Data Management Plans. A high-quality data dictionary will definitely expedite approval processes.
It provides clarity on what each variable represents, making it easier for users to interpret the data correctly. As a result, the citation rate of your dataset will be increased.
By documenting the data structure and constraints, it helps maintain data integrity and consistency. Hence, errors can be reduced when others (or you) work with the data.
Analysts can effectively identify variables and understand their context, which enables secondary analysis and meta-analysis.
Makes your data FAIR (Findable, Accessible, Interoperable, Reusable) compliant.

Step-by-Step: How to create a high-quality Data Dictionary

Decide the Scope and Format
- Use a simple spreadsheet-style dictionary for Tabular data such as Excel, CSV, SPSS, Stata, R, etc.
- Use a full codebook with question text for Survey / Questionnaire data
- Combine both for Mixed or Complex data
- Recommended formats:
  - Excel / Google Sheets (most accessible)
  - Markdown table (easy to put in README files)
  - PDF (the final archival version)
  - JSON / XML (machine-readable, advanced)

Include all essential fields

Column / Section	What to put here	Example
Variable name	Exact column name in the dataset	`income_2023`
Variable label	Human-readable description (1–2 sentences)	Annual household income in HKD (2023)
Question text (if survey)	Full wording of the survey question	“What was your total household income last year?”
Data type	Numeric, String/Text, Date, Categorical, Boolean, etc.	Numeric
Units	e.g., HKD, kg, %, years, etc.	HKD
Value range / Valid values	Minimum, maximum, or list of allowed values	0 – 9999999
Value labels	For categorical variables: code → meaning	1 = Full-time employed, 2 = Part-time, 3 = Unemployed
Missing value codes	How missing data is coded	-99 = Refused, -88 = Don’t know
Measurement level	Nominal, Ordinal, Interval, Ratio, Binary	Ratio
Derivation / Calculation	If the variable was calculated from others	= income_monthly * 12
Source / Instrument	Which questionnaire, sensor, or database it came from	Wave 3 Household Survey
Notes / Caveats	Any special instructions or known issues	Some respondents refused to answer

Build it systematically
- Start early - Begin drafting while you are still collecting or cleaning data.
- Use a template (refer to Step #4)
- One row per variable — never combine multiple variables in one row
- Be consistent — Use the same style and level of detail for every variable.
- Include a cover sheet with:
  - Dataset title
  - Version number and date
  - Number of variables and observations
  - Principal Investigator / Contact person
  - License
  - Citation recommendation
Utilize tools and templates
Recommended free templates:
- R package: codebook or dataMaid
- Python: ydata-profiling or sweetviz (auto-generates basic dictionaries)
- Survey-specific: Qualtrics / REDCap export (convert to codebook)
Final review with a quality checklist

An example: Small excerpt from a real data dictionary

Variable	Label	Type	Values	Missing codes
`age`	Respondent age in years	Numeric	18–99	-99
`gender`	Gender	Categorical	1 = Male, 2 = Female, 3 = Non-binary, 4 = Prefer not to say	-88
`income_monthly`	Monthly household income (HKD)	Numeric	0–999999	-99
`education_level`	Highest education level completed	Ordinal	1 = Primary, 2 = Secondary, 3 = Bachelor, 4 = Postgraduate	-99

Conclusion

Creating a high-quality data dictionary or codebook is a crucial step in effective Research Data Management. By providing a clear and comprehensive guide to your dataset, you enhance data understanding, quality, and collaboration. Follow the steps and best practices outlined in this blog to develop a data dictionary that serves as a valuable resource for all data users. Remember, a well-crafted data dictionary is not just a document—it's a key to unlocking your data's full potential. A good data dictionary takes time (usually 1–3 days for a medium-sized project), but it is one of the highest-ROI activities in RDM. It turns a “data file” into a reusable scientific asset.

Highly Cited Researchers 2025 List Released by Clarivate

Highly Cited Researchers from Clarivate is an annual recognition of influential scientists and social scientists worldwide who have demonstrated significant and broad influence in their fields of research. The list highlights recent research accomplishments, evaluating Highly Cited Papers published in reputable journals indexed in the Science Citation Index Expanded™ and Social Sciences Citation Index™ over an 11-year timeframe.

19 Nov 2025

[2 min read]

2025 Highlights:
In the 2025 edition, a total of 6,868 researchers from 60 countries and regions were acknowledged, resulting in 7,131 Highly Cited Researcher designations. Some individuals were recognized in multiple subjects, showcasing their interdisciplinary expertise. This prestigious recognition honors researchers whose publications rank in the top 1% by citations within the 2014-2024 publication year range, representing the most cited 0.1% of scholars in the sciences and social sciences.
Methodological Adjustments in 2025:
The methodology for the 2025 list of "Highly Cited Researchers" has undergone significant changes this year, primarily focusing on adjustments to the selection of highly cited papers and the application of existing selection standards. These adjustments have impacted various academic disciplines, resulting in 12 ESI (Essential Science Indicators) subject categories experiencing a decline in the number of researchers selected compared to last year. Notably, the new changes have enabled the category of Mathematics to return to the list after being excluded for the past two years due to concerns over suspicious citation patterns.

Researchers from City University of Hong Kong (CityUHK) have excelled on the 2025 Highly Cited Researchers list, ranking ninth in Asia and second in Hong Kong. Access detailed records of CityUHK researchers on CityUHK Scholars: 2025

Identifying Top Research with Normalized Indicators

Normalized article-level metrics move beyond raw citation counts to reveal how individual papers truly perform within their specific fields, publication years and types, and against global benchmarks. From field-normalized measures like Field-Weighted Citation Impact (FWCI) and Category Normalized Citation Impact (CNCI) to elite designations such as Highly Cited Papers and Hot Papers, these tools offer a nuanced view of research influence.

25 Mar 2026

[2 min read]

These metrics are accessible via leading research analytics platforms: Scopus-powered SciVal for FWCI and percentile-based counts, and Web of Science-fed tools like InCites, Journal Citation Reports (JCR), and Essential Science Indicators (ESI) for CNCI, journal quartiles, and top paper recognitions. While methodologies differ, the core aim remains consistent: normalize for fairness so high performers stand out regardless of context.

Metrics Comparison Tables by Data Sources

Data source / Tool	Scopus / SciVal
Metric	Field-Weighted Citation Impact (FWCI)	Outputs in Top Citation Percentiles	Publications in Top Journal Percentiles	Publications in Journal Quartiles
Publication coverage	1996 -	1996 -	1996 -	1996 -
Publication types	All types	All types	All types with journal metrics	All types with journal metrics
Citations	Citations from the publication year plus following 3 years	All citations; can show as field-weighted	Journal-level citations by CiteScore, SCImago Journal Rank (SJR), or Source-Normalized Impact per Paper (SNIP)	Journal-level citations by CiteScore, SCImago Journal Rank (SJR), or Source-Normalized Impact per Paper (SNIP)
Subject filed	All Science Journal Classification Codes (ASJC)	All Science Journal Classification Codes (ASJC)	ASJC for CiteScore and SJR; Not required for SNIP	ASJC for CiteScore and SJR; Not required for SNIP
Threshold	Publication-level: >1: above world average =1: world average <1: below world average	Publication-level: 1% 5% 10% 25%	Journal-level: 1% 5% 10% 25%	Journal-level: Q1 (top 25th percentile) Q2 (26 – 50th percentiles) Q3 (51 – 75th percentiles) Q4 (76 – 100th percentiles)
Normalized by	Publication type, publication year, and subject field	Publication year	Publication year and/or subject field	Publication year and/or subject field

Data source / Tool	Web of Science (WoS) / InCites / Journal Citation Reports (JCR) / Essential Science Indicators (ESI)
Metric	Category Normalized Citation Impact (CNCI)	Percentile in Subject Area	Documents in Top 1% / 10%	Documents in Q1-Q4 Journals	Highly Cited Papers	Hot Papers
Publication coverage	1980 -	1980 -	1980 -	1980 -	Past 10 years	Past 2 years
Publication types	All types	All types	All types	All types with journal metrics	Regular scientific articles and review articles from Science Citation Index Expanded (SCIE) and Social Sciences Citation Index (SSCI)	Regular scientific articles and review articles from Science Citation Index Expanded (SCIE) and Social Sciences Citation Index (SSCI)
Citations	All citations	All citations	All citations	Journal-level citations by Journal Impact Factor (JIF)	Citations from SCIE, SSCI, and Arts & Humanities Citation Index (AHCI) over the 10-year period	Citations from SCIE, SSCI, and Arts & Humanities Citation Index (AHCI) over the most recent 2-month period
Subject filed	WoS Research Areas	WoS Research Areas	WoS Research Areas	WoS Research Areas	ESI Research Fields	ESI Research Fields
Threshold	Publication-level: >1: above world average =1: world average <1: below world average	Publication-level: 99th - 100th: Top 1% 90th - 100th: Top 10% <89th: Others	Publication-level: Top 1% Top 10%	Journal-level: Q1 (top 25th percentile) Q2 (26 – 50th percentiles) Q3 (51 – 75th percentiles) Q4 (76 – 100th percentiles)	Publication-level: Top 1%	Publication-level: Top 0.1%
Normalized by	Publication type, publication year, and subject field	Publication type, publication year, and subject field	Publication type, publication year, and subject field	Publication year and/or subject field	ESI research field and publication year	ESI research field and period added to WoS

Metric details can be viewed at the Library Guide on Impact of Articles.

Navigating the Landscape of Standardized Metadata Schemas in Academic Research

Metadata is a vital component of research data management, providing the necessary context and information to make data discoverable, usable, preservable, and citable. Several standardized metadata schemas have emerged, offering structured frameworks for describing data and ensuring consistency and interoperability across systems and disciplines.

In this blog post, we'll explore some of the prevailing standardized metadata schemas used in academic research, delve deeper into each schema, compare their features, and discuss practical use cases.

18 Mar 2026

[3 min read]

Prevailing Standardized Metadata Schemas

Dublin Core
Overview: Dublin Core is a simple yet flexible metadata schema used across various disciplines. It consists of 15 core elements, making it suitable for a wide range of data types.
- Core metadata fields: Title, Creator, Subject, Description, Publisher, Contributor, Date, Type (The nature or genre of the content. Controlled vocabularies like the DCMI Type Vocabulary can be used), Format (The file format, physical medium, or dimensions of the resource. Controlled vocabularies such as MIME types may be applied), Identifier (A unique reference, such as a URL or DOI), Source, Language (Controlled vocabularies like ISO 639 language codes are often used), Relation, Coverage, and Rights.
- Structure: Dublin Core is designed to be simple and flexible, allowing for both minimal and detailed descriptions. It can be expressed in various syntaxes, including JSON, HTML, XML, and RDF. Its structure is very flat and human-readable.
- Best fit: General-purpose, cross-disciplinary, digital libraries, and any project that needs quick, lightweight metadata.
- Use cases: Institutional repositories, open-access journals, OAI-PMH harvesting, and basic dataset landing pages.
- Strength: Extremely simple and widely supported.
- Weakness: Too basic for complex research data (no version history, no funding info, no spatial/temporal granularity).
DataCite Metadata Schema
Overview: The DataCite Metadata Schema is specifically designed for research data, facilitating data citation and sharing.
- Core metadata fields: Identifier (typically a DOI), Creator, Title, Publisher, Publication Year, Resource Type (E.g., dataset, software, or image. Controlled vocabularies like the DataCite Resource Type General), Version, Description, Geo-location, Rights, and Funding Reference.
- Structure: DataCite metadata is structured in XML format, allowing for detailed and machine-readable descriptions. It supports linking between datasets and related publications.
- Best fit: Any discipline that assigns DOIs to datasets.
- Use cases: Citing datasets in scientific publications and managing research data in institutional repositories.
- Strength: Excellent for citation, versioning, and discoverability. Mandatory fields ensure minimum quality.
- Weakness: Less rich for highly structured social-science surveys or geospatial data.
ISO 19115
Overview: ISO 19115 is a comprehensive schema for geospatial data, providing detailed descriptions of geographic information and services.
- Core metadata fields: File Identifier, Language (Controlled vocabularies like ISO 639 language codes are often used), Character set (the character encoding used), Hierarchy Level, Contact, Date Stamp, Spatial Representation (e.g., vector, raster), Extent (the spatial and temporal extent of the dataset), Lineage, Constraints, and Distribution Information.
- Structure: ISO 19115 is structured in XML and is highly detailed, supporting complex descriptions of geospatial data.
- Best fit: Geography, GIS, remote sensing, environmental science, oceanography.
- Use cases: Documenting geospatial datasets in environmental research, managing spatial data in government agencies and NGOs, National mapping agencies, INSPIRE directive (Europe), USGS, satellite data portals.
- Strength: Extremely rich spatial and temporal metadata.
- Weakness: Heavy and complex for non-geospatial data.
MODS (Metadata Object Description Schema)
Overview: Developed by the Library of Congress, MODS is used for bibliographic data and includes elements like Title, Name, Type of Resource, and Language.
- Core metadata fields: Title, Name of Creator/Contributor, Type of Resource, Genre, Origin Information, Language (Controlled vocabularies like ISO 639 language codes are often used), Physical Description, Abstract, Subject, and Identifier.
- Structure: MODS is XML-based, allowing for rich and hierarchical descriptions of bibliographic records.
- Best fit: Digital repository metadata, interoperable resource description, detailed bibliographical records, and MARC to XML conversion.
- Use cases: Cataloging books, articles, and other bibliographic materials in libraries and managing metadata for digital collections in cultural heritage institutions.
- Strength: Richness and granularity, compatibility and interoperability, user-friendly syntax, flexible XML structure, and reduced complexity.
- Weakness: Lack of mandatory fields, no built-in business rules, and loss in conversion (non-round-tripable, i.e., converting from MARC to MODS and back to MARC can result in a loss of specific data or granular tagging)
VRA Core
Overview: VRA Core is used in the visual resources community, designed for describing images and works of art.
- Core metadata fields: Work type (e.g., painting, sculpture), Title, Agent (the creator or contributor), Material (Controlled vocabularies like AAT may be used), Technique (Controlled vocabularies like AAT may be used), Cultural context, Location, and Subject.
- Structure: VRA Core is XML-based, providing a structured format for describing visual resources.
- Best fit: Cultural heritage management, Digital image repositories, Academic and Special Collections, and Hierarchical relationship mapping.
- Use cases: Documenting art collections in museums and galleries and managing visual resources in academic institutions.
- Strength: Specialized for art/images, XML standardized, work-image relationship, and user-friendly interface (e.g., Omeka with a well-integrated data entry interface)
- Weakness: Implementation complexity of the relational structure of VRA Core 4.0, data migration issues, and lack of adoption by some platforms like ARTstor.
Darwin Core
Overview: Darwin Core (DwC) is a standardized vocabulary, maintained by TDWG, designed to facilitate the sharing and integration of biodiversity data. It provides consistent terms (labels and definitions) for documenting organism occurrences, specimens, and samples. It is widely used to share data on GBIF through structured text files, often packaged as Darwin Core Archives.
- Core metadata fields: Occurrence, Event, Location, Taxon (biological classification), Identification, Measurement or Fact, DNA, and Extension for multimedia.
- Structure: Simple tabular (CSV) or RDF; very flexible extensions.
- Best fit: Biodiversity, ecology, natural history collections.
- Use cases: GBIF, iNaturalist, museum collections, ecological field studies.
- Strength: Designed for sharing species occurrence data globally.
- Weakness: Not suitable outside life sciences.
Schema.org
Overview: Schema.org provides a standardized, collaborative vocabulary that improves SEO (Search Engine Optimization) through richer search results (snippets, star ratings) by helping search engines understand content context.
- Core metadata fields: Dataset, Creator, Description, License, Keywords, Temporal coverage, Spatial coverage, and Variable measured.
- Structure: JSON-LD embedded in web pages.
- Best fit: Any discipline that wants Google Dataset Search visibility.
- Use cases: Institutional websites, repositories that embed structured data (Dataverse, Figshare, Zenodo all support it).
- Strength: Broad adoption, support for multiple formats (JSON-LD, Microdata), and improved click-through rates.
- Weakness: Potential for implementation complexity and limited applicability for niche topics.

Comparison of Metadata Schemas

Schema	Simplicity	Domain-Specific	Interoperabiity	Use Cases
Dublin Core	High	No	High	Libraries, archives, multimedia
DataCite	Medium	Yes (Research)	High	Research data, institutional repositories
ISO 19115	Low	Yes (Geospatial)	High	Geospatial data, GIS systems
MODS	Medium	Yes (Bibliographic)	Medium	Libraries, digital collections
VRA Core	Medium	Yes (Visual)	Medium	Art collections, visual resources
Darwin Core	Low	Yes (Biodiversity)	Low	Ecological field studies, museum collections
Schema.org	Medium	No	High	Institutional websites

Recommendations for Researchers

Start with DataCite — it is the current global standard for most repositories (including Dataverse)
Add domain-specific extensions when needed (e.g., DDI for surveys, Darwin Core for biodiversity, ISO 19115 for GIS)
Always use controlled vocabularies where available — they dramatically improve search and interoperability
Embed Schema.org JSON-LD on your dataset landing page — it costs almost nothing and dramatically increases visibility in Google Dataset Search

Conclusion

Choosing the right metadata schema is crucial for effective data management and sharing. Each schema offers unique features and is suited to specific types of data and research fields. By understanding the strengths and applications of these standardized metadata schemas, researchers can ensure their data is well documented, discoverable, and interoperable, ultimately contributing to the advancement of knowledge and innovation.

Subscribe

Subscribe

How to Create a High-Quality Data Dictionary or Codebook

Highly Cited Researchers 2025 List Released by Clarivate

Identifying Top Research with Normalized Indicators

Navigating the Landscape of Standardized Metadata Schemas in Academic Research

Get new posts by email: