MOVR DataHub Analytics

Python Package GSoC 2026

MOVR Legacy DataHub Library

A Python package for analyzing clinical registry data from the MOVR DataHub (2019-2025): 3,570 participants, 11,501 encounters across ALS, DMD, SMA, BMD, LGMD, FSHD, and Pompe disease.

View on GitHub | Issues | Discussions

Overview

MOVR DataHub Analytics transforms raw Excel registry exports into analysis-ready datasets through automated data wrangling, quality validation, and cohort management.

Key Capabilities

Data Pipeline

Excel to Parquet conversion with audit logging

Data Wrangling

YAML-configurable transformation rules

Cohort Management

Flexible patient cohort building

Analytics

Descriptive statistics and reporting

Data Dictionary

Field search and metadata exploration

Plugin System

Custom transformation extensions

Project Structure

movr-datahub-analytics/
├── src/movr/               # Main package
│   ├── config/             # Configuration management
│   ├── data/               # Excel/Parquet loading
│   ├── wrangling/          # Data cleaning
│   ├── cohorts/            # Cohort management
│   ├── analytics/          # Analysis framework
│   ├── dictionary/         # Data dictionary tools
│   └── cli/                # Command-line interface
├── config/                 # YAML configuration files
├── data/                   # Data storage (gitignored)
├── notebooks/              # Jupyter examples
├── tests/                  # Test suite
└── docs/                   # Documentation

Roadmap

Phase 1: Core Library (2025)

  • Package structure
  • Excel to Parquet conversion
  • Data wrangling
  • Cohort management
  • CLI implementation

Phase 2: Advanced (2026)

  • Config-driven cohort builder
  • Disease-specific analysis rules
  • Workflow orchestration
  • Visualization tools
  • Web interface (FastAPI)

Installation

Note: This package is not yet on PyPI. Install locally in editable mode.

Requirements

  • Python 3.9+
  • Git
  • Virtual environment (recommended)

Basic Installation

# Clone the repository
git clone https://github.com/OpenMOVR/movr-datahub-analytics.git
cd movr-datahub-analytics

# Create virtual environment
python3 -m venv .venv
source .venv/bin/activate  # Windows: .venv\Scripts\activate

# Install package
pip install -e .

Full Installation (recommended)

# Install with all optional dependencies
pip install -e ".[dev,viz,notebooks]"

# This includes:
# - dev: pytest, black, mypy, pre-commit
# - viz: matplotlib, seaborn, plotly
# - notebooks: jupyter, ipywidgets

Configuration Setup

# Run the setup wizard (first time)
movr setup

# This will:
# - Create config/config.yaml
# - Set up data directories
# - Configure Excel file paths
# - Initialize audit logging

Verify Installation

# Check CLI is working
movr --help

# Check version
movr --version

# Run tests
pytest

CLI Reference

The movr command-line interface provides access to all package functionality.

Setup Commands

# Interactive setup wizard
movr setup

# Show current configuration
movr config show

# Validate configuration
movr config validate

Data Pipeline

# Convert Excel files to Parquet
movr convert

# Convert specific file
movr convert --file "path/to/file.xlsx"

# Validate data quality
movr validate

# View data summary
movr summary --registry datahub --metric all
movr summary --registry datahub --metric enrollment

Data Dictionary

# Search for fields
movr dictionary search "age"
movr dictionary search "medication" --diseases "DMD,SMA"
movr dictionary search "ambulation" --diseases "all"

# List all fields
movr dictionary list-fields

# Show specific field details
movr dictionary show-field FACPATID

# Export dictionary
movr dictionary export --format csv

Cohort Management

# Create cohort from YAML config
movr cohort create --config cohort_config.yaml

# List existing cohorts
movr cohort list

# Export cohort
movr cohort export --name "my_cohort" --format parquet

Analytics

# Run descriptive statistics
movr analytics describe --cohort "my_cohort"

# Generate report
movr analytics report --cohort "my_cohort" --output report.html

Contributing

We welcome contributions from both the research and software engineering communities.

GSoC 2026 Priority Project

This is a priority project for Google Summer of Code 2026. Students should review the Contributing Guide.

Development Setup

# Install with development dependencies
pip install -e ".[dev]"

# Install pre-commit hooks
pre-commit install

# Run all checks
pre-commit run --all-files

Code Quality

# Run tests
pytest

# Run tests with coverage
pytest --cov=src/movr

# Format code
black src/ tests/

# Type checking
mypy src/

# Lint
ruff check src/ tests/

Pull Request Process

  1. Fork the repository on GitHub
  2. Create a feature branch: git checkout -b feature/amazing-feature
  3. Make changes following coding standards
  4. Add tests for new functionality
  5. Run tests and linters
  6. Commit with clear messages
  7. Push and create a Pull Request

Contribution Areas

Code

  • Core library features
  • Bug fixes
  • Performance improvements
  • Plugin development

Documentation

  • Tutorials and guides
  • API documentation
  • Example notebooks
  • Translations

Resources

Support

GitHub Issues: Report bugs or request features

Technical: andre.paredes@ymail.com