AI/ML
Relevant Skills
Published Date
Not the right fit for you? Sharing the opportunity with your network is a great form of advocacy!
Design a Unified AI Evaluation Architecture for Gender-Based Violence Court Data
Background. TrackGBV uses AI to extract 60+ structured fields from court sentencing decisions in gender-based violence cases. We have three separate pieces of evaluation code in our system: a legacy eval pipeline that has been dormant for about a year, a standalone comparative evaluation notebook, and a new prompt improvement pipeline recently built by an MIT GenAI Lab student team. These three pieces were built at different times by different people, and we need someone to review all of them and tell us what the unified target should look like.
The project. Review the three existing evaluation codebases, identify what each does, what overlaps, what should be retired, and what gaps remain. Then design a target evaluation architecture that unifies the valuable pieces and supports two specific integrations we need: automatic evaluation runs after each extraction, and using human-corrected data as the preferred ground truth source.
Deliverables, organized into three phases:
Phase 1 (Weeks 1-2): Codebase review and landscape mapping
- Review the three evaluation codebases (legacy eval pipeline, comparative evaluation notebook, prompt improvement pipeline)
- Document what each does, what metrics each uses, how each handles ground truth
- Produce a landscape map showing overlaps, gaps, and redundancies
Phase 2 (Weeks 3-4): Requirements and target architecture design
- Work with ICAAD to define requirements for the unified eval system, focusing on automatic post-extraction runs and human-corrected data as ground truth
- Design the target architecture: which components to retain, which to retire, what new pieces are needed
- Define the data flow between extraction, corrections, and evaluation
- Produce target architecture document with component diagrams and responsibilities
Phase 3 (Weeks 5-6): Backlog and handoff
- Translate the target architecture into a prioritized implementation backlog
- Estimate effort for each backlog item
- Document assumptions, open questions, and risks
- This backlog directly informs a follow-on Taproot project for the integration engineer who will implement the architecture
Commitment: 6-8 hours per week across 6 weeks.
Skills needed:
- Senior-level ML engineering or ML infrastructure experience
- Evaluation methodology background (metrics for structured extraction, MAE vs accuracy tradeoffs, multi-label and regression metrics)
- Ability to read and critically assess existing Python codebases (pandas, SQL, pydantic)
- Experience designing ML evaluation systems that integrate with production pipelines
- Clear technical writing for architecture documents and implementation backlogs
Note on sensitive content: This project involves working with infrastructure that processes court sentencing decisions in gender-based violence cases. The volunteer will not do case-level review, but will need a trauma-informed approach when working with the system and its data.
As we scale, our AI extraction accuracy becomes increasingly important. Our current evaluation infrastructure evolved organically across multiple volunteers and student teams, and we now have three separate pieces of eval code that don't speak to each other. Without a unified view, we cannot reliably answer "is our extraction getting better or worse over time?" or "does human correction data improve our AI when fed back in?"
This volunteer project is the diagnostic step. By reviewing what exists and designing a target architecture, the volunteer enables ICAAD to:
Make an informed decision about what evaluation code to keep, retire, or build
Set a clear specification for the follow-on implementation project, avoiding scope creep and duplicated work
Close the feedback loop between human QC review and AI prompt improvement
Establish ongoing accuracy measurement as cases expand to new jurisdictions
For a volunteer with ML evaluation experience, this is architectural work with durable impact. Your assessment and design directly shape how TrackGBV measures its accuracy for years to come.
Providing access to all three existing evaluation codebases with context on when and why each was built
Proper Documention and READMEs to go with the code
Defining clear integration requirements (automatic post-extraction runs, human-corrected data as ground truth) so the volunteer has concrete targets to design against
Ensuring the volunteer has access to sample data, extracted output, and corrected ground truth examples for context
Committing to a review-and-sign-off process at the end of each phase, so the volunteer knows their direction is correct before moving forward
The volunteer will work directly with me (Director of Analytics and Justice Tech) for all decisions and reviews.
International Center for Advocates Against Discrimination Inc.
Location
Remote, US-NY
Website
https://www.icaad.ngoMember Since
Oct 2021
Completed Taproot Plus Partnerships
0
Organization Mission
Program Focus Areas
See All
Opportunities
Project
Website development
Philanthropy & Capacity Building
The WildRoot Collective is seeking a web designer or developer to help us create a professional, user-friendly website that reflects our mission, programs, and community...
Posted
The WildRoot Collection
Project
Messaging
Education
We are searching for a volunteer who can lead our development of a new and improved mission and vision statement that will communicate our purpose and vision to a variety of...
Posted August 21, 2025
Heart-to-Heart
Project
Other
Community Development
As Girls in Gear continues to grow nationwide, we would like help establishing a corporate sponsorship strategy. Our goal is to obtain corporate sponsorships to support the...
Posted July 22, 2025
Girls in Gear