Pre-processing coverage data for Data Visualizations

@mhucka has been exploring ways to facilitate visually drilling down into the coverage data (aka. public record of all the data held by participating orgs). Discussion of dataviz options here: https://github.com/datatogether/research/tree/master/data_visualization

This will inevitably require pre-processing of the data, partially because you often end up with situations where there are tens of thousands of items (ie. URLs) at a given layer of the navigation tree.  In addition to pre-processing based on simple analysis of the content, such as running files through FITS to extract content types, there is clearly a need for deeper machine analysis. At the very least you could use entity extraction to identify patterns/topics within a corpus.

@mhucka has already been working on some of this. Let's rope in a few more people. @chrpr and @mejackreed come to mind.

The ETL pattern seems pretty applicable, and opens opportunities for experimenting with incorporating distributed data and distributed tools into machine analysis pipelines: 
1. aggregate the essential info into a workable dataset (currently tracking info in a SQL database, eventually will be distributed)
2. analyze that dataset
3. write the analyzed/reformatted result (ie. to IPFS)
4. pass around a reference to the updated/processed/extended dataset (ie. IPFS hash)



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pre-processing coverage data for Data Visualizations #6

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Pre-processing coverage data for Data Visualizations #6

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions