https://github.com/dmvianna/conduit-patents

parsing addresses out of free text
https://github.com/dmvianna/conduit-patents

conduit csv exercise haskell

Last synced: 2 months ago
JSON representation

parsing addresses out of free text

Host: GitHub
URL: https://github.com/dmvianna/conduit-patents
Owner: dmvianna
License: bsd-3-clause
Created: 2018-01-23T04:42:49.000Z (over 7 years ago)
Default Branch: master
Last Pushed: 2018-02-08T04:48:55.000Z (over 7 years ago)
Last Synced: 2025-01-31T09:12:42.879Z (4 months ago)
Topics: conduit, csv, exercise, haskell
Language: Haskell
Size: 77.1 KB
Stars: 1
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.org
- License: LICENSE

Awesome Lists containing this project

README

#+TITLE: conduit-patents
#+AUTHOR: Daniel Vianna
#+DATE:<2018-01-23 Tue>
#+INFOJS_OPT: path:http://orgmode.org/org-info.js
#+INFOJS_OPT: toc:nil ltoc:nil view:info mouse:underline buttons:nil
#+STARTUP: content
#+TODO: TODO IN-PROGRESS WAITING DONE

I'm learning how to parse big CSV files in Haskell. This is my [[https://github.com/dmvianna/framesy][fourth attempt]]. I'll be trying things (hopefully) that are almost directly translatable to my work, as parsing addresses out of free text. Good luck to me!

* The dataset
The data we'll be analysing are the [[https://ipaustralia.gov.au/about-us/economics-ip/ip-government-open-data][public records of Australian patents]]. You can get the CSV files from data.gov.au. No, I will not include 785 MB of data (compressed) in this repository. However you can find the =head= of that file in the =./data= directory. Obtained with

=$ head -n 20 IPGOD.IPGOD122B_PAT_ABSTRACTS.csv > pat_abstracts.csv=

* What I plan to achieve

** TODO 1. Read a stream
Read a ~CSV~ file as a stream, so I don't need to load the entire thing to work on it.

** TODO 2. Inspect the stream
Being able to inspect the stream using something like =take= or =show= with indexing. I assume I would be doing it in ~GHCi~.

** TODO 3. Extract relevant info
from unstructured text, such as addresses. That's a big part of what I do for work, and the main motivation for looking beyond Python. I want to move away from regular expressions and do it fast.

** TODO 4. Filter the stream
After parsing we must be able to subset the stream according to boolean constraints. These must be composable.

** TODO 5. Tabular results
reshape results in tabular form as a prelude to exporting to ~CSV~ or database.

** TODO 6. Group by
At some point we will want to analyse results for such things like counts.

** TODO 7. Output
Encoding results back into an output file, or sending it to a database.

* How can I achieve it?

Finally, I eagerly welcome help to move this forward. Get in touch!

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/dmvianna/conduit-patents

Awesome Lists containing this project

README