Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/linsomniac/spawk

Text processing library for Python
https://github.com/linsomniac/spawk

Last synced: 11 days ago
JSON representation

Text processing library for Python

Host: GitHub
URL: https://github.com/linsomniac/spawk
Owner: linsomniac
License: mit
Created: 2018-08-16T01:39:20.000Z (over 6 years ago)
Default Branch: master
Last Pushed: 2024-01-01T16:21:11.000Z (11 months ago)
Last Synced: 2024-01-01T17:30:12.577Z (11 months ago)
Language: Python
Size: 48.8 KB
Stars: 1
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        # Spawk Text Processing Library

## Overview

Analysis, Spawk!  This is a record processing library inspired by the AWK

tool, in a Python style.  It is currently a work in progress, exploring

different ways of achieving this.  The library may change significantly

as it matures.

## Abilities

- Read lines from an input.

- Call a function if a regex is matched.

- Call a function on every line between a start and end regex.

- Enrich lines from the input such as what line number in the input it came

  from.  Pipeline elements can enrich it as well, such as "split()" to set

  a "fields" attribute on the line containing the split-out fields.

## Snippets

Print out lines that start with "a":

```python

t = spawk.Spawk()

#  This is how you call decorators without the next line being a function

t.pattern(r'^a')()

t.run()

```

OR:

```python

t = spawk.Spawk()

t.grep(r'^a')

```

Select lines that start with "a" and save off lines within it that contain a

"q" to "t.context.data":

```python

t = spawk.Spawk()

t.grep(r'^a')

t.context.data = ''

@t.pattern(r'q')

def line(context, line):

    context.data += line

t.run()

```

The context includes the regex match.  The line data is a string subclass with

some extra attributes for line numbers and extracting fields:

```python

t = spawk.Spawk().split()

@t.pattern(r'hello (\S+)')

def line(context, line):

    print(

        'Line {} says hello to {}.  Field 3 is: {}'.format(

        line.line_number, context.regex.group(1), line.fields[2]))

t.run()

```

Display username and password for "/etc/passwd" lines that

start with "s":

```python

t = spawk.Spawk().split(':')

@t.pattern(r'^s')

def show(ctx, line):

    print('{0} uid={2}'.format(*line.fields))

t.run()

```

When a line contains "CREATE TABLE", save it and the remaining lines up until

a closing paren and semi-colon into "t.context.data".  This parses out the

table creation from a SQL schema.

Within a range there is an addition context that includes the line number

within the range, and if it is the last line.  So we can add line numbers and

print the create statement at the end:

```python

t = spawk.Spawk()

t.context.data = ''

@t.range(r'CREATE TABLE', r'\);')

def line(context, line):

    context.data += (('line %d:' % context.range.line_number) + line)

    if context.range.is_last_line:

        print(context.data)

        context.data = ''

t.run()

```

There is also a FileFollower which implements "tail -F" functionality.

It will look for new data to be appended to the file, and will re-open

the file if it shrinks, or a new file is created in place of the old.

Simple "tail +0 -F" implementation:

```python

for line in spawk.FileFollower('/var/log/syslog'):

    print(line.rstrip())

```

Emulate the Unix "uniq" command, read stdin and drop duplicated lines:

```python

t = spawk.Spawk()

t.context.lastline = None

@t.eval('lastline != line')

def unique(ctx, line):

    sys.stdout.write(line)

    ctx.lastline = line

t.run()

```

## Examples

There are some example programs in the "Examples" directory:

- "css_colors_to_rgba" - Read lines with CSS colors in the in the form

  "#000000" or "#000" and convert them to "rgba(0, 0, 0, 0.3)".

- "extract_db_tables" - Read a database dump and extract out the "CREATE TABLE"

  commands to reproduce the schema.  This is an example of the "range()"

  decorator.

- "uniq" - Read stdin and output unique lines, lines the same as the previous are

  dropped.

## Future Projects

Here are some areas I'm trying to figure out whether they make sense

and if so, how to best implement them:

- More examples.

- Negate patterns.  Or just "not()" or "notpattern()"?  Or a "not()" wrapper

  around regexes?  Also how to do other AWKisms like "pattern && pattern",

  "pattern || pattern", "pattern ? pattern : pattern".

- Else decorator for if no pattern matched?

- @call(func) gets called with the line, possible uses:

    - Define a function and then @call(func)

    - @call(lambda x: x.match(r'xxx') and x.match(r'yyy') and x.eval('NR > 1'))

    - @call(lambda x: x.match(r'xxx') if x.match(r'yyy') else x.eval('NR > 1'))

- @filter() that uses truth of return to determine if line is passed to

  additional steps in program.

- Can decorators also be made as filters, for use in the "default print" case like

  awk-style "/regex/" with no code.

- Switch from "@pattern" to "@match" or "@search" or other "re" module names?

- A way to send updated record back to the processing engine.

    - Can the line be changed in the processing functions like it can in AWK?

      Make it so that the fields can be updated too.

- Documentation pages.

- Maybe: pattern() and the like could take as the argument:

    - String: Interpreted as a regex.

    - Otherwise, call it with the line?  But how do we get the context in there?

      Kind of want the context on the line so it can call re.match() or the like.

    - If return from above is truthy, call the decorated function.

    - BUT, if return was a SRE match, set that in the context.

- Take ideas from "elvish" shell and "jq".

- Plugable field/record modules could allow much richer options, like a CSV

  input source, JSON, htpasswd/passwd, or even dbapi input or output.  But

  does that make sense?  CSV and JSON do.

    - A record reader that can do JSON.

    - A record reader that can do CSV.

    - Record reader that can do FS and RS.

    - FS (Field Separator) and RS (Record Seperator)?  Currently the fields are

      implemented by str.split(), which can take things other than whitespace.

      RS may mean that multiple lines are handed to the processing rules, which

      I don't know exactly how that makes sense in the current setup.  For

      example, FS="\n" and RS="", for processing addresses separated by blank

      lines.

    - OFS/ORS?  These are output versions of the above, which means that there

      needs to be some way to do the equivalent of "print" or "print $1 $3, $5".

    - Probably would have record processors that would have an input handler

      and an output handler and a way to specify if the outputter was called

      by default at the end of the pipeline or only on demand.  Does "Continue"

      trigger default output and what is the way to break pipeline in the

      converse way?  Maybe deliberately calling the outputter then Continue?

    - Come up with new name for "line" in eval, is it "record", "row"?

      Use that throughout?