Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/dohliam/elements-of-a-in-b
Find matching strings in two columns using regular expressions
https://github.com/dohliam/elements-of-a-in-b
columns javascript matches regex regular-expression tiny-tools
Last synced: 9 days ago
JSON representation
Find matching strings in two columns using regular expressions
- Host: GitHub
- URL: https://github.com/dohliam/elements-of-a-in-b
- Owner: dohliam
- License: mit
- Created: 2016-10-24T12:52:33.000Z (over 8 years ago)
- Default Branch: master
- Last Pushed: 2020-07-08T22:01:53.000Z (over 4 years ago)
- Last Synced: 2024-11-27T20:33:46.337Z (2 months ago)
- Topics: columns, javascript, matches, regex, regular-expression, tiny-tools
- Language: JavaScript
- Homepage: https://dohliam.github.io/tiny_tools/elements/
- Size: 6.84 KB
- Stars: 1
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Elements of _a_ in _b_ - Find matching strings in two columns using regex
This is a small JavaScript tool to solve a very specific problem: Imagine you have two columns of data, and potentially duplicate substrings that may exist in both columns. How do you quickly identify which of the elements (substrings) of _column A_ are also included in _column B_?
Although this might seem like a fairly niche problem, it actually comes up all the time in various forms. Generally, the solution involves writing a quick throwaway script in Ruby, Python, Perl, or some other scripting language. Depending on the nature of the data, it may also be solvable with clever piping of command-line tools like `sed`, `sort`, and `uniq`.
Nevertheless, being able to quickly find matching subsets of data without needing to write a custom script each time can be a great time-saver for situations where this kind of thing comes up frequently.
## Supported features
* Live search (results update as you type)
* Works offline (just clone or download this repository and open `index.html` in any browser)
* [Online demo](https://dohliam.github.io/tiny_tools/elements/) available
* Group capturing using parentheses
* Display total number of matches found
* Dictionary mode (see [below](#dictionary-mode))## Usage
Enter some data in box **a** and box **b**. By default, values will appear in the **result** box as you type if there are sequences of numerals (`/\d\d+/`) that appear anywhere in both box _a_ and box _b_.
For example, if you input the following into box _a_:
foo foo 54 bar bar
bar bar 96 foo foo
foo foo 21 bar barAnd the following into box _b_:
234
abc
96
foo barThe result will show as `96`, because this sequence of numerals occurs in both column _a_ and column _b_.
The sequence matched in both columns is easily configurable, and does _not_ have to be the same. The examples below demonstrate the flexibility possible using capture groups, character classes, metacharacters and other regex features.
### Grouping
Use rounded brackets `()` to isolate capture groups. Characters outside of the capture group will be ignored.
This can be useful for data that is separated by regular delimiters, for example tabs, commas or other characters.
For example, input the following into column _a_:
abc@def#ghi
jkl@mno#pqr
stu@vwx#yzaAnd the following in column _b_:
abc
def
ghi
jkl
mno
pqr
stu
vwx
yzaFor **Regex _a_** enter:
#([a-z]+)
And for **Regex _b_** enter:
[a-z]+
This gives the result:
ghi
pqr
yzaThis is possible because the regular expression `/#([a-z]+)/` matched the strings of letters (`/[a-z]+/`) following a hash/pound/number symbol (`/#/`) in column _a_, but only _captured_ the group of letters in each line (without the `#` symbol) for the purposes of matching with the data in column _b_.
When capturing groups, anything outside of the parentheses is ignored. If _Regex a_ had been `@([a-z]+)` instead (try it!), the result would be:
def
mno
vwxThis is because the regular expression `/@([a-z]+)/` matches the "middle column" of data in box _a_, and matches it against the text in box _b_.
To match any sequence of characters in column _b_ (not just letters), use `.*` instead of `[a-z]+`.
If the data in either column has been separated by tabs, the metacharacter `\t` can be used to match it.
For example, given the following data in column _a_:
abc 123 def
ghi 456 jkl
mno 789 pqr
stu 101 vwxAnd the following data in column _b_:
980
765
432
123
987
765Enter `\t(.*)\t` for _Regex a_, and `.*` for _Regex b_.
The result will be `123`, because only sequences surrounded by tabs were matched in column _a_.
Note: An easier way to approach the above example in particular might be to simply use `\d+` for _Regex a_, which will match all sequences of digits (which in this case happen to only occur in the middle column).
### Dictionary mode
Clicking on the checkbox at the bottom of the page enables the optional _dictionary mode_. This mode is meant to handle a specific common subset of problems involving elements in a list (column _b_) that match keys in a key-value pair database (column _a_). You can think of it like a simple hash table or dictionary lookup.
Key features of dictionary mode:
* All elements in the list remain in order
* All keys are printed in the result, even if they do not have a matching value
* Keys and values are returned together in the result
* The delimiter (default `TAB`) can be changed arbitrarily
* Duplicate values are retained in the resultIn its simplest form, the dictionary in column _a_ is a two-column list of values separated by tab spaces, for example:
apple manzana
apricot albaricoque
banana plátano
peach melocotón
pear pera
plum ciruelaIf you then enter the following into column _b_:
pear
orange
plum
apple
pear
plum
kiwiYou would get the following result:
pear pera
orange
plum ciruela
apple manzana
pear pera
plum ciruela
kiwi(Note the blank values for `orange` and `kiwi`, which were not in the original dictionary list.)
By default, the delimiter for the input (the dictionary data in column _a_) and the output (the result in column _b_) is set to a `TAB` stop (`\t`). To change the delimiter, adjust the values in the **Input delimiter** and **Output delimiter** boxes.
## See also
"Elements of _a_ in _b_" is part of the [**tiny tools**](https://dohliam.github.io/tiny_tools/) series.
Other tools for working with columns of data that might also be of interest:
* [Sum columns](https://github.com/dohliam/sum-columns)
* [Compare columns](https://github.com/dohliam/compare-columns)
* [Sort columns](https://github.com/dohliam/sort-columns)## Credits
* [milligram](https://github.com/milligram/milligram) CSS by @cjpatoilo, prototyped using [dropin-minimal-css](https://github.com/dohliam/dropin-minimal-css)
* [github-corners](https://github.com/tholman/github-corners) by @tholman## License
MIT.