An open API service indexing awesome lists of open source software.

https://github.com/fpopic/master-indexers

(Class) C++ script to preprocess input files (mapping ids to indices and vice versa) on a single core machine.
https://github.com/fpopic/master-indexers

cpp id-to-index indexing-engine recommender-system

Last synced: 3 months ago
JSON representation

(Class) C++ script to preprocess input files (mapping ids to indices and vice versa) on a single core machine.

Awesome Lists containing this project

README

        

#### Usage:

Used to preprocess input files on a single machine (because it constructs shared lookup table,
therefore synchronisation and locking for mutable collection are needed on a multicore-distributed environment).

It maps ids to indices (e.g. userId => userIndex, itemId => itemIndex) and saves results to a file with it's lookup table.

Indices are prefered over ids because linear algebra libraries operate with indices.

Scala time > 45 min (only for item-item matrix)\
C++ time < 5 min (for both matrices)

Unindexer is used to revert indices to ids after the recommendations are computed.

#### Compile:

using g++:

```
g++ -static -std=c++11 -O3 indexer.cpp -o indexer
g++ -static -std=c++11 -O3 unindexer.cpp -o unindexer
```

or with cmake:

```
mkdir cmake-build-debug
cmake --build cmake-build-debug --target indexer -- -j 4
cmake --build cmake-build-debug --target unindexer -- -j 4
```

#### Run indexer:

```./indexer ```

Output files will be in the same folder as input with .indexed and .lookup suffix.

#### Output:
```
Header: itemId1,itemId2,a,b,c,d
Processed lines:1000000
...
Processed lines:71000000
Output file with 71930771 lines saved.
Lookup file with 67052 entries saved.
Time: 166s

Header: userId,date,itemId,quantity
Processed lines:1000000
...
Processed lines:29000000
Output file with 29154707 lines saved.
Lookup file with 20998 entries saved.
Time: 50s

Process finished with exit code 0
```

#### Run unindexer:

```./unindexer ```

file must be formated like:

```"userId:itemId1,itemId2,...,itemIdX"```

Output files will be in the same folder as input with .unindexed suffix formated like:

```"userIndex:itemIndex1,itemIndex2,...,itemIndexX"```

where k is the number of top k recommendations for a user, so 0 <= X <= k

#### Output:
```
Processed lines:1000
...
Processed lines:2000
Output file with 21225 lines saved.
Time: 4s

Process finished with exit code 0
```