https://github.com/paddymul/wikipedia_solr

A sample installation of solr setup for indexing wikipedia
https://github.com/paddymul/wikipedia_solr

Last synced: 2 months ago
JSON representation

A sample installation of solr setup for indexing wikipedia

Host: GitHub
URL: https://github.com/paddymul/wikipedia_solr
Owner: paddymul
Created: 2012-04-30T12:17:48.000Z (about 13 years ago)
Default Branch: master
Last Pushed: 2016-03-01T19:57:58.000Z (about 9 years ago)
Last Synced: 2023-04-18T07:34:01.386Z (about 2 years ago)
Language: Java
Size: 26.1 MB
Stars: 17
Watchers: 6
Forks: 3
Open Issues: 4
Metadata Files:
- Readme: readme.org

Awesome Lists containing this project

README

* Introduction
The wikipedia_solr system is a locally installed full text index of
the wikipedia. It allows fast searches through the entire wikipedia.

This document has three main sections.
1. Installation details the steps for a basic install.
2. Running wikipedia-solr goes over the core commands to run the
system and instructions for common scenarios. It also lists
configuration options.
3. Interfacing details how other programs can query solr
4. Implementation notes describes how the system was built.

A running solr installation depends on a couple of things. It needs a
wikipedia dump of articles, this dump is then indexed into solr
and written to an index. Once solr has finished indexing, it is ready
to respond to queries. The index is the data structure that solr
stores documents in for fast retrieval. The solr server needs to be running to do
the indexing and it needs to be running to query the index.

* Installation
** Prerequisites

*** Java 6
Java 6 is installed by default on OS X 10.6 and above.
*** Apache Ant installed
[[http://ant.apache.org/manual/install.html][The Apache Ant official installation guide]]
Note: installing netbeans is a quick shortcut that assures that java and ant are installed properly

*** git command line tools installed
The git command line tools must be installed on your machine, along
with ssh key access to the github repo. You can read more information
about setting ssh keys here:

[[http://help.github.com/set-up-git-redirect/]]

*** curl
[[http://en.wikipedia.org/wiki/CURL]] needs to be installed and working
properly on your system. Mac OS X has a copy of curl by default.

You can test by running the following command in a shell.
#+BEGIN_SRC shell
curl http://google.com
#+END_SRC
you should see the html of google's landing page.

*** ubuntu apt-get
on ubuntu running

#+BEGIN_SRC shell
sudo apt-get install ant git openjdk-6-jdk curl
#+END_SRC
will install all the requirements

*** Space requirements
The source code and compiled files take about 700 Mb.
a minimum index of only the first 1000 articles will take about
50-60 megs.

The complete index takes 16 Gb when finished, and up to 34 Gb while the
indexing process is running.

The complete wikipedia data dump file is 8 Gb.

The greatest amount of space thaelseis system could take on your
system is

43 Gb = 8 + 34 + 1

You can store all three components on separate drives. The repository
root must be on a locally mounted drive, the system will fail if it is
on a network share.

** Installation Instructions
Run the one-step installer at a terminal by copying and pasting
#+BEGIN_SRC shell
nohup curl https://raw.github.com/paddymul/wikipedia_solr/master/apps/update/onestep_install.sh | sh > onestep.out &
#+END_SRC
You can read the one step installer commands [[https://raw.github.com/paddymul/wikipedia_solr/master/apps/update/onestep_install.sh][here]]
Here is aproximately the expected [[https://raw.github.com/paddymul/wikipedia_solr/master/onestep.out][onestep installer output]]

After the installer has run take a look at [[https://github.com/paddymul/wikipedia_solr#i-have-successfully-downloaded-the-data-dump-what-next][ I have successfully downloaded the data dump. What next?]]

If you want to customize install locations read [[https://github.com/paddymul/wikipedia_solr/blob/master/readme.org#how-do-i-control-where-various-files-are-stored][how do I control where
various files are stored]].

[[http://localhost:8983/solr/select/?q=articlePlainText%3A%22american%22&version=2.2&start=0&rows=1000&indent=on&wt=json][search local solr for american]]

* Running wikipedia_solr

** Apps
There are a set of apps which allow you to perform most regular
maintenance tasks. These are the primitives upon which the common workflows
section is based on. All of them are found in
$GC_WIKIPEDIA_REPO_ROOT/apps , after the initial install steps.

Every app opens a terminal window and at the start outputs
"$app_name:STARTING" when the app has finished running
"$app_name:FINISHED" will be written to the terminal window.

*** start_solr.app
This app starts the solr server and leaves a terminal window showing
the current server log. Do not close this window because doing so
will kill the solr server. Any code changes, or file location changes
(the variables specified in ~/.profile) require restarting the
server before the server recognizes the new code. The solr server
needs to be running to index documents and to perform queries.

To verify that the solr server is running properly go to this url with
your browser [[http://localhost:8983/solr/admin]].

*** update_all.app
This app pulls all the latest code and recompiles every module
necessary for the system.

*** download_dump.app
This app downloads the latest wikipedia dump into your
$GC_WIKIPEDIA_DL_DIR , it is necessary to have a wikipedia dump to
index articles.

*** make_helper_files.app
This script makes the smaller sections of the initial dump that aid
faster testing of indexing, it makes a 1,000 10,000 and 100,000
article dump file. While it is running bunzip2 will take around
80-100% of a single core, 3 copies of tee and 3 copies of grep will
also be running.

This script is run automatically by download_dump.app, but sometimes
it is useful to run on its own if you downloaded the article dump
separately.

This app takes about an hour to run.

*** Indexing apps
there are 4 indexing apps,
index_1_000.app, index_10_000.app, index_100_000.app and index_all.app.
they index the first 1,000 10,000 100,000 articles from the data_dump.
the index_all.app pulls in everything in the data_dump file.

Running any one of the Indexing apps blows away the current index. If
you want to preserve the current index you might want to change your
GC_WIKIPEDIA_SOLR_DATA_DIR location to keep an existing large
index as a backup.

**** Sample output from a sucessful run
#+BEGIN_SRC shell
Last login: Sun Apr 29 22:01:11 on ttys003
/Users/patrickmullen/code/wikipedia_solr/apps/reindex/index_1_000 ; exit;
You have mail.
bash_profile
Macintosh-5:~ patrickmullen$ /Users/patrickmullen/code/wikipedia_solr/apps/reindex/index_1_000 ; exit;
index_1_000:STARTING
mkdir: /Users/patrickmullen/code/wikipedia_solr/named_pipes: File exists
linking /Volumes/LaCie_1/data/data_wikipedia//first_1_000.xml to /Users/patrickmullen/code/wikipedia_solr/named_pipes/import.xml
/Volumes/LaCie_1/data/data_wikipedia//first_1_000.xml

01./data-config.xml/Users/patrickmullen/code/wikipedia_solr/solr_home/solr/Users/patrickmullen/code/wikipedia_solr/named_pipes/import.xmlfull-importidle010001982012-04-29 22:01:11Indexing completed. Added/Updated: 802 documents. Deleted 0 documents.2012-04-29 22:01:422012-04-29 22:01:428020:0:31.835This response format is experimental. It is likely to change in the future.

the index has been triggered, but not necessarily completed
read the indexing_apps section of the readme
---------
https://github.com/paddymul/wikipedia_solr/blob/master/readme.org#progress-tracking
---------
the solr_server should take aproximately 40 seconds to complete this index
index_1_000:FINISHED
logout

[Process completed]

#+END_SRC

**** estimated run times
the indexing apps only trigger a reindexing by the solr server. The
indexing apps themselves finish very quickly. You can read more about

Here are the time estimates for the whole indexing:
index_1_000: 40 seconds
index_10_000: 4 minutes
index_100_000: 40 minutes
index_all: 12 hours

**** progress tracking
When a large import is running, your machine should be running at
about 100% cpu for a java process, and 10-20% for a bunzip2 process.

You can track the import progress at this url
[[http://localhost:8983/solr/admin/stats.jsp]]

Search in that page for "Rows Fetched" . Reload as necessary to watch
this number increase, it counts the total number of documents that the
import process has seen. It also counts "Documents Skipped",
documents are skipped as part of the import process because they would
pollute the index with poor results, this is perfectly normal. The
"Documents Processed" field lists the total number actually put into
the index.

The full wikipedia dump has about 12 Million rows in it, at last run
on 3/6/2012 5,785,453 documents were in the index after a full
import.

Note: When running a full import the "Rows Fetched" will get to 12 million, and then the system
will take approximately 1 more hour compacting and optimizing indexes
before you can search against new data, this is normal. The index is
not finished until status reads "IDLE" again.

** Common Workflows

*** I want to run a new index, but not lose an existing index.
Solr stores indexes in the path controlled by the
GC_WIKIPEDIA_SOLR_DATA_DIR environment variable.

edit the line of your ~/.profile that looks like this
#+BEGIN_SRC shell
export GC_WIKIPEDIA_SOLR_DATA_DIR="/Volumes/LaCie_1/data/index_wikipedia"
#+END_SRC
to point to another location. Changes to this location require a
restart of the server to be recognized.

*** I want to reindex the entire wikipedia with newer parsing code
First close the existing start_solr window, solr will have to be
restarted to use the new code.

Run the following scripts or apps. The apps have the same name as the
scripts, except they are followed by ".app" and the apps are not in
the sub-directories, to run them, you can
double-click on the .app from finder.

From the repo root, run the following scripts or apps

1: Pull the most recent code and compile it.
#+BEGIN_SRC shell
./update/update_all
#+END_SRC
2: Start the server so that it is reading the most recent codebase.
#+BEGIN_SRC shell
./start_solr/start_solr
#+END_SRC
3: Next kickoff the reindex.
#+BEGIN_SRC shell
./reindex/index_all
#+END_SRC

*** I have successfully downloaded the data dump. What next?
1st, you need to start solr. Run the app or script
apps/start_solr.app

2nd, go to this url with a browser [[http://localhost:8983/solr/admin/]]

3rd, kick off an indexing run. run apps/reindex/index_1000

4th, when the import has finished running run a query. Go to this url
[[http://localhost:8983/solr/select/?q=*%3A*&version=2.2&start=0&rows=10&indent=on&wt=json]]
that url returns the first 10 documents of everything in your index,
there should be some results.

*** I'm getting mkfifo errors when running the download_dump_wrap.app
OS X doesn't support creating named pipes on network mapped
drives. [[http://caml.inria.fr/pub/docs/manual-ocaml/libref/Unix.html]]

Named pipes are key to the way the whole import system works. They
let the system read the compressed dump into solr, without ever
expanding the whole thing on disk.

It is necessary to have at least the $GC_WIKIPEDIA_REPO_ROOT on a
locally mounted disk

*** I have already downloaded the wikipedia but I want to run make_helper files
Sometimes you will want to rerun the make_helper_file scripts that create
smaller sections of the initial dump, without rerunning the whole
download script.

In a terminal window, run these commands:
#+BEGIN_SRC shell
source ~/.profile
$GC_WIKIPEDIA_REPO_ROOT/apps/update/make_helper_files
#+END_SRC

*** I'm getting LockObtainFailedException 500 errors.
This error shouldn't occur anymore, it used to occur because solr had
trouble creating lock files on network drives. Now solr is configured
to use the NoLockFactory , so this shouldn't be an issue. You can
read more about lock factories here [[http://wiki.apache.org/lucene-java/AvailableLockFactories]]

*** How do I find oddly parsed articles

these two queries are quite helpful
[[http://localhost:8983/solr/select?indent=on&version=2.2&q=*%3A*&fq=&start=0&rows=10&fl=articlePlainTextCount%2CwikimediaMarkupCount%2CmarkupPlainRatio%2Ctitle%2Cid%2Cscore&qt=&wt=&explainOther=&hl.fl=&sort=markupPlainRatio%20asc][markup ratio ascending]]
[[http://localhost:8983/solr/select?indent=on&version=2.2&q=*%3A*&fq=&start=0&rows=10&fl=articlePlainTextCount%2CwikimediaMarkupCount%2CmarkupPlainRatio%2Ctitle%2Cid%2Cscore&qt=&wt=&explainOther=&hl.fl=&sort=markupPlainRatio%20desc][markup ratio descending]]

[[http://localhost:8983/solr/select/?q=error%3Afalse&stats=true&stats.field=markupPlainRatio&stats.facet&fl=markupPlainRatio&version=2.2&start=0&rows=10&indent=on][average of markupPlainRatio]]

** system setup

*** how do I control where various files are storedp
The wikipedia_solr system reads your ~/.profile for three locations.

**** GC_WIKIPEDIA_DL_DIR
This is the directory that the system expects to find the raw data
dump files. This directory grows to about 9Gb with a full download
and helper files.
#+BEGIN_SRC shell
export GC_WIKIPEDIA_DL_DIR="/Volumes/LaCie_1/data/data_wikipedia"
#+END_SRC

**** GC_WIKIPEDIA_SOLR_DATA_DIR
This is where solr looks for an index, if it is indexing, solr stores
an index in this directory. For a full index of the wikipedia, it
should be about 16Gb, it can grow to 34Gb during indexing.
#+BEGIN_SRC shell
export GC_WIKIPEDIA_SOLR_DATA_DIR="/Volumes/LaCie_1/data/index_wikipedia/"
#+END_SRC

**** GC_WIKIPEDIA_REPO_ROOT
This is the directory where source code and compiled files for the
wikipedia_solr system are stored. This directory should take less
than a Gb of storage.
#+BEGIN_SRC shell
export GC_WIKIPEDIA_REPO_ROOT="/Users/patrickmullen/code/wikipedia_solr"
#+END_SRC

* Querying

The solr system is queried over http, results can be returned in json
format or xml format. all examples are given using the json format.

** Official documentation

[[http://wiki.apache.org/solr/CommonQueryParameters]]

** Breakdown of a query url
http://localhost:8983/solr/select/?q=articlePlainText%3A%22american%22&version=2.2&start=0&rows=1000&indent=on&wt=json

*** q parameter
The q parameter is the actual query, unurlescaped this query looks
articlePlainText:"american" .

This tells solr to search the 'articlePlainText' field in the entire database for the term
american.

*** version parameter
The 'version' parmeter is of unknown consequence, use a value of 2.2 for
continuity.

*** start parameter
The start parameter controls the first row the result set to be
returned.

*** rows parameter
The rows parameter controls how many documents (at most) to return
after the start document.

*** indent parameter
The indent=on causes solr to pretty print the result.

*** wt parameter
The wt=json causes solr to return the result in json format.

** Interactive tour of query formation with solr
*** complex queries - phrases ANDs ORs NOTs

Take a look at [[https://github.com/paddymul/wikipedia_solr/blob/master/py/query_demo.py][py/query_demo.py]] to see this as a running program.

Note qp takes the un-urlencoded q parameter as input, it executes the
query and prints some simple stats about it, including the complete
formed url, it returns the total number of documents found for that query.

Triple quotes are a python convention for encoding multiline strings
or quote containing strings. the value of a triple quoted string is
between the first triple quote and last triple quote.

A string preceded by a 'u' is a unicode string, for ascii only
sequences it can be thought of as a string.

The leading and trailing space in the queries are there for readability.

assert is a python statement that throws an error when it is give a
false value, none of the asserts in this tour throw an error.

**** search for american with quotes surounding
#+BEGIN_SRC py
american = qp(''' articlePlainText:"american" ''')
#+END_SRC

|solr url|[[http://localhost:8983/solr/select/?q=articlePlainText%3A%22american%22&start=0&rows=10&indent=on&wt=json]]|
|QTime|1|
|params|{u'q': u'articlePlainText:"american"', u'start': u'0', u'wt': u'json', u'indent': u'on', u'rows': u'10'}|
|numFound|619399|

**** search for american without surrounding quotes
#+BEGIN_SRC py
american_no_quote = qp(''' articlePlainText:american ''')
#+END_SRC
note - for single terms, we got the same number of documents back when we
quoted "american" as we got back when we didn't quote "american"

| solr url | [[http://localhost:8983/solr/select/?q=articlePlainText%3Aamerican&start=0&rows=10&indent=on&wt=json]] |
| QTime | 0 |
| params | {u'q': u'articlePlainText:american', u'start': u'0', u'wt': u'json', u'indent': u'on', u'rows': u'10'} |
| numFound | 619399 |

**** search for american without leading/trailing space
#+BEGIN_SRC py
american_no_trail = qp('''articlePlainText:american''')
#+END_SRC

|solr url|[[http://localhost:8983/solr/select/?q=articlePlainText%3Aamerican&start=0&rows=10&indent=on&wt=json]]|
|QTime|0|
|params|{u'q': u'articlePlainText:american', u'start': u'0', u'wt': u'json', u'indent': u'on', u'rows': u'10'}|
|numFound|619399|

**** syntax verification
#+BEGIN_SRC py
assert american == american_no_quote
assert american_no_trail == american_no_quote
#+END_SRC

**** search for 'samoa' get 4755 docs
#+BEGIN_SRC py
samoa =qp(''' articlePlainText:samoa ''')
#+END_SRC
|solr url|[[http://localhost:8983/solr/select/?q=articlePlainText%3Asamoa&start=0&rows=10&indent=on&wt=json]]|
|QTime|186|
|params|{u'q': u'articlePlainText:samoa', u'start': u'0', u'wt': u'json', u'indent': u'on', u'rows': u'10'}|
|numFound|4755|

**** search for 'american' or 'samoa' get 621,927 docs
#+BEGIN_SRC py
american_or_samoa = qp(''' articlePlainText:american OR _query_:"articlePlainText:samoa" ''')
#+END_SRC
|solr url|[[http://localhost:8983/solr/select/?q=articlePlainText%3Aamerican+OR+_query_%3A%22articlePlainText%3Asamoa%22&start=0&rows=10&indent=on&wt=json]]|
|QTime|191|
|params|{u'q': u'articlePlainText:american OR _query_:"articlePlainText:samoa"', u'start': u'0', u'wt': u'json', u'indent': u'on', u'rows': u'10'}|
|numFound|621,927|

**** search for documents containing 'american' and 'samoa' -> 2227
#+BEGIN_SRC py
american_and_samoa = qp(''' articlePlainText:american AND _query_:"articlePlainText:samoa" ''')
#+END_SRC
|solr url|[[http://localhost:8983/solr/select/?q=articlePlainText%3Aamerican+AND+_query_%3A%22articlePlainText%3Asamoa%22&start=0&rows=10&indent=on&wt=json]]|
|QTime|183|
|params|{u'q': u'articlePlainText:american AND _query_:"articlePlainText:samoa"', u'start': u'0', u'wt': u'json', u'indent': u'on', u'rows': u'10'}|
|numFound|2227|

**** search for docs containg 'samoa' but not containing 'american' ->2528
#+BEGIN_SRC py
samoa_not_american = qp(''' articlePlainText:samoa NOT _query_:"articlePlainText:american" ''')
#+END_SRC
|solr url|[[http://localhost:8983/solr/select/?q=articlePlainText%3Asamoa+NOT+_query_%3A%22articlePlainText%3Aamerican%22&start=0&rows=10&indent=on&wt=json]]|
|QTime|46|
|params|{u'q': u'articlePlainText:samoa NOT _query_:"articlePlainText:american"', u'start': u'0', u'wt': u'json', u'indent': u'on', u'rows': u'10'}|
|numFound|2528|

**** search for the phrase "american samoa" -> 1397
#+BEGIN_SRC py
american_samoa_phrase = qp(''' articlePlainText:"american samoa" ''')
#+END_SRC

|solr url|[[http://localhost:8983/solr/select/?q=articlePlainText%3A%22american+samoa%22&start=0&rows=10&indent=on&wt=json]]|
|QTime|1|
|params|{u'q': u'articlePlainText:"american samoa"', u'start': u'0', u'wt': u'json', u'indent': u'on', u'rows': u'10'}|
|numFound|1397|

**** proof of system consitency
#+BEGIN_SRC py
assert american_or_samoa == (american + samoa_not_american)
assert 621927 == (619399 + 2528)

assert american_and_samoa >= american_samoa_phrase
assert 2227 >= 1397
#+END_SRC

**** double phrase AND query
#+BEGIN_SRC py
a =qp(''' articlePlainText:"american samoa" AND _query_:"articlePlainText:'manifest destiny'" ''')
#+END_SRC

|solr url|[[http://localhost:8983/solr/select/?q=articlePlainText%3A%22american+samoa%22+AND+_query_%3A%22articlePlainText%3A%27manifest+destiny%27%22&start=0&rows=10&indent=on&wt=json]]|
|QTime|199|
|params|{u'q': u'articlePlainText:"american samoa" AND _query_:"articlePlainText:\'manifest destiny\'"', u'start': u'0', u'wt': u'json', u'indent': u'on', u'rows': u'10'}|
|numFound|4 |

**** ambiguous syntax
Note: the following syntax query is unclear and I can't decipher the
results, don't issue queries like this, the results are undecided and
unsupported by me .
#+BEGIN_SRC py
a =qp('''articlePlainText:american samoa''')
#+END_SRC
|solr url|[[http://localhost:8983/solr/select/?q=articlePlainText%3Aamerican+samoa&start=0&rows=10&indent=on&wt=json]]|
|QTime|73|
|params|{u'q': u'articlePlainText:american samoa', u'start': u'0', u'wt': u'json', u'indent': u'on', u'rows': u'10'}|
|numFound|619,399|

** Additional query formation resources
If you want more information about solr query syntax, try thes resources

nested queries in solr
[[http://www.lucidimagination.com/blog/2009/03/31/nested-queries-in-solr/]]

the solr-wiki page, not actually that helpful
[[http://wiki.apache.org/solr/SolrQuerySyntax]]

* Implementation notes
These notes are meant as a guide for a future maintainer of the
solr/java search system. They assume a knowledge of solr, java, and
common development practices.
** Overview
This wikipedia search system uses solr [[http://wiki.apache.org/solr/]]
and the jwpl wikimedia markup parsing library
[[http://code.google.com/p/jwpl/]].

I used the DataImportHandler framework to import the XML wikipedia
dump. I wrote a custom transformer that integrates into the
DataImportHandlerFramework, this handler calls the jwpl parsing
library to extract the article text from the wikimedia markup.

I modified solr in two places. First I changed the file reader so
that it will read from a named pipe. This allows us to keep the
article dump compressed on disk, allowing for faster I/O and less disk
usage.

I also modified the xml reader so that it doesn't kill an entire
import if there is a missing xml tag. This extra fault tolerance
ensures that hours of work aren't lost. Wikipedia article dumps are
of the format described in [[http://www.mediawiki.org/xml/export-0.5.xsd]].
The downloaded dumps seem to be missing the final closing
tag. We could compare md5sums if we are worried about
integrity.

*** Custom code
the custom code I wrote for this project can be found in

**** transformer
[[https://github.com/paddymul/wikipedia_solr/blob/master/solr_home/Wikipedia_importer/wikipedia_solr/src/wikipedia_solr/WikimediaToTextTransformer.java][solr_home/Wikipedia_importer/wikipedia_solr/src/wikipedia_solr/WikimediaToTextTransformer.java]]

**** named pipe file reader
[[https://github.com/paddymul/lucene-solr/blob/06a176316bba15bf6967c87d3799ef743067e972/solr/contrib/dataimporthandler/src/java/org/apache/solr/handler/dataimport/FileDataSource.java][
lib/solr/solr/contrib/dataimporthandler/src/java/org/apache/solr/handler/dataimport/FileDataSource.java]]

**** tolerant xml reader
[[https://github.com/paddymul/lucene-solr/blob/06a176316bba15bf6967c87d3799ef743067e972/solr/contrib/dataimporthandler/src/java/org/apache/solr/handler/dataimport/XPathEntityProcessor.java][
lib/solr/solr/contrib/dataimporthandler/src/java/org/apache/solr/handler/dataimport/XPathEntityProcessor.java]]

** Solr configuration
*** Schema configuration

the solr [[ http://wiki.apache.org/solr/SchemaXml ][ schema.xml]]
for this project can be found
[[https://github.com/paddymul/wikipedia_solr/blob/master/solr_home/solr/conf/schema.xml][
solr_home/solr/conf/schema.xml]]

**** field explanation
[[http://wiki.apache.org/solr/SchemaXml#Fields]]

each field can have one of multiple flags applied to it:
***** stored
An stored field has its original version saved by lucene.
***** indexed
An indexed field can be searched against.
**** wikipedia_solr schema

This controls which fields are stored and indexed. We have a very
simple schema, only three relevant fields, title, articlePlainText and
sectionParsed.

***** articlePlainText
articlePlainText is the field that is searched on, it is an indexed
version of the plaintext of each wikipedia article. It isn't stored
since the plaintext on its own isn't that useful.
***** sectionParsed
This field is stored, but not indexed. it is a json-string
[[http://www.json.org/]] of article
sections, in the form of

[{"section_name":["paragraph1", "paragraph2"]}, {"another section
title": ["paragph1", "p2"]}]
.

*** solr-config.xml

the [[http://wiki.apache.org/solr/SolrConfigXml][solr-config]] for this project can be found here
[[https://github.com/paddymul/wikipedia_solr/blob/master/solr_home/solr/conf/solrconfig.xml][solr_home/solr/conf/solrconfig.xml]]

It stays pretty close to the example config, except for additional
java properties that it reads, which allow the system to be more
easily configured.

*** Maintenance scripts

There are a variety of maintenance scripts that can be found in apps/* ,
they are explained in the Running wikipedia_solr:Apps section of this document.

** Parser debugging
Core concepts fields have been added to make debugging parsing errors easier.

=wikimediaMarkupCount= stores the count of characters in the original version of the text.
=articlePlainTextCount= stores the count of characters in the plaintext version of the text.
=markupPlainRatio= stores articlePlainTextCount / wikimediaMarkupCount .

There are additional fields for java error debugging
=error= When the parsing code detects an error this field will have a
value of "true", normally it will be "false".
=exception= When there is an exception this field will store the name
of that exception.
=stackTrace= This field stores the stacktrace of an exception.

*** Helpful queries

The [[http://localhost:8983/solr/select?indent=on&version=2.2&q=error%3Afalse&fq=&start=0&rows=500&fl=articlePlainTextCount%2CwikimediaMarkupCount%2CmarkupPlainRatio%2Ctitle%2Cid%2Cscore&qt=&explainOther=&hl.fl=&sort=markupPlainRatio%20asc&wt=json][markup ratio ascending]] query is very useful for finding examples
of documents that have a lot of wikimediaMarkup, and very little
articlePlainText. Note that this query excludes that have errors,
error'd articles rarely have any articlePlainText, so they are of no
use.

[[http://localhost:8983/solr/select?indent=on&version=2.2&q=*%3A*&fq=&start=0&rows=10&fl=articlePlainTextCount%2CwikimediaMarkupCount%2CmarkupPlainRatio%2Ctitle%2Cid%2Cscore&qt=&wt=&explainOther=&hl.fl=&sort=markupPlainRatio%20desc][markup ratio descending]]

The [[http://localhost:8983/solr/select/?q=error%3Afalse&stats=true&stats.field=markupPlainRatio&stats.facet&fl=markupPlainRatio&version=2.2&start=0&rows=10&indent=on][average of markupPlainRatio]] (you can read more about solr stats
queries here [[http://wiki.apache.org/solr/StatsComponent][solr stats component]]) is very useful for determining the
affect of a change in the parsing code over a large quantity of
documents. My assumption is that as parsing improves, the standard
deviation of the markupPlainRatio will get smaller.

*** testing tools
There are works in progress for python a python testing tool that
pulls the wikimediaMarkup for an anomalous document from the index and
stores it on the filesystem. This tool can be found in
=py/make_test_case.py=.

The main class of the wikipediaTransformer jar now can accept a
command line argument for a file of wikimediaMarkup, it reads this
file and writes the parsed articlePlainText to stdout. These two tools allow
quick iteration over problematic documents.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/paddymul/wikipedia_solr

Awesome Lists containing this project

README