{"id":21047187,"url":"https://github.com/jsv4/atticusclassifier","last_synced_at":"2025-05-15T19:31:38.966Z","repository":{"id":130217290,"uuid":"325605661","full_name":"JSv4/AtticusClassifier","owner":"JSv4","description":"Trained BERT and Word2Vec legal clause classifiers for SPACY using the Atticus Project's Open Source Contract Label Corpus","archived":true,"fork":false,"pushed_at":"2021-01-02T03:07:27.000Z","size":404,"stargazers_count":14,"open_issues_count":0,"forks_count":3,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-03-24T20:45:30.255Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/JSv4.png","metadata":{"files":{"readme":"README.rst","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-12-30T17:09:09.000Z","updated_at":"2025-01-16T17:58:56.000Z","dependencies_parsed_at":"2023-04-16T00:32:35.434Z","dependency_job_id":null,"html_url":"https://github.com/JSv4/AtticusClassifier","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/JSv4%2FAtticusClassifier","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/JSv4%2FAtticusClassifier/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/JSv4%2FAtticusClassifier/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/JSv4%2FAtticusClassifier/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/JSv4","download_url":"https://codeload.github.com/JSv4/AtticusClassifier/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254407377,"owners_count":22066228,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-19T14:35:52.135Z","updated_at":"2025-05-15T19:31:38.957Z","avatar_url":"https://github.com/JSv4.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"******************************************\nAtticus Legal Clause Classifiers for Spacy\n******************************************\n\nIntroduction\n############\n\nThe `Atticus Project \u003chttps://www.atticusprojectai.org/\u003e`_ was recently announced as an initiative\nto, among other things, build a world-class corpus of labelled legal contracts which could be used\nto train and/or benchmark text classifiers and question-answering NLP models. Their initial release\ncontains 200 labelled contracts. I wanted to experiment with the data set and build a working classifier\nthat I could use on contract data, so I set out to build a simple project to load the dataset, convert it\ninto a format that Spacy can read, and then train some classifiers to see how the data set performs.\nThis repository contains the code I used to train classifiers based on 1) Word2Vec embeddings and 2)\na BERT-based transformer model.\n\nQuickstart - Use the Classifier\n###############################\n\nIf you are in a hurry to test out the classifiers and are not really interested in how they were trained,\nyou can currently install the classifier directly from a package I'm hosting on my AWS bucket by typing::\n\n    pip install https://jsv4public.s3.amazonaws.com/en_atticus_classifier_bert-0.1.0.tar.gz\n\nThis should install Spacy, Spacy-transformers, a BERT model and the classifiers. Once you've installed the\nmodel, you can use it like this::\n\n    import spacy\n\n    nlp = spacy.load('en_atticus_classifier_bert')\n\n    clause = \"\"\"The Joint Venturers shall maintain adequate books\n    and records to be kept of all the Joint Venture activities and affairs\n    conducted pursuant to the terms of this Agreement. All direct costs and\n    expenses, which shall include any insurance costs in connection with the\n    distribution of the Products or operations of the Joint Venture, or if the\n    business of the Joint Venture requires additional office facilities than\n    those now presently maintained by each Joint Venturer\"\"\"\n\n    cats = nlp(clause).cats\n    cats = [label for label in cats if cats[label] \u003e .7] # If you want to filter by similarity scores \u003e .7\n    print(cats) # Show the categories\n\n\nAs discussed below, the performance of the model is good enough to be interesting,\nbut currently not good enough to really be production ready. I *think* this is primarily\ndue to the dataset being relatively small and many clause categories having fewer than 20\nexamples. I wanted to release this as-is, however so others could experiment. As the Atticus\nProject corpus grows, these classifiers should get better. In my experience 50 - 100 examples\nis typically a good target to aim for, so doubling or tripling the Atticus Corpus will\nhopefully lead to much, much better performance.\n\nBuild a Word2Vec-Based Model\n############################\n\nI first experimented with using Spacy's OOTB Word2Vec models. This approach was very\nquick to train, but the performance was not very good. The f-score was about .6. I also\ntried using a different set of word embeddings released as \"Law2Vec\", and these improved\nperformance marginally to an F-Score of ~.64. I've included the code to train these models\nin Word2VecModelBuilder.py. You can simply run that python script. The default settings\nwill load Spacy's en_core_web_lg model and embeddings. You can also load the Law2Vec model\nif you download the vector file::\n\n    wget -O ~/Downloads/Law2Vec.200d.txt https://archive.org/download/Law2Vec/Law2Vec.200d.txt\n\nThen you can use Spacy to convert this file into a Spacy-compatible model like so::\n\n    mkdir /models\n    python -m spacy init-model en /models/Law2VecModel --vectors-loc ~/Downloads/Law2Vec.200d.txt\n\nThen you can change the model argument (per the example above) to '/models/Law2VecModel'.\nYou probably want to change the output_dir too. Once you've trained a new model, you can\nload the trained model with spacy.load(output_dir).\n\nTrain a BERT-based Model\n########################\n\nOverview\n  The transformer models encode a lot more contextual information about words than Word2Vec models,\n  so I wanted to see if I could squeeze more performance out of the dataset using BERT. The good\n  news was performance increased substantially using a BERT-based model. This is still probably not good enough for use in production, but it's good\n  enough to yield some interesting insights, particularly if you set your similarity threshold very\n  high.\n\nTraining Results\n  Using a BERT-based model, the beta release of the Atticus training set yields\n  an acceptable (but still not really production-ready) F-score of .735::\n\n    LOSS \t  P  \t  R  \t  F\n    1.093\t0.739\t0.472\t0.576\n    1.960\t0.763\t0.566\t0.649\n    0.290\t0.756\t0.661\t0.706\n    0.985\t0.764\t0.683\t0.721\n    1.616\t0.770\t0.681\t0.723\n    0.517\t0.743\t0.673\t0.706\n    1.044\t0.754\t0.697\t0.724\n    0.127\t0.762\t0.728\t0.745\n    0.542\t0.748\t0.722\t0.735\n    0.946\t0.756\t0.722\t0.739\n    0.219\t0.751\t0.720\t0.735\n    0.551\t0.751\t0.720\t0.735\n\n  Training the BERT-based model takes a lot more computing power, and a CUDA-compatible\n  graphics card is absolutely recommended. Using a Nvidia 1050 Ti, the above training\n  took about three hours.\n\nStep 1 - Sign Up for Atticus Project Data and Download\n  I've included the Atticus CSV in the repository for convenience, but you should go to the\n  Atticus Project website and signup there. For one, they would like to collect user and\n  contact info for people downloading their dataset. For another, you should go there to make\n  sure you get the latest version of their dataset.\n\nStep 2 - Install Python Dependencies and SPACY BERT Model\n  First, install Python dependencies (I'm using LexNLP to tokenize test data, you do not\n  need it to build the model)::\n\n    pip install lexnlp spacy pip install spacy-transformers==0.5.2 pandas\n\n  Then, download the BERT transformer model::\n\n    !python -m spacy download en_trf_bertbaseuncased_lg\n\nStep 3 - Load Atticus Data and Format for Spacy\n  The Atticus dataset is a csv, so we can use Pandas to load and manipulate it. Since\n  we're training classifiers and not answering questions, we only care about the columns\n  containing text for a given classification. The columns with headers marked \"...-Answer\"\n  are meant for question-answering and we don't want to train on this data. We also don't\n  really want the filename column or the document title columns, which are the first and\n  second columns respectively. The following function will load our Atticus CSV, filter\n  out the ...-Answer cols, the filename col and the document title col. Then, it will\n  format the data into Spacy's preferred training format and split the training set into\n  two pieces - a training set and an evaluation set. The default is to split the total data\n  set so 80% is used for training and 20% is used for evaluation.\n\n  **Code**::\n\n        def load_atticus_data(filepath='/tmp/aok_beta/Final Publication/master_clauses.csv'):\n\n            \"\"\"\n            Load data from the atticus csv (omitting the answer cols as we want to train classifiers\n            not question answering).\n\n            Data is returned in the Spacy training format:\n                TRAIN_DATA = [\n                    (\"text1\", {\"cats\": {\"POSITIVE\": 1.0, \"NEGATIVE\": 0.0}})\n                ]\n\n            A list of headers is also returned so you can add these labels. FYI, the Filename and Doc name\n            columns are dropped as well.\n\n            \"\"\"\n\n            # Load csv\n            atticus_clauses_df = pd.read_csv(filepath)\n\n            # Do a little post-processing\n            data_headers = [h for h in list(atticus_clauses_df.columns) if not \"Answer\" in h]\n            data_headers.pop(0)  # Drop filename col (index 0 for col 1)\n            data_headers.pop(0)  # Drop doc name (orig col 2 (index 1) but now first col (index 0))\n\n            training_values = {i: 0 for i in data_headers}\n            atticus_clauses_data_df = atticus_clauses_df.loc[:, data_headers]\n\n            train_data = []\n\n            # Iterate over csv to build training data dict\n            for header in atticus_clauses_data_df.columns:\n\n                for row in atticus_clauses_data_df[[header]].iterrows():\n\n                    value = row[1][header]\n\n                    if not pd.isnull(value):\n                        train_data.append((value, {'cats': {**training_values, header: 1}}))\n\n            return train_data, data_headers\n\n\n        def create_training_set(train_data=[{}], limit=0, split=0.8):\n            \"\"\"Load data from the Atticus dataset, splitting off a held-out set.\"\"\"\n            random.shuffle(train_data)\n            train_data = train_data[-limit:]\n\n            texts, labels = zip(*train_data)\n            split = int(len(train_data) * split)\n\n            # Return data in format that matches example here:\n            # https://github.com/explosion/spaCy/blob/master/examples/training/train_textcat.py\n            return (texts[:split], labels[:split]), (texts[split:], labels[split:])\n\n\nStep 4 - Build the Model\n  *WARNING - running the training takes a looong time, even if you have a CUDA-compatible\n  graphics card and it's properly configured in your environment*\n\n  You can just run the BertModelBuilder.py with default settings. On my Nvidia 1050 Ti, it took\n  about 3 - 4 hours to run the training. Unless you're adding additional data, I'd suggest you\n  just use my pre-built models.\n\nPackaging / Serving Model for Use\n#################################\n\nYou can follow Spacy's excellent instructions `here \u003chttps://spacy.io/api/cli#package\u003e`_\nto package up the final model into a tar that can be installed with pip like this::\n\n    pip install local_path_to_tar.tar.gz\n\nI've uploaded the package to my public AWS bucket, and you can install directly from there\nlike so::\n\n    pip install https://jsv4public.s3.amazonaws.com/en_atticus_classifier_bert-0.1.0.tar.gz\n\nNow you can load it just like this::\n\n    nlp = spacy.load('en_atticus_classifier_bert')\n\nI plan to also upload this to PyPi as well so you can just do something like this::\n\n    pip install atticus_classifiers_spacy (DOESN'T WORK YET)\n\nAnother option, is you can load the pickled model in the pre-trained folder::\n\n    import pickle\n    import spacy\n\n    nlp = pickle.load(open(\"/path/to/BertClassifier.pickle\", \"rb\"))\n\n    # Then you can use the spacy object just like normal:\n    clause = \"Test clause\"\n    cats = nlp(clause).cats\n    cats = [label for label in cats if cats[label] \u003e .7] #If you want to look only at labels with similarity scores over .7\n    print(cats)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjsv4%2Fatticusclassifier","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjsv4%2Fatticusclassifier","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjsv4%2Fatticusclassifier/lists"}