OpenEDGAR is designed to be run on Amazon Web Services to provide high-quality, reliable Internet access and intra-DC access to Amazon S3 for storage. While users can run OpenEDGAR from outside of AWS, an AWS account is required for S3 usage and performance will be substantially reduced.
-
Launch an EC2 instance
-
Update all packages
a.
$ sudo apt updateb.
$ sudo apt upgrade -
Reboot
-
Format and mount disks (optional)
a.
$ mkfs.ext4 /dev/nvme1n1b. add to
/etc/fstabc. Reboot to test mount
-
Install Python:
$ sudo apt install build-essential python3-dev python3-pip virtualenv -
Install Postgres:
$ sudo apt install postgresql-9.5 postgresql-client-common libpq-dev -
Install Oracle Java
a.
$ sudo add-apt-repository ppa:webupd8team/javab.
$ sudo apt-get updatec.
$ sudo apt-get install oracle-java8-installer oracle-java8-set-default oracle-java8-unlimited-jce-policyd.
$ java -version
-
Clone repo (you may need to ensure you have permissions to create a directory under /opt)
a.
$ cd /optb.
$ git clone https://github.com/LexPredict/openedgar.git -
Setup virtual environment
a.
$ cd /opt/openedgarb.
$ virtualenv -p /usr/bin/python3 envc.
$ ./env/bin/pip install -r lexpredict_openedgar/requirements/full.txt -
Setup database. Note that the password chosen for openegar must be set as DJANGO_PASSWORD in the .env later
a.
$ sudo -u postgres createuser -l -P -s openedgarb.
$ sudo -u postgres createdb -O openedgar openedgarc. Move PG data folder (optional)
$ sudo systemctl stop postgresql $ sudo systemctl status postgresql $ sudo mv /var/lib/postgresql /data $ sudo ln -s /data/postgresql /var/lib/postgresql $ sudo chown -R postgres:postgres /var/lib/postgresql $ sudo systemctl start postgresql $ sudo systemctl status postgresql $ sudo -u postgres psql -
Install and configure RabbitMQ
a.
$ wget https://packages.erlang-solutions.com/erlang-solutions_1.0_all.debb.
$ sudo dpkg -i erlang-solutions_1.0_all.debc.
$ sudo apt updated.
$ sudo apt install rabbitmq-servere.
$ sudo rabbitmqctl add_user openedgar openedgarf.
$ sudo rabbitmqctl add_vhost openedgarg.
$ sudo rabbitmqctl set_permissions -p openedgar openedgar ".*" ".*" ".*"h. Move rabbitmq data folder (optional)
$ sudo systemctl stop rabbitmq-server.service $ sudo mv /var/lib/rabbitmq /data/ $ sudo ln -s /data/rabbitmq /var/lib/rabbitmq $ sudo chown -R rabbitmq:rabbitmq /var/lib/rabbitmq $ sudo systemctl start rabbitmq-server.service $ sudo systemctl status rabbitmq-server.service -
Update .env file. For local testing (downloading files locally, instead of to S3), set CLIENT_TYPE to LOCAL and DOWNLOAD_PATH to a local path
a.
$ cp lexpredict_openedgar/sample.env lexpredict_openedgar/.envb. Update DATABASE_URL
c. Update CELERY_BROKER_URL
d. Setup AWS S3 bucket
e. Setup IAM policy
{ "Version": "2012-10-17", "Statement": [ { "Sid": "[REPLACE:unique ID]", "Effect": "Allow", "Action": [ "s3:*" ], "Resource": [ "arn:aws:s3:::[REPLACE:your bucket]" ] }, { "Sid": "[REPLACE:unique ID]", "Effect": "Allow", "Action": [ "s3:*" ], "Resource": [ "arn:aws:s3:::[REPLACE:your bucket]/*" ] } ] }f. Update
S3_ACCESS_KEY,S3_SECRET_KEY, andS3_BUCKET -
Initial database migration
a.
$ cd /opt/openedgar/lexpredict_openedgarb.
$ source ../env/bin/activatec.
$ source .envd.
$ python manage.py migrate -
Setup Apache Tika and run
a.
$ cd /opt/openedgar/tikab.
$ bash download_tika.shc.
$ bash run_tika.sh(run with&,nohup, or as service) -
Setup Celery
a.
$ cd /opt/openedgar/lexpredict_openedgarb.
$ source ../env/bin/activatec.
$ source .envd.
$ bash scripts/run_celery.sh(run with&,nohup, or as service)
-
Build database of 10-Ks from 2018 from latest SEC EDGAR data
a.
$ cd /opt/openedgar/lexpredict_openedgarb.
$ source ../env/bin/activatec.
$ source .envd.
$ python manage.py shell_pluse. Retrieve all 10-Ks from 2018
>>> from openedgar.processes.edgar import download_filing_index_data, process_all_filing_index >>> download_filing_index_data(year=2018) >>> process_all_filing_index(year=2018, form_type_list=["10-K"])f. Sample timing on
m5.large(2 core, 8GB RAM): ~24 hours to retrieve and parse all 2018 10-Ksg. Sample statistics for 2018 10-Ks as of May
# Data on S3 Size of edgar/ on S3: Objects: 1645 Size: 2.4 GB Size of documents/raw/ on S3: Objects: 135497 Size: 2.1 GB Size of documents/text/ on S3: Object: 130469 Size: 1000.4 MB # Data in Postgres In [7]: Filing.objects.count() Out[7]: 1521 In [8]: FilingDocument.objects.count() Out[8]: 147598 In [9]: Company.objects.count() Out[9]: 1451