Free and open-source digital preservation system designed to maintain standards-based, long-term access to collections of digital objects.

Artefactual

Last update: Dec 16, 2022

Related tags

Organization digital-preservation archivematica

Overview

Archivematica

Archivematica is a web- and standards-based, open-source application which allows your institution to preserve long-term access to trustworthy, authentic and reliable digital content. Our target users are archivists, librarians, and anyone working to preserve digital objects.

You are free to copy, modify, and distribute Archivematica with attribution under the terms of the AGPLv3 license. See the LICENSE file for details.

Installation

Other resources

Website: User and administrator documentation
Wiki: Developer facing documentation, requirements analysis and community resources
Issues: Git repository used for tracking Archivematica issues and feature/enhancement ideas
User Google Group: Forum/mailing list for user questions (both technical and end-user)
Paid support: Paid support, hosting, training, consulting and software development contracts from Artefactual

Contributing

Thank you for your interest in Archivematica! For more details, see the contributing guidelines

Reporting an issue

Issues related to Archivematica, the Storage Service, or any related repository can be filed in the Archivematica Issues repository.

Security

If you have a security concern about Archivematica or any related repository, please see the SECURITY file for information about how to safely report vulnerabilities.

Related projects

Archivematica consists of several projects working together, including:

Archivematica: This repository! Main repository containing the user-facing dashboard, task manager MCPServer and clients scripts for the MCPClient
Storage Service: Responsible for moving files to Archivematica for processing, and from Archivematica into long-term storage
Format Policy Registry: Submodule shared between Archivematica and the Format Policy Registry (FPR) server that displays and updates FPR rules and commands

For more projects in the Archivematica ecosystem, see the getting started page.

Comments

Problem: using symlinks breaks Windows dev environments

Archivematica cannot be deployed on Windows, but this PR from @minusdavid https://github.com/artefactual/deploy-pub/pull/39 makes it possible to deploy a development environment on Windows, using vagrant to deploy to a linux vm.

That PR is working great, but there is a problem with checking out a git repo that contains symlinks into a windows filesystem (google it, lots of links). Windows doesn't properly support symlinks, and so checking out a repo with symlinks is difficult, ansible roles choke, you get weird git errors, etc.

In this repo, there are only a few symlinks being used - it would not be hard to remove them altogether. I think the only place left is in the osdeps folders. Removing those symlinks and creating duplicate files for now would allow osdeps to differ for each platform, which is fine, and would make developing in a Windows environment much easier, which is a bonus.

opened by jhsimpson 24

Problem: Extract contents crashes due to UnicodeEncodeError

We have come across a transfer where the "Extract contents from compressed archives" job seems to run fine, until it comes across a new compressed object where it fails with the following message in the task overview in the dashboard:

....
Not extracting contents from Cotu_K.doc  - No rule found to extract
Not extracting contents from UMT_24.02.12.pdf  - No rule found to extract
Not extracting contents from Rapport_d_activite_Alen.niger_version2.doc  - No rule found to extract
Not extracting contents from Wawan_10.04.17.pdf  - No rule found to extract

extractContents.py: INFO      2018-05-17 20:56:32,786  archivematica.mcp.client.extractContents:get_dir_uuids:240:  Assigning UUID d425717b-eadf-45fc-b5d7-ab13cf550682 to directory path %transferDirectory%objects/SEMINAIRES_2010/MID_TERM_EVALUATION/revised_ghislaine.zip-2018-05-17T20:54:14.968039+00:00/
Traceback (most recent call last):
  File "/usr/lib/archivematica/MCPClient/clientScripts/extractContents.py", line 188, in <module>
    sys.exit(main(transfer_uuid, sip_directory, date, task_uuid, delete=delete))
  File "/usr/lib/archivematica/MCPClient/clientScripts/extractContents.py", line 164, in main
    transfer_mdl)
  File "/usr/share/archivematica/dashboard/main/models.py", line 502, in create_many
    for dir_path, dir_uuid in dir_paths_uuids])
  File "/usr/lib/archivematica/archivematicaCommon/archivematicaFunctions.py", line 237, in get_dir_uuids
    dir_uuid, dir_path)
UnicodeEncodeError: 'ascii' codec can't encode character u'\x82' in position 119: ordinal not in range(128)

The script crashes here, with an exit code of 1 according to the task overview dashboard. In the Transfer dashboard however, the job is incorrectly marked as 'Completed successfully'.

Furthermore, it seems to have skipped the jobs 'Sanitize extracted objects' file and directory names', 'Scan for viruses on extracted files', etc. that would normally run after extraction of packages. Instead it simply moves forward to the 'Update METS.xml document' as if no packages to extract were found.

This finally results in a 'real' error during METS creation during ingest:

Traceback (most recent call last):
  File "/usr/lib/archivematica/MCPClient/clientScripts/archivematicaCreateMETS2.py", line 1314, in <module>
    baseDirectoryPath, objectsDirectoryPath, directories)
  File "/usr/lib/archivematica/MCPClient/clientScripts/archivematicaCreateMETS2.py", line 1182, in get_normative_structmap
    add_normative_structmap_div(all_fsitems, normativeStructMapDiv, directories)
  File "/usr/lib/archivematica/MCPClient/clientScripts/archivematicaCreateMETS2.py", line 1220, in add_normative_structmap_div
    LABEL=basename)
  File "src/lxml/lxml.etree.pyx", line 3112, in lxml.etree.SubElement (src/lxml/lxml.etree.c:81786)
  File "src/lxml/apihelpers.pxi", line 203, in lxml.etree._makeSubElement (src/lxml/lxml.etree.c:18358)
  File "src/lxml/apihelpers.pxi", line 198, in lxml.etree._makeSubElement (src/lxml/lxml.etree.c:18281)
  File "src/lxml/apihelpers.pxi", line 302, in lxml.etree._initNodeAttributes (src/lxml/lxml.etree.c:19840)
  File "src/lxml/apihelpers.pxi", line 316, in lxml.etree._addAttributeToNode (src/lxml/lxml.etree.c:20196)
  File "src/lxml/apihelpers.pxi", line 1439, in lxml.etree._utf8 (src/lxml/lxml.etree.c:32441)
ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters

Further investigation into this error reveals it failed to insert the filename of an extracted file into a normative structmap, due to the fact that the file was of course never normalized after extraction.

IISH

opened by kerim1 22

MCPClient error check Gearman worker creation

This patch adds a try/except block to the MCPClient when creating a Gearman worker in startThread().

Without this patch, if the MCPClient configuration item "MCPArchivematicaServer" has an invalid value, no Gearman worker will be created and Archivematica will be stuck thinking that a job is executing indefinitely with no indication of what happened in the user interface or the logs.

To test, open "/etc/archivematica/MCPClient/clientConfig.conf", and change "MCPArchivematicaServer" to something invalid like "buffalo" or "localhost::9999", and then try to do a standard transfer in the Archivematica dashboard UI. In the micro-service "Verify transfer compliance", you'll get stuck at "Job: Set file permissions". It will say it's still executing but the job will never actually run.

opened by minusdavid 21
Rework the MCP Server, MCP Client and MCP Client scripts to support batching tasks
The MCP Server now batches file-level tasks into fixed-size groups, creating one Gearman task per batch, rather than one per file. It also uses fixed-size thread pools to limit contention between threads.

The MCP Client now operates in batches, processing one batch at a time. It also supports running tasks using a pool of processes (improving throughput where tasks benefit from spanning multiple CPUs.)

The MCP Client scripts now accept a batch of jobs and process them as a single unit. There is a new Job API that provides a standard interface for these client scripts, and all scripts have been converted to use this.

The motivation for this work was to improve performance on transfer and ingest workflows, and to provide an improved interface for implementing client scripts.

Our testing shows transfers and ingests taking approximately half the time they did without these changes.

These changes also permit further optimisation of client scripts, by taking advantage of processing files in batches rather than one at a time. We did some work on optimising a few of the client scripts, but there is likely more improvement to be gained by further optimisation.

This is connected to #938.
Jisc RDSS
opened by jambun 19

Problem: Consistent Ingest failure with media/video transfer

This package of data here is causing Ingest to fail in Archivematica 1.7:

To recreate:

Untar the data and begin transfer
Transfer will complete but will hang on Normalize -> Validate Preservation Derivatives job

If we look at the MCP Server Log we see a large chunk of MediaConch output, followed by a SQL failure:

2[1]\\" formatid=\\"0xBF\\">0x434FD850</value>\\n        </test>\\n        <test outcome=\\"pass\\">\\n          <value name=\\"CRC-32\\" offset=\\"300\\" context=\\"/Segment[1]/Info[1]/CRC-32[1]\\" formatid=\\"0xBF\\">0xBAB0729C</value>\\n        </test>\\n        <test outcome=\\"pass\\">\\n          <value name=\\"CRC-32\\" offset=\\"387\\" context=\\"/Segment[1]/Tracks[1]/CRC-32[1]\\" formatid=\\"0xBF\\">0xAD6CDF0C</value>\\n        </test>\\n        <test outcome=\\"pass\\">\\n          <value name=\\"CRC-32\\" offset=\\"657\\" context=\\"/Segment[1]/Tags[1]/CRC-32[1]\\" formatid=\\"0xBF\\">0x634624A1</value>\\n        </test>\\n        <test outcome=\\"pass\\">\\n          <value name=\\"CRC-32\\" offset=\\"336713942\\" context=\\"/Segment[1]/Cues[1]/CRC-32[1]\\" formatid=\\"0xBF\\">0x96E4D111</value>\\n        </test>\\n      </check>\\n      <check icid=\\"EBML-CRC-VALID\\" version=\\"1\\" tests_run=\\"5\\" fail_count=\\"0\\" pass_count=\\"5\\">\\n        <test outcome=\\"pass\\">\\n          <value name=\\"CRC-32\\" offset=\\"65\\" context=\\"/Segment[1]/SeekHead[1]/CRC-32[1]\\" formatid=\\"0xBF\\">0x434FD850</value>\\n        </test>\\n        <test outcome=\\"pass\\">\\n          <value name=\\"CRC-32\\" offset=\\"300\\" context=\\"/Segment[1]/Info[1]/CRC-32[1]\\" formatid=\\"0xBF\\">0xBAB0729C</value>\\n        </test>\\n        <test outcome=\\"pass\\">\\n          <value name=\\"CRC-32\\" offset=\\"387\\" context=\\"/Segment[1]/Tracks[1]/CRC-32[1]\\" formatid=\\"0xBF\\">0xAD6CDF0C</value>\\n        </test>\\n        <test outcome=\\"pass\\">\\n          <value name=\\"CRC-32\\" offset=\\"657\\" context=\\"/Segment[1]/Tags[1]/CRC-32[1]\\" formatid=\\"0xBF\\">0x634624A1</value>\\n        </test>\\n        <test outcome=\\"pass\\">\\n          <value name=\\"CRC-32\\" offset=\\"336713942\\" context=\\"/Segment[1]/Cues[1]/CRC-32[1]\\" formatid=\\"0xBF\\">0x96E4D111</value>\\n        </test>\\n      </check>\\n      <check icid=\\"MKV-VALID-TRACKTYPE-VALUE\\" version=\\"1\\" tests_run=\\"2\\" fail_count=\\"0\\" pass_count=\\"2\\">\\n        <context name=\\"Valid Values\\">1 2 3 16 17 18 32</context>\\n        <test outcome=\\"pass\\">\\n          <value name=\\"TrackType\\" offset=\\"419\\" context=\\"/Segment[1]/Tracks[1]/TrackEntry[1]/TrackType[1]\\" formatid=\\"0x83\\">1</value>\\n          <value offset=\\"419\\" name=\\"TrackType\\">1</value>\\n        </test>\\n        <test outcome=\\"pass\\">\\n          <value name=\\"TrackType\\" offset=\\"616\\" context=\\"/Segment[1]/Tracks[1]/TrackEntry[2]/TrackType[1]\\" formatid=\\"0x83\\">2</value>\\n          <value offset=\\"616\\" name=\\"TrackType\\">2</value>\\n        </test>\\n      </check>\\n      <check icid=\\"MKV-VALID-BOOLEANS\\" version=\\"1\\" tests_run=\\"2\\" fail_count=\\"0\\" pass_count=\\"2\\">\\n        <context name=\\"Valid Values\\">0 1</context>\\n        <test outcome=\\"pass\\">\\n          <value name=\\"FlagLacing\\" offset=\\"409\\" context=\\"/Segment[1]/Tracks[1]/TrackEntry[1]/FlagLacing[1]\\" formatid=\\"0x9C\\">0</value>\\n          <value offset=\\"409\\" name=\\"FlagLacing\\">0</value>\\n        </test>\\n        <test outcome=\\"pass\\">\\n          <value name=\\"FlagLacing\\" offset=\\"591\\" context=\\"/Segment[1]/Tracks[1]/TrackEntry[2]/FlagLacing[1]\\" formatid=\\"0x9C\\">0</value>\\n          <value offset=\\"591\\" name=\\"FlagLacing\\">0</value>\\n        </test>\\n      </check>\\n    </implementationChecks>\\n    <implementationChecks checks_run=\\"0\\" fail_count=\\"0\\" pass_count=\\"0\\">\\n      <name>MediaConch FFV1 Implementation Checker</name>\\n    </implementationChecks>\\n    <implementationChecks checks_run=\\"1\\" fail_count=\\"0\\" pass_count=\\"1\\">\\n      <name>MediaConch PCM Implementation Checker</name>\\n      <check icid=\\"PCM-IS-CBR\\" version=\\"1\\" tests_run=\\"1\\" fail_count=\\"0\\" pass_count=\\"1\\">\\n        <context name=\\"Valid Values\\">CBR</context>\\n        <test outcome=\\"pass\\">\\n          <value offset=\\"\\" name=\\"\\">CBR</value>\\n        </test>\\n      </check>\\n    </implementationChecks>\\n  </media>\\n</MediaConch>\\n\\r\\n\\n", "eventOutcomeDetailNote": "MediaConch implementation check result: The implementation check MediaConch EBML Implementation Checker returned failure for the following check(s): EBML-ELEMENT-VALID-PARENT."}\n\n'}
archivematica-mcp-server_1       | ERROR     2018-03-09 03:06:01  archivematica.mcp.server:utils:wrapped:16:  Uncaught exception
archivematica-mcp-server_1       | Traceback (most recent call last):
archivematica-mcp-server_1       |   File "/src/MCPServer/lib/utils.py", line 14, in wrapped
archivematica-mcp-server_1       |     return fn(*args, **kwargs)
archivematica-mcp-server_1       |   File "/src/archivematicaCommon/lib/databaseFunctions.py", line 47, in wrapper
archivematica-mcp-server_1       |     return f(*args, **kwargs)
archivematica-mcp-server_1       |   File "/src/MCPServer/lib/taskStandard.py", line 91, in performTask
archivematica-mcp-server_1       |     self.check_request_status(completed_job_request)
archivematica-mcp-server_1       |   File "/src/MCPServer/lib/taskStandard.py", line 100, in check_request_status
archivematica-mcp-server_1       |     self.linkTaskManager.taskCompletedCallBackFunction(self)
archivematica-mcp-server_1       |   File "/src/MCPServer/lib/linkTaskManagerFiles.py", line 143, in taskCompletedCallBackFunction
archivematica-mcp-server_1       |     databaseFunctions.logTaskCompletedSQL(task)
archivematica-mcp-server_1       |   File "/src/archivematicaCommon/lib/databaseFunctions.py", line 263, in logTaskCompletedSQL
archivematica-mcp-server_1       |     task.save()
archivematica-mcp-server_1       |   File "/usr/local/lib/python2.7/site-packages/django/db/models/base.py", line 734, in save
archivematica-mcp-server_1       |     force_update=force_update, update_fields=update_fields)
archivematica-mcp-server_1       |   File "/usr/local/lib/python2.7/site-packages/django/db/models/base.py", line 762, in save_base
archivematica-mcp-server_1       |     updated = self._save_table(raw, cls, force_insert, force_update, using, update_fields)
archivematica-mcp-server_1       |   File "/usr/local/lib/python2.7/site-packages/django/db/models/base.py", line 827, in _save_table
archivematica-mcp-server_1       |     forced_update)
archivematica-mcp-server_1       |   File "/usr/local/lib/python2.7/site-packages/django/db/models/base.py", line 877, in _do_update
archivematica-mcp-server_1       |     return filtered._update(values) > 0
archivematica-mcp-server_1       |   File "/usr/local/lib/python2.7/site-packages/django/db/models/query.py", line 580, in _update
archivematica-mcp-server_1       |     return query.get_compiler(self.db).execute_sql(CURSOR)
archivematica-mcp-server_1       |   File "/usr/local/lib/python2.7/site-packages/django/db/models/sql/compiler.py", line 1062, in execute_sql
archivematica-mcp-server_1       |     cursor = super(SQLUpdateCompiler, self).execute_sql(result_type)
archivematica-mcp-server_1       |   File "/usr/local/lib/python2.7/site-packages/django/db/models/sql/compiler.py", line 840, in execute_sql
archivematica-mcp-server_1       |     cursor.execute(sql, params)
archivematica-mcp-server_1       |   File "/usr/local/lib/python2.7/site-packages/django/db/backends/utils.py", line 64, in execute
archivematica-mcp-server_1       |     return self.cursor.execute(sql, params)
archivematica-mcp-server_1       |   File "/usr/local/lib/python2.7/site-packages/django/db/utils.py", line 98, in __exit__
archivematica-mcp-server_1       |     six.reraise(dj_exc_type, dj_exc_value, traceback)
archivematica-mcp-server_1       |   File "/usr/local/lib/python2.7/site-packages/django/db/backends/utils.py", line 64, in execute
archivematica-mcp-server_1       |     return self.cursor.execute(sql, params)
archivematica-mcp-server_1       |   File "/usr/local/lib/python2.7/site-packages/django/db/backends/mysql/base.py", line 124, in execute
archivematica-mcp-server_1       |     return self.cursor.execute(query, args)
archivematica-mcp-server_1       |   File "/usr/local/lib/python2.7/site-packages/MySQLdb/cursors.py", line 226, in execute
archivematica-mcp-server_1       |     self.errorhandler(self, exc, value)
archivematica-mcp-server_1       |   File "/usr/local/lib/python2.7/site-packages/MySQLdb/connections.py", line 36, in defaulterrorhandler
archivematica-mcp-server_1       |     raise errorvalue
archivematica-mcp-server_1       | OperationalError: (2006, 'MySQL server has gone away')
archivematica-mcp-server_1       | Exception in thread Thread-1105:
archivematica-mcp-server_1       | Traceback (most recent call last):
archivematica-mcp-server_1       |   File "/usr/local/lib/python2.7/threading.py", line 801, in __bootstrap_inner
archivematica-mcp-server_1       |     self.run()
archivematica-mcp-server_1       |   File "/usr/local/lib/python2.7/threading.py", line 754, in run
archivematica-mcp-server_1       |     self.__target(*self.__args, **self.__kwargs)
archivematica-mcp-server_1       |   File "/src/MCPServer/lib/utils.py", line 14, in wrapped
archivematica-mcp-server_1       |     return fn(*args, **kwargs)
archivematica-mcp-server_1       |   File "/src/archivematicaCommon/lib/databaseFunctions.py", line 47, in wrapper
archivematica-mcp-server_1       |     return f(*args, **kwargs)
archivematica-mcp-server_1       |   File "/src/MCPServer/lib/taskStandard.py", line 91, in performTask
archivematica-mcp-server_1       |     self.check_request_status(completed_job_request)
archivematica-mcp-server_1       |   File "/src/MCPServer/lib/taskStandard.py", line 100, in check_request_status
archivematica-mcp-server_1       |     self.linkTaskManager.taskCompletedCallBackFunction(self)
archivematica-mcp-server_1       |   File "/src/MCPServer/lib/linkTaskManagerFiles.py", line 143, in taskCompletedCallBackFunction
archivematica-mcp-server_1       |     databaseFunctions.logTaskCompletedSQL(task)
archivematica-mcp-server_1       |   File "/src/archivematicaCommon/lib/databaseFunctions.py", line 263, in logTaskCompletedSQL
archivematica-mcp-server_1       |     task.save()
archivematica-mcp-server_1       |   File "/usr/local/lib/python2.7/site-packages/django/db/models/base.py", line 734, in save
archivematica-mcp-server_1       |     force_update=force_update, update_fields=update_fields)
archivematica-mcp-server_1       |   File "/usr/local/lib/python2.7/site-packages/django/db/models/base.py", line 762, in save_base
archivematica-mcp-server_1       |     updated = self._save_table(raw, cls, force_insert, force_update, using, update_fields)
archivematica-mcp-server_1       |   File "/usr/local/lib/python2.7/site-packages/django/db/models/base.py", line 827, in _save_table
archivematica-mcp-server_1       |     forced_update)
archivematica-mcp-server_1       |   File "/usr/local/lib/python2.7/site-packages/django/db/models/base.py", line 877, in _do_update
archivematica-mcp-server_1       |     return filtered._update(values) > 0
archivematica-mcp-server_1       |   File "/usr/local/lib/python2.7/site-packages/django/db/models/query.py", line 580, in _update
archivematica-mcp-server_1       |     return query.get_compiler(self.db).execute_sql(CURSOR)
archivematica-mcp-server_1       |   File "/usr/local/lib/python2.7/site-packages/django/db/models/sql/compiler.py", line 1062, in execute_sql
archivematica-mcp-server_1       |     cursor = super(SQLUpdateCompiler, self).execute_sql(result_type)
archivematica-mcp-server_1       |   File "/usr/local/lib/python2.7/site-packages/django/db/models/sql/compiler.py", line 840, in execute_sql
archivematica-mcp-server_1       |     cursor.execute(sql, params)
archivematica-mcp-server_1       |   File "/usr/local/lib/python2.7/site-packages/django/db/backends/utils.py", line 64, in execute
archivematica-mcp-server_1       |     return self.cursor.execute(sql, params)
archivematica-mcp-server_1       |   File "/usr/local/lib/python2.7/site-packages/django/db/utils.py", line 98, in __exit__
archivematica-mcp-server_1       |     six.reraise(dj_exc_type, dj_exc_value, traceback)
archivematica-mcp-server_1       |   File "/usr/local/lib/python2.7/site-packages/django/db/backends/utils.py", line 64, in execute
archivematica-mcp-server_1       |     return self.cursor.execute(sql, params)
archivematica-mcp-server_1       |   File "/usr/local/lib/python2.7/site-packages/django/db/backends/mysql/base.py", line 124, in execute
archivematica-mcp-server_1       |     return self.cursor.execute(query, args)
archivematica-mcp-server_1       |   File "/usr/local/lib/python2.7/site-packages/MySQLdb/cursors.py", line 226, in execute
archivematica-mcp-server_1       |     self.errorhandler(self, exc, value)
archivematica-mcp-server_1       |   File "/usr/local/lib/python2.7/site-packages/MySQLdb/connections.py", line 36, in defaulterrorhandler
archivematica-mcp-server_1       |     raise errorvalue
archivematica-mcp-server_1       | OperationalError: (2006, 'MySQL server has gone away')

This has been seen with other transfer material, but wasn't measured and recreated under the same control circumstances as here.

Will investigate more as I get an opportunity.

Type: bug

opened by ross-spencer 19

Issues with installing Archivematica 1.8 RPMs/Debs on Fresh Servers

Using the documentation located here, I encountered the following issues with the CentOS and Ubuntu packages posted to last meeting's agenda:

CentOS 7

At step 3 of the instructions in the documentation, I got the following error:

[centos@ip-172-31-8-206 ~]$ sudo -u root yum install -y java-1.8.0-openjdk-headless mariadb-server gearmand
Loaded plugins: fastestmirror
https://jenkins-ci.archivematica.org/1.8.x/centos/repodata/repomd.xml: [Errno 14] curl#7 - "Failed connect to jenkins-ci.archivematica.org:443; Connection refused"
Trying other mirror.
https://jenkins-ci.archivematica.org/1.8.x/centos/repodata/repomd.xml: [Errno 14] curl#7 - "Failed connect to jenkins-ci.archivematica.org:443; Connection refused"
Trying other mirror.
https://jenkins-ci.archivematica.org/1.8.x/centos/repodata/repomd.xml: [Errno 14] curl#7 - "Failed connect to jenkins-ci.archivematica.org:443; Connection refused"
Trying other mirror.
https://jenkins-ci.archivematica.org/1.8.x/centos/repodata/repomd.xml: [Errno 14] curl#7 - "Failed connect to jenkins-ci.archivematica.org:443; Connection refused"
Trying other mirror.
https://jenkins-ci.archivematica.org/1.8.x/centos/repodata/repomd.xml: [Errno 14] curl#7 - "Failed connect to jenkins-ci.archivematica.org:443; Connection refused"
Trying other mirror.
https://jenkins-ci.archivematica.org/1.8.x/centos/repodata/repomd.xml: [Errno 14] curl#7 - "Failed connect to jenkins-ci.archivematica.org:443; Connection refused"
Trying other mirror.
https://jenkins-ci.archivematica.org/1.8.x/centos/repodata/repomd.xml: [Errno 14] curl#7 - "Failed connect to jenkins-ci.archivematica.org:443; Connection refused"

This was just due to a difference in the URL in the documentation vs the URL from last meeting. I updated the yum repo to use the correct URL and received the following:

[centos@ip-172-31-8-206 ~]$ sudo -u root yum install -y python-pip archivematica-storage-service
Loaded plugins: fastestmirror
Loading mirror speeds from cached hostfile
 * base: mirrors.usc.edu
 * epel: mirrors.kernel.org
 * extras: mirror.web-ster.com
 * updates: mirror.keystealth.org
No package archivematica-storage-service available.

So I was unable to proceed past step 3 in the instructions since the storage service wasn't available. I did try the steps afterwards just to see how far I could get without the storage service and got up to the following (step 5):

[centos@ip-172-31-8-206 ~]$ sudo -u archivematica bash -c " \
> set -a -e -x
> source /etc/sysconfig/archivematica-dashboard
> cd /usr/share/archivematica/dashboard
> /usr/lib/python2.7/archivematica/dashboard/bin/python manage.py syncdb --noinput
> ";
+ source /etc/sysconfig/archivematica-dashboard
++ ARCHIVEMATICA_DASHBOARD_DASHBOARD_DJANGO_SECRET_KEY=Ptpucrhu0doIq2QcHZtcO9caaqE11fk2
++ ARCHIVEMATICA_DASHBOARD_DASHBOARD_DJANGO_ALLOWED_HOSTS='*'
++ AM_GUNICORN_BIND=127.0.0.1:7400
++ DJANGO_SETTINGS_MODULE=settings.production
++ ARCHIVEMATICA_DASHBOARD_DB_NAME=MCP
++ ARCHIVEMATICA_DASHBOARD_DB_USER=archivematica
++ ARCHIVEMATICA_DASHBOARD_DB_PASSWORD=demo
++ ARCHIVEMATICA_DASHBOARD_DB_HOST=localhost
++ ARCHIVEMATICA_DASHBOARD_DB_PORT=3306
++ ARCHIVEMATICA_DASHBOARD_GEARMAN=localhost:4730
++ ARCHIVEMATICA_DASHBOARD_ELASTICSEARCH=localhost:9200
++ PYTHONPATH=/usr/lib/archivematica/archivematicaCommon/:/usr/share/archivematica/dashboard
+ cd /usr/share/archivematica/dashboard
+ /usr/lib/python2.7/archivematica/dashboard/bin/python manage.py syncdb --noinput
bash: line 3: /usr/lib/python2.7/archivematica/dashboard/bin/python: No such file or directory

Ubuntu 16.04

I got up to step 3 on Ubuntu. When I ran apt-get update I received the following:

Reading package lists... Done
W: The repository 'http://jenkins-ci.archivematica.org/1.8.x/ubuntu xenial Release' does not have a Release file.
N: Data from such a repository can't be authenticated and is therefore potentially dangerous to use.
N: See apt-secure(8) manpage for repository creation and user configuration details.
W: http://packages.archivematica.org/1.6.x/ubuntu-externals/dists/trusty/InRelease: Signature by key 486650CDD6355E25DA542E06C8F04D025236CA08 uses weak digest algorithm (SHA1)
E: Failed to fetch http://jenkins-ci.archivematica.org/1.8.x/ubuntu/dists/xenial/main/binary-amd64/Packages  404  Not Found
E: Some index files failed to download. They have been ignored, or old ones used instead.

Again there was a slight gap between the URLs from our last meeting and the documentation. When I switched the URL to use what was in our meeting notes I was met with:

Reading package lists... Done
W: The repository 'http://jenkins-ci.archivematica.org/repos/apt/dev-1.8.x-xenial xenial Release' does not have a Release file.
N: Data from such a repository can't be authenticated and is therefore potentially dangerous to use.
N: See apt-secure(8) manpage for repository creation and user configuration details.
W: http://packages.archivematica.org/1.6.x/ubuntu-externals/dists/trusty/InRelease: Signature by key 486650CDD6355E25DA542E06C8F04D025236CA08 uses weak digest algorithm (SHA1)
E: Failed to fetch http://jenkins-ci.archivematica.org/repos/apt/dev-1.8.x-xenial/dists/xenial/main/binary-amd64/Packages  404  Not Found
E: Some index files failed to download. They have been ignored, or old ones used instead.

This looks like it might just be due to how the repo is structured, since the current release has dist and architecture-specific subdirs.

Status: in progress Columbia University Library CUL: phase 1

opened by jpellman 18

Problem: ES client timeout is not configurable
With big METs files, 10 seconds might not be enough. This creates a configuration parameter in order to configure it.

I went for the conservative approach of only changing the ES aip index call, but this can also be handled at connection level, with something like:

es_client = Elasticsearch(**{ 'hosts': _es_hosts, 'timeout': request_timeout, 'dead_timeout': 2, })

Refs: #10734
opened by scollazo 18
Problem: Parse Dataverse Mets fails for some datasets

Testing the new "Parse Dataverse Mets" job within the "Parse External Files" microservice.

When testing with datasets I added to Dataverse, this job completes successfully.

When testing datasests created by the Scholar's Portal team, this job is failing.

The error message from the task that doesn't complete is:

I am not sure if it is relevant, but to get to access this dataset, I used "data&subtree=archivematica" in the relative path of the Dataverse location. For the datasets that have passed this job, that field is only set to "archivematica".
OCUL: AM-Dataverse

opened by joel-simpson 17

Problem: antivirus scanning errors if file is too big

Some defaults in clamd.conf are causing the antivirus scanning client script to fail. I believe these are the attributes involved:

■ MaxScanSize SIZE
Sets the maximum amount of data to be scanned for each input file. Archives and other containers are recursively extracted and scanned up to this value. Warning: disabling this limit or setting it too high may result in severe damage to the system.
Default: 100M

■ MaxFileSize SIZE
Files larger than this limit won't be scanned. Affects the input file itself as well as files contained inside it (when the input file is an archive, a document or some other kind of container). Warning: disabling this limit or setting it too high may result in severe damage to the system.
Default: 25M

■ StreamMaxLength SIZE
Clamd uses FTP-like protocol to receive data from remote clients. If you are using clamav-milter to balance load between remote clamd daemons on firewall servers you may need to tune the Stream* options. This option allows you to specify the upper limit for data size that will be transfered to remote daemon when scanning a single file. It should match your MTA's limit for a maximum attachment size.
Default: 10M

I tried to change their values to:

StreamMaxLength 0
MaxScanSize 0
MaxFileSize 0

Zero means unlimited. However the value of StreamMaxLength seems to be hard-coded to 4G (read this).

With the previous config, I've made the following tests:

The largest one (4.1G) failed.

The output of the client script is:

archivematicaClamscan.py: ERROR     2017-10-11 22:05:26,861  archivematica.mcp.client.clamscan:main:95:  Unexpected error scanning: /var/archivematica/sharedDirectory/currentlyProcessing/Test-4.1G-File-With-Accents_2-31a95a67-a0e0-4546-815d-fdbeebd41da6/objects/Volcán.jpg
Traceback (most recent call last):
  File "/usr/lib/archivematica/MCPClient/clientScripts/archivematicaClamscan.py", line 93, in main
    result = client.instream(open(target))
  File "/usr/share/python/archivematica-mcp-client/local/lib/python2.7/site-packages/clamd/__init__.py", line 190, in instream
    self.clamd_socket.send(size + chunk)
error: [Errno 104] Connection reset by peer
archivematicaClamscan.py: INFO      2017-10-11 22:05:26,867  archivematica.mcp.client.clamscan:record_event:56:  Recording event for fileUUID=29ffd407-35c6-41bf-ae5a-30dee0589962 outcome=Fail

In /var/log/clamav/clamav.log:

WARNING: INSTREAM: Size limit reached, (requested: 1024, max: 1023)

Type: bug Severity: critical

opened by sevein 17

Problem: index_aip crashes elasticsearch for large transfers

While testing large and multiple transfers for the rate limiting investigation, we noticed elasticsearch crash when it hit max-memory. Transfers with many files produced larger JSON documents (50 to 100MB), and the post to elasticsearch would take longer than the 10 second timeout causing a retry soon after. As these retries pile up, elasticsearch quickly hits its memory limit and barfs.

We tried increasing the elasticsearch memory allocation to 3x the default and still hit the limit. However, we think we can avoid this situation by increasing the default timeout from 10 seconds to 5 minutes during AIP indexing. This will lesson the load the elasticsearch server (by avoiding all those retries) and allow time for those large documents to be indexed.

We'll test this out and prepare a PR.
Jisc RDSS Piql NHA

opened by payten 16
Problem: quarantine delay is not reliable

When a transfer is sent to quarantine the user will be prompted to remove it from quarantine manually. But also, by default, the transfer is removed from quarantined automatically after 28 days. This is a delay that can be configured by the user in the processing configuration.

The purpose of the delay is to allow virus definitions to update, before virus scan.

There are two modules in MCP implementing the processing delay (1, 2).

It's done with a timer from the threading module which doesn't seem to be provide real guarantees. What would happen if the process is interrupted before the timer finishes? AMQP or Redis seem to offer primitives that allow implementing scheduling. Gearman doesn't seem to provide any.

Solution 1

Deprecate quarantine delay functionality. Only the user would be able to remove it from quarantine.

Solution 2

Update virus definitions before antivirus checking?

Solution 3

Implement delayed jobs using Redis or similar. Once a new scheduled job comes in, MCP would persist it somewhere. In a loop, the tasks would be polled frequently and throw new jobs when needed. The following is a library that could be used for this purpose or as a reference: https://github.com/josiahcarlson/rpqueue/ (it uses Python + Redis).

Remember that Redis is already used as a Gearman backend, we've tried this successfully in our local Docker developer setup. Adding Redis to our stack would be also beneficial for other purposes like caching in Django.
Severity: high

opened by sevein 16
Truncates filenames if they exceed os limit

Gets the maximum filename length and truncates the name of the renamed file if it exceeds the max allowable by the underlying OS. Is one path to addressing Archivematica/Issues#1586

opened by helrond 0
Prefer QuerySet.exists() to QuerySet.count() > 0

This is a micro-optimisation I found while looking at some database issues in the clamscan script; the Django documentation suggests using this instead of count() > 0.

See https://docs.djangoproject.com/en/4.1/ref/models/querysets/#django.db.models.query.QuerySet.exists

This is for https://github.com/archivematica/Issues/issues/1578 and https://github.com/wellcomecollection/archivematica-infrastructure/issues/101

opened by alexwlchan 0
Micro-optimisations from the Wellcome fork

Part of https://github.com/archivematica/Issues/issues/1578

I've been reviewing the changes in our fork of artefactual/archivematica. After removing changes that are already merged (OIDC/zipped bag) or changes which won't be merged (Wellcome-specific bits), this is what was left. It's not especially substantial, but seemed a shame to let it go to waste.

opened by alexwlchan 0
Make ES limits configurable

The total AIPs limit on the "Archival Storage" and "Appraisal" tabs were limited to 10000 items. It was hardcoded so this pull request makes this limit configurable.

Connects to https://github.com/archivematica/Issues/issues/1571

opened by mamedin 1
Add ability to customize LDAP Attributes

This allows users to override the attributes for first name, last name, and email.

I tested this locally, and it works, and would be great to not have to overwrite core files on the system.

Thanks!

Related to https://github.com/archivematica/Issues/issues/1565

opened by misilot 0

Releases(v1.13.2)

v1.13.2(Dec 13, 2021)

Release notes
Source code(tar.gz)
Source code(zip)
v1.12.2(Dec 13, 2021)

Release notes
Source code(tar.gz)
Source code(zip)
v1.13.1(Oct 19, 2021)

Release notes
Source code(tar.gz)
Source code(zip)
v1.13.0(Jul 11, 2021)

Release notes
Source code(tar.gz)
Source code(zip)
v1.12.1(Jan 11, 2021)

Release notes
Source code(tar.gz)
Source code(zip)
v1.12.0(Oct 6, 2020)

Release notes
Source code(tar.gz)
Source code(zip)
v1.11.2(Jun 11, 2020)

Release notes
Source code(tar.gz)
Source code(zip)
v1.11.1(May 20, 2020)

Release notes
Source code(tar.gz)
Source code(zip)
v1.10.2(May 20, 2020)

Release notes
Source code(tar.gz)
Source code(zip)
v1.9.3(May 20, 2020)

Release notes
Source code(tar.gz)
Source code(zip)
v1.11.0-rc.2(Mar 30, 2020)

Source code(tar.gz)
Source code(zip)
v1.11.0(Mar 31, 2020)

Release notes
Source code(tar.gz)
Source code(zip)
v1.11.0-rc.1(Mar 23, 2020)

Source code(tar.gz)
Source code(zip)
v1.10.1(Oct 23, 2019)

Release notes
Source code(tar.gz)
Source code(zip)
v1.10.0(Sep 4, 2019)

Release notes
Source code(tar.gz)
Source code(zip)
v1.10.0-rc.2(Aug 28, 2019)

Source code(tar.gz)
Source code(zip)
v1.10.0-rc.1(Jul 3, 2019)

Source code(tar.gz)
Source code(zip)
v1.9.2(Jun 28, 2019)

Release notes
Source code(tar.gz)
Source code(zip)
v1.9.2-rc.1(Jun 20, 2019)

Source code(tar.gz)
Source code(zip)
v1.9.1(Apr 9, 2019)

Release notes
Source code(tar.gz)
Source code(zip)
v1.9.1-rc.4(Apr 2, 2019)

Source code(tar.gz)
Source code(zip)
v1.9.1-rc.3(Mar 28, 2019)

Source code(tar.gz)
Source code(zip)
v1.9.1-rc.2(Mar 21, 2019)

Source code(tar.gz)
Source code(zip)
v1.9.1-rc.1(Mar 20, 2019)

Source code(tar.gz)
Source code(zip)
v1.9.0(Mar 6, 2019)

Release notes
Source code(tar.gz)
Source code(zip)
v1.9.0-rc.2(Mar 4, 2019)

Source code(tar.gz)
Source code(zip)
v1.9.0-rc.1(Feb 12, 2019)

Source code(tar.gz)
Source code(zip)
v1.8.1(Jan 10, 2019)

Release notes.
Source code(tar.gz)
Source code(zip)
v1.8.0(Dec 19, 2018)

Release notes.
Source code(tar.gz)
Source code(zip)
v1.7.2(Sep 11, 2018)

See https://wiki.archivematica.org/Archivematica_1.7.2_release_notes for details.
Source code(tar.gz)
Source code(zip)

Owner

Artefactual

GitHub http://www.archivematica.org

Invenio digital library framework

Invenio Framework v3 Open Source framework for large-scale digital repositories. Invenio Framework is like a Swiss Army knife of battle-tested, safe a

562 Jan 7, 2023

🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...

14.8k Jan 5, 2023

Open source platform for the machine learning lifecycle

MLflow: A Machine Learning Lifecycle Platform MLflow is a platform to streamline machine learning development, including tracking experiments, packagi

13.3k Jan 4, 2023

Indico - A feature-rich event management system, made @ CERN, the place where the Web was born.

Indico Indico is: ?? a general-purpose event management tool; ?? fully web-based; ?? feature-rich but also extensible through the use of plugins; ⚖️ O

1.4k Jan 9, 2023

The official source code repository for the calibre ebook manager

calibre calibre is an e-book manager. It can view, convert, edit and catalog e-books in all of the major e-book formats. It can also talk to e-book re

14.1k Dec 27, 2022

Source code for Gramps Genealogical program

The Gramps Project ( https://gramps-project.org ) We strive to produce a genealogy program that is both intuitive for hobbyists and feature-complete f

1.6k Jan 8, 2023

Plugin-based, unopinionated membership administration software

byro is a membership administration tool for small and medium sized clubs/NGOs/associations of all kinds, with a focus on the DACH region. While it is

123 Nov 16, 2022

A :baby: buddy to help caregivers track sleep, feedings, diaper changes, and tummy time to learn about and predict baby's needs without (as much) guess work.

Baby Buddy A buddy for babies! Helps caregivers track sleep, feedings, diaper changes, tummy time and more to learn about and predict baby's needs wit

1.5k Jan 2, 2023

ProPublica's collaborative tip-gathering framework. Import and manage CSV, Google Sheets and Screendoor data with ease.

Collaborate This is a web application for managing and building stories based on tips solicited from the public. This project is meant to be easy to s

86 Oct 18, 2022

:books: Web app for browsing, reading and downloading eBooks stored in a Calibre database

About Calibre-Web is a web app providing a clean interface for browsing, reading and downloading eBooks using an existing Calibre database. This softw

8.2k Jan 2, 2023

Collect your thoughts and notes without leaving the command line.

jrnl To get help, submit an issue on Github. jrnl is a simple journal application for your command line. Journals are stored as human readable plain t

31 Dec 1, 2022

Scan, index, and archive all of your paper documents

[ en | de | el ] Important news about the future of this project It's been more than 5 years since I started this project on a whim as an effort to tr

7.8k Jan 6, 2023

Automatic Video Library Manager for TV Shows. It watches for new episodes of your favorite shows, and when they are posted it does its magic.

Automatic Video Library Manager for TV Shows. It watches for new episodes of your favorite shows, and when they are posted it does its magic. Exclusiv

1.5k Dec 30, 2022

Agile project management platform. Built on top of Django and AngularJS

Taiga Backend Documentation Currently, we have authored three main documentation hubs: API: Our API documentation and reference for developing from Ta

5.8k Jan 5, 2023

A collection of self-contained and well-documented issues for newcomers to start contributing with

fedora-easyfix A collection of self-contained and well-documented issues for newcomers to start contributing with How to setup the local development e

8 Oct 16, 2021

GlobaLeaks is free, open source software enabling anyone to easily set up and maintain a secure whistleblowing platform.

GlobaLeaks is free, open souce software enabling anyone to easily set up and maintain a secure whistleblowing platform. Continous Integration and Test

995 Jan 1, 2023

Open Source research tool to search, browse, analyze and explore large document collections by Semantic Search Engine and Open Source Text Mining & Text Analytics platform (Integrates ETL for document processing, OCR for images & PDF, named entity recognition for persons, organizations & locations, metadata management by thesaurus & ontologies, search user interface & search apps for fulltext search, faceted search & knowledge graph)

Open Semantic Search https://opensemanticsearch.org Integrated search server, ETL framework for document processing (crawling, text extraction, text a

684 Jan 6, 2023

Free and open-source digital preservation system designed to maintain standards-based, long-term access to collections of digital objects.

Related tags

Overview

Installation

Other resources

Contributing

Reporting an issue

Security

Related projects

Comments

CentOS 7

Ubuntu 16.04

Solution 1

Solution 2

Solution 3

Releases(v1.13.2)

v1.13.2(Dec 13, 2021)

v1.12.2(Dec 13, 2021)

v1.13.1(Oct 19, 2021)

v1.13.0(Jul 11, 2021)

v1.12.1(Jan 11, 2021)

v1.12.0(Oct 6, 2020)

v1.11.2(Jun 11, 2020)

v1.11.1(May 20, 2020)

v1.10.2(May 20, 2020)

v1.9.3(May 20, 2020)

v1.11.0-rc.2(Mar 30, 2020)

v1.11.0(Mar 31, 2020)

v1.11.0-rc.1(Mar 23, 2020)

v1.10.1(Oct 23, 2019)

v1.10.0(Sep 4, 2019)

v1.10.0-rc.2(Aug 28, 2019)

v1.10.0-rc.1(Jul 3, 2019)

v1.9.2(Jun 28, 2019)

v1.9.2-rc.1(Jun 20, 2019)

v1.9.1(Apr 9, 2019)

v1.9.1-rc.4(Apr 2, 2019)

v1.9.1-rc.3(Mar 28, 2019)

v1.9.1-rc.2(Mar 21, 2019)

v1.9.1-rc.1(Mar 20, 2019)

v1.9.0(Mar 6, 2019)

v1.9.0-rc.2(Mar 4, 2019)

v1.9.0-rc.1(Feb 12, 2019)

v1.8.1(Jan 10, 2019)

v1.8.0(Dec 19, 2018)

v1.7.2(Sep 11, 2018)

Owner

Artefactual

Invenio digital library framework

🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...

Open source platform for the machine learning lifecycle

Indico - A feature-rich event management system, made @ CERN, the place where the Web was born.

The official source code repository for the calibre ebook manager

Source code for Gramps Genealogical program

Plugin-based, unopinionated membership administration software

A :baby: buddy to help caregivers track sleep, feedings, diaper changes, and tummy time to learn about and predict baby's needs without (as much) guess work.

ProPublica's collaborative tip-gathering framework. Import and manage CSV, Google Sheets and Screendoor data with ease.

:books: Web app for browsing, reading and downloading eBooks stored in a Calibre database

Collect your thoughts and notes without leaving the command line.

Scan, index, and archive all of your paper documents

Automatic Video Library Manager for TV Shows. It watches for new episodes of your favorite shows, and when they are posted it does its magic.

Agile project management platform. Built on top of Django and AngularJS

A collection of self-contained and well-documented issues for newcomers to start contributing with

GlobaLeaks is free, open source software enabling anyone to easily set up and maintain a secure whistleblowing platform.

Python collections that are backended by sqlite3 DB and are compatible with the built-in collections

Backdoor is a term that refers to the access of the software or hardware of a computer system without being detected.

Open Source Management System for Botanic Garden Collections.