First commit
Original PyPDF code. Updates should be coming from Noah soon.
This commit is contained in:
commit
c59a212a4c
|
@ -0,0 +1,2 @@
|
|||
*.pyc
|
||||
*.swp
|
|
@ -0,0 +1,205 @@
|
|||
Version 1.12, 2008-09-02
|
||||
------------------------
|
||||
|
||||
- Added support for XMP metadata.
|
||||
|
||||
- Fix reading files with xref streams with multiple /Index values.
|
||||
|
||||
- Fix extracting content streams that use graphics operators longer than 2
|
||||
characters. Affects merging PDF files.
|
||||
|
||||
|
||||
Version 1.11, 2008-05-09
|
||||
------------------------
|
||||
|
||||
- Patch from Hartmut Goebel to permit RectangleObjects to accept NumberObject
|
||||
or FloatObject values.
|
||||
|
||||
- PDF compatibility fixes.
|
||||
|
||||
- Fix to read object xref stream in correct order.
|
||||
|
||||
- Fix for comments inside content streams.
|
||||
|
||||
|
||||
Version 1.10, 2007-10-04
|
||||
------------------------
|
||||
|
||||
- Text strings from PDF files are returned as Unicode string objects when
|
||||
pyPdf determines that they can be decoded (as UTF-16 strings, or as
|
||||
PDFDocEncoding strings). Unicode objects are also written out when
|
||||
necessary. This means that string objects in pyPdf can be either
|
||||
generic.ByteStringObject instances, or generic.TextStringObject instances.
|
||||
|
||||
- The extractText method now returns a unicode string object.
|
||||
|
||||
- All document information properties now return unicode string objects. In
|
||||
the event that a document provides docinfo properties that are not decoded by
|
||||
pyPdf, the raw byte strings can be accessed with an "_raw" property (ie.
|
||||
title_raw rather than title)
|
||||
|
||||
- generic.DictionaryObject instances have been enhanced to be easier to use.
|
||||
Values coming out of dictionary objects will automatically be de-referenced
|
||||
(.getObject will be called on them), unless accessed by the new "raw_get"
|
||||
method. DictionaryObjects can now only contain PdfObject instances (as keys
|
||||
and values), making it easier to debug where non-PdfObject values (which
|
||||
cannot be written out) are entering dictionaries.
|
||||
|
||||
- Support for reading named destinations and outlines in PDF files. Original
|
||||
patch by Ashish Kulkarni.
|
||||
|
||||
- Stream compatibility reading enhancements for malformed PDF files.
|
||||
|
||||
- Cross reference table reading enhancements for malformed PDF files.
|
||||
|
||||
- Encryption documentation.
|
||||
|
||||
- Replace some "assert" statements with error raising.
|
||||
|
||||
- Minor optimizations to FlateDecode algorithm increase speed when using PNG
|
||||
predictors.
|
||||
|
||||
Version 1.9, 2006-12-15
|
||||
-----------------------
|
||||
|
||||
- Fix several serious bugs introduced in version 1.8, caused by a failure to
|
||||
run through our PDF test suite before releasing that version.
|
||||
|
||||
- Fix bug in NullObject reading and writing.
|
||||
|
||||
Version 1.8, 2006-12-14
|
||||
-----------------------
|
||||
|
||||
- Add support for decryption with the standard PDF security handler. This
|
||||
allows for decrypting PDF files given the proper user or owner password.
|
||||
|
||||
- Add support for encryption with the standard PDF security handler.
|
||||
|
||||
- Add new pythondoc documentation.
|
||||
|
||||
- Fix bug in ASCII85 decode that occurs when whitespace exists inside the
|
||||
two terminating characters of the stream.
|
||||
|
||||
Version 1.7, 2006-12-10
|
||||
-----------------------
|
||||
|
||||
- Fix a bug when using a single page object in two PdfFileWriter objects.
|
||||
|
||||
- Adjust PyPDF to be tolerant of whitespace characters that don't belong
|
||||
during a stream object.
|
||||
|
||||
- Add documentInfo property to PdfFileReader.
|
||||
|
||||
- Add numPages property to PdfFileReader.
|
||||
|
||||
- Add pages property to PdfFileReader.
|
||||
|
||||
- Add extractText function to PdfFileReader.
|
||||
|
||||
|
||||
Version 1.6, 2006-06-06
|
||||
-----------------------
|
||||
|
||||
- Add basic support for comments in PDF files. This allows us to read some
|
||||
ReportLab PDFs that could not be read before.
|
||||
|
||||
- Add "auto-repair" for finding xref table at slightly bad locations.
|
||||
|
||||
- New StreamObject backend, cleaner and more powerful. Allows the use of
|
||||
stream filters more easily, including compressed streams.
|
||||
|
||||
- Add a graphics state push/pop around page merges. Improves quality of
|
||||
page merges when one page's content stream leaves the graphics
|
||||
in an abnormal state.
|
||||
|
||||
- Add PageObject.compressContentStreams function, which filters all content
|
||||
streams and compresses them. This will reduce the size of PDF pages,
|
||||
especially after they could have been decompressed in a mergePage
|
||||
operation.
|
||||
|
||||
- Support inline images in PDF content streams.
|
||||
|
||||
- Add support for using .NET framework compression when zlib is not
|
||||
available. This does not make pyPdf compatible with IronPython, but it
|
||||
is a first step.
|
||||
|
||||
- Add support for reading the document information dictionary, and extracting
|
||||
title, author, subject, producer and creator tags.
|
||||
|
||||
- Add patch to support NullObject and multiple xref streams, from Bradley
|
||||
Lawrence.
|
||||
|
||||
|
||||
Version 1.5, 2006-01-28
|
||||
-----------------------
|
||||
|
||||
- Fix a bug where merging pages did not work in "no-rename" cases when the
|
||||
second page has an array of content streams.
|
||||
|
||||
- Remove some debugging output that should not have been present.
|
||||
|
||||
|
||||
Version 1.4, 2006-01-27
|
||||
-----------------------
|
||||
|
||||
- Add capability to merge pages from multiple PDF files into a single page
|
||||
using the PageObject.mergePage function. See example code (README or web
|
||||
site) for more information.
|
||||
|
||||
- Add ability to modify a page's MediaBox, CropBox, BleedBox, TrimBox, and
|
||||
ArtBox properties through PageObject. See example code (README or web site)
|
||||
for more information.
|
||||
|
||||
- Refactor pdf.py into multiple files: generic.py (contains objects like
|
||||
NameObject, DictionaryObject), filters.py (contains filter code),
|
||||
utils.py (various). This does not affect importing PdfFileReader
|
||||
or PdfFileWriter.
|
||||
|
||||
- Add new decoding functions for standard PDF filters ASCIIHexDecode and
|
||||
ASCII85Decode.
|
||||
|
||||
- Change url and download_url to refer to new pybrary.net web site.
|
||||
|
||||
|
||||
Version 1.3, 2006-01-23
|
||||
-----------------------
|
||||
|
||||
- Fix new bug introduced in 1.2 where PDF files with \r line endings did not
|
||||
work properly anymore. A new test suite developed with various PDF files
|
||||
should prevent regression bugs from now on.
|
||||
|
||||
- Fix a bug where inheriting attributes from page nodes did not work.
|
||||
|
||||
|
||||
Version 1.2, 2006-01-23
|
||||
-----------------------
|
||||
|
||||
- Improved support for files with CRLF-based line endings, fixing a common
|
||||
reported problem stating "assertion error: assert line == "%%EOF"".
|
||||
|
||||
- Software author/maintainer is now officially a proud married person, which
|
||||
is sure to result in better software... somehow.
|
||||
|
||||
|
||||
Version 1.1, 2006-01-18
|
||||
-----------------------
|
||||
|
||||
- Add capability to rotate pages.
|
||||
|
||||
- Improved PDF reading support to properly manage inherited attributes from
|
||||
/Type=/Pages nodes. This means that page groups that are rotated or have
|
||||
different media boxes or whatever will now work properly.
|
||||
|
||||
- Added PDF 1.5 support. Namely cross-reference streams and object streams.
|
||||
This release can mangle Adobe's PDFReference16.pdf successfully.
|
||||
|
||||
|
||||
Version 1.0, 2006-01-17
|
||||
-----------------------
|
||||
|
||||
- First distutils-capable true public release. Supports a wide variety of PDF
|
||||
files that I found sitting around on my system.
|
||||
|
||||
- Does not support some PDF 1.5 features, such as object streams,
|
||||
cross-reference streams.
|
||||
|
|
@ -0,0 +1,28 @@
|
|||
Copyright (c) 2006-2008, Mathieu Fenniak
|
||||
Some contributions copyright (c) 2007, Ashish Kulkarni <kulkarni.ashish@gmail.com>
|
||||
|
||||
All rights reserved.
|
||||
|
||||
Redistribution and use in source and binary forms, with or without
|
||||
modification, are permitted provided that the following conditions are
|
||||
met:
|
||||
|
||||
* Redistributions of source code must retain the above copyright notice,
|
||||
this list of conditions and the following disclaimer.
|
||||
* Redistributions in binary form must reproduce the above copyright notice,
|
||||
this list of conditions and the following disclaimer in the documentation
|
||||
and/or other materials provided with the distribution.
|
||||
* The name of the author may not be used to endorse or promote products
|
||||
derived from this software without specific prior written permission.
|
||||
|
||||
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
|
||||
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
||||
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
|
||||
ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE
|
||||
LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
|
||||
CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
|
||||
SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
|
||||
INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
|
||||
CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
|
||||
ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
|
||||
POSSIBILITY OF SUCH DAMAGE.
|
|
@ -0,0 +1 @@
|
|||
include CHANGELOG
|
|
@ -0,0 +1,4 @@
|
|||
from pdf import PdfFileReader, PdfFileWriter
|
||||
from merger import PdfFileMerger
|
||||
|
||||
__all__ = ["pdf", "PdfFileMerger"]
|
|
@ -0,0 +1,252 @@
|
|||
# vim: sw=4:expandtab:foldmethod=marker
|
||||
#
|
||||
# Copyright (c) 2006, Mathieu Fenniak
|
||||
# All rights reserved.
|
||||
#
|
||||
# Redistribution and use in source and binary forms, with or without
|
||||
# modification, are permitted provided that the following conditions are
|
||||
# met:
|
||||
#
|
||||
# * Redistributions of source code must retain the above copyright notice,
|
||||
# this list of conditions and the following disclaimer.
|
||||
# * Redistributions in binary form must reproduce the above copyright notice,
|
||||
# this list of conditions and the following disclaimer in the documentation
|
||||
# and/or other materials provided with the distribution.
|
||||
# * The name of the author may not be used to endorse or promote products
|
||||
# derived from this software without specific prior written permission.
|
||||
#
|
||||
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
|
||||
# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
||||
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
|
||||
# ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE
|
||||
# LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
|
||||
# CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
|
||||
# SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
|
||||
# INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
|
||||
# CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
|
||||
# ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
|
||||
# POSSIBILITY OF SUCH DAMAGE.
|
||||
|
||||
|
||||
"""
|
||||
Implementation of stream filters for PDF.
|
||||
"""
|
||||
__author__ = "Mathieu Fenniak"
|
||||
__author_email__ = "biziqe@mathieu.fenniak.net"
|
||||
|
||||
from utils import PdfReadError
|
||||
try:
|
||||
from cStringIO import StringIO
|
||||
except ImportError:
|
||||
from StringIO import StringIO
|
||||
|
||||
try:
|
||||
import zlib
|
||||
def decompress(data):
|
||||
return zlib.decompress(data)
|
||||
def compress(data):
|
||||
return zlib.compress(data)
|
||||
except ImportError:
|
||||
# Unable to import zlib. Attempt to use the System.IO.Compression
|
||||
# library from the .NET framework. (IronPython only)
|
||||
import System
|
||||
from System import IO, Collections, Array
|
||||
def _string_to_bytearr(buf):
|
||||
retval = Array.CreateInstance(System.Byte, len(buf))
|
||||
for i in range(len(buf)):
|
||||
retval[i] = ord(buf[i])
|
||||
return retval
|
||||
def _bytearr_to_string(bytes):
|
||||
retval = ""
|
||||
for i in range(bytes.Length):
|
||||
retval += chr(bytes[i])
|
||||
return retval
|
||||
def _read_bytes(stream):
|
||||
ms = IO.MemoryStream()
|
||||
buf = Array.CreateInstance(System.Byte, 2048)
|
||||
while True:
|
||||
bytes = stream.Read(buf, 0, buf.Length)
|
||||
if bytes == 0:
|
||||
break
|
||||
else:
|
||||
ms.Write(buf, 0, bytes)
|
||||
retval = ms.ToArray()
|
||||
ms.Close()
|
||||
return retval
|
||||
def decompress(data):
|
||||
bytes = _string_to_bytearr(data)
|
||||
ms = IO.MemoryStream()
|
||||
ms.Write(bytes, 0, bytes.Length)
|
||||
ms.Position = 0 # fseek 0
|
||||
gz = IO.Compression.DeflateStream(ms, IO.Compression.CompressionMode.Decompress)
|
||||
bytes = _read_bytes(gz)
|
||||
retval = _bytearr_to_string(bytes)
|
||||
gz.Close()
|
||||
return retval
|
||||
def compress(data):
|
||||
bytes = _string_to_bytearr(data)
|
||||
ms = IO.MemoryStream()
|
||||
gz = IO.Compression.DeflateStream(ms, IO.Compression.CompressionMode.Compress, True)
|
||||
gz.Write(bytes, 0, bytes.Length)
|
||||
gz.Close()
|
||||
ms.Position = 0 # fseek 0
|
||||
bytes = ms.ToArray()
|
||||
retval = _bytearr_to_string(bytes)
|
||||
ms.Close()
|
||||
return retval
|
||||
|
||||
|
||||
class FlateDecode(object):
|
||||
def decode(data, decodeParms):
|
||||
data = decompress(data)
|
||||
predictor = 1
|
||||
if decodeParms:
|
||||
predictor = decodeParms.get("/Predictor", 1)
|
||||
# predictor 1 == no predictor
|
||||
if predictor != 1:
|
||||
columns = decodeParms["/Columns"]
|
||||
# PNG prediction:
|
||||
if predictor >= 10 and predictor <= 15:
|
||||
output = StringIO()
|
||||
# PNG prediction can vary from row to row
|
||||
rowlength = columns + 1
|
||||
assert len(data) % rowlength == 0
|
||||
prev_rowdata = (0,) * rowlength
|
||||
for row in xrange(len(data) / rowlength):
|
||||
rowdata = [ord(x) for x in data[(row*rowlength):((row+1)*rowlength)]]
|
||||
filterByte = rowdata[0]
|
||||
if filterByte == 0:
|
||||
pass
|
||||
elif filterByte == 1:
|
||||
for i in range(2, rowlength):
|
||||
rowdata[i] = (rowdata[i] + rowdata[i-1]) % 256
|
||||
elif filterByte == 2:
|
||||
for i in range(1, rowlength):
|
||||
rowdata[i] = (rowdata[i] + prev_rowdata[i]) % 256
|
||||
else:
|
||||
# unsupported PNG filter
|
||||
raise PdfReadError("Unsupported PNG filter %r" % filterByte)
|
||||
prev_rowdata = rowdata
|
||||
output.write(''.join([chr(x) for x in rowdata[1:]]))
|
||||
data = output.getvalue()
|
||||
else:
|
||||
# unsupported predictor
|
||||
raise PdfReadError("Unsupported flatedecode predictor %r" % predictor)
|
||||
return data
|
||||
decode = staticmethod(decode)
|
||||
|
||||
def encode(data):
|
||||
return compress(data)
|
||||
encode = staticmethod(encode)
|
||||
|
||||
class ASCIIHexDecode(object):
|
||||
def decode(data, decodeParms=None):
|
||||
retval = ""
|
||||
char = ""
|
||||
x = 0
|
||||
while True:
|
||||
c = data[x]
|
||||
if c == ">":
|
||||
break
|
||||
elif c.isspace():
|
||||
x += 1
|
||||
continue
|
||||
char += c
|
||||
if len(char) == 2:
|
||||
retval += chr(int(char, base=16))
|
||||
char = ""
|
||||
x += 1
|
||||
assert char == ""
|
||||
return retval
|
||||
decode = staticmethod(decode)
|
||||
|
||||
class ASCII85Decode(object):
|
||||
def decode(data, decodeParms=None):
|
||||
retval = ""
|
||||
group = []
|
||||
x = 0
|
||||
hitEod = False
|
||||
# remove all whitespace from data
|
||||
data = [y for y in data if not (y in ' \n\r\t')]
|
||||
while not hitEod:
|
||||
c = data[x]
|
||||
if len(retval) == 0 and c == "<" and data[x+1] == "~":
|
||||
x += 2
|
||||
continue
|
||||
#elif c.isspace():
|
||||
# x += 1
|
||||
# continue
|
||||
elif c == 'z':
|
||||
assert len(group) == 0
|
||||
retval += '\x00\x00\x00\x00'
|
||||
continue
|
||||
elif c == "~" and data[x+1] == ">":
|
||||
if len(group) != 0:
|
||||
# cannot have a final group of just 1 char
|
||||
assert len(group) > 1
|
||||
cnt = len(group) - 1
|
||||
group += [ 85, 85, 85 ]
|
||||
hitEod = cnt
|
||||
else:
|
||||
break
|
||||
else:
|
||||
c = ord(c) - 33
|
||||
assert c >= 0 and c < 85
|
||||
group += [ c ]
|
||||
if len(group) >= 5:
|
||||
b = group[0] * (85**4) + \
|
||||
group[1] * (85**3) + \
|
||||
group[2] * (85**2) + \
|
||||
group[3] * 85 + \
|
||||
group[4]
|
||||
assert b < (2**32 - 1)
|
||||
c4 = chr((b >> 0) % 256)
|
||||
c3 = chr((b >> 8) % 256)
|
||||
c2 = chr((b >> 16) % 256)
|
||||
c1 = chr(b >> 24)
|
||||
retval += (c1 + c2 + c3 + c4)
|
||||
if hitEod:
|
||||
retval = retval[:-4+hitEod]
|
||||
group = []
|
||||
x += 1
|
||||
return retval
|
||||
decode = staticmethod(decode)
|
||||
|
||||
def decodeStreamData(stream):
|
||||
from generic import NameObject
|
||||
filters = stream.get("/Filter", ())
|
||||
if len(filters) and not isinstance(filters[0], NameObject):
|
||||
# we have a single filter instance
|
||||
filters = (filters,)
|
||||
data = stream._data
|
||||
for filterType in filters:
|
||||
if filterType == "/FlateDecode":
|
||||
data = FlateDecode.decode(data, stream.get("/DecodeParms"))
|
||||
elif filterType == "/ASCIIHexDecode":
|
||||
data = ASCIIHexDecode.decode(data)
|
||||
elif filterType == "/ASCII85Decode":
|
||||
data = ASCII85Decode.decode(data)
|
||||
elif filterType == "/Crypt":
|
||||
decodeParams = stream.get("/DecodeParams", {})
|
||||
if "/Name" not in decodeParams and "/Type" not in decodeParams:
|
||||
pass
|
||||
else:
|
||||
raise NotImplementedError("/Crypt filter with /Name or /Type not supported yet")
|
||||
else:
|
||||
# unsupported filter
|
||||
raise NotImplementedError("unsupported filter %s" % filterType)
|
||||
return data
|
||||
|
||||
if __name__ == "__main__":
|
||||
assert "abc" == ASCIIHexDecode.decode('61\n626\n3>')
|
||||
|
||||
ascii85Test = """
|
||||
<~9jqo^BlbD-BleB1DJ+*+F(f,q/0JhKF<GL>Cj@.4Gp$d7F!,L7@<6@)/0JDEF<G%<+EV:2F!,
|
||||
O<DJ+*.@<*K0@<6L(Df-\\0Ec5e;DffZ(EZee.Bl.9pF"AGXBPCsi+DGm>@3BB/F*&OCAfu2/AKY
|
||||
i(DIb:@FD,*)+C]U=@3BN#EcYf8ATD3s@q?d$AftVqCh[NqF<G:8+EV:.+Cf>-FD5W8ARlolDIa
|
||||
l(DId<j@<?3r@:F%a+D58'ATD4$Bl@l3De:,-DJs`8ARoFb/0JMK@qB4^F!,R<AKZ&-DfTqBG%G
|
||||
>uD.RTpAKYo'+CT/5+Cei#DII?(E,9)oF*2M7/c~>
|
||||
"""
|
||||
ascii85_originalText="Man is distinguished, not only by his reason, but by this singular passion from other animals, which is a lust of the mind, that by a perseverance of delight in the continued and indefatigable generation of knowledge, exceeds the short vehemence of any carnal pleasure."
|
||||
assert ASCII85Decode.decode(ascii85Test) == ascii85_originalText
|
||||
|
File diff suppressed because it is too large
Load Diff
|
@ -0,0 +1,401 @@
|
|||
# vim: sw=4:expandtab:foldmethod=marker
|
||||
#
|
||||
# Copyright (c) 2006, Mathieu Fenniak
|
||||
# All rights reserved.
|
||||
#
|
||||
# Redistribution and use in source and binary forms, with or without
|
||||
# modification, are permitted provided that the following conditions are
|
||||
# met:
|
||||
#
|
||||
# * Redistributions of source code must retain the above copyright notice,
|
||||
# this list of conditions and the following disclaimer.
|
||||
# * Redistributions in binary form must reproduce the above copyright notice,
|
||||
# this list of conditions and the following disclaimer in the documentation
|
||||
# and/or other materials provided with the distribution.
|
||||
# * The name of the author may not be used to endorse or promote products
|
||||
# derived from this software without specific prior written permission.
|
||||
#
|
||||
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
|
||||
# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
||||
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
|
||||
# ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE
|
||||
# LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
|
||||
# CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
|
||||
# SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
|
||||
# INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
|
||||
# CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
|
||||
# ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
|
||||
# POSSIBILITY OF SUCH DAMAGE.
|
||||
|
||||
from generic import *
|
||||
from pdf import PdfFileReader, PdfFileWriter, Destination
|
||||
|
||||
class _MergedPage(object):
|
||||
"""
|
||||
_MergedPage is used internally by PdfFileMerger to collect necessary information on each page that is being merged.
|
||||
"""
|
||||
def __init__(self, pagedata, src, id):
|
||||
self.src = src
|
||||
self.pagedata = pagedata
|
||||
self.out_pagedata = None
|
||||
self.id = id
|
||||
|
||||
class PdfFileMerger(object):
|
||||
"""
|
||||
PdfFileMerger merges multiple PDFs into a single PDF. It can concatenate,
|
||||
slice, insert, or any combination of the above.
|
||||
|
||||
See the functions "merge" (or "append") and "write" (or "overwrite") for
|
||||
usage information.
|
||||
"""
|
||||
|
||||
def __init__(self):
|
||||
"""
|
||||
>>> PdfFileMerger()
|
||||
|
||||
Initializes a PdfFileMerger, no parameters required
|
||||
"""
|
||||
self.inputs = []
|
||||
self.pages = []
|
||||
self.output = PdfFileWriter()
|
||||
self.bookmarks = []
|
||||
self.named_dests = []
|
||||
self.id_count = 0
|
||||
|
||||
def merge(self, position, fileobj, bookmark=None, pages=None, import_bookmarks=True):
|
||||
"""
|
||||
>>> merge(position, file, bookmark=None, pages=None, import_bookmarks=True)
|
||||
|
||||
Merges the pages from the source document specified by "file" into the output
|
||||
file at the page number specified by "position".
|
||||
|
||||
Optionally, you may specify a bookmark to be applied at the beginning of the
|
||||
included file by supplying the text of the bookmark in the "bookmark" parameter.
|
||||
|
||||
You may prevent the source document's bookmarks from being imported by
|
||||
specifying "import_bookmarks" as False.
|
||||
|
||||
You may also use the "pages" parameter to merge only the specified range of
|
||||
pages from the source document into the output document.
|
||||
"""
|
||||
|
||||
my_file = False
|
||||
if type(fileobj) in (str, unicode):
|
||||
fileobj = file(fileobj, 'rb')
|
||||
my_file = True
|
||||
|
||||
if type(fileobj) == PdfFileReader:
|
||||
pdfr = fileobj
|
||||
fileobj = pdfr.file
|
||||
else:
|
||||
pdfr = PdfFileReader(fileobj)
|
||||
|
||||
# Find the range of pages to merge
|
||||
if pages == None:
|
||||
pages = (0, pdfr.getNumPages())
|
||||
elif type(pages) in (int, float, str, unicode):
|
||||
raise TypeError('"pages" must be a tuple of (start, end)')
|
||||
|
||||
srcpages = []
|
||||
|
||||
if bookmark:
|
||||
bookmark = Bookmark(TextStringObject(bookmark), NumberObject(self.id_count), NameObject('/Fit'))
|
||||
|
||||
outline = []
|
||||
if import_bookmarks:
|
||||
outline = pdfr.getOutlines()
|
||||
outline = self._trim_outline(pdfr, outline, pages)
|
||||
|
||||
if bookmark:
|
||||
self.bookmarks += [bookmark, outline]
|
||||
else:
|
||||
self.bookmarks += outline
|
||||
|
||||
dests = pdfr.namedDestinations
|
||||
dests = self._trim_dests(pdfr, dests, pages)
|
||||
self.named_dests += dests
|
||||
|
||||
# Gather all the pages that are going to be merged
|
||||
for i in range(*pages):
|
||||
pg = pdfr.getPage(i)
|
||||
|
||||
id = self.id_count
|
||||
self.id_count += 1
|
||||
|
||||
mp = _MergedPage(pg, pdfr, id)
|
||||
|
||||
srcpages.append(mp)
|
||||
|
||||
self._associate_dests_to_pages(srcpages)
|
||||
self._associate_bookmarks_to_pages(srcpages)
|
||||
|
||||
|
||||
# Slice to insert the pages at the specified position
|
||||
self.pages[position:position] = srcpages
|
||||
|
||||
# Keep track of our input files so we can close them later
|
||||
self.inputs.append((fileobj, pdfr, my_file))
|
||||
|
||||
|
||||
def append(self, fileobj, bookmark=None, pages=None, import_bookmarks=True):
|
||||
"""
|
||||
>>> append(file, bookmark=None, pages=None, import_bookmarks=True):
|
||||
|
||||
Identical to the "merge" function, but assumes you want to concatenate all pages
|
||||
onto the end of the file instead of specifying a position.
|
||||
"""
|
||||
|
||||
self.merge(len(self.pages), fileobj, bookmark, pages, import_bookmarks)
|
||||
|
||||
|
||||
def write(self, fileobj):
|
||||
"""
|
||||
>>> write(file)
|
||||
|
||||
Writes all data that has been merged to "file" (which can be a filename or any
|
||||
kind of file-like object)
|
||||
"""
|
||||
my_file = False
|
||||
if type(fileobj) in (str, unicode):
|
||||
fileobj = file(fileobj, 'wb')
|
||||
my_file = True
|
||||
|
||||
|
||||
# Add pages to the PdfFileWriter
|
||||
for page in self.pages:
|
||||
self.output.addPage(page.pagedata)
|
||||
page.out_pagedata = self.output.getReference(self.output._pages.getObject()["/Kids"][-1].getObject())
|
||||
|
||||
|
||||
# Once all pages are added, create bookmarks to point at those pages
|
||||
self._write_dests()
|
||||
self._write_bookmarks()
|
||||
|
||||
# Write the output to the file
|
||||
self.output.write(fileobj)
|
||||
|
||||
if my_file:
|
||||
fileobj.close()
|
||||
|
||||
|
||||
|
||||
def close(self):
|
||||
"""
|
||||
>>> close()
|
||||
|
||||
Shuts all file descriptors (input and output) and clears all memory usage
|
||||
"""
|
||||
self.pages = []
|
||||
for fo, pdfr, mine in self.inputs:
|
||||
if mine:
|
||||
fo.close()
|
||||
|
||||
self.inputs = []
|
||||
self.output = None
|
||||
|
||||
def _trim_dests(self, pdf, dests, pages):
|
||||
"""
|
||||
Removes any named destinations that are not a part of the specified page set
|
||||
"""
|
||||
new_dests = []
|
||||
prev_header_added = True
|
||||
for k, o in dests.items():
|
||||
for j in range(*pages):
|
||||
if pdf.getPage(j).getObject() == o['/Page'].getObject():
|
||||
o[NameObject('/Page')] = o['/Page'].getObject()
|
||||
assert str(k) == str(o['/Title'])
|
||||
new_dests.append(o)
|
||||
break
|
||||
return new_dests
|
||||
|
||||
def _trim_outline(self, pdf, outline, pages):
|
||||
"""
|
||||
Removes any outline/bookmark entries that are not a part of the specified page set
|
||||
"""
|
||||
new_outline = []
|
||||
prev_header_added = True
|
||||
for i, o in enumerate(outline):
|
||||
if type(o) == list:
|
||||
sub = self._trim_outline(pdf, o, pages)
|
||||
if sub:
|
||||
if not prev_header_added:
|
||||
new_outline.append(outline[i-1])
|
||||
new_outline.append(sub)
|
||||
else:
|
||||
prev_header_added = False
|
||||
for j in range(*pages):
|
||||
if pdf.getPage(j).getObject() == o['/Page'].getObject():
|
||||
o[NameObject('/Page')] = o['/Page'].getObject()
|
||||
new_outline.append(o)
|
||||
prev_header_added = True
|
||||
break
|
||||
return new_outline
|
||||
|
||||
def _write_dests(self):
|
||||
dests = self.named_dests
|
||||
|
||||
for v in dests:
|
||||
pageno = None
|
||||
pdf = None
|
||||
if v.has_key('/Page'):
|
||||
for i, p in enumerate(self.pages):
|
||||
if p.id == v['/Page']:
|
||||
v[NameObject('/Page')] = p.out_pagedata
|
||||
pageno = i
|
||||
pdf = p.src
|
||||
if pageno != None:
|
||||
self.output.addNamedDestinationObject(v)
|
||||
|
||||
def _write_bookmarks(self, bookmarks=None, parent=None):
|
||||
|
||||
if bookmarks == None:
|
||||
bookmarks = self.bookmarks
|
||||
|
||||
|
||||
last_added = None
|
||||
for b in bookmarks:
|
||||
if type(b) == list:
|
||||
self._write_bookmarks(b, last_added)
|
||||
continue
|
||||
|
||||
pageno = None
|
||||
pdf = None
|
||||
if b.has_key('/Page'):
|
||||
for i, p in enumerate(self.pages):
|
||||
if p.id == b['/Page']:
|
||||
b[NameObject('/Page')] = p.out_pagedata
|
||||
pageno = i
|
||||
pdf = p.src
|
||||
if pageno != None:
|
||||
last_added = self.output.addBookmarkDestination(b, parent)
|
||||
|
||||
|
||||
def _associate_dests_to_pages(self, pages):
|
||||
for nd in self.named_dests:
|
||||
pageno = None
|
||||
np = nd['/Page']
|
||||
|
||||
if type(np) == NumberObject:
|
||||
continue
|
||||
|
||||
for p in pages:
|
||||
if np.getObject() == p.pagedata.getObject():
|
||||
pageno = p.id
|
||||
|
||||
if pageno != None:
|
||||
nd[NameObject('/Page')] = NumberObject(pageno)
|
||||
else:
|
||||
raise ValueError, "Unresolved named destination '%s'" % (nd['/Title'],)
|
||||
|
||||
def _associate_bookmarks_to_pages(self, pages, bookmarks=None):
|
||||
if bookmarks == None:
|
||||
bookmarks = self.bookmarks
|
||||
|
||||
for b in bookmarks:
|
||||
if type(b) == list:
|
||||
self._associate_bookmarks_to_pages(pages, b)
|
||||
continue
|
||||
|
||||
pageno = None
|
||||
bp = b['/Page']
|
||||
|
||||
if type(bp) == NumberObject:
|
||||
continue
|
||||
|
||||
for p in pages:
|
||||
if bp.getObject() == p.pagedata.getObject():
|
||||
pageno = p.id
|
||||
|
||||
if pageno != None:
|
||||
b[NameObject('/Page')] = NumberObject(pageno)
|
||||
else:
|
||||
raise ValueError, "Unresolved bookmark '%s'" % (b['/Title'],)
|
||||
|
||||
def findBookmark(self, bookmark, root=None):
|
||||
if root == None:
|
||||
root = self.bookmarks
|
||||
|
||||
for i, b in enumerate(root):
|
||||
if type(b) == list:
|
||||
res = self.findBookmark(bookmark, b)
|
||||
if res:
|
||||
return [i] + res
|
||||
if b == bookmark or b['/Title'] == bookmark:
|
||||
return [i]
|
||||
|
||||
return None
|
||||
|
||||
def addBookmark(self, title, pagenum, parent=None):
|
||||
"""
|
||||
Add a bookmark to the pdf, using the specified title and pointing at
|
||||
the specified page number. A parent can be specified to make this a
|
||||
nested bookmark below the parent.
|
||||
"""
|
||||
|
||||
if parent == None:
|
||||
iloc = [len(self.bookmarks)-1]
|
||||
elif type(parent) == list:
|
||||
iloc = parent
|
||||
else:
|
||||
iloc = self.findBookmark(parent)
|
||||
|
||||
dest = Bookmark(TextStringObject(title), NumberObject(pagenum), NameObject('/FitH'), NumberObject(826))
|
||||
|
||||
if parent == None:
|
||||
self.bookmarks.append(dest)
|
||||
else:
|
||||
bmparent = self.bookmarks
|
||||
for i in iloc[:-1]:
|
||||
bmparent = bmparent[i]
|
||||
npos = iloc[-1]+1
|
||||
if npos < len(bmparent) and type(bmparent[npos]) == list:
|
||||
bmparent[npos].append(dest)
|
||||
else:
|
||||
bmparent.insert(npos, [dest])
|
||||
|
||||
|
||||
def addNamedDestination(self, title, pagenum):
|
||||
"""
|
||||
Add a destination to the pdf, using the specified title and pointing
|
||||
at the specified page number.
|
||||
"""
|
||||
|
||||
dest = Destination(TextStringObject(title), NumberObject(pagenum), NameObject('/FitH'), NumberObject(826))
|
||||
self.named_dests.append(dest)
|
||||
|
||||
|
||||
class OutlinesObject(list):
|
||||
def __init__(self, pdf, tree, parent=None):
|
||||
list.__init__(self)
|
||||
self.tree = tree
|
||||
self.pdf = pdf
|
||||
self.parent = parent
|
||||
|
||||
def remove(self, index):
|
||||
obj = self[index]
|
||||
del self[index]
|
||||
self.tree.removeChild(obj)
|
||||
|
||||
def add(self, title, page):
|
||||
pageRef = self.pdf.getObject(self.pdf._pages)['/Kids'][pagenum]
|
||||
action = DictionaryObject()
|
||||
action.update({
|
||||
NameObject('/D') : ArrayObject([pageRef, NameObject('/FitH'), NumberObject(826)]),
|
||||
NameObject('/S') : NameObject('/GoTo')
|
||||
})
|
||||
actionRef = self.pdf._addObject(action)
|
||||
bookmark = TreeObject()
|
||||
|
||||
bookmark.update({
|
||||
NameObject('/A') : actionRef,
|
||||
NameObject('/Title') : createStringObject(title),
|
||||
})
|
||||
|
||||
pdf._addObject(bookmark)
|
||||
|
||||
self.tree.addChild(bookmark)
|
||||
|
||||
def removeAll(self):
|
||||
for child in [x for x in self.tree.children()]:
|
||||
self.tree.removeChild(child)
|
||||
self.pop()
|
File diff suppressed because it is too large
Load Diff
|
@ -0,0 +1,125 @@
|
|||
# vim: sw=4:expandtab:foldmethod=marker
|
||||
#
|
||||
# Copyright (c) 2006, Mathieu Fenniak
|
||||
# All rights reserved.
|
||||
#
|
||||
# Redistribution and use in source and binary forms, with or without
|
||||
# modification, are permitted provided that the following conditions are
|
||||
# met:
|
||||
#
|
||||
# * Redistributions of source code must retain the above copyright notice,
|
||||
# this list of conditions and the following disclaimer.
|
||||
# * Redistributions in binary form must reproduce the above copyright notice,
|
||||
# this list of conditions and the following disclaimer in the documentation
|
||||
# and/or other materials provided with the distribution.
|
||||
# * The name of the author may not be used to endorse or promote products
|
||||
# derived from this software without specific prior written permission.
|
||||
#
|
||||
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
|
||||
# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
||||
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
|
||||
# ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE
|
||||
# LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
|
||||
# CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
|
||||
# SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
|
||||
# INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
|
||||
# CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
|
||||
# ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
|
||||
# POSSIBILITY OF SUCH DAMAGE.
|
||||
|
||||
|
||||
"""
|
||||
Utility functions for PDF library.
|
||||
"""
|
||||
__author__ = "Mathieu Fenniak"
|
||||
__author_email__ = "biziqe@mathieu.fenniak.net"
|
||||
|
||||
#ENABLE_PSYCO = False
|
||||
#if ENABLE_PSYCO:
|
||||
# try:
|
||||
# import psyco
|
||||
# except ImportError:
|
||||
# ENABLE_PSYCO = False
|
||||
#
|
||||
#if not ENABLE_PSYCO:
|
||||
# class psyco:
|
||||
# def proxy(func):
|
||||
# return func
|
||||
# proxy = staticmethod(proxy)
|
||||
|
||||
def readUntilWhitespace(stream, maxchars=None):
|
||||
txt = ""
|
||||
while True:
|
||||
tok = stream.read(1)
|
||||
if tok.isspace() or not tok:
|
||||
break
|
||||
txt += tok
|
||||
if len(txt) == maxchars:
|
||||
break
|
||||
return txt
|
||||
|
||||
def readNonWhitespace(stream):
|
||||
tok = ' '
|
||||
while tok == '\n' or tok == '\r' or tok == ' ' or tok == '\t':
|
||||
tok = stream.read(1)
|
||||
return tok
|
||||
|
||||
class ConvertFunctionsToVirtualList(object):
|
||||
def __init__(self, lengthFunction, getFunction):
|
||||
self.lengthFunction = lengthFunction
|
||||
self.getFunction = getFunction
|
||||
|
||||
def __len__(self):
|
||||
return self.lengthFunction()
|
||||
|
||||
def __getitem__(self, index):
|
||||
if not isinstance(index, int):
|
||||
raise TypeError, "sequence indices must be integers"
|
||||
len_self = len(self)
|
||||
if index < 0:
|
||||
# support negative indexes
|
||||
index = len_self + index
|
||||
if index < 0 or index >= len_self:
|
||||
raise IndexError, "sequence index out of range"
|
||||
return self.getFunction(index)
|
||||
|
||||
def RC4_encrypt(key, plaintext):
|
||||
S = [i for i in range(256)]
|
||||
j = 0
|
||||
for i in range(256):
|
||||
j = (j + S[i] + ord(key[i % len(key)])) % 256
|
||||
S[i], S[j] = S[j], S[i]
|
||||
i, j = 0, 0
|
||||
retval = ""
|
||||
for x in range(len(plaintext)):
|
||||
i = (i + 1) % 256
|
||||
j = (j + S[i]) % 256
|
||||
S[i], S[j] = S[j], S[i]
|
||||
t = S[(S[i] + S[j]) % 256]
|
||||
retval += chr(ord(plaintext[x]) ^ t)
|
||||
return retval
|
||||
|
||||
def matrixMultiply(a, b):
|
||||
return [[sum([float(i)*float(j)
|
||||
for i, j in zip(row, col)]
|
||||
) for col in zip(*b)]
|
||||
for row in a]
|
||||
|
||||
class PyPdfError(Exception):
|
||||
pass
|
||||
|
||||
class PdfReadError(PyPdfError):
|
||||
pass
|
||||
|
||||
class PageSizeNotDefinedError(PyPdfError):
|
||||
pass
|
||||
|
||||
class PdfReadWarning(UserWarning):
|
||||
pass
|
||||
|
||||
if __name__ == "__main__":
|
||||
# test RC4
|
||||
out = RC4_encrypt("Key", "Plaintext")
|
||||
print repr(out)
|
||||
pt = RC4_encrypt("Key", out)
|
||||
print repr(pt)
|
|
@ -0,0 +1,355 @@
|
|||
import re
|
||||
import datetime
|
||||
import decimal
|
||||
from generic import PdfObject
|
||||
from xml.dom import getDOMImplementation
|
||||
from xml.dom.minidom import parseString
|
||||
|
||||
RDF_NAMESPACE = "http://www.w3.org/1999/02/22-rdf-syntax-ns#"
|
||||
DC_NAMESPACE = "http://purl.org/dc/elements/1.1/"
|
||||
XMP_NAMESPACE = "http://ns.adobe.com/xap/1.0/"
|
||||
PDF_NAMESPACE = "http://ns.adobe.com/pdf/1.3/"
|
||||
XMPMM_NAMESPACE = "http://ns.adobe.com/xap/1.0/mm/"
|
||||
|
||||
# What is the PDFX namespace, you might ask? I might ask that too. It's
|
||||
# a completely undocumented namespace used to place "custom metadata"
|
||||
# properties, which are arbitrary metadata properties with no semantic or
|
||||
# documented meaning. Elements in the namespace are key/value-style storage,
|
||||
# where the element name is the key and the content is the value. The keys
|
||||
# are transformed into valid XML identifiers by substituting an invalid
|
||||
# identifier character with \u2182 followed by the unicode hex ID of the
|
||||
# original character. A key like "my car" is therefore "my\u21820020car".
|
||||
#
|
||||
# \u2182, in case you're wondering, is the unicode character
|
||||
# \u{ROMAN NUMERAL TEN THOUSAND}, a straightforward and obvious choice for
|
||||
# escaping characters.
|
||||
#
|
||||
# Intentional users of the pdfx namespace should be shot on sight. A
|
||||
# custom data schema and sensical XML elements could be used instead, as is
|
||||
# suggested by Adobe's own documentation on XMP (under "Extensibility of
|
||||
# Schemas").
|
||||
#
|
||||
# Information presented here on the /pdfx/ schema is a result of limited
|
||||
# reverse engineering, and does not constitute a full specification.
|
||||
PDFX_NAMESPACE = "http://ns.adobe.com/pdfx/1.3/"
|
||||
|
||||
iso8601 = re.compile("""
|
||||
(?P<year>[0-9]{4})
|
||||
(-
|
||||
(?P<month>[0-9]{2})
|
||||
(-
|
||||
(?P<day>[0-9]+)
|
||||
(T
|
||||
(?P<hour>[0-9]{2}):
|
||||
(?P<minute>[0-9]{2})
|
||||
(:(?P<second>[0-9]{2}(.[0-9]+)?))?
|
||||
(?P<tzd>Z|[-+][0-9]{2}:[0-9]{2})
|
||||
)?
|
||||
)?
|
||||
)?
|
||||
""", re.VERBOSE)
|
||||
|
||||
##
|
||||
# An object that represents Adobe XMP metadata.
|
||||
class XmpInformation(PdfObject):
|
||||
|
||||
def __init__(self, stream):
|
||||
self.stream = stream
|
||||
docRoot = parseString(self.stream.getData())
|
||||
self.rdfRoot = docRoot.getElementsByTagNameNS(RDF_NAMESPACE, "RDF")[0]
|
||||
self.cache = {}
|
||||
|
||||
def writeToStream(self, stream, encryption_key):
|
||||
self.stream.writeToStream(stream, encryption_key)
|
||||
|
||||
def getElement(self, aboutUri, namespace, name):
|
||||
for desc in self.rdfRoot.getElementsByTagNameNS(RDF_NAMESPACE, "Description"):
|
||||
if desc.getAttributeNS(RDF_NAMESPACE, "about") == aboutUri:
|
||||
attr = desc.getAttributeNodeNS(namespace, name)
|
||||
if attr != None:
|
||||
yield attr
|
||||
for element in desc.getElementsByTagNameNS(namespace, name):
|
||||
yield element
|
||||
|
||||
def getNodesInNamespace(self, aboutUri, namespace):
|
||||
for desc in self.rdfRoot.getElementsByTagNameNS(RDF_NAMESPACE, "Description"):
|
||||
if desc.getAttributeNS(RDF_NAMESPACE, "about") == aboutUri:
|
||||
for i in range(desc.attributes.length):
|
||||
attr = desc.attributes.item(i)
|
||||
if attr.namespaceURI == namespace:
|
||||
yield attr
|
||||
for child in desc.childNodes:
|
||||
if child.namespaceURI == namespace:
|
||||
yield child
|
||||
|
||||
def _getText(self, element):
|
||||
text = ""
|
||||
for child in element.childNodes:
|
||||
if child.nodeType == child.TEXT_NODE:
|
||||
text += child.data
|
||||
return text
|
||||
|
||||
def _converter_string(value):
|
||||
return value
|
||||
|
||||
def _converter_date(value):
|
||||
m = iso8601.match(value)
|
||||
year = int(m.group("year"))
|
||||
month = int(m.group("month") or "1")
|
||||
day = int(m.group("day") or "1")
|
||||
hour = int(m.group("hour") or "0")
|
||||
minute = int(m.group("minute") or "0")
|
||||
second = decimal.Decimal(m.group("second") or "0")
|
||||
seconds = second.to_integral(decimal.ROUND_FLOOR)
|
||||
milliseconds = (second - seconds) * 1000000
|
||||
tzd = m.group("tzd") or "Z"
|
||||
dt = datetime.datetime(year, month, day, hour, minute, seconds, milliseconds)
|
||||
if tzd != "Z":
|
||||
tzd_hours, tzd_minutes = [int(x) for x in tzd.split(":")]
|
||||
tzd_hours *= -1
|
||||
if tzd_hours < 0:
|
||||
tzd_minutes *= -1
|
||||
dt = dt + datetime.timedelta(hours=tzd_hours, minutes=tzd_minutes)
|
||||
return dt
|
||||
_test_converter_date = staticmethod(_converter_date)
|
||||
|
||||
def _getter_bag(namespace, name, converter):
|
||||
def get(self):
|
||||
cached = self.cache.get(namespace, {}).get(name)
|
||||
if cached:
|
||||
return cached
|
||||
retval = []
|
||||
for element in self.getElement("", namespace, name):
|
||||
bags = element.getElementsByTagNameNS(RDF_NAMESPACE, "Bag")
|
||||
if len(bags):
|
||||
for bag in bags:
|
||||
for item in bag.getElementsByTagNameNS(RDF_NAMESPACE, "li"):
|
||||
value = self._getText(item)
|
||||
value = converter(value)
|
||||
retval.append(value)
|
||||
ns_cache = self.cache.setdefault(namespace, {})
|
||||
ns_cache[name] = retval
|
||||
return retval
|
||||
return get
|
||||
|
||||
def _getter_seq(namespace, name, converter):
|
||||
def get(self):
|
||||
cached = self.cache.get(namespace, {}).get(name)
|
||||
if cached:
|
||||
return cached
|
||||
retval = []
|
||||
for element in self.getElement("", namespace, name):
|
||||
seqs = element.getElementsByTagNameNS(RDF_NAMESPACE, "Seq")
|
||||
if len(seqs):
|
||||
for seq in seqs:
|
||||
for item in seq.getElementsByTagNameNS(RDF_NAMESPACE, "li"):
|
||||
value = self._getText(item)
|
||||
value = converter(value)
|
||||
retval.append(value)
|
||||
else:
|
||||
value = converter(self._getText(element))
|
||||
retval.append(value)
|
||||
ns_cache = self.cache.setdefault(namespace, {})
|
||||
ns_cache[name] = retval
|
||||
return retval
|
||||
return get
|
||||
|
||||
def _getter_langalt(namespace, name, converter):
|
||||
def get(self):
|
||||
cached = self.cache.get(namespace, {}).get(name)
|
||||
if cached:
|
||||
return cached
|
||||
retval = {}
|
||||
for element in self.getElement("", namespace, name):
|
||||
alts = element.getElementsByTagNameNS(RDF_NAMESPACE, "Alt")
|
||||
if len(alts):
|
||||
for alt in alts:
|
||||
for item in alt.getElementsByTagNameNS(RDF_NAMESPACE, "li"):
|
||||
value = self._getText(item)
|
||||
value = converter(value)
|
||||
retval[item.getAttribute("xml:lang")] = value
|
||||
else:
|
||||
retval["x-default"] = converter(self._getText(element))
|
||||
ns_cache = self.cache.setdefault(namespace, {})
|
||||
ns_cache[name] = retval
|
||||
return retval
|
||||
return get
|
||||
|
||||
def _getter_single(namespace, name, converter):
|
||||
def get(self):
|
||||
cached = self.cache.get(namespace, {}).get(name)
|
||||
if cached:
|
||||
return cached
|
||||
value = None
|
||||
for element in self.getElement("", namespace, name):
|
||||
if element.nodeType == element.ATTRIBUTE_NODE:
|
||||
value = element.nodeValue
|
||||
else:
|
||||
value = self._getText(element)
|
||||
break
|
||||
if value != None:
|
||||
value = converter(value)
|
||||
ns_cache = self.cache.setdefault(namespace, {})
|
||||
ns_cache[name] = value
|
||||
return value
|
||||
return get
|
||||
|
||||
##
|
||||
# Contributors to the resource (other than the authors). An unsorted
|
||||
# array of names.
|
||||
# <p>Stability: Added in v1.12, will exist for all future v1.x releases.
|
||||
dc_contributor = property(_getter_bag(DC_NAMESPACE, "contributor", _converter_string))
|
||||
|
||||
##
|
||||
# Text describing the extent or scope of the resource.
|
||||
# <p>Stability: Added in v1.12, will exist for all future v1.x releases.
|
||||
dc_coverage = property(_getter_single(DC_NAMESPACE, "coverage", _converter_string))
|
||||
|
||||
##
|
||||
# A sorted array of names of the authors of the resource, listed in order
|
||||
# of precedence.
|
||||
# <p>Stability: Added in v1.12, will exist for all future v1.x releases.
|
||||
dc_creator = property(_getter_seq(DC_NAMESPACE, "creator", _converter_string))
|
||||
|
||||
##
|
||||
# A sorted array of dates (datetime.datetime instances) of signifigance to
|
||||
# the resource. The dates and times are in UTC.
|
||||
# <p>Stability: Added in v1.12, will exist for all future v1.x releases.
|
||||
dc_date = property(_getter_seq(DC_NAMESPACE, "date", _converter_date))
|
||||
|
||||
##
|
||||
# A language-keyed dictionary of textual descriptions of the content of the
|
||||
# resource.
|
||||
# <p>Stability: Added in v1.12, will exist for all future v1.x releases.
|
||||
dc_description = property(_getter_langalt(DC_NAMESPACE, "description", _converter_string))
|
||||
|
||||
##
|
||||
# The mime-type of the resource.
|
||||
# <p>Stability: Added in v1.12, will exist for all future v1.x releases.
|
||||
dc_format = property(_getter_single(DC_NAMESPACE, "format", _converter_string))
|
||||
|
||||
##
|
||||
# Unique identifier of the resource.
|
||||
# <p>Stability: Added in v1.12, will exist for all future v1.x releases.
|
||||
dc_identifier = property(_getter_single(DC_NAMESPACE, "identifier", _converter_string))
|
||||
|
||||
##
|
||||
# An unordered array specifying the languages used in the resource.
|
||||
# <p>Stability: Added in v1.12, will exist for all future v1.x releases.
|
||||
dc_language = property(_getter_bag(DC_NAMESPACE, "language", _converter_string))
|
||||
|
||||
##
|
||||
# An unordered array of publisher names.
|
||||
# <p>Stability: Added in v1.12, will exist for all future v1.x releases.
|
||||
dc_publisher = property(_getter_bag(DC_NAMESPACE, "publisher", _converter_string))
|
||||
|
||||
##
|
||||
# An unordered array of text descriptions of relationships to other
|
||||
# documents.
|
||||
# <p>Stability: Added in v1.12, will exist for all future v1.x releases.
|
||||
dc_relation = property(_getter_bag(DC_NAMESPACE, "relation", _converter_string))
|
||||
|
||||
##
|
||||
# A language-keyed dictionary of textual descriptions of the rights the
|
||||
# user has to this resource.
|
||||
# <p>Stability: Added in v1.12, will exist for all future v1.x releases.
|
||||
dc_rights = property(_getter_langalt(DC_NAMESPACE, "rights", _converter_string))
|
||||
|
||||
##
|
||||
# Unique identifier of the work from which this resource was derived.
|
||||
# <p>Stability: Added in v1.12, will exist for all future v1.x releases.
|
||||
dc_source = property(_getter_single(DC_NAMESPACE, "source", _converter_string))
|
||||
|
||||
##
|
||||
# An unordered array of descriptive phrases or keywrods that specify the
|
||||
# topic of the content of the resource.
|
||||
# <p>Stability: Added in v1.12, will exist for all future v1.x releases.
|
||||
dc_subject = property(_getter_bag(DC_NAMESPACE, "subject", _converter_string))
|
||||
|
||||
##
|
||||
# A language-keyed dictionary of the title of the resource.
|
||||
# <p>Stability: Added in v1.12, will exist for all future v1.x releases.
|
||||
dc_title = property(_getter_langalt(DC_NAMESPACE, "title", _converter_string))
|
||||
|
||||
##
|
||||
# An unordered array of textual descriptions of the document type.
|
||||
# <p>Stability: Added in v1.12, will exist for all future v1.x releases.
|
||||
dc_type = property(_getter_bag(DC_NAMESPACE, "type", _converter_string))
|
||||
|
||||
##
|
||||
# An unformatted text string representing document keywords.
|
||||
# <p>Stability: Added in v1.12, will exist for all future v1.x releases.
|
||||
pdf_keywords = property(_getter_single(PDF_NAMESPACE, "Keywords", _converter_string))
|
||||
|
||||
##
|
||||
# The PDF file version, for example 1.0, 1.3.
|
||||
# <p>Stability: Added in v1.12, will exist for all future v1.x releases.
|
||||
pdf_pdfversion = property(_getter_single(PDF_NAMESPACE, "PDFVersion", _converter_string))
|
||||
|
||||
##
|
||||
# The name of the tool that created the PDF document.
|
||||
# <p>Stability: Added in v1.12, will exist for all future v1.x releases.
|
||||
pdf_producer = property(_getter_single(PDF_NAMESPACE, "Producer", _converter_string))
|
||||
|
||||
##
|
||||
# The date and time the resource was originally created. The date and
|
||||
# time are returned as a UTC datetime.datetime object.
|
||||
# <p>Stability: Added in v1.12, will exist for all future v1.x releases.
|
||||
xmp_createDate = property(_getter_single(XMP_NAMESPACE, "CreateDate", _converter_date))
|
||||
|
||||
##
|
||||
# The date and time the resource was last modified. The date and time
|
||||
# are returned as a UTC datetime.datetime object.
|
||||
# <p>Stability: Added in v1.12, will exist for all future v1.x releases.
|
||||
xmp_modifyDate = property(_getter_single(XMP_NAMESPACE, "ModifyDate", _converter_date))
|
||||
|
||||
##
|
||||
# The date and time that any metadata for this resource was last
|
||||
# changed. The date and time are returned as a UTC datetime.datetime
|
||||
# object.
|
||||
# <p>Stability: Added in v1.12, will exist for all future v1.x releases.
|
||||
xmp_metadataDate = property(_getter_single(XMP_NAMESPACE, "MetadataDate", _converter_date))
|
||||
|
||||
##
|
||||
# The name of the first known tool used to create the resource.
|
||||
# <p>Stability: Added in v1.12, will exist for all future v1.x releases.
|
||||
xmp_creatorTool = property(_getter_single(XMP_NAMESPACE, "CreatorTool", _converter_string))
|
||||
|
||||
##
|
||||
# The common identifier for all versions and renditions of this resource.
|
||||
# <p>Stability: Added in v1.12, will exist for all future v1.x releases.
|
||||
xmpmm_documentId = property(_getter_single(XMPMM_NAMESPACE, "DocumentID", _converter_string))
|
||||
|
||||
##
|
||||
# An identifier for a specific incarnation of a document, updated each
|
||||
# time a file is saved.
|
||||
# <p>Stability: Added in v1.12, will exist for all future v1.x releases.
|
||||
xmpmm_instanceId = property(_getter_single(XMPMM_NAMESPACE, "InstanceID", _converter_string))
|
||||
|
||||
def custom_properties(self):
|
||||
if not hasattr(self, "_custom_properties"):
|
||||
self._custom_properties = {}
|
||||
for node in self.getNodesInNamespace("", PDFX_NAMESPACE):
|
||||
key = node.localName
|
||||
while True:
|
||||
# see documentation about PDFX_NAMESPACE earlier in file
|
||||
idx = key.find(u"\u2182")
|
||||
if idx == -1:
|
||||
break
|
||||
key = key[:idx] + chr(int(key[idx+1:idx+5], base=16)) + key[idx+5:]
|
||||
if node.nodeType == node.ATTRIBUTE_NODE:
|
||||
value = node.nodeValue
|
||||
else:
|
||||
value = self._getText(node)
|
||||
self._custom_properties[key] = value
|
||||
return self._custom_properties
|
||||
|
||||
##
|
||||
# Retrieves custom metadata properties defined in the undocumented pdfx
|
||||
# metadata schema.
|
||||
# <p>Stability: Added in v1.12, will exist for all future v1.x releases.
|
||||
# @return Returns a dictionary of key/value items for custom metadata
|
||||
# properties.
|
||||
custom_properties = property(custom_properties)
|
||||
|
||||
|
|
@ -0,0 +1,38 @@
|
|||
Example:
|
||||
|
||||
from pyPdf import PdfFileWriter, PdfFileReader
|
||||
|
||||
output = PdfFileWriter()
|
||||
input1 = PdfFileReader(file("document1.pdf", "rb"))
|
||||
|
||||
# add page 1 from input1 to output document, unchanged
|
||||
output.addPage(input1.getPage(0))
|
||||
|
||||
# add page 2 from input1, but rotated clockwise 90 degrees
|
||||
output.addPage(input1.getPage(1).rotateClockwise(90))
|
||||
|
||||
# add page 3 from input1, rotated the other way:
|
||||
output.addPage(input1.getPage(2).rotateCounterClockwise(90))
|
||||
# alt: output.addPage(input1.getPage(2).rotateClockwise(270))
|
||||
|
||||
# add page 4 from input1, but first add a watermark from another pdf:
|
||||
page4 = input1.getPage(3)
|
||||
watermark = PdfFileReader(file("watermark.pdf", "rb"))
|
||||
page4.mergePage(watermark.getPage(0))
|
||||
|
||||
# add page 5 from input1, but crop it to half size:
|
||||
page5 = input1.getPage(4)
|
||||
page5.mediaBox.upperRight = (
|
||||
page5.mediaBox.getUpperRight_x() / 2,
|
||||
page5.mediaBox.getUpperRight_y() / 2
|
||||
)
|
||||
output.addPage(page5)
|
||||
|
||||
# print how many pages input1 has:
|
||||
print "document1.pdf has %s pages." % input1.getNumPages())
|
||||
|
||||
# finally, write "output" to document-output.pdf
|
||||
outputStream = file("document-output.pdf", "wb")
|
||||
output.write(outputStream)
|
||||
|
||||
|
|
@ -0,0 +1,40 @@
|
|||
#!/usr/bin/env python
|
||||
|
||||
from distutils.core import setup
|
||||
|
||||
long_description = """
|
||||
A Pure-Python library built as a PDF toolkit. It is capable of:
|
||||
|
||||
- extracting document information (title, author, ...),
|
||||
- splitting documents page by page,
|
||||
- merging documents page by page,
|
||||
- cropping pages,
|
||||
- merging multiple pages into a single page,
|
||||
- encrypting and decrypting PDF files.
|
||||
|
||||
By being Pure-Python, it should run on any Python platform without any
|
||||
dependencies on external libraries. It can also work entirely on StringIO
|
||||
objects rather than file streams, allowing for PDF manipulation in memory.
|
||||
It is therefore a useful tool for websites that manage or manipulate PDFs.
|
||||
"""
|
||||
|
||||
setup(
|
||||
name="pyPdf",
|
||||
version="1.12",
|
||||
description="PDF toolkit",
|
||||
long_description=long_description,
|
||||
author="Mathieu Fenniak",
|
||||
author_email="biziqe@mathieu.fenniak.net",
|
||||
url="http://pybrary.net/pyPdf/",
|
||||
download_url="http://pybrary.net/pyPdf/pyPdf-1.12.tar.gz",
|
||||
classifiers = [
|
||||
"Development Status :: 5 - Production/Stable",
|
||||
"Intended Audience :: Developers",
|
||||
"License :: OSI Approved :: BSD License",
|
||||
"Programming Language :: Python",
|
||||
"Operating System :: OS Independent",
|
||||
"Topic :: Software Development :: Libraries :: Python Modules",
|
||||
],
|
||||
packages=["pyPdf"],
|
||||
)
|
||||
|
Loading…
Reference in New Issue